It looks like Kafka is by far and away the way to handle persistent logs/events at scale. AFAIK a company here in Japan called LINE has all their messaging flowing through a large kafka cluster themselves.
Wonder if anyone is running large NATS Jetstream[0]/Liftbridge[1] or Pulsar[2] (yahoo runs those) clusters. I guess Pulsar might be #2 in terms of adoption at large scale?
Pulsar is a much better fit when your architecture absolutely requires many queues ex: you need one queue per customer across 100's of thousands of customers.
This architecture certainly exists, but is a lot more burdensome and less frequent than partitioning by customer id across a Kafka topic.
Kafka is a wonderful tool. I built a few systems on top of it and all of them delivered the scale that was promised and more. With surprisingly little hardware.
I'm very hostile to a lot of hipster tech but Kafka is one of the few genuinely good pieces of tech from the whole "Big Data" craze of the past decade.
It seems weird to hear "a company here in Japan called LINE" -- LINE is big enough in Japan that it sounds kind of equivalent to "a company here in America called Discord".
I think that works one way but not the other -- America has the blessing of being the source of lots of new apps and tech companies with global success/ambitions (i.e. Discord has some penetration for gamers anywhere), but Japan is less so.
AFAIK LINE has not had such success. I wouldn't be surprised if most people in the US did not know of LINE, unless they were avid readers of TechCrunch or something it just doesn't come up.
Would be interesting to know what % of people on HN know about LINE though
I’m pretty sure LINE has more than twice the users that Twitter has. Not knowing it is like not knowing about WeChat: it’s because you’re not familiar with things outside of the US, rather than not being up-to-date with the space in general.
How well does Kafka handle high density data (e.g. A/V and images)? I'm scouting out systems for our computer vision pipeline and Kafka would simplify the aggregation/collimation step for marshalling to GPUs, and it would be simplest if I can just send raw frames vs some alternate transport.
I think the important thing there would be the frame size no? Clearly Kafka can handle the throughput side of things but it doesn't seem to be meant for large messages out of the box[0].
I wouldn't be surprised if it was perfectly fine though -- with compression (and all the video/image specific tricks) the file sizes should get pretty small...
> Thanks. That's kinda what I figured, but wanted to sounding board it out a bit as a sanity check.
I'm by no means a Kafka expert or a video expert of course, but glad I could serve as a rubber duck. Maybe there's some lessons to be learned from Encore?[0]
> The link is a great reference by the way.
Yeah the amount of info in there is pretty good -- feels like Kafka could definitely be tuned to do the job but maybe it's better to just start with something better attuned.
> This is more or less what I figured. We already archive to S3 anyways so switching to using it as transport would be straightforward.
Yeah I figured this is what you were trying to avoid -- the round trips to S3 to get the data to the processing would be wasteful if the data is in this case small enough to flow along the processing route. Guess it really depends on your data. I could have sworn I saw some analysis of how kafka performs versus the size of messages it must deliver...
Looks like DZone has some good content[1], LinkedIn of course[2]... Ah I finally found the one I was looking for and it's DZone[3]. All those links make mention of message size
It can, but this depends on the volume and size of the topic messages. The broker and consumer will need a LOT more memory. I did this at a previous job and the GC on the broker started getting very shitty and performance was crap. Consumers were constantly getting OOM and needed bigger containers, etc. It was a bad idea and we just moved the stuff to S3.
Wonder if anyone is running large NATS Jetstream[0]/Liftbridge[1] or Pulsar[2] (yahoo runs those) clusters. I guess Pulsar might be #2 in terms of adoption at large scale?
[0]: https://docs.nats.io/jetstream/jetstream
[1]: https://liftbridge.io/
[2]: https://pulsar.apache.org/