It's hard to evaluate this article without seeing the detail of the "algorithm_w...

bertr4nd · on June 16, 2022

I also found this disappointing. There’s supposedly a 100x speed up to be had going from something in pandas to something using plain python lists but I have no real idea what it is or why it might have produced a speed up. I can guess, but what’s the point of writing an article that just makes me guess at the existence of some hypothetical slow code?

geph2021 · on June 16, 2022

The author says:

  "The function looks something like this:"

And then shows some grouping and sorting functions using pandas.

Then he says:

  "I replaced Pandas with simple python lists and implemented the algorithm manually to do the group-by and sort."

I think the point of the first optimization is you can do the relatively expenseive group/sort operations without pandas, and improve performance. For the rest of the article it's just "algorithm_wizardry", which no longer deals with that portion of the code.

eterm · on June 16, 2022

We never get a good sense of how much time was actually saved with that change not least because the original function calls "initialise weights" inside every loop, the new function does not. It would have been interesting to see what difference that alone made.

The takeaway of the article, that computers are blindingly fast and we make them do unecessary work (and often sit around waiting on I/O) with most their time is true of course.

I'm currently writing a utility to do a basic benchmark of data structures and I/O and it's been a real learning experience for me in just how fast computers can be, but also just how slow a little bit of overhead or contention can cause things, but that's better left for a full write up another day.

geph2021 · on June 16, 2022

   We never get a good sense of how much time was actually saved with that change not least because the original function calls "initialise weights" inside every loop, the new function does not.

Good point. Furthermore to your point, I would assume a library like pandas has fairly well optimized group and sort operations. It would not occur to me that pandas is the bottleneck, but the author does clarify in his footnote that pandas operations, by virtue of creating more complex pandas objects, can indeed be a bottleneck.

   [1] Please don't get me wrong. Pandas is pretty fast for a typical dataset but it's not the processing that slows down pandas in my case. It's the creation of Pandas objects itself which can be slow. If your service needs to respond in less than 500ms, then you will feel the effect of each line of Pandas code.