Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A somewhat common problem is to be limited by the throughput of CPU heavy tasks while the OS reports lower than expected CPU usage. A lot of companies/teams just kind of handwave it away as "hyperthreading is weird", and allocate more machines. Actual causes might be poor cache usage causing programs to wait on data to be loaded from memory, which depending on the CPU metrics you use, may not show as CPU busy time.

For companies at much smaller scale than netflix where employee time is relatively more costly than computer time, this might even be the right decision. So you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.

If the bottlenecks and overhead are reduced such that it's able to make more full use of the CPU, you might be able to reduce to e.g. 15 machines at 75% CPU usage. Consequently the increased CPU usage represents more efficient use of resources.



>> while the OS reports lower than expected CPU usage

>> which depending on the CPU metrics you use, may not show as CPU busy time

If your userspace process is waiting on memory (be that cache, or RAM) then you’ll show as CPU busy when you look in top or whatever - even though if you look under the covers such as via perf counters, you’ll see a lack of instructions executed.

The CPU is busy in this case and the OS won’t context switch to another task, your stalled process will be treated as running by the OS. At the hardware thread level then it will hopefully use the opportunity to run another thread thanks to hyper threading but at the OS level your process will show user space cpu bound. You’ll have to look at perf counters to see what’s actually happening.

>> you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.

Queue theory is fascinating, the latency change when dropping to half the servers may not be just a doubling. It depends on queue arrival rate and processing time but the results can be wild, like 10x worse.


When you put it like that, yes. Hardware is cheap and all that. In practice I think that an organization that doesn't understand the software it is developing has a people problem. And people problems generally can't be solved with hardware.

If somebody knows how to make that insight actionable, let me know. No, hiring new people is not the answer. In all likelihood that swaps one hard problem for an even harder.


IMHO, Usually the people problem is that there are too many people working on the same machine. Sometimes that's unavoidable.

Sometimes, honestly, understanding the software its developing isn't an important business goal. It makes me personally angry, but most businesses do right by not picking business goals to placate me.

Sometimes you just have too many people.

Sometimes you can restructure your software and systems so that fewer people are working on a system and they can understand it better. Sometimes that would also involve restructuring your organization, which has pluses and minuses.

If you can ensure the smaller teams run similar stacks, there can be some good knowledge transfer when one team figures out an underlying truth about the platform that could apply elsewhere. And sometimes you get a platform expert team that can help with understanding and problem solving throughout the teams.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: