The only valid benchmark for that is to use the application you intend to use as a benchmark. Even embarassingly parallel problems can have different characteristics depending on their use of memory and caches and the thermal characteristics of the CPU. Something that uses only L1 cache and registers will probably scale almost linearly in the number of cores, except for thermal influences. Something that uses L2, L3 caches or even main memory will be sublinear.
You're essentially just arguing that all general-purpose benchmarks are worthless because your application could be different.
Suppose I run many different kinds of applications and am just looking for an overall score to provide a general idea of how two machines compare with one another. That's supposed to be the purpose of these benchmarks, isn't it? But this one seems to be unusually useless at distinguishing between various machines with more than a small number of cores.
Your analysis is also incorrect for many of these systems. Each core may have its own L2 cache and each core complex may have its own L3, so systems with more core complexes don't inherently have more contention for caches because they also have more caches. Likewise, systems with more cores often also have more memory bandwidth, so the amount of bandwidth per core isn't inherently less than it is in systems with fewer cores, and in some cases it's actually more, e.g. a HEDT processor may have twice as many cores but four times as many memory channels.
General-purpose benchmarks aren't worthless. They can be used to predict, in very broad strokes, what application performance might be. Especially if you don't really know what the applications would be, or if it is too tedious to use real application benchmarks.
But in your example, deciding between 24 cores with somewhat higher frequency or 32 cores with somewhat lower frequency based on some general-purpose benchmark is essentially pointless. The difference will be small enough that only the real application benchmark can tell you what you need to know. A general purpose benchmark will be no better than a coin toss, because the exact workings of the benchmark, the weightings of it's components into a score and the exact hardware you are running on will have interactions that will determine the decision to a far greater amount. You are right that there could be shared or separate caches, shared or separate memory channels. The benchmark might exercise those, or it might not. It might heat certain parts of the die more than others. It might just be the epitome of embarassingly parallel benchmarks, BogoMIPS, which is a loop executing NOPs. The predictive value of the general purpose benchmark is nil in those cases. The variability from the benchmark maker's choices will always necessarily introduce a bias and therefore a measurement uncertainty. And what you are trying to measure is usually smaller than that uncertainty. Therefore: No better than a coin toss.
You're just back to arguing that general purpose benchmarks are worthless again. Yes, they're not as applicable to the performance of a specific application as testing that application in particular, but you don't always have a specific application in mind. Many systems run a wide variety of different applications.
And a benchmark can then provide a reasonable cross-section of different applications. Or it can yield scores that don't reflect real-world performance differences, implying that it's poorly designed.