If by people you mean me, then I wasn't clear enough in my comment. The example given implied an image without any objects the GP was talking about, just a uniform texture.
Imagine if instead of generating the RGB image directly the model would generate something like that, but with richer descriptive embeddings on each segment, and then having a separate model generating the final RGB image. Then it would be easy to change the background, rotate the peach, change color, add other fruits, etc, by editing this semantic representation of the image instead of wrestling with the prompt to try to do small changes without regenerating the entire image from scratch.
I want a picture of frozen cyan peach fuzz.