Also I think you need a 40GB "card", not just 40GB of vram. I wrote about this upthread, you're probably going to need one card, I'd be surprised if you could chain several GPUs together.
Oh right, I forgot some diffusion models can't offload / split layers. I don't use vision generation models much at all - was just going off LLM work. Apologies for the potential misinformation.
Nah, that won’t gain you much (if anything?) over just doing the layer swaps on RAM. You can put the text encoder on the second card but you can also just put it in your RAM without much for negatives.