Updating cuda inside venv - fix SDP VRAM usage/slowness/lag #132
Replies: 5 comments 10 replies
-
this repo is already compatible with cuda 12.1, but latest torch itself is compiled against cuda 11.8, so thats what gets used. regarding sdp lag on start/end, its because it handles initial (default 3) and final (default 1) steps differently than all other steps. |
Beta Was this translation helpful? Give feedback.
-
https://www.reddit.com/r/StableDiffusion/comments/12grqts/cuda_out_of_memory_errors_after_upgrading_to/ |
Beta Was this translation helpful? Give feedback.
-
for the performance i found a chart here : https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/torch2.0
third column is SDP, second is xformers, SDP is slower yet they think it's faster in their charts. I have no idea what pytorch devs are smoking or I can't read that chart. Anyway, is there ANY advantage to pytorch2 that maybe I am not seeing? at this point Automatic1111 offers none of these issues, maybe there's a reason that ghost guy didn't implement this. |
Beta Was this translation helpful? Give feedback.
-
most complex ops are done by cudnn, so if that is ok, then difference in raw performance between torch 1.13+cuda11.7 and torch 2.0+cuda11.8 is negligible. torch.compile ofers much more, but to think its enough to simply wrap everything in a single call leads exactly to this types of charts where it shows no benefits. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Hi!
Was thinking of reaching cuda 12.1 using this fork - this wasn't possible on Automatic1111 I think, anything over 11.8 seemed to miss prerequisites. Goal is to try to improve SDP memory usage, I tested a lot and it's very slow in actual usage, even if benchmark seems good/similar to xformers. A lot of lag at start and end of renders, and almost 50% of VRAM seems - 10GB to be exact - occupied randomly by pytorch, without any way to use it - this is from a total of 24GB of VRAM btw, 0.5GB reserved by OS.
Using current setup I am gimping myself to less than half of what I can do in Automatic1111, so my ideas so far are:
Thanks
Beta Was this translation helpful? Give feedback.
All reactions