Updating cuda inside venv - fix SDP VRAM usage/slowness/lag #132

razvan-nicolae · 2023-04-14T16:39:05Z

razvan-nicolae
Apr 14, 2023

Hi!
Was thinking of reaching cuda 12.1 using this fork - this wasn't possible on Automatic1111 I think, anything over 11.8 seemed to miss prerequisites. Goal is to try to improve SDP memory usage, I tested a lot and it's very slow in actual usage, even if benchmark seems good/similar to xformers. A lot of lag at start and end of renders, and almost 50% of VRAM seems - 10GB to be exact - occupied randomly by pytorch, without any way to use it - this is from a total of 24GB of VRAM btw, 0.5GB reserved by OS.

Using current setup I am gimping myself to less than half of what I can do in Automatic1111, so my ideas so far are:

try to update CUDA to 12.1 in this venv - my PC already has it, but I assume the .py scripts prefer a specific version in this fork too
try to use something other than SDP to see if the bugs persist there as well - any input here is welcome, not familiar with InvokeAI for example, I know doggettx is not something spectacular.
somehow wait for xformers to catch up to pytorch 2, might take a long time though.

Thanks

vladmandic · 2023-04-14T16:49:38Z

vladmandic
Apr 14, 2023
Maintainer

this repo is already compatible with cuda 12.1, but latest torch itself is compiled against cuda 11.8, so thats what gets used.
at the moment even compiling torch from dev branch does not work with cuda 12 just yet, they have open issues for that.

regarding sdp lag on start/end, its because it handles initial (default 3) and final (default 1) steps differently than all other steps.
but regarding vram, i've never seen 10gb vram usage unless you're running some very high resolution?
with normal 512x512, i'm hitting peak at 4.2gb and standard just over 3gb (and that is with 0.5 reserved like you said).

0 replies

razvan-nicolae · 2023-04-14T18:31:11Z

razvan-nicolae
Apr 14, 2023
Author

https://www.reddit.com/r/StableDiffusion/comments/12grqts/cuda_out_of_memory_errors_after_upgrading_to/
More people are seeing the same problems, it's not on our end

2 replies

vladmandic Apr 14, 2023
Maintainer

https://www.reddit.com/r/StableDiffusion/comments/12grqts/cuda_out_of_memory_errors_after_upgrading_to/ More people are seeing the same problems, it's not on our end

i believe you, but not much i can do about it. i'll support any new version of torch as soon as they are out.

yoshi244 Apr 14, 2023

I'm yoshi245 on that reddit thread and posted my results thus far. Even after updating to today's latest versions and updates, I run out of CUDA memory if I try to use hiresfix on txt2img making something past 1000x1000 on my 3070 8gb. I guess I'm shit outta luck unless xformers does get updated to this latest version of torch?

I get this message when it happens from the UI:
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.37 GiB (GPU 0; 8.00 GiB total capacity; 6.70 GiB already allocated; 88.47 MiB free; 6.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Time taken: 10.10sTorch active/reserved: 6871/6940 MiB, Sys VRAM: 8103/8192 MiB (98.91%)

Consider me an idiot or just poorly educated in finding a means to resolving this because I'll be honest, I really am not very good at trying to diagnose and resolve software issues when it comes to modifying code.

razvan-nicolae · 2023-04-14T21:20:34Z

razvan-nicolae
Apr 14, 2023
Author

for the performance i found a chart here : https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/torch2.0

<html>
<body>
<!--StartFragment-->

090 | 1 | 5.54 | 4.99 | 4.51 |   |   |  
-- | -- | -- | -- | -- | -- | -- | --
4090 | 4 | 13.67 | 11.4 | 10.3 |   |   |  
4090 | 8 (2) |   | 19.79 | 17.13 |   |   |  
4090 | 16 |   | 38.62 | 33.14 |   |   |  
4090 | 32 (1) |   | 76.57 | 65.96 |   |   |  
4090 | 48 |   | 114.44 | 98.78

<!--EndFragment-->
</body>
</html>

third column is SDP, second is xformers, SDP is slower yet they think it's faster in their charts.

I have no idea what pytorch devs are smoking or I can't read that chart.

Anyway, is there ANY advantage to pytorch2 that maybe I am not seeing? at this point Automatic1111 offers none of these issues, maybe there's a reason that ghost guy didn't implement this.

0 replies

vladmandic · 2023-04-14T21:55:22Z

vladmandic
Apr 14, 2023
Maintainer

most complex ops are done by cudnn, so if that is ok, then difference in raw performance between torch 1.13+cuda11.7 and torch 2.0+cuda11.8 is negligible.
do note that torch 1.13 is prone to broken cudnn installs quite often and then you see 1/2 of expected performance.

torch.compile ofers much more, but to think its enough to simply wrap everything in a single call leads exactly to this types of charts where it shows no benefits.

4 replies

razvan-nicolae Apr 15, 2023
Author

AUTOMATIC1111/stable-diffusion-webui#8696 (comment)
Very conflicting information all over, but your points are valid.

The benchmark charts show a lot of ppl doing 50 it/s and even higher on 4090 even with SDP enabled, so it's possible using a certain combo of torch/torchvision. I will try to update these manually, see how far I can get. But it's really wild west out there, no consensus, and the VRAM increased usage has been reported in a lot of places, and no one can explain it well at this time.

razvan-nicolae Apr 15, 2023
Author

I managed to get some results on Automatic1111 using torch 2.0, this beats my last xformers benchmarks by a few percent, and is working in deterministic mode. Next I will attempt something similar in your fork, see what differences appear.

razvan-nicolae Apr 15, 2023
Author

any idea how to update the "accelerate" packages? I'm quite behind on these

vladmandic Apr 15, 2023
Maintainer

just run pip install -U accelerate
this repo is tested with latest version, so there should be no need to bind to specific build.
accelerate did change its config file format a while back, so advised to run accelerate config once and answer questions.

razvan-nicolae · 2023-04-15T20:49:25Z

razvan-nicolae
Apr 15, 2023
Author

okay so made sure everything is similar, and to improve in your automatic fork I've also added the tweaks you mentioned in os.environ using launch.py

Automatic1111 is way faster than Automatic and I have no clue why. No HW scheduling, browser minimized or maximized, nothing seemed to increase it past 32 it/s and the original repo is flying at 42, probably more if I turn on the OC profiles for CPU/GPU.

4 replies

razvan-nicolae Apr 15, 2023
Author

okay so only way to reach similar performance is to use warmup and extended steps in the benchmark, otherwise 0 chance, so clearly there is something slowing down the startup of image generation in this fork.

On the other hand, using the user.css provided kindly by @FutonGama and with the current settings, the UI experience is better than the original repo, much faster startup both before and after launch.
We need some easier way to swap in user.css files.

Overall if this image generation slowness can be figured out, this could be a good replacement for 1111 going forward.
I'm gonna start using it more.

vladmandic Apr 15, 2023
Maintainer

given that warmup/extra-steps brings it to same level, issue has to be in preprocessing, not actual generation.
so its most likely some of the extensions that perform prompt analysis and lookups before generation even starts.
i'll take a look.

vladmandic Apr 15, 2023
Maintainer

We need some easier way to swap in user.css files.

its coming soon

razvan-nicolae Apr 15, 2023
Author

given that warmup/extra-steps brings it to same level, issue has to be in preprocessing, not actual generation.
so its most likely some of the extensions that perform prompt analysis and lookups before generation even starts.
i'll take a look.

Much appreciated, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating cuda inside venv - fix SDP VRAM usage/slowness/lag #132

{{title}}

Replies: 5 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Updating cuda inside venv - fix SDP VRAM usage/slowness/lag #132

razvan-nicolae Apr 14, 2023

Replies: 5 comments · 10 replies

vladmandic Apr 14, 2023 Maintainer

razvan-nicolae Apr 14, 2023 Author

vladmandic Apr 14, 2023 Maintainer

yoshi244 Apr 14, 2023

razvan-nicolae Apr 14, 2023 Author

vladmandic Apr 14, 2023 Maintainer

razvan-nicolae Apr 15, 2023 Author

razvan-nicolae Apr 15, 2023 Author

razvan-nicolae Apr 15, 2023 Author

vladmandic Apr 15, 2023 Maintainer

razvan-nicolae Apr 15, 2023 Author

razvan-nicolae Apr 15, 2023 Author

vladmandic Apr 15, 2023 Maintainer

vladmandic Apr 15, 2023 Maintainer

razvan-nicolae Apr 15, 2023 Author

razvan-nicolae
Apr 14, 2023

Replies: 5 comments 10 replies

vladmandic
Apr 14, 2023
Maintainer

razvan-nicolae
Apr 14, 2023
Author

vladmandic Apr 14, 2023
Maintainer

razvan-nicolae
Apr 14, 2023
Author

vladmandic
Apr 14, 2023
Maintainer

razvan-nicolae Apr 15, 2023
Author

razvan-nicolae Apr 15, 2023
Author

razvan-nicolae Apr 15, 2023
Author

vladmandic Apr 15, 2023
Maintainer

razvan-nicolae
Apr 15, 2023
Author

razvan-nicolae Apr 15, 2023
Author

vladmandic Apr 15, 2023
Maintainer

vladmandic Apr 15, 2023
Maintainer

razvan-nicolae Apr 15, 2023
Author