Rework elementwise op fusion in IREE. #9348
Replies: 7 comments
-
Looked a bit deeper into dispatch regions created with DeeplabV3 (Before : deeplabbase.print.mlir.txt, After : deeplabmodified.print.mlir.txt) Couple of interesting things stand out
Before
which is really a bug, cause that is not intended from dispatch region formation. After the change the same dispatch region is
which has the same two ops but fused.
Before
After
Reason for the difference is because the result of the gemm (which is result of |
Beta Was this translation helpful? Give feedback.
-
(this is a fantastic breakdown!) |
Beta Was this translation helpful? Give feedback.
-
FYI @okkwon |
Beta Was this translation helpful? Give feedback.
-
Did a bit more digging. I realized I wrongly implicated CSE in my earlier analysis. It is still valuable to have Linalg ops CSE-able, but given that input to IREE comes from either MHLO or TOSA (mostly), where ops dont have region, and do CSE, redundant computation is not expected in the input to IREE core (which works from Linalg on tensors). The place where redundant computation did get introduced was in the elementwise fusion pass. The control function fused operations that have a lower dimensionality iteration space into operations that have a higher dimensionality iteration space, potentially doing redundant computation. While it is useful to have these fused (even with redundant computation), these probably need to be made using more deliberate decisions, and not part of the fixed point loop of fusion. Modify the control function to not allow redundant computation results shows better results.
Specifically
The PR has been updated with the new heuristics, since they do default better. It might be worth verifying this a bit more by looking at those models where number of dispatches doesnt change and check for missed opportunities. While I do that will also try to land the PR. Couple of other things to try:
|
Beta Was this translation helpful? Give feedback.
-
@MaheshRavishankar the comparison looks neat! How do you get these numbers? Is there a flag that triggers these statistics? |
Beta Was this translation helpful? Give feedback.
-
Landing back on this (old-ish) discussion thread after a long while. A bunch of work has happened on this in the background. Primarily this work has been developed under the
There are still a few issues. In some cases, the aggressive fusion creates a stack allocation during bufferization. This is a known issue. IREE has a strict compile time check to limit the stack allocation to be statically known and to be less than 32KB. With these models I get a stack allocation of 67 KB in some cases due to static allocation. Till that issue is addressed, you need to pass
Edit: These are only for the CPU backend. |
Beta Was this translation helpful? Give feedback.
-
With #11686, it looks like all x86_64 benchmark models can be built with The table shows the numbers of dispatches after enabling the aggressive fusion (at stream-level by
|
Beta Was this translation helpful? Give feedback.
-
PR #8723 attempts to rework the elementwise op fusion in IREE. This bug is just to document findings along the way.
To start with below are some stats that are collected with base commit f0f64ca23
These are the stats from using PR #8723
Except for Resnet50 there is a substantial decrease in the number of generic ops with the modification. Surprisingly though the number of dispatches remains the same or increases. So while the changes in the PR improve fusion as such, something with dispatch region formation gets pessimized. So couple of things to look into going forward.
Beta Was this translation helpful? Give feedback.
All reactions