Rework elementwise op fusion in IREE. #9348

MaheshRavishankar · 2022-04-01T05:52:17Z

MaheshRavishankar
Apr 1, 2022
Collaborator

PR #8723 attempts to rework the elementwise op fusion in IREE. This bug is just to document findings along the way.

To start with below are some stats that are collected with base commit f0f64ca23

Model	Fills	Copies	Dispatches	Executables	Generic Before	Generic After	DispatchWorkgroups
bert_encoderbase	10	4	585	41	2059	557	609
collatzbase	2	2	5	41	12	5	5
deeplabbase	19	1	80	41	106	87	80
edge_detectionbase	1	0	3	41	4	1	4
fragmentbase	0	0	1	41	11	1	1
fullyconnectedbase	0	0	3	41	12	3	3
mnistbase	4	0	9	41	10	8	9
mobilebertbase	49	197	752	41	1765	724	752
mobilenetv2base	18	1	77	41	103	84	77
mobilenetv3base	12	1	106	41	226	100	106
mobilessdbase	36	16	123	41	157	122	128
posenetbase	15	1	49	41	59	45	49
resnet50base	338	0	95	41	711	58	95
unidirectional_lstmbase	5	4	14	41	43	13	14

These are the stats from using PR #8723

Model	Fills	Copies	Dispatches	Executables	Generic Before	Generic After	DispatchWorkgroups
bert_encodermodified	10	4	585	41	2059	486	610
collatzmodified	2	2	5	41	12	5	5
deeplabmodified	19	1	85	41	106	60	86
edge_detectionmodified	1	0	3	41	4	1	4
fragmentmodified	0	0	1	41	11	1	1
fullyconnectedmodified	0	0	3	41	12	3	3
mnistmodified	4	0	7	41	10	5	7
mobilebertmodified	49	197	751	41	1765	652	752
mobilenetv2modified	18	1	80	41	103	56	80
mobilenetv3modified	12	1	117	41	226	91	117
mobilessdmodified	36	16	126	41	157	88	131
posenetmodified	15	1	48	41	59	32	48
resnet50modified	338	0	106	41	711	72	106
unidirectional_lstmmodified	5	4	13	41	43	13	14

Except for Resnet50 there is a substantial decrease in the number of generic ops with the modification. Surprisingly though the number of dispatches remains the same or increases. So while the changes in the PR improve fusion as such, something with dispatch region formation gets pessimized. So couple of things to look into going forward.

MaheshRavishankar · 2022-04-01T06:04:01Z

MaheshRavishankar
Apr 1, 2022
Collaborator Author

Looked a bit deeper into dispatch regions created with DeeplabV3 (Before : deeplabbase.print.mlir.txt, After : deeplabmodified.print.mlir.txt)

Couple of interesting things stand out

Though the number of generic ops got reduced due to better fusion, some of the unfused operations still got into the same dispatch region before the modifications

Before
)

    %24 = flow.dispatch.workgroups[%c12, %c4225, %c1](%23, %cst_41, %17) : (tensor<4225x72xf32>, tensor<72x12xf32>, tensor<4225x12xf32>) -> tensor<4225x12xf32> =
        (%arg1: !flow.dispatch.tensor<readonly:4225x72xf32>, %arg2: !flow.dispatch.tensor<readonly:72x12xf32>, %arg3: !flow.dispatch.tensor<readonly:4225x12xf32>, %arg4: !flow.dispatch.tensor<writeonly:\
4225x12xf32>) {
     ....
      %136 = linalg.init_tensor [4225, 12] : tensor<4225x12xf32>
      %137 = linalg.fill ins(%cst_92 : f32) outs(%136 : tensor<4225x12xf32>) -> tensor<4225x12xf32>
      %138 = linalg.matmul ins(%133, %134 : tensor<4225x72xf32>, tensor<72x12xf32>) outs(%137 : tensor<4225x12xf32>) -> tensor<4225x12xf32>
      %139 = linalg.generic {indexing_maps = [#map3, #map4, #map4], iterator_types = ["parallel", "parallel"]} ins(%cst_93, %138 : tensor<12xf32>, tensor<4225x12xf32>) outs(%136 : tensor<4225x12xf32>) {
      ^bb0(%arg5: f32, %arg6: f32, %arg7: f32):
        %141 = arith.addf %arg5, %arg6 : f32
        linalg.yield %141 : f32
      } -> tensor<4225x12xf32>
      %140 = linalg.generic {indexing_maps = [#map4, #map4, #map4], iterator_types = ["parallel", "parallel"]} ins(%139, %135 : tensor<4225x12xf32>, tensor<4225x12xf32>) outs(%136 : tensor<4225x12xf32>)\
 {
      ^bb0(%arg5: f32, %arg6: f32, %arg7: f32):
        %141 = arith.addf %arg5, %arg6 : f32
        linalg.yield %141 : f32
      } -> tensor<4225x12xf32>
      flow.dispatch.tensor.store %140, %arg4, offsets = [0, 0], sizes = [4225, 12], strides = [1, 1] : tensor<4225x12xf32> -> !flow.dispatch.tensor<writeonly:4225x12xf32>
      flow.return
    }

which is really a bug, cause that is not intended from dispatch region formation. After the change the same dispatch region is

    %25 = flow.dispatch.workgroups[%c12, %c4225, %c1](%24, %cst_41, %17) : (tensor<4225x72xf32>, tensor<72x12xf32>, tensor<4225x12xf32>) -> tensor<4225x12xf32> =
        (%arg1: !flow.dispatch.tensor<readonly:4225x72xf32>, %arg2: !flow.dispatch.tensor<readonly:72x12xf32>, %arg3: !flow.dispatch.tensor<readonly:4225x12xf32>, %arg4: !flow.dispatch.tensor<writeonly:\
4225x12xf32>) {
     .......
      %141 = linalg.init_tensor [4225, 12] : tensor<4225x12xf32>
      %142 = linalg.fill ins(%cst_92 : f32) outs(%141 : tensor<4225x12xf32>) -> tensor<4225x12xf32>
      %143 = linalg.matmul ins(%138, %139 : tensor<4225x72xf32>, tensor<72x12xf32>) outs(%142 : tensor<4225x12xf32>) -> tensor<4225x12xf32>
      %144 = linalg.generic {indexing_maps = [#map2, #map3, #map2, #map3, #map3], iterator_types = ["parallel", "parallel"]} ins(%cst_93, %143, %cst_94, %140 : tensor<12xf32>, tensor<4225x12xf32>, tenso\
r<12xf32>, tensor<4225x12xf32>) outs(%141 : tensor<4225x12xf32>) {
      ^bb0(%arg5: f32, %arg6: f32, %arg7: f32, %arg8: f32, %arg9: f32):
        %145 = arith.addf %arg5, %arg6 : f32
        %146 = arith.addf %arg7, %arg8 : f32
        %147 = arith.addf %145, %146 : f32
        linalg.yield %147 : f32
      } -> tensor<4225x12xf32>
      flow.dispatch.tensor.store %144, %arg4, offsets = [0, 0], sizes = [4225, 12], strides = [1, 1] : tensor<4225x12xf32> -> !flow.dispatch.tensor<writeonly:4225x12xf32>
      flow.return
    }

which has the same two ops but fused.

Some of the previous gemm -> generic op fusion dont happen now

Before

    %17 = flow.dispatch.workgroups[%c12, %c4225, %c1](%16, %cst_39) : (tensor<4225x48xf32>, tensor<48x12xf32>) -> tensor<4225x12xf32> =
        (%arg1: !flow.dispatch.tensor<readonly:4225x48xf32>, %arg2: !flow.dispatch.tensor<readonly:48x12xf32>, %arg3: !flow.dispatch.tensor<writeonly:4225x12xf32>) {
    ....
      %135 = linalg.init_tensor [4225, 12] : tensor<4225x12xf32>
      %136 = linalg.fill ins(%cst_92 : f32) outs(%135 : tensor<4225x12xf32>) -> tensor<4225x12xf32>
      %137 = linalg.matmul ins(%133, %134 : tensor<4225x48xf32>, tensor<48x12xf32>) outs(%136 : tensor<4225x12xf32>) -> tensor<4225x12xf32>
      %138 = linalg.generic {indexing_maps = [#map3, #map4, #map4], iterator_types = ["parallel", "parallel"]} ins(%cst_93, %137 : tensor<12xf32>, tensor<4225x12xf32>) outs(%135 : tensor<4225x12xf32>) {
      ^bb0(%arg4: f32, %arg5: f32, %arg6: f32):
        %139 = arith.addf %arg4, %arg5 : f32
        linalg.yield %139 : f32
      } -> tensor<4225x12xf32>
      flow.dispatch.tensor.store %138, %arg3, offsets = [0, 0], sizes = [4225, 12], strides = [1, 1] : tensor<4225x12xf32> -> !flow.dispatch.tensor<writeonly:4225x12xf32>
      flow.return
    }

After

    %17 = flow.dispatch.workgroups[%c12, %c4225, %c1](%16, %cst_39) : (tensor<4225x48xf32>, tensor<48x12xf32>) -> tensor<4225x12xf32> =
        (%arg1: !flow.dispatch.tensor<readonly:4225x48xf32>, %arg2: !flow.dispatch.tensor<readonly:48x12xf32>, %arg3: !flow.dispatch.tensor<writeonly:4225x12xf32>) {
     ....
      %140 = linalg.init_tensor [4225, 12] : tensor<4225x12xf32>
      %141 = linalg.fill ins(%cst_92 : f32) outs(%140 : tensor<4225x12xf32>) -> tensor<4225x12xf32>
      %142 = linalg.matmul ins(%138, %139 : tensor<4225x48xf32>, tensor<48x12xf32>) outs(%141 : tensor<4225x12xf32>) -> tensor<4225x12xf32>
      flow.dispatch.tensor.store %142, %arg3, offsets = [0, 0], sizes = [4225, 12], strides = [1, 1] : tensor<4225x12xf32> -> !flow.dispatch.tensor<writeonly:4225x12xf32>
      flow.return
    }

Reason for the difference is because the result of the gemm (which is result of flow.dispatch.workgroups) has 2 uses after the change. One is in a generic op which does an addition with %cst_92 and another is a generic op that takes both the result of gemm and %cst_92 as operands to do the same addition. The reason these separate uses exist is because Linalg operations (having a region) currently dont CSE. If they were CSEd there would be only one use of the GEMM, and the fusion would happen again. So the next step is to re-evaluate the dispatch region metrics after Linalg ops can be CSEd.

0 replies

benvanik · 2022-04-01T14:40:38Z

benvanik
Apr 1, 2022
Maintainer

(this is a fantastic breakdown!)

0 replies

MaheshRavishankar · 2022-04-01T17:24:06Z

MaheshRavishankar
Apr 1, 2022
Collaborator Author

FYI @okkwon

0 replies

MaheshRavishankar · 2022-04-04T02:42:33Z

MaheshRavishankar
Apr 4, 2022
Collaborator Author

Did a bit more digging. I realized I wrongly implicated CSE in my earlier analysis. It is still valuable to have Linalg ops CSE-able, but given that input to IREE comes from either MHLO or TOSA (mostly), where ops dont have region, and do CSE, redundant computation is not expected in the input to IREE core (which works from Linalg on tensors).

The place where redundant computation did get introduced was in the elementwise fusion pass. The control function fused operations that have a lower dimensionality iteration space into operations that have a higher dimensionality iteration space, potentially doing redundant computation. While it is useful to have these fused (even with redundant computation), these probably need to be made using more deliberate decisions, and not part of the fixed point loop of fusion. Modify the control function to not allow redundant computation results shows better results.

Model	Fills	Copies	Dispatches	Executables	Generic Before	Generic After	DispatchWorkgroups
bert_encodermodified	10	4	585	41	2059	485	609
collatzmodified	2	2	5	41	12	5	5
deeplabmodified	19	1	79	41	106	59	79
edge_detectionmodified	1	0	3	41	4	1	4
fragmentmodified	0	0	1	41	11	1	1
fullyconnectedmodified	0	0	3	41	12	3	3
mnistmodified	4	0	6	41	10	6	6
mobilebertmodified	49	197	752	41	1765	652	752
mobilenetv2modified	18	1	74	41	103	56	74
mobilenetv3modified	12	1	96	41	226	85	96
mobilessdmodified	36	16	121	41	157	88	126
posenetmodified	15	1	48	41	59	32	48
resnet50modified	338	0	78	41	711	55	78
unidirectional_lstmmodified	5	4	14	41	43	13	14

Specifically

Resnet dispatch goes from 95 -> 78
MobilenetV3 goes from 106 -> 96

The PR has been updated with the new heuristics, since they do default better. It might be worth verifying this a bit more by looking at those models where number of dispatches doesnt change and check for missed opportunities. While I do that will also try to land the PR.

Couple of other things to try:

Right now fusion with reshape by expansion and fusion of elementwise ops are done in the same fixed point loop. Splitting the fusion with reshape by expansion and try different traversal order might propagate the reshapes better
During dispatch region formation, producer + consumer fusion only happens when producer has a single use. That is restrictive. A dispatch can have multiple results, so as long as the consumer dominates all uses of the producers, fusion should be fine. That will require all backends to work with these dispatches efficiently, but doesnt seem like too much of a stretch.

0 replies

okkwon · 2022-04-04T17:04:45Z

okkwon
Apr 4, 2022
Collaborator

@MaheshRavishankar the comparison looks neat! How do you get these numbers? Is there a flag that triggers these statistics?

0 replies

MaheshRavishankar · 2022-12-02T23:14:06Z

MaheshRavishankar
Dec 2, 2022
Collaborator Author

Landing back on this (old-ish) discussion thread after a long while. A bunch of work has happened on this in the background. Primarily this work has been developed under the --iree-flow-enable-aggressive-fusion flag. There have been two parts to this work, figuring out what is fused into a dispatch, and can the backends generate code for this. The first part was mostly done in this commit. Since then focus has been on getting the CPU backend being able to handle the dispatch regions created under this flag. So with this flag enabled, this the number of dispatches that are created for the models I have been tracking.

Model	Fills	Copies	Dispatches	Executables	Generic Before	Generic After	DispatchWorkgroups
bert_encoder	11	4	489	41	1198	485	489
collatz	2	1	0	41	12	0	0
deeplab	18	1	78	41	105	57	78
edge_detection	1	0	3	41	4	1	3
efficientnet	17	1	102	41	215	85	102
fragment	3	1	1	41	10	1	1
fullyconnected	0	0	3	41	8	3	3
minilm	0	2	184	41	1137	175	184
mnist	4	0	4	41	10	6	4
mobilebert-fp32	1	4	727	41	1765	651	727
mobilenetv2	18	1	73	41	103	56	73
mobilessd	30	1	121	41	157	88	121
persondetect-int8	14	1	58	41	95	44	58
posenet	14	1	48	41	59	32	48
resnet50	338	0	76	41	711	55	76
unidirectional_lstm	5	8	8	41	43	5	8

There are still a few issues. In some cases, the aggressive fusion creates a stack allocation during bufferization. This is a known issue. IREE has a strict compile time check to limit the stack allocation to be statically known and to be less than 32KB. With these models I get a stack allocation of 67 KB in some cases due to static allocation. Till that issue is addressed, you need to pass --iree-llvmcpu-fail-on-out-of-bounds-stack-allocation=false. This is a transient state and would be addressed. Other two remaining issues

mobilebert-int8 version has a drastic increase in compilation time. This is a known issue in the compiler for quantized models, that has been solved previously, but not for the newly created dispatches.
MobileNetV3Small fails compilation in the backend. That is being looked at.

Edit: These are only for the CPU backend.

0 replies

pzread · 2022-12-31T05:05:24Z

pzread
Dec 31, 2022

With #11686, it looks like all x86_64 benchmark models can be built with --iree-flow-enable-aggressive-fusion and --iree-llvmcpu-fail-on-out-of-bounds-stack-allocation=false (MobileNetV3Small seems to be fixed somehow). mobilebert-int8 still takes long time to compile.

The table shows the numbers of dispatches after enabling the aggressive fusion (at stream-level by --iree-scheduling-dump-statistics-format=json)

	Disabled	Enabled
BertForMaskedLMTF	225	187
DeepLabV3_fp32	78	78
EfficientNet_int8	122	102
EfficientNetV2STF	281	276
MiniLML12H384Uncased	223	184
MobileBertSquad_fp16	752	727
MobileBertSquad_fp32	752	703
MobileBertSquad_int8	1472	1403
MobileNetV1_fp32	57	57
MobileNetV2_fp32	74	73
MobileNetV3Small_fp32	98	96
MobileSSD_fp32	121	121
PersonDetect_int8	75	58
PoseNet_fp32	48	48
Resnet50TF	78	76

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework elementwise op fusion in IREE. #9348

{{title}}

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Rework elementwise op fusion in IREE. #9348

MaheshRavishankar Apr 1, 2022 Collaborator

Replies: 7 comments

MaheshRavishankar Apr 1, 2022 Collaborator Author

benvanik Apr 1, 2022 Maintainer

MaheshRavishankar Apr 1, 2022 Collaborator Author

MaheshRavishankar Apr 4, 2022 Collaborator Author

okkwon Apr 4, 2022 Collaborator

MaheshRavishankar Dec 2, 2022 Collaborator Author

pzread Dec 31, 2022

MaheshRavishankar
Apr 1, 2022
Collaborator

MaheshRavishankar
Apr 1, 2022
Collaborator Author

benvanik
Apr 1, 2022
Maintainer

MaheshRavishankar
Apr 1, 2022
Collaborator Author

MaheshRavishankar
Apr 4, 2022
Collaborator Author

okkwon
Apr 4, 2022
Collaborator

MaheshRavishankar
Dec 2, 2022
Collaborator Author

pzread
Dec 31, 2022