Rationale behind moving tile and distribute out of Flow #9321

MaheshRavishankar · 2022-06-03T20:49:32Z

MaheshRavishankar
Jun 3, 2022
Collaborator

IREE recently moved away from using tile and distribute at the Flow level. At the point of the move, all dispatch regions where tiled and distributed parametrically at the Flow level, with the backends deciding what the static values to use would be. With a lot of effort, it was ensured that all operations that were moved into the dispatch regions were tiled and distributed. Having gone through that effort though, this approach still seemed to get in the way constantly, with backend development having to work around some of the fall out from doing tile and distribute at the Flow level. Trying to document some of this here

Flow is supposed to be architecture agnostic. So doing tile and distribute at the Flow level without making architecture specific decision would require using the parametric values for workgroup size. So an op like this

%1 = linalg.generic {...} ins(%0 : tensor<21x42xi32>) ...

would be tiled and distribute at the flow level into

%c21 = arith.constant 21 : index
%c42 = arith.constant 42 : index
%id = ...
%count = ...
scf.for %iv0 = %id_y to %c21 step %count_y {
scf.for %iv1 = %id_x to %c42 step %count_x {
  %ts_y = affine.min affine_map<(d0)[s0, s1] -> (s0, s1 - d0)>()[%iv0, %count_y, %c21]
  %ts_x = affine.min affine_map<(d0)[s0, s1] -> (s0, s1 - d0)>()[%iv1, %count_x, %c42]
  %2 = flow.dispatch.tensor.load %0, offsets = [%iv0, %iv1], sizes = [%ts_y, %ts_x], strides = [1, 1] : tensor<21x42xi32> -> tensor<?xi?x32>
  %3 = linalg.generic {...} ins(%2 : tensor<?x?xi32>) ...
  flow.dispatch.tensor.store %3, ..., offsets = [%iv0, %iv1], sizes = [%ts_y, %ts_x], strides = [1, 1] : tensor<?x?xi32> into tensor<21x42xi32>
}
}

(this isnt the exact form, but captures what is required for discussion here). The loops here are for cyclic distribution of workgroups (which is deemed as a good feature to have since it allows you to change the number of threads used dynamically and still have the generated code correct, i.e. it isnt hardwired to use the exact number of workgroups). These leads to a couple of issues
- The tile and distribute at Flow level means that there is an opinion in the way tiles are distributed along different logical dimensions (i.e. x-y-z grid of processors that IREE uses and mirrors Vulkan/CUDA). So all backends are forced to obey the same traversal order. This might not be ideal for all backends for all cases (maybe the case for a majority of cases though). Defering that decision to the backends seemed more prudent
- Doing the tile and distribute parametrically at the Flow level implied that the shape of the tiled operations would no more be static even if the shape of the original operations were static. With more complex sequence of ops the analysis needed to find the original problem size was very involved and fragile. The original problem size is used by all backends to decide the static values of the tile size to use.

Doing tile and distribute at the flow level meant that every operation that needed to be moved into a dispatch needed to be "tiled and distribute"-able. This led to the development of the TiledOpInterface in IREE (being upstreamed as TilingInterface). That was very useful, but it also meant that some of the corner cases were strange. There are cases where tensor.extract_slice and tensor.insert_slice needed to be in their own dispatch. This means these ops should also be tiled and distributed. It is possible to do, but the justification for this seemed flimsy. Further these ops are sometimes a "true copy" and sometimes get lowered into a memref.subview which is not a copy. Disambiguating between the different uses at this level is not possible, and ends up pessimizing things by always treating them as copies. By moving the tile and distribute to the backends, such corner cases were easier to handle
- The cases where a dispatch had only a tensor.extract_slice/tensor.insert_slice these ops could be folded with the flow.dispatch.load/flow.dispatch.store.
- For dispatches that had only flow. operations you could bufferize early to get a linalg.copy operation to better represent the computation and allow for the backends to handle this case like it would any other operation.
One of the main rationales behind doing tile and distribute at the Flow level was that you could also fuse during this stage and since the computation is already tiled and distributed, it ensures that the operations that are expected to be fused are indeed fused. There is still a question of heuristic when it comes to decide "what to fuse". Whether the computation was tiled + fused + distributed or not, this heuristic still exists. The heuristic that we use today is very simple. The maintanence/development cost of things mentioned above didnt seem to justify continuing to use the tile + distribute at Flow level. When we want to search for different heuristics, then maybe doing tile + fuse + distribute would be useful, but is still unclear how much the heuristic actually is related to tiling. They seemed fairly orthogonal, and more related to the access pattern of data between producer and consumer that does not really need tiling to be done. The decision to remove tile and distribute from the flow level was a pragmatic choice to remove the cost incurred by this approach currently till we have a better idea of a case where tiling is required to make decision about fusion.
An added benefit of not tile and distributing at flow level is that in the future if we want to target some hardware where the full operation is needed, these can still be targeted easily. Requiring that all target backends handle a tiled + distributed approach seemed very opinionated. This reasoning is a bit shallow though since most ML based hardware are moving towards using a tiled + distributed approach.

benvanik · 2022-06-03T21:17:00Z

benvanik
Jun 3, 2022
Maintainer

Thanks Mahesh!

The original problem size is used by all backends to decide the static values of the tile size to use.

Good point and thanks for calling this out - that's a bug we should work on fixing insofar as the static information should not be a requirement and instead only a potential optimization. I think today there's a bit too much relying on the information and that creates some big performance cliffs that'll become much easier to fall into when we move towards models with dynamic shapes. One way to think of it is that static information should be used to prune a search space down vs. used to build one up. Lots of detailed work there for sure and it's good to keep an eye on this as it's a design smell in any part of the code that uses such information. However we end up moving forward we should make sure it's at least got provisions for doing the right thing in the face of missing static information.

At least on the CPU backend these are only used to "optimize" the generated code. For dynamic cases it falls back to some default, which might not be optimal (but is still correct).. That being said, I dont think it is easy to generate optimized code for a purely dynamic case. There should be some information of the potential range at least if something reasonable has to be done.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationale behind moving tile and distribute out of Flow #9321

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Rationale behind moving tile and distribute out of Flow #9321

MaheshRavishankar Jun 3, 2022 Collaborator

Replies: 1 comment

benvanik Jun 3, 2022 Maintainer

MaheshRavishankar
Jun 3, 2022
Collaborator

benvanik
Jun 3, 2022
Maintainer