We're doing some research on Apple Silicon inference runtimes and trying to understand the practical synchronization boundary of Apple GPUs.
We are not asking about threadgroup barriers (those are documented), but about device-scope synchronization patterns built from atomics.
What we've observed:
- Device-scope atomics are available.
- It is possible to build global counters and persistent-thread style coordination structures.
- However, we cannot find any documented guarantee regarding:
- threadgroup co-residency,
- global forward progress,
- occupancy-bounded synchronization safety.
In our experiments, synchronization schemes that rely on all threadgroups making progress eventually can become unreliable, while strictly local producer/consumer handoff patterns appear much more robust.
Questions:
-
Does Metal provide any documented forward-progress guarantees across threadgroups beyond what is explicitly stated in the Metal specification?
-
Is there any recommended pattern for implementing long-lived producer/consumer GPU pipelines without relying on global synchronization assumptions?
-
For Apple GPUs specifically, should developers assume that occupancy-bounded global synchronization is unsupported unless explicitly provided by the API?
We are not looking for undocumented implementation details, only for guidance on what assumptions are safe for production systems.
Thanks.