We're doing some local LLM inference experiments on Apple Silicon and have observed something that seems counterintuitive.
In several cases, MoE models significantly outperform dense models despite having similar total parameter counts.
Examples (simplified):
- Dense model: ~30B parameters
- MoE model: ~30B total parameters, ~3B active parameters
On Apple Silicon, the MoE model consistently achieves higher decode throughput even after accounting for routing overhead.
A few hypotheses we're considering:
-
Active parameter count appears to matter more than total parameter count for decode throughput.
-
Memory traffic may dominate M=1 autoregressive decode, making sparse activation more important than expected.
-
Expert matrix geometry might matter as much as parameter count. Some MoE designs appear to produce GPU-friendly GEMV shapes while others do not.
-
Quantization layout and memory alignment seem to have surprisingly large effects on practical throughput.
What I'm curious about is whether others have observed similar behavior on Apple Silicon specifically.
Has anyone profiled decode throughput across:
- dense models
- large-expert MoE
- many-small-expert MoE
and identified which hardware characteristics are actually driving the difference?
I'm particularly interested in observations from Metal profiling rather than benchmark leaderboards.