LLM inference on Apple Silicon: wh… | Apple Developer Forums

LLM inference on Apple Silicon: why do some MoE architectures outperform dense models despite similar parameter counts?

We're doing some local LLM inference experiments on Apple Silicon and have observed something that seems counterintuitive.

In several cases, MoE models significantly outperform dense models despite having similar total parameter counts.

Examples (simplified):

On Apple Silicon, the MoE model consistently achieves higher decode throughput even after accounting for routing overhead.

A few hypotheses we're considering:

Active parameter count appears to matter more than total parameter count for decode throughput.
Memory traffic may dominate M=1 autoregressive decode, making sparse activation more important than expected.
Expert matrix geometry might matter as much as parameter count. Some MoE designs appear to produce GPU-friendly GEMV shapes while others do not.
Quantization layout and memory alignment seem to have surprisingly large effects on practical throughput.

What I'm curious about is whether others have observed similar behavior on Apple Silicon specifically.

Has anyone profiled decode throughput across:

and identified which hardware characteristics are actually driving the difference?

I'm particularly interested in observations from Metal profiling rather than benchmark leaderboards.