Architecture Decisions
Mixture of Experts (MoE)
Why Mixtral is cheaper than GPT-4
MoE models have many specialized 'expert' sub-networks but only activate a few per token. A 47B parameter MoE model might only use 13B parameters per forward pass — giving you large-model quality at small-model cost.
Why this matters
MoE is why some models are surprisingly cheap. Understanding this architecture helps you evaluate vendor claims about model size vs. actual inference cost.