LearnArchitecture DecisionsMixture of Experts (MoE)
Architecture Decisions

Mixture of Experts (MoE)

Why Mixtral is cheaper than GPT-4

MoE models have many specialized 'expert' sub-networks but only activate a few per token. A 47B parameter MoE model might only use 13B parameters per forward pass — giving you large-model quality at small-model cost.

Why this matters

MoE is why some models are surprisingly cheap. Understanding this architecture helps you evaluate vendor claims about model size vs. actual inference cost.