Mixture of Experts (MoE)

Why Mixtral is cheaper than GPT-4

MoE models have many specialized 'expert' sub-networks but only activate a few per token. A 47B parameter MoE model might only use 13B parameters per forward pass — giving you large-model quality at small-model cost.

Why this matters

MoE is why some models are surprisingly cheap. Understanding this architecture helps you evaluate vendor claims about model size vs. actual inference cost.

Prerequisites

Attention

Why your inference costs scale quadratically

Leads to

Inference Optimization

Making models fast and cheap