Deployment

Inference

The process of running a trained AI model to generate predictions or outputs from new inputs. This is what happens every time you send a prompt to an AI model and receive a response.

Inference is the operational phase of an AI model — when it takes your input and produces an output. While training is a one-time (or periodic) process that creates the model, inference happens every time anyone uses it. For large language models, inference means processing a prompt and generating text token by token until the response is complete.

LLM inference has two distinct phases. The "prefill" phase processes all input tokens in parallel to build the model's internal representation of your prompt. The "decode" phase generates output tokens one at a time, each conditioned on all previous tokens. The prefill phase is compute-bound (limited by raw processing power), while the decode phase is memory-bandwidth-bound (limited by how fast weights can be loaded from GPU memory). This distinction matters for optimization: different hardware and software strategies target each phase differently.

Inference costs dominate the economics of AI applications. While training a frontier model costs millions of dollars, the cumulative cost of running it for millions of users quickly dwarfs the training investment. This is why efficiency matters: techniques like quantization, speculative decoding, KV-cache optimization, continuous batching, and dedicated inference hardware (like NVIDIA's TensorRT-LLM or specialized chips) aim to reduce the cost per token while maintaining quality.

For developers choosing between models, inference characteristics — latency (time to first token and total generation time), throughput (tokens per second), and cost per token — are often more important than raw benchmark scores. A model that scores 2% higher on benchmarks but costs 5x more to run may not be the right choice. GPTCrunch's pricing and performance data helps you evaluate these tradeoffs across providers and models.

Hallucination

Latency

Explore more AI concepts in the glossary

Browse Full Glossary