The total number of tokens a model accepts in a single request, counting input and output together. Larger windows raise cost and latency, and quality often degrades toward the far end.
Definition: The total number of tokens a model accepts in a single request, counting input and output together. Larger windows raise cost and latency, and quality often degrades toward the far end.
The maximum number of tokens a model can hold in one request, input and output combined. Larger windows let you feed more transcripts or documents at once, but cost rises, latency rises, and quality often degrades at the far end of the window. "Just paste the whole study in there" is rarely the right move.
The unit a language model processes input and produces output in. Roughly four characters of English on average, less for code and non-Latin scripts.
Running a trained model to produce output, as opposed to training it. Every API call to a language model is inference, and inference cost per token is what shapes LLM economics.
A technique that enhances LLM responses by first retrieving relevant information from a specific knowledge base, then using that information to ground the model's output.