Token

The fundamental unit of text processed by language models.

Overview

A token is the smallest unit of text that an AI model processes. Think of tokens as the building blocks of language that AI models understand - they can be words, parts of words, or even single characters. For example, the word "understanding" might be broken into tokens like "under" and "standing", while "cat" might be a single token.

How Tokens Work

Tokens help AI models process text efficiently by:

  • Breaking text into manageable pieces
  • Maintaining consistent input sizes
  • Capturing common patterns in language
  • Handling multiple languages effectively

Common Examples

  • Full words: "cat", "dog", "the"
  • Word pieces: "under" + "stand" + "ing"
  • Special tokens: [START], [END], [MASK]
  • Numbers and punctuation: "123", "!", "?"
  • Common sequences: "ing", "'s", "pre"

Practical Impact

Understanding tokens is important because they:

  • Affect model performance
  • Influence processing costs
  • Impact text length limits
  • Determine model behavior