★  Gen AI Summit Asia·August 2026 · Malaysia·Get your ticket →·★  Gen AI Summit Asia·August 2026 · Malaysia·Get your ticket →·★  Gen AI Summit Asia·August 2026 · Malaysia·Get your ticket →·★  Gen AI Summit Asia·August 2026 · Malaysia·Get your ticket →·
AI Terms for Beginners: The Essential Glossary
AI for BeginnersMay 8, 20266 min read

AI Terms for Beginners: The Essential Glossary

Tokens, inference, quantization: the AI terms that trip up beginners are simpler than they sound. Here is your practical starting glossary, decoded.

Jackson YewJackson Yew

Professionals everywhere are picking up AI tools faster than they are picking up AI vocabulary. The core terms, tokens, inference, quantization, context window, map to a handful of concrete ideas. Once you know those ideas, every AI article, job listing, and product page becomes readable. This glossary gives you that foundation in plain language, no engineering background needed.

LinkedIn's 2025 Workplace Learning Report ranked AI literacy as the number one skill professionals sought to build, with enrollment in foundational AI courses rising 142% year over year (LinkedIn Learning, 2025). As of May 2026, the term "AI literacy" appears in more than half of all new technology job postings on LinkedIn globally, up from roughly one in five postings in 2023 (LinkedIn Talent Insights). The vocabulary gap is real, and it has a cost. These are the terms that close it.

If you want to go deeper after reading this, How to Learn Claude AI from Scratch in 2026 is a good next stop.


What Is an AI Model and How Does It Learn?

An AI model is a system trained on data to recognize patterns and produce outputs. Think of it as a refined pattern-matching machine. Feed it enough text, images, or code, and it learns to predict what comes next.

Two words separate beginners from confused beginners: training and inference. Training is when the model learns. Engineers feed it massive datasets, and the model adjusts billions of internal values called parameters. Think of parameters like the knobs on a mixing board. Each small adjustment shapes how the model responds to any given input.

Inference is when you use the model. You type a prompt. The model runs its learned parameters and produces an output. Training happens once. Inference happens every time you press send.

Model size is measured in parameters. A 7-billion-parameter model runs on a laptop. A 400-billion-parameter model needs a data center. Bigger is not always better for your specific use case, but size does affect both capability and cost. Knowing this helps you pick the right tool for the job, not just the biggest one available.


What Are Tokens and Why Do They Matter?

Tokens are the chunks of text AI models read and write. They are not exactly words. One token equals roughly 0.75 words on average. A simple 10-word sentence uses about 13 tokens. A full 10,000-word report uses around 13,000 tokens.

Why does this matter? Because every major AI API charges by token. Input tokens, what you send, and output tokens, what the model returns, both cost money. Understanding tokens helps you write tighter prompts and cut costs in production.

Tokenization also has quirks. The word "ChatGPT" might be one token. The word "unbelievable" might be three. Common short words tend to be single tokens. Rare or long words get split. This is why AI sometimes stumbles on made-up words or unusual names.

Token limits also shape what you can send. Every model has a maximum token count. Exceed it and the model cuts off your input, returns an error, or silently drops earlier content. Knowing that tokens are the unit of measurement, not words or characters, is the first step toward working with AI tools rather than against them.


How Does a Context Window Affect What AI Can Do?

The context window is the total number of tokens a model can "see" at once. It covers both your input and the model's output. If the window is 8,000 tokens, that is roughly 6,000 words total, shared between what you send and what the model returns.

A small context window causes a familiar problem: the AI forgets earlier parts of a long conversation. Ask it about something mentioned 20 messages back and it may have no memory of it. That text simply fell outside the window.

Larger windows solve this. As of May 2026, Anthropic's Sonnet 4.6 and OpenAI's GPT-5.5 both support context windows of 200,000 tokens or more, making context window size one of the most-compared specs among first-time AI tool buyers.

In practice, this matters for real tasks. Summarizing a 50-page report needs a large context window. Answering a quick question needs almost none. Match the model's window to the size of your task, and you will avoid the frustrating experience of watching the AI "forget" what you already told it.


Inference, Latency, and Throughput: Running AI in the Real World

Inference is the act of sending a prompt and receiving a response. Every time you use an AI tool, you are running inference. This is distinct from training, which is the far more expensive process of building the model in the first place.

Two speed metrics matter most in production. Latency is the time to your first token, how long before the model starts responding. Throughput is tokens per second, how fast the full response arrives. Low latency matters for live chat. High throughput matters for batch document processing.

Inference cost dominates most production AI budgets. Training a large model might cost millions of dollars once. But running inference at scale, thousands of requests per hour, adds up fast. This is why pricing pages on AI products almost always show a cost-per-token structure.

The bridge between a model and any app is the API (application programming interface). You send a request. The model returns a response. Your app handles the rest. Every AI product you use, from a chatbot to an AI writing tool, sits on top of this same basic structure.


What Is Quantization and Why Should Beginners Care?

Quantization is compressing a model's numerical precision to reduce memory and compute requirements. Every parameter in a model is stored as a number. Full precision stores that number with high detail, using more memory. Quantization rounds that number to fewer decimal places, using less.

The tradeoff is small but real. A quantized model runs faster and fits on cheaper hardware. Output quality drops slightly, though for most everyday tasks the difference is hard to notice.

Beginners encounter quantization most often when running local models on a laptop, using tools like Ollama or LM Studio. Model cards on Hugging Face show shorthand like Q4 or Q8. The number refers to how many bits are used per parameter. Q4 uses 4 bits: a smaller file, more compression. Q8 uses 8 bits: closer to full quality. As of early 2026, Hugging Face hosts over 1.2 million public model repositories, with the majority referencing quantization formats directly in their model cards, exposing the term to millions of newcomers each month.

When you see "Q4_K_M" in a model filename, do not let it scare you. It just means the model has been compressed for faster local use.


Terms You Will See in AI Job Listings and Product Pages

Three pairs of terms appear constantly in job postings and vendor comparisons. Knowing them makes any AI conversation easier.

Fine-tuning is adapting a pre-trained model on a smaller, domain-specific dataset. A hospital might fine-tune a base model on medical records. The result responds better to medical queries than a general model would. Fine-tuning is how companies build AI that feels native to their industry.

RAG stands for retrieval-augmented generation. Instead of relying only on what the model learned during training, RAG connects it to an external knowledge source at query time. Think of it as giving the model a live reference book to check before answering. This is how many AI products stay current without retraining constantly.

Foundation model refers to a large base model trained on broad data, the starting point for many specific applications. Open weights means the model's parameters are publicly available for anyone to download and run. Proprietary model means a company controls access, usually through an API.

Knowing these three pairs, fine-tuned vs. base, RAG vs. no RAG, open vs. proprietary, covers most of what you will read on any AI product page or job description. For a practical look at how these ideas show up in real tool comparisons, 5 Best ChatGPT Alternatives in 2026 That Actually Work walks through several models side by side.


Key Takeaway

AI has a vocabulary that sounds technical but maps to a handful of concrete ideas. Once you know that a model learns during training, runs during inference, reads in tokens, and is bounded by a context window, every other term clicks into place. This glossary is not about memorizing definitions. It is about building the mental model that lets you read any AI article, evaluate any AI tool, and speak confidently in any AI conversation.

FAQ

What does token mean in AI?

A token is the basic unit of text that an AI language model processes. It is not exactly a word: one token is roughly 0.75 words in English, so the sentence 'The cat sat' is about four tokens. Tokens matter because AI APIs charge per token consumed (input plus output), and every model has a maximum number of tokens it can handle in a single request. Understanding tokens helps you estimate costs and structure prompts more efficiently.

What is the difference between AI training and inference?

Training is the process of building an AI model by exposing it to large amounts of data and adjusting its internal parameters until it produces useful outputs. Inference is what happens when you actually use the model: you send it a prompt and it generates a response. Training is expensive and happens once (or periodically). Inference is cheap per request but adds up at scale. Most people interacting with AI tools like ChatGPT or Claude are always on the inference side.

What does quantization mean in simple terms?

Quantization is a technique that shrinks an AI model's file size by reducing the numerical precision of its weights. Think of it like compressing a photo: you lose a little detail, but the file becomes far more manageable. A quantized model uses less memory and runs faster, which is why it is popular for running AI locally on a laptop or phone. You will see labels like Q4 or Q8 in open-weight model names, where higher numbers generally mean better quality at the cost of larger file size.

What is a context window in AI models?

A context window is the maximum amount of text, measured in tokens, that an AI model can read and consider at one time. It includes your prompt, any documents you paste in, and the model's own previous responses in a conversation. If content exceeds the context window, the model cannot 'see' the earlier parts. Larger context windows let you work with longer documents, longer conversations, and more complex tasks without losing information mid-session.

What does fine-tuning an AI model mean?

Fine-tuning means taking a pre-trained foundation model and continuing to train it on a smaller, targeted dataset so it becomes better at a specific task or domain. For example, a general-purpose language model might be fine-tuned on medical literature to improve its performance in clinical settings. Fine-tuning is less expensive than training from scratch, but it still requires compute resources and carefully curated data. It is distinct from prompt engineering, which shapes model behavior without changing the model's weights.

Sources

  1. Artificial Intelligence Index Report 2025
  2. OpenAI Tokenizer Documentation
  3. Hugging Face Model Hub Documentation

More where this came from

Documentation, not the product.

See all posts →