🔍 Click image to zoom
Frequently Asked Questions
What makes a language model "large"?
A language model is considered "large" when it has enough parameters — typically above one billion — to exhibit emergent capabilities such as in-context learning, chain-of-thought reasoning, and zero-shot task generalisation. The threshold is not fixed; as the field progresses, "large" is a relative term compared to the models of the time.
What is the difference between an LLM and a chatbot?
An LLM is the underlying AI model — a mathematical system trained on text data. A chatbot is a product layer built on top of an LLM, which adds a user interface, memory management, and system prompts to shape how the model behaves. ChatGPT and Claude are chatbots; GPT-4 and Claude 3 Sonnet are the LLMs powering them.
How are LLMs evaluated?
LLMs are evaluated on standardised benchmarks such as MMLU (multiple-choice knowledge), HumanEval (code generation), GSM8K (maths reasoning), and MT-Bench (instruction following). Human evaluation — where annotators compare model outputs — is also widely used, as benchmarks can be gamed through overfitting to test sets. No single benchmark fully captures real-world LLM usefulness.