How LLMs Scaled from 512 to 2M Context
2026-02-12• Lausanne Metro
Do you remember the days when language models had a context length of just 512 tokens (the original Transformer, BERT, etc.)? How did we get from that to 2 million tokens?
For intuition: on average, 1 word ≈ 1.33 tokens, so 2M tokens ≈ 1.5M words.
That’s roughly equivalent to:
| Book | Approx. Words | How Many Fit in 2M Tokens |
|---|---|---|
| Typical English Protestant Bible | 783,000 | 1.92 |
| Entire 7-book Harry Potter series | 1,084,000 | 1.38 |
| Entire LOTR trilogy | 481,000 | 3.12 |
| Complete works of William Shakespeare | 884,000 | 1.70 |
I hadn’t really thought about this until a colleague asked me how context length actually scaled.
What made it even more interesting was that just a few days earlier, I had read that context length is one of the biggest bottlenecks for autonomous agents, and that even 2M tokens is still not enough.
That made me very curious to do a deep dive into how context scaling actually works — and in this post, I’ll summarize what I learned.
Writing....