How LLMs Scaled from 512 to 2M Context

2026-02-12• Lausanne Metro

Do you remember the days when language models had a context length of just 512 tokens (the original Transformer, BERT, etc.)? How did we get from that to 2 million tokens?

For intuition: on average, 1 word ≈ 1.33 tokens, so 2M tokens ≈ 1.5M words.

That’s roughly equivalent to:

Book	Approx. Words	How Many Fit in 2M Tokens
Typical English Protestant Bible	783,000	1.92
Entire 7-book Harry Potter series	1,084,000	1.38
Entire LOTR trilogy	481,000	3.12
Complete works of William Shakespeare	884,000	1.70

I hadn’t really thought about this until a colleague asked me how context length actually scaled.

What made it even more interesting was that just a few days earlier, I had read that context length is one of the biggest bottlenecks for autonomous agents, and that even 2M tokens is still not enough.

That made me very curious to do a deep dive into how context scaling actually works — and in this post, I’ll summarize what I learned.

Writing....