How LLMs Scaled from 512 to 2M Context

2026-02-12Lausanne Metro

Do you remember the days when language models had a context length of just 512 tokens (the original Transformer, BERT, etc.)? How did we get from that to 2 million tokens?

For intuition: on average, 1 word ≈ 1.33 tokens, so 2M tokens ≈ 1.5M words.

That’s roughly equivalent to:

BookApprox. WordsHow Many Fit in 2M Tokens
Typical English Protestant Bible783,0001.92
Entire 7-book Harry Potter series1,084,0001.38
Entire LOTR trilogy481,0003.12
Complete works of William Shakespeare884,0001.70

I hadn’t really thought about this until a colleague asked me how context length actually scaled.

What made it even more interesting was that just a few days earlier, I had read that context length is one of the biggest bottlenecks for autonomous agents, and that even 2M tokens is still not enough.

That made me very curious to do a deep dive into how context scaling actually works — and in this post, I’ll summarize what I learned.

Writing....