How DeepSeek learns reasoning capability
It first learns simple rules exactly right, then complex combinatorial thinking process.
After discussion with friends https://jairwuilloud.com/ and at Turing Institute:
Pattern matching in a memory-economical way ("attention latent vector"). A pattern is a short piece of text, for example "United Kingdom" as in "Lorem ipsum dolor sit amet United Kingdom consectetur adipiscing elit".
Learn the basic rules exactly right ("Finetuning"), like 2 + 3 = 5.
Optimise the thinking process towards the correct one for more complex problems ("GRPO Policy Optimisation"). It generates many candidate thinking processes, and gives the correct one higher score.
It does the Finetuning and Policy Optimisation twice over.
Reference
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning https://arxiv.org/abs/2501.12948