I'm not a math person either, but I'm familiar with some of the terminology. The two things I got out of this paper were:
1. Depth is more important than breadth for making transformers smarter. That is, for a given size of model it will be more powerful with more, smaller layers than it would be with less, bigger ones. Interestingly, Mistral just updated their small model yesterday with a big reduction in layers in order to improve performance. Among the many ways they say it more technically they do say directly "depth plays
a more critical role than width in reasoning and composition tasks".
2. As I understand it, they are claiming to be able to prove mathematically that Chain of Thought such as seen in the new DeepSeek R1 and GPT o1/o3 models creates results that wouldn't be possible without it. The thought chains effectively act as additional layers, and per the previous point, the more layers the better. "From a theoretical view, CoT provides Transformer with extra computation space, and previous work
... proved that log-precision Transformer with CoT could simulate any polynomial-time algorithm. Therefore, by further assuming certain complexity conjecture ... their results imply that constant depth Transformer with CoT could simulate
poly-time algorithm, while constant depth Transform ... itself can not solve P-complete task."
1. Depth is more important than breadth for making transformers smarter. That is, for a given size of model it will be more powerful with more, smaller layers than it would be with less, bigger ones. Interestingly, Mistral just updated their small model yesterday with a big reduction in layers in order to improve performance. Among the many ways they say it more technically they do say directly "depth plays a more critical role than width in reasoning and composition tasks".
2. As I understand it, they are claiming to be able to prove mathematically that Chain of Thought such as seen in the new DeepSeek R1 and GPT o1/o3 models creates results that wouldn't be possible without it. The thought chains effectively act as additional layers, and per the previous point, the more layers the better. "From a theoretical view, CoT provides Transformer with extra computation space, and previous work ... proved that log-precision Transformer with CoT could simulate any polynomial-time algorithm. Therefore, by further assuming certain complexity conjecture ... their results imply that constant depth Transformer with CoT could simulate poly-time algorithm, while constant depth Transform ... itself can not solve P-complete task."