Skip to main content

100 TFLOPs vs. 2 Human Hours: Measuring Work in FLOPs

·413 words·2 mins

Current 70B models spends ~100 TFLOPs to produce output that might take an expert human ~2hr of work in many cases.

If you start measuring amount of work done by humans and machines in FLOPs, you get some very interesting consequences!🧵

One can argue that problems takes more time to solve as they get more harder. Einstein spent 10 intense years to find the field equation. It took many generations of mathematicians building on each others work for over 350 years to prove Fermat’s Last Theorem.

A homework problem, a Kaggle challenge, a major research breakthrough - all might lie on some continuous FLOPs expenditure curve. While a big breakthrough might appear non-linear, it is continuation of FLOPs spent over long period of time by many people building on each other.

Current frontier models are great at homework problems, mediocre at Kaggle challenges and cannot produce real legitimate research paper.

Could this be because of limited FLOPs available in a forward pass aka thinking “bandwidth” of LLMs? Scaling increases this bandwidth.

The big issue: The forward pass FLOPs increases ~linearly for model size BUT training cost increases in ~square of model size (for some compute optimality).

This in fact prohibits scaling as the way to make models smart enough to achieve big scientific breakthroughs!!

For example, if you ask how big my model have to be which will prove a math theorem that took 1M hours of work by expert mathematicians over 350 years, the answer would be a model that is ~100,00 times bigger than GPT4 because otherwise forward pass won’t have needed FLOPs!

We will surely soon get 10X larger model and it will be able to generate output for a prompt that takes 20 hours of work by human expert instead of 2. It would be super awesome but still far and away from solving Riemann Hypothesis. That won’t be happen even at 100X larger model!

All these points to the possibility that scaling is likely not an answer to everything. In fact, it has very stringent limitations.

Instead of expecting answer in one forward pass for a given prompt, LLMs needs to be able to create plan, review and re-plan.

I think agentic AI is where the next biggest breakthroughs might happen, not the scaling. In essence, agentic AI trades model size for time by accumulating FLOPs.

For many examples, LLMs are ~50,000 times faster than humans. Imagine, proving FLT in < 1 day instead of 358 years!

Discussion