Agents Are Still Far Out

17 January 2025·41 words·1 min

Tweets

For real-world use case, Devin gets 3 out of 20 tasks done correctly. It seems models are at PhD level for certain tasks but then they fail on simpler stuff where “Junior SWE” would have done a flawless job. https://x.com/HamelHusain/status/1880129024737104118

https://pbs.twimg.com/media/Gher8OQaUAAuy-Q.jpg

Discussion