One post tagged with "swe-bench"

We Can Beat Devin

April 2, 2024 · 8 min read

Host of Craft vs Cruft

On March 12th, AI startup Cognition Labs came out of stealth mode with $21M Series A round, a state of the art SWE-bench score, and a demo that has captured the imagination of the AI coding space, even calling into question (for some) the very future of the Software Engineering profession.

"Can we beat Devin?" was the question soon posed by a Discord server of that name, gathering 1,000 users within days of Devin's announcement!

In the spirit of friendly competition, Mender.AI answers that rallying cry.

What is Devin?

Cognition Labs advertises Devin as "the first AI Software Engineer." Considering the climate of grandiose AI claims, this marketing decision is understandable, but not even on April Fool's Day do we abide by that terminology.

Still, it does seem they have created something impressive and important. We ought to call it something. We suggest the following distinction:

Coding Assistants operate as editor plugins offering auto-complete style suggestions during development. Examples include GitHub Copilot, Cursor, and the Open Source Continue.dev.

Autonomous Coders (perhaps AutoCoder for short?) have greater independence. When a user assigns them a higher level task, they are capable of breaking down subtasks and overcoming obstacles. These may be an early entry into the broader category Autonomous DevTools.

Devin, then, is an Autonomous Coder. Another term that has been in play is "Coding Agents." There have been several Open Source agents in active development, though none as polished as Devin's demo videos.

Why not "AI Software Engineer"

Software Engineer was a term introduced by Margaret Hamilton for the Apollo Moon landing project in order to secure professional respect for her colleagues and to emphasize the disciplined practices they were developing. By consensus, the term's use in most of the industry remains somewhat aspirational, but with the world now running on software we have good reason to use it carefully.

Cognition Labs has not stated their definition, if they reach out with one we'll be happy to provide it here. For discussion's sake, let's consider some basic necessary, though not sufficient conditions. We might all agree that a Software Engineer:

Is a professional, having an obligation to uphold a standard of care
- Therefore, has legal and moral agency typically associated with personhood
Is capable not only of solving small predefined tasks in a codebase, but also
- Creating and maintaining a codebase of significant size over a period of time
- Partnering to define the requirements to be done

Without the supervision of a human programmer, every autonomous coding tool to date would fail each one of these minimal prerequisites. That's before any consideration of the actual engineering discipline as currently understood, see Dave Farley's recent book Modern Software Engineering for a nice formulation.

Again, this does not discount the usefulness of tools like Devin. They can help us. We will buy them and we will build them.

What does it mean to beat Devin?

Beating Devin could refer to at least 3 ideas.

1. The relevance of human programmers

Can we as humans compete with this automation of our role?

Yes. That's not even close to a question today. The world’s increased dependence on complex poorly-understood software makes one worry about many things, but not about the employability of software engineers.

For more, see my video "Will AI Democratize Programming?" from 11 months ago.

2. Reproducing the demo

Can the Open Source community produce a self-hostable interactive experience similar to the Devin demo?

Yes, perhaps in 2-4 months. For instance, the OpenDevin project has produced a rudimentary demo already. More effort is required and there is plenty of prior work that can be leveraged.

The shiny stuff

The Devin demos show users directing tasks over a remote dev environment in a unified browser-based UX. A viable reproduction would probably include most of the same features:

Chat
Task planner
Code editor
Terminal
Remote-controlled browser (snapshots)

3. SWE-bench or bust

Can the Open Source community beat 14% on the SWE-bench AI coding benchmark?

This is an important question because Devin represents a huge leap over previous scores. Nonetheless, the video preview of this article (March 23rd) predicted an Open Source agent would beat it in a similar timeline of a few months.

Even that may turn out to be conservative because just this morning SWE-agent was announced by the same Princeton NLP Group that created SWE-bench! This Open Source agent (using GPT-4) is now just 2 percentage points away from beating Devin in this criteria. It's possible that incorporating Anthropic's Claude 3 Opus model (released March 3rd) could push it over the edge.

The significance of SWE-bench is that it requires solving realistic tasks in the context of a full codebase. Currently, the bottleneck is not the LLMs themselves but our orchestration and tool support. SWE-bench's examples only cover Python and there will also be other approaches, but generally this move towards gauging performance in a practical context is promising.

What about models?

The above assumes that an Open Source reproduction would rely on 3rd party proprietary models from the likes of OpenAI, Google, and Anthropic. We assume (but don't know) that Devin also relies on 3rd party models, see the March 12th Bloomberg article:

Cognition AI has declined to say to what extent it does or does not rely on existing LLMs from other companies, making it hard to tell how it stacks up against the work being done by Magic AI.

An Open Source reproduction that included the LLMs themselves would be a much more daunting task due to the expertise and compute required to train them. Open Weight models are becoming usable if not competitive for dev assistants and agents. We have no specific "catch-up" predictions, see r/LocalLLaMA for the latest news.

Again, we must be careful of the terminology: Most Open Weight models are not Open Source because weights are not source, the training data and process are the source.

Who are the players?

Here are some of the projects in the conversation right now.

SWE-agent - The benchmark leader
OpenDevin - An early front-runner for Open Source Devin-like UX, along with Devika
Devika
Mirko.AI - An Open Source re-write of the Softgen.AI demo by Marko Kraemer
OpenAgents
Aider
GPT Pilot
GPT Engineer

See Coding Agents for more...

Following SWE-bench and OpenDevin's example, we recommend defaulting to MIT license for these efforts so we can share with minimum friction.

While most efforts have focused on expanding agent capabilities to better meet the work-as-done, it's also possible to constrain the problem scope by coevolving the app framework with the assistant. Examples are NextPy and Sublayer's Blueprints. While changing the problem makes it hard to objectively compare performance, it seems like a very practical approach in the long term.

Are you building an AI Coding Assistant, Autonomous Coder, or related tool? Reach out on LinkedIn or submit a Pull Request to add to our list!

Harnessing the constituency

What does this sea of potential competitors mean for AI startups like Cognition? DevTool growth expert Ana Hevesi had these words of wisdom:

Contributors are hungry for the transparency and self-determination an open source implementation would provide. A combination of love of the game—and perhaps some concern for job security—has spurred a spectacular show of collaboration.

So: how would it improve Cognition AI's position if they sought ways to engage and harness the energy of this extended constituency? Despite the fact that this reaction wasn’t in anyone’s plan, this group could surely be incredibly engaged testers. It’s also easy to imagine this group helping the Cognition team find new applications and customer profiles, and even red team the product.

Extended constituencies have minds of their own, which can be inconvenient if you didn’t plan with them in mind. But long term, they can multiply your success—if you harness their energy and reward their efforts.

When rubber meets the road

At the end of the day, we're here to deliver value, not just funding rounds and clicks. There are problems left to solve before we can fully operationalize these tools.

DevOps pioneer John Willis offers a much needed reality check in his article To Devin or Not to Devin, balancing his optimism with past lessons from robotic process automation (RPA) and his work on Chef infra management. Worth a read!

One of Devin's most intriguing aspects is its autonomous operation through its shell code editor and web browser. [...] However, as a software engineer with 45 years of experience, I take these claims with a hefty chaser of skepticism. Devin.ai may introduce many security concerns, especially when dealing with sensitive information or performing actions on services, databases and APIs, as well as the handling of sensitive information and the potential for unintended actions. [...] While I’m not against AI, I am cautious about Devin’s potential to replace human full-stack software engineers.

And as of press time, Lao Tzu said:

With patience the most tangled cord may be undone.

Can we write to you from time to time? Get updates.

What is Devin?​

Why not "AI Software Engineer"​

What does it mean to beat Devin?​

1. The relevance of human programmers​

2. Reproducing the demo​

The shiny stuff​

3. SWE-bench or bust​

What about models?​

Who are the players?​

Harnessing the constituency​

When rubber meets the road​