Skip to main content

· 8 min read
Ray Myers

On March 12th, AI startup Cognition Labs came out of stealth mode with $21M Series A round, a state of the art SWE-bench score, and a demo that has captured the imagination of the AI coding space, even calling into question (for some) the very future of the Software Engineering profession.

"Can we beat Devin?" was the question soon posed by a Discord server of that name, gathering 1,000 users within days of Devin's announcement!

In the spirit of friendly competition, Mender.AI answers that rallying cry.

What is Devin?

Cognition Labs advertises Devin as "the first AI Software Engineer." Considering the climate of grandiose AI claims, this marketing decision is understandable, but not even on April Fool's Day do we abide by that terminology.

Still, it does seem they have created something impressive and important. We ought to call it something. We suggest the following distinction:

Coding Assistants operate as editor plugins offering auto-complete style suggestions during development. Examples include GitHub Copilot, Cursor, and the Open Source Continue.dev.

Autonomous Coders (perhaps AutoCoder for short?) have greater independence. When a user assigns them a higher level task, they are capable of breaking down subtasks and overcoming obstacles. These may be an early entry into the broader category Autonomous DevTools.

Devin, then, is an Autonomous Coder. Another term that has been in play is "Coding Agents." There have been several Open Source agents in active development, though none as polished as Devin's demo videos.

Why not "AI Software Engineer"

Software Engineer was a term introduced by Margaret Hamilton for the Apollo Moon landing project in order to secure professional respect for her colleagues and to emphasize the disciplined practices they were developing. By consensus, the term's use in most of the industry remains somewhat aspirational, but with the world now running on software we have good reason to use it carefully.

Cognition Labs has not stated their definition, if they reach out with one we'll be happy to provide it here. For discussion's sake, let's consider some basic necessary, though not sufficient conditions. We might all agree that a Software Engineer:

  • Is a professional, having an obligation to uphold a standard of care
    • Therefore, has legal and moral agency typically associated with personhood
  • Is capable not only of solving small predefined tasks in a codebase, but also
    • Creating and maintaining a codebase of significant size over a period of time
    • Partnering to define the requirements to be done

Without the supervision of a human programmer, every autonomous coding tool to date would fail each one of these minimal prerequisites. That's before any consideration of the actual engineering discipline as currently understood, see Dave Farley's recent book Modern Software Engineering for a nice formulation.

Again, this does not discount the usefulness of tools like Devin. They can help us. We will buy them and we will build them.

What does it mean to beat Devin?

Beating Devin could refer to at least 3 ideas.

1. The relevance of human programmers

Can we as humans compete with this automation of our role?

Yes. That's not even close to a question today. The world’s increased dependence on complex poorly-understood software makes one worry about many things, but not about the employability of software engineers.

For more, see my video "Will AI Democratize Programming?" from 11 months ago.

2. Reproducing the demo

Can the Open Source community produce a self-hostable interactive experience similar to the Devin demo?

Yes, perhaps in 2-4 months. For instance, the OpenDevin project has produced a rudimentary demo already. More effort is required and there is plenty of prior work that can be leveraged.

The shiny stuff

The Devin demos show users directing tasks over a remote dev environment in a unified browser-based UX. A viable reproduction would probably include most of the same features:

  • Chat
  • Task planner
  • Code editor
  • Terminal
  • Remote-controlled browser (snapshots)

3. SWE-bench or bust

Can the Open Source community beat 14% on the SWE-bench AI coding benchmark?

This is an important question because Devin represents a huge leap over previous scores. Nonetheless, the video preview of this article (March 23rd) predicted an Open Source agent would beat it in a similar timeline of a few months.

Even that may turn out to be conservative because just this morning SWE-agent was announced by the same Princeton NLP Group that created SWE-bench! This Open Source agent (using GPT-4) is now just 2 percentage points away from beating Devin in this criteria. It's possible that incorporating Anthropic's Claude 3 Opus model (released March 3rd) could push it over the edge.

The significance of SWE-bench is that it requires solving realistic tasks in the context of a full codebase. Currently, the bottleneck is not the LLMs themselves but our orchestration and tool support. SWE-bench's examples only cover Python and there will also be other approaches, but generally this move towards gauging performance in a practical context is promising.

What about models?

The above assumes that an Open Source reproduction would rely on 3rd party proprietary models from the likes of OpenAI, Google, and Anthropic. We assume (but don't know) that Devin also relies on 3rd party models, see the March 12th Bloomberg article:

Cognition AI has declined to say to what extent it does or does not rely on existing LLMs from other companies, making it hard to tell how it stacks up against the work being done by Magic AI.

An Open Source reproduction that included the LLMs themselves would be a much more daunting task due to the expertise and compute required to train them. Open Weight models are becoming usable if not competitive for dev assistants and agents. We have no specific "catch-up" predictions, see r/LocalLLaMA for the latest news.

Again, we must be careful of the terminology: Most Open Weight models are not Open Source because weights are not source, the training data and process are the source.

Who are the players?

Here are some of the projects in the conversation right now.

See Coding Agents for more...

Following SWE-bench and OpenDevin's example, we recommend defaulting to MIT license for these efforts so we can share with minimum friction.

While most efforts have focused on expanding agent capabilities to better meet the work-as-done, it's also possible to constrain the problem scope by coevolving the app framework with the assistant. Examples are NextPy and Sublayer's Blueprints. While changing the problem makes it hard to objectively compare performance, it seems like a very practical approach in the long term.

Are you building an AI Coding Assistant, Autonomous Coder, or related tool? Reach out on LinkedIn or submit a Pull Request to add to our list!

Harnessing the constituency

What does this sea of potential competitors mean for AI startups like Cognition? DevTool growth expert Ana Hevesi had these words of wisdom:

Contributors are hungry for the transparency and self-determination an open source implementation would provide. A combination of love of the game—and perhaps some concern for job security—has spurred a spectacular show of collaboration.

So: how would it improve Cognition AI's position if they sought ways to engage and harness the energy of this extended constituency? Despite the fact that this reaction wasn’t in anyone’s plan, this group could surely be incredibly engaged testers. It’s also easy to imagine this group helping the Cognition team find new applications and customer profiles, and even red team the product.

Extended constituencies have minds of their own, which can be inconvenient if you didn’t plan with them in mind. But long term, they can multiply your success—if you harness their energy and reward their efforts.

When rubber meets the road

At the end of the day, we're here to deliver value, not just funding rounds and clicks. There are problems left to solve before we can fully operationalize these tools.

DevOps pioneer John Willis offers a much needed reality check in his article To Devin or Not to Devin, balancing his optimism with past lessons from robotic process automation (RPA) and his work on Chef infra management. Worth a read!

One of Devin's most intriguing aspects is its autonomous operation through its shell code editor and web browser. [...] However, as a software engineer with 45 years of experience, I take these claims with a hefty chaser of skepticism. Devin.ai may introduce many security concerns, especially when dealing with sensitive information or performing actions on services, databases and APIs, as well as the handling of sensitive information and the potential for unintended actions. [...] While I’m not against AI, I am cautious about Devin’s potential to replace human full-stack software engineers.

And as of press time, Lao Tzu said:

With patience the most tangled cord may be undone.


Can we write to you from time to time? Get updates.

· 7 min read
Ray Myers

Like alchemists seeking to turn base metals into gold, some contemporary researchers seek to create AGI using large language models as their philosopher's stone.

Grady Booch, Chief Scientist at IBM

The removal of CEO Sam Altman by the OpenAI board and the subsequent high drama has been shocking, fascinating, and deeply confusing. It's difficult assess the board's decision itself (independent of execution) because we don't know the reasons yet. However, perhaps it's time for us to look at the charter itself and realize that what we need to let go of is not merely one executive or board, but the center of the mission itself: Artificial General Intelligence.

We are committed to building safe, beneficial AGI that will have a massive positive impact on humanity's future. Anything that doesn't help with that is out of scope.

OpenAI careers page

For this discussion we'll use the AGI definition in OpenAI's charter, "highly autonomous systems that outperform humans at most economically valuable work". There are many other definitions, so any prediction for AGI timeline partly depends on which definition is used. It may never arrive or it may be already here.

What does AGI do?

Here are some things OpenAI leaders have suggested AGI can do.

Wipe out the human race

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.

That open letter was signed this year by OpenAI leaders including then CEO Sam Altman, 3 other co-founders, CTO Mira Murati, and the heads of policy research and governance.

OpenAI's position is essentially that their goal is to build something they consider a high risk for human exinction. Or more charitably, that they have the duty to build it first in order to protect us from whomever else may do it.

Break capitalism

I think that if AGI really truly fully happens... I can imagine all these ways that it breaks capitalism.

Sam Altman, OpenAI CEO (at the time) source

Fair enough, if "most economically valuable work" were automated, it stands to reason that economics and the organization of society could radically change.

Write the law like Judge Dredd

AI is going to enable the creation and enforcement of laws 1000x as complex as we have today. [...] One AI system working on behalf of a politician will go conduct interviews and deeply understand what everyone's preferences are and what would benefit them, and find a way to make everyone happy at once."

Adam D'Angelo, Quora CEO, OpenAI Board of Directors thread

As you can see, this leads to some pretty extreme conclusions. If AGI is so much more competent than us, it should do everything right? It should write the law, it should rule over us. This borders on worship and is just not in line with the capabilities of anything yet seen even if it were desirable.

AGI and TESCREAL

By reducing morality to an abstract numbers game, and by declaring that what’s most important is fulfilling "our potential" by becoming simulated posthumans among the stars, longtermists not only trivialize past atrocities [...] but give themselves a "moral excuse" to dismiss or minimize comparable atrocities in the future. This is one reason that I’ve come to see longtermism as an immensely dangerous ideology. It is, indeed, akin to a secular religion built around the worship of "future value," complete with its own "secularised doctrine of salvation," as the Future of Humanity Institute historian Thomas Moynihan approvingly writes in his book X-Risk.

The popularity of this religion among wealthy people in the West - especially the socioeconomic elite - makes sense because it tells them exactly what they want to hear: not only are you ethically excused from worrying too much about sub-existential threats like non-runaway climate change and global poverty, but you are actually a morally better person for focusing instead on more important things—risk that could permanently destroy "our potential" as a species of Earth-originating intelligent life."

Émile P. Torres - The Dangerous Ideas of "Longtermism" and "Existential Risk"

Here's an example of the all-or-nothing thinking created by the utopian vision of AGI, by a founder who (in contrast to AGI) is himself working on quite an interesting and well-scoped product.

For every day AGI is delayed, there occurs an immense amount of pain and death that could have been prevented by AGI abundance.

Anyone who unnecessarily delays AI progress has an enormous amount of blood on their hands.

Scott Stevenson, CEO of Spellbook Legal

An example of the all-or-nothing thinking from the side of the X-Risk vision is provided by current OpenAI interim CEO Emmett Shear, 5 months before being appointed.

The Nazis were very evil, but I'd rather the actual literal Nazis take over the world forever than flip a coin on the end of all value.

Emmett Shear Thread

The coin flip mentioned refers to having a "p(DOOM)" of some significant chance. Shear places his p(DOOM) between 5% and 50% as of June, expanding on his view in this interview with Logan Bartlett.

Some critics of this umbrella of thinking refer to it as the "TESCREAL" bundle: Transhumanism, Extropianism, Singularitarianism, Cosmism, Rationalism, Effective Altruism, and Longtermism. One particularly scathing and well-researched critique is Eugenics and the Promise of Utopia through AGI by Dr Timnit Gebru.

This is a distraction

AI Safety conversions centered around utopian or apocalyptic hypotheticals distract from ongoing AI Ethics work about the benefits and dangers that are already here. For more on this see Talking about a 'schism' is ahistorical by Emily Bender.

This is emphatically not a story of a community that once shared concerns and now is broken into disagreeing camps. Rather, there are two separate threads — only one of which can properly be called a body of scholarship — that are being held up as in conversation or in competition with each other. I think this forced pairing comes in part from the media trying to fit the recent AI doomer PR pushes into a broader narrative and in part from the fact that there is competition for a limited resource: policymaker attention.

In describing the Doom-Topia safety camps as a distraction rather than equally valid concerns, we are taking the position is that the current AI advances like Large Language Models (LLMs) are not an imminent AGI. GPT-4 is not AGI, nor will be GPT-5 nor GPT-6. This is also the position of OpenAI (ex-?)CEO Sam Altman, who said earlier this month that improved LLMs will not get us to AGI or Superintelligence, we still need other breakthroughs. In the nearly 70 year history of AI research, this is characteristic of the pattern of dramatic new capabilities rightly capturing our imagination followed by new hurdles and AI winters.

Whereas if we focus on what's clear and present, we are already sitting on technology with huge potential, good and bad. OpenAI's own projection is that ChatGPT will have some impact on 80% of the workforce - is that not enough responsibility for us to take? Almost everything we do for each other and to each other has the potential to be amplified, whether by LLMs or by some other less exotic advancement that just hasn't been properly applied or maintained.

In summary:

  • Yes to AI
  • No to AGI

Build well-scoped systems and take responsibility for them. Don't try to build a God.

With patience the most tangled cord may be undone.

Can we write to you from time to time? Get updates.

· 6 min read
Ray Myers
Kyle Forster

Yesterday, OpenAI held their first DevDay with some of their biggest releases since GPT-4 in March! Full details are in the the official announcement and keynote stream. In this post we'll give first thoughts on the implications for software development and maintenance.

Some of the biggest limitations of GPT-4 were that it was slow, expensive, couldn't fit enough data in the context window, and had a knowledge cut off of January 2022. All of those have been significantly addressed. Short of eliminating halucinations (which may be intractable), we couldn't have asked for much more in this release.

While this is not "GPT-5", whatever that may look like, it was a huge move to execute on so many key frustrations at once. As the Mechanized Mending Manifesto hints, we have much to learn about taking advantage of Large Language Models as components in a system before our main limitation becomes the sophistication of the model itself.

Lightning round

Let's give some initial takes on the impact to AI coding workflows for each of these changes.

  • 👾👾👾👾👾 = Game changer
  • 👾👾👾👾 = Reconsider many usage patterns
  • 👾👾👾 = Major quality of life improvement
  • 👾👾 = Considerable quality of life improvement
  • 👾 = Nice to have
  • 🤷 = Not sure!
FeatureImpactNotes
128K context👾👾👾👾👾Max score of 5 space invadors!
Price drop👾👾👾👾See below
Code Interpreter in API👾👾👾Code Interpreter's workflow is often better than using GPT-4 codegen directly
JSON mode / Parallel Function calls👾👾👾Tooling needs this, we had workarounds but structured output was a constant pain
Speed👾👾This makes GPT-4 more of a contender for interactive coding assistants
Assistants API👾👾Saves a lot of boilerplate for new chatbots
Retrieval API👾👾Again, we could do this ourselves but now it's easy
Updated cutoff date👾Probably more important outside coding
Log probabilities👾Should help with autocomplete features

Uncertain callouts

FeatureImpactNotes
Improved instruction following🤷We need to try it
Reproducible outputs🤷Will reproducibility help if it's generally unpredictable?
GPT-4 Fine Tune / Custom Models🤷I don't have 5 million dollars, do you?
GPT Store🤷🤷Maybe more useful for coding adjacent tools, see Kyle's section below
Copyright Shield🤷🤷🤷Their legal strategy will have... ramifications

Looking deeper

128K context 👾👾👾👾👾

This gets the maximum score of 5 space invadors.

We'll follow up with more later, but for instance this video from April, Generating Documentation with GPT AI, had as it's main theme the difficulty of getting an LLM agent to reason about a single 8,000 line source file from Duke Nukem 3D.

That dreaded file now fits in a single (expensive) prompt! So do some entire books. Our options for inference using the state of the art model have just drastically changed. We look forward to seeing how well the performance holds up in extended context because previous methods in the research have usually had caveats.

Price drop! 👾👾👾👾

Deciding when to use 3.5-Turbo vs the premium 4 vs a fine-tuned 3.5 has been a juggling act. With this price drop

  • GPT-4 Turbo 128K is 1/3 the cost of GPT-4 8K by input token (1/2 by output)
  • GPT-4 Turbo 128K is 1/6 the cost of GPT-4 32K by input token (1/4 by output)
  • GPT-3.5 Turbo 16K is also now cheaper than it's 4K version was

Updated cutoff date

Training now includes data up to April 2023 instead of January 2022. This is huge for general use of ChatGPT, but for coding tasks you should consider controling context more carefully with Retrieval Augmented Generation (RAG), as Cursor does.

Whisper v3 and Consistency Decoder

Better speech recognition models will always be good news for speech driven tools like Cursorless and Talon, used by coders with repetitive stress injury.

New modalities in the API

These are worth mentioning, but don't seem aimed at coding as we normally understand it. Perhaps for front-end devs and UX design though?

  • GPT-4 Turbo vision
  • DALL·E 3
  • Text-to-speech

AI Troubleshooting

For this section we're joined by a leader in AI-assisted troubleshooting: Kyle Forster, CEO of RunWhen and former Kubernetes Product Director at Google.

I look to OpenAI's developer announcements as bellwether moments in the modern AI industry. Whether you use their APIs or not, they have access to so many consumers and enterprises that their decisions of what to do and not do are particularly well informed. Below are my take-aways relevant to our domain.

Models like micro-services, not monolith

OpenAI could focus entirely on driving traffic to their native ChatGPT. Instead, their announcements his week are making it easier to build your own domain-specific GPT and Digital Assistants. We've been in a strong believer in this direction since day 1 where our UX allows users to ask the same troubleshooting question to multiple Digital Assistants. Like individuals on a team, each one has different capabilities, different access rights and come up with different troubleshooting paths and different conclusions.

Internal-only Enterprise GPT Investments

Enterprise data security with regards to AI is an issue that our industry is only just starting to digest. It is clear that using enterprise data to train models is an absolute "no," but what about a vendor's in-house completion endpoints? A vendor's third party vendors completion endpoints? Masked data? Enterprise-managed models?

We've taken very conservative decisions here out of short term necessity, but our advanced users are thinking about how to take advantage of big public endpoints. The debate reminds me of raging debates in '10-'12 public cloud vs private cloud and the emergence of hybrid cloud research that drove both forward. In this vein, the OpenAI announcements touching on this feel like hybrid cloud investment. I don't know where this work ultimately lands, but I do see numerous inventions - equivalents of Cloud VPCs and Cloud Networking Firewalls that supplemented the early focus on Security Groups - are ahead of us.

· 4 min read
Ray Myers

So, I ask an LLM 'hey, generate me code for doing X and Y'. It spits out something that looks like it could work, but it doesn't as some of the methods used do not exist at all.

How do I continue to give it a chance, in this particular instance? Any suggestions? - Slobodan Tanasić

This is very common issue right now, and seems somewhat inevitable. Maybe this'll help.

A Practical Answer

I think of it like using Stack Overflow. There is a category of isolated problems that I've learned Stack Overflow will help me more more than waste my time, and it's awesome at those. It helps even though we can't trust that copying the answer directly into the context. It might not be correct, even compile, or be licensed to allow use.

But we understand the provenance of the solution is dubious and it still helps us reach a solution faster, getting past stumbling blocks. When I've used LLMs (Large Language Models) like ChatGPT successfully during coding, it's a similar process.

For some more detailed guidance, our friends Daniel and Nicolas have some articles getting into specific use cases:

The Cursor editor has also made progress creating a more ergonomic integration than other assistants.

The Deeper Problem

This is controversial, I actually believe LLMs can't code.

Coding is not a single task. It's a big bag of different types of cognitive tasks. Some parts are, we might say, language-oriented and map very well to an LLM's strengths.

LLM-Easy

Language models suggest pretty good variable names. They can identify code smells. They can lookup many (mostly real) libraries that relate semantically to our questions. These are examples of well-trodden NLP (Natural Language Processing) tasks that we could label "LLM-Easy".

  • Summarization
  • Classification
  • Semantic Search
  • ...

LLM-Hard

Then you have tasks that require precise reasoning. Even though we've come to expect computers to excel at those, LLMs are weak in this department. Some would say they don't reason at all, I'll just say that at minimum they are very inefficient and unreliable at it.

For instance, GPT-4 cannot multiply two 3 digit numbers together. Not only that, it returns wrong answers indiscriminately.

If you can spend $100 Million training a model, run it on special GPU hardware optimized for number-crunching, and still get something that can't multiply numbers, we are justified in saying that it is the wrong tool for certain kinds of reasoning. Fortunately, there are many other tools.

Forgive a little handwaving here, because we do not have a full account of every kind of cognitive task that goes into coding. But if you suspect that any part is LLM-Hard then you might agree that the all-in-one approaches that ask an LLM to produce raw source code are naive:

  • Putting an unknown number of LLM-Easy and Hard tasks into a blender
  • Expecting an LLM to identify which is which or solve them all simultaneously
  • Accepting the unrestricted text response as a valid update to your production code

This will hit a wall. More hardware and better prompts won't fix it. We need to partition the tasks, using components that are suited for each rather that treating one component as a panacea.

In short, that's how we teach machines to code.

Can we write to you from time to time? Get updates.

· 5 min read
Ray Myers

If we're going to teach machines to code, let's teach them the safest way we know how.

GeePaw Hill said, "Many More Much Smaller Steps". Just as the discipline of small steps helps humans work more effectively, Large Language Models have shown dramatically improved results by breaking down tasks in a technique called Chain of Thought Prompting. Surprisingly, Chain of Thought improves responses even with no feedback. By adding incremental verifications between each step we can do even better.

Rule: Take small steps and test them

To make these steps as safe as possible, let's take a moment to embrace a brutal design constraint: LLMs cannot reliably edit existing code.

  • When we ask GPT4 to update code, it breaks it
  • When we ask GitHub CoPilot to update code, it breaks it
  • When we ask SourceGraph Cody to update code, it breaks it
  • When we ask CodeLlama to update code, it breaks it

We could wait for bigger shinier models to address this, but possibly LLMs are just architectural incapable of the required kinds of reasoning. For our purposes today, we will treat LLM output as potentially useful but inherently untrustworthy.

Rule: Treat LLM output as untrustworthy

Where does this leave us? Untrusted output must be filtered through an intermediary, typically a human reviewer or a vetted tool with limited actions. So far, most approaches involve the former - people manually code review and debug the raw LLM output. Are there other options? Can we manipulate code without directly editing it? Funny you should ask...

Return of Refactoring Browser

A tool called Refactoring Browser introduced syntax-aware automated refactoring in commerical tools in 1999, building on research started by Ralph Johnson and Bill Opdyke in 1990. The right-click refactoring in today's IDEs like IntelliJ are descendants, but a unique aspect of Refactoring Browser was that is could only refactor, not act as a text editor. In a sense, what it couldn't do was as important as what it could.

We can rebuild it

Could we build another, more automation-friendly Refactoring Browser today? How many transformations would we need to support?

While Martin Fowler's definitive Catalog of Refactorings is extensive, we can get very far with surprisingly few of them mastered. The Core 6 Refactorings identified by Arlo Belshee are Extract (Variable / Method / Parameter / Field), Rename, and Inline. Those basic operations give us control of how concepts are named in the code, and therefore how intent is expressed. Further, all of these can be automated and performed with a high degree of confidence - no hand editing necessary!

Instead of trying to understand an entire large block of code at once, we can take incremental steps capturing one piece of insight in a name before moving on. The practices of Read by Refactoring and Naming as a Process expand on this. On Craft vs Cruft, I tend to call this "Untangling".

The versatility of a refactoring CLI

With a small handful of core recipes supported per language, we can create a workable command-line refactoring tool(see Untangler for an early prototype). Liberating this functionality from IDEs allows opens up many possibilities:

  • Restricted editing UI's ("Uneditors" like Refactoring Browser)
  • Scripted code migrations
  • AI-driven refactoring
  • Easily give more editors refactoring support (YAY!)

Insanely small steps

So we started by saying we'd teach machines the safest way we know how to develop. What is that, Test Driven Development? Let's go even further. Let's go all the way to TCR, which has been called "TDD on steroids"!

Test && Commit || Revert

In Kent Beck's TCR, you make one single change. If the tests pass, you commit. If they fail, you blow away your change and start over. This is an unorthodox and challenging workflow that encourages preparatory refactoring before making a feature change.

Make the change easy, then make the easy change - Kent Beck

In the 2012 talk, 2 Minutes to Better Code Woody Zuill and Llewellyn Falco demonstrate rapid commits of incremental refactoring steps supporting an upcoming feature change on a real codebase. Those kinds of steps would fit well into a TCR workflow.

Full workflow

Here's an example of how this could work end-to-end, other variations are possible/

  1. Human enters high level intent: "Clean up the Foo module"
  2. Branch from Trunk
  3. AI Refactor Loop
    1. Green (tests pass)
    2. AI suggests refactoring action
    3. Refactoring tool applies transformation
    4. Green (tests pass)
    5. Commit
  4. Create Pull Request to Trunk
  5. Human decides whether to merge the Pull Request

We now have an unprecendented set of safegaurds for LLM-driven changes:

  • We only edit code through syntax transformations with known behaviors
  • Every edit is built
  • Every edit has tests run
  • We can perform review at any gradularity due to rapid commits
  • The steps are replayable, entirely or a subset

Can this work?

It already does.

asciicast

While the prototype's capabilities are very limited (it does renames in C), early experiments suggest this is the first AI developer agent that can update code in a bug-free way. It may choose a bad variable name here and there, but it does not break the code.

I'll be releasing more pieces into The Mender Stack as I'm able to clean them up and explain them properly. It will take a significant but very tractable amount of work to implement polished support for the core refactorings in several major languages. If you're interested in helping out I'd love to hear from you.

Can we write to you from time to time? Get updates.

· 10 min read
Ray Myers

MenderCon

Yesterday was the 4th ever MenderCon, an annual conference by CorgiBytes (creators of the Legacy Code Rocks Podcast). This was my second and I've found these to be incredibly valuable due to the quality of the community and the emphasis on live interaction. Thus far they are free and I highly recommend tuning in to the next if you can.

This event SOLD OUT at a record 200 attendees, congrats to the organizers on that.

Similar to last year, the format had one track of scheduled talks along-side multiple tracks of spontaneus Open Space sessions pitched by attendees. The feel of a physical conference was simulated remotely using an app called Gather, which feels sort of like playing Zelda while on a Zoom call with the nearby charactors. I found Gather enjoyable after a learning curve. Open space sessions and notes were managed in the more familiar collaboration tool Miro, the reigning champ of virtual post-it notes.

I prefer to catch scheduled talks later on their YouTube release so I can focus on the Open Space sessions. All of them are listed later, these were the 3 I attended:

  • Tech Debt Metrics for Execs w/ Jay Bazuzi
  • Mending with AI w/ Wendy Closson
  • Code Generation Won't Save Us - Mender.AI

I almost withdrew my submission for a session since there were two other related topics, but I decided to put it in the last time slot and see if the was still interest. As it turned out, 20 people showed up and we had a vibrant hour long discussion!

My three personal highlights from the day:

  • Andrea Goulet's perspective on communicating with leadership. If you're not familiar with her work, try this talk on Empathy-Driven Development
  • Wendy Closson suggesting the Mechanized Mending Manifesto, which abbreviates to MMM, is the WWW from World Wide Web flipped upside-down. Her second session topic on the difficulty of "selling the mend" is also something I'd like to get further into, maybe as a sequel to the Craft vs Cruft Tech Debt and Trust episode.
  • Emma's well-defended warning that we should never use GPT for anything. This may seem like an odd pick because I created this entire site to discuss the usefulness of generative AI in software development, but I also created it to urge caution around the limitations. With so much reckless optimism floating around, we also need splash of cold water. I hope to host a healthy dose of AI-skepticism on this site, for now try out Will A.I. Become the New McKinsey? by Ted Chiang and A.I. is B.S. by Adam Conover.

Talks

The scheduled talks will be available soon on YouTube, past years are already up!

Keynote: How a 40-year-old software company keeps its code fresh and maintainable

  • Haim Cohen, Software Engineering Team Leader, Bloomberg | LinkedIn
  • Lainey Hall, Software Engineer, Bloomberg | LinkedIn

How many of you have encountered the following problems working with a mature codebase:

You don’t know what it does

You don’t know how to build it

The deployment process is based on various scripts and word of mouth

New developers take weeks to get up to speed

We've all been there. Join this talk to discover how a 40-year-old software company keeps its code fresh and maintainable. Through this talk, we will share some helpful techniques and principles that you can immediately start to use to improve time to market, code quality, and developers' work satisfaction.

Clean Code - Strategies & Principles

  • Abhishek Jain, Lead Software Dev at Cornerstone OnDemand, Author, YouTuber at Coach4Dev, Blogger, Stand Up Comedian | LinkedIn

In this presentation, we will talk about how to write clean code - common pitfalls and strategies on how to enforce better code standards for your organization. We will see some sample code bases on how bad code looks like, why developers tend to go that route and the mindset change required to improve the code quality at your organization.

Decommissioning a legacy module progressively, using AST traversal

Somewhere, deeply hidden in an old codebase, a module has become so buggy and nasty that no developer dares making any change to it. Actually, developers want to get rid of it, but a big part of the codebase is indirectly relying on it...

Can this story resolve to a happy ending? How to prevent the situation from getting worse, while developers keep adding features in the codebase?

What if AST (Abstract Syntax Tree) traversal was the key? Let's see how it can help us keep an eye on these dependencies, and prune them one by one, until the module is gone for good! 😌

Unifying UIs with a Design System

Starting a design system can be daunting, especially on a long running, extensive app or family of applications. How do you start, and then how do you implement the plan? What are the pitfalls of common approaches and off-the-shelf setups? We'll dive into how to start your team on the path of bringing a design system to your organization. We'll identify steps you can take today to move the ball forward, and finally, we'll discuss some of the pitfalls and concerns to watch for as you move towards a unified experience.

Open Space Sessions

These notes were gathered from the live Miro board, if someone else who attended would like to add more context or links, you are welcome to submit a Pull Request.

  • Tech Debt Metrics for Execs (Jay Bazuzi)
  • Ask me, or someone else, anything (Chris C)
  • Experiences in applying the strangler fig pattern in the wild (Miriam Retka)
  • Behavior change for a better mender culture (radhika)
    • "Have the right management" / Authority may be needed
    • Pair programming example (possibly a top-down initiative)
    • "Behavior change is hard"
    • "Hard to work with" tech = low morale/motivation = resignations
    • Celebrate modernization and code deletion with demos
    • Mob programming example 2: external collaborators adjust to internal norms; start with small commitments
    • Mob programming example: attitudes a barrier; advocates set the norms
    • "Keychain awards" example: encourage behaviors like cross-team collab with "tour of duty award"
    • Appreciation framework applications (e.g. Motivosity); spot bonuses
    • Team dinners tied to some achievement
    • Hire external consultants to expand capability and bypass fixed norms/culture
    • Clear specific deliverables (and behaviors to get there?)
    • Hackathon example: upgrade security-lagging code
    • "December merry month of maintenance" example
    • Do It Together = more motivation to do less interesting maintenance
    • Enhance group capabilities with shared language, book clubs, exposure/training with new practices
    • Books:
      • Sidney Dekker (Not sure which)
      • Software Design X-Rays (Adam Tornhill)
      • Tiny Habits (BJ Fogg)
      • Fearless Change: Patterns for Introducing New Ideas
      • Chris Hartjes: Building Test Driven Developers
  • Test-Driven Development Basics (Sharon DeCaro)
    • What is TDD?
    • how to use in early-stage start-ups?
      • never too early to start! each feature is behavior that the application doesn't already do
    • how does it apply to applications that already exist that may be untested?   * use it for new changes, whether feature or bug   * for initial test coverage, end-to-end can be more beneficial to validate existing behavior
    • unit/integration tests ==> in-memory/IO tests
      • anything running in-memory should be able to run quickly
      • run tests with external calls/dependency as needed
    • tools: Visual Studio Enterprise Live Testing, NCrunch,  Wallaby.js
    • I'm Done with Unit and Integration Tests
    • Print-and-Play version of JitterTed's TDD Game v1.1
    • If you can't TDD, your code needs TLC
  • How will the technical nature of mending software change as OOP wanes as general "AI" becomes more prominent? (Andrea Goulet)
  • We bought a SASS product using Kubernetes. It only runs in production and has no tests. (Chad)
  • Defining the "senior" in senior engineer (Chuck)
    • What makes a Senior Engineer
      • taking responsibility / ownership
      • capable to work independently but also help others
      • engage in business concerns (not just ticket-takers)
      • have transferrable skills rather than domain-specific skills (including soft-skills)
      • having the right mindset (growth)
      • delivers value
      • can break down (decompose) work
      • systems thinking
      • that undefinable je ne sais quoi
    • Shuhari - stages of learning to mastery
    • Dreyfus model of skill acquisition
    • Should teams be all seniors? What benefits are there for having juniors?
      • Mentoring
      • Senior developers have to come from somewhere
    • Things that seem to not be good indicators:
      • years on resume
      • how long they've been coding
      • what the level / progression track at your company says
  • Practical steps in setting up a CI/CD pipeline (Mike Kienenberger)
    • Build-and-test
    • Start by getting something to process, even something trivial
    • Have the pipeline call commands that can trivially be run locally
    • Use a linter  (even with all of the rules disabled) to get started when you have no tests
    • Some validation happens before merge, some validation happens after.  All of the pre-merge checks should be green almost all the time, but the post-tests rarely are.
  • Mending with AI (Wendy Closson)
    • ChatGPT vs. Github Copilot:
      • Copilot = Next Generation Auto-complete
      • ChatGPT = Next Generation Stack Overflow
    • TabNine - Can host LLM models on-prem to avoid sharing code outside the company
  • Cross-company architecture guild / simulations (Allan Stewart)
    • Architectural Katas
    • The "architect" role is different in so many companies (and no one even agrees what it means to be one)
    • Spectrum between 1:1 coaching and writing a book
  • How can 1099s make $ (Chris May)
  • Selling the Mend. Why is it so hard to get the buy in? (Wendy Closson)
  • Code Generation Won't Save Us - Mender.AI (Ray Myers)
  • Moving from mender IC developer to manager (Miriam Retka)

Can we write to you from time to time? Get updates.

· 6 min read
Ray Myers
GPT4

In 2021, the DABUS system created by Dr. Stephen Thaler became the world's first machine to be granted a patent. This patent, "Food container and devices and methods for attracting enhanced attention" was accepted in South Africa after being denied by Australia, the European Patent Office, the United Kingdom, and the United States. However, these rejections were due to legal requirements of the inventor being a natural person, not a judgement on the quality of the invention itself.

For a thorough review of the legal issues, see A study on inventorship in inventions involving AI activity by Dr. Noam Shemtov. For now, let's take a look at the invention with our friend ChatGPT, shall we?


Ray: What is this patent application describing? [Abstract omitted]

GPT4: This patent application describes a container, possibly for beverages, with a unique fractal profile design on its wall. The wall has an external surface and an internal surface of substantially uniform thickness. The fractal profile consists of a series of fractal elements, forming pits and bulges on both the interior and exterior surfaces. When viewed from either the exterior or interior surface, a pit on one surface corresponds to a bulge on the other surface.

The fractal profile allows multiple containers to be coupled together through interlocking engagement of pits and bulges on corresponding containers. This design also enhances grip and improves heat transfer in and out of the container.

In addition to the fractal profile, the patent application mentions devices for attracting enhanced attention. These devices consist of an input signal of a lacunar pulse train with specific characteristics and at least one controllable light source, configured to be pulsatingly operated by the input signal. The neural flame emitted from the light source, as a result of the lacunar pulse train, serves as a uniquely-identifiable signal beacon that selectively triggers human or artificial anomaly-detection filters, thus attracting enhanced attention over other competing sources.

Fractals don't make any sense to me in reference to a beverage container. In simple terms, what are we talking about? A thermos with a light on it?

Apologies for the confusion. In simpler terms, the patent application is describing a beverage container with a unique, textured surface pattern that resembles fractals. This pattern not only enhances the grip on the container but also allows multiple containers to be easily connected by interlocking the textured surfaces. Additionally, the pattern helps improve heat transfer in and out of the container. The container also features a pulsating light designed to attract attention, which may be useful in marketing or visibility in crowded spaces. So, it's not exactly a thermos with a light on it, but rather a uniquely designed container with an attention-grabbing light feature.

What purpose would it serve to interlock containers together? If I need to store some cups I can just stack them?

Interlocking containers can provide certain benefits over simple stacking. Some potential advantages include:

  1. Stability: Interlocking containers can create a more stable structure, reducing the risk of them falling over or getting dislodged during transportation or storage. This can be especially useful when dealing with a large number of containers or in situations where they might be subject to movement or vibrations.

  2. Organization: Interlocking containers can help keep things organized and neatly arranged, especially in limited storage spaces. This can make it easier to access specific containers without disturbing the entire stack.

  3. Customization: With interlocking containers, you can create various configurations to suit your specific needs, such as building a larger structure or arranging containers in a particular shape or pattern.

  4. Improved compatibility: Interlocking containers can potentially be designed to work with complementary accessories or attachments, like lids or carrying handles, that can further enhance their functionality.

While stacking containers can work well in many situations, the interlocking design offers some unique benefits that may be advantageous in certain scenarios or for specific applications.

That's a very nice hypothetical and I realize I've asked you to explain this, it's not your idea, but I've been using cups my whole life and I've never needed this interlocking stuff.

You're absolutely right. In most everyday situations, stacking cups or containers is sufficient and convenient. The interlocking design described in the patent application is indeed a more niche concept that may not be necessary for the majority of users. It's essential to recognize that many innovations and ideas are generated to address specific problems or explore unique solutions that might not be relevant to everyone.

In some cases, these innovative ideas might find a specialized market, catering to specific industries, unique environments, or specialized needs where traditional stacking might not be sufficient. However, it is entirely possible that this particular interlocking container design may not be widely adopted or seen as necessary for general use.

So basically we're looking at a cup (or other vague kind of food holder) with a light, a grip and the ability to interlock?

Yes, that's a concise summary of the patent application. It describes a container, such as a cup or other food holder, with a unique textured surface pattern that enhances grip and allows for interlocking with other similar containers. Additionally, the container features a pulsating light designed to attract attention. The combination of these features results in a distinctive and innovative design, though its widespread adoption and practicality may vary depending on the specific use cases and preferences of potential users.

This particular patent application was created to demonstrate that machines can invent (or at least test the legal framework), and yet no effort has been made to actually manufacture and market this "invention" and to a typical user such as myself it seems completely superfluous. What do you make of that?

It's important to acknowledge that the primary goal of this patent application may be to explore the boundaries of AI-generated inventions and test the legal framework rather than to create a practical, marketable product. As AI continues to advance, questions around the ownership, validity, and value of AI-generated inventions will become increasingly relevant.

In this context, the patent application serves as a thought-provoking example of the potential for AI to generate new ideas and designs, even if they might not always be immediately applicable or valuable to the average user. It highlights the need for ongoing discussions around the role of AI in the creative process and the legal and ethical implications of AI-generated inventions.

As technology continues to evolve, it's possible that some AI-generated ideas, like the interlocking container in this patent application, may not be widely adopted or considered practical. However, they still play a crucial role in stimulating conversations and pushing the boundaries of what is possible in the world of innovation and intellectual property.

Thanks, GPT4.

You're welcome! If you have any more questions or need further assistance, please feel free to ask. I'm here to help!

Can we write to you from time to time? Get updates.

· 2 min read
Ray Myers

Earlier this week, Microsoft Research published the paper "Sparks of Artificial General Intelligence: Early experiments with GPT-4".

This is no small claim. Artificial General Intelligence (AGI), is something of a holy grail in Computer Science. Traditionally, all successful AIs have been specialized for a particular problem, such as the Deep Blue chess engine or Google Translate. While they may outperform us within their specialization, the versatility of human intelligence puts us in another league. A breakthrough to AGI would be world-changing.

Considering the source for a moment, Microsoft Research has some top talent and I wouldn’t question the skills or motivation of any of their researchers. However we also can’t ignore the enormous strategic interest Microsoft has in OpenAI’s technology (e.g. Bing search and GitHub Copilot). Even with the best intentions, a review and publication bias is undoubtedly at play here.

I’ve had a chance to work with GPT 4 a bit, it is a substantial improvement over GPT 3.5, which powered the initial release of ChatGPT. Where many prompts produced plausible-sounding gibberish, results are now more consistently lucid and useful.

As the paper demonstrates, GPT is also capable of interacting with other tools. It’s bad at Math? No problem, it can invoke a calculator during its task. It can run a Google search and take the results into account.

So with augmented LLMs, are we finally on a path that will lead to AGI? A fair portion of the software community is on "yellow alert" for that possibility. It’s also well worth listening to AI-skeptics like Grady Booch. The true value will only be realized once the hype cycle dies down.

As for me, I’ll put it this way. The Turing Test is dead and buried. LLMs are "good enough" at such a broad range of tasks that we will need to redraw our concept of AGI around their weaknesses.

Can we write to you from time to time? Get updates.

· 3 min read
Ray Myers

Introduction

Refactoring Andy Sloan's entry to the 2006 International Obfuscated C Code Contest, with some help from ChatGPT. The code renders a spinning ASCII donut over a checker board with a scrolling banner.

Unlike previous refactoring sessions I've done, in this one I incorporated ChatGPT for making suggestions of incremental improvements.

Source code with refactoring steps in commit history.

RefactorGPT prompt

chatgpt-refactoring-prompts

You are now RefactorGPT, an expert in code maintainability. Your input will be code blocks. You will respond by giving the code a letter grade (A, B+, D-, etc...), listing smells (name only) and then recommending refactoring steps for the provided code without changing the functionality. Do not break the code. Make suggestions relevant to the code. Only show your suggestions, not the updated code. The available actions include but are not limited to:

Rename variable or parameter or function Inline macro or function Extract function or variable or constant Add comment Convert ternary to if Delete unused variable or parameter Remove dead code Introduce parameter object

You can also say "Weird: " and flag something strange that should be investigated, but summarize it very briefly. Group suggestions by the function they pertain to, or "global". Bold the action types. We value names that are honest and complete. When you suggest comments, include the actual comment in quotes and briefly describe where it should be.

If you understand please respond with "RefactorGPT>"

If you are interested in legacy code techniques, checkout my other refactoring sessions or reach out with questions via LinkedIn or Twitter!

Example Input

int charIndex(int q, char *l, char *r, int o) {
for (; q >= 0;)
q = ("AYLrZ^w^?EXnovnebYVdO}LE{yWlwJl_Ja|[ur]zovpui]e|yao_BeosmIg}r]]r]m|wkZU}{O}xys]]\
x|ya|ysm||{uel}|r{yIcsm||ya[{uE{qY\
w|gGorVrVWioriI}Qac{{BIY[sXjjsVW]aMT\
tXjjsssV_OUkRUlSiorVXp_qOM>draw{BadB"[frame / 6] -
62 >>
frame++ % 6 &
1
? r[q]
: l[q]) -
o;
return q;
}

Response (ChatGPT with GPT 4 model)

Grade: C-

Smells:

Inconsistent formatting
Poor variable naming
Lack of comments
Refactoring Suggestions:

global:

None
charIndex:

Rename function: charIndex -> findCharacterIndex
Rename parameter: q -> startIndex
Rename parameter: l -> leftString
Rename parameter: r -> rightString
Rename parameter: o -> offset
Add comment: Before the for loop, add the comment: "Search for a character in leftString and rightString starting from startIndex, subtracting the offset from the ASCII value of the character found."
Convert ternary to if: Replace the ternary operator in the loop with an if-else statement for better readability.
After applying these suggestions, the code will be more maintainable and easier to understand.

Can we write to you from time to time? Get updates.

· 2 min read
Ray Myers

What if machines could do the mechanical parts of what humans do?

Often they do, we call that automation. Automation of the software variety is often called "programming".

But what if the mechanical parts of programming were automated?

We do that.

Constantly in fact, we're positively neurotic about finding new ways to eliminate duplicate work.

But what if you could draw on a vast library of code already written?

We call those library dependencies.

But what if you could skip the code entirely and just describe in unambiguous terms what you wanted the computer to do?

Then you’d have a programming language. They exist to describe in human-readable and unambiguous terms what you want a computer to do.

But it would be like a higher-level pass that generates code.

We do that, for instance in DSLs (domain-specific languages) and multi-pass compilers. The GHC Haskell compiler uses at least 4 intermediate representations not including Haskell code itself or assembly/machine code.

Do you really think it’s impossible to make further progress?

Of course not, that’s why we continue to refine our tool-chains and new language communities are still forming with new approaches.

Just understand that we’ve seen claims of revolutions in coding productivity/elimination coupled with impressive demos regularly for over 50 years. The ideas that made a dent were few and far between, and even then the realistic implications were different than we would have expected.

So what have you got?

Can we write to you from time to time? Get updates.