4 posts tagged with "chatgpt"

View All Tags

OpenAI DevDay Announcments and Coding

November 7, 2023 · 6 min read

Ray Myers

Host of Craft vs Cruft

Kyle Forster

CEO of RunWhen

Yesterday, OpenAI held their first DevDay with some of their biggest releases since GPT-4 in March! Full details are in the the official announcement and keynote stream. In this post we'll give first thoughts on the implications for software development and maintenance.

Some of the biggest limitations of GPT-4 were that it was slow, expensive, couldn't fit enough data in the context window, and had a knowledge cut off of January 2022. All of those have been significantly addressed. Short of eliminating halucinations (which may be intractable), we couldn't have asked for much more in this release.

While this is not "GPT-5", whatever that may look like, it was a huge move to execute on so many key frustrations at once. As the Mechanized Mending Manifesto hints, we have much to learn about taking advantage of Large Language Models as components in a system before our main limitation becomes the sophistication of the model itself.

Lightning round

Let's give some initial takes on the impact to AI coding workflows for each of these changes.

👾👾👾👾👾 = Game changer
👾👾👾👾 = Reconsider many usage patterns
👾👾👾 = Major quality of life improvement
👾👾 = Considerable quality of life improvement
👾 = Nice to have
🤷 = Not sure!

Feature	Impact	Notes
128K context	👾👾👾👾👾	Max score of 5 space invadors!
Price drop	👾👾👾👾	See below
Code Interpreter in API	👾👾👾	Code Interpreter's workflow is often better than using GPT-4 codegen directly
JSON mode / Parallel Function calls	👾👾👾	Tooling needs this, we had workarounds but structured output was a constant pain
Speed	👾👾	This makes GPT-4 more of a contender for interactive coding assistants
Assistants API	👾👾	Saves a lot of boilerplate for new chatbots
Retrieval API	👾👾	Again, we could do this ourselves but now it's easy
Updated cutoff date	👾	Probably more important outside coding
Log probabilities	👾	Should help with autocomplete features

Uncertain callouts

Feature	Impact	Notes
Improved instruction following	🤷	We need to try it
Reproducible outputs	🤷	Will reproducibility help if it's generally unpredictable?
GPT-4 Fine Tune / Custom Models	🤷	I don't have 5 million dollars, do you?
GPT Store	🤷🤷	Maybe more useful for coding adjacent tools, see Kyle's section below
Copyright Shield	🤷🤷🤷	Their legal strategy will have... ramifications

Looking deeper

128K context 👾👾👾👾👾

This gets the maximum score of 5 space invadors.

We'll follow up with more later, but for instance this video from April, Generating Documentation with GPT AI, had as it's main theme the difficulty of getting an LLM agent to reason about a single 8,000 line source file from Duke Nukem 3D.

That dreaded file now fits in a single (expensive) prompt! So do some entire books. Our options for inference using the state of the art model have just drastically changed. We look forward to seeing how well the performance holds up in extended context because previous methods in the research have usually had caveats.

Price drop! 👾👾👾👾

Deciding when to use 3.5-Turbo vs the premium 4 vs a fine-tuned 3.5 has been a juggling act. With this price drop

GPT-4 Turbo 128K is 1/3 the cost of GPT-4 8K by input token (1/2 by output)
GPT-4 Turbo 128K is 1/6 the cost of GPT-4 32K by input token (1/4 by output)
GPT-3.5 Turbo 16K is also now cheaper than it's 4K version was

Updated cutoff date

Training now includes data up to April 2023 instead of January 2022. This is huge for general use of ChatGPT, but for coding tasks you should consider controling context more carefully with Retrieval Augmented Generation (RAG), as Cursor does.

Whisper v3 and Consistency Decoder

Better speech recognition models will always be good news for speech driven tools like Cursorless and Talon, used by coders with repetitive stress injury.

New modalities in the API

These are worth mentioning, but don't seem aimed at coding as we normally understand it. Perhaps for front-end devs and UX design though?

GPT-4 Turbo vision
DALL·E 3
Text-to-speech

AI Troubleshooting

For this section we're joined by a leader in AI-assisted troubleshooting: Kyle Forster, CEO of RunWhen and former Kubernetes Product Director at Google.

I look to OpenAI's developer announcements as bellwether moments in the modern AI industry. Whether you use their APIs or not, they have access to so many consumers and enterprises that their decisions of what to do and not do are particularly well informed. Below are my take-aways relevant to our domain.

Models like micro-services, not monolith

OpenAI could focus entirely on driving traffic to their native ChatGPT. Instead, their announcements his week are making it easier to build your own domain-specific GPT and Digital Assistants. We've been in a strong believer in this direction since day 1 where our UX allows users to ask the same troubleshooting question to multiple Digital Assistants. Like individuals on a team, each one has different capabilities, different access rights and come up with different troubleshooting paths and different conclusions.

Internal-only Enterprise GPT Investments

Enterprise data security with regards to AI is an issue that our industry is only just starting to digest. It is clear that using enterprise data to train models is an absolute "no," but what about a vendor's in-house completion endpoints? A vendor's third party vendors completion endpoints? Masked data? Enterprise-managed models?

We've taken very conservative decisions here out of short term necessity, but our advanced users are thinking about how to take advantage of big public endpoints. The debate reminds me of raging debates in '10-'12 public cloud vs private cloud and the emergence of hybrid cloud research that drove both forward. In this vein, the OpenAI announcements touching on this feel like hybrid cloud investment. I don't know where this work ultimately lands, but I do see numerous inventions - equivalents of Cloud VPCs and Cloud Networking Firewalls that supplemented the early focus on Security Groups - are ahead of us.

LLMs Can't Code

September 24, 2023 · 4 min read

Ray Myers

Host of Craft vs Cruft

So, I ask an LLM 'hey, generate me code for doing X and Y'. It spits out something that looks like it could work, but it doesn't as some of the methods used do not exist at all.

How do I continue to give it a chance, in this particular instance? Any suggestions? - Slobodan Tanasić

This is very common issue right now, and seems somewhat inevitable. Maybe this'll help.

A Practical Answer

I think of it like using Stack Overflow. There is a category of isolated problems that I've learned Stack Overflow will help me more more than waste my time, and it's awesome at those. It helps even though we can't trust that copying the answer directly into the context. It might not be correct, even compile, or be licensed to allow use.

But we understand the provenance of the solution is dubious and it still helps us reach a solution faster, getting past stumbling blocks. When I've used LLMs (Large Language Models) like ChatGPT successfully during coding, it's a similar process.

For some more detailed guidance, our friends Daniel and Nicolas have some articles getting into specific use cases:

19 Tips for Coding Faster with ChatGPT - Daniel Feldman
Can AI help me refactor legacy code? - Nicolas Carlo
Can AI help me write tests on legacy code? - Nicolas Carlo

The Cursor editor has also made progress creating a more ergonomic integration than other assistants.

The Deeper Problem

This is controversial, I actually believe LLMs can't code.

Coding is not a single task. It's a big bag of different types of cognitive tasks. Some parts are, we might say, language-oriented and map very well to an LLM's strengths.

LLM-Easy

Language models suggest pretty good variable names. They can identify code smells. They can lookup many (mostly real) libraries that relate semantically to our questions. These are examples of well-trodden NLP (Natural Language Processing) tasks that we could label "LLM-Easy".

Summarization
Classification
Semantic Search
...

LLM-Hard

Then you have tasks that require precise reasoning. Even though we've come to expect computers to excel at those, LLMs are weak in this department. Some would say they don't reason at all, I'll just say that at minimum they are very inefficient and unreliable at it.

For instance, GPT-4 cannot multiply two 3 digit numbers together. Not only that, it returns wrong answers indiscriminately.

If you can spend $100 Million training a model, run it on special GPU hardware optimized for number-crunching, and still get something that can't multiply numbers, we are justified in saying that it is the wrong tool for certain kinds of reasoning. Fortunately, there are many other tools.

Forgive a little handwaving here, because we do not have a full account of every kind of cognitive task that goes into coding. But if you suspect that any part is LLM-Hard then you might agree that the all-in-one approaches that ask an LLM to produce raw source code are naive:

Putting an unknown number of LLM-Easy and Hard tasks into a blender
Expecting an LLM to identify which is which or solve them all simultaneously
Accepting the unrestricted text response as a valid update to your production code

This will hit a wall. More hardware and better prompts won't fix it. We need to partition the tasks, using components that are suited for each rather that treating one component as a panacea.

In short, that's how we teach machines to code.

Can we write to you from time to time? Get updates.

Turbo Refactoring

September 2, 2023 · 5 min read

Ray Myers

Host of Craft vs Cruft

If we're going to teach machines to code, let's teach them the safest way we know how.

GeePaw Hill said, "Many More Much Smaller Steps". Just as the discipline of small steps helps humans work more effectively, Large Language Models have shown dramatically improved results by breaking down tasks in a technique called Chain of Thought Prompting. Surprisingly, Chain of Thought improves responses even with no feedback. By adding incremental verifications between each step we can do even better.

Rule: Take small steps and test them

To make these steps as safe as possible, let's take a moment to embrace a brutal design constraint: LLMs cannot reliably edit existing code.

When we ask GPT4 to update code, it breaks it
When we ask GitHub CoPilot to update code, it breaks it
When we ask SourceGraph Cody to update code, it breaks it
When we ask CodeLlama to update code, it breaks it

We could wait for bigger shinier models to address this, but possibly LLMs are just architectural incapable of the required kinds of reasoning. For our purposes today, we will treat LLM output as potentially useful but inherently untrustworthy.

Rule: Treat LLM output as untrustworthy

Where does this leave us? Untrusted output must be filtered through an intermediary, typically a human reviewer or a vetted tool with limited actions. So far, most approaches involve the former - people manually code review and debug the raw LLM output. Are there other options? Can we manipulate code without directly editing it? Funny you should ask...

Return of Refactoring Browser

A tool called Refactoring Browser introduced syntax-aware automated refactoring in commerical tools in 1999, building on research started by Ralph Johnson and Bill Opdyke in 1990. The right-click refactoring in today's IDEs like IntelliJ are descendants, but a unique aspect of Refactoring Browser was that is could only refactor, not act as a text editor. In a sense, what it couldn't do was as important as what it could.

We can rebuild it

Could we build another, more automation-friendly Refactoring Browser today? How many transformations would we need to support?

While Martin Fowler's definitive Catalog of Refactorings is extensive, we can get very far with surprisingly few of them mastered. The Core 6 Refactorings identified by Arlo Belshee are Extract (Variable / Method / Parameter / Field), Rename, and Inline. Those basic operations give us control of how concepts are named in the code, and therefore how intent is expressed. Further, all of these can be automated and performed with a high degree of confidence - no hand editing necessary!

Instead of trying to understand an entire large block of code at once, we can take incremental steps capturing one piece of insight in a name before moving on. The practices of Read by Refactoring and Naming as a Process expand on this. On Craft vs Cruft, I tend to call this "Untangling".

The versatility of a refactoring CLI

With a small handful of core recipes supported per language, we can create a workable command-line refactoring tool(see Untangler for an early prototype). Liberating this functionality from IDEs allows opens up many possibilities:

Restricted editing UI's ("Uneditors" like Refactoring Browser)
Scripted code migrations
AI-driven refactoring
Easily give more editors refactoring support (YAY!)

Insanely small steps

So we started by saying we'd teach machines the safest way we know how to develop. What is that, Test Driven Development? Let's go even further. Let's go all the way to TCR, which has been called "TDD on steroids"!

Test && Commit || Revert

In Kent Beck's TCR, you make one single change. If the tests pass, you commit. If they fail, you blow away your change and start over. This is an unorthodox and challenging workflow that encourages preparatory refactoring before making a feature change.

Make the change easy, then make the easy change - Kent Beck

In the 2012 talk, 2 Minutes to Better Code Woody Zuill and Llewellyn Falco demonstrate rapid commits of incremental refactoring steps supporting an upcoming feature change on a real codebase. Those kinds of steps would fit well into a TCR workflow.

Full workflow

Here's an example of how this could work end-to-end, other variations are possible/

Human enters high level intent: "Clean up the Foo module"
Branch from Trunk
AI Refactor Loop
1. Green (tests pass)
2. AI suggests refactoring action
3. Refactoring tool applies transformation
4. Green (tests pass)
5. Commit
Create Pull Request to Trunk
Human decides whether to merge the Pull Request

We now have an unprecendented set of safegaurds for LLM-driven changes:

We only edit code through syntax transformations with known behaviors
Every edit is built
Every edit has tests run
We can perform review at any gradularity due to rapid commits
The steps are replayable, entirely or a subset

Can this work?

It already does.

While the prototype's capabilities are very limited (it does renames in C), early experiments suggest this is the first AI developer agent that can update code in a bug-free way. It may choose a bad variable name here and there, but it does not break the code.

I'll be releasing more pieces into The Mender Stack as I'm able to clean them up and explain them properly. It will take a significant but very tractable amount of work to implement polished support for the core refactorings in several major languages. If you're interested in helping out I'd love to hear from you.

Can we write to you from time to time? Get updates.

AI vs Donut! Using GPT4 to De-Obfuscate C

March 23, 2023 · 3 min read

Ray Myers

Host of Craft vs Cruft

Introduction

Refactoring Andy Sloan's entry to the 2006 International Obfuscated C Code Contest, with some help from ChatGPT. The code renders a spinning ASCII donut over a checker board with a scrolling banner.

Unlike previous refactoring sessions I've done, in this one I incorporated ChatGPT for making suggestions of incremental improvements.

Source code with refactoring steps in commit history.

RefactorGPT prompt

chatgpt-refactoring-prompts

You are now RefactorGPT, an expert in code maintainability. Your input will be code blocks. You will respond by giving the code a letter grade (A, B+, D-, etc...), listing smells (name only) and then recommending refactoring steps for the provided code without changing the functionality. Do not break the code. Make suggestions relevant to the code. Only show your suggestions, not the updated code. The available actions include but are not limited to:

Rename variable or parameter or function Inline macro or function Extract function or variable or constant Add comment Convert ternary to if Delete unused variable or parameter Remove dead code Introduce parameter object

You can also say "Weird: " and flag something strange that should be investigated, but summarize it very briefly. Group suggestions by the function they pertain to, or "global". Bold the action types. We value names that are honest and complete. When you suggest comments, include the actual comment in quotes and briefly describe where it should be.

If you understand please respond with "RefactorGPT>"

If you are interested in legacy code techniques, checkout my other refactoring sessions or reach out with questions via LinkedIn or Twitter!

Example Input

int charIndex(int q, char *l, char *r, int o) {
  for (; q >= 0;)
    q = ("AYLrZ^w^?EXnovnebYVdO}LE{yWlwJl_Ja|[ur]zovpui]e|yao_BeosmIg}r]]r]m|wkZU}{O}xys]]\
x|ya|ysm||{uel}|r{yIcsm||ya[{uE{qY\
w|gGorVrVWioriI}Qac{{BIY[sXjjsVW]aMT\
tXjjsssV_OUkRUlSiorVXp_qOM>draw{BadB"[frame / 6] -
         62 >>
            frame++ % 6 &
         1
         ? r[q]
         : l[q]) -
        o;
  return q;
}

Response (ChatGPT with GPT 4 model)

Grade: C-

Smells:

Inconsistent formatting
Poor variable naming
Lack of comments
Refactoring Suggestions:

global:

None
charIndex:

Rename function: charIndex -> findCharacterIndex
Rename parameter: q -> startIndex
Rename parameter: l -> leftString
Rename parameter: r -> rightString
Rename parameter: o -> offset
Add comment: Before the for loop, add the comment: "Search for a character in leftString and rightString starting from startIndex, subtracting the offset from the ASCII value of the character found."
Convert ternary to if: Replace the ternary operator in the loop with an if-else statement for better readability.
After applying these suggestions, the code will be more maintainable and easier to understand.

Can we write to you from time to time? Get updates.

Lightning round​

Uncertain callouts​

Looking deeper

128K context 👾👾👾👾👾​

Price drop! 👾👾👾👾​

Updated cutoff date​

Whisper v3 and Consistency Decoder​

New modalities in the API​

AI Troubleshooting

Models like micro-services, not monolith​

Internal-only Enterprise GPT Investments​

A Practical Answer​

The Deeper Problem​

LLM-Easy​

LLM-Hard​

Return of Refactoring Browser​

We can rebuild it​

The versatility of a refactoring CLI​

Insanely small steps​

Test && Commit || Revert​

Full workflow​

Can this work?​

Introduction​

RefactorGPT prompt​

Example Input​

Response (ChatGPT with GPT 4 model)​

Lightning round

Uncertain callouts

128K context 👾👾👾👾👾

Price drop! 👾👾👾👾

Updated cutoff date

Whisper v3 and Consistency Decoder

New modalities in the API

Models like micro-services, not monolith

Internal-only Enterprise GPT Investments

A Practical Answer

The Deeper Problem

LLM-Easy

LLM-Hard

Return of Refactoring Browser

We can rebuild it

The versatility of a refactoring CLI

Insanely small steps

Test && Commit || Revert

Full workflow

Can this work?

Introduction

RefactorGPT prompt

Example Input

Response (ChatGPT with GPT 4 model)