Skip to main content

2 posts tagged with "reasoning"

View All Tags

ยท 6 min read
Ray Myers
Kyle Forster

Yesterday, OpenAI held their first DevDay with some of their biggest releases since GPT-4 in March! Full details are in the the official announcement and keynote stream. In this post we'll give first thoughts on the implications for software development and maintenance.

Some of the biggest limitations of GPT-4 were that it was slow, expensive, couldn't fit enough data in the context window, and had a knowledge cut off of January 2022. All of those have been significantly addressed. Short of eliminating halucinations (which may be intractable), we couldn't have asked for much more in this release.

While this is not "GPT-5", whatever that may look like, it was a huge move to execute on so many key frustrations at once. As the Mechanized Mending Manifesto hints, we have much to learn about taking advantage of Large Language Models as components in a system before our main limitation becomes the sophistication of the model itself.

Lightning roundโ€‹

Let's give some initial takes on the impact to AI coding workflows for each of these changes.

  • ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ = Game changer
  • ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ = Reconsider many usage patterns
  • ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ = Major quality of life improvement
  • ๐Ÿ‘พ๐Ÿ‘พ = Considerable quality of life improvement
  • ๐Ÿ‘พ = Nice to have
  • ๐Ÿคท = Not sure!
128K context๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พMax score of 5 space invadors!
Price drop๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พSee below
Code Interpreter in API๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พCode Interpreter's workflow is often better than using GPT-4 codegen directly
JSON mode / Parallel Function calls๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พTooling needs this, we had workarounds but structured output was a constant pain
Speed๐Ÿ‘พ๐Ÿ‘พThis makes GPT-4 more of a contender for interactive coding assistants
Assistants API๐Ÿ‘พ๐Ÿ‘พSaves a lot of boilerplate for new chatbots
Retrieval API๐Ÿ‘พ๐Ÿ‘พAgain, we could do this ourselves but now it's easy
Updated cutoff date๐Ÿ‘พProbably more important outside coding
Log probabilities๐Ÿ‘พShould help with autocomplete features

Uncertain calloutsโ€‹

Improved instruction following๐ŸคทWe need to try it
Reproducible outputs๐ŸคทWill reproducibility help if it's generally unpredictable?
GPT-4 Fine Tune / Custom Models๐ŸคทI don't have 5 million dollars, do you?
GPT Store๐Ÿคท๐ŸคทMaybe more useful for coding adjacent tools, see Kyle's section below
Copyright Shield๐Ÿคท๐Ÿคท๐ŸคทTheir legal strategy will have... ramifications

Looking deeper

128K context ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พโ€‹

This gets the maximum score of 5 space invadors.

We'll follow up with more later, but for instance this video from April, Generating Documentation with GPT AI, had as it's main theme the difficulty of getting an LLM agent to reason about a single 8,000 line source file from Duke Nukem 3D.

That dreaded file now fits in a single (expensive) prompt! So do some entire books. Our options for inference using the state of the art model have just drastically changed. We look forward to seeing how well the performance holds up in extended context because previous methods in the research have usually had caveats.

Price drop! ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พโ€‹

Deciding when to use 3.5-Turbo vs the premium 4 vs a fine-tuned 3.5 has been a juggling act. With this price drop

  • GPT-4 Turbo 128K is 1/3 the cost of GPT-4 8K by input token (1/2 by output)
  • GPT-4 Turbo 128K is 1/6 the cost of GPT-4 32K by input token (1/4 by output)
  • GPT-3.5 Turbo 16K is also now cheaper than it's 4K version was

Updated cutoff dateโ€‹

Training now includes data up to April 2023 instead of January 2022. This is huge for general use of ChatGPT, but for coding tasks you should consider controling context more carefully with Retrieval Augmented Generation (RAG), as Cursor does.

Whisper v3 and Consistency Decoderโ€‹

Better speech recognition models will always be good news for speech driven tools like Cursorless and Talon, used by coders with repetitive stress injury.

New modalities in the APIโ€‹

These are worth mentioning, but don't seem aimed at coding as we normally understand it. Perhaps for front-end devs and UX design though?

  • GPT-4 Turbo vision
  • DALLยทE 3
  • Text-to-speech

AI Troubleshooting

For this section we're joined by a leader in AI-assisted troubleshooting: Kyle Forster, CEO of RunWhen and former Kubernetes Product Director at Google.

I look to OpenAI's developer announcements as bellwether moments in the modern AI industry. Whether you use their APIs or not, they have access to so many consumers and enterprises that their decisions of what to do and not do are particularly well informed. Below are my take-aways relevant to our domain.

Models like micro-services, not monolithโ€‹

OpenAI could focus entirely on driving traffic to their native ChatGPT. Instead, their announcements his week are making it easier to build your own domain-specific GPT and Digital Assistants. We've been in a strong believer in this direction since day 1 where our UX allows users to ask the same troubleshooting question to multiple Digital Assistants. Like individuals on a team, each one has different capabilities, different access rights and come up with different troubleshooting paths and different conclusions.

Internal-only Enterprise GPT Investmentsโ€‹

Enterprise data security with regards to AI is an issue that our industry is only just starting to digest. It is clear that using enterprise data to train models is an absolute "no," but what about a vendor's in-house completion endpoints? A vendor's third party vendors completion endpoints? Masked data? Enterprise-managed models?

We've taken very conservative decisions here out of short term necessity, but our advanced users are thinking about how to take advantage of big public endpoints. The debate reminds me of raging debates in '10-'12 public cloud vs private cloud and the emergence of hybrid cloud research that drove both forward. In this vein, the OpenAI announcements touching on this feel like hybrid cloud investment. I don't know where this work ultimately lands, but I do see numerous inventions - equivalents of Cloud VPCs and Cloud Networking Firewalls that supplemented the early focus on Security Groups - are ahead of us.

ยท 4 min read
Ray Myers

So, I ask an LLM 'hey, generate me code for doing X and Y'. It spits out something that looks like it could work, but it doesn't as some of the methods used do not exist at all.

How do I continue to give it a chance, in this particular instance? Any suggestions? - Slobodan Tanasiฤ‡

This is very common issue right now, and seems somewhat inevitable. Maybe this'll help.

A Practical Answerโ€‹

I think of it like using Stack Overflow. There is a category of isolated problems that I've learned Stack Overflow will help me more more than waste my time, and it's awesome at those. It helps even though we can't trust that copying the answer directly into the context. It might not be correct, even compile, or be licensed to allow use.

But we understand the provenance of the solution is dubious and it still helps us reach a solution faster, getting past stumbling blocks. When I've used LLMs (Large Language Models) like ChatGPT successfully during coding, it's a similar process.

For some more detailed guidance, our friends Daniel and Nicolas have some articles getting into specific use cases:

The Cursor editor has also made progress creating a more ergonomic integration than other assistants.

The Deeper Problemโ€‹

This is controversial, I actually believe LLMs can't code.

Coding is not a single task. It's a big bag of different types of cognitive tasks. Some parts are, we might say, language-oriented and map very well to an LLM's strengths.


Language models suggest pretty good variable names. They can identify code smells. They can lookup many (mostly real) libraries that relate semantically to our questions. These are examples of well-trodden NLP (Natural Language Processing) tasks that we could label "LLM-Easy".

  • Summarization
  • Classification
  • Semantic Search
  • ...


Then you have tasks that require precise reasoning. Even though we've come to expect computers to excel at those, LLMs are weak in this department. Some would say they don't reason at all, I'll just say that at minimum they are very inefficient and unreliable at it.

For instance, GPT-4 cannot multiply two 3 digit numbers together. Not only that, it returns wrong answers indiscriminately.

If you can spend $100 Million training a model, run it on special GPU hardware optimized for number-crunching, and still get something that can't multiply numbers, we are justified in saying that it is the wrong tool for certain kinds of reasoning. Fortunately, there are many other tools.

Forgive a little handwaving here, because we do not have a full account of every kind of cognitive task that goes into coding. But if you suspect that any part is LLM-Hard then you might agree that the all-in-one approaches that ask an LLM to produce raw source code are naive:

  • Putting an unknown number of LLM-Easy and Hard tasks into a blender
  • Expecting an LLM to identify which is which or solve them all simultaneously
  • Accepting the unrestricted text response as a valid update to your production code

This will hit a wall. More hardware and better prompts won't fix it. We need to partition the tasks, using components that are suited for each rather that treating one component as a panacea.

In short, that's how we teach machines to code.

Can we write to you from time to time? Get updates.