In this segment, I’ll generate many candidate applications using my experimental framework, CodeAgents, choosing from a set of models: GPT-4.1, Claude 3.7, and GPT-4o. Then, I’ll compare and contrast the solutions. Along the way, I’ll present some ideas and tips on improving AI-generated code in ways that generally translate to other tools and frameworks.

It isn’t easy to score how good an AI-coded solution is. Of the possible metrics, code complexity might not be as meaningful as long as the AI understands the code, as would “maintainability,” as that’s based on human limitations; the AI can refactor on the fly. Test coverage is a good metric as it measures how well the AI-generated test suite covers the code.

Other subjective measures of the generated application are difficult to quantify. For example, UX (User Experience) and how well the problem was solved are more difficult to measure and automate. Still, a combination of all these metrics may serve as a good proxy for choosing the best candidates.

I will use radon, a Python tool that extracts various metrics from source code, and PyTest to pull other statistics:

  • Raw statistics:
    • Logical lines of code (LLOC) - Every LLOC contains exactly one statement and is a good proxy for how “chatty” the code is. Chatty code is typically more complicated.
  • Halstead metrics:
    • Difficulty - another measure of complexity.
    • Time to program (sec) - a size metric.
    • Delivered bugs - estimated bug rate.
  • Cyclomatic Complexity:
    • Average complexity - measures path complexity in the code.
  • Test coverage percentage - using PyTest coverage reporting.

Lastly, I will add a 1-5 score of the UX based on my interaction with the application. This score will include a sense of how well it addresses the given problem and how easy it is to use generally.

I will only use three of these metrics to compare solutions: UX score, time to program, and average complexity, but I will include the other metrics to double-check how the various statistics relate and compare. This approach will favor the best UX and tie-break using the other two size and complexity metrics.

The Good, The Bad…

Key Points

  • How do the models compare (time to generate, variability, prompting, and overall quality)
  • From the general to the specific - updating the problem to improve overall quality.
  • Further areas to explore: performance, change management, AI-coding vs. Human-coding, mixing models.

The Problem

Version 1 of the problem asks for a basic four-function calculator as a command line REPL (Read-Eval-Print-Loop)

create a cli calculator that runs in a REPL loop and handles basic arithmetic. 
Ensure that operations follow the precedence rules for arithmetic, and handle
negative number literals properly, keeping in mind that consecutive minus signs
can precede a unary expression. 
For example: --3 is equal to 3. 
The negative sign should have higher precedence when preceding a unary
expression. 
The basic arithmetic operations (+,-,/,*) when used to take two arguments should
treat multiplication and division with higher priority than addition and 
subtraction. When precedent is equal, operations are handled in sequence.

The problem stated above was originally given without all the details about precedence, but using that description caused all the models to generate code that badly fumbled the arithmetic. So, I needed to add more context to improve the generated applications.

Tip: the more context you can provide to the AI models, the better the solutions will be—sometimes in surprising ways. However, a follow-up to this tip would be to add only as much context as needed. Don’t overdo it.

Problem One Results

GPT-4.1

The experiments are ordered as shown in the sort statement, with UX descending, and ascending by time-to-code, and complexity. This will be consistent throughout the following examples, leaving the top 2 experiments as the best candidates.

The sixth and second runs gave the best results, but there were some bad examples in experiments three and four - where the code did not recognize parentheses. This will cause me to add some more context to the problem later on (see Tip 1)

Speed-wise, GPT-4o is the fastest model, followed by GPT-4.1 and Claude 3.7. Another interesting thing to note is that LLOC and complexity will typically be low in the higher-ranked solutions, even though I’m not explicitly sorting by these features.

Example session:

GPT-4o and Claude 3.7 produced similar results on this problem, so I will skip ahead to the revised problem, adding in the mention of parentheses handling, and also adding a requirement for a ‘help’ command.

Problem Update 1

I added this context to the problem:

Allow expressions to be inside parentheses.
Provide a help command that explains how to use the calculator. 

GPT-4.1

GPT-4o

Claude 3.7

All the models produced good solutions to this revised problem, with some surprises.

Note the variability in LLOC and difficulty. The metrics related to complexity and size are more or less scattered depending on the model. Of these, Claude 3.7 shows the most variability. Surprisingly, Claude produced both the best and the worst solution in this run.

I need to add more data to confirm my intuition that Claude 3.7 tends to produce more complex and diverse solutions than GPT-4.1 and GPT-4o, yet I’ve noticed this in other projects. It can produce very high-quality solutions, but it’s hard to predict what you’ll get. Adding more context often improves the result, but adding more cases is a little painful, as Claude is chatty and slower than GPT-4.1 or GPT-4o.

I don’t want to pick on Claude 3.7. I think I may also need to provide better (or different) agent prompting to my framework tuned specifically to Claude 3.7—that may help as well.

With all this said, Claude produced the best version, which included command line history, even though I didn’t ask for that. So, was this “creativity” at work or something else? I’ll leave that discussion for another day.

Example session:

Final Version

In the final iteration, I added:

the calculator should support command line history via up/down arrow.

Overall, for this exercise, GPT-4.1 turned out to be the best option, as it shows more consistent results and runs faster than Claude 3.7.

GPT-4.1

Example session:

Further Areas to Explore

This experimental framework has been a useful and fun tool for exploring AI code generation.

A key takeaway is the value of adding context to the initial prompt used for code generation. This technique proves useful across all the models.

Choosing metrics to evaluate solutions is still an art form. In many cases, I think the way AI models generate code is based on the human examples they have been trained on. So, for now, some of these metrics are still useful for automated evaluation—especially ones related to complexity and code coverage. In the future who knows?

We still need a “human in the loop” to evaluate the user experience, but perhaps a mixture of automated evaluations can help us narrow down the candidates for review.

Other topics:

  • Change management - how to minimize the impact of new requirements.
  • Performance - how to scale the generation of code.
  • Evaluation - more ideas are needed.
  • Computer Language
    • Are AI’s limited by human-designed languages?
    • Could they do better with languages designed by AI’s themselves?
    • What metrics would apply to evaluating these?

I’ll explore more on these topics in future posts.

Especially that last one 😀.