Vibecoding an Agentic Coder - Part 1

I’ve tried Cursor, Replit, Lovable, and Bolt with varying degrees of success and found recurring themes in the use of these tools that require “vibing” until you arrive at a finished, hopefully working, result. Whether the result is good can sometimes be in the eye of the beholder.

I’ve also become fascinated by how these tools will change the way programmers think about code and its organization — how many rules will be thrown completely out the window and how, oddly, the new rules will harken back to the early days of programming before Google and the Internet.

One of these ideas is that we may no longer need so many dependencies in our code bases. Eventually, all LLMs will be able to generate all of the code needed and, more importantly, only the code needed to solve any given problem.

Why do we have so many dependencies in our projects? In a nutshell, programmers are lazy. Alright, maybe it’s just because we are human beings with limited time and memory. We solve related problems and lump code into libraries to save time through reuse; cognitively, we can stay focused on the main problems we are trying to solve.

Why do LLMs need so many dependencies in their generated projects? Perhaps because they learned to code from humans? “Cargo culting” all of that code has brought certain old problems into the equation, such as security exploits and buggy dependencies.

We can ask our LLMs to code from “first principles.”

To learn more and explore some of these ideas while also having a meaty example to later test on the open source and commercial tools I mentioned earlier, I decided to use a powerful code generation model, Claude 3.7, to generate an “agentic” code generator from first principles.

The journey has revealed some insights and taken on a life of its own.

Hey Claude, make this…

Using a Claude Desktop project, I generated a basic code generation framework using these prompts:

Write a python program which implements an agentic flow from basic
principles, using the openai api and the GPT-4o model.
There will be two agents:
1. Architect - an agent which takes a problem description and 
   converts it into a plan for developing a software solution.
2. The second agent will be a software engineer that can write python code
   which implements the system outlined by the architect agent.

we need to add a third agent to this framework that will be able to compile
and test the code developed by the software engineer. It will take the
implementation from the software engineer agent, and return a test_report
showing whether the code is compiling, and whether the tests are passing,
along with suggestions how the code might be fixed if it fails to compile
or run properly.
The report may itself be a json object with a boolean 'passed' key,
along with 'results' key to suggested fixes if passed == False, or test results
if passed = True.
we can adjust the software engineer agent to look for test_report,
along with the software engineer specification, and its own last-generated
implementation. If the test_report shows passed == False, it can use the
suggestions from the test agent to attempt to fix the implementation.
We can then put this interaction in a loop that breaks once the passed == True
is returned from the test agent, or after a maximum number of attempts.

The resulting code worked, but the testing agent did not actually run tests; rather, it generated test cases and then attempted to predict whether each test case passed. The results were passed back and integrated into future passes. So, although it hallucinated its ability to interpret the code, it did create the overall flow correctly.

So, I needed to create a real software testing agent, for which I used the following prompts to generate:

Create a python class that can take a python program as text and
perform the following:
1. Determine the python packages needed by the program.
2. Create a Dockerfile that loads the required packages along
   with the pytest framework required to run tests against the program.

but our code, implementation.py, and its test test_implementation.py
need to be copied over to app dir in the Dockerfile, and the pytest
command can be run with app as the curent dir, so tests will run.
Make the changes needed for this to be done in the dockerfile

let's change this code as follows:
1. replace the sample_code with a check for a src_dir on the command line,
   from which python files are loaded, and checked for dependencies so that
   a requirements.txt file can be created.
2. the src_dir will be used in the Docker file to copy python files into
   the WORKDIR

minor change to the requirements file generation needed to exclude package
names matching the application files loaded from the source directory.

Finally, Claude returns code for a testing framework:

Figure 1. Claude test agent description.

Putting all the pieces together

Putting the code together into a project, I ended up adding a few enhancements, including the ability to choose other model providers. I used Claude 3.7 to analyze the combined code and asked it to create a DOT-language (Graphviz) representation of the code’s flowchart below using the following prompt:

Examine the code and create a flowchart of its operation using the DOT language.
(code attached)

Figure 2. Agentic code generator flowchart.

As was my goal, the generated agent code has just standard packages as dependencies (shown below). Although I initially had Claude create the ability to choose different model providers, I eventually chose LiteLLM to make it easier for me to compare output from other providers without having to re-generate this part.

Figure 3. Dependencies for agentic code generator.

The current project can be found at CodeAgents on GitHub

Taking it out for a test drive

I have included notable examples of two projects that I generated using this framework in the GitHub repository above: a scientific calculator as CLI and a weather API using FastAPI. I found these project types useful for testing across different models and providers. They are non-trivial enough to be interesting and small enough to run frequent passes with different providers. In my next post on this topic, I’ll present some results and conclusions about these comparisons.

Coming next

For now, I can say the promise of “hands-free” code generation of production-quality code is still a way off, although I can see no reason why the quality won’t continue to improve. I think we need to set our expectations accordingly. One thing is clear: These tools will change the way we develop software - mostly for the better.

Hey Claude, make this…#

Putting all the pieces together#

Taking it out for a test drive#

Coming next#

Hey Claude, make this…

Putting all the pieces together

Taking it out for a test drive

Coming next