Posts

Building a Hybrid Summary Evaluation Framework

Combining deterministic NLP with LLM-as-Judge for robust evaluation Summary Evaluation Challenges Summary evaluation metrics sometimes fall short in capturing the qualities most relevant to assessing summary quality. Traditional machine learning for natural language processing (NLP) has covered a lot of ground in this area. Widely used measures such as ROUGE focus on surface-level token overlap and n-gram matches. While effective for evaluating lexical similarity, these approaches offer limited insight into aspects such as factual accuracy or semantic completeness [1]. ...

Vibecoding an Agentic Coder - Part 2

In this segment, I’ll generate many candidate applications using my experimental framework, CodeAgents, choosing from a set of models: GPT-4.1, Claude 3.7, and GPT-4o. Then, I’ll compare and contrast the solutions. Along the way, I’ll present some ideas and tips on improving AI-generated code in ways that generally translate to other tools and frameworks. It isn’t easy to score how good an AI-coded solution is. Of the possible metrics, code complexity might not be as meaningful as long as the AI understands the code, as would “maintainability,” as that’s based on human limitations; the AI can refactor on the fly. Test coverage is a good metric as it measures how well the AI-generated test suite covers the code. ...

Vibecoding an Agentic Coder - Part 1

I’ve tried Cursor, Replit, Lovable, and Bolt with varying degrees of success and found recurring themes in the use of these tools that require “vibing” until you arrive at a finished, hopefully working, result. Whether the result is good can sometimes be in the eye of the beholder. I’ve also become fascinated by how these tools will change the way programmers think about code and its organization — how many rules will be thrown completely out the window and how, oddly, the new rules will harken back to the early days of programming before Google and the Internet. ...

LLMs At The Command Line - Part 1

If you are a command-line fan and want to experiment with large language models (LLM), you will love AiChat. There are many popular graphical front ends for working with LLMs, such as OpenAI’s ChatGPT, and Anthropic’s Claude, but get ready for this little powerhouse for CLI lovers as it has many advanced and useful features. One such feature is an easy-to-use, out-of-the-box RAG feature (Retrieval Augmented Generation) useful for searching existing content. I’ve put together a small demo here that shows how easy it can be to use in a pinch. There are many use cases where such an approach is just the right size. ...

Experimenting with Agentic AI Tooling: My Journey Through the Cutting Edge

The first time I fired up an MCP (Model Context Protocol) server plugin, “Agent,” I was excited to see it registered in Claude Desktop but immediately annoyed by the errors that popped up. I didn’t expect a smooth experience in my encounter with the future of Agentic AI, but I found many configuration tweaks, clunky debugging tools, and broken dependencies along the way. It was a stark reminder that we’re in the early days, and there’s a lot of ground to cover before Agents become seamless collaborators. ...

Navigating the Fragmented Landscape of Agentic AI Tools

Agentic AI, with its promise of creating systems capable of autonomous reasoning and action, has been a hotbed of innovation in the AI community. Tools from OpenAI, LangChain, and Microsoft are spearheading this new wave, each offering unique features and capabilities. However, the lack of standardization in this ecosystem presents significant challenges to developers, researchers, and organizations eager to adopt these technologies. The Current State of Agentic AI Tools The diversity of agentic AI tools is both a strength and a weakness. On one hand, it fosters creativity and innovation as developers explore various approaches to building autonomous systems. On the other hand, the fragmented landscape leads to: ...