On vibes, delegation, context and specs
After many months getting myself up to speed in writing backend code, a few demo projects one after another, I was finally getting confident at it without really generating any LLM code, despite the temptation. A few months ago, the looming moment came for Alexandria (my current attempt to do a SaaS); it was time to sit down and do the website with my zero front-end experience, and my brain's hesitance to read comfortably reactive code.
I could learn, like I did with Python, and that would, in itself, make me better at delegating to LLMs, but speed was also relevant at that stage. On the other hand, learning how to use LLMs properly from a manager's perspective is a skill I also always cared about, and one that I think will be highly critical in the future. So, for Alexandria, the idea was simple: vibe code the web, but without the vibes.
What do I mean by this? Well, I mean, make a product I have trust in, that I can expand, iterate, experiment over, deploy, test, share, and use!—but without me writing a single line of code on it.
Now, a similar journey had already started with the Pokémon project. I wanted to control the loops from a webpage and asked Windsurf to make a basic webpage. And, it worked, but there is this aftertaste, this gut feeling that something is off, mainly because, yeah, there was no plan, just a bland request. What happens after that? Who knows.
For Alexandria, I got my hands a bit more dirty: some basics of HTML, some JavaScript, some CSS. What's Next.js, React, etc.? And to be honest, I should do more of that, because the more I do, the more comfortable I feel. That's rule number 1, especially if it’s a technology I'm going to work with for quite a while, and not a one-off obscure project. But it looks like TypeScript and web tech are going to stay. So I'd say that's goal #1: get better at what you are delegating, to be better at delegating. This in itself opens a whole window on how to learn properly.
Speed, Precision, Guardrails, and Experimentation
After iteration, the first online version of Alexandria was hosted on Vercel, with a fine-looking UI and working routes, APIs, and events. It was not that painful, and it did the job.
It wasn't until I wanted to add the main real-time functionality. The idea was, or is, simple: you search for papers, and they show in real time in the UI as they come in; subscribe to Supabase, and show. Do I know how to do it? Absolutely no. But, it shouldn't be so difficult: you listen to events, then you have a factory make instances with the content of those events. Code-wise, it should not be that long.
Well, Codex did it, and it worked, but that file, that page, was huge. Let me check how many lines of code: 700 LoC. In web dev, I have no clue if that's too much or too little, but for me, it seemed too much. I asked to change a few things, and those changes would break everything and would take forever. My gut feeling was sounding like crazy. At that moment, I also noticed I didn't know if the scaffolding was correct, if the tech stack was fine, or if I was following best practices.
That brought me to a few weeks ago, when I decided to start from scratch, and properly, from top to bottom: What is the value proposition? What do I care about? And how to build something that helps me do that? That became design principles, Figma, tokens, etc. I spent a whole day just on the setup for all these, designing the colors, animations, accessibility—you name it. The main idea was simple: make the big decisions so the LLM has a really narrow space for operation. From the top, you bring the strategy, the vision, the product, the PRD, the user experience, and the user flow, but YOU decide those. From the bottom up, you put your guardrails: unit testing, linting, build requirements, tokens, coding style, libraries, and reviews. The LLM finds itself confined, or assisted, in a two-way manner:
Clear and executable goal: There is no longer mind-reading because writing the specs is hard. It's extremely useful, but mentally it's hard.
Feedback loops: There is no longer guesswork. Either the test passes or it does not, either it builds or it does not, either the user flow works as intended or it does not.
This "pinching" removes a lot of the discomfort about not writing the code myself, and it's fulfilling in its own way because it’s a form of craft, a new form of craft that requires thinking.
Of course, this is all easier said than done. Last week, as I discovered SDD (Specification-Driven Development), I found myself in a really happy place, and I put it to the test by trying to do my Gemini-conversation-to-video app. It was a really good experience because it forces you to think about the app in specs and discover what parts of your idea are missing and need defining. So I sat down, wrote the specs, and left the rest on autopilot. The docs seemed solid, and I let it write the code. 
Many hundreds of lines of code after, my gut feeling was kicking in again: "That looks like it's following the plan, but how do we know it's doing the thing and doing it right?" And then you pay the price for the autopilot: the model had no clue on how to use Gemini for audio or image generation, so it just left a bunch of placeholders. It built the software around it, and when it came down to testing, it had not tested any other component, so it felt brittle, and I closed the project because the cognitive load of fixing that seemed higher than starting from scratch. Also, that's not how I work and why I like these things. The way I would have done it is: use those two new, fancy, and fun Gemini APIs, see if it works, test it, and then build a system around it. Then you can craft the context for the LLM to elevate it, build tests, and work on the rest.
Out of all these, a few pieces come as king: Context. The webpage works because of that goddamn AGENTS.md. There is a simple line there stating that every change needs to be tested; if they just go and fail a test. But context, or the lack of it, was why that side project failed: it didn't have the context on how to use the APIs. The second one is metrics, evaluations, utility function—you name it. A simple "what do we call 'good enough'," and how fascinating it is the fact that it's never a single metric. You can have a perfect-looking UI, but if it takes 3 seconds to load a page, the system failed. And that makes you wonder: Was there any performance benchmark in the system? Was any test in place for it? Well, no, because I didn't think about performance in the implementation, or in the planning, or in the architecture, or in the product. For how much I'd love to be good at writing algorithms in JavaScript, I also love reading about the history of browser performance up until this day, but also on a higher level: what a human with a low attention span thinks is fast enough.
Why precedes the What, What precedes the How... Whom, and When? well, that's a bit of art.