I locked Claude in a box with some detailed specifications and docs, and then went for a weekend with the family.

Set up one continuous orchestrator that watches new commits and provides review, which are dropped in the dev orchestrators context after it finishes a task. They work in a continuous loop until the dev orchestrator provided no new commits for 15 minutes.

Checking in periodically on the progress and results; it's still running and the progress is surprisingly good so far. The token spend is not catastrophic despite multiple workflows and high thinking budget (still <50% weekly limit on 20x plan).

I've been doing tests like this about once a month where I just provide some obscene task and see where the limits are; what can be built in a finite amount of time. Then I informally evaluate the functionality, adherence to the design intent, and quality. The results of these are never published; sometimes they are related to my hobbies but sometimes just throwaway "what if" experiments/ideas that we discuss internally at Tabular Editor - since about February they are long-running or autonomous tasks.

Up until about November the results of these tests have always been sloppy on all fronts, and tbh I also hadnt been very serious or disciplined about my workflow or approach. Then (after Opus 4.5) suddenly the functionality was there, but the quality not at all. Slowly over months it's been able to do more and more and more... and go further, but not do better.

Something seems to have changed in the last 1-2 months though. I'm noticing in smaller experiments significantly better outputs, around since I've spent a bunch of effort trying to really lock things down and clean up my workflow... Trying to ensure consistency, security, and scalability. I'm not sure if that's why, but it probably helps. I just don't know how much.

But this test already seems to be showing a huge leap in progress and even possibly quality compared to the last time I tried a big experiment like this. I'm still skeptical of the clanker though and had to intervene once already to steer away from a pileup... but so far quite surprised. It's making me re-think a lot of things, again.

I'm not sure, but the new feature in Claude code of dynamic workflows seem to be a huge change in the right context, especially with skills that you tune the right way, and custom tools that you've tailored made to your scenario. I don't yet know though if it's just the change to my setup that's making the difference.

Some things I'm settling on as I try to find sleep

lock down your memory and maintain it well. Don't be afraid to fill it temporarily with relevant info. Never let ai curated it, give the ai an obsidian vault or agent docs repo.

You should use skills.
Don't just make technical skills for technical processes and formats /frameworks. Skills work best when you compliment them with conceptual and nontechnical process context. Routing skills seem great too but I'm not sure yet.
Don't just use skills ootb from a vendor. Take a skill or template that you like, fork it, and own it.
Don't use skills to replace tools, that's stupid.
You should test skills but this is really fucking hard and tbh I don't think I've come up with evals yet for this that doesn't feel like hand wavy bs

don't use ai to make plans or create context. The human should own it, the ai can iterate and clean it or refine it.

/goal is incredible if you know you have a complete feedback loop that you verified. /loop and /schedule feels icky.

I think there's something special to dynamic workflows. I don't know yet for sure.

I locked Claude in a box with some detailed specifications and docs, and then went for a weekend with the family.

Related

The Ultimate Guide: Claude + Premiere Pro + YouTube = $37K/MONTH

Claude Design Built My Brand End to End (without hitting limits)