Review of the engine by Claude Code yielded 11 bug-fixes. The good, the bad about the experience, how you can do this in your own codebase, Docker SBX and more

Posted on

PLY point cloud. with proper colors and opacity
PLY point cloud. with proper colors and opacity
PLY point cloud. with proper colors and opacity

I gave Claude a long-running task: Review every file of Castle Game Engine.

The process found 11 “real” bugs (something was indeed a bug in the engine, and it could be a bug in real-life applications using the engine) which we promptly fixed. So I’m happy:) See the “Detailed Stats” section at the end for all the numbers.

I will describe the process below, to give some pointers how to do it on your own codebases. In short: fine-tune the instructions to get quality reviews, watch for agent trying to avoid doing real reviews, use Docker SBX to run securely unattended.

Note: If you want to know my general thoughts/experience about using AI with Castle Game Engine, see AI guidelines for Castle Game Engine. A short version: let’s work to gain understanding / be smarter thanks to AI, and not to lose understanding / be dumber. AI can do impressive stuff, but can also be terribly unreliable — in fact the post below is another good example of both of those facts being true at the same time.

For some independent AI opinion that I agree with, see links at the bottom of our guidelines, and also I recently read from a creator of open-source AI harness Pi “Thoughts on slowing the fuck down”.

Motivation: AI reviews (subset of them) are useful

  1. AI reviews of commits/PRs by Claude are useful in my experience. Using commands like review staged, review unstaged, review unmerged is consistently giving me useful suggestions how to improve the code before I push it.

    I have to emphasize that by useful I mean that they contain useful bits. Of course they are also filled with nonsense — AI misunderstanding what the application/engine around it does, misunderstanding differences between FPC/Delphi or particular platforms, suggesting bad approaches (always with confidence), suggesting “defensive” coding techniques that would actually hide problems instead of keeping the flow validated…

    But if you’re prepared + capable of filtering out the nonsense, you’re left with useful observations how to improve the code. AI reviews do sometimes note a real improvement or even real bug-fix that my own eyes didn’t catch — that’s why I use it. It’s a special case of “many eyeballs make all the bugs shallow” rule, here AI is another set of eyeballs.

    ( This of course also deserves a disclaimer: Don’t write bad quality code just because you can rely that “AI review” will smooth it out. It will not, it will not capture all the wrong edge-cases if your design is flawed, it will not do the thinking for you. Use AI as a helper, not as replacement for your own thinking/understanding. )

  2. So, why am I using AI reviews only for new code that I commit? Let’s review all the past code!

  3. A bonus note about security, especially if you develop an application where security is a concern (you consume untrusted input, from network or game MODs — see how this applies to games / game engines too): this approach could be focused to perform a “security audit”.

    You don’t need Mythos to do it:) And, to be serious, the bad actors will do such review of your code, esp. if you’re open-source and your code is just public. So you better do the audit yourself, and fix the issues before they are exploited.

How I did this: on tuning the reviews quality/verbosity and Docker sbx

The details how I set this up (so that you can consider doing this for your own codebase too):

  1. Initial prompt. At the start, I asked to review the engine, file by file, starting from src/ and provide a review for each file.

    Note: Our engine already includes CLAUDE.md to help Claude follow our rules better. If you don’t have such file, I would recommend to create and maintain it — it really helps to “steer” Claude in a good direction. To make it available to non-Claude agents, use AGENTS.md and point to it from CLAUDE.md.

  2. Fine-tune to get useful reviews. It took some fine-tuning to make the initial prompt really useful.

    At the beginning, it tended to produce overly-verbose and useless reviews — often discussing with itself (like “Bug! X does Y. Oh, no wait, it is actually OK. Not a bug”.) or pointing to not important details (to give it credit, it sometimes flagged such notes as nit-picking indeed).

    Instructing it to give more concise reports… resulted in getting simple review “No definite bugs found.” for all engine files:) Well that’s also useless, of course.

    In the end, we found a middle ground, although effectively I cannot share with you “just a single prompt to do this”, because the end result is a conversation and pointing into examples what to do / what not to do. For some files, I got “No bugs.”, for some others I got txt files with 1-2 issues to correct.

    The point is: you need to look at initial reviews, and consider if they are too verbose/succinct, and give feedback to steer it in the useful direction. Otherwise browsing a 4000 sloppy reviews will be a huge pointless job for you.

  3. It was worth it! The issues detected (after the fine-tuning above) are genuinely useful reports. Real, meaty bugs that I missed over the years. E.g.

  4. Fixing them manually or not? Initially I considered letting Claude also fix the issues by itself.

    • But in the end, I fixed most of them manually (i.e. I’m still using AI code completion through GitHub Copilot, but not AI agent) since the AI agent “fixes” mostly required fixes on top of them. It didn’t add automated tests, it didn’t make code clean (centralizing logic). Reviewing + fixing agent work was taking me longer than “just doing” them in the way I want from start.

    • And some reviews were plain wrong. E.g. it wanted to break TCastleTouchControl by limiting speed, because it wanted to adapt visual clamping to speed clamping — without understanding that the current version is actually what we want, more functional to use.

  5. Place reviews in .ai-review subdirs. I started by instructing Claude to show me the review, and only continue once I acknowledge. This was of course not good in the long-run, it needlessly blocked the work, so I switched to “put each review in a txt file alongside the source, and don’t wait for my confirmation to move on”.

    In the end I switched to “put each review in a .ai-review subdirectory, e.g. review for src/foo.pas goes to src/.ai-review/foo.txt.

    This allowed me to easily ignore these reviews in version control, by adding .ai-review/ to GIT exclusions (e.g. in .git/info/exclude, so the exclusions themselves remain private to me).

  6. Exclude auto-generated and 3rd party stuff. I fine-tuned which files to include / exclude. I did want to go through entire engine (sources, examples, tools etc.). But I excluded auto-generated files (like LPK, DPROJ, auto-generated per-node includes in src/scene/x3d/auto_generated_node_helpers/) and 3rd-party units code (like Vampyre Imaging).

  7. Watch it, as it tries to abort / avoid doing the work! I needed to push Claude to actually do all the reviews, unattended. I used /goal to enforce it, and it worked… only to some extent.

    Claude started to avoid work by doing “shallow” reviews of most files — and happily reported “No bugs found” for ~1000 files while admitting it only scanned them. (grepped for potential issues, not read). I instructed to use “deep review”, and again enforced it using /goal.

    Then Claude lied to me that it “finished the work” while it turned out it only processed 3 sub-directories in src and the rest was quickly scanned. Not surprisingly, the “scanned sub-directories” did not contain any issues according to the review. This is hilariously unreliable:) I just kept looking what it’s doing and asking for the “deep review” when I saw it tries to mass-qualify a bunch of files as “No bugs found”.

  8. This used multiple 4h sessions. This was a very long-running task. Multiple times I exhausted my “tokens in 4h window” limit from Claude Code. I was ready and OK for this, I just waited, and later/next day said “continue”.

  9. Docker SBX rocks for unattended agent usage. I wanted it to run without interaction, since otherwise I’m a human bottleneck and I mostly just accept the commands executed. (some of the commands could not be auto-accepted by a useful pattern.)

    So I wanted to run with --dangerously-skip-permissions, but that’s something you should never do without being in a secure container. ( Really, don’t ignore this warning. Esp. for longer-running tasks where you don’t observe everything the AI is doing. Letting it run with --dangerously-skip-permissions will lead to a disaster!, weird things will happen at least in your repo (and maybe beyond) and you will not even know what/why happened. )

    Solution: I run the session of Claude inside Docker’s sbx which is a fantastic tool to run AI agents in a sandbox. Docker containers are obvious solution for isolation of agents, and sbx makes using them for this specific purpose a breeze. It creates containers with read-write mount on your disk, so you can inspect the work (and even commit+push) comfortably on your host system, while AI runs in a really isolated environment.

    I highly recommend this approach for any longer-running AI tasks. It’s really a relief to just run inside sbx with --dangerously-skip-permissions, so you’re no longer a dummy clicking “yes I allow” (and constantly break your work to look at Claude state) and you can just let it run. You don’t specifically need sbx for this, since any virtualization Docker-like solution could achieve this — but I found sbx to just fit me right out-of-the-box, and I already use Docker for other stuff so this felt “I’m at home” for me.

  10. Moving conversations from outside to inside SBX. As I didn’t start with the sbx approach, I started with a normal session outside sbx -> so then resumed work in a session inside sbx. I wondered about copying around ~/.claude to make this resume smooth, but in the end it was enough to carry the “current review state” in a collection of Markdown files, and just point the new session to them, and also copy a bunch of “memory files”.

All in all, I do recommend this process, I guess, to everyone if you have some “tokens to spare”. As pointed above — the process had flaws, you need to fine-tune the verbosity/quality of the reviews, you need to make sure it’s really doing it (and not lying to you). But it also had real gains — real, user-affecting bugs have been found and fixed in the engine thanks to this.

Detailed stats

If you’re curious, detailed stats and links to what was fixed:

And that’s it 🙂 Thank you for reading.

If you like what we’re doing, please support the engine development. Thank you!

Start the discussion at Castle Game Engine Forum