Watch Your Agents

I’ve been telling developers to watch their logs for years.

Not just when something is broken. Not just when production is on fire. Watch them while you are building.

Your logs are the closest thing you have to x-ray vision for a web application. Click a button in the browser, watch the request move through the app, and you can see what is really happening behind the scenes.

Watch your logs

The habit is simple: keep the server log visible while you work.

When you do, you start spotting problems long before they become production issues:

the same query firing 50 times because of an N+1
a page that feels fine locally but is doing way too much work
a slow query that needs an index
an unexpected redirect or extra request
a cache miss you thought was a cache hit
a background job being enqueued more often than expected
parameters coming through in a shape you did not expect

The logs give you immediate feedback. They make the invisible visible.

Agents have logs too

Coding agents need the same treatment.

When you are working with an agent, do not just look at the final diff. Watch what it is doing. Watch the commands it runs, the files it opens, the mistakes it repeats, and the little bits of glue code it keeps inventing along the way.

That is the agent equivalent of watching your development log.

You are not only checking whether this turn succeeded. You are looking for patterns that can make future turns better.

Read the session logs

Most coding agents keep some kind of session history: transcripts, tool calls, command output, file edits, errors, retries, and sometimes timing information.

Those logs are useful after the fact. Point the agent at its own session logs and ask it to look for patterns:

What tasks did you repeat multiple times in this session?
What code did you generate only to throw away later?
Which commands failed, and what would have prevented those failures?
Did you write any one-off scripts that should become checked-in tools?
Did you repeatedly search for the same files or project conventions?
Were there project rules you had to infer that should be documented?
Which parts of the workflow were deterministic enough to automate?
What should be added to AGENTS.md, a skill, or a bin/ script?
If a smaller model had to do this next time, what tools or instructions would it need?

A prompt I like for this:

Review the session logs for this task. Identify repeated reasoning, failed commands,
one-off scripts, unclear project conventions, and deterministic workflows that could
be turned into documentation, skills, or checked-in scripts. Give me the top five
improvements to make future sessions faster, cheaper, and more reliable.

This is the same habit as watching the Rails log after clicking around a page. You are looking for the part of the system that is doing too much work, guessing too often, or hiding useful signal.

Repeated reasoning is a smell

A useful signal is when the model keeps generating code to do the same mechanical task.

For example, imagine you have a skill for publishing blog posts. Every time you run it, the model writes a small Ruby or Python snippet to:

parse front matter
validate the title, summary, badge, tags, and date
derive the final filename
move the draft into _posts/

If the agent is generating that code every time, that is a smell. The model is doing work that should probably be deterministic.

Ask the agent to turn that behavior into a script:

bin/validate-post _drafts/my-post.md
bin/publish-draft _drafts/my-post.md

Then update the skill so future agents call the script instead of improvising the logic.

Specific examples

Example: front matter validation

Bad pattern: every publishing session, the agent manually inspects YAML front matter and tries to remember the required fields.

Better pattern: create bin/check-frontmatter that exits non-zero when summary, badge, tags, or title are missing or malformed.

Now the agent does not need to reason about the rules from scratch. It runs the command and reacts to the result.

Example: screenshot comparison

Bad pattern: the agent repeatedly writes one-off Python to resize screenshots, compare image dimensions, or calculate visual diffs.

Better pattern: create bin/compare-screenshots before.png after.png with clear output like:

METRIC changed_pixels=1824
METRIC changed_percent=0.7

The agent can use the result without reinventing image processing each time.

Example: database checks

Bad pattern: the agent keeps constructing ad hoc SQL to answer common questions like “which users have duplicate active subscriptions?” or “which jobs are stuck?”

Better pattern: create named scripts or Rails tasks:

bin/rails audit:duplicate_subscriptions
bin/rails jobs:stuck

Now the workflow is repeatable, reviewable, and safe to run again.

Example: API fixture generation

Bad pattern: the agent writes custom code every time it needs to build a fake webhook payload or API response.

Better pattern: create bin/generate-webhook-fixture invoice.paid or a small fixture library that produces known-good examples.

The agent stops guessing at payload shapes and starts using something the test suite can trust.

Turn patterns into tools

Moving repeated agent behavior into deterministic tools gives you a few wins:

Dependability: the same input produces the same output.
Determinism: fewer “creative” variations in routine work.
Testability: scripts can have tests; improvised reasoning usually cannot.
Reviewability: a script can be read, improved, and versioned.
Cost: once the workflow is encoded, you may be able to use a smaller model for that task.
Speed: future turns spend less time rediscovering the same procedure.

The loop

Watch the agent the way you watch your logs.

When you see friction, repetition, or uncertainty, ask whether the agent needs better instructions or a better tool.

Sometimes the answer is a clearer prompt. Sometimes it is a skill. And sometimes the best thing you can do is take the fragile reasoning out of the model entirely and give it a boring, deterministic script to call.

That is not making the agent less useful. That is making the whole system more useful.