Skip to main content


Anyone have luck getting LLMs to write tests without mocks? The tests I want are often just 1-2 lines of code, but anything I get from Claude or Gemini ends up being 20-30 lines long, despite requests for conciseness, saying no mocks are needed, and seeing using real resources is ok.

(I use LLMs a lot for other stuff, but tests seem to be particularly bad.)

in reply to Satvik

I have issues in general with getting LLMs to write anything "simply". It feels to me like they only know how to write "pre-crufted" code, maybe because their training data is mostly ten-year-old, median-quality GitHub repos, and when you train them using programming challenge problems they only learn based on correctness and not simplicity.
in reply to Ben Weinstein-Raun

Yeah – I'm used to LLMs writing 2-3x as much code as I would, and trimming it down or rewriting from scratch, but with tests it's 10-15x and at that point I don't reap any benefit.
in reply to Satvik

I haven't had a problem with this; I wonder what you're doing different. An example from the other day, I described this proposal then asked for

> test: for `.zip`, an empty iterable as argument produces an iterable which is finished

Gemini 2.5 gave me

```
assert.deepEqual(Iterator.zip([]).next(), { value: undefined, done: true });
```

which is exactly what I'd have written. (Somewhat surprisingly, actually, because Gemini tends to be more verbose.)

Claude 3.7 gave me

```
// Test that calling Iterator.zip with an empty iterable produces an immediately finished iterator
const result = Iterator.zip([]);

// Check that the iterator is done immediately
const firstResult = result.next();

assert.sameValue(firstResult.done, true, "Iterator should be done immediately");
assert.sameValue(firstResult.value, undefined, "Iterator's value should be undefined");
```

The comments are unnecessary but the test is otherwise fine.

in reply to Satvik

I would love that first result, and the second would be fine too. But here is an example of what I get: aistudio.google.com/app/prompt…

(I'm literally just trying to generate ~100 tests that call various functions, to see if anything crashes when we change Python versions.)

Claude Code gave me similar results. I did manage to get much better results using Aider + Claude, interestingly enough.

(These are not particularly well-written prompts, but I'm generally pretty lazy with my prompts, and prefer to just provide feedback. This works fine for a lot of stuff, but I haven't gotten it to work with tests.)

in reply to Satvik

in reply to Kevin Gibbons

Hmm interesting. I tried in the API and the online playground, both on Sonnet 3.7, and the API was more similar to your result – noticeably more concise. I've also tried several other versions (with much more context, enough to write out real tests), and the pattern seems to be there – the API results are better.

I'm surprised, I would have expected them to be about the same. But using the API is an easy enough change.

in reply to Daniel Ziegler

Sonnet 4 in the console gave me a slightly better answer than Kevin's above, using some more Julia-specific tricks for a better test. It's about as good as a test can get with such minimal information!

I'll be interested to try it out on more complex stuff during the week, there were some refactors that I couldn't get to work with Claude Code on 3.7, maybe they'll work now.