Satvik

satvik@superstimul.us

Anyone have luck getting LLMs to write tests without mocks? The tests I want are often just 1-2 lines of code, but anything I get from Claude or Gemini ends up being 20-30 lines long, despite requests for conciseness, saying no mocks are needed, and seeing using real resources is ok.

(I use LLMs a lot for other stuff, but tests seem to be particularly bad.)

like this

in reply to Satvik

Ben Weinstein-Raun

in reply to Satvik • 2 months ago •

I have issues in general with getting LLMs to write anything "simply". It feels to me like they only know how to write "pre-crufted" code, maybe because their training data is mostly ten-year-old, median-quality GitHub repos, and when you train them using programming challenge problems they only learn based on correctness and not simplicity.

Satvik likes this.

in reply to Ben Weinstein-Raun

Satvik

in reply to Ben Weinstein-Raun • 2 months ago •

Yeah – I'm used to LLMs writing 2-3x as much code as I would, and trimming it down or rewriting from scratch, but with tests it's 10-15x and at that point I don't reap any benefit.

Ben Weinstein-Raun likes this.

in reply to Satvik

Kevin Gibbons

in reply to Satvik • 2 months ago •

I haven't had a problem with this; I wonder what you're doing different. An example from the other day, I described this proposal then asked for

> test: for `.zip`, an empty iterable as argument produces an iterable which is finished

Gemini 2.5 gave me

```
assert.deepEqual(Iterator.zip([]).next(), { value: undefined, done: true });
```

which is exactly what I'd have written. (Somewhat surprisingly, actually, because Gemini tends to be more verbose.)

Claude 3.7 gave me

```
// Test that calling Iterator.zip with an empty iterable produces an immediately finished iterator
const result = Iterator.zip([]);

// Check that the iterator is done immediately
const firstResult = result.next();

assert.sameValue(firstResult.done, true, "Iterator should be done immediately");
assert.sameValue(firstResult.value, undefined, "Iterator's value should be undefined");
```

The comments are unnecessary but the test is otherwise fine.

Satvik likes this.

in reply to Satvik

Satvik

in reply to Satvik • 2 months ago •

I would love that first result, and the second would be fine too. But here is an example of what I get: aistudio.google.com/app/prompt…

(I'm literally just trying to generate ~100 tests that call various functions, to see if anything crashes when we change Python versions.)

Claude Code gave me similar results. I did manage to get much better results using Aider + Claude, interestingly enough.

(These are not particularly well-written prompts, but I'm generally pretty lazy with my prompts, and prefer to just provide feedback. This works fine for a lot of stuff, but I haven't gotten it to work with tests.)

in reply to Satvik

Kevin Gibbons

in reply to Satvik • 2 months ago •

I added "The test will run in a real environment, so you don't need to mock anything." to the end of your prompt, and Sonnet 3.7 gave me

```
using Test
using PyCall

@testset "get_agg_strat_vectors_cache_filepath basic functionality" begin
# Get a valid account name from the environment
# This assumes there's at least one account configured in the system
accounts = pyimport("execution.account_settings").get_all_accounts()
@test length(accounts) > 0

test_account = accounts
[1] test_strategy = "momentum"
test_symbol = "BTCUSDT"
test_leverage = 1

# Test that the function runs without errors
result = get_agg_strat_vectors_cache_filepath(test_account, test_strategy, test_symbol, test_leverage)

# Basic validation of the result
@test result isa String
@test !isempty(result)

# Print the result for manual inspection
@info "Generated filepath: $result"
end
```

which, idk, seems reasonable? I think if you want a better result than that you'll nee

Satvik likes this.

in reply to Kevin Gibbons

Satvik

in reply to Kevin Gibbons • 2 months ago •

Hmm interesting. I tried in the API and the online playground, both on Sonnet 3.7, and the API was more similar to your result – noticeably more concise. I've also tried several other versions (with much more context, enough to write out real tests), and the pattern seems to be there – the API results are better.

I'm surprised, I would have expected them to be about the same. But using the API is an easy enough change.

in reply to Satvik

Daniel Ziegler

in reply to Satvik • 2 months ago •

Try again with Sonnet 4!

like this

in reply to Daniel Ziegler

Satvik

in reply to Daniel Ziegler • 2 months ago •

Sonnet 4 in the console gave me a slightly better answer than Kevin's above, using some more Julia-specific tricks for a better test. It's about as good as a test can get with such minimal information!

I'll be interested to try it out on more complex stuff during the week, there were some refactors that I couldn't get to work with Claude Code on 3.7, maybe they'll work now.

like this

in reply to Satvik

Daniel Ziegler

in reply to Satvik • 2 months ago •

I think Sonnet 4 is not *that* much smarter than 3.7 but it should be significantly more steerable and less likely to insert silly mocks

Satvik likes this.

in reply to Daniel Ziegler

Satvik

in reply to Daniel Ziegler • 2 months ago •

Sonnet 4 is tremendously more effective for my use cases, probably because I use a niche programming language (Julia). Two weeks ago I would have said LLMs make me ~10% more productive, now it looks closer to +100%.

And I'm not even committing LLM-generated code – I just use it to iterate and test on designs, then delete the code and implement from scratch manually.

⇧

Satvik 2 months ago •

Satvik
2 months ago •