Superstimulus | Satvik @ Superstimulus

Satvik

satvik@superstimul.us

What's an example where it actually makes sense to build your own agent? I see tons of tutorials floating around recently, but it's hard for me to imagine a case where I wouldn't just e.g. build an MCP for Claude Code instead. What am I missing?

Satvik

1 month ago •

Satvik
1 month ago •

Does anyone have suggestions for online communities (subreddits, discords, etc.) with high-quality discussion on what works and what doesn't with LLMs? Most places I can find go to one extreme or the other.

Ben Weinstein-Raun likes this.

in reply to Satvik

Ben Weinstein-Raun

in reply to Satvik • 1 month ago • •

The communities that I get any value from here are r/ChatGPTCoding and r/LocalLlama, though they're not that high-quality, especially when discussing less-practical aspects of LLMs.

Satvik likes this.

Satvik

1 month ago •

Satvik
1 month ago •

I tried telling Claude "Never compliment me. Criticize my ideas, ask clarifying questions, and give me funny insults". It was great! Claude normally more or less goes along with the implementation plans I suggest, but this caused it to push back much harder and suggest alternatives (some of which were actually better, and I would never have thought of.)

Some highlights:

"Why not just use VS Code's Julia extension with Copilot?"

"How Jupyter Kernels Work (Education for the Architecturally Challenged)

"Why This Doesn't Suck (Unlike Your Original Plan)"

"Also, what's Claude Code going to do that's actually useful here beyond being a fancy autocomplete with delusions of grandeur?"

I love how hard Claude is trying to get me to stop using Claude.

Ben Weinstein-Raun likes this.

Satvik

2 months ago •

Satvik
2 months ago •

I got a Claude Max subscription and have been playing around.

First big experiment was to see if we could reimplement a part of our pipeline from Julia into Rust. The quality of the code it produced was really good, and it did good analyses on what parts would be easier or harder to port. Eventually this convinced us that a port wouldn't have been worthwhile, because the Rust <> Julia interop is currently quite difficult. I think it might have taken us several days to come to that conclusion otherwise, so this was a big win.

Second significant experiment was asking Claude Code to optimize a pipeline. It pretty much failed abysmally – it successfully added a lot of profiling and came up with a lot of ideas, but not a single one of them made the code any faster. Later, I was able to make a significant improvement to that pipeline with about half an hour of work. So, a failure here.

Third was a moderately difficult refactor, replacing some dicts with structs, that then get serialized and read in another part of the codebase. Claude Code did really w

Satvik

2 months ago •

Satvik
2 months ago •

I think the thing I really like about LLM-assisted coding is that it makes context switching easier.

I can be in "words mode" or "code mode", and switching between these takes time and effort. (There are more categories, but they don't change the fundamental point.)

In my job, I have to spend a lot of time in words mode, due to things like hiring and managing. Historically, this has meant that I only really get engineering work done when I have 2+ hour chunks to focus on it. But now I can often get work done in much shorter chunks, while still in words mode.

I would not like to spend all my time in words mode – I enjoy digging into the details – but it's really nice to have the option.

Satvik

2 months ago •

Satvik
2 months ago •

Anyone have luck getting LLMs to write tests without mocks? The tests I want are often just 1-2 lines of code, but anything I get from Claude or Gemini ends up being 20-30 lines long, despite requests for conciseness, saying no mocks are needed, and seeing using real resources is ok.

(I use LLMs a lot for other stuff, but tests seem to be particularly bad.)

like this

in reply to Satvik

Ben Weinstein-Raun

in reply to Satvik • 2 months ago • •

I have issues in general with getting LLMs to write anything "simply". It feels to me like they only know how to write "pre-crufted" code, maybe because their training data is mostly ten-year-old, median-quality GitHub repos, and when you train them using programming challenge problems they only learn based on correctness and not simplicity.

Satvik likes this.

in reply to Ben Weinstein-Raun

Satvik

in reply to Ben Weinstein-Raun • 2 months ago •

Yeah – I'm used to LLMs writing 2-3x as much code as I would, and trimming it down or rewriting from scratch, but with tests it's 10-15x and at that point I don't reap any benefit.

Ben Weinstein-Raun likes this.

in reply to Satvik

Kevin Gibbons

in reply to Satvik • 2 months ago • •

I haven't had a problem with this; I wonder what you're doing different. An example from the other day, I described this proposal then asked for

> test: for `.zip`, an empty iterable as argument produces an iterable which is finished

Gemini 2.5 gave me

```
assert.deepEqual(Iterator.zip([]).next(), { value: undefined, done: true });
```

which is exactly what I'd have written. (Somewhat surprisingly, actually, because Gemini tends to be more verbose.)

Claude 3.7 gave me

```
// Test that calling Iterator.zip with an empty iterable produces an immediately finished iterator
const result = Iterator.zip([]);

// Check that the iterator is done immediately
const firstResult = result.next();

assert.sameValue(firstResult.done, true, "Iterator should be done immediately");
assert.sameValue(firstResult.value, undefined, "Iterator's value should be undefined");
```

The comments are unnecessary but the test is otherwise fine.

Satvik likes this.

in reply to Satvik

Satvik

in reply to Satvik • 2 months ago •

I would love that first result, and the second would be fine too. But here is an example of what I get: aistudio.google.com/app/prompt…

(I'm literally just trying to generate ~100 tests that call various functions, to see if anything crashes when we change Python versions.)

Claude Code gave me similar results. I did manage to get much better results using Aider + Claude, interestingly enough.

(These are not particularly well-written prompts, but I'm generally pretty lazy with my prompts, and prefer to just provide feedback. This works fine for a lot of stuff, but I haven't gotten it to work with tests.)

This entry was edited (2 months ago)

in reply to Satvik

Kevin Gibbons

in reply to Satvik • 2 months ago • •

I added "The test will run in a real environment, so you don't need to mock anything." to the end of your prompt, and Sonnet 3.7 gave me

```
using Test
using PyCall

@testset "get_agg_strat_vectors_cache_filepath basic functionality" begin
# Get a valid account name from the environment
# This assumes there's at least one account configured in the system
accounts = pyimport("execution.account_settings").get_all_accounts()
@test length(accounts) > 0

test_account = accounts
[1] test_strategy = "momentum"
test_symbol = "BTCUSDT"
test_leverage = 1

# Test that the function runs without errors
result = get_agg_strat_vectors_cache_filepath(test_account, test_strategy, test_symbol, test_leverage)

# Basic validation of the result
@test result isa String
@test !isempty(result)

# Print the result for manual inspection
@info "Generated filepath: $result"
end
```

which, idk, seems reasonable? I think if you want a better result than that you'll nee

Satvik likes this.

in reply to Kevin Gibbons

Satvik

in reply to Kevin Gibbons • 2 months ago •

Hmm interesting. I tried in the API and the online playground, both on Sonnet 3.7, and the API was more similar to your result – noticeably more concise. I've also tried several other versions (with much more context, enough to write out real tests), and the pattern seems to be there – the API results are better.

I'm surprised, I would have expected them to be about the same. But using the API is an easy enough change.

This entry was edited (2 months ago)

in reply to Satvik

Daniel Ziegler

in reply to Satvik • 2 months ago • •

Try again with Sonnet 4!

like this

in reply to Daniel Ziegler

Satvik

in reply to Daniel Ziegler • 2 months ago •

Sonnet 4 in the console gave me a slightly better answer than Kevin's above, using some more Julia-specific tricks for a better test. It's about as good as a test can get with such minimal information!

I'll be interested to try it out on more complex stuff during the week, there were some refactors that I couldn't get to work with Claude Code on 3.7, maybe they'll work now.

like this

in reply to Satvik

Daniel Ziegler

in reply to Satvik • 2 months ago • •

I think Sonnet 4 is not *that* much smarter than 3.7 but it should be significantly more steerable and less likely to insert silly mocks

Satvik likes this.

in reply to Daniel Ziegler

Satvik

in reply to Daniel Ziegler • 2 months ago •

Sonnet 4 is tremendously more effective for my use cases, probably because I use a niche programming language (Julia). Two weeks ago I would have said LLMs make me ~10% more productive, now it looks closer to +100%.

And I'm not even committing LLM-generated code – I just use it to iterate and test on designs, then delete the code and implement from scratch manually.

This entry was edited (2 months ago)

Satvik

5 months ago •

Satvik
5 months ago •

Run-time type checking is way more useful than I expected. I've been using it in Julia for 4 years now, and I expected it to provide ~25% of the value of the value of static type checking, but it's actually been closer to 90%.

I guess it's because when I'm developing, I'm constantly running code anyway, either through a notebook or tests. And the change -> run loop in Julia is not noticeably slower than the change -> compile loop in Scala.

The big exception is when I have code that can only reasonably be run on a remote machine and takes 5+ minutes to set up/execute. Then I'd really like more static analysis.

Ben Weinstein-Raun likes this.

Satvik

8 months ago •

Satvik
8 months ago •

Cracking Eggs

The best way to crack eggs is the highlander method: beat two eggs against each other. This overly easy method preserves rarely makes a mess, and is tolerant to a lot of different levels of force.

But don't just look on the sunny side: the highlander method has a major flaw. What do you do with the last egg? If you haven't hatched a plan, you may scramble to one of the inferior methods: counter or bowl.

The counter method is the safe option: it consistently produces a small mess, even if your strike is eggsceptional. But if you're ready to leave your shell, the bowl is for gamblers and dreamers: it can produce a mess-free egg if you aim things perfectly, but you'll end up with shell everywhere unless you crack it eggsactly right.

Ben Weinstein-Raun likes this.

Satvik

8 months ago •

Satvik
8 months ago •

One of the main questions I ask in interviews is basically "we have a data pipeline with goal X and constraints A, B, and C. How would you design it?" Depending on how they do, we'll discuss various tradeoffs, other possible goals/constraints, and so on.

This is based on a real system I designed and have been maintaining for ~5 years, and is also very similar to other systems I've run at previous jobs.

About half the candidates complain that it's not a realistic question.

Ben Weinstein-Raun likes this.

in reply to Satvik

Ben Weinstein-Raun

in reply to Satvik • 8 months ago • •

Is it possible that they mean some other kind of realism? Like, maybe they think it's unrealistic to design a system like this during the course of an interview, or something?

in reply to Ben Weinstein-Raun

Satvik

in reply to Ben Weinstein-Raun • 8 months ago •

I've asked for more specific feedback, and the complaints often come down to "nothing I've done has been like this" and "most of development is web development." That might be true, but we don't have a website/web app, and we're pretty specific about the work involved in both the job description and the phone interview.

(We have had other feedback that's helpful)

Generally, everyone who's done well on this question and joined has been a strong hire, though we've also hired some people who didn't do well that specific question. So I'm pretty sure it's a good question. I'm just a little amused/dismayed at how many people seem to think "realistic" means "web development."

This entry was edited (8 months ago)

Ben Weinstein-Raun likes this.

Satvik

9 months ago •

Satvik
9 months ago •

In git, commits play competing roles. They represent the mechanical history of your codebase, letting you restore/merge the state at various points. They're also often used as a legible, coherent history meant for reading or git bisect.

These two uses are at odds with each other – for the mechanical part, you want to make lots of small commits all the time, and for the legible part you want to have a small number of meaningful changes.

There are ways around this within git – for example, git rebase -i lets you create a new set of reorganized commits that have the same end result as your original commits. But all of these methods have enough friction that I've never seen a team actually stick to them.

I've been playing around with jj, a version control system on top of git. jj works seamlessly with existing git repositories/commands, and you can undo all operations, so trying it is very cheap.

jj gets around this problem by explicitly having separate concepts for Revisions (representing the mechanical history) and Changes (representing

like this

in reply to Satvik

Ben Weinstein-Raun

in reply to Satvik • 9 months ago • •

This seems cool; my current favorite workflow involves storing a snapshot with an LLM-generated commit message every time I save a file, and when I'm ready to submit, doing a git reset and manually doing a few rounds of "git add -p" on a new branch.

Satvik likes this.

in reply to Ben Weinstein-Raun

Satvik

in reply to Ben Weinstein-Raun • 9 months ago •

Oh that's very cool! How do you incorporate PR feedback, if any?

in reply to Satvik

Ben Weinstein-Raun

in reply to Satvik • 9 months ago • •

Ideally: Keep working in the "wip" branch, and then go back and modify the PR branch commits. In practice: haven't been doing enough team coding recently to have much actual experience with that aspect

Satvik likes this.

in reply to Satvik

Soccum Speleodontidae

in reply to Satvik • 9 months ago • •

I experienced much of the benefits by just having a more efficient interface to git as-is. In particular, reducing the friction of selective staging specific subsets of changes made creating what you describe as the "mechanical" (smaller) commits easier, and happen more naturally. I would often work out a full set of changes first, and then go back and incrementally commit logically-related subsets of the change, which might not have been in the same order I actually performed the corresponding edits. This also reduced the amount of rebasing I felt was necessary, but rebasing is also easier with a better interface.

The most efficient interface I'm aware of is magit.vc/, which I first tried when "trying to learn emacs again". I still feel like I hardly know emacs at all, but even just magit was excellent enough on its own that I miss it enough to occasionally fire up emacs just for magit when I have a lot of otherwise-tedious specific git manipulations to do. This most frequently happens when I'm rebas

like this

⇧