Ben Weinstein-Raun

ben@superstimul.us

Hello! I'm Ben. I like to make stuff and not go extinct. I'm the admin of the superstimul.us Friendica instance.

Network Posts

Berkeley, California, USA

https://benwr.net

Ben Weinstein-Raun

9 months ago •

Ben Weinstein-Raun
9 months ago •

I've said this and things like it elsewhere, but o1-preview feels qualitatively better than other LLMs to me, in a way that I don't think I experienced even with GPT-4 vs GPT-3. My implicit superintelligence timelines actually grew a bit longer with GPT-4's release, and have grown a bit more in the time since, but using o1-preview has shrunk them again. It's also increased my felt probability that AI systems will be scheme-y in ways that are hard to detect.

like this

in reply to Ben Weinstein-Raun

Ben Weinstein-Raun

in reply to Ben Weinstein-Raun • 9 months ago •

One aspect of the qualitative difference is that o1-preview appears better at answering difficult science and engineering questions in 10-30s than I am in 10m with Google (on topics I'm unfamiliar with), to roughly the same accuracy, which hasn't seemed true of 4o or Claude. I fairly often catch 4o in reasoning mistakes or simple fabrications, but this has only happened maybe once with o1-preview, in about a month of using up my credits. And in that case I'm not even hugely confident that it was wrong.

It's seriously limited by having a 3-year-old knowledge cutoff and no access to external tools, and yet I find myself needing to be selective about which things I ask it for fear of running out of credits.

like this

in reply to Ben Weinstein-Raun

[object Object]

in reply to Ben Weinstein-Raun • 8 months ago • •

interesting. I haven't found it to be much better than trying a different prompting approach when 4o loses the plot. What's an example prompt that does much better than 4o?

(That said, I've been mostly been using Claude Sonnet 3.5 lately. Also my recent science/engineering work has been mostly stuff I'm already good at.)

in reply to Ben Weinstein-Raun

Kevin Gibbons

in reply to Ben Weinstein-Raun • 9 months ago •

I have been seriously considering building a poor-man's o1 which is just "automatically feed the output back in and ask it to correct any mistakes", probably along with some examples of good/bad outputs. The thing where LLMs hallucinate and then live in the world where the thing they said makes sense seems like it should be easy to break out of by adding an initial prompt like "here's an example of a conversation, assess it for accuracy and correct mistakes by either party" or whatever. That is at least some of what o1 is doing (which is why it takes so long to respond).

re: running out of credits: I really recommend using the API. Caveat: o1 is currently only available for the "$100 paid on the API and 7+ days since first successful payment" tier, so if you aren't already using the API I guess that doesn't really help you for o1 access. It's pay-as-you-go, so no limits except what you're willing to spend (which for me ends up being a trivial for everything _except_ o1 which often costs me a dollar or two for a longer conversation). It doesn't have fancy stuff like LaTeX ren

like this

in reply to Ben Weinstein-Raun

Kevin Gibbons

in reply to Ben Weinstein-Raun • 9 months ago •

Convenient timing: o1 is now available to all usage tiers in the API, so I believe you can just sign up and start using it.

Ben Weinstein-Raun likes this.

⇧