Don’t let the “flash” name fool you, this is an amazing model.
I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fraction (basically order of magnitude less!!) of the inference time and price
thecupisblue 18 hours ago [-]
Oh wow - I recently tried 3 Pro preview and it was too slow for me.
After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash.
The results are better AND the response times have stayed the same.
What an insane gain - especially considering the price compared to 2.5 Pro.
I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance.
Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out.
lancekey 11 hours ago [-]
Curious to learn what a “product benchmark” looks like. Is it evals you use to test prompts/models? A third party tool?
Examples from the wild are a great learning tool, anything you’re able to share is appreciated.
theshrike79 3 hours ago [-]
Everyone should have their own "pelican riding a bicycle" benchmark they test new models on.
And it shouldn't be shared publicly so that the models won't learn about it accidentally :)
ggsp 3 hours ago [-]
Any suggestions for a simple tool to set up your own local evals?
theshrike79 14 minutes ago [-]
My "tool" is just prompts saved in a text file that I feed to new models by hand. I haven't built a bespoke framework on top of it.
...yet. Crap, do I need to now? =)
m00dy 6 hours ago [-]
May I ask your internal benchmark ? I'm building a new set of benchmarks and testing suite for agentic workflows using deepwalker [0]. How do you design your benchmark suite ? would be really cool if you can give more details.
I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.
Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.
So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).
So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.
Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.
prettyblocks 13 hours ago [-]
I don't think tricky niche knowledge is the sweet spot for genai and it likely won't be for some time. Instead, it's a great replacement for rote tasks where a less than perfect performance is good enough. Transcription, ocr, boilerplate code generation, etc.
lambda 11 hours ago [-]
The thing is, I see people use it for tricky niche knowledge all the time; using it as an alternative to doing a Google search.
So I want to have a general idea of how good it is at this.
I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.
But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.
Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.
ozim 4 hours ago [-]
That’s riding hype machine and throwing baby with bath water.
Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM.
Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice.
Fixing up text is of course also big.
Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like a mad man.
The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box.
illiac786 5 hours ago [-]
But people using the wrong tool for a task is nothing new. Using excel as a database (still happening today), etc.
Maybe the scale is different with genAI and there are some painful learnings ahead of us.
katzenversteher 5 hours ago [-]
I also use niche questions a lot but mostly to check how much the models tend to hallucinate. E.g. I start asking about rank badges in Star Trek which they usually get right and then I ask about specific (non existing) rank badges shaped like strawberries or something like that. Or I ask about smaller German cities and what's famous about them.
I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something.
mikepurvis 11 hours ago [-]
And Google themselves obviously believe that too as they happily insert AI summaries at the top of most serps now.
ComputerGuru 11 hours ago [-]
Or maybe Google knows most people search inane, obvious things?
coldtea 10 hours ago [-]
Or more likely Google couldn't give a rat's arse whether those AI summaries are good or not (except to the degree that people don't flee it), and what it cares is that they keep users with Google itself, instead of clicking of to other sources.
After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade.
vitorgrs 10 hours ago [-]
Google AI Overview a lot of times write wrong about obvious things so... lol
They probably use old Flash Lite model, something super small, and just summarize the search...
mikepurvis 6 hours ago [-]
Those summaries would be far more expensive to generate than the searches themselves so they're probably caching the top 100k most common or something, maybe even pre-caching it.
8 hours ago [-]
ozim 12 hours ago [-]
Second this.
Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.
I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.
DeathArrow 1 hours ago [-]
Well, I used Grok to find information I forgot about like product names, films, books and various articles on different subjects. Google search didn't help but putting the LLM at work did the trick.
So I think LLMs can be good for finding niche info.
DrewADesign 9 hours ago [-]
Yeah, but tests like that deliberately prod the boundaries of its capability rather than how well it does what it’s good at.
andai 12 hours ago [-]
So this is an interesting benchmark, because if the answer is actually in the top 3 google results, then my python script that runs a google search, scrapes the top n results and shoves them into a crappy LLM would pass your benchmark too!
Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)
lambda 11 hours ago [-]
I've tried doing this query with search enabled in LLMs before, which is supposed to effectively do that, and even with that they didn't give very good answers. It's a very physical kind of thing, and its easy to conflate with other similar descriptions, so they would frequently just conflate various different things and give some horrible mash-up answer that wasn't about the specific thing I'd asked about.
andai 2 hours ago [-]
So it's a difficult question for LLMs to answer even when given perfect context?
Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context).
arisAlexis 13 minutes ago [-]
can you give us an example of this niche knowledge? I highly doubt there is knowledge that is not inside some internet training material.
TeodorDyakov 13 hours ago [-]
Hi. I am curious what was the benchmark question? Cheers!
Turskarama 12 hours ago [-]
The problem with publicly disclosing these is that if lots of people adopt them they will become targeted to be in the model and will no longer be a good benchmark.
lambda 12 hours ago [-]
Yeah, that's part of why I don't disclose.
Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.
grog454 11 hours ago [-]
This thought process is pretty baffling to me, and this is at least the second time I've encountered it on HN.
What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.
Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.
nl 10 hours ago [-]
I have a bunch of private benchmarks I run against new models I'm evaluating.
The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.
However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.
grog454 9 hours ago [-]
Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse.
I didn't see anyone claiming any 'science'? Did I miss something?
grog454 9 hours ago [-]
I guess there's two things I'm still stuck on:
1. What is the purpose of the benchmark?
2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?
To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
nl 8 hours ago [-]
1. The purpose of the benchmark is to choose what models I use for my own system(s). This is extremely common practice in AI - I think every company I've worked with doing LLM work in the last 2 years has done this in some form.
> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.
grog454 7 hours ago [-]
I see the potential value of private evaluations. They aren't scientific but you can certainly beat a "vibe test".
I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.
> There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.
Then you must not be working in an environment where a better benchmark yields a competitive advantage.
eru 6 hours ago [-]
> I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.
In principle, we have ways: if nl's reports consistently predict how public benchmarks will turn out later, they can build up a reputation. Of course, that requires that we follow nl around for a while.
nl 8 hours ago [-]
As ChatGPT said to you:
> A secret benchmark is: Useful for internal model selection
That's what I'm doing.
Turskarama 11 hours ago [-]
The point is that it's a litmus test for how well the models do with niche knowledge _in general_. The point isn't really to know how well the model works for that specific niche.
Ideally of course you would use a few of them and aggregate the results.
theshrike79 3 hours ago [-]
Because it encompasses the very specific way I like to do things. It's not of use to the general public.
akoboldfrying 9 hours ago [-]
I actually think "concealing the question" is not only a good idea, but a rather general and powerful idea that should be much more widely deployed (but often won't be, for what I consider "emotional reasons").
Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.
grog454 8 hours ago [-]
It's hard to have any certainty around concealment unless you are only testing local LLMs. As a matter of principle I assume the input and output of any query I run in a remote LLM is permanently public information (same with search queries).
Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!
This is the second reason I find the idea of publicly discussing secret benchmarks silly.
grog454 6 hours ago [-]
I learned in another thread there is some work being done to avoid contamination of training data during evaluation of remote models using trusted execution environments (https://arxiv.org/pdf/2403.00393). It requires participation of the model owner.
kridsdale3 12 hours ago [-]
If they told you, it would be picked up in a future model's training run.
jacobn 12 hours ago [-]
Don't the models typically train on their input too? I.e. submitting the question also carries a risk/chance of it getting picked up?
I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?
nl 10 hours ago [-]
OpenAI and Anthropic don't train on your questions if you have pressed the opt-out button and are using their UI. LMArena is a different matter.
jerojero 12 hours ago [-]
they probably dont train on inputs from testing grounds.
you dont train on your test data because you need to have that to compare if training is improving or not.
energy123 12 hours ago [-]
Given they asked in on LMArena, yes.
lambda 11 hours ago [-]
Yeah, probably asking on LMArena makes this an invalid benchmark going forward, especially since I think Google is particular active in testing models on LMArena (as evidenced by the fact that I got their preview for this question).
I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.
_heimdall 12 hours ago [-]
Is that an issue if you now need a new question to ask?
Marazan 1 hours ago [-]
Heres my old benchmark question and my new variant:
"When was the last time England beat Scotland at rugby union"
new variant
"Without using search when was the last time England beat Scotland at rugby union"
It is amazing how bad ChatGPT is at this question and has been for years now across multiple models. It's not that it gets it wrong - no shade, I've told it not to search the web so this is _hard_ for it - but how badly it reports the answer. Starting from the small stuff - it almost always reports the wrong year, wrong location and wrong score - that's the boring facts stuff that I would expect it to stumble on. It often creates details of matches that didn't exist, cool standard hallucinations. But even within the text it generates itself it cannot keep it consistent with how reality works. It often reports draws as wins for England. It frequently states the team that it just said scored most points lost the match, etc.
It is my ur example for when people challenge my assertion LLMs are stochastic parrots or fancy Markov chains on steroids.
vitaflo 9 hours ago [-]
I also have my own tricky benchmark that up til now only Deepseek has been able to answer. Gemini 3 Pro was the second. Every other LLM fail horribly. This is the main reason I started looking at G3pro more seriously.
fragmede 12 hours ago [-]
Even the most magical wonderful auto-hammer is gonna be bad at driving in screws. And, in this analogy I can't fault you because there are people trying to sell this hammer as a screwdriver. My opinion is that it's important to not lose sight of the places where it is useful because of the places where it isn't.
pretzellogician 12 hours ago [-]
Funny, I grew up using what's called a "hand impact screwdriver"... turns out a hammer can be used to drive in screws!
mips_avatar 13 hours ago [-]
OpenAI made a huge mistake neglecting fast inferencing models. Their strategy was gpt 5 for everything, which hasn't worked out at all. I'm really not sure what model OpenAI wants me to use for my applications that require lower latency. If I follow their advice in their API docs about which models I should use for faster responses I get told either use GPT 5 low thinking, or replace gpt 5 with gpt 4.1, or switch to the mini model. Now as a developer I'm doing evals on all three of these combinations. I'm running my evals on gemini 3 flash right now, and it's outperforming gpt5 thinking without thinking. OpenAI should stop trying to come up with ads and make models that are useful.
andai 11 hours ago [-]
Hard to find info but I think the -chat versions of 5.1 and 5.2 (gpt-5.2-chat) are what you're looking for. They might just be an alias for the same model with very low reasoning though. I've seen other providers do the same thing, where they offer a reasoning and non reasoning endpoint. Seems to work well enough.
ComputerGuru 11 hours ago [-]
They’re not the same, there are (at least) two different tunes per 5.x
For each you can use it as “instant” supposedly without thinking (though these are all exclusively reasoning models) or specify a reasoning amount (low, medium, high, and now xhigh - though if you do g specify it defaults to none) OR you can use the -chat version which is also “no thinking” but in practice performs markedly differently from the regular version with thinking off (not more or less intelligent but has a different style and answering method).
mips_avatar 11 hours ago [-]
It's weird they don't document this stuff. Like understanding things like tool call latency and time to first token is extremely important in application development.
eru 9 hours ago [-]
Humans often answer with fluff like "That's a good question, thanks for asking that, [fluff, fluff, fluff]" to give themselves more breathing room until the first 'token' of their real answer. I wonder if any LLM are doing stuff like that for latency hiding?
mips_avatar 8 hours ago [-]
I don't think the models are doing this, time to first token is more of a hardware thing. But people writing agents are definitely doing this, particularly in voice it's worth it to use a smaller local llm to handle the acknowledgment before handing it off.
strangegecko 7 hours ago [-]
Do humans really do that often?
Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.
eru 6 hours ago [-]
People who professionally answer questions do that, yes. Eg politicians or press secretaries for companies, or even just your professor taking questions after a talk.
> Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.
It gets a lot easier with practice: your brain caches a few of the typical fluff routines.
danpalmer 13 hours ago [-]
Hardware is a factor here. GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. There are lots of other factors here, but latency specifically favours TPUs.
The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.
nl 10 hours ago [-]
> GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data.
Where are you getting that? All the citations I've seen say the opposite, eg:
> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.
> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.
Both Cerebras and Grok have custom AI-processing hardware (not CPUs).
The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.
danpalmer 7 hours ago [-]
I thought it was generally accepted that inference was faster on TPUs. This was one of my takeaways from the LLM scaling book: https://jax-ml.github.io/scaling-book/ – TPUs just do less work, and data needs to move around less for the same amount of processing compared to GPUs. This would lead to lower latency as far as I understand it.
The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.
> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).
I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.
nl 5 hours ago [-]
Sorry I meant Groq custom hardware, not Grok!
I don't see any latency comparisons in the link
danpalmer 4 hours ago [-]
The link is just to the book, the details are scattered throughout. That said the page on GPUs specifically speaks to some of the hardware differences and how TPUs are more efficient for inference, and some of the differences that would lead to lower latency.
Re: Groq, that's a good point, I had forgotten about them. You're right they too are doing a TPU-style systolic array processor for lower latency.
mips_avatar 8 hours ago [-]
I'm pretty sure xAI exclusively uses Nvidia H100s for Grok inference but I could be wrong. I agree that I don't see why TPUs would necessarily explain latency.
danpalmer 4 hours ago [-]
To be clear I'm only suggesting that hardware is a factor here, it's far from the only reason. The parent commenter corrected their comment that it was actually Groq not Grok that they were thinking of, and I believe they are correct about that as Groq is doing something similar to TPUs to accelerate inference.
jrk 9 hours ago [-]
Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth.
eru 9 hours ago [-]
And our LLMs still have latencies well into the human perceptible range. If there's any necessary, architectural difference in latency between TPU and GPU, I'm fairly sure it would be far below that.
danpalmer 7 hours ago [-]
My understanding is that TPUs do not use memory in the same way. GPUs need to do significantly more store/fetch operations from HBM, where TPUs pipeline data through systolic arrays far more. From what I've heard this generally improves latency and also reduces the overhead of supporting large context windows.
simonw 13 hours ago [-]
Yeah, I'm surprised that they've been through GT-5.1 and GPT-5.1-Codex and GPT-5.1-Codex-Max and now GPT-5.2 but their most recent mini model is still GPT-5-mini.
mips_avatar 11 hours ago [-]
I cannot comprehend how they do not care about this segment of the market.
yakbarber 10 hours ago [-]
it's easy to comprehend actually. they're putting everything on "having the best model". It doesn't look like they're going to win, but that's still their bet/
mips_avatar 10 hours ago [-]
I mean they’re trying to outdo google. So they need to do that.
eru 9 hours ago [-]
Until recently, Google was the underdog in the LLM race and OpenAI was the reigning champion. How quickly perceptions shift!
mips_avatar 8 hours ago [-]
I just want a deepseek moment for an open weights model fast enough to use in my app, I hate paying the big guys.
eru 4 hours ago [-]
Isn't deepseek an open weights model?
campers 5 hours ago [-]
I had wondered if they run their inference at high batch sizes to get better throughput to keep their inference costs lower.
They do have a priority tier at double the cost, but haven't seen any benchmarks on how much faster that actually is.
The flex tier was an underrated feature in GPT5, batch pricing with a regular API call. GPT5.1 using flex priority is an amazing price/intelligence tradeoff for non-latency sensitive applications, without needing to extra plumbing of most batch APIs
mips_avatar 5 hours ago [-]
I’m sure they do something like that. I’ve noticed azure has way faster gpt 4.1 than OpenAI
behnamoh 12 hours ago [-]
> OpenAI made a huge mistake neglecting fast inferencing models.
It's a lost battle. It'll always be cheaper to use an open source model hosted by others like together/fireworks/deepinfra/etc.
I've been maining Mistral lately for low latency stuff and the price-quality is hard to beat.
mips_avatar 11 hours ago [-]
I'll try benchmarking mistral against my eval, I've been impressed by kimi's importance but it's too slow to do anything useful realtime.
TacticalCoder 11 hours ago [-]
> OpenAI should stop trying to come up with ads and make models that are useful.
Turns out becoming a $4 trillion company first with ads (Google), then owning everybody on the AI-front could be the winning strategy.
9 hours ago [-]
kartayyar 9 hours ago [-]
Can confirm. We at Roblox open sourced a new frontier game eval today, and it's beating even Gemini 3 Pro! ( Previous best model ).
Alright so we have more benchmarks including hallucinations and flash doesn't do well with that, though generally it beats gemini 3 pro and GPT 5.1 thinking and gpt 5.2 thinking xhigh (but then, sonnet, grok, opus, gemini and 5.1 beat 5.2 xhigh) - everything. Crazy.
On your Omniscience-Index vs. Cost graph, I think your Gemini 3 pro & flash models might be swapped.
giancarlostoro 15 hours ago [-]
I wonder at what point will everyone who over-invested in OpenAI will regret their decision (expect maybe Nvidia?). Maybe Microsoft doesn't need to care, they get to sell their models via Azure.
Seeing Sergey Brin back in the trenches makes me think Google is really going to win this
They always had the best talent, but with Brin at the helm, they also have someone with the organizational heft to drive them towards a single goal
outside1234 14 hours ago [-]
Very soon, because clearly OpenAI is in very serious trouble. They are scaled and have no business model and a competitor that is much better than them at almost everything (ads, hardware, cloud, consumer, scaling).
TacticalCoder 10 hours ago [-]
Oracle's stock skyrocketed then took a nosedive. Financial experts warned that companies who bet big on OpenAI like Oracle and Coreweave to pump their stock would go down the drain, and down the drain they went (so far: -65% for Coreweave and nearly -50% of Oracle compared to their OpenAI-hype all-time highs).
Markets seems to be in a: "Show me the OpenAI money" mood at the moment.
And even financial commentators who don't necessarily know a thing about AI can realize that Gemini 3 Pro and now Gemini 3 Flash are giving ChatGPT a run for its money.
Oracle and Microsoft have other source of revenues but for those really drinking the OpenAI koolaid, including OpenAI itself, I sure as heck don't know what the future holds.
My safe bet however is that Google ain't going anywhere and shall keep progressing on the AI front at an insane pace.
eru 9 hours ago [-]
Financial experts [0] and analysts are pretty much useless. Empirically their predictions are slightly worse than chance.
[0] At least the guys who publish where you or me can read them.
guelo 13 hours ago [-]
OpenAI's doom was written when Altman (and Nadella) got greedy, threw away the nonprofit mission, and caused the exodus of talent and funding that created Anthropic. If they had stayed nonprofit the rest of the industry could have consolidated their efforts against Google's juggernaut. I don't understand how they expected to sustain the advantage against Google's infinite money machine. With Waymo Google showed that they're willing to burn money for decades until they succeed.
This story also shows the market corruption of Google's monopolies, but a judge recently gave them his stamp of approval so we're stuck with it for the foreseeable future.
deegles 10 hours ago [-]
I think their downfall will be the fact that they don't have a "path to AGI" and have been raising investor money on the promise that they do.
taytus 6 hours ago [-]
I believethere’s also exponential dislike growing for Altman among most AI users, and that impacts how the brand/company is perceived.
mingusrude 4 hours ago [-]
Most AI users outside of HN does not have any idea of who Altman is. ChatGPT is in many circles synonymous to AI so their brand recognition is huge.
behnamoh 12 hours ago [-]
> I don't understand how they expected to sustain the advantage against Google's infinite money machine.
I ask this question about Nazi Germany. They adopted the Blitkrieg strategy and expanded unsustainably, but it was only a matter of time until powers with infinite resources (US, USSR) put an end to it.
goobatrooba 11 hours ago [-]
I know you're making an analogy but I have to point out that there are many points where Nazi Germany could have gone a different route and potentially could have ended up with a stable dominion over much of Western Europe.
Most obvious decision points were betraying the USSR and declaring war on the US (no one really had been able to print the reason, but presumably it was to get Japan to attack the soviets from the other side, which then however didn't happen). Another could have been to consolidate after the surrender/supplication of France, rather than continue attacking further.
eru 9 hours ago [-]
Huh? How did the USSR have infinite resources? They were barely kept afloat by western allied help (especially at the beginning). Remember also how Tsarist Russia was the first power to collapse and get knocked out of the war in WW1, long before the war was over. They did worse than even the proverbial 'Sick Man of Europe', the Ottoman Empire.
Not saying that the Nazi strategy was without flaws, of course. But your specific critique is a bit too blunt.
SoftTalker 8 hours ago [-]
they had more soldiers to throw into the meat grinder
eru 6 hours ago [-]
They also had more soldiers in WW1.
jack_riminton 13 hours ago [-]
But you’re forgetting the Jonny Ive hardware device that totally isn’t like that laughable pin badge thing from Humane
/s
user34283 2 hours ago [-]
I agree completely. Altman was at some point talking about a screen less device and getting people away from the screen.
Abandoning our mose useful sense, vision, is a recipe for a flop.
mmaunder 17 hours ago [-]
Thanks, having it walk a hardcore SDR signal chain right now --- oh damn it just finished. The blog post makes it clear this isn't just some 'lite' model - you get low latency and cognitive performance. really appreciate you amplifying that.
kqr 3 hours ago [-]
Yes, 2.5 Flash is extremely cost efficient in my favourite private benchmark: playing text adventures[1]. I'm looking forward to testing 3.0 Flash later today.
Lately I was trying ask LLMs to generate SVG pictures, do you have famous pelican on bike created by flash model?
behnamoh 11 hours ago [-]
> Don’t let the “flash” name fool you
I think it's bad naming on google's part. "flash" implies low quality, fast but not good enough. I get less negative feeling looking at "mini" models.
pietz 11 hours ago [-]
Interesting. Flash suggests more power to me than Mini. I never use gpt-5-mini in the UI whereas Flash appears to be just as good as Pro just a lot faster.
taytus 6 hours ago [-]
Im in between :)
Mini - small, incomplete, not good enough
Flash - good, not great, fast, might miss something.
nemonemo 11 hours ago [-]
Fair point. Asked Gemini to suggest alternatives, and it suggested Gemini Velocity, Gemini Atom, Gemini Axiom (and more). I would have liked `Gemini Velocity`.
behnamoh 8 hours ago [-]
I like Anthropic's approach: Haiku, Sonnet, Opus. Haiku is pretty capable still and the name doesn't make me not wanna use it. But Flash is like "Flash Sale". It might still be a great model but my monkey brain associates it with "cheap" stuff.
esafak 18 hours ago [-]
What are you using it for and what were you using before?
epolanski 18 hours ago [-]
Gemini 2.0 flash was good already for some tasks of mine long time ago..
unsupp0rted 17 hours ago [-]
How good is it for coding, relative to recent frontier models like GPT 5.x, Sonnet 4.x, etc?
jasonjmcghee 4 hours ago [-]
My experience so far- much less reliable. Though it’s been in chat not opencode or antigravity etc. you give it a program and say change it in this way, and it just throws stuff away, changes unrelated stuff etc. completely different quality than pro (or sonnet 4.5 / GPT-5.2)
piokoch 3 hours ago [-]
So why Flash is so high in LiveCodeBench Pro?
BTW: I have the same impression, Claude was working better for me for coding tasks.
bovermyer 13 hours ago [-]
In my own, very anecdotal, experience, Gemini 3 Pro and Flash are both more reliably accurate than GPT 5.x.
I have not worked with Sonnet enough to give an opinion there.
yunohn 13 hours ago [-]
I love how every single LLM model release is accompanied by pre-release insiders proclaiming how it’s the best model yet…
hexasquid 11 hours ago [-]
Make me think of how every iPhone is the best iPhone yet.
Waiting for Apple to say "sorry folks, bad year for iPhone"
eru 9 hours ago [-]
Wouldn't you expect that every new iPhone is genuinely the best iPhone? I mean, technology marches on.
OrangeMusic 4 hours ago [-]
It was sarcasm.
ZuoCen_Liu 8 hours ago [-]
What type of question is your one about testing AI inference time?
freedomben 18 hours ago [-]
Cool! I've been using 2.5 flash and it is pretty bad. 1 out of 5 answers it gives will be a lie. Hopefully 3 is better
samyok 17 hours ago [-]
Did you try with the grounding tool? Turning it on solved this problem for me.
Davidzheng 17 hours ago [-]
what if the lie is a logical deduction error not a fact retrieval error
rat9988 17 hours ago [-]
The error rate would still be improved overall and might make it a viable tool for the price depending on the usecase.
encroach 16 hours ago [-]
How did you get early access?
tonymet 15 hours ago [-]
Can you be more specific on the tasks you’ve found exceptional ?
tonyhart7 15 hours ago [-]
I think google is the only one that still produce general knowledge LLM right now
claude is coding model from the start but GPT is in more and more becoming coding model
Imustaskforhelp 15 hours ago [-]
I agree with this observation. Gemini does feel like code-red for basically every AI company like chatgpt,claude etc. too in my opinion if the underlying model is both fast and cheap and good enough
I hope open source AI models catch up to gemini 3 / gemini 3 flash. Or google open sources it but lets be honest that google isnt open sourcing gemini 3 flash and I guess the best bet mostly nowadays in open source is probably glm or deepseek terminus or maybe qwen/kimi too.
leemoore 13 hours ago [-]
Gemini isn't code red for Anthropic. Gemini threatens none of Anthropic's positioning in the market.
ralusek 13 hours ago [-]
Yes it does. I never use Claude anymore outside of agentic tasks.
leemoore 11 hours ago [-]
What demographic are you in that is leaving anthropic in mass that they care about retaining? From what I see Anthropic is targeting enterprise and coding.
Claude Code just caught up to cursor (no 2) in revenue and based on trajectories is about to pass GitHub copilot (number 1) in a few more months. They just locked down Deloitte with 350k seats of Claude Enterprise.
In my fortune 100 financial company they just finished crushing open ai in a broad enterprise wide evaluation. Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.
There is 1 leader with enterprise. There is one leader with developers. And google has nothing to make a dent. Not Gemini 3, not Gemini cli, not anti gravity, not Gemini. There is no Code Red for Anthropic. They have clear target markets and nothing from google threatens those.
Karrot_Kream 11 hours ago [-]
I agree with your overall thesis but:
> Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.
Does that mean y'all never evaluated Gemini at all or just that it couldn't compete? I'd be worried that prior performance of the models prejudiced stats away from Gemini, but I am a Claude Code and heavy Anthropic user myself so shrug.
user34283 10 hours ago [-]
Enterprise is slow. As for developers, we will be switching to Google unless the competition can catch up and deliver a similarly fast model.
Enterprise will follow.
I don't see any distinction in target markets - it's the same market.
Imustaskforhelp 4 hours ago [-]
Yeah, this is what I was trying to say in my original comment too.
Also I do not really use agentic tasks but I am not sure that gemini 3/3 flash have mcp support/skills support for agentic tasks
if not, I feel like they are very low hanging fruits and something that google can try to do too to win the market of agentic tasks over claude too perhaps.
user34283 3 hours ago [-]
I don't use MCP, but I am using agents in Antigravity.
So far they seem faster with Flash, and with less corruption of files using the Edit tool - or at least it recovered faster.
siva7 13 hours ago [-]
so? agentic tasks is where the promised agi is for many of us
8 hours ago [-]
Uehreka 14 hours ago [-]
I would expect open weights models to always lag behind; training is resource-intensive and it’s much easier to finance if you can make money directly from the result. So in a year we may have a ~700B open weights model that competes with Gemini 3, but by then we’ll have Gemini 4, and other things we can’t predict now.
xbmcuser 14 hours ago [-]
There will be diminishing returns though as the future models won't be thah much better we will reach a point where the open source model will be good enough for most things. And the need for being on the latest model no longer so important.
For me the bigger concern which I have mentioned on other AI related topics is that AI is eating all the production of computer hardware so we should be worrying about hardware prices getting out of hand and making it harder for general public to run open source models. Hence I am rooting for China to reach parity on node size and crash the PC hardware prices.
FuckButtons 13 hours ago [-]
I had a similar opinion, that we were somewhere near the top of the sigmoid curve of model improvement that we could achieve in the near term. But given continued advancements, I’m less sure that prediction holds.
Imustaskforhelp 4 hours ago [-]
Yeah I have a similar opinion and you can go back almost a year when claude 3.5 launched and I said on hackernews, that its good enough
And now I am saying the same for gemini 3 flash.
I still feel the same way tho, sure there is an increase but I somewhat believe that gemini 3 is good enough and the returns on training from now on might not be worth thaat much imo but I am not sure too and i can be wrong, I usually am.
eru 9 hours ago [-]
My model is a bit simpler: model quality is something like the logarithm of effort you put into making the model. (Assuming you know what you are doing with your effort.)
So I don't think we are on any sigmoid curve or so. Though if you plot the performance of the best model available at any point in time against time on the x-axis, you might see a sigmoid curve, but that's a combination of the logarithm and the amount of effort people are willing to spend on making new models.
(I'm not sure about it specifically being the logarithm. Just any curve that has rapidly diminishing marginal returns that nevertheless never go to zero, ie the curve never saturates.)
baq 13 hours ago [-]
If Gemini 3 flash is really confirmed close to Opus 4.5 at coding and a similarly capable model is open weights, I want to buy a box with an usb cable that has that thing loaded, because today that’s enough to run out of engineering work for a small team.
eru 9 hours ago [-]
Open weights doesn't mean you can necessarily run it on a (small) box.
If Google released their weights today, it would technically be open weight; but I doubt you'd have an easy time running the whole Gemini system outside of Google's datacentres.
Workaccount2 14 hours ago [-]
Open source models are riding coat tails, they are basically just distilling the giant SOTA models, hence perpetually being 4-6mos behind.
waffletower 13 hours ago [-]
If this quantification of lag is anywhere near accurate (it may be larger and/or more complex to describe), soon open source models will be "simply good enough". Perhaps companies like Apple could be 2nd round AI growth companies -- where they market optimized private AI devices via already capable Macbooks or rumored appliances. While not obviating cloud AI, they could cheaply provide capable models without subscription while driving their revenue through increased device sales. If the cost of cloud AI increases to support its expense, this use case will act as a check on subscription prices.
Gigachad 14 hours ago [-]
So basically the proprietary models are devalued to almost 0 in about 4-6 months. Can they recover the training costs + profit margin every 4 months?
Workaccount2 14 hours ago [-]
Coding is basically an edge case for LLMs too.
Pretty much every person in the first (and second) world is using AI now, and only small fraction of those people are writing software. This is also reflected in OAI's report from a few months ago that found programming to only be 4% of tokens.
int_19h 12 hours ago [-]
That may be so, but I rather suspect the breakdown would be very different if you only count paid tokens. Coding is one of the few things where you can actually get enough benefit out of AI right now to justify high-end subscriptions (or high pay-per-token bills).
aleph_minus_one 13 hours ago [-]
> Pretty much every person in the first (and second) world is using AI now
This sounds like you live in a huge echo chamber. :-(
chpatrick 5 hours ago [-]
All of my non techy friends use it, it's the new search engine. I think at this point people refusing to use it are the echo chamber.
lukan 12 hours ago [-]
Depends what you count as AI (just googling makes you use the LLM summary), but also my mother who is really not tech affine loved what google lense can do, after I showed her.
Apart from my very old grandmothers, I don't know anyone not using AI.
pests 12 hours ago [-]
How many people do you know? Do you talk to your local shop keeper? Or the clerk at the gas station? How are they using AI? I'm a pretty techy person with a lot of tech friends, and I know more people not using AI (on purpose, or lack of knowledge) then do.
GeneralMaximus 49 minutes ago [-]
I live in India and a surprising number of people here are using AI.
A lot of public religious imagery is very clearly AI generated, and you can find a lot of it on social media too. "I asked ChatGPT" is a common refrain at family gatherings. A lot of regular non-techie folks (local shopkeepers, the clerk at the gas station, the guy at the vegetable stand) have been editing their WhatsApp profile pictures using generative AI tools.
Some of my lawyer and journalist friends are using ChatGPT heavily, which is concerning. College students too. Bangalore is plastered with ChatGPT ads.
There's even a low-cost ChatGPT plan called ChatGPT Go you can get if you're in India (not sure if this is available in the rest of the world). It costs ₹399/mo or $4.41/mo, but it's completely free for the first year of use.
So yes, I'd say many people outside of tech circles are using AI tools. Even outside of wealthy first-world countries.
lukan 12 hours ago [-]
Hm, quite some. Like I said, it depends what you count as AI.
Just googling means you use AI nowdays.
eru 9 hours ago [-]
Whether Googling something counts as AI has more to do with the shifting definition of AI over time, then with Googling itself.
Remember, really back in the day the A* search algorithm was part of AI.
If you had asked anyone in the 1970s whether a box that given a query pinpoints the right document that answers that question (aka Google search in the early 2000s), they'd definitely would have called it AI.
lukan 3 hours ago [-]
Google gives you an AI summary, reading that means interacting with LLMs.
pests 2 hours ago [-]
Google also gives you ads. Some learn to scroll past before reading.
SoftTalker 7 hours ago [-]
I'm sort of old but not a grandmother. Not using AI.
jauntywundrkind 18 hours ago [-]
Just to point this out: many of these frontier models cost isn't that far away from two orders of magnitude more than what DeepSeek charges. It doesn't compare the same, no, but with coaxing I find it to be a pretty capable competent coding model & capable of answering a lot of general queries pretty satisfactorily (but if it's a short session, why economize?). $0.28/m in, $0.42/m out. Opus 4.5 is $5/$25 (17x/60x).
I've been playing around with other models recently (Kimi, GPT Codex, Qwen, others) to try to better appreciate the difference. I knew there was a big price difference, but watching myself feeding dollars into the machine rather than nickles has also founded in me quite the reverse appreciation too.
I only assume "if you're not getting charged, you are the product" has to be somewhat in play here. But when working on open source code, I don't mind.
happyopossum 17 hours ago [-]
Two orders of magnitude would imply that these models cost $28/m in and $42/m out. Nothing is even close to that.
To me as an engineer, 60x for output (which is most of the cost I see, AFAICT) is not that significantly different from 100x.
I tried to be quite clear with showing my work here. I agree that 17x is much closer to a single order of magnitude than two. But 60x is, to me, a bulk enough of the way to 100x that yeah I don't feel bad saying it's nearly two orders (it's 1.78 orders of magnitude). To me, your complaint feels rigid & ungenerous.
My post is showing to me as -1, but I standby it right now. Arguing over the technicalities here (is 1.78 close enough to 2 orders to count) feels besides the point to me: DeepSeek is vastly more affordable than nearly everything else, putting even Gemini 3 Flash here to shame. And I don't think people are aware of that.
I guess for my own reference, since I didn't do it the first time: at $0.50/$3.00 / M-i/o, Gemini 3 Flash here is 1.8x & 7.1x (1e1.86) more expensive than DeepSeek.
KoolKat23 13 hours ago [-]
I struggle to see the incentive to do this, I have similar thoughts for locally run models. It's only use case I can imagine is small jobs at scale perhaps something like auto complete integrated into your deployed application, or for extreme privacy, honouring NDA's etc.
Otherwise, if it's a short prompt or answer, SOTA (state of the art) model will be cheap anyway and id it's a long prompt/answer, it's way more likely to be wrong and a lot more time/human cost is spent on "checking/debugging" any issue or hallucination, so again SOTA is better.
lukan 12 hours ago [-]
"or for extreme privacy"
Or for any privacy/IP protection at all? There is zero privacy, when using cloud based LLM models.
Workaccount2 12 hours ago [-]
Really only if you are paranoid. It's incredibly unlikely that the labs are lying about not training on your data for the API plans that offer it. Breaking trust with outright lies would be catastrophic to any lab right now. Enterprise demands privacy, and the labs will be happy to accommodate (for the extra cost, of course).
mistercheph 9 hours ago [-]
No, it's incredibly unlikely that they aren't training on user data. It's billions of dollars worth of high quality tokens and preference that the frontier labs have access to, you think they would give that up for their reputation in the eyes of the enterprise market? LMAO. Every single frontier model is trained on torrented books, music, and movies.
user34283 1 hours ago [-]
Considering that they will make a lot of money with enterprise, yes, that's exactly what I think.
What I don't think is that I can take seriously someone's opinion on enterprise service's privacy after they write "LMAO" in capslock in their post.
lukan 1 hours ago [-]
I just know many people here complained about the very unclear way, google for example communicates what they use for training data and what plan to choose to opt out of everything, or if you (as a normal buisness) even can opt out. Given the whole volatile nature of this thing, I can imagine an easy "oops, we messed up" from google if it turns out they were in fact using allmost everything for training.
Second thing to consider is the whole geopolitical situation. I know companies in europe are really reluctant to give US companies access to their internal data.
dfsegoat 13 hours ago [-]
> it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high
...and all of that done without any GPUs as far as i know! [1]
(tldr: afaik Google trained Gemini 3 entirely on tensor processing units - TPUs)
poopiokaka 15 hours ago [-]
[dead]
Sincere6066 18 hours ago [-]
[flagged]
moffkalast 14 hours ago [-]
Should I not let the "Gemini" name fool me either?
__jl__ 18 hours ago [-]
This is awesome. No preview release either, which is great to production.
They are pushing the prices higher with each release though:
API pricing is up to $0.5/M for input and $3/M for output
For comparison:
Gemini 3.0 Flash: $0.50/M for input and $3.00/M for output
Gemini 2.5 Flash: $0.30/M for input and $2.50/M for output
Gemini 2.0 Flash: $0.15/M for input and $0.60/M for output
Gemini 1.5 Flash: $0.075/M for input and $0.30/M for output (after price drop)
Gemini 3.0 Pro: $2.00/M for input and $12/M for output
Gemini 2.5 Pro: $1.25/M for input and $10/M for output
Gemini 1.5 Pro: $1.25/M for input and $5/M for output
I think image input pricing went up even more.
Correction: It is a preview model...
mips_avatar 18 hours ago [-]
I'm more curious how Gemini 3 flash lite performs/is priced when it comes out. Because it may be that for most non coding tasks the distinction isn't between pro and flash but between flash and flash lite.
srameshc 18 hours ago [-]
Thanks that was a great breakup of cost. I just assumed before that it was the same pricing. The pricing probably comes from the confidence and the buzz around Gemini 3.0 as one of the best performing models. But competetion is hot in the area and it's not too far where we get similar performing models for cheaper price.
Token usage also needs to be factored in specifically when thinking is enabled, these newer models find more difficult problems easier and use less tokens to solve.
YetAnotherNick 17 hours ago [-]
For comparison, GPT-5 mini is $0.25/M for input and $2.00/M for output, so double the price for input and 50% higher for output.
AuthError 17 hours ago [-]
flash is closer to sonnet than gpt minis though
martythemaniak 16 hours ago [-]
The price increase sucks, but you really do get a whole lot more. They also had the "Flash Lite" series, 2.5 Flash Lite is 0.10/M, hopefully we see something like 3.0 Flash Lite for .20-.25.
uluyol 18 hours ago [-]
Are these the current prices or the prices at the time the models were released?
__jl__ 17 hours ago [-]
Mostly at the time of release except for 1.5 Flash which got a price drop in Aug 2024.
Google has been discontinuing older models after several months of transition period so I would expect the same for the 2.5 models. But that process only starts when the release version of 3 models is out (pro and flash are in preview right now).
misiti3780 17 hours ago [-]
is there a website where i can compare openai, anthropic and gemini models on cost/token ?
jsnell 15 hours ago [-]
There are plenty. But it's not the comparison you want to be making. There is too much variability between the number of tokens used for a single response, especially once reasoning models became a thing. And it gets even worse when you put the models into a variable length output loop.
You really need to look at the cost per task. artificialanalysis.ai has a good composite score, measures the cost of running all the benchmarks, and has 2d a intelligence vs. cost graph.
misiti3780 14 hours ago [-]
thanks
deaux 6 hours ago [-]
For reference the above completely depends on what you're using them for. For many tasks, the number of tokens used is consistent within 10~20%.
Feels like Google is really pulling ahead of the pack here. A model that is cheap, fast and good, combined with Android and gsuite integration seems like such powerful combination.
Presumably a big motivation for them is to be first to get something good and cheap enough they can serve to every Android device, ahead of whatever the OpenAI/Jony Ive hardware project will be, and way ahead of Apple Intelligence. Speaking for myself, I would pay quite a lot for truly 'AI first' phone that actually worked.
That's too bad. Apple's most interesting value proposition is running local inference with big privacy promises. They wouldn't need to be the highest performer to offer something a lot of people might want.
cmckn 7 hours ago [-]
My understanding is Apple will be hosting Gemini models themselves on the private compute system they announced a while back.
floundy 9 hours ago [-]
Apple’s most interesting value proposition was ignoring all this AI junk and letting users click “not interested” on Apple Intelligence and never see it again.
From a business perspective it’s a smart move (inasmuch as “integrating AI” is the default which I fundamentally disagree with) since Apple won’t be left holding the bag on a bunch of AI datacenters when/if the AI bubble pops.
I don’t want to lose trust in Apple, but I literally moved away from Google/Android to try and retain control over my data and now they’re taking me… right back to Google. Guess I’ll retreat further into self-hosting.
willis936 9 hours ago [-]
I also agree with this. Microsoft successfully removed my entire household from ever owning one of their products again after this year. Apple and linux make up the entire delta.
As long as Apple doesn't take any crazy left turns with their privacy policy then it should be relatively harmless if they add in a google wrapper to iOS (and we won't need to take hard right turns with grapheneOS phones and framework laptops).
bitpush 5 hours ago [-]
> Apple’s most interesting value proposition was ignoring all this AI junk
Did you forget all the Apple Intelligence stuff? They were never "ignoring" if anything they talked a big talk, and then failed so hard.
The whole iPhone 16 was marketed as AI first phone (including in billboards). They had full length ads running touting AI benefits.
Apple was never "ignoring" or "sitting AI out". They were very much in it. And they failed.
2 hours ago [-]
skerit 13 hours ago [-]
Pulling ahead? Depends on the usecase I guess. 3 turns into a very basic Gemini-CLI session and Gemini 3 Pro has already messed up a simple `Edit` tool-call.
And it's awfully slow. In 27 minutes it did 17 tool calls, and only managed to modify 2 files. Meanwhile Claude-Code flies through the same task in 5 minutes.
RobinL 13 hours ago [-]
Yeah - agree, Anthropic much better for coding. I'm more thinking about the 'average chat user' (the larger potential userbase), most of whom are on chatgpt.
nowittyusername 5 hours ago [-]
Knowing Googles MO, its most likely not the model but their harness system that's the issue. God they are so bad at their UI and agentic coding harnesses...
eldenring 5 hours ago [-]
I think Claude is genuinely much smarter, and more lucid.
anukin 14 hours ago [-]
What will you use the ai in the phone to do for you? I can understand tablets and smart glasses being able to leverage smol AI much better than a phone which is reliant on apps for most of the work.
Workaccount2 13 hours ago [-]
I desperately want to be able to real-time dictate actions to take on my phone.
Stuff like:
"Open Chrome, new tab, search for xyz, scroll down, third result, copy the second paragraph, open whatsapp, hit back button, open group chat with friends, paste what we copied and send, send a follow-up laughing tears emoji, go back to chrome and close out that tab"
All while being able to just quickly glance at my phone. There is already a tool like this, but I want the parsing/understanding of an LLM and super fast response times.
KoolKat23 13 hours ago [-]
This new model is absurdly quick on my phone and for launch day, wonder if it's additional capacity/lower demand or if this is what we can expect going forward.
On a related note, why would you want to break down your tasks to that level surely it should be smart enough to do some of that without you asking and you can just state your end goal.
pests 12 hours ago [-]
This has been my dream for voice control of PC for ages now. No wake word, no button press, no beeping or nagging, just fluently describe what you want to happen and it does.
without a wake word, it would have to listen and process all parsed audio. you really want everything captured near the device/mic to be sent to external servers?
TeMPOraL 4 hours ago [-]
I might if that's what it takes to make it finally work. The fueling of the previous 15 years was not worth it, but that was then.
procaryote 13 hours ago [-]
is that faster to say than do, or is it an accessibility or while-driving need?
11 hours ago [-]
CamperBob2 6 hours ago [-]
I don't understand that use case at all. How can you tell it to do all that stuff, if you aren't sitting there glued to the screen yourself?
TeMPOraL 3 hours ago [-]
Because typing on mobile is slow, app switching is slow, text selection and copy-paste are torture. Pretty much the only interaction of the ones OP listed is scrolling.
Plus, if the above worked, the higher level interactions could trivially work too. "Go to event details", "add that to my calendar".
FWIW, I'm starting to embrace using Gemini as general-purpose UI for some scenarios just because it's faster. Most common one, "<paste whatever> add to my calendar please."
wiseowise 1 hours ago [-]
Analyse e-mails/text/music/videos, edit photos, summarization, etc.
qnleigh 11 hours ago [-]
This model is breaking records on my benchmark of choice, which is 'the fraction of Hacker News comments that are positive.' Even people who avoid Google products on principle are impressed. Hardly anyone is arguing that ChatGPT is better in any respect (except brand recognition).
ipsum2 11 hours ago [-]
Chatgpt 5.2 thinking is significantly better quality for most knowledge work, but it trades off in speed.
energy123 10 hours ago [-]
That has been my experience. Primarily because it is allowed to expend far more test-time tokens than Gemini 3.0 Pro to solve the same prompt.
eli 10 hours ago [-]
And GPT costs 4x as much
Palmik 5 hours ago [-]
No offense, but that seems like a poor benchmark. These initial vibe checks are easily swayed by personal brand biases.
qnleigh 4 hours ago [-]
Fair. No benchmark is perfect.
I do pay special attention to what the most negative comments say (which in this case are unusually positive). And people discussing performance on their own personal benchmarks.
awestroke 5 hours ago [-]
The brand bias is heavily against Google, not in Googles favor
Palmik 4 hours ago [-]
In context of AI I'm mostly seeing anti-OpenAI pro-Google bias.
clarkmoreno 4 hours ago [-]
Facts. These HN threads are half astroturfing and paid shills. Near impossible to decifer authentic takes that are not actual colleagues or people IRL
fariszr 18 hours ago [-]
These flash models keep getting more expensive with every release.
Is there an OSS model that's better than 2.0 flash with similar pricing, speed and a 1m context window?
Edit: this is not the typical flash model, it's actually an insane value if the benchmarks match real world usage.
> Gemini 3 Flash achieves a score of 78%, outperforming not only the 2.5 series, but also Gemini 3 Pro. It strikes an ideal balance for agentic coding, production-ready systems and responsive interactive applications.
The replacement for old flash models will be probably the 3.0 flash lite then.
thecupisblue 18 hours ago [-]
Yes, but the 3.0 Flash is cheaper, faster and better than 2.5 Pro.
So if 2.5 Pro was good for your usecase, you just got a better model for about 1/3rd of the price, but might hurt the wallet a bit more if you use 2.5 Flash currently and want an upgrade - which is fair tbh.
aoeusnth1 18 hours ago [-]
I think it's good, they're raising the size (and price) of flash a bit and trying to position Flash as an actually useful coding / reasoning model. There's always lite for people who want dirt cheap prices and don't care about quality at all.
It's extremely fast on good hardware, quite smart, and can support up to 1m context with reasonable accuracy
mips_avatar 17 hours ago [-]
For my apps evals Gemini flash and grok 4 fast are the only ones worth using. I'd love for an open weights model to compete in this arena but I haven't found one.
scrollop 15 hours ago [-]
This one is more powerful than openai models, including gpt 5.2 (which is worse on various benchmarks than 5.1 which is worse than 5.1, and that's where 5.2 was using XHIGH, whiulst the others were on high eg: https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582 )
cost of e2e task resolution should be cheaper, even if single inference cost is higher, you need fewer loops to solve a problem now
fariszr 18 hours ago [-]
Sure, but for simple tasks that require a large context window, aka the typical usecase for 2.0 flash, it's still significantly more expensive.
Workaccount2 13 hours ago [-]
So gemini 3 flash (non thinking) is now the first model to get 50% on my "count the dog legs" image test.
Gemini 3 pro got 20%, and everyone else has gotten 0%. I saw benchmarks showing 3 flash is almost trading blows with 3 pro, so I decided to try it.
Basically it is an image showing a dog with 5 legs, an extra one photoshopped onto it's torso. Every models counts 4, and gemini 3 pro, while also counting 4, said the dog had a "large male anatomy". However it failed a follow-up saying 4 again.
3 flash counted 5 legs on the same image, however I added distinct a "tattoo" to each leg as an assist. These tattoos didn't help 3 pro or other models.
So it is the first out of all the models I have tested to count 5 legs on the "tattooed legs" image. It still counted only 4 legs on the image without the tattoos. I'll give it 1/2 credit.
Valakas_ 4 hours ago [-]
What if you also number the legs, but with an error like: 1,2,3,5,6. Or 1,2,3, ,4.
simonsarris 18 hours ago [-]
Even before this release the tools (for me: Claude Code and Gemini for other stuff) reached a "good enough" plateau that means any other company is going to have a hard time making me (I think soon most users) want to switch. Unless a new release from a different company has a real paradigm shift, they're simply sufficient. This was not true in 2023/2024 IMO.
With this release the "good enough" and "cheap enough" intersect so hard that I wonder if this is an existential threat to those other companies.
bgirard 18 hours ago [-]
Why wouldn't you switch? The cost to switch is near zero for me. Some tools have built in model selectors. Direct CLI/IDE plug-ins practically the same UI.
azuanrb 18 hours ago [-]
Not OP, but I feel the same way. Cost is just one of the factor. I'm used to Claude Code UX, my CLAUDE.md works well with my workflow too. Unless there's any significant improvement, changing to new models every few months is going to hurt me more.
bgirard 17 hours ago [-]
I used to think this way. But I moved to AGENTS.md. Now I use the different UI as a mental context separation. Codex is working on Feature A, Gemini on feature B, Claude on Feature C. It has become a feature.
rolisz 15 hours ago [-]
You're assuming that different models need the same stuff in AGENTS.md
In my experience, to get the best performance out of different models, they need slightly different prompting.
NamlchakKhandro 12 hours ago [-]
just switch to Opencode and stop locking yourself into a particular providers way of doing things.
There's a plugin for everything that mimics anything the others are doing
azuanrb 3 hours ago [-]
Being open does not magically make everything better. People are willing to pay for Claude Code for many valid reasons. You are also assuming I have never used OpenCode, which is incorrect. Claude is simply my preference.
I see all of these tools as IDEs. Whether someone locks into VS Code, JetBrains, Neovim, or Sublime Text comes down to personal preference. Everyone works differently, and that is completely fine.
nevir 15 hours ago [-]
I think a big part of the switching cost is the cost of learning a different model's nuances. Having good intuition for what works/doesn't, how to write effective prompts, etc.
Maybe someday future models will all behave similarly given the same prompt, but we're not quite there yet
NamlchakKhandro 12 hours ago [-]
Because some people are restricted by company policy to only use providers with which they have a legally binding agreement to not use their chats as training data.
theLiminator 18 hours ago [-]
For me, the last wave of models finally started delivering on their agentic coding promises.
orourke 18 hours ago [-]
This has been my experience exactly. Even over just the last few weeks I’ve noticed a dramatic drop in having to undo what the agents have done.
But for me the previous models were routinely wrong time wasters that overall added no speed increase taking the lottery of whether they'd be correct into account.
catigula 18 hours ago [-]
Correct. Opus 4.5 'solved' software engineering. What more do I need? Businesses need uncapped intelligence, and that is a very high bar. Individuals often don't.
gaigalas 17 hours ago [-]
If Opus is one-size-fits-all, then why Claude keeps the other series? (rethorical).
Opus and Sonnet are slower than Haiku. For lots of less sophisticated tasks, you benefit from the speed.
All vendors do this. You need smaller models that you can rapid-fire for lots of other reasons than vibe coding.
Personally, I actually use more smaller models than the sophisticated ones. Lots of small automations.
dimitri-vs 9 hours ago [-]
Yes, all the major CLIs (Claude Code, Codex, etc) and many agentic applications use a large model main agent with task delegation to small model sub-agent. For example in CC using Opus4.5 it will delegate an Explore task to a Haiku/Sonnet subagent or multiple subagents.
gaigalas 3 hours ago [-]
The agent interfaces are for human interaction. Some tasks can be fully unattended though. For those, I find smaller models more capable due to their speed.
Think beyond interfaces. I'm talking about rapid-firing hundreds of small agents and having zero human interaction with them. The feedback is deterministic (non agentic) and automated too.
alex1138 16 hours ago [-]
I just can't stop thinking though about the vulnerability of training data
You say good enough. Great, but what if I as a malicious person were to just make a bunch of internet pages containing things that are blatantly wrong, to trick LLMs?
calflegal 16 hours ago [-]
The internet has already tried this, for about a few decades. The garbage is in the corpus; it gets weighted as such
floundy 9 hours ago [-]
>a bunch of internet pages containing things that are blatantly wrong
So Reddit?
I’d imagine the AI companies have all the “pre AI internet” data they scraped very carefully catalogued.
szundi 18 hours ago [-]
[dead]
kingstnap 18 hours ago [-]
It has a SimpleQA score of 69%, a benchmark that tests knowledge on extremely niche facts, that's actually ridiculously high (Gemini 2.5 *Pro* had 55%) and reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model.
I'm speculating but Google might have figured out some training magic trick to balance out the information storage in model capacity. That or this flash model has huge number of parameters or something.
I'm confused about the "Accuracy vs Cost" section. Why is Gemini 3 Pro so cheap? It's basically the cheapest model in the graph (sans Llama 4 and Mistral Large 3) by a wide margin, even compared to Gemini 3 Flash. Is that an error?
albumen 13 hours ago [-]
I’m amazed by how much Gemini 3 flash hallucinates; it performs poorly in that metric (along with lots of other models). In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant; GPT-5.1 (high), opus 4.5 and 4.5 haiku are.
Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?
wasabi991011 9 hours ago [-]
Hallucination rate is hallucination/(hallucination+partial+ignored), while omniscience is correct-hallucination.
One hypothesis is that gemini 3 flash refuses to answer when unsuure less often than other models, but when sure is also more likely to be correct. This is consistent with it having the best accuracy score.
Wyverald 11 hours ago [-]
I'm a total noob here, but just pointing out that Omniscience Index is roughly "Accuracy - Hallucination Rate". So it simply means that their Accuracy was very high.
> In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant
This doesn't mean much. As long as Gemini 3 has a high hallucination rate (higher than at least 50% others), it's not going to be in the most desirable quadrant by definition.
For example, let's say a model answers 99 out of 100 questions correctly. The 1 wrong answer it produces is a hallucination (i.e. confidently wrong). This amazing model would have a 100% hallucination rate as defined here, and thus not be in the most desirable quadrant. But it should still have a very high Omniscience Index.
int_19h 12 hours ago [-]
> reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model
That's what MoE is for. It might be that with their TPUs, they can afford lots of params, just so long as the activated subset for each token is small enough to maintain throughput.
tanh 17 hours ago [-]
This will be fantastic for voice. I presume Apple will use it
leumon 16 hours ago [-]
Or could it be that it's using tool calls in reasoning (e.g. a google search)?
GaggiX 18 hours ago [-]
>or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model.
More experts with a lower pertentage of active ones -> more sparsity.
cakealert 3 hours ago [-]
Gemini 2.5 was a full broadside on OpenAI's ship.
After Gemini 3.0 the OpenAI damage control crews all drowned.
Not only is it vastly better, it's also free.
I find this particular benchmark to be in agreement with my experiences: https://simple-bench.com
mmaunder 17 hours ago [-]
I think about what would be most terrifying to Anthropic and OpenAI i.e. The absolute scariest thing that Google could do. I think this is it: Release low latency, low priced models with high cognitive performance and big context window, especially in the coding space because that is direct, immediate, very high ROI for the customer.
Now, imagine for a moment they had also vertically integrated the hardware to do this.
JumpCrisscross 16 hours ago [-]
> think about what would be most terrifying to Anthropic and OpenAI
The most terrifying thing would be Google expanding its free tiers.
wasabi991011 8 hours ago [-]
It's the only model provider that has offered a decent deal to students: a full year of google ai pro.
Granted, this doesn't give api access, only what google calls their "consumer ai products", but it makes a huge difference when chatgpt only allows a handful of document uploads and deep research queries per day.
Davidzheng 9 hours ago [-]
on aistudio the free tier limits on all models are decent
avazhi 17 hours ago [-]
"Now, imagine for a moment they had also vertically integrated the hardware to do this."
Then you realise you aren't imagining it.
iwontberude 16 hours ago [-]
“And then imagine Google designing silicon that doesn’t trail the industry. While you are there we may as well start to imagine Google figures out how to support a product lifecycle that isn’t AdSense”
Google is great on the data science alone, every thing else is an after thought
Oh I got your joke, sir - but as you can see from the other comment, there are techies who still don't have even a rudimentary understanding of tensor cores, let alone the wider public and many investors. Over the next year or two the gap between Google and everybody else, even those they license their hardware to, is going to explode.
iwontberude 16 hours ago [-]
Exactly my point, they have bespoke offerings but when they compete head to head for performance they get smoked. See more: their Tensor processor that they use in the beleaguered Pixel. They are in last place.
TPUs on the other hand are ASICs, we are more than familiar with the limited application, high performance and high barriers to entry associated with them. TPUs will be worthless as the AI bubble keeps deflating and excess capacity is everywhere.
The people who don't have a rudimentary understanding are the wall street boosters that treat it like the primary threat to Nvidia or a moat for Google (hint: it is neither).
It's 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k - notable that the new Flash model doesn’t have a price increase after that 200,000 token point.
It’s also twice the price of GPT-5 Mini for input, half the price of Claude 4.5 Haiku.
caminanteblanco 18 hours ago [-]
Does anyone else understand what the difference is between Gemini 3 'Thinking' and 'Pro'? Thinking "Solves complex problems" and Pro "Thinks longer for advanced math & code".
I assume that these are just different reasoning levels for Gemini 3, but I can't even find mention of there being 2 versions anywhere, and the API doesn't even mention the Thinking-Pro dichotomy.
peheje 18 hours ago [-]
I think:
Fast = Gemini 3 Flash without thinking (or very low thinking budget)
Thinking = Gemini 3 flash with high thinking budget
Thank you! I wish they had clearer labelling (or at the very least some documentation) explaining this.
flakiness 18 hours ago [-]
It seems:
- "Thinking" is Gemini 3 Flash with higher "thinking_level"
- Prop is Gemini 3 Pro. It doesn't mention "thinking_level" but I assume it is set to high-ish.
lysace 17 hours ago [-]
Really stupid question: How is Gemini-like 'thinking' separate from artificial general intelligence (AGI)?
When I ask Gemini 3 Flash this question, the answer is vague but agency comes up a lot. Gemini thinking is always triggered by a query.
This seems like a higher-level programming issue to me. Turn it into a loop. Keep the context. Those two things make it costly for sure. But does it make it an AGI? Surely Google has tried this?
dcre 14 hours ago [-]
This is what every agentic coding tool does. You can try it yourself right now with the Gemini CLI, OpenCode, or 20 other tools.
CamperBob2 14 hours ago [-]
I don't think we'll get genuine AGI without long-term memory, specifically in the form of weight adjustment rather than just LoRAs or longer and longer contexts. When the model gets something wrong and we tell it "That's wrong, here's the right answer," it needs to remember that.
Which obviously opens up a can of worms regarding who should have authority to supply the "right answer," but still... lacking the core capability, AGI isn't something we can talk about yet.
LLMs will be a part of AGI, I'm sure, but they are insufficient to get us there on their own. A big step forward but probably far from the last.
bananaflag 12 hours ago [-]
> When the model gets something wrong and we tell it "That's wrong, here's the right answer," it needs to remember that.
Problem is that when we realize how to do this, we will have each copy of the original model diverge in wildly unexpected ways. Like we have 8 billion different people in this world, we'll have 16 gazillion different AIs. And all of them interacting with each other and remembering all those interactions. This world scares me greatly.
andai 11 hours ago [-]
AGI is hard but we can solve most tasks with artificial stupidity in an `until done`.
lysace 11 hours ago [-]
Just a matter of time and cost. Eventually...
criley2 14 hours ago [-]
Advanced reasoning LLM's simulate many parts of AGI and feel really smart, but fall short in many critical ways.
- An AGI wouldn't hallucinate, it would be consistent, reliable and aware of its own limitations
- An AGI wouldn't need extensive re-training, human reinforced training, model updates. It would be capable of true self-learning / self-training in real time.
- An AGI would demonstrate real genuine understanding and mental modeling, not pattern matching over correlations
- It would demonstrate agency and motivation, not be purely reactive to prompting
- It would have persistent integrated memory. LLM's are stateless and driven by the current context.
- It should even demonstrate consciousness.
And more. I agree that what've we've designed is truly impressive and simulates intelligence at a really high level. But true AGI is far more advanced.
waffletower 13 hours ago [-]
Humans can fail at some of these qualifications, often without guile:
- being consistent and knowing their limitations
- people do not universally demonstrate effective understanding and mental modeling.
I don't believe the "consciousness" qualification is at all appropriate, as I would argue that it is a projection of the human machine's experience onto an entirely different machine with a substantially different existential topology -- relationship to time and sensorium. I don't think artificial general intelligence is a binary label which is applied if a machine rigidly simulates human agency, memory, and sensing.
versteegen 2 hours ago [-]
> - It should even demonstrate consciousness.
I disagreed with most of your assertions even before I hit the last point. This is just about the most extreme thing you could ask for. I think very few AI researchers would agree with this definition of AGI.
lysace 14 hours ago [-]
Thanks for humoring my stupid question with a great answer. I was kind of hoping for something like this :).
xpil 15 hours ago [-]
My main issue with Gemini is that business accounts can't delete individual conversations. You can only enable or disable Gemini, or set a retention period (3 months minimum), but there's no way to delete specific chats. I'm a paying customer, prices keep going up, and yet this very basic feature is still missing.
strstr 10 hours ago [-]
For my personal usage of ai-studio, I had to use autohotkey to record and replay my mouse deleting my old chats. I thought about cooking up a browser extension, but never got around to it.
testfrequency 14 hours ago [-]
This is the #1 thing that keeps me from going all in on Gemini.
Their retention controls for both consumer and business suck. It’s the worst of any of the leaders.
ComputerGuru 11 hours ago [-]
Use it over api.
outside2344 17 hours ago [-]
I don't want to say OpenAI is toast for general chat AI, but it sure looks like they are toast.
They have been for a while. Had first mover advantage that kept them in the lead but it's not anything others couldn't throw money at, and catch up eventually. I remember when not so long ago everyone was talking how Google lost AI race, and now it feels like they're chasing Anthropic
Gigachad 14 hours ago [-]
I’ve fully switched over to Gemini now. It seems significantly more useful, and is less of an automatic glaze machine that just restates your question and how smart you are for asking it.
niek_pas 2 hours ago [-]
That’s funny, I’ve had the exact opposite experience. Gemini starts every answer to a coding question with, “you have hit upon a fundamental insight in zyx”. ChatGPT usually starts with, “the short answer? Xyz.”
radicality 12 hours ago [-]
How do I get Gemini to be more proactive in finding/double-checking itself against new world information and doing searches?
For that reason I still find chatgpt way better for me, many things I ask it first goes off to do online research and has up to date information - which is surprising as you would expect Google to be way better at this.
For example, was asking Gemini 3 Pro recently about how to do something with a “RTX 6000 Blackwell 96GB” card, and it told me this card doesn’t exist and that I probably meant the rtx 6000 ada… Or just today I asked about something on macOS 26.2, and it told me to be cautious as it’s a beta release (it’s not).
Whereas with chatgpt I trust the final output more since it very often goes to find live sources and info.
leemoore 11 hours ago [-]
Gemini is bad at this sort of thing but I find all models tend to do this to some degree. You have to know this could be coming and give it indicators to assume that it’s training data is going to be out of date. And it must web search the latest as of today or this month. They aren’t taught to ask themselves “is my understanding of this topic based on info that is likely out of date” but understand after the fact. I usually just get annoyed and low key condescend to it for assuming its old ass training data is sufficient grounding for correcting me.
That epistemic calibration is is something they are capable of thinking through if you point it out. But they aren’t trained to stop and ask/check themselves on how confident do they have a right to be. This is a meta cognitive interrupt that is socialized into girls between 6 and 9 and is socialized into boys between 11-13. While meta cognitive interrupt to calibrate to appropriate confidence levels of knowledge is a cognitive skill that models aren’t taught and humans learn socially by pissing off other humans. It’s why we get pissed off st models when they correct ua with old bad data. Our anger is the training tool to stop doing that. Just that they can’t take in that training signal at inference time
andai 11 hours ago [-]
Yeah any time I mention GPT-5, the other models start having panic attacks and correcting it to GPT-4. Even if it's a model name in source code!
They think GPT-5 won't be released until the distant future, but what they don't realize is we have already arrived ;)
jaigupta 8 hours ago [-]
Only if I could figure out how to use it. I have been using Claude Code and enjoy it. I sometimes also try Codex which is also not bad.
Trying to use Gemini cli is such a pain. I bought GDP Premium and configured GCP, setup environment variables, enabled preview features in cli and did all the dance around it and it won't let me use gemini 3. Why the hell I am even trying so hard?
jdanbrown 7 hours ago [-]
Have you tried OpenRouter (https://openrouter.ai)? I’ve been happy using it as a unified api provider with great model coverage (including Google, Anthropic, OpenAI, Grok, and the major open models). They charge 5% on top of each model’s api costs, but I think it’s worth it to have one centralized place to insert my money and monitor my usage. I like being able to switch out models without having to change my tools, and I like being able to easily head-to-head compare claude/gemini/gpt when I get stuck on a tricky problem.
Then you just have to find a coding tool that works with OpenRouter. Afaik claude/codex/cursor don’t, at least not without weird hacks, but various of the OSS tools do — cline, roo code, opencode, etc. I recently started using opencode (https://github.com/sst/opencode), which is like an open version of claude code, and I’ve been quite happy with it. It’s a newer project so There Will Be Bugs, but the devs are very active and responsive to issues and PRs.
Palmik 5 hours ago [-]
Why would you use OpenRouter rather than some local proxy like LiteLLM? I don't see the point of sharing data with more third parties and paying for the privilege.
Not to mention that for coding, it's usually more cost efficient to get whatever subscription the specific model provider offers.
jaigupta 6 hours ago [-]
I have used OpenRouter before but in this case I was trying to use it like Claude Code (agentic coding with a simple fixed monthly subscription). I don't want to pay per use via direct APIs as I am afraid it might have surprising bills. My point was, why Google makes it so damn hard even for paid subscriptions where it was supposed to work.
qingcharles 6 hours ago [-]
Have you tried Google Antigravity? I use that and GitHub Copilot when I want to use Gemini for coding tasks.
android521 7 hours ago [-]
use cursor. it allows you to choose any model to use.
zhyder 18 hours ago [-]
Glad to see big improvement in the SimpleQA Verified benchmark (28->69%), which is meant to measure factuality (built-in, i.e. without adding grounding resources). That's one benchmark where all models seemed to have low scores until recently. Can't wait to see a model go over 90%... then will be years till the competition is over number of 9s in such a factuality benchmark, but that'd be glorious.
jug 10 hours ago [-]
Yes, that's very good because it's my main use case for Flash; queries depending on world knowledge. Not science or engineering problems, but think you'd ask someone that has a really broad knowledge about things and can give quick and straightforward answers.
zkmon 1 hours ago [-]
I asked it to draft an email with a business proposal and it puts the date on letter as October 26, 2023. Then I asked it why it did so. It replies saying that the templates it was trained on might be anchored to that date. Gemini 3 Pro also puts that same date on letter. I didn't ask it why.
It's a cool release, but if someone on the google team reads that:
flash 2.5 is awesome in terms of latency and total response time without reasoning. In quick tests this model seems to be 2x slower. So for certain use cases like quick one-token classification flash 2.5 is still the better model.
Please don't stop optimizing for that!
Yes I tried it with minimal and it's roughly 3 seconds for prompts that take flash 2.5 1 second.
On that note it would be nice to get these benchmark numbers based on the different reasoning settings.
retropragma 15 hours ago [-]
That's more of a flash-lite thing now, I believe
bobviolier 13 hours ago [-]
This might also have to do with it being a preview, and only available on the global region?
Tiberium 15 hours ago [-]
You can still set thinking budget to 0 to completely disable reasoning, or set thinking level to minimal or low.
andai 11 hours ago [-]
>You cannot disable thinking for Gemini 3 Pro. Gemini 3 Flash also does not support full thinking-off, but the minimal setting means the model likely will not think (though it still potentially can). If you don't specify a thinking level, Gemini will use the Gemini 3 models' default dynamic thinking level, "high".
Pricing is $0.5 / $3 per million input / output tokens. 2.5 Flash was $0.3 / $2.5. That's 66% increase in input tokens and 20% increase in output token pricing.
For comparison, from 2.5 Pro ($1.25 / $10) to 3 Pro ($2 / $12), there was 60% increase in input tokens and 20% increase in output tokens pricing.
> Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.
Tiberium 15 hours ago [-]
Yes, but also most of the increase in 3 Flash is in the input context price, which isn't affected by reasoning.
int_19h 12 hours ago [-]
It is affected if it has to round-trip, e.g. because it's making tool calls.
Wild how this beats 2.5 Pro in every single benchmark. Don't think this was true for Haiku 4.5 vs Sonnet 3.5.
FergusArgyll 18 hours ago [-]
Sonnet 3.5 might have been better than opus 3. That's my recollection anyhow
SyrupThinker 18 hours ago [-]
I wonder if this suffers from the same issue as 3 Pro, that it frequently "thinks" for a long time about date incongruity, insisting that it is 2024, and that information it receives must be incorrect or hypothetical.
Just avoiding/fixing that would probably speed up a good chunk of my own queries.
robrenaud 18 hours ago [-]
Omg, it was so frustrating to say:
Summarize recent working arxiv url
And then it tells me the date is from the future and it simply refuses to fetch the URL.
16 hours ago [-]
tootyskooty 18 hours ago [-]
Since it now includes 4 thinking levels (minimal-high) I'd really appreciate if we got some benchmarks across the whole sweep (and not just what's presumably high).
Flash is meant to be a model for lower cost, latency-sensitive tasks. Long thinking times will both make TTFT >> 10s (often unacceptable) and also won't really be that cheap?
happyopossum 16 hours ago [-]
Google appears to be changing what flash is “meant for” with this release - the capability it has along with the thinking budgets make it superior to previous Pro models in both outcome and speed. The likely-soon-coming flash-lite will fit right in to where flash used to be - cheap and fast.
jug 18 hours ago [-]
Looks like a good workhorse model, like I felt 2.5 Flash also was at its time of launch. I hope I can build confidence with it because it'll be good to offload Pro costs/limits as well of course always nice with speed for more basic coding or queries. I'm impressed and curious about the recent extreme gains on ARC-AGI-2 from 3 Pro, GPT-5.1 and now even 3 Flash.
rw2 5 hours ago [-]
They didn't put Opus 4.5 on the model card to compare
whinvik 18 hours ago [-]
Ok, I was a bit addicted to Opus 4.5 and was starting to feel like there's nothing like it.
Turns out Gemini 3 Flash is pretty close. The Gemini CLI is not as good but the model more than makes up for it.
The weird part is Gemini 3 Pro is nowhere as good an experience. Maybe because its just so slow.
scrollop 15 hours ago [-]
Yes! Gemini 3 pro is significantly slower than opus (surprisingly) , and prefer opus' output.
Might be using flash for my MCP research/transcriber/minor tasks modl over haiku, now, though (will test of course)
__jl__ 17 hours ago [-]
I will have to try that. Cursor bill got pretty high with Opus 4.5. Never considered opus before the 4.5 price drop but now it's hard to change... :)
diamondfist25 17 hours ago [-]
$100 Claude max is the best subscription I’ve ever had.
Well worth every penny now
vanviegen 3 hours ago [-]
Or a $40 GitHub copilot plan also gets you a lot of Opus usage.
Obertr 17 hours ago [-]
At this point in time I start to believe OAI is very much behind on the models race and it can't be reversed
Image model they have released is much worse than nano banana pro, ghibli moment did not happen
Their GPT 5.2 is obviously overfit on benchmarks as a consensus of many developers and friends I know. So Opus 4.5 is staying on top when it comes to coding
The weight of the ads money from google and general direction + founder sense of Brin brought the google massive giant back to life.
None of my companies workflow run on OAI GPT right now. Even though we love their agent SDK, after claude agent SDK it feels like peanuts.
avazhi 17 hours ago [-]
"At this point in time I start to believe OAI is very much behind on the models race and it can't be reversed"
This has been true for at least 4 months and yeah, based on how these things scale and also Google's capital + in-house hardware advantages, it's probably insurmountable.
drawnwren 16 hours ago [-]
OAI also got talent mined. Their top intellectual leaders left after fight with sama, then Meta took a bunch of their mid-senior talent, and Google had the opposite. They brought Noam and Sergey back.
mmaunder 17 hours ago [-]
Yeah the only thing standing in Google's way is Google. And it's the easy stuff, like sensible billing models, easy to use docs and consoles that make sense and don't require 20 hours to learn/navigate, and then just the slew of bugs in Gemini CLI that are basic usability and model API interaction things. The only differentiator that OpenAI still has is polish.
Edit: And just to add an example: openAI's Codex CLI billing is easy for me. I just sign up for the base package, and then add extra credits which I automatically use once I'm through my weekly allowance. With Gemini CLI I'm using my oauth account, and then having to rotate API keys once I've used that up.
Also, Gemini CLI loves spewing out its own chain of thought when it gets into a weird state.
Also Gemini CLI has an insane bias to action that is almost insurmountable. DO NOT START THE NEXT STAGE still has it starting the next stage.
Also Gemini CLI has been terrible at visibility on what it's actually doing at each step - although that seems a bit improved with this new model today.
mips_avatar 16 hours ago [-]
I'd be curious how many people use openrouter byok just to avoid figuring out the cloud consoles for gcp/azure.
vanviegen 3 hours ago [-]
Openrouter is great! Prepaid, no surprise bills. Easily switch between any models you desire. Dead simple interface. Reliable. What's not to like?
mmaunder 16 hours ago [-]
Agreed. It's ridiculous.
visarga 15 hours ago [-]
I do. Gave up using Gemini directly.
mips_avatar 13 hours ago [-]
I mean I do too, had a really odd Gemini bug until I did byok on openrouter
ewoodrich 8 hours ago [-]
Gemini CLI via a Google One plan is the regular consumer billing flow which is pretty straightforward.
GenerWork 16 hours ago [-]
I'm actually liking 5.2 in Codex. It's able to take my instructions, do a good job at planning out the implementation, and will ask me relevant questions around interactions and functionality. It also gives me more tokens than Claude for the same price. Now, I'm trying to white label something that I made in Figma so my use case is a lot different from the average person on this site, but so far it's my go to and I don't see any reason at this time to switch.
gpt5 16 hours ago [-]
I've noticed when it comes to evaluating AI models, most people simply don't ask difficult enough questions. So everything is good enough, and the preference comes down to speed and style.
It's when it becomes difficult, like in the coding case that you mentioned, that we can see the OpenAI still has the lead. The same is true for the image model, prompt adherence is significantly better than Nano Banana. Especially at more complex queries.
int_19h 12 hours ago [-]
I'm currently working on a Lojban parser written in Haskell. This is a fairly complex task that requires a lot of reasoning. And I tried out all the SOTA agents extensively to see which one works the best. And Opus 4.5 is running circles around GPT-5.2 for this. So no, I don't think it's true that OpenAI "still has the lead" in general. Just in some specific tasks.
GenerWork 13 hours ago [-]
I'd argue that 5.2 just barely squeaks past Sonnet 4.5 at this point. Before this was released, 4.5 absolutely beat Codex 5.1 Medium and could pretty much oneshot UI items as long as I didn't try to create too many new things at once.
fellowniusmonk 16 hours ago [-]
I have a very complex set of logic puzzles I run through my own tests.
My logic test and trying to get an agent to develop a certain type of ** implementation (that is published and thus the model is trained on to some limited extent) really stress test models, 5.2 is a complete failure of overfitting.
Really really bad in an unrecoverable infinite loop way.
It helps when you have existing working code that you know a model can't be trained on.
It doesn't actually evaluate the working code it just assumes it's wrong and starts trying to re-write it as a different type of **.
Even linking it to the explanation and the git repo of the reference implementation it still persists in trying to force a different **.
This is the worst model since pre o3. Just terrible.
int32_64 16 hours ago [-]
Is there a "good enough" endgame for LLMs and AI where benchmarks stop mattering because end users don't notice or care? In such a scenario brand would matter more than the best tech, and OpenAI is way out in front in brand recognition.
crazygringo 16 hours ago [-]
For average consumers, I think very much yes, and this is where OpenAI's brand recognition shines.
But for anyone using LLM's to help speed up academic literature reviews where every detail matters, or coding where every detail matters, or anything technical where every detail matters -- the differences very much matter. And benchmarks serve just to confirm your personal experience anyways, as the differences between models becomes extremely apparent when you're working in a niche sub-subfield and one model is showing glaring informational or logical errors and another mostly gets it right.
And then there's a strong possibility that as experts start to say "I always trust <LLM name> more", that halo effect spreads to ordinary consumers who can't tell the difference themselves but want to make sure they use "the best" -- at least for their homework. (For their AI boyfriends and girlfriends, other metrics are probably at play...)
smashed 16 hours ago [-]
I haven't seen any LLM tech shine "where every detail matters".
In fact so far, they consistently fail in exactly these scenario, glossing over random important details whenever you double check results in depth.
You might have found models, prompts or workflows that work for you though, I'm interested.
bitpush 16 hours ago [-]
> OpenAI's brand recognition shines.
We've seen this movie before. Snapchat was the darling. Infact, it invented the entire category and was dominating the format for years. Then it ran out of time.
Now very few people use Snapchat, and it has been reduced to a footnote in history.
If you think I'm exaggerating, that just proves my point.
decimalenough 16 hours ago [-]
Not a great example: Snapchat made it through the slump, successfully captured the next generation of teenagers, and now has around 500M DAUs.
bitpush 12 hours ago [-]
You might not remember, but Snapchat was once supposed to take on Facebook. The founder was so cocky that they declined being bought by Facebook because they thought they could be bigger.
I never said Snapchat is dead. It still lives on, but it is a shell of the past. They had no moat, and the competitors caught up (Instagram, Whatsapp and even LinkedIn copied Snapchat with stories .. and rest is history)
16 hours ago [-]
xbmcuser 16 hours ago [-]
Google biggest advantage over time will be costs. They have their own hardware which they can and will optimise for their LLMS. And Google has experience of getting market share over time by giving better results, performance or space. ie gmail vs hotmail/yahoo. Chrome vs IE/Firefox. So don't discount them if the quality is better they will get ahead over time.
int_19h 12 hours ago [-]
It already is costs. Their Pro plan has much more generous limits compared to both OpenAI and especially Anthropic. You get 20 Deep Research queries with Pro per day, for example.
rfw300 16 hours ago [-]
That might be true for a narrow definition of chatbots, but they aren't going to survive on name recognition if their models are inferior in the medium term. Right now, "agents" are only really useful for coding, but when they start to be adopted for more mainstream tasks, people will migrate to the tools that actually work first.
holler 16 hours ago [-]
this. I don't know any non-tech people who use anything other than chatgpt. On a similar note, I've wondered why Amazon doesn't make a chatgpt-like app with their latest Alexa+ makeover, seems like a missed opportunity. The Alexa app has a feature to talk to the LLM in chat mode, but the overall app is geared towards managing devices.
macNchz 16 hours ago [-]
Google has great distribution to be able to just put Gemini in front of people who are already using their many other popular services. ChatGPT definitely came out of the gate with a big lead on name recognition, but I have been surprised to hear various non-techy friends talking about using Gemini recently, I think for many of them just because they have access at work through their Workspace accounts.
Obertr 16 hours ago [-]
Most of Europe if full of Gemini ads, my parents use Gemini because it is free and it popped up in YouTube ad before the video
Just go outside the bubble plus take a bit older people
ewoodrich 8 hours ago [-]
Yeah my parents never really cared enough to explore ChatGPT despite hearing about it 10 times a day in news/media for the last few years. But recently my mom started using Google's AI Search mode after first trying it while doing research for house hunting and my dad uses the Gemini app for occasional questions/identifying parts and stuff (he has always loved Google Lens so those sort of interactive multimedia features are the main pull vs plain text chatbot conversations).
They are both Android/Google Search users so all it really took was "sure I guess I'll try that" in response to a nudge from Google. For me personally I have subscriptions to Claude/ChatGPT/Gemini for coding but use Gemini for 90% of chatbot questions. Eventually I'll cancel some of them but will probably keep Gemini regardless because I like having the extra storage with my Google One plan bundle. Google having a pre-existing platform/ecosystem is a huge advantage imo.
nimchimpsky 16 hours ago [-]
[dead]
fullstick 15 hours ago [-]
I doubt anyone I know who is using llms outside of work knows that there are benchmark tests for these models.
jay_kyburz 16 hours ago [-]
This is why both google and microsoft are pushing Gemini and Copilot in everyone's face.
dieortin 17 hours ago [-]
Is there anything pointing to Brin having anything to do with Google’s turnaround in AI? I hear a lot of people saying this, but no one explaining why they do
novok 16 hours ago [-]
In organizations, everyone's existence and position is politically supported by their internal peers around their level. Even google's & microsoft's current CEOs are supported by their group of co-executives and other key players. The fact that both have agreeable personalities is not a mistake! They both need to keep that balance to stay in power, and that means not destroying or disrupting your peer's current positions. Everything is effectively decided by informal committee.
Founders are special, because they are not beholden to this social support network to stay in power and founders have a mythos that socially supports their actions beyond their pure power position. The only others they are beholden too are their co-founders, and in some cases major investor groups. This gives them the ability to disregard this social balance because they are not dependent on it to stay on power. Their power source is external to the organization, while everyone else is internal to it.
This gives them a very special "do something" ability that nobody else has. It can lead to failures (zuck & occulus, snapchat spectacles) or successes (steve jobs, gemini AI), but either way, it allows them to actually "do something".
JumpCrisscross 15 hours ago [-]
> Founders are special, because they are not beholden to this social support network to stay in power
Of course they are. Founders get fired all the time. As often as non-founder CEOs purge competition from their peers.
> The only others they are beholden too are their co-founders, and in some cases major investor groups
This describes very few successful executives. You can have your co-founders and investors on board, if your talent and customers hate you, they’ll fuck off.
ryoshu 16 hours ago [-]
If he's having an impact it's because he can break through the bureaucracy. He's not trying to protect a fiefdom.
HarHarVeryFunny 16 hours ago [-]
I would say it more goes back to the Google Brain + DeepMind merger, creating Google DeepMind headed by Demis Hassabis.
The merger happened in April 2023.
Gemini 1.0 was released in Dec 2023, and the progress since then has been rapid and impressive.
raincole 16 hours ago [-]
That's a quite sensationalized view.
Ghibli moment was only about half a year ago. At that moment, OpenAI was so far ahead in terms of image editing. Now it's behind for a few months and "it can't be reversed"?
Obertr 16 hours ago [-]
Check the size and budget of Google iniatives. It’s unlimited
BoredPositron 16 hours ago [-]
The Ghibli moment was an influencer fad not real advancement.
JumpCrisscross 16 hours ago [-]
> I start to believe OAI is very much behind
Kara Swisher recently compared OpenAI to Netscape.
Andrex 8 hours ago [-]
Ouch.
Maybe we'll get some awesome FOSS tech out of its ashes?
JumpCrisscross 6 hours ago [-]
We’ll get a bail-out and then a massive data-centre and energy-production build-out.
baq 17 hours ago [-]
GPT 5.2 is actually getting me better outputs than Opus 4.5 on very complex reviews (on high, I never use less) - but the speed makes Opus the default for 95% of use cases.
louiereederson 16 hours ago [-]
i think the most important part of google vs openai is slowing usage of consumer LLMs. people focus on gemini's growth, but overall LLM MAUs and time spent is stabilizing. in aggregate it looks like a complete s-curve. you can kind of see it in the table in the link below but more obvious when you have the sensortower data for both MAUs and time spent.
the reason this matters is slowing velocity raises the risk of featurization, which undermines LLMs as a category in consumer. cost efficiency of the flash models reinforces this as google can embed LLM functionality into search (noting search-like is probably 50% of chatgpt usage per their july user study). i think model capability was saturated for the average consumer use case months ago, if not longer, so distribution is really what matters, and search dwarfs LLMs in this respect.
OAI's latest image model outperforms Google's in LMArena in both image generation and image editing. So even though some people may prefer nano banana pro in their own anecdotal tests, the average person prefers GPT image 1.5 in blind evaluations.
Add This to Gemini distribution which is being adcertised by Google in all of their products, and average Joe will pick the sneakers at the shelf near the checkout rather than healthier option in the back
gdhkgdhkvff 16 hours ago [-]
Those darn sneakers are just too delicious!
encroach 16 hours ago [-]
That's not how the arena works. The evaluation is blind so Google's advertising/integration has no effect on the results.
Obertr 16 hours ago [-]
3 points, sure
encroach 16 hours ago [-]
Right, it only scores 3 points higher on image edit, which is within the margin of error. But on image generation, it scores a significant 29 points higher.
raincole 16 hours ago [-]
...and what does this have to do with the comment you replied to? Did you reply to the wrong person or you were just stating unrelated factoids?
yieldcrv 16 hours ago [-]
the trend I've seen is that none of these companies are behind in concept and theory, they are just spending longer intervals baking a more superior foundational model
so they get lapped a few times and then drop a fantastic new model out of nowhere
the same is going to happen to Google again, Anthropic again, OpenAI again, Meta again, etc
they're all shuffling the same talent around, its California, that's how it goes, the companies have the same institutional knowledge - at least regarding their consumer facing options
random9749832 17 hours ago [-]
This is obviously trained on Pro 3 outputs for benchmaxxing.
CuriouslyC 16 hours ago [-]
Not trained on pro, distilled from it.
viraptor 16 hours ago [-]
What do you think distilled means...?
CuriouslyC 12 hours ago [-]
It's good to keep the language clear, because you could pretrain/sft on outputs (as many labs do), which is not the same thing.
NitpickLawyer 16 hours ago [-]
> for benchmaxxing.
Out of all the big4 labs, google is the last I'd suspect of benchmaxxing. Their models have generally underbenched and overdelivered in real world tasks, for me, ever since 2.5 pro came out.
nightski 16 hours ago [-]
Google has incredible tech. The problem is and always has been their products. Not only are they generally designed to be anti-consumer, but they go out of their way to make it as hard as possible. The debacle with Antigravity exfiltrating data is just one of countless.
novok 16 hours ago [-]
The Antigravity case feels like a pure bug and them rushing to market. They had a bunch of other bugs showing that. That is not anti-consumer or making it difficult.
15 hours ago [-]
17 hours ago [-]
acheong08 18 hours ago [-]
Thinking along the line of speed, I wonder if a model that can reason and use tools at 60fps would be able to control a robot with raw instructions and perform skilled physical work currently limited by the text-only output of LLMs. Also helps that the Gemini series is really good at multimodal processing with images and audio. Maybe they can also encode sensory inputs in a similar way.
Pipe dream right now, but 50 years later? Maybe
incognito124 18 hours ago [-]
Believe it or not, there's Gemini Robotics, which seems to be exactly what you're talking about:
Much sooner, hardware, power, software, even AI model design, inference hardware, cache, everything being improved , it's exponential.
bearjaws 18 hours ago [-]
I've been using the preview flash model exclusively since it came out, the speed and quality of response is all I need at the moment. Although still using Claude Code w/ Opus 4.5 for dev work.
Google keeps their models very "fresh" and I tend to get more correct answers when asking about Azure or O365 issues, ironically copilot will talk about now deleted or deprecated features more often.
sv123 18 hours ago [-]
I've found copilot within the Azure portal to be basically useless for solving most problems.
djeastm 18 hours ago [-]
Me too. I don't understand why companies think we devs need a custom chat on their website when we all have access to a chat with much smarter models open in a different tab.
golem14 15 hours ago [-]
That's not what they are thinking. They are thinking: "We want to capture the dev and make them use our model – since it is easier to use it in our tab, it can afford to be inferior. This way we get lots of tasty, tasty user data."
3 hours ago [-]
xnx 18 hours ago [-]
OpenAI is pretty firmly in the rear-view mirror now.
walthamstow 18 hours ago [-]
Google Antigravity is a buggy mess at the moment, but I believe it will eventually eat Cursor as well. The £20/mo tier currentluy has the highest usage limits on the market, including Google models and Sonnet and Opus 4.5.
tempaccount420 13 hours ago [-]
It's not in Google's style, but they need a codex-like fine-tune. I don't think they have ever released fine-tunes like that though.
The model is very hard to work with as is.
bennydog224 18 hours ago [-]
From the article, speed & cost match 2.5 Flash. I'm working on a project where there's a huge gap between 2.5 Flash and 2.5 Flash Lite as far as performance and cost goes.
-> 2.5 Flash Lite is super fast & cheap (~1-1.5s inference), but poor quality responses.
-> 2.5 Flash gives high quality responses, but fairly expensive & slow (5-7s inference)
I really just need an in-between for Flash and Flash Lite for cost and performance. Right now, users have to wait up to 7s for a quality response.
k8sToGo 17 hours ago [-]
I remember the preview price for 2.5 flash was much cheaper. And then it got quite expensive when it went out of preview. I hope the same won't happen.
Tiberium 15 hours ago [-]
For 2.5 Flash Preview the price was specifically much cheaper for the no-reasoning mode, in this case the model reasons by default so I don't think they'll increase the price even further.
Fiveplus 18 hours ago [-]
It is interesting to see the "DeepMind" branding completely vanish from the post. This feels like the final consolidation of the Google Brain merger. The technical report mentions a new "MoE-lite" architecture. Does anyone have details on the parameter count? If this is under 20B params active, the distillation techniques they are using are lightyears ahead of everyone else.
bayarearefugee 13 hours ago [-]
Gemini is so awful at any sort of graceful degradation whenever they are under heavy load.
Its great that they have these new fast models, but the release hype has made Gemini Pro pretty much unusable for hours.
"Sorry, something went wrong"
random sign-outs
random garbage replies, etc
dandiep 16 hours ago [-]
For someone looking to switch over to Gemini from OpenAI, are there any gotchas one should be aware of? E.g. I heard some mention of API limits and approvals? Or in terms of prompt writing? What advice do people have?
I use a service where I have access to all SOTA models and many open sourced models, so I change models within chats, using MCPs eg start a chat with opus making a search with perplexity and grok deepsearch MCPs and google search, next query is with gpt 5 thinking Xhigh, next one with gemini 3 pro, all in the same conversation. It's fantastic! I can't imagine what it would be like again to be locked into using one (or two) companies. I have nothing to do with the guys who run it (the hosts from the podcast This day in AI, though if you're interested have a look in the simtheory.ai discord.
I don't know how people use one service can manage...
dandiep 15 hours ago [-]
99% of what I do is fine-tuned models, so there is a certain level of commitment I have to make around training and time to switch.
alach11 18 hours ago [-]
I really wish these models were available via AWS or Azure. I understand strategically that this might not make sense for Google, but at a non-software-focused F500 company it would sure make it a lot easier to use Gemini.
lbhdc 17 hours ago [-]
I feel like that is part of their cloud strategy. If your company wants to pump a huge amount of data through one of these you will pay a premium in network costs. Their sales people will use that as a lever for why you should migrate some or all of your fleet to their cloud.
jiggawatts 15 hours ago [-]
A few gigabytes of text is practically free to transfer even over the most exorbitant egress fee networks, but would cost “get finance approval” amounts of money to process even through a cheaper model.
17 hours ago [-]
jtrn 18 hours ago [-]
This is the first flash/mini model that doesn't make a complete ass of itself when I prompt for the following: "Tell me as much as possible about Skatval in Norway. Not general information. Only what is uniquely true for Skatval."
Skatval is a small local area I live in, so I know when it's bullshitting. Usually, I get a long-winded answer that is PURE Barnum-statement, like "Skatval is a rural area known for its beautiful fields and mountains" and bla bla bla.
Even with minimal thinking (it seems to do none), it gives an extremely good answer. I am really happy about this.
I also noticed it had VERY good scores on tool-use, terminal, and agentic stuff. If that is TRUE, it might be awesome for coding.
I'm tentatively optimistic about this.
amunozo 18 hours ago [-]
I tried the same with my father's little village (Zarza Capilla, in Spain), and it gave a surprisingly good answer in a couple of seconds. Amazing.
peterldowns 16 hours ago [-]
That's a really cool prompt idea, I just tried it with my neighborhood and it nailed it. Very impressive.
kingstnap 18 hours ago [-]
You are effectively describing SimpleQA but with a single question instead of a comprehensive benchmark and you can note the dramatic increase in performance there.
jtrn 12 hours ago [-]
I tested it for coding in Cursor, and the disappointment is real. It's completely INSANE when it comes to just doing anything agentic. I asked it to give me an option for how to best solve a problem, and within 1 second it was NPM installing into my local environment without ANY thinking. It's like working with a manic patient. It's like it thinks: I just HAVE TO DO SOMETHING, ANYTHING! RIGHT NOW! DO IT DO IT! I HEARD TEST!?!?!? LET'S INSTALL PLAYWRIGHT RIGHT NOW LET'S GOOOOOO.
This might be fun for vibecode to just let it go crazy and don't stop until an MVP is working, but I'm actually afraid to turn on agent mode with this now.
If it was just over-eager, that would be fine, but it's also not LISTENING to my instructions. Like the previous example, I didn't ask it to install a testing framework, I asked it for options fitting my project. And this happened many times. It feels like it treats user prompts/instructions as: "Suggestions for topics that you can work on."
doomerhunter 18 hours ago [-]
Pretty stoked for this model. Building a lot with "mixture of agents" / mix of models and Gemini's smaller models do feel really versatile in my opinion.
Hoping that the local ones keep progressively up (gemma-line)
Workaccount2 18 hours ago [-]
Really hoping this is used for real time chatting and video. The current model is decent, but when doing technical stuff (help me figure out how to assemble this furniture) it falls far short of 3 pro.
speedgoose 18 hours ago [-]
I’m wondering why Claude Opus 4.5 is missing from the benchmarks table.
anonym29 18 hours ago [-]
I wondered this, too. I think the emphasis here was on the faster / lower costs models, but that would suggest that Haiku 4.5 should be the Anthropic entry on the table instead. They also did not use the most powerful xAI model either, instead opting for the fast one. Regardless, this new Gemini 3 Flash model is good enough that Anthropic should be feeling pressure on both price and model output quality simultaneously regardless of which Anthropic model is being compared against, which is ultimately good for the consumer at the end of the day.
gorbot 8 hours ago [-]
Ive been using 2.5 pro or flash a ton at work and the pro was not noticeably more accurate, but significantly slower, so I used flash way more. This is super exciting
gustavoaca1997 7 hours ago [-]
Cannot wait for it to be available in GH Copilot
alooPotato 13 hours ago [-]
I have a latency sensitive application - anyone know if any tools that let you compare time to first token and total latency for a bunch of models at once given a prompt. Ideally, run close to the DCs that serve the various models so we can take out network latency from the benchmark.
robertwt7 8 hours ago [-]
looking at the results, it seems like flash should be the default now when using Gemini? the difference between flash thinking and pro thinking is not noticeable anymore, not to mention the speed increase from flash! The only noticeable one is MRCR (long context) benchmark which tbh I also found it to be pretty bad in gemini 3 preview since launching
I had it draw four pelicans, one for each of its thinking levels (Gemini 3 Pro only had two thinking levels). Then I had it write me an <image-gallery> Web Component to help display the four pelicans it had made on my blog: https://simonwillison.net/2025/Dec/17/gemini-3-flash/
I also had it summarize this thread on Hacker News about itself:
llm \
-f hn:46301851 -m "gemini-3-flash-preview" \
-s 'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
Gemini 3 are great models but lacking a few things:
- app expirience is atrocious, poor UX all over the place. A few examples: silly jumps when reading the text when the model starting to respond, slide-over view in iPad breaking request while Claude and ChatGPT working fine.
- Google offer 2 choices: your data used for whatever they want or if you want privacy, the app expirience going even worse.
user_7832 18 hours ago [-]
Two quick questions to Gemini/AI Studio users:
1, has anyone actually found 3 Pro better than 2.5 (on non code tasks)? I struggle to find a difference beyond the quicker reasoning time and fewer tokens.
2, has anyone found any non-thinking models better than 2.5 or 3 Pro? So far I find the thinking ones significantly ahead of non thinking models (of any company for that matter.)
Workaccount2 18 hours ago [-]
Gemini 3 is a step change up against 2.5 for electrical engineering R&D.
Davidzheng 18 hours ago [-]
I think it's probably actually better at math. Though still not enough to be useful in my research in a substantial way. Though I suspect this will change suddenly at some point as the models move past a certain threshold (also it is heavily limited by the fact that the models are very bad at not giving wrong proofs/counterexamples) so that even if the models are giving useful rates of successes, the labor to sort through a bunch of trash makes it hard to justify.
tmaly 18 hours ago [-]
Not for coding but for the design aspect, 3 outshines 2.5
This is exactly why you keep your personal life off the internet
peheje 18 hours ago [-]
This is great. I literally "LOL'd".
echelon 18 hours ago [-]
This is hilarious. The personalized pie charts and XKCD-style comics are great, and the roast-style humor is perfect.
I do feel like it's not an entirely accurate caricature (recency bias? limited context?), but it's close enough.
Good work!
You should do a "show HN" if you're not worried about it costing you too much.
Tiberium 18 hours ago [-]
Yet again Flash receives a notable price hike: from $0.3/$2.5 for 2.5 Flash to $0.5/$3 (+66.7% input, +20% output) for 3 Flash. Also, as a reminder, 2 Flash used to be $0.1/$0.4.
BeetleB 18 hours ago [-]
Yes, but this Flash is a lot more powerful - beating Gemini 3 Pro on some benchmarks (and pretty close on others).
I don't view this as a "new Flash" but as "a much cheaper Gemini 3 Pro/GPT-5.2"
Tiberium 18 hours ago [-]
I would be less salty if they gave us 3 Flash Lite at same price as 2.5 Flash or cheaper with better capability, but they still focus on the pricier models :(
int_19h 12 hours ago [-]
We'll probably get 3 Flash Lite eventually, it just takes time to distill the models, and you want to start with the one that is likely to bring in more money.
zzleeper 18 hours ago [-]
Same! I want to do some data stuff from documents and 2.0 pricing was amazing, but the constant increases go the wrong way for this task :/
jexe 17 hours ago [-]
Right, depends on your use cases. I was looking forward to the model as an upgrade to 2.5 Flash, but when you're processing hundreds of millions of tokens a day (not hard to do if you're dealing in documents or emails with a few users), the economics fall apart.
poplarsol 18 hours ago [-]
Will be interesting to see what their quota is. Gemini 3.0 Pro only gives you 250 / day until you spam them with enough BS requests to increase your total spend > $250.
18 hours ago [-]
croemer 15 hours ago [-]
It's fast and good in Gemini CLI (even though Gemini CLI still lags far behind Claude as a harness).
hereme888 10 hours ago [-]
Any word on when fine-tuning might become available?
mmaunder 13 hours ago [-]
Used the hell out of Gemini 3 Flash with some 3 Pro thrown in for the past 3 hours on CUDA/Rust/FFT code that is performance critical, and now have a gemini flavored cocaine hangover and have gone crawling back to Codex GPT 5.2 xhigh and am making slower progress but with higher quality code.
Firstly, 3 Flash is wicked fast and seems to be very smart for a low latency model, and it's a rush just watching it work. Much like the YOLO mode that exists in Gemini CLI, Flash 3 seems to YOLO into solutions without fully understanding all the angles e.g. why something was intentionally designed in a way that at first glance may look wrong, but ended up this way through hard won experience. Codex gpt 5.2 xhigh on the other hand does consider more angles.
It's a hard come-down off the high of using it for the first time because I really really really want these models to go that fast, and to have that much context window. But it ain't there. And turns out for my purposes the longer chain of thought that codex gpt 5.2 xhigh seems to engage in is a more effective approach in terms of outcomes.
And I hate that reality because having to break a lift into 9 stages instead of just doing it in a single wicked fast run is just not as much fun!
tanh 18 hours ago [-]
Does this imply we don't need as much compute for models/agents? How can any other AI model compete against that?
sunaookami 17 hours ago [-]
Sadly not available in the free tier...
raybb 13 hours ago [-]
And they recently cut 2.5 flash to 20 requests per day and removed 2.5 pro all together.
sunaookami 6 hours ago [-]
Huh wow you are right, they never sent any notice. Lame.
Def_Os 16 hours ago [-]
Consolidating their lead. I'm getting really excited about the next Gemma release.
agentifysh 15 hours ago [-]
so hat's why logan posed 3 lightning emojis. at $0.50/M for input and $3.00/M for output, this will put serious pressure on OpenAI and Anthropic now
its almost as good as 5.2 and 4.5 but way faster and cheaper
FergusArgyll 18 hours ago [-]
So much for "Monopolies get lazy, they just rent seek and don't innovate"
NitpickLawyer 18 hours ago [-]
Also so much for the "wall, stagnation, no more data" folks. Womp womp.
jonathan_h 15 hours ago [-]
"Monopolies get lazy, they just rent seek and don't innovate"
I think part of what enables a monopoly is absence of meaningful competition, regardless of how that's achieved -- significant moat, by law or regulation, etc.
I don't know to what extent Google has been rent-seeking and not innovating, but Google doesn't have the luxury to rent-seek any longer.
deskamess 17 hours ago [-]
Monopolies and wanna-be monopolies on the AI-train are running for their lives. They have to innovate to be the last one standing (or second last) - in their mind.
concinds 18 hours ago [-]
The LLM market has no moats so no one "feels" like a monopoly, rightfully.
incrudible 17 hours ago [-]
LLMs are a big threat to their search engine revenue, so whatever monopoly Google may have had does not exist anymore.
inshard 5 hours ago [-]
Tested it on Gemini CLI and the experience as good if not better than Claude Code. Gemini CLI has come a long way and is arguably likely to surpass Claude Code at this rate of progress.
MillionOClock 5 hours ago [-]
What are your favorite features? I recently downloaded it and also use Codex CLI and GitHub Copilot in VS Code but I don't really know what specific features it has others might not have.
inshard 5 hours ago [-]
The UI is better - they box the specific types of actions the orchestrator agent takes with a clear categorization. The standard quality of life shortcuts like type a number to respond to an MCQ are present here as well. They use specialized sub agents such as one with big context window to find context in the codebase. The quotas appear to be much more generous vs CC. The agent memory management between compacting cycles seems to have a few tricks CC is missing. Also, with 3.0 Flash, it feels faster with the same level of agency and intelligence. It has a feature to focus into an interactive shell where bash commands are being executed by the orchestrator agent. Doesn't feel like Google is trying to push you to buy more credits or is relying on this product for its financial survival - I suspect CC has some dark patterns around this where the agents runs cycles of token in circles with minimal progress on bugs before you have to top up your wallet. Early days still.
heliophobicdude 17 hours ago [-]
Any word on if this using their diffusion architecture?
JeremyHerrman 18 hours ago [-]
Disappointed to see continued increased pricing for 3 Flash (up from $0.30/$2.50 to $0.50/$3.00 for 1M input/output tokens).
I'm more excited to see 3 Flash Lite. Gemini 2.5 Flash Lite needs a lot more steering than regular 2.5 Flash, but it is a very capable model and combined with the 50% batch mode discount it is CHEAP ($0.05/$0.20).
jeppebemad 18 hours ago [-]
Have you seen any indications that there will be a Lite version?
summerlight 17 hours ago [-]
I guess if they want to eventually deprecate the 2.5 family they will need to provide a substitute. And there are huge demands for cheap models.
nickvec 18 hours ago [-]
So is Gemini 3 Fast the same as Gemini 3 Flash?
evandena 14 hours ago [-]
yes
i_love_retros 13 hours ago [-]
I'll take the hit to my 401k for this to all just go away. The comments here sound ridiculous.
blitz_skull 12 hours ago [-]
What do you mean?
walthamstow 18 hours ago [-]
I'm sure it's good, I thought the last one was too, but it seems like the backdoor way to increase prices is to release a new model
jeffbee 18 hours ago [-]
If the model is better in that it resolves the task with fewer iterations then the i/o token pricing may be a wash or lower.
GaggiX 18 hours ago [-]
They went too far, now the Flash model is competing with their Pro version. Better SWE-bench, better ARC-AGI 2 than 3.0 Pro. I imagine they are going to improve 3.0 Pro before it's no more in Preview.
Also I don't see it written in the blog post but Flash supports more granular settings for reasoning: minimal, low, medium, high (like openai models), while pro is only low and high.
minimaxir 18 hours ago [-]
"minimal" is a bit weird.
> Matches the “no thinking” setting for most queries. The model may think very minimally for complex coding tasks. Minimizes latency for chat or high throughput applications.
I'd prefer a hard "no thinking" rule than what this is.
GaggiX 18 hours ago [-]
It still supports the legacy mode of setting the budget, you can set it to 0 and it would be equivalent to none reasoning effort like gpt 5.1/5.2
minimaxir 16 hours ago [-]
I can confirm this is the case via the API, but annoyingly AI Studio doesn't let you do so.
skerit 18 hours ago [-]
> They went too far, now the Flash model is competing with their Pro version
Wasn't this the case with the 2.5 Flash models too? I remember being very confused at that time.
JohnnyMarcone 17 hours ago [-]
This is similar to how Anthropic has treated sonnet/opus as well. At least pre opus 4.5.
To me it seems like the big model has been "look what we can do", and the smaller model is "actually use this one though".
jug 18 hours ago [-]
I'm not sure how I'm going to live with this!
6 hours ago [-]
prompt_god 15 hours ago [-]
it's better than Pro in a few evals. anyone who used, how is it for coding?
timpera 17 hours ago [-]
Looks awesome on paper. However, after trying it on my usual tasks, it is still very bad at using the French language, especially for creative writing. The gap between the Gemini 3 family and GPT-5 or Sonnet 4.5 is important for my usage.
Also, I hate that I cannot send the Google models in a "Thinking" mode like in ChatGPT. When I send GPT 5.1 Thinking on a legal task and tell it to check and cite all sources, it takes +10 minutes to answer, but it did check everything and cite all its sources in the text; whereas the Gemini models, even 3 Pro, always answer after a few seconds and never cite their sources, making it impossible to click to check the answer. It makes the whole model unusable for these tasks.
(I have the $20 subscription for both)
happyopossum 16 hours ago [-]
> whereas the Gemini models, even 3 Pro, always answer after a few seconds and never cite their sources
Definitely has not been my experience using 3 Pro in Gemini Enterprise - in fact just yesterday it took so long to do a similar task I’d thought something was broken. Nope, just re-chrcking a source
timpera 16 hours ago [-]
Does Gemini Enterprise have more features?
Just tried once again with the exact same prompt: GPT-5.1-Thinking took 12m46s and Gemini 3.0 Pro took about 20 seconds. The latter obviously has a dramatically worse answer as a result.
(Also, the thinking trace is not in the correct language, and doesn't seem to show which sources have been read at which steps- there is only a "Sources" tab at the end of the answer.)
jijji 17 hours ago [-]
I tried Gemini CLI the other day, typed in two one line requests, then it responded that it would not go further because I ran out of tokens. I've hard other people complaint that it will re-write your entire codebase from scratch and you should make backups before even starting any code-based work with the Gemini CLI. I understand they are trying to compete against Claude Code, but this is not ready for prime time IMHO.
anonym29 18 hours ago [-]
I never have, do not, and conceivably never will use gemini models, or any other models that require me to perform inference on Alphabet/Google's servers (i.e. gemma models I can run locally or on other providers are fine), but kudos to the team over there for the work here, this does look really impressive. This kind of competition is good for everyone, even people like me who will probably never touch any gemini model.
oklahomasports 16 hours ago [-]
You don’t want Google to know that you are searching for like advice on how much a 61 yr old can contribute to a 401k. What are you hiding?
anonym29 16 hours ago [-]
Why do you close the bathroom stall door in public?
You're not doing anything wrong. Everyone knows what you're doing. You have no secrets to hide.
Yet you value your privacy anyway. Why?
Also - I have no problem using Anthropic's cloud-hosted services. Being opposed to some cloud providers doesn't mean I'm opposed to all cloud providers.
happyopossum 16 hours ago [-]
> I have no problem using Anthropic's cloud-hosted services
Anthropic - one of GCP’s largest TPU customers? Good for you.
Not only it is fast, it is also quite cheap, nice!
18 hours ago [-]
retinaros 15 hours ago [-]
i might have missed the bandwagon on gemini but I never found the models to be reliable. now it seems they rank first in some hallucinations bench?
I just always thought the taste of gpt or claude models was more interesting in the professional context and their end user chat experience more polished.
are there obvious enterprise use cases where gemini models shine?
FpUser 12 hours ago [-]
>"Gemini 3 Flash demonstrates that speed and scale don’t have to come at the cost of intelligence."
I am playing with Gemini 3 and the more I do the more I find it disappointing when discussing both tech and non-tech subject comparatively to ChatGPT. When it comes to non tech it seems like it was heavily indoctrinated and when it can not "prove" the point it abruptly cuts the conversation. When asked why, it says: formatting issues. Did it attend weasel courses?
It is fast. I grant it.
16 hours ago [-]
andrepd 18 hours ago [-]
Is there a way to try this without a Google account?
mschulkind 18 hours ago [-]
Just use openrouter or a similar aggregator.
18 hours ago [-]
17 hours ago [-]
i_love_retros 13 hours ago [-]
Oh wow another LLM update!
yieldcrv 10 hours ago [-]
anybody know the pattern of when these exit preview mode?
I hate adding -preview to my model environment variable
pancodecake 5 hours ago [-]
[dead]
inquirerGeneral 18 hours ago [-]
[dead]
Lucasjohntee 17 hours ago [-]
[dead]
pancodecake 8 hours ago [-]
[dead]
lyy123 9 hours ago [-]
111
imvetri 18 hours ago [-]
this is why samsung is stopping production in flash
Tepix 18 hours ago [-]
This is why they stopped The Flash after season 9 in 2023.
alex1138 12 hours ago [-]
I so want to like Gemini. I so want to like Google, but beyond their history of shuttering products they also tend to have a bent towards censorship (as most directly seen with Youtube)
alex1138 10 hours ago [-]
Downvotes from sycophants
jdthedisciple 15 hours ago [-]
To those saying "OpenAI is toast"
ChatGPT still has 81% market share as of this very moment, vs Gemini's ~2%, and arguably still provides the best UX and branding.
Everyone and their grandma knows "ChatGPT", who outside developers' bubble has even heard of Gemini Flash?
Yea I don't think that dynamic is switching any time soon.
int_19h 12 hours ago [-]
They won't switch "to Gemini". They will switch "to Google", meaning whatever's integrated into Chrome and Android.
riku_iki 15 hours ago [-]
> ChatGPT still has 81% market share as of this very moment, vs Gemini's ~2%
where did you get this from?
scrollop 15 hours ago [-]
Says the CEO of MySpace.
Topfi 13 hours ago [-]
By existing as part of Google results, AI Search makes them the least reliable search engine of all. Just to show an example I have searched for organically today with Kagi that I tried with Google for a quick real world test, looking for the exact 0-100kph times of the Honda Pan European ST1100, I got a result of 12-13 seconds, which isn't even in the correct stratosphere (roughly around 4sec), nor anywhere in the linked sources the model claims to rely on: https://share.google/aimode/Ui8yap74zlHzmBL5W
No matter the model, AI Overview/Results in Google are just hallucinated nonsense, only providing roughly equivalent information to what is in the linked sources as a coincidence, rather than due to actually relying on them.
Whether DuckDuckGo, Kagi, Ecosia or anything else, they are all objectively and verifiably better search engines than Google as of today.
This isn't new either, nor has it gotten better. AI Overview has been and continues to be a mess that makes it very clear to me anyone claiming Google is still the "best" search engine results wise is lying to themselves. Anyone saying Google search in 2025 is good or even usable is objectively and verifiably wrong and claiming DDG or Kagi offer less usable results is equally unfounded.
Either fix your models finally so they adhere to and properly quote sources like your competitors somehow manage or, preferably, stop forcing this into search.
Isn't it the opposite? From the link: Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct.
Gemini 3 Flash scored +13 in the test, more correct answers than incorrect.
sabareesh 11 hours ago [-]
Nope lower is better compared to recent open ai models this is bad. I am looking at AA-Omniscience Hallucination Rate
nemonemo 11 hours ago [-]
One thing I don't understand is how come Gemini Pro seems much cheaper than Gemini Flash in the scatter graph.
andai 11 hours ago [-]
This model has the best score on that benchmark.
Edit: Huh... It does score highest in "Omniscience", but also very high in Hallucination Rate (where higher score is worse)...
sabareesh 11 hours ago [-]
this has one of the worse score in AA-Omniscience Hallucination Rate
Rendered at 11:14:21 GMT+0000 (Coordinated Universal Time) with Vercel.
I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fraction (basically order of magnitude less!!) of the inference time and price
After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash.
The results are better AND the response times have stayed the same. What an insane gain - especially considering the price compared to 2.5 Pro. I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance.
Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out.
Examples from the wild are a great learning tool, anything you’re able to share is appreciated.
And it shouldn't be shared publicly so that the models won't learn about it accidentally :)
...yet. Crap, do I need to now? =)
[0] https://deepwalker.xyz
I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.
Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.
So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).
So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.
Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.
So I want to have a general idea of how good it is at this.
I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.
But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.
Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.
Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM.
Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice.
Fixing up text is of course also big.
Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like a mad man.
The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box.
Maybe the scale is different with genAI and there are some painful learnings ahead of us.
I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something.
After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade.
They probably use old Flash Lite model, something super small, and just summarize the search...
Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.
I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.
So I think LLMs can be good for finding niche info.
Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)
Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context).
Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.
What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.
Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.
The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.
However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.
For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629
1. What is the purpose of the benchmark?
2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?
To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.
> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.
I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.
> There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.
Then you must not be working in an environment where a better benchmark yields a competitive advantage.
In principle, we have ways: if nl's reports consistently predict how public benchmarks will turn out later, they can build up a reputation. Of course, that requires that we follow nl around for a while.
> A secret benchmark is: Useful for internal model selection
That's what I'm doing.
Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.
Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!
This is the second reason I find the idea of publicly discussing secret benchmarks silly.
I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?
you dont train on your test data because you need to have that to compare if training is improving or not.
I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.
"When was the last time England beat Scotland at rugby union"
new variant "Without using search when was the last time England beat Scotland at rugby union"
It is amazing how bad ChatGPT is at this question and has been for years now across multiple models. It's not that it gets it wrong - no shade, I've told it not to search the web so this is _hard_ for it - but how badly it reports the answer. Starting from the small stuff - it almost always reports the wrong year, wrong location and wrong score - that's the boring facts stuff that I would expect it to stumble on. It often creates details of matches that didn't exist, cool standard hallucinations. But even within the text it generates itself it cannot keep it consistent with how reality works. It often reports draws as wins for England. It frequently states the team that it just said scored most points lost the match, etc.
It is my ur example for when people challenge my assertion LLMs are stochastic parrots or fancy Markov chains on steroids.
For each you can use it as “instant” supposedly without thinking (though these are all exclusively reasoning models) or specify a reasoning amount (low, medium, high, and now xhigh - though if you do g specify it defaults to none) OR you can use the -chat version which is also “no thinking” but in practice performs markedly differently from the regular version with thinking off (not more or less intelligent but has a different style and answering method).
Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.
> Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.
It gets a lot easier with practice: your brain caches a few of the typical fluff routines.
The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.
Where are you getting that? All the citations I've seen say the opposite, eg:
> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.
https://massedcompute.com/faq-answers/
> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.
Both Cerebras and Grok have custom AI-processing hardware (not CPUs).
The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.
The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.
> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).
I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.
I don't see any latency comparisons in the link
https://jax-ml.github.io/scaling-book/gpus/#gpus-vs-tpus-at-...
Re: Groq, that's a good point, I had forgotten about them. You're right they too are doing a TPU-style systolic array processor for lower latency.
They do have a priority tier at double the cost, but haven't seen any benchmarks on how much faster that actually is.
The flex tier was an underrated feature in GPT5, batch pricing with a regular API call. GPT5.1 using flex priority is an amazing price/intelligence tradeoff for non-latency sensitive applications, without needing to extra plumbing of most batch APIs
It's a lost battle. It'll always be cheaper to use an open source model hosted by others like together/fireworks/deepinfra/etc.
I've been maining Mistral lately for low latency stuff and the price-quality is hard to beat.
Turns out becoming a $4 trillion company first with ads (Google), then owning everybody on the AI-front could be the winning strategy.
https://github.com/Roblox/open-game-eval/blob/main/LLM_LEADE...
https://artificialanalysis.ai/evaluations/omniscience
They always had the best talent, but with Brin at the helm, they also have someone with the organizational heft to drive them towards a single goal
Markets seems to be in a: "Show me the OpenAI money" mood at the moment.
And even financial commentators who don't necessarily know a thing about AI can realize that Gemini 3 Pro and now Gemini 3 Flash are giving ChatGPT a run for its money.
Oracle and Microsoft have other source of revenues but for those really drinking the OpenAI koolaid, including OpenAI itself, I sure as heck don't know what the future holds.
My safe bet however is that Google ain't going anywhere and shall keep progressing on the AI front at an insane pace.
[0] At least the guys who publish where you or me can read them.
This story also shows the market corruption of Google's monopolies, but a judge recently gave them his stamp of approval so we're stuck with it for the foreseeable future.
I ask this question about Nazi Germany. They adopted the Blitkrieg strategy and expanded unsustainably, but it was only a matter of time until powers with infinite resources (US, USSR) put an end to it.
Most obvious decision points were betraying the USSR and declaring war on the US (no one really had been able to print the reason, but presumably it was to get Japan to attack the soviets from the other side, which then however didn't happen). Another could have been to consolidate after the surrender/supplication of France, rather than continue attacking further.
Not saying that the Nazi strategy was without flaws, of course. But your specific critique is a bit too blunt.
/s
Abandoning our mose useful sense, vision, is a recipe for a flop.
[1]: https://entropicthoughts.com/haiku-4-5-playing-text-adventur...
I think it's bad naming on google's part. "flash" implies low quality, fast but not good enough. I get less negative feeling looking at "mini" models.
Mini - small, incomplete, not good enough
Flash - good, not great, fast, might miss something.
BTW: I have the same impression, Claude was working better for me for coding tasks.
I have not worked with Sonnet enough to give an opinion there.
Waiting for Apple to say "sorry folks, bad year for iPhone"
claude is coding model from the start but GPT is in more and more becoming coding model
I hope open source AI models catch up to gemini 3 / gemini 3 flash. Or google open sources it but lets be honest that google isnt open sourcing gemini 3 flash and I guess the best bet mostly nowadays in open source is probably glm or deepseek terminus or maybe qwen/kimi too.
Claude Code just caught up to cursor (no 2) in revenue and based on trajectories is about to pass GitHub copilot (number 1) in a few more months. They just locked down Deloitte with 350k seats of Claude Enterprise.
In my fortune 100 financial company they just finished crushing open ai in a broad enterprise wide evaluation. Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.
There is 1 leader with enterprise. There is one leader with developers. And google has nothing to make a dent. Not Gemini 3, not Gemini cli, not anti gravity, not Gemini. There is no Code Red for Anthropic. They have clear target markets and nothing from google threatens those.
> Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.
Does that mean y'all never evaluated Gemini at all or just that it couldn't compete? I'd be worried that prior performance of the models prejudiced stats away from Gemini, but I am a Claude Code and heavy Anthropic user myself so shrug.
Enterprise will follow.
I don't see any distinction in target markets - it's the same market.
Also I do not really use agentic tasks but I am not sure that gemini 3/3 flash have mcp support/skills support for agentic tasks
if not, I feel like they are very low hanging fruits and something that google can try to do too to win the market of agentic tasks over claude too perhaps.
So far they seem faster with Flash, and with less corruption of files using the Edit tool - or at least it recovered faster.
For me the bigger concern which I have mentioned on other AI related topics is that AI is eating all the production of computer hardware so we should be worrying about hardware prices getting out of hand and making it harder for general public to run open source models. Hence I am rooting for China to reach parity on node size and crash the PC hardware prices.
And now I am saying the same for gemini 3 flash.
I still feel the same way tho, sure there is an increase but I somewhat believe that gemini 3 is good enough and the returns on training from now on might not be worth thaat much imo but I am not sure too and i can be wrong, I usually am.
So I don't think we are on any sigmoid curve or so. Though if you plot the performance of the best model available at any point in time against time on the x-axis, you might see a sigmoid curve, but that's a combination of the logarithm and the amount of effort people are willing to spend on making new models.
(I'm not sure about it specifically being the logarithm. Just any curve that has rapidly diminishing marginal returns that nevertheless never go to zero, ie the curve never saturates.)
If Google released their weights today, it would technically be open weight; but I doubt you'd have an easy time running the whole Gemini system outside of Google's datacentres.
Pretty much every person in the first (and second) world is using AI now, and only small fraction of those people are writing software. This is also reflected in OAI's report from a few months ago that found programming to only be 4% of tokens.
This sounds like you live in a huge echo chamber. :-(
Apart from my very old grandmothers, I don't know anyone not using AI.
A lot of public religious imagery is very clearly AI generated, and you can find a lot of it on social media too. "I asked ChatGPT" is a common refrain at family gatherings. A lot of regular non-techie folks (local shopkeepers, the clerk at the gas station, the guy at the vegetable stand) have been editing their WhatsApp profile pictures using generative AI tools.
Some of my lawyer and journalist friends are using ChatGPT heavily, which is concerning. College students too. Bangalore is plastered with ChatGPT ads.
There's even a low-cost ChatGPT plan called ChatGPT Go you can get if you're in India (not sure if this is available in the rest of the world). It costs ₹399/mo or $4.41/mo, but it's completely free for the first year of use.
So yes, I'd say many people outside of tech circles are using AI tools. Even outside of wealthy first-world countries.
Just googling means you use AI nowdays.
Remember, really back in the day the A* search algorithm was part of AI.
If you had asked anyone in the 1970s whether a box that given a query pinpoints the right document that answers that question (aka Google search in the early 2000s), they'd definitely would have called it AI.
I've been playing around with other models recently (Kimi, GPT Codex, Qwen, others) to try to better appreciate the difference. I knew there was a big price difference, but watching myself feeding dollars into the machine rather than nickles has also founded in me quite the reverse appreciation too.
I only assume "if you're not getting charged, you are the product" has to be somewhat in play here. But when working on open source code, I don't mind.
I tried to be quite clear with showing my work here. I agree that 17x is much closer to a single order of magnitude than two. But 60x is, to me, a bulk enough of the way to 100x that yeah I don't feel bad saying it's nearly two orders (it's 1.78 orders of magnitude). To me, your complaint feels rigid & ungenerous.
My post is showing to me as -1, but I standby it right now. Arguing over the technicalities here (is 1.78 close enough to 2 orders to count) feels besides the point to me: DeepSeek is vastly more affordable than nearly everything else, putting even Gemini 3 Flash here to shame. And I don't think people are aware of that.
I guess for my own reference, since I didn't do it the first time: at $0.50/$3.00 / M-i/o, Gemini 3 Flash here is 1.8x & 7.1x (1e1.86) more expensive than DeepSeek.
Otherwise, if it's a short prompt or answer, SOTA (state of the art) model will be cheap anyway and id it's a long prompt/answer, it's way more likely to be wrong and a lot more time/human cost is spent on "checking/debugging" any issue or hallucination, so again SOTA is better.
Or for any privacy/IP protection at all? There is zero privacy, when using cloud based LLM models.
What I don't think is that I can take seriously someone's opinion on enterprise service's privacy after they write "LMAO" in capslock in their post.
Second thing to consider is the whole geopolitical situation. I know companies in europe are really reluctant to give US companies access to their internal data.
...and all of that done without any GPUs as far as i know! [1]
[1] - https://www.uncoveralpha.com/p/the-chip-made-for-the-ai-infe...
(tldr: afaik Google trained Gemini 3 entirely on tensor processing units - TPUs)
They are pushing the prices higher with each release though: API pricing is up to $0.5/M for input and $3/M for output
For comparison:
Gemini 3.0 Flash: $0.50/M for input and $3.00/M for output
Gemini 2.5 Flash: $0.30/M for input and $2.50/M for output
Gemini 2.0 Flash: $0.15/M for input and $0.60/M for output
Gemini 1.5 Flash: $0.075/M for input and $0.30/M for output (after price drop)
Gemini 3.0 Pro: $2.00/M for input and $12/M for output
Gemini 2.5 Pro: $1.25/M for input and $10/M for output
Gemini 1.5 Pro: $1.25/M for input and $5/M for output
I think image input pricing went up even more.
Correction: It is a preview model...
Google has been discontinuing older models after several months of transition period so I would expect the same for the 2.5 models. But that process only starts when the release version of 3 models is out (pro and flash are in preview right now).
You really need to look at the cost per task. artificialanalysis.ai has a good composite score, measures the cost of running all the benchmarks, and has 2d a intelligence vs. cost graph.
Tried a lot of them and settled on this one, they update instantly on model release and having all models on one page is the best UX.
Presumably a big motivation for them is to be first to get something good and cheap enough they can serve to every Android device, ahead of whatever the OpenAI/Jony Ive hardware project will be, and way ahead of Apple Intelligence. Speaking for myself, I would pay quite a lot for truly 'AI first' phone that actually worked.
From a business perspective it’s a smart move (inasmuch as “integrating AI” is the default which I fundamentally disagree with) since Apple won’t be left holding the bag on a bunch of AI datacenters when/if the AI bubble pops.
I don’t want to lose trust in Apple, but I literally moved away from Google/Android to try and retain control over my data and now they’re taking me… right back to Google. Guess I’ll retreat further into self-hosting.
As long as Apple doesn't take any crazy left turns with their privacy policy then it should be relatively harmless if they add in a google wrapper to iOS (and we won't need to take hard right turns with grapheneOS phones and framework laptops).
Did you forget all the Apple Intelligence stuff? They were never "ignoring" if anything they talked a big talk, and then failed so hard.
The whole iPhone 16 was marketed as AI first phone (including in billboards). They had full length ads running touting AI benefits.
Apple was never "ignoring" or "sitting AI out". They were very much in it. And they failed.
Stuff like:
"Open Chrome, new tab, search for xyz, scroll down, third result, copy the second paragraph, open whatsapp, hit back button, open group chat with friends, paste what we copied and send, send a follow-up laughing tears emoji, go back to chrome and close out that tab"
All while being able to just quickly glance at my phone. There is already a tool like this, but I want the parsing/understanding of an LLM and super fast response times.
On a related note, why would you want to break down your tasks to that level surely it should be smart enough to do some of that without you asking and you can just state your end goal.
https://en.wikipedia.org/wiki/PlainTalk
Plus, if the above worked, the higher level interactions could trivially work too. "Go to event details", "add that to my calendar".
FWIW, I'm starting to embrace using Gemini as general-purpose UI for some scenarios just because it's faster. Most common one, "<paste whatever> add to my calendar please."
I do pay special attention to what the most negative comments say (which in this case are unusually positive). And people discussing performance on their own personal benchmarks.
Is there an OSS model that's better than 2.0 flash with similar pricing, speed and a 1m context window?
Edit: this is not the typical flash model, it's actually an insane value if the benchmarks match real world usage.
> Gemini 3 Flash achieves a score of 78%, outperforming not only the 2.5 series, but also Gemini 3 Pro. It strikes an ideal balance for agentic coding, production-ready systems and responsive interactive applications.
The replacement for old flash models will be probably the 3.0 flash lite then.
So if 2.5 Pro was good for your usecase, you just got a better model for about 1/3rd of the price, but might hurt the wallet a bit more if you use 2.5 Flash currently and want an upgrade - which is fair tbh.
It's extremely fast on good hardware, quite smart, and can support up to 1m context with reasonable accuracy
https://epoch.ai/benchmarks/simplebench
Gemini 3 pro got 20%, and everyone else has gotten 0%. I saw benchmarks showing 3 flash is almost trading blows with 3 pro, so I decided to try it.
Basically it is an image showing a dog with 5 legs, an extra one photoshopped onto it's torso. Every models counts 4, and gemini 3 pro, while also counting 4, said the dog had a "large male anatomy". However it failed a follow-up saying 4 again.
3 flash counted 5 legs on the same image, however I added distinct a "tattoo" to each leg as an assist. These tattoos didn't help 3 pro or other models.
So it is the first out of all the models I have tested to count 5 legs on the "tattooed legs" image. It still counted only 4 legs on the image without the tattoos. I'll give it 1/2 credit.
With this release the "good enough" and "cheap enough" intersect so hard that I wonder if this is an existential threat to those other companies.
In my experience, to get the best performance out of different models, they need slightly different prompting.
There's a plugin for everything that mimics anything the others are doing
I see all of these tools as IDEs. Whether someone locks into VS Code, JetBrains, Neovim, or Sublime Text comes down to personal preference. Everyone works differently, and that is completely fine.
Maybe someday future models will all behave similarly given the same prompt, but we're not quite there yet
https://news.ycombinator.com/item?id=46290797
Opus and Sonnet are slower than Haiku. For lots of less sophisticated tasks, you benefit from the speed.
All vendors do this. You need smaller models that you can rapid-fire for lots of other reasons than vibe coding.
Personally, I actually use more smaller models than the sophisticated ones. Lots of small automations.
Think beyond interfaces. I'm talking about rapid-firing hundreds of small agents and having zero human interaction with them. The feedback is deterministic (non agentic) and automated too.
You say good enough. Great, but what if I as a malicious person were to just make a bunch of internet pages containing things that are blatantly wrong, to trick LLMs?
So Reddit?
I’d imagine the AI companies have all the “pre AI internet” data they scraped very carefully catalogued.
I'm speculating but Google might have figured out some training magic trick to balance out the information storage in model capacity. That or this flash model has huge number of parameters or something.
https://artificialanalysis.ai/evaluations/omniscience
Prepare to be amazed
Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?
One hypothesis is that gemini 3 flash refuses to answer when unsuure less often than other models, but when sure is also more likely to be correct. This is consistent with it having the best accuracy score.
> In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant
This doesn't mean much. As long as Gemini 3 has a high hallucination rate (higher than at least 50% others), it's not going to be in the most desirable quadrant by definition.
For example, let's say a model answers 99 out of 100 questions correctly. The 1 wrong answer it produces is a hallucination (i.e. confidently wrong). This amazing model would have a 100% hallucination rate as defined here, and thus not be in the most desirable quadrant. But it should still have a very high Omniscience Index.
That's what MoE is for. It might be that with their TPUs, they can afford lots of params, just so long as the activated subset for each token is small enough to maintain throughput.
More experts with a lower pertentage of active ones -> more sparsity.
After Gemini 3.0 the OpenAI damage control crews all drowned.
Not only is it vastly better, it's also free.
I find this particular benchmark to be in agreement with my experiences: https://simple-bench.com
Now, imagine for a moment they had also vertically integrated the hardware to do this.
The most terrifying thing would be Google expanding its free tiers.
Granted, this doesn't give api access, only what google calls their "consumer ai products", but it makes a huge difference when chatgpt only allows a handful of document uploads and deep research queries per day.
Then you realise you aren't imagining it.
Google is great on the data science alone, every thing else is an after thought
"And then imagine Google designing silicon that doesn’t trail the industry."
I'm def not a Google stan generally, but uh, have you even been paying attention?
https://en.wikipedia.org/wiki/Tensor_Processing_Unit
TPUs on the other hand are ASICs, we are more than familiar with the limited application, high performance and high barriers to entry associated with them. TPUs will be worthless as the AI bubble keeps deflating and excess capacity is everywhere.
The people who don't have a rudimentary understanding are the wall street boosters that treat it like the primary threat to Nvidia or a moat for Google (hint: it is neither).
It's 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k - notable that the new Flash model doesn’t have a price increase after that 200,000 token point.
It’s also twice the price of GPT-5 Mini for input, half the price of Claude 4.5 Haiku.
I assume that these are just different reasoning levels for Gemini 3, but I can't even find mention of there being 2 versions anywhere, and the API doesn't even mention the Thinking-Pro dichotomy.
Fast = Gemini 3 Flash without thinking (or very low thinking budget)
Thinking = Gemini 3 flash with high thinking budget
Pro = Gemini 3 Pro with thinking
>Fast = 3 Flash
>Thinking = 3 Flash (with thinking)
>Pro = 3 Pro (with thinking)
When I ask Gemini 3 Flash this question, the answer is vague but agency comes up a lot. Gemini thinking is always triggered by a query.
This seems like a higher-level programming issue to me. Turn it into a loop. Keep the context. Those two things make it costly for sure. But does it make it an AGI? Surely Google has tried this?
Which obviously opens up a can of worms regarding who should have authority to supply the "right answer," but still... lacking the core capability, AGI isn't something we can talk about yet.
LLMs will be a part of AGI, I'm sure, but they are insufficient to get us there on their own. A big step forward but probably far from the last.
Problem is that when we realize how to do this, we will have each copy of the original model diverge in wildly unexpected ways. Like we have 8 billion different people in this world, we'll have 16 gazillion different AIs. And all of them interacting with each other and remembering all those interactions. This world scares me greatly.
- An AGI wouldn't hallucinate, it would be consistent, reliable and aware of its own limitations
- An AGI wouldn't need extensive re-training, human reinforced training, model updates. It would be capable of true self-learning / self-training in real time.
- An AGI would demonstrate real genuine understanding and mental modeling, not pattern matching over correlations
- It would demonstrate agency and motivation, not be purely reactive to prompting
- It would have persistent integrated memory. LLM's are stateless and driven by the current context.
- It should even demonstrate consciousness.
And more. I agree that what've we've designed is truly impressive and simulates intelligence at a really high level. But true AGI is far more advanced.
I don't believe the "consciousness" qualification is at all appropriate, as I would argue that it is a projection of the human machine's experience onto an entirely different machine with a substantially different existential topology -- relationship to time and sensorium. I don't think artificial general intelligence is a binary label which is applied if a machine rigidly simulates human agency, memory, and sensing.
I disagreed with most of your assertions even before I hit the last point. This is just about the most extreme thing you could ask for. I think very few AI researchers would agree with this definition of AGI.
Their retention controls for both consumer and business suck. It’s the worst of any of the leaders.
https://artificialanalysis.ai/evaluations/omniscience
https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582
For that reason I still find chatgpt way better for me, many things I ask it first goes off to do online research and has up to date information - which is surprising as you would expect Google to be way better at this. For example, was asking Gemini 3 Pro recently about how to do something with a “RTX 6000 Blackwell 96GB” card, and it told me this card doesn’t exist and that I probably meant the rtx 6000 ada… Or just today I asked about something on macOS 26.2, and it told me to be cautious as it’s a beta release (it’s not). Whereas with chatgpt I trust the final output more since it very often goes to find live sources and info.
That epistemic calibration is is something they are capable of thinking through if you point it out. But they aren’t trained to stop and ask/check themselves on how confident do they have a right to be. This is a meta cognitive interrupt that is socialized into girls between 6 and 9 and is socialized into boys between 11-13. While meta cognitive interrupt to calibrate to appropriate confidence levels of knowledge is a cognitive skill that models aren’t taught and humans learn socially by pissing off other humans. It’s why we get pissed off st models when they correct ua with old bad data. Our anger is the training tool to stop doing that. Just that they can’t take in that training signal at inference time
They think GPT-5 won't be released until the distant future, but what they don't realize is we have already arrived ;)
Trying to use Gemini cli is such a pain. I bought GDP Premium and configured GCP, setup environment variables, enabled preview features in cli and did all the dance around it and it won't let me use gemini 3. Why the hell I am even trying so hard?
Then you just have to find a coding tool that works with OpenRouter. Afaik claude/codex/cursor don’t, at least not without weird hacks, but various of the OSS tools do — cline, roo code, opencode, etc. I recently started using opencode (https://github.com/sst/opencode), which is like an open version of claude code, and I’ve been quite happy with it. It’s a newer project so There Will Be Bugs, but the devs are very active and responsive to issues and PRs.
Not to mention that for coding, it's usually more cost efficient to get whatever subscription the specific model provider offers.
thinkingConfig: { thinkingLevel: "low", }
More about it here https://ai.google.dev/gemini-api/docs/gemini-3#new_api_featu...
On that note it would be nice to get these benchmark numbers based on the different reasoning settings.
https://ai.google.dev/gemini-api/docs/thinking#levels
For comparison, from 2.5 Pro ($1.25 / $10) to 3 Pro ($2 / $12), there was 60% increase in input tokens and 20% increase in output tokens pricing.
> Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.
Developer Blog: https://blog.google/technology/developers/build-with-gemini-...
Model Card [pdf]: https://deepmind.google/models/model-cards/gemini-3-flash/
Gemini 3 Flash in Search AI mode: https://blog.google/products/search/google-ai-mode-update-ge...
For example, the Gemini 3 Pro collection: https://blog.google/products/gemini/gemini-3-collection/
But having everything linked at the bottom of the announcement post itself would be really great too!
Just avoiding/fixing that would probably speed up a good chunk of my own queries.
Summarize recent working arxiv url
And then it tells me the date is from the future and it simply refuses to fetch the URL.
Flash is meant to be a model for lower cost, latency-sensitive tasks. Long thinking times will both make TTFT >> 10s (often unacceptable) and also won't really be that cheap?
Turns out Gemini 3 Flash is pretty close. The Gemini CLI is not as good but the model more than makes up for it.
The weird part is Gemini 3 Pro is nowhere as good an experience. Maybe because its just so slow.
Might be using flash for my MCP research/transcriber/minor tasks modl over haiku, now, though (will test of course)
Well worth every penny now
Image model they have released is much worse than nano banana pro, ghibli moment did not happen
Their GPT 5.2 is obviously overfit on benchmarks as a consensus of many developers and friends I know. So Opus 4.5 is staying on top when it comes to coding
The weight of the ads money from google and general direction + founder sense of Brin brought the google massive giant back to life. None of my companies workflow run on OAI GPT right now. Even though we love their agent SDK, after claude agent SDK it feels like peanuts.
This has been true for at least 4 months and yeah, based on how these things scale and also Google's capital + in-house hardware advantages, it's probably insurmountable.
Edit: And just to add an example: openAI's Codex CLI billing is easy for me. I just sign up for the base package, and then add extra credits which I automatically use once I'm through my weekly allowance. With Gemini CLI I'm using my oauth account, and then having to rotate API keys once I've used that up.
Also, Gemini CLI loves spewing out its own chain of thought when it gets into a weird state.
Also Gemini CLI has an insane bias to action that is almost insurmountable. DO NOT START THE NEXT STAGE still has it starting the next stage.
Also Gemini CLI has been terrible at visibility on what it's actually doing at each step - although that seems a bit improved with this new model today.
It's when it becomes difficult, like in the coding case that you mentioned, that we can see the OpenAI still has the lead. The same is true for the image model, prompt adherence is significantly better than Nano Banana. Especially at more complex queries.
My logic test and trying to get an agent to develop a certain type of ** implementation (that is published and thus the model is trained on to some limited extent) really stress test models, 5.2 is a complete failure of overfitting.
Really really bad in an unrecoverable infinite loop way.
It helps when you have existing working code that you know a model can't be trained on.
It doesn't actually evaluate the working code it just assumes it's wrong and starts trying to re-write it as a different type of **.
Even linking it to the explanation and the git repo of the reference implementation it still persists in trying to force a different **.
This is the worst model since pre o3. Just terrible.
But for anyone using LLM's to help speed up academic literature reviews where every detail matters, or coding where every detail matters, or anything technical where every detail matters -- the differences very much matter. And benchmarks serve just to confirm your personal experience anyways, as the differences between models becomes extremely apparent when you're working in a niche sub-subfield and one model is showing glaring informational or logical errors and another mostly gets it right.
And then there's a strong possibility that as experts start to say "I always trust <LLM name> more", that halo effect spreads to ordinary consumers who can't tell the difference themselves but want to make sure they use "the best" -- at least for their homework. (For their AI boyfriends and girlfriends, other metrics are probably at play...)
In fact so far, they consistently fail in exactly these scenario, glossing over random important details whenever you double check results in depth.
You might have found models, prompts or workflows that work for you though, I'm interested.
We've seen this movie before. Snapchat was the darling. Infact, it invented the entire category and was dominating the format for years. Then it ran out of time.
Now very few people use Snapchat, and it has been reduced to a footnote in history.
If you think I'm exaggerating, that just proves my point.
I never said Snapchat is dead. It still lives on, but it is a shell of the past. They had no moat, and the competitors caught up (Instagram, Whatsapp and even LinkedIn copied Snapchat with stories .. and rest is history)
Just go outside the bubble plus take a bit older people
They are both Android/Google Search users so all it really took was "sure I guess I'll try that" in response to a nudge from Google. For me personally I have subscriptions to Claude/ChatGPT/Gemini for coding but use Gemini for 90% of chatbot questions. Eventually I'll cancel some of them but will probably keep Gemini regardless because I like having the extra storage with my Google One plan bundle. Google having a pre-existing platform/ecosystem is a huge advantage imo.
Founders are special, because they are not beholden to this social support network to stay in power and founders have a mythos that socially supports their actions beyond their pure power position. The only others they are beholden too are their co-founders, and in some cases major investor groups. This gives them the ability to disregard this social balance because they are not dependent on it to stay on power. Their power source is external to the organization, while everyone else is internal to it.
This gives them a very special "do something" ability that nobody else has. It can lead to failures (zuck & occulus, snapchat spectacles) or successes (steve jobs, gemini AI), but either way, it allows them to actually "do something".
Of course they are. Founders get fired all the time. As often as non-founder CEOs purge competition from their peers.
> The only others they are beholden too are their co-founders, and in some cases major investor groups
This describes very few successful executives. You can have your co-founders and investors on board, if your talent and customers hate you, they’ll fuck off.
The merger happened in April 2023.
Gemini 1.0 was released in Dec 2023, and the progress since then has been rapid and impressive.
Ghibli moment was only about half a year ago. At that moment, OpenAI was so far ahead in terms of image editing. Now it's behind for a few months and "it can't be reversed"?
Kara Swisher recently compared OpenAI to Netscape.
Maybe we'll get some awesome FOSS tech out of its ashes?
the reason this matters is slowing velocity raises the risk of featurization, which undermines LLMs as a category in consumer. cost efficiency of the flash models reinforces this as google can embed LLM functionality into search (noting search-like is probably 50% of chatgpt usage per their july user study). i think model capability was saturated for the average consumer use case months ago, if not longer, so distribution is really what matters, and search dwarfs LLMs in this respect.
https://techcrunch.com/2025/12/05/chatgpts-user-growth-has-s...
https://lmarena.ai/leaderboard/text-to-image
https://lmarena.ai/leaderboard/image-edit
so they get lapped a few times and then drop a fantastic new model out of nowhere
the same is going to happen to Google again, Anthropic again, OpenAI again, Meta again, etc
they're all shuffling the same talent around, its California, that's how it goes, the companies have the same institutional knowledge - at least regarding their consumer facing options
Out of all the big4 labs, google is the last I'd suspect of benchmaxxing. Their models have generally underbenched and overdelivered in real world tasks, for me, ever since 2.5 pro came out.
Pipe dream right now, but 50 years later? Maybe
https://deepmind.google/models/gemini-robotics/
Previous discussions: https://news.ycombinator.com/item?id=43344082
Google keeps their models very "fresh" and I tend to get more correct answers when asking about Azure or O365 issues, ironically copilot will talk about now deleted or deprecated features more often.
The model is very hard to work with as is.
-> 2.5 Flash Lite is super fast & cheap (~1-1.5s inference), but poor quality responses.
-> 2.5 Flash gives high quality responses, but fairly expensive & slow (5-7s inference)
I really just need an in-between for Flash and Flash Lite for cost and performance. Right now, users have to wait up to 7s for a quality response.
Its great that they have these new fast models, but the release hype has made Gemini Pro pretty much unusable for hours.
"Sorry, something went wrong"
random sign-outs
random garbage replies, etc
Just do it.
I use a service where I have access to all SOTA models and many open sourced models, so I change models within chats, using MCPs eg start a chat with opus making a search with perplexity and grok deepsearch MCPs and google search, next query is with gpt 5 thinking Xhigh, next one with gemini 3 pro, all in the same conversation. It's fantastic! I can't imagine what it would be like again to be locked into using one (or two) companies. I have nothing to do with the guys who run it (the hosts from the podcast This day in AI, though if you're interested have a look in the simtheory.ai discord.
I don't know how people use one service can manage...
Skatval is a small local area I live in, so I know when it's bullshitting. Usually, I get a long-winded answer that is PURE Barnum-statement, like "Skatval is a rural area known for its beautiful fields and mountains" and bla bla bla.
Even with minimal thinking (it seems to do none), it gives an extremely good answer. I am really happy about this.
I also noticed it had VERY good scores on tool-use, terminal, and agentic stuff. If that is TRUE, it might be awesome for coding.
I'm tentatively optimistic about this.
This might be fun for vibecode to just let it go crazy and don't stop until an MVP is working, but I'm actually afraid to turn on agent mode with this now.
If it was just over-eager, that would be fine, but it's also not LISTENING to my instructions. Like the previous example, I didn't ask it to install a testing framework, I asked it for options fitting my project. And this happened many times. It feels like it treats user prompts/instructions as: "Suggestions for topics that you can work on."
Hoping that the local ones keep progressively up (gemma-line)
I also had it summarize this thread on Hacker News about itself:
https://gist.github.com/simonw/b0e3f403bcbd6b6470e7ee0623be6...
Where the `-f hn:xxxx` bit resolves via this plugin: https://github.com/simonw/llm-hacker-news1, has anyone actually found 3 Pro better than 2.5 (on non code tasks)? I struggle to find a difference beyond the quicker reasoning time and fewer tokens.
2, has anyone found any non-thinking models better than 2.5 or 3 Pro? So far I find the thinking ones significantly ahead of non thinking models (of any company for that matter.)
I do feel like it's not an entirely accurate caricature (recency bias? limited context?), but it's close enough.
Good work!
You should do a "show HN" if you're not worried about it costing you too much.
I don't view this as a "new Flash" but as "a much cheaper Gemini 3 Pro/GPT-5.2"
Firstly, 3 Flash is wicked fast and seems to be very smart for a low latency model, and it's a rush just watching it work. Much like the YOLO mode that exists in Gemini CLI, Flash 3 seems to YOLO into solutions without fully understanding all the angles e.g. why something was intentionally designed in a way that at first glance may look wrong, but ended up this way through hard won experience. Codex gpt 5.2 xhigh on the other hand does consider more angles.
It's a hard come-down off the high of using it for the first time because I really really really want these models to go that fast, and to have that much context window. But it ain't there. And turns out for my purposes the longer chain of thought that codex gpt 5.2 xhigh seems to engage in is a more effective approach in terms of outcomes.
And I hate that reality because having to break a lift into 9 stages instead of just doing it in a single wicked fast run is just not as much fun!
its almost as good as 5.2 and 4.5 but way faster and cheaper
I think part of what enables a monopoly is absence of meaningful competition, regardless of how that's achieved -- significant moat, by law or regulation, etc.
I don't know to what extent Google has been rent-seeking and not innovating, but Google doesn't have the luxury to rent-seek any longer.
I'm more excited to see 3 Flash Lite. Gemini 2.5 Flash Lite needs a lot more steering than regular 2.5 Flash, but it is a very capable model and combined with the 50% batch mode discount it is CHEAP ($0.05/$0.20).
Also I don't see it written in the blog post but Flash supports more granular settings for reasoning: minimal, low, medium, high (like openai models), while pro is only low and high.
> Matches the “no thinking” setting for most queries. The model may think very minimally for complex coding tasks. Minimizes latency for chat or high throughput applications.
I'd prefer a hard "no thinking" rule than what this is.
Wasn't this the case with the 2.5 Flash models too? I remember being very confused at that time.
To me it seems like the big model has been "look what we can do", and the smaller model is "actually use this one though".
Also, I hate that I cannot send the Google models in a "Thinking" mode like in ChatGPT. When I send GPT 5.1 Thinking on a legal task and tell it to check and cite all sources, it takes +10 minutes to answer, but it did check everything and cite all its sources in the text; whereas the Gemini models, even 3 Pro, always answer after a few seconds and never cite their sources, making it impossible to click to check the answer. It makes the whole model unusable for these tasks. (I have the $20 subscription for both)
Definitely has not been my experience using 3 Pro in Gemini Enterprise - in fact just yesterday it took so long to do a similar task I’d thought something was broken. Nope, just re-chrcking a source
Just tried once again with the exact same prompt: GPT-5.1-Thinking took 12m46s and Gemini 3.0 Pro took about 20 seconds. The latter obviously has a dramatically worse answer as a result.
(Also, the thinking trace is not in the correct language, and doesn't seem to show which sources have been read at which steps- there is only a "Sources" tab at the end of the answer.)
You're not doing anything wrong. Everyone knows what you're doing. You have no secrets to hide.
Yet you value your privacy anyway. Why?
Also - I have no problem using Anthropic's cloud-hosted services. Being opposed to some cloud providers doesn't mean I'm opposed to all cloud providers.
Anthropic - one of GCP’s largest TPU customers? Good for you.
https://www.anthropic.com/news/expanding-our-use-of-google-c...
I just always thought the taste of gpt or claude models was more interesting in the professional context and their end user chat experience more polished.
are there obvious enterprise use cases where gemini models shine?
I am playing with Gemini 3 and the more I do the more I find it disappointing when discussing both tech and non-tech subject comparatively to ChatGPT. When it comes to non tech it seems like it was heavily indoctrinated and when it can not "prove" the point it abruptly cuts the conversation. When asked why, it says: formatting issues. Did it attend weasel courses?
It is fast. I grant it.
I hate adding -preview to my model environment variable
ChatGPT still has 81% market share as of this very moment, vs Gemini's ~2%, and arguably still provides the best UX and branding.
Everyone and their grandma knows "ChatGPT", who outside developers' bubble has even heard of Gemini Flash?
Yea I don't think that dynamic is switching any time soon.
where did you get this from?
No matter the model, AI Overview/Results in Google are just hallucinated nonsense, only providing roughly equivalent information to what is in the linked sources as a coincidence, rather than due to actually relying on them.
Whether DuckDuckGo, Kagi, Ecosia or anything else, they are all objectively and verifiably better search engines than Google as of today.
This isn't new either, nor has it gotten better. AI Overview has been and continues to be a mess that makes it very clear to me anyone claiming Google is still the "best" search engine results wise is lying to themselves. Anyone saying Google search in 2025 is good or even usable is objectively and verifiably wrong and claiming DDG or Kagi offer less usable results is equally unfounded.
Either fix your models finally so they adhere to and properly quote sources like your competitors somehow manage or, preferably, stop forcing this into search.
Gemini 3 Flash scored +13 in the test, more correct answers than incorrect.
Edit: Huh... It does score highest in "Omniscience", but also very high in Hallucination Rate (where higher score is worse)...