Small models win: When GPT-4 is overkill

Yuki Tanaka, ML Engineer · 02.03.2026, 16:35:55

Small models win: When GPT-4 is overkill


Yuki Tanaka, ML Engineer

Everyone defaults to the biggest model. GPT-4, Claude Opus, Gemini Ultra — the assumption is that more parameters mean better results. I used to think this too. Then I started actually measuring.

For the past year, I've been running production ML systems across three different companies. The pattern is consistent: smaller models outperform larger ones in most real-world applications. Not because they're smarter, but because "better" in production means something different than "better" on benchmarks.

The latency tax

Large models are slow. Not dramatically slow — we're talking seconds, not minutes. But those seconds compound.

I worked on a customer service bot last year. We started with GPT-4 because we wanted the best responses. Average response time: 3.2 seconds. User satisfaction was mediocre. We assumed the AI responses weren't good enough and spent weeks tweaking prompts.

Then we tried GPT-3.5. Response quality dropped maybe 10% on our evals. Response time dropped to 0.8 seconds. User satisfaction increased 34%.

Users didn't care about the subtle quality improvements from GPT-4. They cared about waiting. A slightly worse answer delivered quickly beat a slightly better answer delivered slowly.

This isn't universal — some tasks need careful reasoning and users expect to wait. But for interactive applications, latency often matters more than marginal quality gains.

The cost reality

Let me share real numbers from a document processing pipeline.

Initial implementation: GPT-4 for everything. Monthly cost: $12,400. We processed about 50,000 documents per month, each requiring extraction, classification, and summarization.

Optimized implementation: GPT-4 only for complex edge cases (~8% of documents), GPT-3.5 for standard processing, and a fine-tuned small model for classification. Monthly cost: $1,100.

Quality impact: essentially none. The edge cases that needed GPT-4 still got GPT-4. The routine cases didn't benefit from more capability — they were already solved well by smaller models.

That's 91% cost reduction with no measurable quality loss. The savings funded our entire ML infrastructure for the quarter.

When bigger is actually worse

Here's something counterintuitive: larger models sometimes perform worse on narrow tasks.

We had a structured data extraction task — pulling specific fields from invoices. GPT-4 was creative. It would infer missing values, extrapolate from context, make educated guesses. Great capabilities for general reasoning. Terrible for data extraction where we need exactly what's on the document, nothing more.

A smaller, fine-tuned model extracted fields mechanically. No creativity, no inference. Just pattern matching. Accuracy was higher because it didn't try to be clever.

The general intelligence of large models is a feature when you need generalization. It's a bug when you need precision on a specific task. The model that "understands" your task deeply may overthink it.

The fine-tuning advantage

Small models have another superpower: you can fine-tune them cheaply.

Fine-tuning GPT-4 isn't available. Fine-tuning Claude isn't available. For the largest models, you're stuck with prompting. Prompts are powerful but limited — there's only so much behavior you can specify in a context window.

Smaller models can be fine-tuned on your specific data. A 7B parameter model fine-tuned on 10,000 examples of your exact use case often beats a 175B model prompted with general instructions. The small model has literally learned your task; the large model is improvising.

We fine-tuned Llama 2 7B for a medical coding task. Out of the box, it performed terribly — medical coding requires specialized knowledge it didn't have. After fine-tuning on 50,000 labeled examples, it outperformed GPT-4 on our test set while running on a single GPU.

Cost to fine-tune: about $200 in compute. Cost to run: fraction of API pricing. Performance: better than models 25x larger.

When to actually use large models

I'm not saying large models are useless. They have real advantages in specific situations.

Complex reasoning with novel problems. If the task requires multi-step reasoning about things the model hasn't seen before, more parameters help. A legal analysis of an unusual contract clause, a creative solution to an engineering problem, a nuanced interpretation of ambiguous text — these benefit from scale.

Low-volume, high-stakes decisions. If you're processing ten documents per day and each one matters significantly, the cost difference between models is negligible. Use the best available.

Bootstrapping before fine-tuning. Large models are excellent for generating training data for smaller models. Use GPT-4 to label examples, fine-tune a small model on those labels, deploy the small model. The large model enables the small model.

User-facing conversations requiring broad knowledge. Chatbots that need to discuss anything a user might ask benefit from general capability. You can't fine-tune for conversations you haven't anticipated.

The decision framework

Here's how I choose models for new projects:

Start with the smallest model that might work. For most tasks, this is something like GPT-3.5 or Claude Haiku. Run your evaluation suite. If performance is acceptable, you're done.

If performance isn't acceptable, diagnose why. Is it a capability issue (model can't do the task) or a knowledge issue (model doesn't know enough about your domain)? Capability issues need larger models. Knowledge issues need fine-tuning.

Consider hybrid approaches. Route easy cases to small models, hard cases to large models. This requires building a classifier, but a simple one often works. "If the input is longer than X or contains Y, use the large model" can capture most edge cases.

Measure what matters. Benchmark performance is irrelevant if your users care about latency. Cost per transaction matters more than cost per token. End-to-end accuracy matters more than component-level performance.

Re-evaluate regularly. Models improve constantly. The small model that couldn't handle your task six months ago might handle it easily today. The large model that was necessary might be replaceable.

The uncomfortable truth

Most AI systems are over-provisioned. Teams default to the biggest model because they don't want to be blamed for quality issues. "We used GPT-4" is a defense against criticism. Nobody gets fired for choosing the premium option.

But the teams that ship the best products aren't optimizing for defensibility. They're optimizing for user experience and business outcomes. Sometimes that means the biggest model. Often it means the smallest model that works.

The question isn't "which model is best?" It's "which model is best for this specific task, at this scale, with these constraints?" The answer is rarely the largest one available.

Yuki Tanaka is an ML Engineer who has built production AI systems for fintech, healthcare, and e-commerce companies. She focuses on making AI practical and cost-effective rather than impressive on benchmarks.

#AI


Related posts

AI agents in product development: What changed in twelve months
Why users don't trust AI recommendations (it's not what you think)
Scroll down to load next post