Balancing Performance and Cost in Leading AI Language Models for Zero-shot Classification

5/19/2025

Edward Krueger

Karlo Vlahek

Sarah Gregory

As AI becomes more deeply integrated into everyday tools, choosing the right language model is more important than ever. Whether you're building a classifier, summarization engine, or chatbot, both performance and cost can make or break your product. In this benchmark, we evaluated leading models from OpenAI, Google, Anthropic, DeepSeek, and Mistral using a zero-shot classification task, assigning documents to one of 34 categories such as Business and Finance, Education, Automotive, Healthy Living, Entertainment, and Law.

Despite prompting for a response containing only the label, many models included excess text. To solve this, we used a string inclusion approach, where we looped over the categories and returned the first one that was included in the response.

We measured accuracy, input/output token pricing, and overall cost-efficiency.

Key Observations

Accuracy is strong and consistent across models, with only a 4–6% spread. The mean accuracy was 72.7%, with OpenAI’s models and DeepSeek outperforming the average.
Token pricing varies widely. Claude’s models had the highest input costs, followed by OpenAI’s larger models. In contrast, OpenAI’s GPT-4o-mini, DeepSeek, Gemini, and Mistral offered much lower input prices.
Input tokens are key to controlling cost. Since output tokens are priced significantly higher across all providers, it’s crucial to craft concise and direct prompts. Specifying brevity in the prompt can lead to cheaper outputs.
Caching saves money. Using model APIs with caching mechanisms and advantageous cache prices can substantially lower costs.

Pricing vs. Performance

Scatter plots comparing input and output pricing confirm that output tokens are universally more expensive. While output costs shift rightward (i.e., higher) on the pricing axis, the relative positions of models remain stable. This reinforces the idea that prompt efficiency, both in structure and reuse, can be as valuable as choosing the “best” model.

Despite comparable accuracy, pricing strategies and token dynamics differ. For this problem, OpenAI’s gpt-4o-mini and DeepSeek's models stood out as clear winners. In particular it’s worth noting that more expensive models, even from the provider, do not necessarily produce better results for a given problem. As you deploy models in production, remember: design your prompts carefully, allow caching, and optimize for brevity to reduce costs without compromising quality.