Case Study

How Wolfcast Cut Luna AI Costs by 84%

One line of code. No quality loss. No architectural changes. Just smarter routing.

84%

Cost Reduction

Line Changed

Quality Loss

5 min

Integration

The Customer

Wolfcast.ai — Prediction Markets, Powered by AI

Wolfcast.ai is a prediction market platform where users place forecasts and compete on accuracy. To help users navigate complex financial data, Wolfcast built Luna — an AI assistant powered by Google Gemini that answers everything from basic questions to deep analytical queries about market movements.

Luna handles thousands of queries daily across a spectrum of complexity: simple factual questions, platform how-tos, and sophisticated multi-turn conversations involving market context, trend analysis, and prediction comparisons.

The Challenge

Premium Models for Every Query

Wolfcast routed all Luna queries to Google's premium Gemini models. Great for quality. Terrible for the budget. As user counts grew, AI became the largest operational expense — and most queries didn't need a premium model.

💬

Simple queries are expensive

"What is a prediction market?" doesn't need gemini-2.5-pro. But it was being routed there anyway, at $1.25/MTok input.

📈

Scale made it worse

At 200 msgs/day/user, costs grew linearly. No easy way to separate simple from complex without building a custom classifier.

🛠

Building a classifier = months

Wolfcast evaluated building their own query classifier. Estimated timeline: 2–3 months of engineering, plus ongoing maintenance.

⚠

Can't sacrifice quality

Luna's value is accurate, helpful answers. Any cost optimization had to preserve response quality for complex analytical queries.

The Solution

One Line. Instant Savings.

TokenSurf sits between Wolfcast and Google Gemini. It classifies each query's complexity in real-time and routes simple questions to cheaper models — while passing complex analytical queries straight through to the premium model, untouched.

Before — Direct Gemini

const response = await fetch(
  'https://generativelanguage
  .googleapis.com/v1beta/
  models/gemini-2.5-pro
  :generateContent',
  {
    method: 'POST',
    body: JSON.stringify(query),
    headers: {
      'Authorization': apiKey
    }
  }
);

After — TokenSurf

const response = await fetch(
  'https://api.tokensurf.io
  /v1/chat/completions',
  {
    method: 'POST',
    body: JSON.stringify(query),
    headers: {
      'Authorization': tsKey
    }
  }
);

Three steps: create a TokenSurf account, store the key with defineSecret('TOKENSURF_API_KEY'), change the URL.

The Results

72–84% Cost Savings, Zero Quality Loss

Model Downgrade	Cost Before	Cost After	Savings
gemini-2.5-pro → gemini-2.5-flash	$1.25 / $10.00	$0.30 / $2.50	76%
gemini-3.1-pro → gemini-2.5-flash	$2.00 / $12.00	$0.30 / $2.50	84%

Query Distribution

60% Simple

40% Complex

Routed to cheaper model (saved) Passed through unchanged (full quality)

TokenSurf's response headers (X-TokenSurf-Downgraded, X-TokenSurf-Model) let us monitor every routing decision in production. Full transparency into what's being optimized and what's being preserved.

Under the Hood

How TokenSurf Classifies Luna's Queries

TokenSurf's classification engine runs in real-time on every request. Here's how it handles typical Luna traffic:

💬

"What is a prediction market?"

Short factual question. Matches simple patterns. Under 50 tokens.

→ gemini-2.5-flash

📊

"Analyze top movers in tech markets"

Analytical request with market context. Matches complex patterns. 10K+ token system prompt.

→ gemini-2.5-pro