Qualifying 300K venues for under $25

March 5, 2026 · #1 in series

ai fine-tuning rag python lora vllm

After Sherpa, a question kept nagging me.

We had built real AI-powered GTM for the midmarket. LLM scoring, CRM enrichment, the full stack. But what about the businesses below that? The ones with five employees and no data budget. Could you offer them serious AI outreach capabilities for almost nothing?

I needed a real use case to find out. A friend runs a premium e-commerce tea business selling to HoReCa venues across Spain. Restaurants, coffee shops, hotels. His sales process was entirely manual. Google a neighborhood, find places that look premium enough, cold-call them. One by one.

Spain has roughly 300,000 HoReCa venues. Most have no structured digital presence. No CRM data to buy. No intent signals to subscribe to. Just names and addresses scattered across Google Maps, OpenStreetMap, and random review sites.

The core question: can you qualify 300K venues at near-zero cost, with useful precision?

Two agents, one pipeline

The system has two agents that work in sequence.

The Enrichment Agent discovers potential customers from public data sources, enriches them with business signals, and scores them for fit using a fine-tuned model. This is where 90% of the engineering lives.

The SDR Agent takes top-scored venues, selects an email template per venue type, personalizes one line using enrichment data from Agent 1, and syncs everything to the CRM. No RAG, no LLM. Heuristics and regex. Good enough, and practically free.

This post covers the Enrichment Agent. The SDR Agent gets its own article.

The enrichment state machine

I tried an LLM-driven agent loop early on. The model decided which tool to call next. It worked but was slow, expensive, and unpredictable. For a classification pipeline processing hundreds of thousands of venues, I needed determinism.

So I built a state machine with rule-based routing. Each venue walks through the pipeline independently. The agent inspects what data fields are available and decides the next tool to call. No LLM in the loop for routing decisions.

Seven tools, ordered by cost:

OSM Discovery — OpenStreetMap Overpass API. Free. One query pulls every restaurant, hotel, and cafe in a province with names, addresses, categories.
Google Maps DB fuzzy match — Free. I had a local database of 126K Google Maps records. Three-tier matching cascade: exact name match, then proximity + containment, then proximity + trigram similarity.
Google Places API — $0.032 per call. Only fires when the free match fails and the venue looks promising enough to justify the spend.
DuckDuckGo search — Free. Fallback for web presence signals when Google Places returns nothing useful.
Website scraper — Free. Extracts menu details, price signals, brand positioning from venue websites when they exist.
Gemini gap-fill — Free tier. Fills remaining data gaps from whatever context is available.
Scoring via RunPod vLLM — ~$0.001 per call. The fine-tuned model scores the enriched venue profile.

The key design principle: graceful degradation. If one source fails, the next picks up without blocking. A venue that gets rich data from OSM + Google Maps might skip straight to scoring. A venue with nothing but a name and address goes through the full chain. A venue that reaches the end with insufficient data gets auto-discarded. No API call wasted.

Data sufficiency before scoring

This was the most important engineering decision in the project.

Early on, the model was over-qualifying venues when enrichment data was too thin. The fine-tuned model is good at classification when it has real signals. But give it a venue name with zero context and it hallucinates a score based on pretraining knowledge. A restaurant called “La Terraza” would get a decent score because the model “knows” terrace restaurants tend to be mid-to-upscale. That is not useful.

I built a richness score. Simple formula: count non-null signal fields across the enrichment profile. Scale of 0 to 8.

Richness >= 3: score with confidence.
Richness < 3: try gap-filling first.
No scoring signal at all: auto-mark as not_qualified, score 0.0. No inference call.

This is the kind of thing that sounds obvious in retrospect but took me real debugging time to identify. The model’s accuracy metrics looked fine in aggregate. The precision problem only showed up when I sliced by data richness.

Distilling Opus into Mistral 7B

The core AI challenge was lead qualification at scale. I needed a model that could look at a venue’s enrichment profile and predict whether it was a good fit for premium tea.

Step one: I used Claude Opus to score 5,000 real venue profiles. These were stratified by venue type (restaurants, cafes, hotels, bars) to avoid the model learning a shortcut like “all hotels score high.” Opus is excellent at this task but costs roughly $0.027 per call. At 300K venues, that is $8,100. Not viable.

Step two: I fine-tuned Mistral 7B with LoRA at rank 8. Rank 8 was deliberate. This is a narrow binary classification task over lexical signals the base model already understands. Words like “organic,” “premium,” “tasting menu,” “artisan” are strong positive signals. The model does not need to learn new representations. It needs to learn which combinations of existing representations indicate fit. Low rank is sufficient for that.

Cost of fine-tuning: under $15 on Together.ai. One-time investment.

Step three: deploy on RunPod serverless with vLLM for production. Inference cost drops to roughly $0.0005 per venue. For development, I ran the model locally on my gaming PC GPU.

Evaluation

I evaluated on 1,000 fully scored venues with Opus as ground truth.

89% agreement at the 0.6 threshold. That is the accuracy number. But accuracy is not the metric that matters here.

In a 300K venue market, leads are abundant. You do not need to find every good venue. You need the ones you do find to actually be good. Outreach has a real cost: someone writes a message, makes a call, follows up. False positives waste that effort. Precision matters more than recall.

Initial precision was 68%. Not good enough. After implementing the data sufficiency thresholds, precision pushed to 80%.

I manually validated a stratified sample of 200 scored examples. 88% accuracy confirmed against my own judgment. The model and I mostly disagree on edge cases where the data genuinely supports either interpretation.

The numbers

For 50,000 venues:

Approach	Cost
Claude Opus direct	~$1,350
Fine-tuned Mistral 7B on RunPod	~$25
Fine-tuning investment (one-time)	~$15

50x cheaper. With 89% agreement against the expensive model.

The full stack runs on Python/Flask, React, PostgreSQL with pgvector, Docker Compose. Self-hosted scoring on RunPod serverless. Total infrastructure cost per month is negligible outside of inference.

What I actually learned

The real engineering problem was not the model. The model part was straightforward once I had clean training data from Opus.

The hard part was reliability and cost control. Chaining data sources that fail in different ways. Keeping inference consistent across heterogeneous venue data where some venues have rich web presence and others have nothing but a name. Building a system that degrades gracefully instead of crashing or wasting money on hopeless cases.

The data sufficiency threshold was the single highest-leverage decision. It turned a model with mediocre precision into one I would actually trust for outreach. Not by changing the model. By controlling what the model sees.

Open data, a $15 fine-tuned model, and rule-based control. That is enough to qualify 300K venues with 80% precision at negligible cost.

But these two pipelines are still batch, deterministic flows. They work, but they are not agents yet. In the next post, I cover how I turned them into proper agentic systems. What makes something an “agent” versus a script with API calls. And why that distinction actually matters for reliability.