HomeResourcesBlogRobotics

Robotics

How to Evaluate Robotics Annotation Vendors

Cheap annotation from text-RLHF vendors fails on egocentric video and gripper sequences. Vendor selection is a capability-match problem. The differentiators are domain expertise, staff retention, QA discipline, and closed-loop feedback infrastructure.

Author · Mark Pinnes

26 May 2026

11 min

IndiVillage robotics specialist at workstation

IndiVillage Robotics · Bengaluru

The Vendor Selection Problem: Capability vs. Cost

You need robot training data annotated. Five vendors respond. Each claims to handle your requirements. The cheapest vendor wins on cost. And fails in production.

When your robot hesitates on an egocentric gripper state it never learned, you'll realise your vendor was built for text RLHF and 2D images, not temporal video sequences and gripper mechanics. Rebooting the project to switch vendors causes far greater delays and costs than the original savings.

Vendor selection is not cost optimisation. It is capability-matching. And the capability most vendors lack is the one that matters most: egocentric video and multi-modal domain expertise. The label quality on those sequences determines whether your model learns generalisation or noise.

The Lead Filter: Domain Expertise in Robotics

The first question to ask any vendor is not "what do you charge?" but "do you have egocentric video annotation experience?"

Most annotation vendors will say yes. Most will be wrong.

Here's why: text RLHF and 2D image labelling are two different domains with two entirely different skill sets.

Text RLHF vendor expertise (vendors like Scale, Surge):

Taxonomy: simple (coherent response vs. incoherent response; helpful vs. harmful)
Task: one annotator labels one piece of text; minimal ambiguity
Volume: extremely high-throughput; millions of labels per month
Talent: crowdsourced workers; some training on taxonomy, but typical churn patterns

Robotics annotation expertise (vendors with deep robotics focus):

Taxonomy: complex (50+ gripper states, sensor modalities, temporal sequences)
Task: temporal reasoning required; single frame is meaningless; sequence context is everything
Volume: lower throughput; consistency matters more than speed
Talent: team members with robotics or computer vision background; sustained tenure; domain knowledge compounds over time

A text-RLHF vendor can scale to handle robotics annotation. But they will likely make systematic errors: misunderstanding spatial reasoning, mislabelling temporal sequences, losing consistency across long videos. By the time you discover these errors in production, your training data is corrupted.

The lead filter: Ask each vendor for egocentric/multi-modal case studies and customer references in robotics. If they can't name one, disqualify them.

The Retention Proxy: Staff Stability as Quality Indicator

Robotics annotation requires learning. A new annotator starts at 70% accuracy and reaches 95%+ accuracy after 3-4 months of learning the taxonomy, understanding the gripper mechanics, and building pattern recognition for edge cases.

Staff turnover is the hidden cost of cheap annotation. Here's the dynamic:

High-churn vendors (typical industry average: 50-70% annual turnover):

Hire 100 annotators → 50-70 leave within 12 months
Constant retraining of new cohorts → institutional knowledge never accumulates
Quality plateaus (the new cohort doesn't have the nuance the departing cohort was developing)
SLA failures increase → project delays and rework

Low-churn vendors (IndiVillage: 96% retention over 16 years):

Hire 100 annotators → 96 stay within 12 months
Same team across projects → learning compounds
Year 2 project is faster and higher-quality than Year 1 (same team understands your domain)
SLA compliance is predictable → no surprises

The retention proxy: Ask vendors for annual turnover rate. Below 30% suggests specialist knowledge compounds. Below 50% is acceptable for most tasks. Above 70% signals churn-driven delivery models.

IndiVillage's 96% retention over 16 years means the team annotating Machani Robotics' egocentric video in 2025 includes people who learned robotics annotation principles in 2010-2012. That accumulated expertise is not transferable; you cannot buy it by switching vendors mid-project.

The QA Discipline Filter: Autonomy-Grade Standards

"Multi-pass review" and "gold sets" are table stakes in annotation. The question is whether a vendor treats these as checkbox activities or as core infrastructure.

Commodity-grade QA (checkbox):

L1 annotation (primary labeller)
L2 review (spot-check 5% of work)
"Quality assurance passed" ✓
No zero-drift monitoring
No inter-rater agreement tracking

Autonomy-grade QA (infrastructure):

L1 annotation (primary labeller)
L2 internal review (10-20% re-audit for consistency)
L3 expert QA (5-10% roboticist review for edge cases)
Weekly inter-rater agreement checks (κ threshold-based pauses)
Monthly gold-set recalibration
Quarterly taxonomy audits
Zero-drift dashboards (tracking accuracy over time)

The QA discipline filter: Ask vendors to describe their zero-drift monitoring process. If they don't have one, they're not operating at autonomy grade.

IndiVillage's 18-month zero-drift QA on autonomous-robotics workloads is built on this infrastructure. This is not a competitive advantage that other vendors cannot copy. But most don't, because it's expensive and complex. The vendors competing on price cannot afford to operate it.

The Proof Filter: Case Studies in Your Modality

A vendor might be excellent at agricultural annotation and terrible at robotics annotation. Conversely, they might excel at 2D images but struggle with 3D or temporal sequences.

The proof filter: Ask for case studies that match YOUR data type. If you need egocentric video annotation, ask for egocentric video case studies (not 2D images, not commodity text). If you need gripper-state labelling, ask for gripper-state labelling (not bounding boxes).

IndiVillage: 18 months zero-drift on egocentric robotics video (Machani Robotics CeCe/RIA). This is direct proof for egocentric use cases.
Commodity vendor: 4.2M image labels for tech companies. Generic. Not robotics-specific.

If a vendor can't provide case studies in your modality, they don't have proven expertise. Marketing claims ("we support video annotation") are not proof.

The Team Structure Filter: Dedicated vs. Shared Resources

When you sign with a vendor, are you getting a dedicated team or sharing a pool?

Shared pool model (typical for low-cost vendors):

Your project is one of 20 projects the team handles
Annotators context-switch between robotics, medical imaging, agriculture
No deep domain expertise in any single area
Ramp time on your project is slower (annotators are new to the domain)
Consistency suffers (context-switching introduces errors)

Dedicated team model (IndiVillage for long-term partnerships):

Your project gets a dedicated team (10-50 people depending on scale)
All their time is spent on YOUR domain (robotics, medical imaging, agriculture)
Deep domain expertise builds over time
Ramp time is faster (team understands your domain from day one if multi-project relationship)
Consistency is high (no context-switching)

The team structure filter: Ask whether you get a dedicated team or a shared pool. For robotics, dedicated is non-negotiable.

A shared-pool vendor will claim they can "allocate dedicated resources if needed." This is true but expensive—often negating the per-image savings. If dedicated allocation is optional, the default is shared pool. Avoid it.

The Process and Pricing Transparency Filter

Vendors that hide their process behind a "custom quote" or refuse to share security documents upfront are not being transparent.

Red flags in vendor responses:

"Pricing depends on scope; fill out our form to get a quote" (obscuring cost structure)
"Security details are proprietary; we'll discuss under NDA" (hiding potential gaps)
"SLAs are negotiable project-by-project" (no standard accountability)
"We'll adjust our process to your needs" (no documented, repeatable process)

Green flags:

Clear, itemised pricing (per-image cost, modality multipliers, volume discounts)
Downloadable security documentation (SOC 2, DPA, BAA)
Published SLAs (accuracy thresholds, throughput targets, turnaround times)
Documented process (QA steps, taxonomy versioning, escalation protocols)

IndiVillage's stance: pricing is transparent (modality-specific pricing table available), security docs are downloadable, SLAs are defined contractually, processes are documented. This is not hiding value—it's demonstrating confidence in the service.

The Scalability Proof Filter

You might start with a 50,000-frame pilot. If successful, you'll scale to 500,000 frames, then 5 million. Can your vendor scale without quality degradation?

Scaling failure signature:

Pilot (50K frames): 98% accuracy, excellent consistency
Production (500K frames): 95% accuracy, noticeable drift in edge-case handling
Scale-up (5M frames): 93% accuracy, systematic errors in temporal consistency

This pattern appears when a vendor's QA infrastructure doesn't scale proportionally with volume. They can maintain discipline on small projects; discipline breaks at larger scale.

The scalability proof: Ask for a reference customer with 500K+ images. Ask that reference: Did accuracy hold steady as volume increased?

Taranis (4.5M images across 8+ years) is the gold standard. FMC (300K+ hectares, multiple models) is another. Ask IndiVillage for these references. They will provide them because the work is real and the results are documented.

A vendor that can't name a reference with 500K+ images has not proven it can scale.

Common Mistakes in Vendor Selection

Mistake 1: Optimising for per-image cost. "Vendor A is £1/image; Vendor B is £5/image. Choose A." This ignores quality. If Vendor A delivers 93% accuracy and Vendor B delivers 98%, you're paying Vendor A to corrupt your training data. The true cost includes retraining cycles, delayed deployment, and lower production accuracy.

Mistake 2: Treating all "case studies" equally. "They say they've done robotics annotation." Ask for proof. A 2-paragraph case study from a generic vendor is worth less than a detailed, multi-year case study from a specialist. Depth of proof matters.

Mistake 3: Underestimating the vendor-lock risk. "We can switch vendors if this one doesn't work." Switching is extremely expensive. You have 500K annotated frames with their taxonomy. A new vendor uses a different taxonomy; retraining their team on your domain takes 3-4 months. Mid-project vendor switches are a last resort, not a Plan B.

Mistake 4: Assuming "scaling up" is linear. "If they can handle 100K images, they can handle 1M." Not necessarily. Scaling introduces new challenges: maintaining inter-rater agreement across larger teams, preventing taxonomy drift, managing QA infrastructure at 10x volume. A vendor that scales poorly will show cracks at 500K when they seemed fine at 100K.

Mistake 5: Not checking staff retention. "They hired a lot of people; they can scale." High hiring is often a signal of high turnover. If a vendor is hiring 100 people/quarter, they're likely replacing people who left. This is not a sign of capacity; it's a sign of churn.

Your Robotics Vendor Evaluation Checklist

Use this as a filtering tool in RFP responses:

Required (disqualify if absent):

Egocentric video annotation case study with customer reference
Published accuracy metrics (with confidence intervals) on robotics workload
Documented zero-drift QA process with inter-rater agreement thresholds
Annual staff turnover rate below 50%
Dedicated team model (not shared pool)

Strongly preferred:

Multiple robotics case studies (egocentric, gripper-state, 3D annotation)
Reference customer with 500K+ annotated images
Downloadable SOC 2 and DPA documents
Published pricing (modality-specific, volume discounts)
Formal SLA documentation (accuracy thresholds, throughput, turnaround)

Red flags (disqualify):

No egocentric robotics proof
Cannot name turnover rate
Pricing hidden behind "custom quote"
No documented QA process
Cannot provide customer references (only internal metrics)

Frequently Asked Questions

Q: Is commodity annotation acceptable for robotics? A: Not for production-grade systems. Commodity vendors excel at high-volume, low-ambiguity tasks (text RLHF, simple 2D images). They struggle with egocentric video, temporal reasoning, and edge-case handling. Use commodity vendors only for non-critical applications or foundational datasets that will be expertly curated later.

Q: Can a vendor improve if they don't currently have robotics expertise? A: Yes, but it's slow. You're effectively paying them to learn. Ramp time is 4-6 months (vs. 2-3 weeks for an experienced vendor). Project timelines slip. Your training data becomes a learning vehicle for their team, not a competitive advantage for you. This is viable only if your timeline allows it.

Q: Should I hire annotation in-house instead of outsourcing? A: Depends on project size. For 500K+ frames, in-house hiring might be cost-effective (vs. vendor outsourcing). For 50K-500K frames, outsourcing to a specialist vendor is usually faster and cheaper than ramping an internal team. For one-off projects, always outsource.

Q: How do I negotiate better pricing with a vendor? A: Demonstrate volume. "We have 5M frames over 24 months" is a negotiating signal. Vendors offer volume discounts. Also negotiate SLA terms: higher accuracy thresholds = higher cost; longer turnarounds = lower cost. Trade-offs exist; you can optimise based on your timeline.

Q: What questions should I ask a reference customer? A: (1) Did accuracy hold as volume increased? (2) Were there hidden costs (re-annotation, taxonomy changes)? (3) Was the vendor responsive to edge-case issues? (4) Would you re-engage with them? (5) How did they handle scale-up from pilot to production?

Q: How do I compare vendors fairly if they use different metrics? A: Insist on common metrics. Accuracy across the same gold set. Throughput in frames/week. Turnaround time in calendar days. Don't accept "it depends" as an answer. Vendors should be able to quote standardised SLAs.

JSON-LD Schema

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "What Should I Look for in a Robotics Annotation Vendor?",
  "description": "How to evaluate robotics annotation vendors on domain expertise, staff retention, QA discipline, and proof of production capability.",
  "author": {
    "@type": "Organization",
    "@name": "IndiVillage"
  },
  "datePublished": "2026-05-26"
}