IndiVillage

Playbook · Robotics

The Egocentric Annotation Playbook.

Twelve patterns from our humanoid-robotics annotation programme. Written for ML teams scoping egocentric video at training cadence, and the operators who have to staff the team doing it.

40-page illustrated PDF

Annotator rubrics, schema templates, QA checklist

Worked examples from programmes shipped with customers including Machani Robotics

Pattern 01 is below, unlocked. The other eleven are in the PDF.

Contents

Twelve patterns.

Each pattern is a rubric, a decision tree, or a failure-mode analysis. Written by our senior reviewers from programmes that actually shipped, not from what a generic annotation handbook would sound like.

  • Pattern 01

    Schema design for first-person actionUnlocked below

    Hierarchical action taxonomies that hold up across kitchens, care environments, and workshop spaces. How to avoid the taxonomy debt that forces rewrites in month six.

  • Pattern 02

    Rubrics for object-in-hand versus out-of-hand

    The single largest disagreement driver in egocentric annotation. A decision tree that cuts inter-annotator variance by half.

  • Pattern 03

    Partial grasps and re-grasps

    Temporal boundary calls at grasp / release / adjustment. How we label them so the downstream action-recognition model is not confused by the continuous contact sequence.

  • Pattern 04

    Object hand-off labelling

    The bidirectional giver/receiver problem. Person-to-person, person-to-robot, and robot-to-robot hand-offs need distinct schemas; here is the one we use.

  • Pattern 05

    Navigable-space parsing in clutter

    Ground-plane segmentation that ignores transient clutter but respects permanent obstacles. Rules of thumb for edge cases.

  • Pattern 06

    Affordance mapping

    Beyond object class: which surfaces support, which handles afford pull, which edges are safe contact. The labels VLA models actually learn from.

  • Pattern 07

    Temporal segmentation for action boundaries

    When does 'reach' end and 'grasp' begin? A 30-frame rule that reduces ambiguity and improves action-recognition F1.

  • Pattern 08

    Multi-person egocentric disambiguation

    When the robot's POV catches multiple humans interacting. How to label the subject of attention without losing context on the others.

  • Pattern 09

    Occlusion handling in first-person

    The hand always occludes the object being manipulated. Pattern for consistent occluded-object labelling and downstream model robustness.

  • Pattern 10

    Calibration drift monitoring across training runs

    The schema evolves; the annotator pool evolves. How to monitor for silent rubric drift across quarters without blocking throughput.

  • Pattern 11

    Safety-critical scene flagging

    Annotators as the first line of safety review. Flags for hot surfaces, sharp edges, human contact with the robot. Escalation tree into senior review.

  • Pattern 12

    Disagreement-aware sampling for ambiguous transitions

    Where annotators disagree, models learn brittle boundaries. Sampling strategy that routes high-disagreement clips to domain-expert reviewers for schema-level resolution.

Pattern 01 · Unlocked preview

Schema design for first-person action.

Most egocentric annotation schemas begin life as a flat list of actions — pick up, put down, push, pull, reach, grasp, release — and fall apart inside three months. The failure mode is always the same: real first-person video contains actions that compose, overlap, and interrupt each other in ways the flat list cannot express. Once annotators start stacking tags or inventing new ones inline, you have lost calibration and the training signal degrades.

Our schema is hierarchical. The top layer is a five-class action family: contact, manipulation, locomotion, perception, idle. Beneath each family sits a closed list of primitives, and beneath those, optional modifiers. The primitives do not compose — at any moment a frame belongs to exactly one primitive — but modifiers can stack. Manipulation / rotate-in-hand / one-handed / on-support is a valid four-token label. Manipulation / grasp / push is not; it would have to be split into two temporal segments.

This constraint is load-bearing. By banning primitive stacking we force annotators to draw temporal boundaries rather than paper over ambiguity with multiple tags. The model learns cleaner transition statistics. Inter-annotator agreement stabilises inside four weeks. And when the taxonomy evolves — which it will — the evolution happens at the modifier layer, not the primitive layer, so existing annotations stay valid.

Three heuristics for the primitive set itself:

  1. Primitives must be describable in under eight words. If the definition needs a paragraph, you are describing a composition. Split it.
  2. No primitive can depend on object class. “Pour-liquid” is a composition of manipulation / tilt-object plus an object-class modifier. Keeping action and object orthogonal is what makes the schema transferable across environments.
  3. Every primitive must have a reasonable idle counterpart. If there is no sensible “no-op equivalent,” the primitive is too specific.

In the full PDF, Pattern 02 picks up from here with the rubric for disambiguating object-in-hand state, the single largest source of disagreement once the primitive set is stable. Patterns 03-12 cover the rest of the stack.