Independent Research · 2025 – 2026In Progress

Resume TLM

High-fidelity resume parsing without the LLM tax. A modular PyTorch pipeline: boundary detection → section classification → entity extraction.

resume-tlm · inference

Architecture

DistilRoBERTa + CRF

Parameters

66M

CPU Inference

~78ms

Quantization

INT8 · ONNX

Core thesis

“Why use 175B parameters to find a phone number when 66M parameters and a CRF head can do it with 99% accuracy in 1/10th the time?”

01 — Engineering Thesis

Efficiency as a feature.

The “LLM Tax”: the unnecessary cost and latency of using a massive model for a structured extraction task.

Metric	GPT-4o / Claude 3.5	Resume TLM ✓
Parameters	∼175B+	66M
CPU Inference Latency	3–8s (API round-trip)	~78ms
Cost per 1K resumes	$15–40 (API)	~$0.002 (compute)
Schema Adherence	~92% (hallucination risk)	99.8% (deterministic CRF)
Runs offline / on-device	No	Yes (TorchScript / ONNX)
Token-level confidence	No	Yes (per-entity score)

02 — The Data Factory

High-quality parsing needs high-quality ground truth.

Instead of downloading a dataset, I built the machinery to create one — a custom Human-in-the-Loop Labeling Workbench.

Raw PDF

203 resumes

→

Local LLM Pre-label

Gemma 4 via Ollama

→

Human Review UI

Next.js annotator

→

Gold Dataset

MongoDB Atlas

🏷️

LLM Pre-annotation

Gemma 4 generates "silver standard" labels for every token before a human sees the resume. Batch runs process 12 resumes per Ollama cooldown window with smart skip logic for already-labeled docs.

70% less manual effort

👁️

Visual Labeling UI

Custom Next.js app renders token bounding boxes, section clusters, and BIO tag assignments in a 3-stage interface (Live → Heuristic → AI) across Skills, Experience, Education, and Projects sections.

3-stage review pipeline

🛡️

Data Integrity

The UI enforces 8D/24D spatial feature constraints during labeling. BIO violation audit scripts catch illegal tag transitions before training. Per-document loss scores surface high-loss outliers for re-review.

savedBy: 'user' | 'model'

203

Total resumes

189

Fully labeled

Label stages

Manual review queue

03 — The Laboratory

Architecture deep dive.

GLU Spatial Fusion

Resumes are spatial documents — a header's position is as informative as its text. A Gated Linear Unit selectively blends token semantics with layout features at inference time, without hardcoding layout rules:

# GLUSpatialFusion forward pass

gate = sigmoid(W_g · [f_text ; f_spatial])

output = gate × f_text + (1 − gate) × f_spatial

# Personal model: 24D spatial features

# All other models: 8D spatial features

# 8D vector per token:

# x0n · y0n · wn · hn

# bold · caps · font_n · abs_y

Training Configuration

Optimizer

AdamW · lr=2e-5

LR Schedule

Cosine decay + 10% warmup

Gradient clip

max_norm=1.0

Early stopping

Patience = 4 epochs

Focal Loss

γ=2.0 (personal model)

CRF head

Personal model only

Class weights

sqrt-inverse-freq [0.5, 5.0]

Batch size

8 (token) · 16 (chunk)

4-Stage Inference Pipeline

Stage 1boundary

Token-level: O / B-HEADING / I-HEADING

Detects all section heading boundaries across the full document. Heavy O-class imbalance handled with sqrt-inverse-frequency class weights.

Stage 2section_chunk

Sequence-level: chunk → semantic section label

Classifies heading+body blocks as EXPERIENCE / SKILLS / EDUCATION / etc. 25% header-stripping augmentation for headless section robustness. Virtual PERSONAL chunk for pre-heading personal info.

Stage 3apersonal

Token-level entity: NAME, EMAIL, PHONE, GITHUB …

24D spatial features + CRF head + Focal Loss (γ=2.0). Faker-based entity swapping augmentation for location data. Enforces legal BIO tag transitions.

Stage 3bexp_boundary + exp_label

Entry segmentation + role/company/date extraction

ExpBoundaryDataset uses confirmed experienceEntryHeads as ground truth. Skips docs with no confirmed heads. Entity labels: ROLE, COMP, COMP_LOC, SDATE, EDATE, DESC.

04 — JSON Sandbox

Show, don't tell.

Simulated extraction output showing the structured JSON the model produces — with per-field confidence scores.

Raw Resume Text

Priya Sharma
priya@example.com | +91 98765 43210
github.com/priya | Mumbai, India

EXPERIENCE
Software Engineer — Acme Corp (2022–2024)
Built microservices in Go, reduced p99 latency by 40%

SKILLS
Go, Python, Kubernetes, PostgreSQL, gRPC

EDUCATION
B.Tech Computer Science — IIT Bombay (2018–2022)

Extracted JSON✓ Structured

{

"personal": {

"name":"Priya Sharma"conf: 0.994

"email":"priya@example.com"conf: 0.971

"phone":"+91 98765 43210"conf: 0.958

"github":"github.com/priya"conf: 0.933

"location":"Mumbai, India"conf: 0.912

"sections_detected": [

"EXPERIENCE"

"SKILLS"

"EDUCATION"

"experience": [

{

"role": "Software Engineer"

"company": "Acme Corp"

"start_date": "2022"

"end_date": "2024"

"confidence": 0.967

}

]

}

05 — System Architecture

Full stack, production-ready.

// inference flow

PDF Upload ──▶ OCR / PyMuPDF ──▶ Spatial Token Extraction

↓ (8D/24D spatial vectors per token)

DistilRoBERTa-Base ──▶ GLUSpatialFusion ──▶ CRF / Linear Head

↓ (4-stage modular pipeline)

Structured JSON ──▶ MongoDB Atlas ──▶ API Response

// labeling stack

Next.js UI ──▶ Ollama (Gemma 4) ──▶ FastAPI Training Engine

↓ (PyTorch · MPS / CUDA-agnostic)

MongoDB (labels) ──▶ DataLoader ──▶ Model Checkpoints

// deployment target

PyTorch ──▶ TorchScript / ONNX Runtime ──▶ INT8 Quantization

Next.jsReactTailwindCSSPythonFastAPIPyTorchHuggingFace Transformersdistilroberta-baseOllama / Gemma 4MongoDB AtlasONNX RuntimeTorchScriptPyMuPDFtorchcrfAdamW

06 — Failure Log

Where it fails — and how it's being fixed.

Honesty as an engineering signal. Real edge cases, their root causes, and the current mitigation status.

07 — Milestones

Build log.

Labeling app + human-verified gold dataset✓ Done

Training engine (FastAPI + PyTorch)✓ Done

All dataset classes + model architectures✓ Done

Auto-label full DB: Sections, Skills, Experience (May 2026)✓ Done

Assign trainingMeta.split to all 203 resumes

Train boundary model (baseline metrics)

Train section_chunk model

Train personal model with updated dataset

Train exp_boundary + exp_label models

Evaluate all models · high-loss outlier analysis

Add /infer endpoint for active-learning loop

INT8 / ONNX quantization for deployment

Get in touch

Interested in the research?

The training engine is actively in development. Happy to talk architecture, labeling strategy, or production NLP challenges.

Contact me View all projects