E-commerce
Classification
1 month

Building a classifier for very large text documents

95% accurate. Weeks to 30 seconds. 1,000+ docs organized. 500 hours saved.

500+
Hours saved
95%
TOP-5 accuracy
30
Seconds/document
Building a classifier for very large text documents

Project Overview

A client asked us to classify 1,000+ internal documents with codes that do not exist in public taxonomies. The files were for publication, so they had to be easy to find by the target audience.

  • 1,000+ documents
  • 60–300 pages each (average 150–200)
  • Three code systems:
    • System 1: ~7,000 codes (flat)
    • System 2: ~700 codes (specialized)
    • System 3: ~7,000 codes (hierarchy: general → specific → highly specific)

Classification systems (3 categories):

  1. First system: ~7,000 codes (flat structure)
  2. Second system: ~700 codes (specialized domain)
  3. Third system: ~7,000 codes with hierarchical structure (general → specific → highly specific)

Each code has an article number and a short description (~120 characters). Request was to built an AI agent that assigns the right codes to each document.

Key Challenges

  1. The codes are internal and not in public databases. Standard models do not know them, so we needed a custom zero-shot approach.

  2. Codes are short (~120 characters), while documents are long (60–300 pages). This size gap makes comparison hard.

  3. Only the client’s experts knew the rules; our team had no prior examples.

  4. Each document needs codes from three systems at once, so we had to process them in parallel.

Our Solution

We built a production-ready RAG classifier for very long documents (60–300 pages). It maps each file to codes in three systems in about 30 seconds. Every result includes page-level evidence and a confidence score.

We turned ~14,000 internal codes into simple “concept cards”: code ID, short description, common names, and full parent–child path. We added domain synonyms and near-miss negatives. With subject-matter experts we built a small, high-quality set of true matches and counter-examples to guide thresholds and evaluation.

We parse titles, abstracts, headings, sections, and tables, then chunk the text with a small overlap so meaning is not lost. We keep helpful signals like section paths, page numbers, acronyms, and definition patterns. Titles and abstracts get extra weight.

For each system we keep a dense semantic index and a sparse lexical index, query both, and merge. A cross-encoder reranker boosts precision. Hierarchy proximity nudges the model from a likely parent to the specific child. The LLM reasons over a short candidate list and outputs structured JSON with code, rationale, page-anchored evidence, and confidence (exact spans required). We use a conservative sampling setting for stable results, and guardrails block out-of-ontology codes, missing evidence, or conflicting siblings.

We started with n8n and moved to LangChain for finer control. Specialized agents run each step: ingestion removes PII; per-system retrievers run hybrid search; reranking tightens candidates; the classifier “rerank-then-reasons” and extracts evidence; a hierarchy pass checks path consistency; and a calibrator turns raw scores into reliable confidences via isotonic or temperature scaling on the SME-labeled set. A final validator packages results, deduplicates near-equivalents, and produces a Top-5 per system with short rationales and clickable evidence links.

The pipeline runs in parallel. Chunking and batched retrieval keep latency low. Indexes hot-reload when codes or descriptions change. Dashboards track precision@1/@5, per-system accuracy, hierarchy errors, and performance by document length and section density. Low-confidence cases go to human experts; their decisions update the gold set and improve the system.

User just upload a document and get the most relevant codes for each system with clear reasoning and page citations. In production we reached ~95% accuracy, processed 1,000+ documents, and cut days of work to ~30 seconds per file, saving 500+ hours while keeping experts in the loop.

What we've learned

building this

Key insights and lessons from this project that shaped our approach to future AI implementations.

Hybrid approach outperforms pure semantic search

Dense + sparse combination delivers better results than each method separately. Pure semantic search is good for conceptual understanding but misses exact terms. Pure BM25 is effective for keyword matching but doesn't understand synonyms and context. The hybrid approach (85/15 ratio) balances both methods perfectly, while solo implementation doesn't deliver results at all.

Weighted vector averaging is critically important

Simple averaging of all chunks causes the document body to "drown out" important metadata (title, annotation). We decided to implement weighted averaging with weight coefficients. Result: ~15% improvement in TOP-1 accuracy compared to uniform averaging.

Context window limitations require selection strategy

You can't "cram" the entire document into embedding. A strategy is needed: Prioritization, Limit, Chunking. More tokens = more coverage, but higher cost and slower processing.

Zero-shot classification is possible even for domain-specific codes

Even without model fine-tuning, the right retrieval system architecture delivers high accuracy. Quality embeddings, proper knowledge indexing, hybrid approach. What was NOT needed: Model fine-tuning on custom dataset; Creating labeled training data; Training classifiers from scratch.

Key Results

0.30

Processing time

95%

Accuracy

500

Hours saved

Implementation Process

Introduction

Project details alignment, receiving classification codes and document examples. There was one condition: implement a basic pilot (MVP) on N8N for testing.

First try

We started development on N8N, beginning with document upload via Google Drive, document summary, attempts to extract relevant codes, and sending results via email. The result of this experiment: high token costs, long processing time, and hallucinations. We realized that this task is not feasible on N8N, as it would only allow creating a ReACT agent that would cyclically try to guess the classification code.

Regroup

We discussed all the details with the client, showed the results, and proposed a different approach to the architecture. Direct OpenAI integration, custom Python code, RAG and dataset development. We started developing our custom dataset based on existing documents. The dataset consisted of: document, correct code, incorrect code.

Architecture & Development

Developed a custom Python system with modular architecture. Deployed a vector DB to store ~14,000 classification codes using an embeddings model. Implemented pipeline: PDF → text → chunking (800-1000 tokens) → embeddings → weighted average (title × 3.0, body × 1.0) → L2-normalization. Created hybrid scoring: 85% semantic (cosine) + 15% keyword (BM25). Additionally: RRF for A/B testing, JSON output with breakdown scores.

Testing & Optimization

Tested on 10 documents. Initial results: accuracy ~55-60% for the first option, ~80-85% among top-5. Conducted a series of experiments with various system parameters. Final results: accuracy ~75-80% for the first option, ~92-95% among top-5, processing time ~30-45 seconds. After optimization, gradually increased the dataset from 10 to 100 documents.

Maintenance & Dataset Enrichment

Client experts validate classification results. For each document, 5 proposed options are stored, along with the expert's final choice and reasons for changes. Based on this, we improve the dataset and system settings. Every 2-3 weeks we conduct optimizations: expand descriptions of problematic codes, add synonyms, update the database. The system automatically processes new documents with the option of expert verification.

Ready to transform your business?

Let's discuss how AI can solve your specific challenges. Our team is ready to build custom solutions that deliver measurable results.

EXPLORE SERVICES