Gbuck12DocsReviews & Comparisons
Related
7 Game-Changing Scheduling Upgrades in Kubernetes v1.36pCloud Lifetime Backup Review: Secure, Affordable Cloud Storage That LastsReactOS Streamlines Installation and Hardware Support with Two Major UpdatesBeelink EX Mate Pro Review: The Ultimate USB4 v2 Dock for Power UsersMotorola Razr Ultra (2026) Disappoints: Why You Should Look ElsewhereFCC Blocks Chinese Labs from Certifying U.S. Electronics, Citing National SecurityJetStream 3.0: A New Era for Browser Performance BenchmarkingBreaking: GameSpot Reveals Top-Rated Games of 2026 — Cairn and Diablo 4 Expansion Lead With 9/10 Scores

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches

Last updated: 2026-05-14 09:45:16 · Reviews & Comparisons

Introduction

Extracting structured data from B2B documents—such as purchase orders, invoices, or delivery notes—is a common challenge. Two primary approaches exist: a traditional rule-based method using pytesseract for OCR and regex for parsing, and a modern LLM-based method using Ollama with LLaMA 3. This guide walks you through building both versions of the same document extractor, comparing their strengths and tradeoffs using a realistic B2B order scenario. By the end, you'll be able to choose the right approach for your own projects.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

What You Need

  • Python 3.8+ installed on your machine
  • pytesseract – Python wrapper for Tesseract OCR engine
  • Tesseract OCR engine installed separately (see Tesseract OCR documentation)
  • Ollama – local LLM server (download from ollama.com)
  • LLaMA 3 model (run ollama pull llama3 after installing Ollama)
  • Python libraries: pdf2image, Pillow, re, requests
  • A sample B2B PDF (e.g., a purchase order with fields: company name, date, line items, totals)

Step-by-Step Instructions

Step 1: Set Up the Environment

Create a new Python virtual environment and install all required packages:

pip install pytesseract pdf2image Pillow requests

Ensure Tesseract OCR is installed globally (sudo apt install tesseract-ocr on Linux, or download the Windows installer). Also install and start Ollama, then pull the LLaMA 3 model:

ollama pull llama3

Step 2: Convert PDF to Images

B2B documents are often scanned PDFs. Use pdf2image to turn each page into a PNG image. Write a function that:

  • Takes the PDF path as input
  • Converts pages to images using convert_from_path
  • Returns a list of PIL Image objects

Step 3: Perform OCR with pytesseract

For each image, call pytesseract.image_to_string() to extract raw text. This step is identical for both rule-based and LLM approaches, as they both need the text first. Store the extracted text per page.

Step 4: Build the Rule-Based Extractor

Use regular expressions and string logic to locate fields like Order Number, Date, Client Name, and Line Items. For example:

  • Search for patterns like r'Order\s*#:\s*(\S+)'
  • Use a list of known product names for line items
  • Parse multi-line blocks for tables

This method is fast and predictable, but fragile if the document format changes.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

Step 5: Build the LLM-Based Extractor

Instead of writing rules, send the extracted text to LLaMA 3 via Ollama’s API. Send a structured prompt that asks the model to extract specific fields in JSON format:

prompt = f"""
Extract the following information from this purchase order:
- order_number
- date
- client_name
- line_items (array of objects with 'item', 'quantity', 'price')
Return only valid JSON.

Text:
{text}
"""

Use the requests library to call Ollama:

response = requests.post('http://localhost:11434/api/generate', json={'model':'llama3', 'prompt':prompt, 'stream':False})

Parse the JSON from the response.

Step 6: Compare Outputs

Run both extractors on the same set of PDFs and compare:

  • Accuracy: Which fields are correct?
  • Robustness: How does each handle missing data or typos?
  • Speed: Rule-based usually finishes in seconds; LLM may take 10–30 seconds per page.

The original experiment showed that the rule-based approach failed on a slightly different document format, while the LLM gracefully adapted—but hallucinated one item.

Tips for Success

  • Preprocess images: For rule-based OCR, apply thresholding or deskewing to improve accuracy.
  • Optimize LLM prompts: Include example outputs and specify format clearly to reduce hallucinations.
  • Fallback strategy: Use rule-based extraction for well-known templates and LLM as a fallback for unknown documents.
  • Test with diverse samples: Don’t rely on a single document; vary fonts, layouts, and printing quality.
  • Monitor costs: Local LLMs are free but require GPU; cloud LLMs charge per token.

By following these steps, you can build your own B2B document extractor and decide which approach best fits your needs. For a deep dive into the original comparison, see the full article.