January 13, 2026

Tags: Python, API, PDF, Tutorial

How to Extract Tables from PDFs Using Python (Without Losing Your Mind)

If you’ve ever tried to extract data from a PDF, you know the pain. What looks like a simple table on screen is actually a chaotic mess of positioned text elements in the file.

I built a PDF extraction API for a real project and ended up learning more about PDF internals than I expected. Here’s a practical breakdown.

The Problem: PDFs Don’t Have “Tables”

Open any PDF with tabular data. It looks organized, right? Rows, columns, headers.

Now look at what’s actually in the file:

draw "Product" at position (50, 100)
draw "Price" at position (200, 100)
draw "Widget" at position (50, 120)
draw "$99" at position (200, 120)

There’s no table structure. No rows. No columns. Just text floating at coordinates.

Your job is to reconstruct the logical structure from spatial positions.

Approach 1: PyMuPDF (Basic Text Extraction)

For simple text extraction, PyMuPDF (also called fitz) is fast and reliable:

import fitz

def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

Pros: Fast, handles most PDFs
Cons: Tables come out as jumbled text

Output from a table:

Product Price Quantity
Widget $99 10
Gadget $149 5

Not useful if you need structured data.

Approach 2: pdfplumber (Table Detection)

pdfplumber is specifically designed for table extraction:

import pdfplumber

def extract_tables(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

Pros: Detects table boundaries automatically
Cons: Struggles with complex layouts, merged cells

Output:

[
    [['Product', 'Price', 'Quantity'],
     ['Widget', '$99', '10'],
     ['Gadget', '$149', '5']]
]

Much better! But still needs post-processing.

Approach 3: Combining Both

The best results come from combining approaches:

import fitz
import pdfplumber

def smart_extract(pdf_path):
    # First, check if PDF has selectable text
    doc = fitz.open(pdf_path)
    first_page_text = doc[0].get_text().strip()
    
    if len(first_page_text) < 50:
        # Likely a scanned PDF - needs OCR
        return {"error": "Scanned PDF detected, OCR required"}
    
    # Extract tables with pdfplumber
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            for table in page.extract_tables():
                if table and len(table) > 1:
                    headers = table[0]
                    rows = table[1:]
                    tables.append({
                        "page": i + 1,
                        "headers": headers,
                        "rows": rows
                    })
    
    # Extract remaining text with PyMuPDF
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    
    return {
        "tables": tables,
        "text": full_text,
        "page_count": len(doc)
    }

The Hard Parts Nobody Tells You About

1. Table boundaries are ambiguous

Is this one table or two?

Name     | Email
---------|------------------
John     | john@example.com

Department | Budget
-----------|--------
Sales      | $50,000

Humans see two tables. Algorithms often merge them.

2. Headers aren’t always on top

Some invoices put totals at the bottom. Some have headers on the left side. Some have no headers at all.

3. Multi-page tables

When a table spans pages, you need to:

  • Detect it’s a continuation (no headers on page 2)
  • Merge rows correctly
  • Handle page breaks mid-row

4. Currency and number parsing

“$1,234.56” vs “1.234,56 EUR” vs “JPY 1234”

Different locales, different formats. Don’t assume.

A Better Way: Use an API

After building all this myself, I packaged it into an API so others don’t have to:

curl -X POST "https://pdfpull-895295000838.europe-west1.run.app/api/v1/extract/tables" \
  -H "X-API-Key: sk_demo_123456789" \
  -F "file=@invoice.pdf"

Response:

{
  "tables": [
    {
      "page_number": 1,
      "headers": ["Product", "Price", "Qty"],
      "rows": [
        ["Widget", "$99", "10"],
        ["Gadget", "$149", "5"]
      ]
    }
  ],
  "table_count": 1
}

It also has smart parsers for invoices and resumes that extract specific fields:

curl -X POST "https://pdfpull-895295000838.europe-west1.run.app/api/v1/parse/invoice" \
  -H "X-API-Key: sk_demo_123456789" \
  -F "file=@invoice.pdf"
{
  "vendor_name": "ACME Corporation",
  "invoice_number": "INV-2024-0042",
  "invoice_date": "January 15, 2024",
  "total_amount": 1250.00,
  "currency": "USD",
  "line_items": [
    {"description": "Widget", "quantity": 10, "amount": 990.00},
    {"description": "Gadget", "quantity": 5, "amount": 260.00}
  ],
  "confidence": 0.91
}

Conclusion

PDF extraction is harder than it looks. If you’re building something that only occasionally needs PDF parsing, use a library. If you’re doing it at scale, consider an API that handles the edge cases for you.

If you need consistent results across lots of PDFs, an API can save you time. For one-off jobs, a library is usually enough.


Building this in public. Follow along on Twitter: @uppnrise