January 13, 2026
How to Extract Tables from PDFs Using Python (Without Losing Your Mind)
If you’ve ever tried to extract data from a PDF, you know the pain. What looks like a simple table on screen is actually a chaotic mess of positioned text elements in the file.
I built a PDF extraction API for a real project and ended up learning more about PDF internals than I expected. Here’s a practical breakdown.
The Problem: PDFs Don’t Have “Tables”
Open any PDF with tabular data. It looks organized, right? Rows, columns, headers.
Now look at what’s actually in the file:
draw "Product" at position (50, 100)
draw "Price" at position (200, 100)
draw "Widget" at position (50, 120)
draw "$99" at position (200, 120)
There’s no table structure. No rows. No columns. Just text floating at coordinates.
Your job is to reconstruct the logical structure from spatial positions.
Approach 1: PyMuPDF (Basic Text Extraction)
For simple text extraction, PyMuPDF (also called fitz) is fast and reliable:
import fitz
def extract_text(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
return text
Pros: Fast, handles most PDFs
Cons: Tables come out as jumbled text
Output from a table:
Product Price Quantity
Widget $99 10
Gadget $149 5
Not useful if you need structured data.
Approach 2: pdfplumber (Table Detection)
pdfplumber is specifically designed for table extraction:
import pdfplumber
def extract_tables(pdf_path):
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
page_tables = page.extract_tables()
tables.extend(page_tables)
return tables
Pros: Detects table boundaries automatically
Cons: Struggles with complex layouts, merged cells
Output:
[
[['Product', 'Price', 'Quantity'],
['Widget', '$99', '10'],
['Gadget', '$149', '5']]
]
Much better! But still needs post-processing.
Approach 3: Combining Both
The best results come from combining approaches:
import fitz
import pdfplumber
def smart_extract(pdf_path):
# First, check if PDF has selectable text
doc = fitz.open(pdf_path)
first_page_text = doc[0].get_text().strip()
if len(first_page_text) < 50:
# Likely a scanned PDF - needs OCR
return {"error": "Scanned PDF detected, OCR required"}
# Extract tables with pdfplumber
tables = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
for table in page.extract_tables():
if table and len(table) > 1:
headers = table[0]
rows = table[1:]
tables.append({
"page": i + 1,
"headers": headers,
"rows": rows
})
# Extract remaining text with PyMuPDF
full_text = ""
for page in doc:
full_text += page.get_text()
return {
"tables": tables,
"text": full_text,
"page_count": len(doc)
}
The Hard Parts Nobody Tells You About
1. Table boundaries are ambiguous
Is this one table or two?
Name | Email
---------|------------------
John | john@example.com
Department | Budget
-----------|--------
Sales | $50,000
Humans see two tables. Algorithms often merge them.
2. Headers aren’t always on top
Some invoices put totals at the bottom. Some have headers on the left side. Some have no headers at all.
3. Multi-page tables
When a table spans pages, you need to:
- Detect it’s a continuation (no headers on page 2)
- Merge rows correctly
- Handle page breaks mid-row
4. Currency and number parsing
“$1,234.56” vs “1.234,56 EUR” vs “JPY 1234”
Different locales, different formats. Don’t assume.
A Better Way: Use an API
After building all this myself, I packaged it into an API so others don’t have to:
curl -X POST "https://pdfpull-895295000838.europe-west1.run.app/api/v1/extract/tables" \
-H "X-API-Key: sk_demo_123456789" \
-F "file=@invoice.pdf"
Response:
{
"tables": [
{
"page_number": 1,
"headers": ["Product", "Price", "Qty"],
"rows": [
["Widget", "$99", "10"],
["Gadget", "$149", "5"]
]
}
],
"table_count": 1
}
It also has smart parsers for invoices and resumes that extract specific fields:
curl -X POST "https://pdfpull-895295000838.europe-west1.run.app/api/v1/parse/invoice" \
-H "X-API-Key: sk_demo_123456789" \
-F "file=@invoice.pdf"
{
"vendor_name": "ACME Corporation",
"invoice_number": "INV-2024-0042",
"invoice_date": "January 15, 2024",
"total_amount": 1250.00,
"currency": "USD",
"line_items": [
{"description": "Widget", "quantity": 10, "amount": 990.00},
{"description": "Gadget", "quantity": 5, "amount": 260.00}
],
"confidence": 0.91
}
Conclusion
PDF extraction is harder than it looks. If you’re building something that only occasionally needs PDF parsing, use a library. If you’re doing it at scale, consider an API that handles the edge cases for you.
If you need consistent results across lots of PDFs, an API can save you time. For one-off jobs, a library is usually enough.
Building this in public. Follow along on Twitter: @uppnrise