Convert PDF to Markdown — Clean Output for LLMs & RAG
2026-04-21
8 min read
Why Convert PDFs to Markdown?
If you've ever pasted a PDF into ChatGPT, Claude, or your local Llama setup and gotten garbage back, you've already learned the lesson the hard way: PDFs are a terrible input format for language models.
PDFs store text as positioned glyphs — `(x: 142, y: 718) "T"`, `(x: 148, y: 718) "h"`, `(x: 152, y: 718) "e"` — not as flowing paragraphs. Naive PDF-to-text extraction gives you:
For RAG (retrieval-augmented generation), this destroys retrieval quality because chunk boundaries fall in the wrong places and embeddings represent positional noise instead of semantic structure.
Markdown solves all of this.
What Good PDF→Markdown Looks Like
A typical research paper input/output:
Input PDF (raw text extraction): ``` Abstract Introduction Recent advances in transformer 1 architectures have shown that self attention is a Methods more efficient mechanism than recurrence... ```
Output Markdown: ```markdown
Abstract
Recent advances in transformer architectures have shown that self-attention is a more efficient mechanism than recurrence...
1. Introduction
[paragraph here]
2. Methods
[paragraph here] ```
The Markdown version is chunkable, embeddable, and directly usable in any LLM pipeline.
Use Cases
1. Feeding Internal Docs to ChatGPT/Claude
Paste Markdown — not PDF — into the chat. Token cost drops 30–60% (PDF text extraction includes a lot of garbage). Quality of answers goes up dramatically.2. RAG Pipeline Ingestion
LangChain, LlamaIndex, Haystack, and Chroma all ingest Markdown natively:```python from langchain.text_splitter import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[ ("##", "section"), ("###", "subsection"), ]) chunks = splitter.split_text(markdown_content) ```
Chunks split on heading boundaries — exactly what you want for retrieval.
3. Obsidian / Notion / Logseq Notes
Drop a PDF you want to take notes on, convert to Markdown, paste into your note-taking app. All headings become an outline, all links become wiki-links.4. Static Site Publishing
Hugo, Jekyll, MkDocs, Astro, Next.js — they all consume Markdown. Your PDF technical docs become a searchable web site in 10 minutes.5. Translation / LLM Editing
Round-trip a PDF through Markdown → LLM rewrite → Markdown → back to PDF (using Markdown to PDF) for a clean, edited output.Why "No Upload" Matters Here Especially
PDFs you're feeding to LLMs are often the most sensitive things you own:
Uploading them to a SaaS conversion service to "make them LLM-friendly" — and then uploading the output to ChatGPT/Claude/Gemini — doubles your exposure surface. ExactPDF's PDF to Markdown runs entirely in your browser, so the conversion step adds zero risk.
What's Preserved (and What Isn't)
| Element | Preserved | Notes |
|---|---|---|
| Headings | ✅ | Inferred from font-size hierarchy |
| Paragraphs | ✅ | Reflowed correctly across columns |
| Bullet lists | ✅ | Detected from bullet glyphs (•, –, ●) |
| Numbered lists | ✅ | Detected from "1.", "1)", "(1)" patterns |
| Links | ✅ | Embedded URLs preserved as Markdown links |
| Tables | ✅ | Output as GitHub-flavored Markdown tables |
| Code blocks | ✅ | Detected by monospace font + indentation |
| Inline code | ⚠️ | Sometimes misclassified (font heuristic) |
| Bold / italic | ✅ | Detected from font weight/style |
| Math equations | ⚠️ | Output as raw text — won't render unless you wrap in \$\$ |
| Images | ⚠️ | Extracted as separate files (Markdown references them) |
| Footnotes | ✅ | Output as superscript + bottom-of-section reference |
| Headers / footers | ✅ removed | Page numbers and running titles dropped |
| Multi-column layout | ✅ | Re-flowed into single-column reading order |
Step-by-Step: PDF to Markdown for RAG
Step 1: Convert
Open PDF to Markdown, drop your PDF in, click Convert. Output appears in seconds — no upload.Step 2: Inspect
Scroll the output. Check:Step 3: Save and chunk
Download the .md, then in your RAG pipeline:```python from langchain.document_loaders import UnstructuredMarkdownLoader from langchain.text_splitter import MarkdownHeaderTextSplitter
loader = UnstructuredMarkdownLoader("paper.md") docs = loader.load() splitter = MarkdownHeaderTextSplitter(...) chunks = splitter.split_text(docs[0].page_content) ```
Step 4: Embed and index
Pass chunks to your embedding model (OpenAI text-embedding-3, Cohere embed-v3, BGE, Jina, whatever) and into your vector store.Step 5: Retrieve
At query time, retrieve top-k chunks, pass to your LLM as context. Quality is dramatically higher than chunking raw PDF text.Scanned PDFs
If your source is a scanned image PDF (no text layer), the Markdown converter has nothing to work with. Two-step pipeline:
Both run client-side, so your scanned document never leaves your laptop — important for medical records, legal exhibits, and historical archives.
Comparison
| Tool | Server upload? | Quality | Cost |
|---|---|---|---|
| ExactPDF PDF to Markdown | No | High (heading/table aware) | Free |
| Online "PDF to Markdown" sites | Yes | Variable | Often paid |
| Pandoc local | No | Medium (lossy on tables) | Free, install needed |
| LlamaParse / Unstructured.io | Yes (API) | High | Pay per page |
| Adobe API | Yes (API) | High | Enterprise pricing |
| Manual conversion | No | Highest | Slow + expensive |
FAQ
Why convert PDFs to Markdown?
Markdown is the cleanest input format for LLMs, RAG pipelines, note-taking apps, and static-site generators. PDF's positioned-glyph format adds noise that hurts retrieval quality.Does it preserve tables and code blocks?
Yes — tables become GFM tables, code blocks are detected by font and indentation, headings are inferred from font-size hierarchy.Is it good enough for RAG ingestion?
Yes. The output is structured Markdown ready to chunk and embed. Chunk on heading boundaries (split on ##/###), keep tables as units.Does the PDF get uploaded?
No. Conversion happens locally in your browser.Can I convert scanned PDFs to Markdown?
Run OCR PDF first to add a text layer, then run PDF to Markdown.Try PDF to Markdown
Clean Markdown output for LLM ingestion and RAG pipelines. No upload.
Open ToolFound this helpful?
❤️ Love this tool? Share it:
Related Articles
Most AI PDF tools upload your documents to their servers. This free browser-based AI summarizer processes your PDF locally — no upload, no account, works offline.
Discover our suite of free AI tools that run entirely in your browser.
Convert Markdown files to beautifully formatted PDF documents for free. Choose from multiple templates, preview before downloading. 100% private, no signup.