Why does extracted text look scrambled?

PDFs store text as positioned fragments, not continuous lines. PDF.js reconstructs reading order by joining items with spaces, but this fails for tables, multi-column layouts, and rotated text. For layout-sensitive documents, use Adobe Acrobat's PDF-to-Word converter.

guide

How to Extract Text from a PDF for Free (No Upload, No Account)

By Rui Barreira · Last updated: 13 June 2026

Extracting text from a PDF should take seconds, not require a subscription. brevio PDF to Word / Text uses PDF.js — Mozilla's open-source PDF rendering engine — to read a PDF file entirely in your browser and export the text as a .txt file. Nothing is uploaded to a server. The entire extraction happens locally, in JavaScript, in your tab.

Text PDFs vs Image PDFs

Not all PDFs are created equal. A PDF can contain two fundamentally different types of content: actual text data embedded in the file, or images of text (scanned documents).

Text PDFs

Most PDFs created by exporting from Word, Google Docs, LibreOffice, LaTeX, or a web browser contain real text. The text is encoded in the PDF byte stream and can be selected, copied, and extracted programmatically. PDF.js reads these directly — extraction takes under one second for a 50-page document.

Image PDFs (Scanned Documents)

When a physical document is scanned, the scanner typically produces a JPEG or TIFF image of each page. If that image is embedded in a PDF without running OCR (Optical Character Recognition), the PDF contains only pixels — there is no text layer to extract. PDF.js will return empty strings for these pages. brevio does not perform OCR; the output will be blank for scanned documents.

How to Tell Which Type You Have

Open your PDF in a browser or PDF viewer and try to select text with your mouse. If you can highlight individual words and copy them, the PDF has a text layer. If the entire page selects as a single image, or if you cannot select any text, it is an image-only PDF — you need OCR software, not a text extractor.

When You Need OCR Instead

PDF Type	brevio Text Extractor	OCR Required
Exported from Word / Google Docs	Works perfectly	No
Generated by LaTeX / InDesign	Works perfectly	No
Web page saved as PDF	Works perfectly	No
Scanned physical document	Returns empty text	Yes
Photo of a document (JPEG)	Returns empty text	Yes
PDF with embedded images only	Returns empty text	Yes
Password-protected / encrypted PDF	May fail to open	N/A — decrypt first

For scanned PDFs, free OCR options include Adobe Acrobat's free online OCR, Google Drive (upload PDF → open with Google Docs → OCR auto-runs), and Tesseract.js (open-source, command-line).

How PDF.js Extracts Text

PDF.js parses the raw PDF byte stream and reconstructs page content. Each page contains a content stream with drawing operators — text operators like Tj (show text) and TJ (show text with kerning adjustments). PDF.js reads these operators, resolves font encodings, and returns the character strings. The brevio tool joins these character strings with spaces between items, which produces readable sentences for most well-structured PDFs.

For complex layouts — multi-column documents, PDFs with overlapping text boxes, or PDFs with rotated text — the extracted order may not match reading order. Column-heavy documents may interleave text from both columns. This is a fundamental limitation of text-stream extraction; the PDF format does not store reading order metadata.

Editing Extracted Text Before Downloading

After extraction, the text appears in an editable textarea. You can clean up the text directly in the browser — remove headers, fix line breaks, delete unwanted sections — before clicking "Download as .txt". The page break separator "--- Page Break ---" can be deleted if you want a continuous text file.

For more advanced post-processing, download the .txt file and open it in a text editor. VS Code, Notepad++, and BBEdit all support regex find-and-replace, which is useful for cleaning up repeated headers, page numbers, or footer text that appears on every page.

Common Encoding Issues

Some PDFs use custom font encodings or symbol fonts where character codes don't map to standard Unicode. In these cases, PDF.js may return garbled characters (e.g., ligatures like "fi" appearing as "!"), or substituted characters from the wrong character set. This is most common with older PDFs generated by legacy publishing software, PDFs using embedded symbol fonts for mathematical notation, and PDFs with East Asian character sets that aren't embedded.

If you see garbled output, try opening the PDF in Adobe Reader or Foxit Reader and using their built-in "Save as Text" or "Export to Word" feature — these tools have more complete font mapping tables than PDF.js.

DevTools Privacy Verification

Open DevTools (F12) and navigate to the Network tab.
Upload a PDF file to the text extractor.
Watch the Network tab as extraction runs.
You will see zero POST requests containing PDF data. PDF.js runs entirely in the browser thread. The worker script is loaded once from the CDN (pdf.worker.min.js), but your PDF file data never crosses the network.

The CDN request for the PDF.js worker script is one-time and loads only the JavaScript engine — not your document. After the first load, browsers cache the worker script.

PDF Text Extractor Comparison

Tool	Cost	Upload Required	OCR Support	Output Format
brevio PDF to Text	Free	No	No	.txt
Adobe Acrobat Online	Free (2/day) / $14.99/mo	Yes	Yes	.docx, .txt
Smallpdf	Free (2/day) / $9/mo	Yes	No	.docx
ILovePDF	Free / $7/mo	Yes	No	.docx
Google Docs	Free	Yes (Google Drive)	Yes (auto)	Google Doc / .docx
Tesseract (CLI)	Free	No	Yes	.txt

FAQ

Why does the extracted text look scrambled or have missing spaces?

PDFs don't store text as continuous lines — they store text fragments positioned at X/Y coordinates on the page. PDF.js reconstructs reading order by joining text items with spaces, but this heuristic fails for complex layouts (tables, multiple columns, rotated text, small caps). For documents where layout matters, use Adobe Acrobat's PDF-to-Word converter, which has more sophisticated reading-order detection.

Can I extract text from a password-protected PDF?

If the PDF requires a password to open, PDF.js will fail to parse it. You need to decrypt the file first. If you know the password, Adobe Reader can remove it: File → Properties → Security → change to "No Security". Alternatively, print the PDF to a new PDF file (File → Print → Microsoft Print to PDF / macOS PDF), which creates an unencrypted copy.

My PDF has 300 pages — will it still work?

Yes. PDF.js processes pages sequentially in the browser. A 300-page text-based PDF typically extracts in 5–15 seconds, depending on font complexity and computer speed. The tool shows a loading indicator during extraction. There is no file size limit imposed by the tool — the browser's available memory is the practical limit, which is rarely hit for text-only PDFs.

I need a .docx file, not a .txt — what should I use?

The brevio tool outputs plain text only. For a Word .docx file with formatting preserved, use Adobe Acrobat Online (free for 2 files per day) or ILovePDF, which convert PDF layout to Word formatting. Note that perfect formatting preservation is impossible — PDFs store appearance, not document structure, so headings, bold text, and tables will require manual cleanup in any converter.

Frequently Asked Questions

Why does extracted text look scrambled?: PDFs store text as positioned fragments, not continuous lines. PDF.js reconstructs reading order by joining items with spaces, but this fails for tables, multi-column layouts, and rotated text. For layout-sensitive documents, use Adobe Acrobat's PDF-to-Word converter.
Can I extract text from a password-protected PDF?: PDF.js cannot parse encrypted PDFs. Decrypt first using Adobe Reader (File → Properties → Security → No Security) or by printing to PDF (File → Print → save as PDF), which creates an unencrypted copy.
My PDF has 300 pages — will it still work?: Yes. PDF.js processes pages sequentially. A 300-page text-based PDF typically extracts in 5–15 seconds. There is no file size limit — available browser memory is the practical ceiling, which is rarely hit for text-only PDFs.
I need a .docx file, not .txt — what should I use?: Use Adobe Acrobat Online (free, 2 files/day) or ILovePDF for Word output with formatting. Perfect formatting preservation is impossible — PDFs store appearance, not document structure, so headers and tables will need manual cleanup in any converter.