Extract text from PDF files using PDF.js getTextContent(). Shows extracted text per page with page numbers. Includes word count, character count, copy all, and download as .txt. Works on text-based PDFs — scanned image PDFs require OCR.
Upload a PDF that contains selectable text (not a scanned image PDF). If you can select and copy text in your PDF viewer, this tool will extract it. Scanned PDFs are images and require OCR software to extract text.
Text is extracted from each page and displayed in labelled panels. A word count and character count for the entire document are shown. Scroll to review all pages.
Click Copy Page to copy individual page text, or Copy All to copy the entire document text. Click Download TXT to save as a plain text file with page separators.
If your PDF is a scanned document (a photograph or scan of a physical page), the PDF contains images rather than text — there is no machine-readable text to extract. This tool works only on PDFs with embedded text (PDFs created from Word, Excel, or other digital documents). To extract text from scanned PDFs, you need OCR (Optical Character Recognition) software such as Adobe Acrobat, Google Drive, or Tesseract.
PDF text extraction captures the text content but does not fully preserve visual layout — complex multi-column layouts, tables, and text boxes may appear in a different order than they look on the page. Simple linear documents (articles, reports, ebooks) extract cleanly. For layout-preserving extraction, tools that convert PDF to Word (docx) format do a better job of maintaining structure.
PDF.js provides a getTextContent() method that returns all text items from a PDF page, including their position, font, and content. This tool concatenates those text items into readable paragraphs. The text is extracted in the order it appears in the PDF's internal structure, which usually (but not always) matches reading order.
If the PDF has a user password (required to open the document), PDF.js will prompt for it. If the PDF is encrypted with an owner password only (which restricts printing and editing but allows opening), PDF.js can still extract text since it can open the document. If content copying is specifically restricted, some PDFs may return empty text content.
Best results: PDFs created from Microsoft Word, Google Docs, Excel, or other office applications — text is fully embedded. Good results: PDFs created from presentations or web pages — most text extracts correctly. Poor results: Scanned PDFs, PDFs with text as images, heavily formatted PDFs with complex layouts. Zero results: Encrypted PDFs that explicitly prohibit text extraction.
Google Drive: upload the scanned PDF, right-click → Open with Google Docs — Google's OCR extracts the text. Adobe Acrobat: Edit > Text Recognition > In This File. Online OCR tools: tools like Adobe online, Smallpdf, or dedicated OCR services. Free option: Tesseract OCR (open source, command-line). OCR accuracy depends on scan quality — 300 DPI scans produce much better results than 72 DPI.