officeParser: One API to Read Every Document You Throw at It

If you have ever built anything that ingests user-uploaded files, you know the drill. A Word doc needs one library, a spreadsheet needs another, PDFs need a third, and the moment someone uploads an OpenOffice file your carefully assembled stack falls over. officeParser exists to delete that entire problem. It is a pure-JavaScript library that reads .docx, .pptx, .xlsx, .odt, .odp, .ods, .pdf, .rtf, .csv, Markdown, and HTML through a single unified call, and it runs on Node.js as well as directly in the browser.

The library started life years ago as a humble "give me the text" extractor, but the version 7 rewrite turned it into something considerably more capable. Today parseOffice returns a rich Abstract Syntax Tree describing the document's structure, metadata, and formatting, and a family of generators can turn that tree into Markdown, HTML, CSV, RTF, PDF, plain text, or retrieval-ready chunks for RAG pipelines. It is a strong default whenever you need to ingest mixed document types: search indexing, document conversion, and feeding LLM retrieval systems are all squarely in its wheelhouse.

What Makes It Worth a Look

One API, many formats. Microsoft Office, OpenOffice/LibreOffice, PDF, RTF, CSV, Markdown, and HTML all go through the same parseOffice entry point.
Structured output, not just a string. You get an AST with paragraphs, headings, tables, formatting, notes, and comments, so you can do more than dump raw text.
Multi-format generation. Parse once, then emit Markdown, HTML, CSV, RTF, PDF, plain text, or RAG chunks from the same tree.
Built-in RAG chunking. Document-structure, fixed-size-with-overlap, and semantic strategies ship in the box, so you do not need a separate chunking library.
Optional OCR. Tesseract.js can pull text out of images embedded in documents and scanned PDFs.
No external binaries. It is pure JavaScript with no antiword or shell-outs, which makes it painless to deploy in Docker and serverless environments, and it even runs client-side in the browser.

Getting It Installed

The core package is a single install:

npm install officeparser

Or with yarn:

yarn add officeparser

There is exactly one optional extra. If you want to generate PDF output (parsing PDFs needs nothing extra), install Puppeteer alongside it:

npm install puppeteer

officeParser requires Node.js 18 or newer. It bundles real parsing engines like pdf.js and tesseract.js, so the install is not featherweight, but that is the trade you make for broad format coverage with zero additional setup.

Reading Your First Document

The heart of the library is parseOffice. Hand it a file path and it returns the document's AST. From there, ast.to(...) converts the tree into whatever format you actually want.

import officeParser from "officeparser";

const ast = await officeParser.parseOffice("/path/to/report.docx");

// Inspect the structure
console.log(ast.type);      // "docx"
console.log(ast.metadata);  // { author, title, created, modified, ... }
console.log(ast.content);   // array of paragraphs, headings, tables, ...

// The classic "just give me the text" path
const { value: text } = await ast.to("text");
console.log(text);

A couple of things are worth calling out for anyone coming from an older release. In the version 4 and 5 era, the main function returned a plain string of extracted text directly. As of version 7 it returns an AST object instead, and you get plain text through ast.to("text"). If you have old tutorials open in another tab, that is the difference you are seeing. There is also a synchronous ast.toText() shortcut, but the async ast.to("text") form is the one to reach for.

The AST itself is straightforward to navigate. Each node in content carries a type (such as paragraph, heading, or table), the text, optional children, and a formatting object describing bold, italic, color, size, font, and alignment. Metadata like heading level, list IDs, and table row/column positions ride along too.

Working With Buffers and Streams

You will not always have a file on disk. officeParser happily accepts a Buffer, which is exactly what you want when handling an upload or pulling a blob from object storage.

import { readFileSync } from "node:fs";
import officeParser from "officeparser";

const buffer = readFileSync("/path/to/file.pdf");
const ast = await officeParser.parseOffice(buffer);

const { value: text } = await ast.to("text");

One important gotcha: binary formats like docx and PDF have magic bytes that let the library auto-detect the type, but text-based formats (Markdown, HTML, CSV) do not. When you pass a buffer of one of those, you have to tell officeParser what it is using the fileType config option:

const csvAst = await officeParser.parseOffice(csvBuffer, {
  fileType: "csv",
  csvDelimiter: ";",
});

Converting Documents to Markdown or HTML

Because the AST is format-agnostic, turning a Word document into clean Markdown is a one-liner. This is the move when you want to display uploaded documents on a web page or hand them to a downstream system that prefers structured text.

import officeParser from "officeparser";

const ast = await officeParser.parseOffice("proposal.docx");

const { value: markdown } = await ast.to("md");
const { value: html } = await ast.to("html", { includeFormatting: true });

If you only care about the conversion and never need the intermediate tree, the OfficeConverter helper collapses parse-and-generate into a single step:

import { OfficeConverter } from "officeparser";

const { value: html } = await OfficeConverter.convert("data.xlsx", "html");

And when you want to parse once and fan out into several formats, OfficeGenerator takes an existing AST and emits whatever target you ask for, which keeps you from re-parsing the same file three times.

import { OfficeGenerator } from "officeparser";

const ast = await officeParser.parseOffice("report.docx");
const { value: md } = await OfficeGenerator.generate(ast, "md");
const { value: csv } = await OfficeGenerator.generate(ast, "csv");

Building a RAG Ingestion Step

This is where the version 7 rewrite really earns its keep. Retrieval-augmented generation pipelines live or die on how documents get split into chunks, and officeParser ships three strategies so you do not have to bolt on a separate chunking dependency.

The default document-structure strategy splits along the document's own headings, which tends to keep semantically related content together:

import { OfficeConverter } from "officeparser";

const { value: chunks } = await OfficeConverter.convert("report.docx", "chunks", {
  generatorConfig: {
    chunksConfig: {
      strategy: "document-structure",
      splitBy: "heading",
      maxChunkSize: 1500,
    },
  },
});

Prefer uniform chunk sizes with a sliding overlap window? Switch to the fixed-size strategy:

const fixed = {
  chunksConfig: {
    strategy: "fixed-size",
    chunkSize: 1000,
    chunkOverlap: 200,
  },
};

And if you want chunk boundaries decided by meaning rather than length, the semantic strategy uses your own embedding function to group similar passages:

const semantic = {
  chunksConfig: {
    strategy: "semantic",
    embeddingFunction: async (text: string) => getEmbedding(text),
    similarityThreshold: 0.8,
  },
};

The output is JSON chunks ready to drop into a vector store. Combined with the broad format support, this means a single library handles "user uploads a mixed bag of documents, turn them into retrieval-ready chunks" end to end.

OCR, Cancellation, and Other Real-World Touches

Scanned PDFs and documents with text trapped inside images are a perennial headache. officeParser can run Tesseract OCR on embedded images by flipping a single flag, with a config block for language and timeouts:

const ast = await officeParser.parseOffice("scanned.pdf", {
  ocr: true,
  ocrConfig: {
    language: "eng",
    timeout: {
      workerLoad: 30000,
      recognition: 15000,
      autoTerminate: 10000,
    },
  },
});

OCR workers auto-shut down after about ten seconds of idle time, but for short-lived scripts you can call await officeParser.terminateOcr() to let the process exit promptly.

Long-running parses (especially with OCR) are also cancellable through a standard AbortController, which is exactly what you want behind an HTTP request that might time out:

const controller = new AbortController();
setTimeout(() => controller.abort(), 5000);

try {
  const ast = await officeParser.parseOffice("big.pdf", {
    abortSignal: controller.signal,
    ocr: true,
  });
} catch (err) {
  if ((err as Error).name === "AbortError") {
    console.log("Parsing was cancelled");
  }
}

A handful of other config flags let you tune what gets extracted: ignoreNotes, ignoreComments, ignoreHeadersAndFooters, and ignoreSlideMasters prune the noise, while extractAttachments populates images and charts as Base64 in the AST. There is also an onWarning callback for surfacing non-fatal issues without crashing your pipeline.

Running It in the Browser

Because there are no system binaries involved, officeParser ships ESM and IIFE bundles and runs entirely client-side. That lets you parse a user's upload without a server round-trip, which is genuinely handy for privacy-sensitive apps where documents should never leave the device.

import { OfficeParser } from "officeparser";

const file = await input.files[0].arrayBuffer();
const ast = await OfficeParser.parseOffice(new Uint8Array(file));
const { value: text } = await ast.to("text");

PDF parsing in the browser uses a Web Worker, which by default pulls a pdf.js build from a CDN. If you would rather self-host it, override the pdfWorkerSrc config option to point at your own copy.

There is also a CLI for quick one-offs, so you can convert a file straight from your terminal: npx officeparser report.docx --to=md --output=report.md does exactly what it looks like.

Where It Fits

officeParser's edge is breadth. It is the strong default when you need to ingest mixed document types behind one consistent API, with no external binaries, optional browser support, and RAG chunking built in. For single-format, fidelity-critical work, a specialist may still win: mammoth produces higher-fidelity Word-to-HTML conversion (though it only handles docx), and SheetJS offers far richer cell-level spreadsheet access. But for the common "I have a folder of who-knows-what and I need clean text or chunks out of it" problem, having a single, actively maintained, pure-JS library that handles the whole zoo is hard to beat. With around two million downloads a month and frequent releases, it is a dependency you can lean on.