Prototype Development of a PDF to Text Extraction and Image Conversion CLI Tool with Node.js and WASM Libraries

In recent years, technologies like RAG (Retrieval-Augmented Generation) that leverage document content with AI have been gaining attention. As preprocessing for this, there’s growing demand for extracting text and images from PDF files for vectorization.

Today, as a first step in this preprocessing, I’ve developed a prototype PDF processing CLI tool that runs on Node.js.

GitHub: feat: initialize PDF processing CLI tool · Pull Request #37 · codenote-net/sandbox

This article introduces the development background, particularly the reasons for adopting WebAssembly (Wasm) in our technology selection, and an overview of the implemented tool.

Background: We Want to Run It Serverless!

The ultimate goal is to build a web application that processes user-uploaded PDFs, vectorizes their content (text and images), and stores them in a database. For the development stack, we’re planning on Vercel and Next.js.

The challenge here is the dependencies of PDF processing libraries. Many PDF processing libraries depend on native binaries like poppler or ImageMagick. These binaries can’t be easily used in serverless environments like Vercel, often leading to additional build configurations and environmental constraints.

That’s why we focused on platform-independent libraries built with WebAssembly (Wasm). With WASM, they can be executed directly in the Node.js runtime without special environment setup, making them perfectly compatible with serverless environments.

Technology Selection: WASM-Based Libraries

For this prototype, we adopted the following WASM-based packages:

@hyzyla/pdfium: A WASM build of PDFium, the PDF rendering engine maintained by Google. It provides powerful features including PDF loading, text extraction, and page-by-page rendering.
@jsquash/png: A WASM version of the high-speed image codec library libspng. We use it to encode bitmap data rendered by PDFium into PNG format.

The convenience of these libraries is also appealing - just npm install downloads the necessary WASM assets, and they can be used without additional build steps.

Overview of the Completed CLI Tool

The developed CLI tool pdf-tool can perform text extraction and image conversion on specified PDF files.

# Basic usage
node ./bin/pdf-tool.mjs <pdf-path> [options]

Main features include:

Text Extraction:
- Output text for each page in plain text or JSON format (--text-out, --text-format)
- Combine text from all pages into one block (--text-join)
Image Conversion:
- Save each PDF page as a PNG image in the specified directory (--png-dir)
- Specify output image resolution by scale or pixel size (--scale, --width, --height)
Others:
- Specify pages to process (e.g., 1,3-5) (--pages)
- Support for password-encrypted PDFs (--password)

Usage and Examples

Let’s look at some specific command examples.

Example 1: Output Text as JSON and Images at 2x Scale

For pages 1-3, output text in JSON format to output/sample.json and page images at 2x resolution to the output/images/ directory.

node ./bin/pdf-tool.mjs ./docs/sample.pdf \
  --pages 1-3 \
  --text-out output/sample.json \
  --text-format json \
  --png-dir output/images \
  --scale 2

Example 2: Combine All Page Text into One File

Extract all text from the PDF and save it as a single text file, ignoring page breaks. This is convenient for creating input data for LLMs.

node ./bin/pdf-tool.mjs ./docs/sample.pdf --text-out output/sample.txt --text-join

Project Structure

The source code is modularized by functionality:

src/pdfium/: Helper functions for PDFium initialization and document loading
src/text/: Logic for text extraction processing
src/image/: Logic for page rendering and PNG encoding processing
bin/pdf-tool.mjs: Entry point that bundles the above modules and interprets command-line arguments

By separating concerns this way, it becomes easier to port functionality to Next.js API Routes in the future.

Summary and Future Prospects

This time, with an eye toward using it in a Vercel + Next.js environment, we developed a prototype PDF processing CLI tool utilizing WASM-based libraries. This confirmed the feasibility of implementing a robust PDF preprocessing foundation without worrying about environmental dependencies.

Moving forward, based on this prototype, we plan to:

Integrate the processing into Next.js API Routes and provide it as a Web API
Vectorize extracted text and images and store them in a vector database

We’re planning to advance these feature expansions.

Thanks to WASM, processes that were previously difficult to handle on the server side can now be completed within the JavaScript/Node.js ecosystem, and I feel that this has greatly expanded the scope of development.

That’s all from the Gemba.