Prototype Development of a PDF to Text Extraction and Image Conversion CLI Tool with Node.js and WASM Libraries
In recent years, technologies like RAG (Retrieval-Augmented Generation) that leverage document content with AI have been gaining attention. As preprocessing for this, there’s growing demand for extracting text and images from PDF files for vectorization.
Today, as a first step in this preprocessing, I’ve developed a prototype PDF processing CLI tool that runs on Node.js.
This article introduces the development background, particularly the reasons for adopting WebAssembly (Wasm) in our technology selection, and an overview of the implemented tool.
The ultimate goal is to build a web application that processes user-uploaded PDFs, vectorizes their content (text and images), and stores them in a database. For the development stack, we’re planning on Vercel and Next.js.
The challenge here is the dependencies of PDF processing libraries. Many PDF processing libraries depend on native binaries like poppler
or ImageMagick
. These binaries can’t be easily used in serverless environments like Vercel, often leading to additional build configurations and environmental constraints.
That’s why we focused on platform-independent libraries built with WebAssembly (Wasm). With WASM, they can be executed directly in the Node.js runtime without special environment setup, making them perfectly compatible with serverless environments.
For this prototype, we adopted the following WASM-based packages:
@hyzyla/pdfium
: A WASM build of PDFium, the PDF rendering engine maintained by Google. It provides powerful features including PDF loading, text extraction, and page-by-page rendering.@jsquash/png
: A WASM version of the high-speed image codec library libspng
. We use it to encode bitmap data rendered by PDFium into PNG format.The convenience of these libraries is also appealing - just npm install
downloads the necessary WASM assets, and they can be used without additional build steps.
The developed CLI tool pdf-tool
can perform text extraction and image conversion on specified PDF files.
# Basic usage
node ./bin/pdf-tool.mjs <pdf-path> [options]
Main features include:
--text-out
, --text-format
)--text-join
)--png-dir
)--scale
, --width
, --height
)1,3-5
) (--pages
)--password
)Let’s look at some specific command examples.
For pages 1-3, output text in JSON format to output/sample.json
and page images at 2x resolution to the output/images/
directory.
node ./bin/pdf-tool.mjs ./docs/sample.pdf \
--pages 1-3 \
--text-out output/sample.json \
--text-format json \
--png-dir output/images \
--scale 2
Extract all text from the PDF and save it as a single text file, ignoring page breaks. This is convenient for creating input data for LLMs.
node ./bin/pdf-tool.mjs ./docs/sample.pdf --text-out output/sample.txt --text-join
The source code is modularized by functionality:
src/pdfium/
: Helper functions for PDFium initialization and document loadingsrc/text/
: Logic for text extraction processingsrc/image/
: Logic for page rendering and PNG encoding processingbin/pdf-tool.mjs
: Entry point that bundles the above modules and interprets command-line argumentsBy separating concerns this way, it becomes easier to port functionality to Next.js API Routes in the future.
This time, with an eye toward using it in a Vercel + Next.js environment, we developed a prototype PDF processing CLI tool utilizing WASM-based libraries. This confirmed the feasibility of implementing a robust PDF preprocessing foundation without worrying about environmental dependencies.
Moving forward, based on this prototype, we plan to:
We’re planning to advance these feature expansions.
Thanks to WASM, processes that were previously difficult to handle on the server side can now be completed within the JavaScript/Node.js ecosystem, and I feel that this has greatly expanded the scope of development.
That’s all from the Gemba.