Extracting Text from PDFs

by kuligaposten 2025-02-12

Handling PDFs programmatically can be useful for extracting text, analyzing content, and integrating extracted data into web applications.

Extracting Text from PDFs and Saving It as a JavaScript File

Introduction

Handling PDFs programmatically can be useful for extracting text, analyzing content, and integrating extracted data into web applications. In this blog post, we’ll walk through a Node.js script that scans a folder for PDFs, extracts text, processes it for readability, and saves the extracted data into a JavaScript file that can be used directly in a frontend application.


How the Script Works

This script performs the following steps:

  1. Scans a folder (downloads/) for all PDF files.
  2. Extracts text from each PDF using the pdf-parse library.
  3. Cleans the extracted text by removing unnecessary line breaks, hyphens, and extra spaces.
  4. Generates an excerpt of the text for quick previews.
  5. Saves the extracted data in a window.data JavaScript object (output.js), which can be directly used in a web page.

The Complete Script

const fs = require('fs');
const path = require('path');
const pdf = require('pdf-parse');

const inputFolder = './downloads'; // Change this to your actual folder
const outputFile = './output.js';

// Ensure the input folder exists
if (!fs.existsSync(inputFolder)) {
  console.error('Error: Folder does not exist', inputFolder);
  process.exit(1);
}

// Function to clean extracted text
function cleanText(text) {
  return text
    .replace(/-\n/g, '') // Remove hyphenated line breaks (e.g., "long-\nword" → "longword")
    .replace(/\n([a-z])/g, ' $1') // Merge broken words mid-sentence
    .replace(/\s{2,}/g, ' ') // Remove extra spaces
    .replace(/\n{2,}/g, '\n\n') // Keep real paragraph breaks
    .trim();
}

function getExcerpt(text, length = 100) {
  let trimmed = text.slice(0, length).trim();
  return trimmed.length < text.length ? trimmed + '...' : trimmed;
}

// Read all PDF files in the folder
fs.readdir(inputFolder, (err, files) => {
  if (err) throw err;

  let pdfDataArray = [];

  let filePromises = files.map((file) => {
    if (path.extname(file).toLowerCase() === '.pdf') {
      const pdfPath = path.join(inputFolder, file);

      return fs.promises
        .readFile(pdfPath)
        .then((data) => pdf(data))
        .then((result) => {
          pdfDataArray.push({
            filename: file,
            title: path
              .basename(file, '.pdf')
              .replace(/[-_%20]+/g, ' ')
              .trim(),
            excerpt: getExcerpt(cleanText(result.text), 100),
            content: cleanText(result.text),
          });
        })
        .catch((err) => console.error(`Error processing ${file}:`, err));
    }
  });

  // Wait for all PDFs to be processed, then save to JS file
  Promise.all(filePromises).then(() => {
    // Format object without double quotes on keys
    const formattedData = pdfDataArray
      .map(
        (obj) => `{
    filename: ${JSON.stringify(obj.filename)},
    title: ${JSON.stringify(obj.title)},
    excerpt: ${JSON.stringify(obj.excerpt)},
    content: ${JSON.stringify(obj.content)}
  }`
      )
      .join(',\n');

    const jsContent = `window.data = [\n${formattedData}\n];`;

    fs.writeFileSync(outputFile, jsContent, 'utf-8');
    console.log(`✅ Saved extracted text as a JS file: ${outputFile}`);
  });
});

Breaking Down the Script

1️ Ensuring the Folder Exists

if (!fs.existsSync(inputFolder)) {
  console.error('Error: Folder does not exist', inputFolder);
  process.exit(1);
}

This ensures the script doesn’t run on a non-existent directory, preventing runtime errors.


2️ Extracting and Cleaning Text

function cleanText(text) {
  return text
    .replace(/-\n/g, '') // Remove hyphenated line breaks
    .replace(/\n([a-z])/g, ' $1') // Merge broken words mid-sentence
    .replace(/\s{2,}/g, ' ') // Remove extra spaces
    .replace(/\n{2,}/g, '\n\n') // Preserve paragraph breaks
    .trim();
}

Since text extraction from PDFs can be messy, this function removes unwanted characters and improves readability.


3️ Creating an Excerpt

function getExcerpt(text, length = 100) {
  let trimmed = text.slice(0, length).trim();
  return trimmed.length < text.length ? trimmed + '...' : trimmed;
}

This ensures each extracted PDF has a brief preview of its content.


4️ Reading and Processing PDFs

fs.readdir(inputFolder, (err, files) => {
  if (err) throw err;

  let pdfDataArray = [];

  let filePromises = files.map((file) => {
    if (path.extname(file).toLowerCase() === '.pdf') {
      const pdfPath = path.join(inputFolder, file);

      return fs.promises
        .readFile(pdfPath)
        .then((data) => pdf(data))
        .then((result) => {
          pdfDataArray.push({
            filename: file,
            title: path
              .basename(file, '.pdf')
              .replace(/[-_%20]+/g, ' ')
              .trim(),
            excerpt: getExcerpt(cleanText(result.text), 100),
            content: cleanText(result.text),
          });
        })
        .catch((err) => console.error(`Error processing ${file}:`, err));
    }
  });

This reads all PDF files in the folder, extracts their text, and processes them.


5️ Saving Extracted Data as a JavaScript File

Promise.all(filePromises).then(() => {
  const formattedData = pdfDataArray
    .map(
      (obj) => `{
    filename: ${JSON.stringify(obj.filename)},
    title: ${JSON.stringify(obj.title)},
    excerpt: ${JSON.stringify(obj.excerpt)},
    content: ${JSON.stringify(obj.content)}
  }`
    )
    .join(',\n');

  const jsContent = `window.data = [\n${formattedData}\n];`;

  fs.writeFileSync(outputFile, jsContent, 'utf-8');
  console.log(`Saved extracted text as a JS file: ${outputFile}`);
});

Instead of saving the extracted data as a JSON file, the script generates a window.data JavaScript object. This allows for easy integration into web applications.


How to Use

1️ Install Dependencies

npm install pdf-parse

2️ Place Your PDFs in the downloads/ Folder

Make sure your target PDFs are in the specified folder.

3️ Run the Script

node script.js

4️ Use output.js in a Web App

Include the generated output.js file in an HTML page:

<script src="output.js"></script>
<script>
  console.log(window.data); // Access extracted PDF content
</script>

Final Thoughts

This script provides a powerful way to extract and clean text from PDFs and save it in a format that can be directly used in a web application. Whether you're building a searchable document archive, a text analysis tool, or an online reading platform, this approach can save time and improve content accessibility.

Back to Home