Extracting Text from PDFs

by kuligaposten 2025-02-12

Handling PDFs programmatically can be useful for extracting text, analyzing content, and integrating extracted data into web applications.

Extracting Text from PDFs and Saving It as a JavaScript File

Introduction

Handling PDFs programmatically can be useful for extracting text, analyzing content, and integrating extracted data into web applications. In this blog post, we’ll walk through a Node.js script that scans a folder for PDFs, extracts text, processes it for readability, and saves the extracted data into a JavaScript file that can be used directly in a frontend application.

How the Script Works

This script performs the following steps:

Scans a folder (downloads/) for all PDF files.
Extracts text from each PDF using the pdf-parse library.
Cleans the extracted text by removing unnecessary line breaks, hyphens, and extra spaces.
Generates an excerpt of the text for quick previews.
Saves the extracted data in a window.data JavaScript object (output.js), which can be directly used in a web page.

The Complete Script

const fs = require('fs');
const path = require('path');
const pdf = require('pdf-parse');

const inputFolder = './downloads'; // Change this to your actual folder
const outputFile = './output.js';

// Ensure the input folder exists
if (!fs.existsSync(inputFolder)) {
  console.error('Error: Folder does not exist', inputFolder);
  process.exit(1);
}

// Function to clean extracted text
function cleanText(text) {
  return text
    .replace(/-\n/g, '') // Remove hyphenated line breaks (e.g., "long-\nword" → "longword")
    .replace(/\n([a-z])/g, ' $1') // Merge broken words mid-sentence
    .replace(/\s{2,}/g, ' ') // Remove extra spaces
    .replace(/\n{2,}/g, '\n\n') // Keep real paragraph breaks
    .trim();
}

function getExcerpt(text, length = 100) {
  let trimmed = text.slice(0, length).trim();
  return trimmed.length < text.length ? trimmed + '...' : trimmed;
}

// Read all PDF files in the folder
fs.readdir(inputFolder, (err, files) => {
  if (err) throw err;

  let pdfDataArray = [];

  let filePromises = files.map((file) => {
    if (path.extname(file).toLowerCase() === '.pdf') {
      const pdfPath = path.join(inputFolder, file);

      return fs.promises
        .readFile(pdfPath)
        .then((data) => pdf(data))
        .then((result) => {
          pdfDataArray.push({
            filename: file,
            title: path
              .basename(file, '.pdf')
              .replace(/[-_%20]+/g, ' ')
              .trim(),
            excerpt: getExcerpt(cleanText(result.text), 100),
            content: cleanText(result.text),
          });
        })
        .catch((err) => console.error(`Error processing ${file}:`, err));
    }
  });

  // Wait for all PDFs to be processed, then save to JS file
  Promise.all(filePromises).then(() => {
    // Format object without double quotes on keys
    const formattedData = pdfDataArray
      .map(
        (obj) => `{
    filename: ${JSON.stringify(obj.filename)},
    title: ${JSON.stringify(obj.title)},
    excerpt: ${JSON.stringify(obj.excerpt)},
    content: ${JSON.stringify(obj.content)}
  }`
      )
      .join(',\n');

    const jsContent = `window.data = [\n${formattedData}\n];`;

    fs.writeFileSync(outputFile, jsContent, 'utf-8');
    console.log(`✅ Saved extracted text as a JS file: ${outputFile}`);
  });
});

Breaking Down the Script

1️ Ensuring the Folder Exists

if (!fs.existsSync(inputFolder)) {
  console.error('Error: Folder does not exist', inputFolder);
  process.exit(1);
}

This ensures the script doesn’t run on a non-existent directory, preventing runtime errors.

2️ Extracting and Cleaning Text

function cleanText(text) {
  return text
    .replace(/-\n/g, '') // Remove hyphenated line breaks
    .replace(/\n([a-z])/g, ' $1') // Merge broken words mid-sentence
    .replace(/\s{2,}/g, ' ') // Remove extra spaces
    .replace(/\n{2,}/g, '\n\n') // Preserve paragraph breaks
    .trim();
}

Since text extraction from PDFs can be messy, this function removes unwanted characters and improves readability.

3️ Creating an Excerpt

function getExcerpt(text, length = 100) {
  let trimmed = text.slice(0, length).trim();
  return trimmed.length < text.length ? trimmed + '...' : trimmed;
}

This ensures each extracted PDF has a brief preview of its content.

4️ Reading and Processing PDFs

fs.readdir(inputFolder, (err, files) => {
  if (err) throw err;

  let pdfDataArray = [];

  let filePromises = files.map((file) => {
    if (path.extname(file).toLowerCase() === '.pdf') {
      const pdfPath = path.join(inputFolder, file);

      return fs.promises
        .readFile(pdfPath)
        .then((data) => pdf(data))
        .then((result) => {
          pdfDataArray.push({
            filename: file,
            title: path
              .basename(file, '.pdf')
              .replace(/[-_%20]+/g, ' ')
              .trim(),
            excerpt: getExcerpt(cleanText(result.text), 100),
            content: cleanText(result.text),
          });
        })
        .catch((err) => console.error(`Error processing ${file}:`, err));
    }
  });

This reads all PDF files in the folder, extracts their text, and processes them.

5️ Saving Extracted Data as a JavaScript File

Promise.all(filePromises).then(() => {
  const formattedData = pdfDataArray
    .map(
      (obj) => `{
    filename: ${JSON.stringify(obj.filename)},
    title: ${JSON.stringify(obj.title)},
    excerpt: ${JSON.stringify(obj.excerpt)},
    content: ${JSON.stringify(obj.content)}
  }`
    )
    .join(',\n');

  const jsContent = `window.data = [\n${formattedData}\n];`;

  fs.writeFileSync(outputFile, jsContent, 'utf-8');
  console.log(`Saved extracted text as a JS file: ${outputFile}`);
});

Instead of saving the extracted data as a JSON file, the script generates a window.data JavaScript object. This allows for easy integration into web applications.

How to Use

1️ Install Dependencies

npm install pdf-parse

2️ Place Your PDFs in the `downloads/` Folder

Make sure your target PDFs are in the specified folder.

3️ Run the Script

node script.js

4️ Use `output.js` in a Web App

Include the generated output.js file in an HTML page:

<script src="output.js"></script>
<script>
  console.log(window.data); // Access extracted PDF content
</script>

Final Thoughts

This script provides a powerful way to extract and clean text from PDFs and save it in a format that can be directly used in a web application. Whether you're building a searchable document archive, a text analysis tool, or an online reading platform, this approach can save time and improve content accessibility.