Extracting Text from PDFs
by kuligaposten 2025-02-12
Handling PDFs programmatically can be useful for extracting text, analyzing content, and integrating extracted data into web applications.
Extracting Text from PDFs and Saving It as a JavaScript File
Introduction
Handling PDFs programmatically can be useful for extracting text, analyzing content, and integrating extracted data into web applications. In this blog post, we’ll walk through a Node.js script that scans a folder for PDFs, extracts text, processes it for readability, and saves the extracted data into a JavaScript file that can be used directly in a frontend application.
How the Script Works
This script performs the following steps:
- Scans a folder (
downloads/) for all PDF files. - Extracts text from each PDF using the
pdf-parselibrary. - Cleans the extracted text by removing unnecessary line breaks, hyphens, and extra spaces.
- Generates an excerpt of the text for quick previews.
- Saves the extracted data in a
window.dataJavaScript object (output.js), which can be directly used in a web page.
The Complete Script
const fs = require('fs');
const path = require('path');
const pdf = require('pdf-parse');
const inputFolder = './downloads'; // Change this to your actual folder
const outputFile = './output.js';
// Ensure the input folder exists
if (!fs.existsSync(inputFolder)) {
console.error('Error: Folder does not exist', inputFolder);
process.exit(1);
}
// Function to clean extracted text
function cleanText(text) {
return text
.replace(/-\n/g, '') // Remove hyphenated line breaks (e.g., "long-\nword" → "longword")
.replace(/\n([a-z])/g, ' $1') // Merge broken words mid-sentence
.replace(/\s{2,}/g, ' ') // Remove extra spaces
.replace(/\n{2,}/g, '\n\n') // Keep real paragraph breaks
.trim();
}
function getExcerpt(text, length = 100) {
let trimmed = text.slice(0, length).trim();
return trimmed.length < text.length ? trimmed + '...' : trimmed;
}
// Read all PDF files in the folder
fs.readdir(inputFolder, (err, files) => {
if (err) throw err;
let pdfDataArray = [];
let filePromises = files.map((file) => {
if (path.extname(file).toLowerCase() === '.pdf') {
const pdfPath = path.join(inputFolder, file);
return fs.promises
.readFile(pdfPath)
.then((data) => pdf(data))
.then((result) => {
pdfDataArray.push({
filename: file,
title: path
.basename(file, '.pdf')
.replace(/[-_%20]+/g, ' ')
.trim(),
excerpt: getExcerpt(cleanText(result.text), 100),
content: cleanText(result.text),
});
})
.catch((err) => console.error(`Error processing ${file}:`, err));
}
});
// Wait for all PDFs to be processed, then save to JS file
Promise.all(filePromises).then(() => {
// Format object without double quotes on keys
const formattedData = pdfDataArray
.map(
(obj) => `{
filename: ${JSON.stringify(obj.filename)},
title: ${JSON.stringify(obj.title)},
excerpt: ${JSON.stringify(obj.excerpt)},
content: ${JSON.stringify(obj.content)}
}`
)
.join(',\n');
const jsContent = `window.data = [\n${formattedData}\n];`;
fs.writeFileSync(outputFile, jsContent, 'utf-8');
console.log(`✅ Saved extracted text as a JS file: ${outputFile}`);
});
});
Breaking Down the Script
1️ Ensuring the Folder Exists
if (!fs.existsSync(inputFolder)) {
console.error('Error: Folder does not exist', inputFolder);
process.exit(1);
}
This ensures the script doesn’t run on a non-existent directory, preventing runtime errors.
2️ Extracting and Cleaning Text
function cleanText(text) {
return text
.replace(/-\n/g, '') // Remove hyphenated line breaks
.replace(/\n([a-z])/g, ' $1') // Merge broken words mid-sentence
.replace(/\s{2,}/g, ' ') // Remove extra spaces
.replace(/\n{2,}/g, '\n\n') // Preserve paragraph breaks
.trim();
}
Since text extraction from PDFs can be messy, this function removes unwanted characters and improves readability.
3️ Creating an Excerpt
function getExcerpt(text, length = 100) {
let trimmed = text.slice(0, length).trim();
return trimmed.length < text.length ? trimmed + '...' : trimmed;
}
This ensures each extracted PDF has a brief preview of its content.
4️ Reading and Processing PDFs
fs.readdir(inputFolder, (err, files) => {
if (err) throw err;
let pdfDataArray = [];
let filePromises = files.map((file) => {
if (path.extname(file).toLowerCase() === '.pdf') {
const pdfPath = path.join(inputFolder, file);
return fs.promises
.readFile(pdfPath)
.then((data) => pdf(data))
.then((result) => {
pdfDataArray.push({
filename: file,
title: path
.basename(file, '.pdf')
.replace(/[-_%20]+/g, ' ')
.trim(),
excerpt: getExcerpt(cleanText(result.text), 100),
content: cleanText(result.text),
});
})
.catch((err) => console.error(`Error processing ${file}:`, err));
}
});
This reads all PDF files in the folder, extracts their text, and processes them.
5️ Saving Extracted Data as a JavaScript File
Promise.all(filePromises).then(() => {
const formattedData = pdfDataArray
.map(
(obj) => `{
filename: ${JSON.stringify(obj.filename)},
title: ${JSON.stringify(obj.title)},
excerpt: ${JSON.stringify(obj.excerpt)},
content: ${JSON.stringify(obj.content)}
}`
)
.join(',\n');
const jsContent = `window.data = [\n${formattedData}\n];`;
fs.writeFileSync(outputFile, jsContent, 'utf-8');
console.log(`Saved extracted text as a JS file: ${outputFile}`);
});
Instead of saving the extracted data as a JSON file, the script generates a window.data JavaScript object. This allows for easy integration into web applications.
How to Use
1️ Install Dependencies
npm install pdf-parse
2️ Place Your PDFs in the downloads/ Folder
Make sure your target PDFs are in the specified folder.
3️ Run the Script
node script.js
4️ Use output.js in a Web App
Include the generated output.js file in an HTML page:
<script src="output.js"></script>
<script>
console.log(window.data); // Access extracted PDF content
</script>
Final Thoughts
This script provides a powerful way to extract and clean text from PDFs and save it in a format that can be directly used in a web application. Whether you're building a searchable document archive, a text analysis tool, or an online reading platform, this approach can save time and improve content accessibility.