Web scraping in JavaScript : Complete guide

Table of Contents :

the scraping in JavaScript is an effective way to collect web data, qu’il s’agisse de contenus statiques ou dynamiques. Ce guide vous explique tout.

Prerequisites for web scraping in JavaScript

Web scraping with JavaScript. Cristina for Alucare.fr

Avant de coder, vous devez préparer l’environnement. Pour cela, vous avez besoin des deux outils de base :

✅ Node.js : c’est un environnement d’exécution JavaScript, open-source et multiplateforme. Il permet de lancer du code JS en dehors d’un navigateur.
✅ npm (Node Package Manager) : c’est un gestionnaire de paquets intégré à Node.js, pratique pour installer rapidement des bibliothèques.

For libraries, here are the most popular:

✅ puppeteer and Playwright for scraping dynamic sites. They simulate a real browser and load pages just like a user.
✅ Cheerio pour le scraping statique. Elle est notamment légère, rapide et simple d’utilisation.

Web scraping in JavaScript

Let's get down to business with this short tutorial.

Step 1: Installation and configuration

Téléchargez Node.js, puis installez-le. Vérifiez l’installation avec le terminal :

node -v
npm -v

Create a Node.js project in your terminal.

mkdir my-scraping
cd my-scraping
npm init -y

This creates a Node.js project with a package.json file.

Install the necessary libraries

👉 For a static page (Cheerio)

npm install axios cheerio

👉 For a dynamic page (Puppeteer)

npm install puppeteer

Étape 2 : Création d’un script de scraping

Scraping d’une page statique avec Cheerio

// Import libraries
const axios = require('axios');
const cheerio = require('cheerio');

// Scraper page URL
const url = "https://exemple.com";

// Main function
async function scrapePage() {
  try {
    // Download HTML content
    const { data } = await axios.get(url);

    // Load HTML with Cheerio
    const $ = cheerio.load(data);

    // Example: retrieve all h1 titles
    const titles = [];
    $("h1").each((i, elem) => {
      titles.push($(elem).text().trim());
    });

    // Display results
    console.log("Titles found:", titles);

  } catch (error) {
    console.error("Error while scraping:", error);
  }
}

// Run script
scrapePage();

👉 You can replace https://exemple.com by the URL of the page you want to scrape and modify the selector $("h1") to target what you're interested in (eg. $("p"), .class, #idetc.).

Scraping d’une page dynamique avec Puppeteer

// Import Puppeteer
const puppeteer = require("puppeteer");

// Scrape page URL
const url = "https://exemple.com";

async function scrapePage() {
  // Launch a headless browser
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  try {
    // Go to page
    await page.goto(url, { waitUntil: "networkidle2" });

    // Example: extract text from all h1 titles
    const titles = await page.$$eval("h1", elements =>
      elements.map(el => el.textContent.trim())
    );

    console.log("Titles found:", titles);

  } catch (error) {
    console.error("Error while scraping:", error);
  } finally {
    // Close browser
    await browser.close();
  }
}

// Run script
scrapePage();

Step 3: Managing extracted data

Recovered data can be saved in CSV format for Excel, or stored in JSON format for database integration.

// Save as JSON
    fs.writeFileSync("results.json", JSON.stringify(results, null, 2), "utf-8");
    console.log("✅ Data saved in resultats.json");

    // Save as CSV
    const parser = new Parser();
    const csv = parser.parse(results);
    fs.writeFileSync("results.csv", csv, "utf-8");
    console.log("✅ Data saved in resultats.csv");

  } catch (error) {
    console.error("❌ Error while scraping:", error);
  }
}

scrapeAndSave();

Best practices for web scraping in JavaScript

Avant de lancer vos scripts, il est essentiel d’adopter certaines bonnes pratiques pour que votre scraping reste efficace.

🔥 Respect the robots.txt file : c’est la règle d’or pour éviter les problèmes légaux et éthiques.
🔥 CAPTCHA and blocking management with proxies or anti-CAPTCHA services.
🔥 Make your script more robust Add error and exception handling to avoid crashes.

What alternatives should you consider?

Le web scraping ne se limite pas à JavaScript. Plusieurs autres options s’offrent à vous, à l’exemple de :

Python Scrapy and BeautifulSoup work wonders for efficient data recovery.
PHP ideal for web developers who want to integrate scraping directly into their projects.
Web scraping tools as Bright Data, Octoparse and Apify. They're perfect if you don't want to code, but still want total control over your data.

FAQs

How do I scrape a site with JavaScript?

To scrape a site with JavaScript, you need to follow a few key steps:

Identify if the page is static Where dynamic.
For a static page, use Cheerio to extract HTML directly.
For a dynamic page, use Puppeteer or Playwright to simulate a browser, wait for the content to load completely, then extract data.
Structuring and saving data in a usable format such as JSON Where CSV.

What's the best JavaScript web scraping tool?

It all depends on the type of site:

🔥 Cheerio fast and lightweight, perfect for static pages.
🔥 puppeteer ideal for simulating a real browser and managing dynamic content.
🔥 Playwright similar to Puppeteer, but with more advanced, cross-browser features.

What's the best programming language for scraping?

Il n’y a pas de meilleur langage de programmation pour le scraping. Le choix va dépendre de votre projet et de votre environnement.

🔥 Python for fast-track projects.
🔥 JavaScript si vous travaillez déjà dans l’écosystème web.
🔥 PHP to integrate scraping directly into an existing website.
🔥 Code-free tools such as Bright Data, Octoparse and Apify.

To conclude, the web scraping in JavaScript simplifies data collection, whether with Cheerio, Puppeteer or Playwright. What about you? What techniques do you use?

💬 Share your experiences or ask your questions in comments!