Guide to Puppeteer Scraping for Efficient Data Extraction

Kevin Sahin | 25 October 2025 | 16 min read

Table of contents

Puppeteer scraping lets you automate real browsers to open tabs, visit desired web pages, and extract public data. But how do you use this Node.js library without prior experience?

In this guide, we will show you how to set up Puppeteer, navigate pages, extract data with $eval/$$eval/XPath, paginate, and export results. You’ll also see where Puppeteer hits limits at scale and how our HTML API unlocks consistent access to protected websites with the ability to rotate IP addresses and bypass anti-bot systems. Stay tuned, and you will have a working Puppeteer scraper in just a few minutes!

Quick Answer (TL;DR)

This copy-and-paste-ready code covers all the steps outlined in this tutorial. Copy it to launch headless Chromium, open a page, wait for network idle, extract elements, paginate, take screenshots, and export to CSV/JSON. For more information, visit our blog on Scraping Dynamic Pages.

// Load Puppeteer so we can control a Chromium browser from Node. 
// The variable will contain the imported module object via a function "require", while 'puppeteer' names the package to import
const puppeteer = require('puppeteer');

// NEW LINE: Load Node.js file system module to write files
const fs = require('fs');

(async () => {
  // Start a new Chromium instance without a visible window
  const browser = await puppeteer.launch({ headless: true });
  
    // Open a new tab inside the browser
  const page = await browser.newPage();

  // NEW LINE: Array to store all logs for writing later
  const logs = [];

  const response = await page.goto('https://en.wikipedia.org/wiki/Headless_browser', { waitUntil: 'networkidle2', timeout: 10000 });
  console.log('Status code:', response.status());


  // Loop to handle pagination
  while (true) {
  const heading = await page.$eval('h1', el => el.textContent.trim());
  console.log('Page Title:', heading);
    // el refers to the actual DOM element matched by the selector
    // el.textContent gets all the text inside the element (ignores HTML tags)
    // .trim() removes extra spaces and line breaks from the beginning and end

    // NEW LINE: Add to logs array
    logs.push({ type: 'Page Title', message: heading });


  // $$eval selects all elements that match a CSS selector (like querySelectorAll)
  // It gives you an array of elements, which you can loop through
  const sectionTitles = await page.$$eval('h2', elements => {
    return elements.map(el => {
      // el here is each <span class="mw-headline"> element under <h2>
      // We again use textContent and trim to get clean section titles
      return el.textContent.trim();
    });
  });
    console.log('Section titles:', sectionTitles);
logs.push({ type: 'Section titles', message: sectionTitles });



const xpath = '//*[@id="mw-content-text"]/div[1]/p[string-length(normalize-space()) > 0][1]'
const paraText = await page.evaluate((xp) => {
  const result = document.evaluate(xp, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
  const node = result.singleNodeValue;
  return node ? node.textContent.trim() : null;
}, xpath);

console.log('First paragraph:', paraText);
logs.push({ type: 'First paragraph', message: paraText });


await page.screenshot({ path: 'page.png', fullPage: true, timeout: 10000 });

  // Check for "Next" pagination link (modify selector according to actual link)
    const nextLink = await page.$('a[title="Web browser"]'); // selector for next page

    if (nextLink) {
      await Promise.all([
        page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 10000  }),
        nextLink.click(),
      ]);
    } else {
      break; // no more pages, exit loop
    }
  }

  // NEW LINE: Convert logs array to CSV format string
  const csvHeader = 'Type,Message\n';
  const csvRows = logs.map(log => {
    // Escape commas and quotes in data
    const message = log.message
    return `"${log.type}","${message}"`;
  }).join('\n');
  const csvContent = csvHeader + csvRows;

  // NEW LINE: Write CSV to file
  fs.writeFileSync('logs.csv', csvContent, 'utf8');

  // NEW LINE: Write JSON logs to file with pretty print
  fs.writeFileSync('logs.json', JSON.stringify(logs, null, 2), 'utf8');
await browser.close(); // Add this to close the browser
})();

Introduction to Puppeteer Scraping

Puppeteer is a Node.js tool that lets you create Chrome-based Headless Browsers . You can open pages, wait for scripts to finish, click buttons, and read the DOM that JavaScript actually builds.

Pages that look empty in raw HTML now reveal prices, reviews, and content. With it, you can click cookie banners, fill forms, trigger hovers/scrolls, handle shadow-DOM widgets, upload files, and capture pixel-perfect screenshots.

Why Use Puppeteer for Scraping?

Puppeteer scraping is the optimal choice for JavaScript-heavy sites because it waits for client-side rendering and gives you the final DOM. Prices, reviews, lazy-loaded lists, and modal content are actually present, unlike with scrapers that use requests and the BeautifulSoup Python libraries to capture the initial HTML response, which often misses data.

Puppeteer can click consent dialogs, type into search boxes, scroll to trigger infinite lists, and read elements that appear after timers or intersections. You can build consistent scrapers that target dynamic websites by combining them with tools that bypass anti-scraping systems, often found on popular retailer websites. To learn more about how it is done, check out our tutorial on Scraping Without Getting Blocked.

Setting Up Puppeteer

To start Web Scraping with Node.js, first, we have to set up our environment. First, we have to install Node to run JavaScript code on your device, while "npm" will be the package manager that we will use to set up Puppeteer:

Windows users: Download and run the Windows installer from nodejs.org.
Linux users (Ubuntu/Debian). Enter "sudo apt install nodejs npm -y" in your Terminal window.

In this Puppeteer tutorial, we will focus on the Windows integration. After the installation, enter the following lines in your Command Prompt to confirm it:

node -v
npm -v

Now, create a separate folder for our web scraping project. Go to that folder through your Command Prompt and enter the following line to create a package.json file:

npm init -y

After that, we can see that the file has been created, and we can observe its contents.

Once the JSON file has been created, we can use npm to install Puppeteer. Enter the following line:

npm install puppeteer

Once the folder has been updated with additional Node packages, we can start working on our Puppeteer browser.

Scraping Workflow with Puppeteer

To test the Puppeteer scraping capabilities, we will target one of the biggest web pages and a common target for automated web scraping – Wikipedia.com. In our project folder, create a file "index.js" that will contain our data collection workflow.

To start our script, import Puppeteer to control a Chromium browser:

// Load Puppeteer so we can control a Chromium browser from Node. 
// The variable will contain the imported module object via a function "require", while 'puppeteer' string calls the imported package
const puppeteer = require('puppeteer');

Then add an "async" function so it executes immediately after being defined.

// Run an async function immediately so we can use await at top level
(async () => {

Note: If you prefer alternative scraping solutions for a specific target website, check out the guide on Scraping APIs.

Step 1: Launch a Browser

Now we can proceed with working on our headless browser. Our "const browser" variable will pause until Puppeteer launches a headless browser instance:

  // Start a new Chromium instance without a visible window
  const browser = await puppeteer.launch({ headless: true });

Then, a new variable "const page" will open a new page within the launched Puppeteer browser.

  // Open a new tab inside the browser
  const page = await browser.newPage();

Note: Chromium browsers can use Sandbox flags – command-line parameters that isolate browser processes to prevent malicious code from affecting the host system. Sandbox flags disable Chrome’s built-in process isolation to let headless browsers run in restricted environments. To learn more about them, check out our Headless Chrome Guide.

Step 2: Navigate and Wait

Now we can start controlling the page identifier for Puppeteer browser navigation. Let's use the Wikipedia "Headless browser" page to test the service:

const response = await page.goto('https://en.wikipedia.org/wiki/Headless_browser', { waitUntil: 'networkidle2', timeout: 10000 });

The "goto" method tells the Puppeteer browser to open the provided URL, while waitUntil: 'networkidle2' handles background loading. On top of that, we added a timeout so the script waits for the page to respond before giving a timeout error.

Note: You can also use "await" to wait for specific CSS selectors before To make sure that the page content is loaded before the extraction. Make sure to avoid infinite loaders and always set up timeouts when using the "await" expression.

Because we are using our Puppeteer browser in headless mode, there is no way of knowing if the connection is successful unless we add a "console.log" function to output the HTTP status code. However, we can take a step further and make a screenshot of the entire page before closing the Puppeteer browser:

console.log('Status code:', page.status());
await page.screenshot({ path: 'page.png', fullPage: true, timeout: 10000 });

After putting everything together, your code should look something like this:

const puppeteer = require('puppeteer');

(async () => {
  // Start a new Chromium instance without a visible window
  const browser = await puppeteer.launch({ headless: true });
  
    // Open a new tab inside the browser
  const page = await browser.newPage();

 const response = await page.goto('https://en.wikipedia.org/wiki/Headless_browser', { waitUntil: 'networkidle2', timeout: 10000 });
  console.log('Status code:', response.status());

await page.screenshot({ path: 'page.png', fullPage: true, timeout: 10000 });

await browser.close(); // Add this to close the browser
})();

Let's test it. Go to your Command Line or terminal and run the script with the following command:

node index.js

After running the code, we can see that the connection is successful because it outputs HTTP status code "200". If we look at the created screenshot, we can see that our Puppeteer scraper has accessed the page:

Step 3: Extract Data

Now that we have secured access to the platform, we can extract specific data elements using $eval, $$eval, and XPath methods:

$eval: for a single element
$$eval: for multiple elements
XPath: selects elements using XPath expressions

First, let's use $eval to extract the title of the page. Add the following section after the printed HTTP status code message:

  const heading = await page.$eval('h1', el => el.textContent.trim());
  console.log('Page Title:', heading);
    // el refers to the actual DOM element matched by the selector
    // el.textContent gets all the text inside the element (ignores HTML tags)
    // .trim() removes extra spaces and line breaks from the beginning and end

Now let's do the same with the page's multiple section titles using $$eval:

  const sectionTitles = await page.$$eval('h2', elements => {
    return elements.map(el => {
      // el here is each <span class="mw-headline"> element under <h2>
      // We again use textContent and trim to get clean section titles
      return el.textContent.trim();
    });
  });
    console.log('Section titles:', sectionTitles);

As for XPath, it lets you select elements using path-like syntax instead of CSS selectors. You can read more about it in our blog post about XPath in Web Scraping. For example, let's use it to extract the first paragraph on our page.

Note: Some Wikipedia pages have no content in their first <p> tag. By adding the [string-length(normalize-space()) > 0] filter to our XPath we ensure that our Puppeteer scraper only extracts the first paragraph containing text.

// XPath selecting the first paragraph under #mw-content-text
const xpath = '//*[@id="mw-content-text"]//p[1]'

// Run the XPath inside the page and return the result text
const paraText = await page.evaluate((xp) => {
  // Evaluate the XPath and request the first matching node
  const result = document.evaluate(xp, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null)
  // Grab the matched node or null if none found
  const node = result.singleNodeValue
  // Return the node text trimmed, or null when not found
  return node ? node.textContent.trim() : null
}, xpath)

// Print the extracted paragraph
console.log('First paragraph:', paraText)

Once we run the code, we can see that our parser is working:

Step 4: Handle Panigation

By adding the following section of code, we can interact with data from multiple Wikipedia pages. This will make our Headless Browser loop the extraction steps as until it clicks the hyperlink which leads to the "Web browser" Wikipedia page.

To keep things simple, we will go through one redirection, but you can set up more iterations based on the provided "click instructions". Once our scraper finds the matching element, it will not exist in the other page's HTML content, and the loop will break.

Go back to right before the first "h1" extraction and open a while loop:

  // new Loop to handle pagination
  while (true) {
// Your previously defined heading variable
  const heading = await page.$eval('h1', el => el.textContent.trim());

Then, right after the screenshot command, the "nextLink" variable will check if the link exists based on its CSS selector. If it does, it will go to the page and execute the same extraction steps. Once the "web browser" hyperlink is no longer present on the scraped page, the loop is broken:

  // Check for "Next" pagination link (modify selector according to actual link)
    const nextLink = await page.$('a[title="Web browser"]'); // selector for next page

    if (nextLink) {
      await Promise.all([
        page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 10000  }),
        nextLink.click(),
      ]);
    } else {
      break; // no more pages, exit loop
    }
  }
await browser.close(); // Add this to close the browser
})();

Once we run the full code, we can see that it successfully extracted the same elements from both pages:

Step 5: Save and Export

To export the content into CSV and JSON files, we have to add the Node.js file system module right before the Puppeteer launch:

//Load Node.js file system module to write files
const fs = require('fs');

Then, create an empty list that will store our logs before the loop declaration.

//Array to store all logs for writing later
  const logs = [];

After that, add "logs.push" statements near the already created console statements for our data variables:

// comments here represent the previously filled code:

  console.log('Page Title:', heading);
  logs.push({ type: 'Page Title', message: heading });

// ...

console.log('Section titles:', sectionTitles);
logs.push({ type: 'Section titles', message: sectionTitles });

// ...

console.log('First paragraph:', paraText);
logs.push({ type: 'First paragraph', message: paraText });

Once that is done, the following section, which should go after your loop definition, creates headers and rows for the CSV file, as well as JSON objects to export data into two separate files: "logs.csv" and "logs.json":

  // NEW LINE: Convert logs array to CSV format string
  const csvHeader = 'Type,Message\n';
  const csvRows = logs.map(log => {
    // Escape commas and quotes in data
    const message = log.message
    return `"${log.type}","${message}"`;
  }).join('\n');
  const csvContent = csvHeader + csvRows;

  // NEW LINE: Write CSV to file
  fs.writeFileSync('logs.csv', csvContent, 'utf8');

  // NEW LINE: Write JSON logs to file with pretty print
  fs.writeFileSync('logs.json', JSON.stringify(logs, null, 2), 'utf8');

After running the code, we can see that two files have been created:

And we're done! Feel free to copy the copy-paste-ready code from the TL;DR section to tweak it for your needs or use it for testing purposes.

Challenges with Puppeteer Scraping

Puppeteer offers deep control over browser automation, but it also introduces several challenges that limit its scalability and reliability for large-scale web scraping:

Performance. Each Puppeteer instance launches a full Chromium browser, consuming significant CPU and memory resources. When running dozens of concurrent sessions, resource use grows quickly and can overwhelm your system. Even with headless mode enabled, it still loads CSS, fonts, and scripts that are unnecessary for data extraction.
Speed. Puppeteer must render pages fully to access JavaScript-generated content. Compared to lightweight HTTP requests, this adds seconds per page, making it inefficient for large datasets or frequent scraping runs.
Bot detection. Many websites detect automation by fingerprinting browser behavior or looking for signs like missing user activity, rapid navigation, or default Puppeteer headers. Once detected, access can be blocked or redirected to CAPTCHA pages. Avoiding this requires randomized headers, delays, and proxy rotation.
Consistency. Pages with infinite loaders, heavy scripts, or dynamic DOM updates often cause timeouts or partial loads. Each error may crash the session or hang the browser process, forcing manual cleanup.
Scalability. Running Chromium instances across servers or containers demands careful setup, sandbox flags, and disk space. Updating Chrome versions or dependencies can break compatibility in CI/CD pipelines.

Using ScrapingBee with Puppeteer

There's a reason why we used Wikipedia in this tutorial. If we try to visit websites that block scraper connections, we can see that our script fails to access the site.

For example, here is a very simplified version of our code that only screenshots the page and does nothing else. This time, we will try to access the Yellowpages.com website:

// Load Puppeteer so we can control a Chromium browser from Node. 
// The variable will contain the imported module object via a function "require", while 'puppeteer' names the package to import
const puppeteer = require('puppeteer');

// NEW LINE: Load Node.js file system module to write files
const fs = require('fs');

(async () => {
  // Start a new Chromium instance without a visible window
  const browser = await puppeteer.launch({ headless: true });
  
    // Open a new tab inside the browser
  const page = await browser.newPage();

  // NEW LINE: Array to store all logs for writing later
  const logs = [];

  const response = await page.goto('https://www.yellowpages.com/', { waitUntil: 'networkidle2', timeout: 10000 });
  console.log('Status code:', response.status());


await page.screenshot({ path: 'page.png', fullPage: true, timeout: 10000 });

 
await browser.close(); // Add this to close the browser
})();

After running the code, we can see that it failed to establish a connection, returning the HTTP status code 403 – access forbidden.

After inspecting the screenshot, we can see the exact same result:

However, by implementing our powerful HTML API, you can reroute the connection and avoid getting blocked while utilizing our "stealth_proxy" feature. On our ScrapingBee Documentation page, you can find the Node.js implementation of our Software Development Kit, but we can continue using our Puppeteer logic and only adjust the URL to deliver the GET API call, just like we would with a single cURL command.

Let's get straight to the point. Instead of using the default YellowPages URL, enter the following link.

https://app.scrapingbee.com/api/v1?api_key=YOUR_API_KEY&stealth_proxy=true&render_js=true&url=https://www.yellowpages.com/

After replacing the "YOUR_API_KEY" part with the key from your ScrapingBee account dashboard, we added two parameters to our request:

stealth_proxy=true: Uses a special stealth proxy to reduce detection and blocking by websites
render_js=true: Enables JavaScript rendering on the page, allowing dynamic content generated by JavaScript to be fully loaded and included in the scraped HTML.

After all the changes, your "response" variable should look like this:

  const response = await page.goto('https://app.scrapingbee.com/api/v1?api_key=A0I5QA6KVT1I6NP8LXLJRJDYF43WVALD5KP2ZJYCHUIHS3K8THI0X8NC9O4LTZTBW687ASU0M6O2KXM5&stealth_proxy=true&render_js=true&url=https://www.yellowpages.com/')

If we try the connection again, the result will be different – Puppeteer has successfully reached the Yellowpages.com site via our API:

Scale Puppeteer Scraping with ScrapingBee

Puppeteer is ideal for controlled workflows and proof-of-concepts. However, browser overhead, bot defenses, and inconsistent page loads can ruin your scraping efforts.

With our HTML API, you can handle the tough parts with our automated JavaScript rendering and high-quality proxy rotation. Just keep your existing selectors and parsing logic while replacing the target URL with our API call.

After creating an account on our platform, you will be able to customize various connection-related parameters and never get blocked. Register today to test the service with our 1-week free trial, or go to ScrapingBee Pricing to see available tiers and credit limits. Good luck with your scraping!

Frequently Asked Questions (FAQs)

Can I scrape any website with Puppeteer?

Yes, but some sites deploy aggressive anti-bot measures that block default Puppeteer fingerprints. You can bypass these obstacles with customization of headers, delays, and proxies, and route requests through our API. Just stick to public data sources to avoid liability.

Is Puppeteer scraping fast?

Yes, but not as fast as raw HTML scrapers. Each opened page loads scripts and executes JS. This adds seconds per page and scales poorly under high concurrency, which you can avoid by writing a Python scraper with our SDK.

Do I need proxies with Puppeteer?

Yes, if you plan to scrape protected websites. Repeated requests from one IP trigger can trigger the web page's rate limits and CAPTCHA. Our API automates Rotating Proxies and geo-targeting without requiring manual configuration.

What is the difference between Puppeteer and ScrapingBee?

Puppeteer is a self-hosted browser automation, while ScrapingBee is a managed API that returns JS-rendered HTML with proxy rotation. You can combine them by keeping your Puppeteer flow and selectors, but accessing the URLs via our API.

Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.