Web scraping is automating the process of data collection from the web. This usually means deploying a “crawler” that automatically searches the web and scrapes data from selected pages. Data collection through scraping can be much faster, eliminating the need for manual data-gathering, and maybe mandatory if the website has no provided API. Scraping methods change based on the website's data display mechanisms.
One way to display content is through a one-page website, also known as a single-page application. Single-page applications (SPA) have become a trend, and with the implementation of infinite scrolling techniques, programmers can develop SPA that allows users to scroll forever. If you are an avid social media user, you have most likely experienced this feature before on platforms like Instagram, Twitter, Facebook, Pinterest, etc.
While a one-page website is beneficial for user experience (UX), it can make your attempts to extract data seem more complicated. But there is no need to worry, because thanks to Puppeteer, you will be able to scrape data infinitely by the end of this article.
Prerequisites & Goals
To fully benefit from this post, you should have the following:
-
✅ Some experience with writing ES6 Javascript.
-
✅ A proper understanding of promises and some experience with async/await.
-
✅Node.js installed on your development machine.
What is Infinite Scrolling?
Before you attempt to scrape data from a never-ending timeline, it is essential to ask yourself, what exactly is infinite scrolling?
Infinite Scrolling, a web-design technique that loads content continuously as the user scrolls down the page. There is an Infinite Scroll JavaScript plugin that automatically adds the next page, preventing a full page load. The first version was created in 2008 by Paul Irish and was a breakthrough in web development. The plugin uses ajax to pre-fetch content from a subsequent page and then adds it directly to the current page. There are many other ways to produce infinite scrolling content, such as data delivered through API endpoints to incrementally deliver more data, processing data from multiple endpoints before injecting something into a webpage, or data delivery in real-time through WebSockets.
Advantages ✨
- Discovery Applications
- It is almost a must-have feature for discovery applications/interphases. If a user does not know what to search for specifically, they may need to see an immense amount of items to find the one thing they like.
- Mobile Devices
- Since mobile devices have a much smaller screen size, infinite scrolling can create a much more pleasant UX.
- User Engagement
- Since new results are always loading on the page, users are sucked into the application.
Disadvantages ⛔
- Poor for Page Performance
- Page loading is important for UX. As a user scrolls further down a page, more content has to load on the same page. As a result, the page performance will become increasingly slow.
- Poor for Item Search and Location
- Users may get to a certain point in the stream where they cannot bookmark their location. If they leave the site, they will lose all their progress, decreasing UX.
- Loss of Footers
- A viable part of applications that may contain easily accessible, essential information is now gone.
Now that you know a little more about the content presentation style, and its developmental uses, you can better understand how to scrape data from infinite scrolling interphases. That is where Puppeteer comes into play.
What is Puppeteer?
It can take time to understand and reverse-engineer an app's data delivery, plus websites may take different approaches to create infinite scrolling content. However, you will not need to worry about any of that today, all thanks to Puppeteer . And, no, not the kind that works Puppets. Puppeteer, is a headless Chrome Node API, allows you to emulate scrolling on the page and retrieve the desired data needed from the rendered elements.
Puppeteer allows you to behave almost exactly as if you were in your regular browser, except programmatically and without a user interface. Here are some examples of what you can do:
- Generate screenshots and PDFs of pages.
- Crawl a SPA and generate pre-rendered content.
- Automate form submission , UI testing, keyboard input, etc.
- Create an up-to-date, automated testing environment - Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
- Capture a timeline trace of your site to help diagnose performance issues.
- Test Chrome Extensions.
You can find this list, along with additional information about Puppeteer uses, in the documentation .
How to Scrape Infinite Scrolling Websites Using Puppeteer
Presuming you already have [npm](https://www.npmjs.com/)
installed, create a folder to store your Puppeteer project.
mkdir infinite-scroll
cd infinite-scroll
npm install --save puppeteer
By using npm
, you are installing both Puppeteer and a version of Chromium browser used by Puppeteer. On Linux machines, Puppeteer might require some additional dependencies. This allows you to save time by jumping into writing the scraping script. Open your go-to text-editor and create a scrape-infinite-scroll.js
file. In that file, copy in the following code:
// Puppeteer will not run without these lines
const fs = require('fs');
const puppeteer = require('puppeteer');
These first couple lines are boilerplate code configurations. You will create a function for the items you would like to scrape. Open up your console, and examine the page HTML to determine your extractedElements constant.
function extractItems() {
/* For extractedElements, you are selecting the tag and class,
that holds your desired information,
then choosing the disired child element you would like to scrape from.
in this case, you are selecting the
"<div class=blog-post />" from "<div class=container />" See below: */
const extractedElements = document.querySelectorAll('#container > div.blog-post');
const items = [];
for (let element of extractedElements) {
items.push(element.innerText);
}
return items;
}
The next function called is, scrapeItems
. This function controlls the actual scrolling and extraction, by using page.evaluate
to repeatedly scroll down each page, extracting any items from the injected extractItems method, until at least itemCount
many items have been scraped. Since Puppeteer’s methods are Promise-based, by utilizing await
and placing everything in an async
wrapper, you're able to write the code as if it were executing synchronously.
async function scrapeItems(
page,
extractItems,
itemCount,
scrollDelay = 800,
) {
let items = [];
try {
let previousHeight;
while (items.length < itemCount) {
items = await page.evaluate(extractItems);
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
await page.waitForTimeout(scrollDelay);
}
} catch(e) { }
return items;
}
This last chunk of code handles starting up and navigating to the Chromium browser, as well as the number of items you are scraping and where that data is going.
(async () => {
// Set up Chromium browser and page.
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
const page = await browser.newPage();
page.setViewport({ width: 1280, height: 926 });
// Navigate to the example page.
await page.goto('https://mmeurer00.github.io/infinite-scroll-example/');
// Auto-scroll and extract desired items from the page. Currently set to extract ten items.
const items = await scrapeItems(page, extractItems, 10);
// Save extracted items to a new file.
fs.writeFileSync('./items.txt', items.join('\n') + '\n');
// Close the browser.
await browser.close();
})();
It's important to include everything you desire for item extraction in the extractItems
function’s definition. The following line:
items = await page.evaluate(extractItems);
will serialize the extractItems
function before reviewing it in the browser’s context, essentially the lexical environment becomes unavailable during execution.
When finished, your file should look similar to:
// Puppeteer will not run without these lines
const fs = require('fs');
const puppeteer = require('puppeteer');
function extractItems() {
/* For extractedElements, you are selecting the tag and class,
that holds your desired information,
then choosing the desired child element you would like to scrape from.
in this case, you are selecting the
"<div class=blog-post />" from "<div class=container />" See below: */
const extractedElements = document.querySelectorAll('#container > div.blog-post');
const items = [];
for (let element of extractedElements) {
items.push(element.innerText);
}
return items;
}
async function scrapeItems(
page,
extractItems,
itemCount,
scrollDelay = 800,
) {
let items = [];
try {
let previousHeight;
while (items.length < itemCount) {
items = await page.evaluate(extractItems);
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
await page.waitForTimeout(scrollDelay);
}
} catch(e) { }
return items;
}
(async () => {
// Set up Chromium browser and page.
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
const page = await browser.newPage();
page.setViewport({ width: 1280, height: 926 });
// Navigate to the example page.
await page.goto('https://mmeurer00.github.io/infinite-scroll-example/');
// Auto-scroll and extract desired items from the page. Currently set to extract ten items.
const items = await scrapeItems(page, extractItems, 10);
// Save extracted items to a new file.
fs.writeFileSync('./items.txt', items.join('\n') + '\n');
// Close the browser.
await browser.close();
})();
Great, you are ready to go! 😎 Run the script with:
node scrape-infinite-scroll.js
That line of code will open the demo page in the headless browser and scroll until ten #container > div.blog-post items have been loaded, saving the text from the extracted items in ./items.txt. By Running
open ./items.txt
you will have access to all your scraped data, as you can see below:
Blog Post #1
sed ab est est
at pariatur consequuntur earum quidem quo est laudantium soluta voluptatem qui ullam et est et cum voluptas voluptatum repellat est
كيان سلطانی نژاد
Blog Post #2
enim unde ratione doloribus quas enim ut sit sapiente
odit qui et et necessitatibus sint veniam mollitia amet doloremque molestiae commodi similique magnam et quam blanditiis est itaque quo et tenetur ratione occaecati molestiae tempora
Carla Vidal
Blog Post #3
fugit voluptas sed molestias voluptatem provident
eos voluptas et aut odit natus earum aspernatur fuga molestiae ullam deserunt ratione qui eos qui nihil ratione nemo velit ut aut id quo
Anna Lane
Blog Post #4
laboriosam dolor voluptates
doloremque ex facilis sit sint culpa soluta assumenda eligendi non ut eius sequi ducimus vel quasi veritatis est dolores
Tyrone Terry
Blog Post #5
commodi ullam sint et excepturi error explicabo praesentium voluptas
etc...
You can also run tail ./items.txt
to see the last 10 items scraped in your terminal.
What happens if the script is unable to exact the number of items indicated? Well, Puppeteer has functions that evaluate JavaScript on the page like page.waitForFunction
that usually have a 30 second timeout (which can be customized). The function waits for page height to increase after each scroll. So, when the page loads more items, it will go through a while loop, only breaking and throwing an error when the height doesn not change for 30 seconds, or the custom timeout.
Alternative Scraping Methods for Infinite Scrolling
While Puppeteer can decrease your workload, it may not always be the best approach to scrapping, depending on your case. An alternative and less extensive way to scrape is through Cheerio. Cheerio is an NPM library, also called “JQuery for Node”, allowing you to scrape data with a lightweight framework. Cheerio works with raw HTML data that is input to it, working best when the data you need to parse is extracted directly from a URL. If you are interested in using this scarping method, check out the article Web Scraping with node-fetch , to understand more about Cheerio and its use cases.
Conclusion
Thanks to Puppeteer, you can now extract data on infinite scrolling applications quickly and efficiently. While it may not be what you utilize in all cases, the script from this article should serve as a starting point for emulating human-like scrolling on an application.
If you have enjoyed this article on Scraping Infinite Scrolling Applications with Puppeteer, give ScrapingBee a try, and get the first 1000 requests free. Check out the getting started guide here !
Scraping the web is challenging, given that anti-scraping mechanisms are growing by the day, so getting it done right can be quite a tedious task. ScrapingBee allows you to skip the noise and focus only on what matters the most: data