Web Scraping with PHP

14 April 2022 | 17 min read

You might have seen one of our other tutorials on how to scrape websites, for example with Ruby, JavaScript or Python, and wondered: what about the most widely used server-side programming language for websites, which, at the same time, is the one of the most dreaded? Wonder no more - today it's time for PHP 🥳!

Believe it or not, but PHP and web scraping have much in common: just like PHP, web scraping can be used either in a quick and dirty way, or in a more elaborate fashion and supported with the help of additional tools and services.

In this article, we'll look at some ways to scrape the web with PHP. Please keep in mind that there is no general "best way" - each approach has its use-case, depending on what you need, how you like to do things, and what you want to achieve.

cover image

As an example, we will try to get a list of people that share the same birthday, as you can see, for instance, on famousbirthdays.com. If you want to code along, please ensure that you have installed a current version of PHP and Composer.

Create a new directory and run the following commands from it:

$ composer init --require="php >= 8.1" --no-interaction
$ composer update

We're ready!

1. HTTP Requests

When it comes to browsing the web, the one important communication protocol, you need to be familiar with, is HTTP, the Hypertext Transport Protocol. It defines how participants on the World Wide Web communicate with each other. There are servers hosting resources and clients requesting resources from them.

Your browser is such a client, and when we open the developer console (press F12), select the "Network" tab and open the famous example.com, we can see the full request sent to the server, as well as the full response:

Network tab
Network tab of your browser developer console

That's quite some request and response headers, but in its most basic form, a request looks like this:

GET / HTTP/1.1
Host: www.example.com

Let's try to recreate with PHP, what the browser just did for us!

fsockopen()

Typically, we won't use much such "low-level" communication, but just for the sake of it, let's create this request with the most basic tool PHP has to offer, fsockopen():

<?php
    # fsockopen.php

    // HTTP requires "\r\n" In HTTP, lines have to be terminated with "\r\n" because of
    $request = "GET / HTTP/1.1\r\n";
    $request .= "Host: www.example.com\r\n";
    $request .= "\r\n"; // We need to add a last new line after the last header

    // We open a connection to www.example.com on the port 80
    $connection = fsockopen('www.example.com', 80);

    // The information stream can flow, and we can write and read from it
    fwrite($connection, $request);

    // As long as the server returns something to us...
    while(!feof($connection)) {
        // ... print what the server sent us
        echo fgets($connection);
    }

    // Finally, close the connection
    fclose($connection);

And indeed, if you put this code snippet into a file fsockopen.php and run it with php fsockopen.php, you will see the same HTML that you get when you open http://example.com in your browser.

Next step: performing an HTTP request with Assembler... just kidding! But in all seriousness: fsockopen() is usually not used to perform HTTP requests in PHP; I just wanted to show you that it's possible, using the easiest possible example. While one can handle HTTP tasks with it, it's not fun and requires a lot of boilerplate code that we don't need to write - performing HTTP requests is a solved problem, and in PHP (and many other languages) it's solved by…

cURL

Enter cURL (a client for URLs)!

Let's jump right into the code, it's quite straight forward:

<?php
# curl.php

// Initialize a connection with cURL (ch = cURL handle, or "channel")
$ch = curl_init();

// Set the URL
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com');

// Set the HTTP method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');

// Return the response instead of printing it out
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

// Send the request and store the result in $response
$response = curl_exec($ch);

echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;

// Close cURL resource to free up system resources
curl_close($ch);

Now, this already looks less low-level than our previous example, doesn't it? No need to manually compose the HTTP request, establish and manage the TCP connection, or handle the response byte-by-byte. Instead we only initialise the cURL handle, pass the actual URL, and perform the request using curl_exec.

If, for example, we wanted cURL to automatically handle HTTP redirect 30x codes, we'd only need to add curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);. Plus, there are quite a few additional options and flags to support other use-cases.

Great! Now let's get to actual scraping!

If you would like to learn more about cURL without PHP you can check: How to send a GET request using cURL?, How to send a delete request using cURL? or How to send a POST request using cURL?

2. Strings, regular expressions, and Wikipedia

Let's look at Wikipedia as our first data provider. Each day of the year has its own page for historical events, including birthdays! When we open, for example, the page for December 10th (which happens to be my birthday), we can inspect the HTML in the developer console and see how the "Births" section is structured:

Wikipedia HTML structure
Wikipedia's HTML structure

This looks nice and organized! We can see that:

  • There's an <h2> header element containing <span id="Births" ...>Births</span> (only one element on the whole page should have an ID named "Births").
  • Following, is a list of sub-headings (<h3>) for individual epochs and their list entry elements (<ul>).
  • Each list entry is represented by a <li> item, containing the year, a dash, the name of the person, a comma, and a teaser of what the person is known for.

This is something we can work with, isn't it? Let's go!

<?php
# wikipedia.php

$html = file_get_contents('https://en.wikipedia.org/wiki/December_10');

echo $html;

Wait what? Surprise! Yes, file_get_contents() makes use of PHP's fopen wrappers and (as long as they are enabled) can be used to fetch HTTP URLs. Though, primarily really meant for local files, it probably is the easiest and fastest way to perform basic HTTP GET requests and is fine for our example here or for quick one-off scripts, as long as you use it carefully.

Script output
Script output

Have you read all the HTML that the script has printed out? I hope not, because it's a lot! The important thing is that we know where we should start looking: we're only interested in the part starting with id="Births" and ending right before the next <h2>:

<?php
# wikipedia.php

$html = file_get_contents('https://en.wikipedia.org/wiki/December_10');

$start = stripos($html, 'id="Births"');

$end = stripos($html, '<h2>', $offset = $start);

$length = $end - $start;

$htmlSection = substr($html, $start, $length);

echo $htmlSection;

We're getting closer!

Cleaner results
Cleaner results

This is not valid HTML anymore, but at least we can see what we're working with! Let's use a regular expression to load all list items into an array so that we can handle each item one by one:

preg_match_all('@<li>(.+)</li>@', $htmlSection, $matches);
$listItems = $matches[1];

foreach ($listItems as $item) {
    echo "{$item}\n\n";
}
Cleaner results (bis)
Cleaner results (bis)

For the years and names… We can see from the output that the first number is the birth year. It's followed by an HTML-Entity &#8211; (a dash). Finally, the name is located within the following <a> element. Let's grab 'em all, and we're done 👌.

<?php
# wikipedia.php

$html = file_get_contents('https://en.wikipedia.org/wiki/December_10');

$start = stripos($html, 'id="Births"');

$end = stripos($html, '<h2>', $offset = $start);

$length = $end - $start;

$htmlSection = substr($html, $start, $length);

preg_match_all('@<li>(.+)</li>@', $htmlSection, $matches);
$listItems = $matches[1];

echo "Who was born on December 10th\n";
echo "=============================\n\n";

foreach ($listItems as $item) {
    preg_match('@(\d+)@', $item, $yearMatch);
    $year = (int) $yearMatch[0];

    preg_match('@;\s<a\b[^>]*>(.*?)</a>@i', $item, $nameMatch);
    $name = $nameMatch[1];

    echo "{$name} was born in {$year}\n";
}
Final results
Final results

Perfect, isn't it? It works and we managed to get the data we wanted, right? Well, it does work, but we did not choose a particularly elegant approach. Instead of handling the DOM tree, we resorted to "brute-force" string parsing of the HTML code and, in that, missed out on most of what DOM parsers already provide out-of-the-box. Not ideal.

We can do better! When? Now!

3. Guzzle, XML, XPath, and IMDb

Guzzle is a popular HTTP Client for PHP that makes it easy and enjoyable to send HTTP requests. It provides you with an intuitive API, extensive error handling, and even the possibility of extending its functionality with additional plugins/middleware. This makes Guzzle a powerful tool that you don't want to miss. You can install Guzzle from your terminal with composer require guzzlehttp/guzzle.

Let's cut to the chase and have a look at the HTML of https://www.imdb.com/search/name/?birth_monthday=12-10 (Wikipedia's URLs were definitely nicer)

IMDB HTML structure
IMDB HTML structure

We can see straight away that we'll need a better tool than string functions and regular expressions here. Instead of a list with list items, we see nested <div>s. There's no id="..." that we can use to jump to the relevant content. But worst of all: the birth year is either buried in the biography excerpt or not visible at all! 😱

We'll try to find a solution for the year-situation later, but for now, let's at least get the names of our jubilees with XPath, a query language to select nodes from a DOM Document.

In our new script, we'll first fetch the page with Guzzle, convert the returned HTML string into a DOMDocument object and initialize an XPath parser with it:

<?php
# imdb.php

require 'vendor/autoload.php';

$httpClient = new \GuzzleHttp\Client();

$response = $httpClient->get('https://www.imdb.com/search/name/?birth_monthday=12-10');

$htmlString = (string) $response->getBody();

// HTML is often wonky, this suppresses a lot of warnings
libxml_use_internal_errors(true);

$doc = new DOMDocument();
$doc->loadHTML($htmlString);

$xpath = new DOMXPath($doc);

Let's have a closer look at the HTML in the window above:

  • The list is contained in a <div class="lister-list"> element
  • Each direct child of this container is a <div>, with two classes set in the class attribute, lister-item and mode-detail
  • Finally, the name can be found within an <a>, within a <h3>, within a <div> with a lister-item-content class

If we look closer, we can make it even simpler and skip the child divs and class names: there is only one <h3> in a list item, so let's target that directly:

$links = $xpath->evaluate('//div[@class="lister-list"][1]//h3/a');

foreach ($links as $link) {
    echo $link->textContent.PHP_EOL;
}

Let's break that quickly down.

  • //div[@class="lister-list"][1] returns the first ([1]) div tag with an attribute named class that has the exact value lister-list
  • within that div, from all <h3> elements (//h3) return all anchors ( <a> )
  • We then iterate through the result and print the text content of the anchor elements

I hope I explained it well enough for this use case, but in any case, our article "Practical XPath for Web Scraping" here on the blog explains XPath far better and goes much deeper than I ever could, so definitely check it out (but finish reading this one first! 💪)

💡 Check out ScrapingBee's data extraction tools

With a single API call, you can fetch a page and extract any data, based on custom path sets.

Guzzle is a great HTTP client, but many others are equally excellent - it just happens to be one of the most mature and most downloaded. PHP has a vast, active community; whatever you need, there's a good chance someone else has written a library or framework for it, and web scraping is no exception.

4. Goutte and IMDB

Goutte is an HTTP client made for web scraping. It was created by Fabien Potencier, the creator of the Symfony Framework, and combines several Symfony components to make web scraping very comfortable:

  • The BrowserKit component simulates the behavior of a web browser that you can use programmatically.
  • Think of the DomCrawler component as DOMDocument and XPath on steroids - except that steroids , and DomCrawler is good!
  • The CssSelector component translates CSS selectors to XPath expressions.
  • The Symfony HTTP Client is developed and maintained by the Symfony team and, naturally, easily integrates into the overall Symfony ecosystem.

Let's install Goutte with composer require fabpot/goutte and recreate the previous XPath example with it:

<?php
# goutte_xpath.php

require 'vendor/autoload.php';

$client = new \Goutte\Client();

$crawler = $client->request('GET', 'https://www.imdb.com/search/name/?birth_monthday=12-10');

$links = $crawler->evaluate('//div[@class="lister-list"][1]//h3/a');

foreach ($links as $link) {
    echo $link->textContent.PHP_EOL;
}

This alone is already pretty good - we saved the step where we had to explicitly disable XML warnings and didn't need to instantiate an XPath object ourselves. Now, let's use a "native" CSS selector instead of the manual XPath evaluation (thanks to the CssSelector component integrated into Goutte):

<?php
# goutte_css.php

require 'vendor/autoload.php';

$client = new \Goutte\Client();

$crawler = $client->request('GET', 'https://www.imdb.com/search/name/?birth_monthday=12-10');

$crawler->filter('div.lister-list h3 > a')->each(function ($node) {
    echo $node->text().PHP_EOL;
});

I like where this is going; our script is more and more looking like a conversation that even a non-programmer can understand, not just code 🥰. However, now is the time to find out if you're coding along or not 🤨: does this script return results when running it? Because for me, it didn't at first - I spent an hour debugging why and finally discovered a solution:

composer require masterminds/html5

As it turns out, the reason why Goutte (more precisely: the DOMCrawler) doesn't report XML warnings is that it just throws away the parts it cannot parse. The additional library helps with HTML5 specifically, and after installing it, the script runs as expected.

We will talk more about this later, but for now, let's remember that we're still missing the birth years of our jubilees. This is where a web scraping library like Goutte really shines: we can click on links! And indeed: if we click one of the names in the birthday list to go to a person's profile, we can see a "Born: " line, and in the HTML a <time datetime="YYYY-MM-DD"> element within a div with the id name-born-info:

IMDB HTML structure
IMDB HTML stucture

This time, I will not explain the single steps that we're going to perform beforehand, but just present you the final script; I believe that it can speak for itself:

<?php
# imdb_birthdates.php

require 'vendor/autoload.php';

$client = new \Goutte\Client();

$client
    ->request('GET', 'https://www.imdb.com/search/name/?birth_monthday=12-10')
    ->filter('div.lister-list h3 a')
    ->each(function ($node) use ($client) {
        $name = $node->text();

        $birthday = $client
            ->click($node->link())
            ->filter('#name-born-info > time')->first()
            ->attr('datetime');

        $year = (new DateTimeImmutable($birthday))->format('Y');

        echo "{$name} was born in {$year}\n";
    });
Look at this clean output
Look at this clean output

As there are 50 people on the page, 50 additional GET requests have to be made, so the run of the script will take a bit longer now, but we now have the birthdays for each of these fifty people as well.

Great, but what about the others you say? And you are right, there are more than 1,000 people listed on IMDb, who share December 10th as their birthday and we should not just ignore them. Let's enter the beautiful world of fun with pagination 😳.

Pagination

When content is split into different pages, it always get a bit tricky, as you'll need to handle the page management as well and change to the next page, once you have accessed all data on the current page. This can be particularly tricky in the context of endless scrolling but, fortunately, IMDb follows a traditional approach with Next &raquo; buttons.

So let's extend our previous code example in a way, that it does not only crawl the first page, but all pages. To do so, we will introduce a new $url variable, which we initialise with our start URL, then move our main crawler code into a while loop, and - finally - check on each iteration whether there is a next link. If there isn't, we have reached the last page and can call it a day after having compiled a long list of people we now have to send Happy Birthday cards. However, if we actually find a next link, we will know that we are not done yet, store the link's URL in $url, and start one more iteration of our loop.

Here we go.

<?php
# imdb_final.php

require 'vendor/autoload.php';

$client = new \Goutte\Client();

$url = 'https://www.imdb.com/search/name/?birth_monthday=12-10';

while (true)
{
    $crawler = $client->request('GET', $url);

    $crawler
        ->filter('div.lister-list h3 a')
        ->each(function ($node) use ($client) {
            $name = $node->text();

            $birthday = $client
                ->click($node->link())
                ->filter('#name-born-info > time')->first()
                ->attr('datetime');

            $year = (new DateTimeImmutable($birthday))->format('Y');

            echo "{$name} was born in {$year}\n";
        });

    // Try to find the "Next" link
    $next = $crawler->filter('a.next-page');

    // Stop, if there is no more "Next" link
    if ($next->count() == 0) break;

    $url = $next->link()->getUri();
}

Summary

At this point, our code is still pretty concise and yet does exactly all we went out to do. It starts at the given IMDb link, collects the listed profiles along with the dates of birth from the pages the individual actors, and continues on to the next set of actors, as long as there is further data.

As Guzzle handles everything by default request-by-request, it will certainly take a bit of time to fetch all 1,000+ profiles and that's exactly an opportunity to build upon for further enhancements:

  • Guzzle does support concurrent requests; perhaps we could leverage that to improve the processing speed.
  • IMDb features pictures for most of their listings, wouldn't it be lovely to have the profile picture of each actor? (hint: div.lister-item-image img[src])

5. Headless Browsers

Here's a thing: when we looked at the HTML DOM tree in the Developer Console, we didn't see the actual HTML code that has been sent from the server to the browser, but the final result of the browser's interpretation of the DOM Tree. If a site does not use JavaScript, that output will not differ much, but the more JavaScript the site runs, the more likely it will change the DOM tree from what the server originally sent.

When a website uses AJAX to dynamically load content, or when even the complete HTML is generated dynamically with JavaScript, we cannot access it with just downloading the original HTML document from the server. Tools like Goutte can simulate a lot when it comes to browser behaviour, but they still have their limits. This is where so called headless browsers come into play.

A headless browser runs a full-fledged browser engine without the graphical user interface and it can be controlled programmatically in a similar way as we did before with the simulated browser.

Symfony Panther is a standalone library that provides the same APIs as Goutte - this means you could use it as a drop-in replacement in our previous Goutte scripts. A nice feature is that it can use an already existing installation of Chrome or Firefox on your computer so that you don't need to install additional software.

Since we have already achieved our goal of getting the birthdays from IMDB, let's conclude our journey with getting a screenshot from the page that we so diligently parsed.

After installing Panther with composer require symfony/panther we could write our script for example like this:

<?php
# screenshot.php

require 'vendor/autoload.php';

$client = \Symfony\Component\Panther\Client::createFirefoxClient();
// or "createChromeClient()" for a Chrome instance
// $client = \Symfony\Component\Panther\Client::createChromeClient();

$client
    ->get('https://www.imdb.com/search/name/?birth_monthday=12-10')
    ->takeScreenshot($saveAs = 'screenshot.jpg');

Conclusion

We've learned about several ways to scrape the web with PHP today. Still, there are a few topics that we haven't spoken about - for example, website owners often prefer their sites to be accessed only by end-users and are not too happy if they are accessed in any automated fashion.

  • When we used Goutte to load all the pages in quick succession, IMDb could have interpreted this as unusual and could have blocked our IP address from further accessing their website.
  • Many websites have rate limiting in place to prevent Denial-of-Service attacks.
  • Depending on in which country you live and where a server is located, some sites might not be available from your computer.
  • Managing headless browsers for different use cases can take a performance toll on your machine (mine sounded like a jet engine at times).

ℹ️ For more on the basics of Goutte check out our Goutte tutorial.

That's where services like ScrapingBee can help: you can use the Scraping Bee API to delegate thousands of requests per second without the fear of getting limited or even blocked so that you can focus on what matters: the content 🚀.

I hope you liked this article, if you're more old school, check out this Web scraping with Perl article

If you'd rather use something free, we have also benchmarked thoroughly the most used free proxy provider.

If you want to read more about web scraping without being blocked, we have written a complete guide, but we still would be delighted if you decided to give Scraping Bee a try, the first 1,000 requests are on us!

image description
Jérôme Gamez

Jérôme is an experienced PHP developer very active in the Open-Source community, if you use PHP and Firebase, you should check-out his SDK on Github (1.4k stars).