Using Python and wget to Download Web Pages and Files
In many contexts, such as automation, data science, data engineering, automation, and application development, Python is the lingua franca. It’s commonly used for downloading images and web pages, with a variety of methods and packages to choose from. One method that’s simple but robust is to interface with Wget.
Wget is a free twenty-five-year-old command-line program that can retrieve files from web services using HTTP, HTTPS, and FTP. If you use it with Python, you’re virtually unlimited in what you can download and scrape from the web.
This article will show you the benefits of using Wget with Python with some simple examples. You’ll also learn about Wget’s limits and alternatives.
Why Use Wget?
Wget is a convenient and widely supported tool for downloading files over three protocols: HTTP, HTTPS, and FTP. Wget owes its popularity to two of its main features: recursiveness and robustness.
Recursiveness: Using the proper parameters, Wget can operate as a web crawler. Instead of downloading a single file, it can recursively download files linked from a specific web page until all the links have been exhausted or until it reaches a user-specified recursion depth. In this scenario, Wget saves the downloaded files in a directory structure that resembles the server they have been downloaded from. This feature is highly configurable:
- supports wildcards in network locations and filenames
- offers timestamp inspection, so that only new or updated files are downloaded
- respects the robots exclusion standard
Robustness: Wget can recover from broken transfers, making it a good solution for downloading files over unstable or slow networks. Wget uses the Range HTTP header to continue a download from where it left off until the whole file is received. This requires no intervention from the user.
Wget2 was released in 2021. While it supports more or less the same features, it focuses on parallelization, making it much faster than its predecessor.
Why Wget with Python?
Python is a general-purpose programming language used in finance, academia, cloud/data engineering, data science, web development, and workflow automation. Not only is it widely used in a variety of fields and sectors, but it also has a huge community, is the most-searched programming language via Google, and tops the list of most sought-after programming languages in job openings.
Using Wget, you can easily turn Python scripts into full-fledged web crawling solutions. Some interesting use cases are:
- Creating data sets for academic and business goals. Via Wget, it’s easy to scrape the content of one or multiple websites. These large data sets can be essential to machine learning research. For example, recent NLP models wouldn’t be possible without billions of pieces of content.
- Monitoring large websites. Automate Wget to check if web pages and files are available from different networks and places around the world.
- Content mapping. Lots of web pages generate personalized content. By setting up Wget to behave as different personas, you can create an overview of what content is shown to which users.
While there are multiple ways to run shell commands and programs (such as Wget) in Python, you’ll use the subprocess package to interface with the operating system’s shell in this tutorial.
Note that although it shares some functionalities, the Python wget package is unrelated to the Wget command-line program. It is an unfinished package that hasn’t been updated in years and lacks most of Wget's distinguishing features.
Using Wget with Python
Next, you’ll set up Wget to download files in Python.
First, make sure you have Wget installed on your machine. This process differs depending on your operating system.
- If you’re using Linux, you may already have it preinstalled.
- If you’re using Mac, the easiest way to install Wget is by using Homebrew.
- Windows users can download the Wget command-line tool executable from this website. Once it’s downloaded, make sure it’s added to the PATH variable.
Running Commands with the Subprocess Package
To run Wget commands from within a Python script, you’ll use the Popen method of the subprocess package. Every time your script invokes
popen(), it will execute the command you passed in an independent instance of the operating system’s command processor. By setting the verbose argument to
True, it will also return the output of the command. Feel free to adapt this to your needs.
All code snippets can be found in this file.
import subprocess def runcmd(cmd, verbose = False, *args, **kwargs): process = subprocess.Popen( cmd, stdout = subprocess.PIPE, stderr = subprocess.PIPE, text = True, shell = True ) std_out, std_err = process.communicate() if verbose: print(std_out.strip(), std_err) pass runcmd('echo "Hello, World!"', verbose = True)
The commands you’ll use throughout this section are all structured in the same way. You’ll use the
wget command, give it a URL, and provide specific options to achieve certain goals.
wget \[options\] url
Check your options in the extensive manual.
Download a file: To download a file from a server, pass the
wget command and the file URL to the custom function you created. Set
runcmd("wget https://www.scrapingbee.com/images/logo-small.png", verbose = True)
From the output of the command, you can observe that (1) the URL is resolved to the IP address of the server, (2) an HTTP request is sent, and (3) status code 200 OK is received. Finally (4), Wget stores the file in the directory from where the script runs without changing the file name.
Download a file to a custom folder: To download a file to a specific folder, pass it the
-P flag, followed by the destination folder. Interestingly, when the path to the folder doesn’t exist, Wget will create it.
runcmd("wget --directory-prefix=download_folder https://www.scrapingbee.com/images/logo-small.png", verbose = False) runcmd("wget -P download_folder https://www.scrapingbee.com/images/logo-small.png", verbose = False)
Download a file to a specific file name: Not only can you change the destination folder for a file, but you can specify its local file name. Provide it the
-O flag, followed by the desired file name.
runcmd("wget -O logo.png https://www.scrapingbee.com/images/logo-small.png") runcmd("wget --output-document=logo.png https://www.scrapingbee.com/images/logo-small.png")
Download a newer version of a file: Sometimes you’ll only want to download a file if the local copy is older than the version of the server. You can turn this feature on by providing the
runcmd("wget --timestamping https://www.scrapingbee.com/images/logo-small.png", verbose = True)
If you have already downloaded the ScrapingBee logo, you’ll most likely see that the server responds with status code 304 Not Modified in this example. In other words, the file on the server is the same version as the file on your local machine, so no file will be downloaded.
Complete unfinished downloads: The default behavior of Wget is to retry downloading a file if the connection is lost midway through. However, if you want to continue getting a partially downloaded file, you can set the
-c or the
runcmd("wget -c https://www.scrapingbee.com/images/logo-small.png") runcmd("wget --continue https://www.scrapingbee.com/images/logo-small.png")
Recursive retrieval: Wget’s most exciting feature is recursive retrieval. Wget can retrieve and parse the page on a given URL and the files to which the initial document refers via HTML
href attributes or a CSS
url() functional notation. If the next file is also text/HTML, it will be parsed and followed further until the desired depth is reached. Recursive retrieval is breadth-first: it will download the files on depth 1, then depth 2, etc.
There are a lot of options you can set:
--recursiveoption will enable recursive retrieval.
--leveloption allows you to set the depth, ie, the number of subdirectories that Wget can recurse. To prevent crawling huge websites, Wget sets a default depth of
5\. Change this option to zero
--convert-linksoption will convert the links in the downloaded documents, making them suitable for local viewing. Downloaded files will be referred to relatively (e.g., ../foo/bar.png). Files that have not been downloaded will be referred to with their hostname (e.g., https://scrapingbee.com/foo/bar/logo.png).
The following command will recursively download the scrapingbee.com website into a www.scrapingbee.com directory, with a maximum depth of
3\. Wget will also convert all links to make this copy available locally.
runcmd('wget --recursive --level=3 --convert-links https://www.scrapingbee.com')
This command might take more than a few minutes to complete, depending on your internet connection speed.
When Not to Use Wget
Wget is an excellent solution if you’re focused on recursively downloading files from web servers. However, its use cases are limited due to this narrow focus, and alternatives are worth considering.
- To download files over protocols other than HTTP(S) or FTP(S), cURL with Python is probably your best bet.
- If you need to scrape only certain DOM elements on a web page without storing the file locally, consider requests in combination with [Beautiful Soup](https://www.scrapingbee.com/blog/python-web-scraping-beautiful-soup/. Alternatively, you can use paid solutions such as ScrapingBee.
- Selenium is a wonderful solution to simulate click and scroll behavior on a website (e.g., for testing purposes).
Wget is a convenient solution for downloading files over HTTP and FTP protocols. It works well with Python in recursively downloading multiple files, and the process can easily be automated to save you time.
Wget’s focus can be somewhat limited, but it offers plenty of options for your downloading and web scraping needs.
You might also like:
How to use cURL with Python?
This tutorial will teach you to use cURL with Python using PycURL. PycURL is an interface to cURL in Python. It's one of the fastest HTTP client for Python, which is perfect if you need lots of concurrent connections.
Using the Cheerio NPM Package for Web Scraping
In this article, you'll learn how to use Cheerio to scrape data from static HTML content.