How to ignore non-HTML URLs when web crawling?

You can ignore non-HTML URLs when web crawling via two methods.

Check the URL suffix for unwanted file extensions

Here is some sample code that filters out image file URLs based on extension:

import os

IMAGE_EXTENSIONS = [
    'mng', 'pct', 'bmp', 'gif', 'jpg', 
    'jpeg', 'png', 'pst', 'psp', 'tif', 
    'tiff', 'ai', 'drw', 'dxf', 'eps', 
    'ps', 'svg', 'cdr', 'ico',
]

url = "https://scrapingbee.com/logo.png"
if os.path.splitext(url)[-1][1:] in IMAGE_EXTENSIONS:
    print("Abort the request")
else:
    print("Continue the request")

Perform a HEAD request to the URL and investigate the response headers

A head request does not download the whole response but rather makes a short request to a URL to get some metadata. An important piece of information that it provides is the Content-Type of the response. This can give you a very good idea of the file type of a URL. If the HEAD request returns a non-HTML Content-Type then you can skip the complete request. Here is some sample code for making a HEAD request and figuring out the response type:

import requests

response = requests.head("https://scrapingbee.com")
print(response.headers['Content-Type'])
# Output: 'text/html; charset=utf-8'

if "text/html" in response.headers['Content-Type']:
    print("You can now make the complete GET request")
else:
    print("Abort the request")

# This request would have failed with the above check:
# response = requests.head("https://practicalpython.yasoob.me/_static/images/book-cover.png")
# print(response.headers['Content-Type'])
# Output: image/png

Web Crawling web scraping tutorial:

Learn web scraping with Web Crawling

How to ignore non-HTML URLs when web crawling?

Related Web Crawling web scraping questions:

Web Crawling web scraping tutorial: