How to ignore non-HTML URLs when web crawling?

You can ignore non-HTML URLs when web crawling via two methods.

  1. Check the URL suffix for unwanted file extensions

Here is some sample code that filters out image file URLs based on extension:

import os

IMAGE_EXTENSIONS = [
    'mng', 'pct', 'bmp', 'gif', 'jpg', 
    'jpeg', 'png', 'pst', 'psp', 'tif', 
    'tiff', 'ai', 'drw', 'dxf', 'eps', 
    'ps', 'svg', 'cdr', 'ico',
]

url = "https://scrapingbee.com/logo.png"
if os.path.splitext(url)[-1][1:] in IMAGE_EXTENSIONS:
    print("Abort the request")
else:
    print("Continue the request")
  1. Perform a HEAD request to the URL and investigate the response headers

A head request does not download the whole response but rather makes a short request to a URL to get some metadata. An important piece of information that it provides is the Content-Type of the response. This can give you a very good idea of the file type of a URL. If the HEAD request returns a non-HTML Content-Type then you can skip the complete request. Here is some sample code for making a HEAD request and figuring out the response type:

import requests

response = requests.head("https://scrapingbee.com")
print(response.headers['Content-Type'])
# Output: 'text/html; charset=utf-8'

if "text/html" in response.headers['Content-Type']:
    print("You can now make the complete GET request")
else:
    print("Abort the request")

# This request would have failed with the above check:
# response = requests.head("https://practicalpython.yasoob.me/_static/images/book-cover.png")
# print(response.headers['Content-Type'])
# Output: image/png

Related Web Crawling web scraping questions: