How to ignore non-HTML URLs when web crawling?
You can ignore non-HTML URLs when web crawling via two methods.
- Check the URL suffix for unwanted file extensions
Here is some sample code that filters out image file URLs based on extension:
import os IMAGE_EXTENSIONS = [ 'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico', ] url = "https://scrapingbee.com/logo.png" if os.path.splitext(url)[-1][1:] in IMAGE_EXTENSIONS: print("Abort the request") else: print("Continue the request")
- Perform a HEAD request to the URL and investigate the response headers
A head request does not download the whole response but rather makes a short request to a URL to get some metadata. An important piece of information that it provides is the
Content-Type of the response. This can give you a very good idea of the file type of a URL. If the HEAD request returns a non-HTML
Content-Type then you can skip the complete request. Here is some sample code for making a HEAD request and figuring out the response type:
import requests response = requests.head("https://scrapingbee.com") print(response.headers['Content-Type']) # Output: 'text/html; charset=utf-8' if "text/html" in response.headers['Content-Type']: print("You can now make the complete GET request") else: print("Abort the request") # This request would have failed with the above check: # response = requests.head("https://practicalpython.yasoob.me/_static/images/book-cover.png") # print(response.headers['Content-Type']) # Output: image/png