What is Screen Scraping and How To Do It With Examples

19 May 2025 | 11 min read

What is Screen Scraping?

The easiest way to get data from another program is to use a dedicated API (Application Programming Interface), but not all programs provide one. In fact, most programs don't.

If there's no API provided, you can still get data from a program by using screen scraping, which is the process of capturing data from the screen output of a program.

This can take all kinds of forms, ranging from parsing terminal output to reading text off screenshots, with the most common being classic web scraping.

Types of Screen Scraping

The most common types of scraping are:

  • Web scraping: extracting data from web pages (most popular of all)
  • GUI scraping: extracting data from graphical user interfaces (often used as a last resort when no other options are available)
  • Terminal scraping: extracting data from command line output (super common for scripting and automation)

Screen Scraping Websites

This is the most common type of scraping and the one that most people think of when they hear the name, and it's our second-best choice after using a dedicated API.

The simplest form of web scraping is to use an HTTP client to fetch the HTML code of a web page and then parse it to extract desired data.

For example, let's scrape the page title of https://example.com using the requests-html library:

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://example.com')

if response.status_code == 200:
	page_title = response.html.find('h1', first=True).text
	print(f"Page Title: {page_title}")
else:
	print(f"Failed to retrieve page. Status code: {response.status_code}")

That's easy! But if we wanted to extract all links from the page, so that we can later recursively scrape them? We could use BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = 'https://4comprehension.com'
response = requests.get(url)

if response.status_code == 200:
	soup = BeautifulSoup(response.text, 'html.parser')
	links = soup.find_all('a')

	for link in links:
		href = link.get('href')
		text = link.get_text(strip=True)
		print(f"Link text: {text} | URL: {href}")
else:
	print(f"Failed to retrieve page. Status code: {response.status_code}")

Result:

// ...
Link text: Follow @pivovarit | URL: https://github.com/pivovarit
Link text: vavr-io/vavr | URL: https://github.com/vavr-io/vavr
Link text: parallel-collectors | URL: https://github.com/pivovarit/parallel-collectors
Link text: throwing-function | URL: https://github.com/pivovarit/throwing-function
Link text: more-gatherers | URL: https://github.com/pivovarit/more-gatherers
// ...

So far, so good! This is a bit more complex, but still manageable, but... what if the page contained dynamic JavaScript content?

Let's try to scrape quotes from quotes.toscrape.com, which loads quotes dynamically using JavaScript:

scraping quotes

import requests
from bs4 import BeautifulSoup

url = 'https://quotes.toscrape.com/js/'
response = requests.get(url)

if response.status_code == 200:
	soup = BeautifulSoup(response.text, 'html.parser')
	quotes = soup.find_all(class_='quote')

	if quotes:
		for quote in quotes:
			print(f"{quote.find(class_='text').text}{quote.find(class_='author').text}")
else:
	print(f"Failed to retrieve page. Status code: {response.status_code}")

If you run it, you will see that there's no output! We scraped the HTML code, but the quotes are not there.

That's because BeautifulSoup, just like many similar tools, is not a browser - it merely parses static HTML.

In such cases, one of the options is to use a headless browser like Selenium to render the page first, and then scrape:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("--headless")
service = Service()

driver = webdriver.Chrome(service=service, options=options)

try:
	driver.get("https://quotes.toscrape.com/js/")

	WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "quote")))

	quotes = driver.find_elements(By.CLASS_NAME, "quote")

	for quote in quotes:
		print(f"-{quote.find_element(By.CLASS_NAME, "text").text}{quote.find_element(By.CLASS_NAME, "author").text}")

finally:
	driver.quit()

And here's the result:

- “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” — Albert Einstein
- “It is our choices, Harry, that show what we truly are, far more than our abilities.” — J.K. Rowling
- “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” — Albert Einstein
- “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” — Jane Austen
- “Imperfection is beauty, madness is genius, and it's better to be absolutely ridiculous than absolutely boring.” — Marilyn Monroe
- “Try not to become a man of success. Rather become a man of value.” — Albert Einstein
- “It is better to be hated for what you are than to be loved for what you are not.” — André Gide
- “I have not failed. I've just found 10,000 ways that won't work.” — Thomas A. Edison
- “A woman is like a tea bag; you never know how strong it is until it's in hot water.” — Eleanor Roosevelt
- “A day without sunshine is like, you know, night.” — Steve Martin

If you want to go full auto-pilot, you could leverage ScrapingBee AI Web Scraping:

from scrapingbee import ScrapingBeeClient
import json

client = ScrapingBeeClient(api_key='SIGN_UP_FOR_FREE_API_KEY')

response = client.get(
	'https://quotes.toscrape.com/js/',
	params={
		'ai_query': 'return a list of quotes, ignore everything outside quotes',
		'ai_extract_rules': json.dumps({
			"quotes": {
				'type': 'list',
				'description': 'retrieved quotes',
			}
		})
	}
)

print(response.content.decode('utf-8'))

And here's the result:

{"quotes": [
  "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.",
  "It is our choices, Harry, that show what we truly are, far more than our abilities.",
  "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.",
  "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.",
  "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",
  "Try not to become a man of success. Rather become a man of value.",
  "It is better to be hated for what you are than to be loved for what you are not.",
  "I have not failed. I've just found 10,000 ways that won't work.",
  "A woman is like a tea bag; you never know how strong it is until it's in hot water.",
  "A day without sunshine is like, you know, night."
]}

Naturally, the Python ecosystem is vibrant and full of libraries. You can discover the most popular ones here: Best Python Web Scraping Libraries.

If you want to look beyond Python, you can also check out the best tools for web scraping.

GUI Scraping

GUI scraping is the hardest type of scraping and should be used as the last resort. It ranges from simulating human interaction with a GUI, to reading text off the screen.

For example, if you need to manually click through a couple of buttons in a calculator, with Python, you could use pyautogui:

pyautogui.press('2')
pyautogui.press('+')
pyautogui.press('2')
pyautogui.press('return')

What if you wanted to take a screenshot and scrape the content by running Optical Character Recognition?

You could use pytesseract to try to read the text:

import pyautogui
import pytesseract

screenshot = pyautogui.screenshot()
screenshot.save("screenshot.png")

print("Result:", pytesseract.image_to_string(screenshot).strip())

Here's my attempt. I purposefully chose a trivial example:

screenshot

And the result:

pytesseract result

You see what I'm talking about? Don't be surprised if you have a hard time getting this to run properly. This is as brittle as it gets.

Leveraging OpenAI

If you have some API credits, you could even leverage OpenAI's Vision API to read the text from the screenshot:

import base64
from openai import OpenAI
import pyautogui
from io import BytesIO

client = OpenAI(api_key="your api key")

screenshot = pyautogui.screenshot()

buffer = BytesIO()
screenshot.save(buffer, format="PNG")
img_base64 = base64.b64encode(buffer.getvalue()).decode("utf-8")


response = client.chat.completions.create(model="gpt-4o",
messages=[
	{
		"role": "user",
		"content": [
			{"type": "text", "text": "Please extract all visible text from this screenshot."},
			{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}}
		]
	}
],
max_tokens=1000)

print("Result:", response.choices[0].message.content.strip())

And here's the result:

Result: Sure! Here is the extracted text from the screenshot:
- # HELLO WORLD!

Much better! At the point of writing, that cost me less than $0.01.

Naturally, there are other tricks we can use before we resort to OCRing screenshots, like using pywinauto to read text from Windows applications, or AppKit on macOS, but these are heavily platform-specific.

The point is, avoid this kind of scraping at all costs if you can. For example, if you want to scrape data from a web page, there are much easier and more reliable ways than taking screenshots and running OCR.

Terminal Scraping

Terminal scraping is the easiest type of scraping (except for APIs, which technically aren't scraping at all), since the output of a program is usually just terminal-friendly plain text.

What's more, many command-line programs provide structured output in formats like JSON or CSV that are as easy to process as integrating with an API.

However, not all tools are that scripting-friendly. If that's the case, we need to scrape and parse the output ourselves.

Let's say we have a program that prints some raw data to the terminal, like this:

echo 'Hello, world!'

We can capture the output of this program in Bash easily by running it in a subshell and assigning the output to a variable:

output=$(echo 'Hello, world!')
echo "Captured: $output"

And once the output is captured, we can process it however we want. Easy, right?

Let's look at a more complex example: df -h, which shows disk space usage on Linux systems:

Filesystem        Size    Used   Avail Capacity iused ifree %iused  Mounted on
/dev/disk2s1s1   460Gi    10Gi    52Gi    17%    425k  541M    0%   /
devfs            202Ki   202Ki     0Bi   100%     698     0  100%   /dev
/dev/disk2s6     460Gi    24Ki    52Gi     1%       0  541M    0%   /System/Volumes/VM
/dev/disk2s2     460Gi   6.6Gi    52Gi    12%    1.3k  541M    0%   /System/Volumes/Preboot
/dev/disk2s4     460Gi   2.4Mi    52Gi     1%      57  541M    0%   /System/Volumes/Update
/dev/disk1s2     500Mi   6.0Mi   481Mi     2%       1  4.9M    0%   /System/Volumes/xarts
/dev/disk1s1     500Mi   5.4Mi   481Mi     2%      32  4.9M    0%   /System/Volumes/iSCPreboot
/dev/disk1s3     500Mi   3.1Mi   481Mi     1%      88  4.9M    0%   /System/Volumes/Hardware
/dev/disk2s5     460Gi   391Gi    52Gi    89%    3.0M  541M    1%   /System/Volumes/Data
map auto_home      0Bi     0Bi     0Bi   100%       0     0     -   /System/Volumes/Data/home

If we wanted to extract available space and mount location, we would need to capture the output, skip the last row and then cherry-pick the relevant columns, which can be done sed and awk on macOS:

df -h | sed '$d' | awk 'NR>1 { print $5 " used on " $9 }'
17% used on /
100% used on /dev
1% used on /System/Volumes/VM
12% used on /System/Volumes/Preboot
1% used on /System/Volumes/Update
2% used on /System/Volumes/xarts
2% used on /System/Volumes/iSCPreboot
1% used on /System/Volumes/Hardware
89% used on /System/Volumes/Data

As you can see, the script might be a bit fragile, but it does the job!

We've just scraped the output of a command-line program!

Challenges of Web Scraping

Unfortunately, web scraping is not always easy. At first glance, pages that seem easy to scrape can be deceptively complex, and even if they are not, you might need to face the worst obstacle of them all - anti-scraping measures.

Website creators often implement anti-scraping measures to prevent unauthorized automated access. These measures include:

  • CAPTCHAs: a challenge that tests whether the user is human or not, which get harder and harder to solve even for humans
  • Rate limiting: sites might limit the number of requests coming from a single IP address in a given time frame
  • IP blocking: sites might block IP addresses that make too many requests in a short period
  • Frequent changes: one of the most effective ways to prevent scraping... is to frequently change the structure of a page

While all of those can be overcome, scraping pages drastically increases the time and effort required.

For example, scrapers often rely on CAPTCHA-solving services or use machine learning models to recognize and solve challenges.

Rate limiting and IP blocking can be overcome using a pool of distributed proxies (sometimes called a "scraping farm").

Structural changes can be handled by relying less on selectors and more on heuristics involving text matching, machine learning models, and AI in general.

You can read more about overcoming anti-scraping measures in our blog post on How to scrape without getting blocked.

That's why our Web Scraping API exist. We abstract away the most difficult scraping parts and allow you to focus on the data you want to extract.

Ethics of Screen Scraping

Before scraping a website, it's a good practice to check the site's robots.txt, which defines the policies for scrapers. This file is usually located at the root of the website (for example, https://www.scrapingbee.com/robots.txt) and should be considered as a polite request and not a legal requirement.

Some pages disallow scraping entirely, while others make it even easier by providing sitemaps, which are XML files that list all the pages on the site.

Example of robots.txt:

User-agent: *
Disallow:

Sitemap: https://4comprehension.com/sitemap_index.xml

Remember always to ensure that you are not violating any terms of service. There's a famous case of LinkedIn vs. hiQ Labs where LinkedIn sued a data analytics company for scraping its public profiles and eventually won.

Conclusion

Screen scraping is a powerful technique that allows us to extract data when no dedicated APIs are provided, whether it's scraping terminal output, web pages, or clunky GUIs.

Now that you know the basics, go ahead and scrape responsibly!

image description
Grzegorz Piwowarek

Independent consultant, blogger at 4comprehension.com, trainer, Vavr project lead - teaching distributed systems, architecture, Java, and Golang