How to scrape data from Twitter.com

26 October 2022 | 11 min read

Twitter is a gold mine for data. It started as a micro-blogging website and has quickly grown to become the favorite hangout spot for millions of people. Twitter provides access to most of its data via its official API but sometimes that is not enough.

Web scraping provides some advantages over using the official API. For example, Twitter's API is rate-limited and you need to wait for a while before Twitter approves your application request and lets you access its data but this is not the case with web scraping.

In this article, you will learn how to use Selenium to scrape public profile data from Twitter.

cover image

Setting up the prerequisites

This tutorial uses Python 3.10 but should work with most Python versions. Start by creating a new directory where all of your project files will be stored and then create a new Python file within it for the code itself:

$ mkdir scrape_twitter
$ cd scrape_twitter
$ touch app.py

You will need to install a few different libraries to follow along:

You can install both of these libraries using this command:

$ pip install selenium webdriver-manager

Selenium will provide you with all the APIs for programmatically accessing a browser and the webdriver-manager will help you in setting up a browser's binary driver easily without the need to manually download and link to it in your code.

Fetching a profile page

Let's try fetching a Twitter profile page using Selenium to make sure everything is set up correctly. Save the following code in the app.py file:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://twitter.com/yasoobkhalid")

Running this code should open up a Chrome window and navigate it to my Twitter profile. It might take a minute before it opens up the browser window as it might have to download the Chrome driver binary.

The code itself is fairly straightforward. It starts with importing the webdriver, the Service and the ChromeDriverManager. Normally, you would initialize the webdriver by giving it an executable_path for the browser-specific driver binary you want to use:

browser = webdriver.Chrome(executable_path=r"C:\path\to\chromedriver.exe")

The biggest downside for this is that any time the browser updates, you will have to download the updated driver binary for the browser. This gets tiring very quickly and the webdriver_manager library makes it simpler by letting you pass in ChromeDriverManager().install(). This will automatically download the required binary and return the appropriate path so that you don't have to worry about it anymore.

The last line of code asks the driver to navigate to my Twitter profile.

Deciding what to scrape

The next step is to figure out what you want to scrape. In this tutorial, you will learn how to scrape:

  1. profile name
  2. profile handle
  3. user location
  4. User website
  5. join date
  6. following count
  7. followers count
  8. user tweets

The image below highlights where all of this information is located on the page (except the tweets. We will focus on them in a bit):

Annotated Twitter profile

Let the extraction begin!

You will be relying on the APIs provided by Selenium to do the web scraping. You will mostly be using XPath and CSS selectors for this purpose. The exact method you will use will be dictated by the HTML structure and which option is more reliable and easy to use.

How to know the page is fully loaded

If you try to extract data as soon as Selenium navigates to the profile page, you will encounter the NoSuchElementException. This occurs because Twitter uses JavaScript to request data from the server and populate the page and the JavaScript hasn't finished executing when the profile page is loaded.

There are a few ways to be sure the page has fully loaded and the data we need is visible on the page. We will be relying on two of these methods.

The first one just makes sure that the page and all of its sub-resources have finished loading. It does this by checking the value of document.readyState property. The value of this property will be "complete" when everything has loaded:

import time
from random import randint

# ...

state = ""
while state != "complete":
    print("loading not complete")
    time.sleep(randint(3, 5))
    state = driver.execute_script("return document.readyState")

The .execute_script method takes in JavaScript code to execute in the current window. There is a random wait added in the while loop just to make sure you are waiting for a little while before checking the status again.

However, this document.readyState method alone is not enough as it takes time for the tweets to be populated on the screen. Twitter launches an XHR request to fetch the tweets after the page has finished loading. If you rely only on this first method and try to extract tweets from the profile, you will encounter the same NoSuchElementException again. This second method will fix this:

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import WebDriverException

# ...

try:
    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, '[data-testid="tweet"]')))
except WebDriverException:
    print("Tweets did not appear!, Try setting headless=False to see what is happening")

Here you are asking Selenium to pause the execution until the DOM element with the data-testid="tweet" attribute is visible on the screen. It will then wait for 10 seconds for this attribute to show up and if it doesn't show up in this time, it will timeout and throw the WebDriverException.

You don't need to worry about how we know about this particular CSS selector for the tweets. We will cover this in a later section.

Extracting the username & handle

Let's open up the profile page in a chrome window, fire up the Developer Tools, and explore the HTML structure of the page:

Annotated handle

As you can see in the image above, the whole name and handle section is encapsulated in a div. It seems like Twitter generates class names randomly and that is no good if you want to reliably scrape data. Luckily, there does seem to be a data-testid attribute that is unique to this div. If you search for this attribute in the HTML you will quickly observe that there is only one data-testid attribute that has the value of UserName. You can use this to your benefit and tell selenium to extract the name and handle from the div that has this attribute:

from selenium.webdriver.common.by import By

#...

name, username = driver.find_element(By.CSS_SELECTOR,'div[data-testid="UserName"]').text.split('\n')

Here you are using the CSS_SELECTOR to target this particular div and then extracting the visible text from it. This returns the name and username/handle with a newline character in between. You are then splitting the resulting string at the newline character and assigning the return values to name and username variables.

Extracting user bio

Right-click on the user bio and click on "inspect". This will open up the relevant section of the DOM in the developer tools window:

Annotated bio

This is the same as the username and handle section. All the CSS classes are randomly generated. However, there is a data-testid attribute for this div as well and it is also unique on this page. You can target this attribute and then extract the text from the div:

bio = driver.find_element(By.CSS_SELECTOR,'div[data-testid="UserDescription"]').text

You can try to be a bit more precise and figure out what children tags you need to extract the bio from but that is going to be unnecessary and somewhat useless as there is no other visible text other than the bio in this div.

Extracting location, website, & join date

The location, website, and join date have a similar extraction method. Look at the image below which shows the DOM structure of the location:

Annotated user location

This span also has the data-testid attribute defined. The same is the case for the website and join date. You can extract all three of these using the same method we used for the previous two extractions:

location = driver.find_element(By.CSS_SELECTOR,'span[data-testid="UserLocation"]').text
website = driver.find_element(By.CSS_SELECTOR,'a[data-testid="UserUrl"]').text
join_date = driver.find_element(By.CSS_SELECTOR,'span[data-testid="UserJoinDate"]').text

Extracting followers and following count

This one is going to be slightly different as there doesn't seem to be any data-testid attribute defined for the tags containing the followers and following count. Look at the DOM structure of the parent tags containing the following count:

Annotated following count

This seems tricky at first but has a surprisingly easy solution. You can use XPath for targeting these tags. XPath allows you to find a tag on the page based on the text it contains. This is useful as there is only one location on the page where the word "Following" is found. Let me show you the XPath and then explain what is happening:

following = driver.find_element(By.XPATH, "//span[contains(text(), 'Following')]/ancestor::a/span").text
  • //span[contains(text(), 'Following')] tells Selenium to find the span that contains the text "Following"
  • /ancestor::a tells Selenium to go to the a (anchor tag) ancestor of this span
  • /span tells selenium to move down to the very first child span of the a tag

Look at the DOM structure above and you will see what is happening. We first target the span containing the "Following" string. Then we traverse upwards to the a tag and then again traverse downwards to the first span child of the a tag and extract the text from it.

The same is the case for the "Followers" count:

followers = driver.find_element(By.XPATH, "//span[contains(text(), 'Followers')]/ancestor::a/span").text

Extracting tweets

The profile page contains tweets as well as retweets. We will not distinguish between them for now and will consider all of them as simply Tweets. Open up the developer tools and start inspecting a tweet:

Annotated tweet

Turns out that the tweets also have a data-testid defined. You can use this to extract all of the tweets from the page:

tweets = driver.find_elements(By.CSS_SELECTOR, '[data-testid="tweet"]')

Note that here we are using find_elements as compared to find_element. find_element only returns the first element with the matching conditions whereas find_elements returns all the elements with the matching conditions. You have to use the latter here because there are multiple tweets on the profile page with the same data-testid and you need to extract all of them.

From here on you can go over each tweet in this list and extract the relevant data. Let's start by extracting the tweet author. Each tweet's author name and author handle are wrapped in a div with the data-testid of User-Names:

Annotated tweet author

If you try to extract the text from this div, it will return the following:

>>> print(tweet.find_element(By.CSS_SELECTOR,'div[data-testid="User-Names"]').text)
Yasoob Khalid
@yasoobkhalid
·
Aug 3

You can simply split the text at the newline character (\n) and extract the name and handle of the tweet author and the timestamp of the tweet:

for tweet in tweets:
		tag_text = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="User-Names"]').text
		name, handle, _, timestamp = tag_text.split('\n')

The tweet text is also encapsulated in a div with a distinct data-testid attribute:

Annotated tweet text

You can extract the tweet text using the following code:

for tweet in tweets:
		# ...
    tweet_text = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="tweetText"]').text

The likes, retweets, and replies count all follow the same pattern. Right-click on either of these and you will see that there is a data-testid attribute in a parent div for each of them. Here is a screenshot of the DOM structure for the "replies" button:

Annotated tweet replies

You can extract all of these stats of a tweet using the following code:

for tweet in tweets:
		# ...
    retweet_count = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="retweet"]').text
    like_count = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="like"]').text
    reply_count = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="reply"]').text

You can distinguish between an original tweet and a retweet by comparing the tweet author handle with the profile page handle. If they don't match then you are looking at a retweet.

Complete code

The complete code so far looks something like this:

import time 
from random import randint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import WebDriverException

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://twitter.com/yasoobkhalid")

state = ""
while state != "complete":
    print("loading not complete")
    time.sleep(randint(3, 5))
    state = driver.execute_script("return document.readyState")

try:
    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, '[data-testid="tweet"]')))
except WebDriverException:
    print("Tweets did not appear!, Try setting headless=False to see what is happening")
        
bio = driver.find_element(By.CSS_SELECTOR,'div[data-testid="UserDescription"]').text
name, username = driver.find_element(By.CSS_SELECTOR,'div[data-testid="UserName"]').text.split('\n')
location = driver.find_element(By.CSS_SELECTOR,'span[data-testid="UserLocation"]').text
website = driver.find_element(By.CSS_SELECTOR,'a[data-testid="UserUrl"]').text
join_date = driver.find_element(By.CSS_SELECTOR,'span[data-testid="UserJoinDate"]').text
following = driver.find_element(By.XPATH, "//span[contains(text(), 'Following')]/ancestor::a/span").text
followers = driver.find_element(By.XPATH, "//span[contains(text(), 'Followers')]/ancestor::a/span").text

tweets = driver.find_elements(By.CSS_SELECTOR, '[data-testid="tweet"]')

for tweet in tweets:
    tag_text = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="User-Names"]').text
    name, handle, _, timestamp = tag_text.split('\n')
    tweet_text = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="tweetText"]').text
    retweet_count = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="retweet"]').text
    like_count = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="like"]').text
    reply_count = tweet.find_element(By.CSS_SELECTOR,'div[data-testid="reply"]').text
    print(name, handle)
    print(tweet_text)
    print("--------------")

Feel free to tweak it to suit your needs!

Conclusion

In this tutorial, you learned how to scrape data from a Twitter profile using Selenium. You can continue in the same manner and extract whatever additional data you need. You can also automate page scrolling and login functionality so that you can extract as many tweets as you can.

Or better yet, you can stand on the shoulders of giants and use popular open-source programs that already do all of this for you. One such example is the Twitter scraper selenium project on GitHub that uses Selenium as well and contains a ton of additional features. It automatically extracts images and videos from tweets and also contains code for auto-scrolling to continue extracting additonal tweets.

If you have any further questions related to web scraping, feel free to reach out to us! We would love to help you.

image description
Yasoob Khalid

Yasoob is a renowned author, blogger and a tech speaker. He has authored the Intermediate Python and Practical Python Projects books ad writes regularly. He is currently working on Azure at Microsoft.