How to scrape channel data from YouTube

27 December 2022 | 14 min read

If you are an internet user, it is safe to assume that you are no stranger to YouTube. It is the hub for videos on internet and even back in 2020, 500 hours of videos were being uploaded to YouTube every minute! This has led to the accumulation of a ton of useful data on the platform. You can extract and make use of some of this data via the official YouTube API but it is rate limited and doesn't contain all the data viewable on the website. In this tutorial, you will learn how you can scrape YouTube data using Selenium. This tutorial will specifically focus on extracting information about videos uploaded by a channel but the techniques are easily transferrable to extracting search results and individual video data.

cover image

Setting up the prerequisites

This tutorial will use Python 3.10 but should work with most recent Python versions. Start by creating a new directory for the project and an app.py file for storing all the code:

$ mkdir scrape_youtube
$ cd scrape_youtube
$ touch app.py

You will need to install a few different libraries to follow along:

You can install both of these libraries using this command:

$ pip install selenium webdriver-manager

Selenium will provide you with all the APIs for programmatically accessing a browser and the webdriver-manager will help you in setting up a browser's binary driver easily without the need to manually download and link to it in your code.

Fetching the channel page

Let's try and fetch a channel's uploaded videos page using Selenium to make sure everything is set up correctly. Save the following code in app.py:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.youtube.com/@YasoobKhalid/videos")

Running this code will open up a Chrome window and navigate it to my YouTube channel. If this is the first time you are using webdriver_manager then it might take a few seconds for the Chrome window to open as webdriver_manager has to download the latest chromedriver.

The code is fairly straightforward. It imports webdriver, Service, and ChromeDriverManager and then creates a Chrome driver instance. Generally, you pass in the path to a webdriver executable to webdriver.Chrome but it becomes tedious to keep it updated with the latest Chrome version installed on your system. webdriver_manager makes it super simple by automatically downloading the required chromedriver binary and returning the path to that. This way you don't have to worry about manually downloading the binary with every Chrome update.

This is what the chrome window might look like on a successful run:

Channel page

Deciding what to scrape

It is always a good idea to plan what you want to scrape before writing any code. In this tutorial, you will be scraping:

  • channel name
  • channel handle
  • subscriber count
  • video titles
  • video view counts
  • video thumbnails

The screenshot below highlights where all of this information is stored on the page:

Annotated channel page

You will be using the default methods (find_element + find_elements) that Selenium provides for accessing DOM elements and extracting data from them. Additionally, you will be relying on CSS selectors and XPath for locating the DOM elements. The exact method you use for locating an element will depend on the DOM structure and which method is the most appropriate.

Scraping channel information

Go ahead and explore the HTML structure of the page and try to identify tags that can reliably be targeted to extract the required information.

Extracting channel title

You can start with the channel name. Right-click on the channel name and click on Inspect in the context menu. This will fire up Developer Tools and will show you the HTML structure of the channel name:

Dev tools channel title

As you can see in the screenshot above, the channel title is encapsulated in a yt-formatted-string tag that has a class of ytd-channel-name. You can use this information to target this element using XPath and extract the text from within it. This is what the code will look like:

from selenium.webdriver.common.by import By

# ...

channel_title = driver.find_element(By.XPATH, '//yt-formatted-string[contains(@class, "ytd-channel-name")]').text

If you have ever worked with Selenium and/or XPath before then the code will look recognizable. It makes use of the find_element method to return the first element on the page that is matched by the XPath expression. The XPath expression itself makes use of some basic XPath features and works like this:

  • //yt-formatted-string matches all tags on the page that are named yt-formatted-string
  • [contains(@class, "ytd-channel-name")] filters the yt-formatted-string tags based on the class name and only returns those tags that contain the ytd-channel-name class

You could have tried to use the following XPath:

//yt-formatted-string[@class="ytd-channel-name"]

However, it would not have worked as it tries to match all yt-formatted-string tags that have the exact class of ytd-channel-name and no additional classes. You might be wondering what happens if we add the additional class to the XPath as well:

//yt-formatted-string[@class="style-scope ytd-channel-name"]

Well, it will work perfectly fine. However, it is better to make your XPaths a bit flexible just in case the class names change in the future, and using one less class name is more flexible than hardcoding two class names.

Extracting channel handle

Next up, explore the HTML structure of the channel handle:

Channel handle

Seems like you can use the id of channel-handle this time and use that to target this element:

handle = driver.find_element(By.XPATH, '//yt-formatted-string[@id="channel-handle"]').text

Even though there is another yt-formatted-string tag with the same id right underneath the one you extracted, you don't need to worry too much about it for two reasons:

  • Selenium will always return the first tag that matches the XPath due to the usage of find_element method as compared to find_elements
  • Both tags have the same text so even if they were interchanged we would end up with the same text

Extracting subscriber count

The subscriber count is located right underneath the tag containing the channel handle:

channel subscribers

Again, you can rely on the unique id of subscriber-count to target and extract data from this tag:

subscriber_count = driver.find_element(By.XPATH, '//yt-formatted-string[@id="subscriber-count"]').text

Scraping video data

Now that you have all the channel data you wanted to extract, you can focus on the videos. Let's start with the video titles.

Extracting video title

This is what the video title looks like in developer tools:

video title

You have two workflow options here. You can either extract each video section like this:

video groups

And then extract the individual video data (title, view count, etc.).

Or we can extract all the titles and all the view counts (along with other info) separately and merge them for each video. Generally, the first method is preferred in cases where all sections do not have the same amount of info. This way you will be able to account for missing info for a section (video) much more easily.

However, because YouTube displays the required info for all the videos, you can go with the second workflow option.

According to the developer tools screenshot above, all titles are encapsulated in a yt-formatted-string tag with the id of video-titles. You can use this knowledge to extract all video titles like so:

titles = driver.find_elements(By.ID, "video-title")

This will return a list of WebElement instances. You will have to loop over this list and use the .text property to extract the actual title. This is what that will look like:

titles = [title.text for title in titles]

Extract view count

The view count is located in a span that itself is nested in a div:

video views

You can extract the spans using this code:

views = driver.find_elements(By.XPATH,'//div[@id="metadata-line"]/span[1]')

This XPath matches the first span inside all divs with the id of metadata-line. The indexing in XPaths begins with 1 so that is why you see 1 instead of 0 as the span index.

We will extract the text from the individual span tags in a bit.

Extract video thumbnail

Sweet! Now you have only one last thing that needs to be extracted: video thumbnails. This is what the DOM structure looks like:

Video thumbnails

You need to extract the img tag that is nested inside the yt-image tag which itself is nested within an anchor tag with the id of thumbnail. This is what the extraction code will look like:

thumbnails = driver.find_elements(By.XPATH, '//a[@id="thumbnail"]/yt-image/img')

Merging the extracted video information

At this point, you have three separate lists containing the video information. You can merge the info about individual videos and create a new list of dictionaries:

from pprint import pprint

# ...

videos = []
for title, view, thumb in zip(titles, views, thumbnails):
    video_dict = {
        'title': title.text,
        'views': view.text,
        'thumbnail': thumb.get_attribute('src')
    }
    videos.append(video_dict)
    
pprint(videos)

If you save this code and run it, it will produce an output similar to this:

[
    {
        "thumbnail": "https://i.ytimg.com/vi/9853EAISk2c/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLDaPCvkUgLmsYZ3V7jETCqMu4DEhw",
        "title": "How to Listen to Trunked Police Radio | Complete Instructions 2020",
        "views": "25K views",
    },
    {
        "thumbnail": "https://i.ytimg.com/vi/5N066ISH8og/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLBiWiKDJq8dC1UDhU9-FuUizexzRg",
        "title": "Web Scraping using lxml and Python 2018: Extracting data from "
        "Steam",
        "views": "22K views",
    },
    {
        "thumbnail": "https://i.ytimg.com/vi/MqeO9lQemmQ/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLAYTPts5BIKZ4olyt8o1miJJrYcmQ",
        "title": "Top 14 MOST famous Python libraries & frameworks",
        "views": "55K views",
    },
]

While you are at it, go ahead and add the video links to the dictionaries as well. The code for that would resemble this:

links = driver.find_elements(By.ID, "video-title-link")

videos = []
for title, view, thumb, link in zip(titles, views, thumbnails, links):
    video_dict = {
        'title': title.text,
        'views': view.text,
        'thumbnail': thumb.get_attribute('src'),
        'link': link.get_attribute('href')
    }
    videos.append(video_dict)

And now your output should include the link as well:

[
    {
        "link": "https://www.youtube.com/watch?v=9853EAISk2c",
        "thumbnail": "https://i.ytimg.com/vi/9853EAISk2c/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLDaPCvkUgLmsYZ3V7jETCqMu4DEhw",
        "title": "How to Listen to Trunked Police Radio | Complete Instructions 2020",
        "views": "25K views",
    },
    {
        "link": "https://www.youtube.com/watch?v=5N066ISH8og",
        "thumbnail": "https://i.ytimg.com/vi/MqeO9lQemmQ/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLAYTPts5BIKZ4olyt8o1miJJrYcmQ",
        "title": "Top 14 MOST famous Python libraries & frameworks",
        "views": "55K views",
    },
    {
        "link": "https://www.youtube.com/watch?v=MqeO9lQemmQ",
        "thumbnail": "https://i.ytimg.com/vi/5N066ISH8og/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLBiWiKDJq8dC1UDhU9-FuUizexzRg",
        "title": "Web Scraping using lxml and Python 2018: Extracting data from "
        "Steam",
        "views": "22K views",
    },
]

Perfect! You have extracted all the data you wanted! But wait! There is one slight issue.

Loading all videos on the channel page

By default, YouTube only shows a few videos on the channel page. My channel only has 3 videos so far so it hasn't been an issue. However, if the channel you are scraping has a ton of videos, you will have to scroll to the bottom of the page to load older videos. There might be 1000 videos so you will have to scroll quite a few times to load them all.

Luckily, there is a way for you to automate this using Selenium. The basic idea is that you will get the current height of the document (page), tell Selenium to scroll to the bottom of the page, wait for a few seconds, and then calculate the height of the document yet again. You will continue doing so until the new height is the same as the old height. This way you can be sure that there are no more videos that need to be loaded. Once all the videos are visible on the page, you can go ahead and scrape all of them in one go.

The code for this logic will resemble this:

import time

WAIT_IN_SECONDS = 5
last_height = driver.execute_script("return document.documentElement.scrollHeight")

while True:
    # Scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, arguments[0]);", last_height)
    # Wait for new videos to show up
    time.sleep(WAIT_IN_SECONDS)
    
    # Calculate the new document height and compare it with the last height
    new_height = driver.execute_script("return document.documentElement.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
    
# All videos loaded
# Go ahead with scraping

The code exactly follows the logic explained earlier. It executes some javascript statements to get the current document height, scroll to the bottom of the page, wait for a bit and compare the new height with the old height. If the new height and old heights are same, it breaks out of the loop.

There is a better way to do this by using selenium-wire and making sure the AJAX call has returned. We follow that workflow in some of our other tutorials. However, for YouTube, this code should work just fine.

Complete code

The complete code for this tutorial looks like this:

import time
import pprint

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.youtube.com/@YasoobKhalid/videos")

channel_title = driver.find_element(By.XPATH, '//yt-formatted-string[contains(@class, "ytd-channel-name")]').text
handle = driver.find_element(By.XPATH, '//yt-formatted-string[@id="channel-handle"]').text
subscriber_count = driver.find_element(By.XPATH, '//yt-formatted-string[@id="subscriber-count"]').text

WAIT_IN_SECONDS = 5
last_height = driver.execute_script("return document.documentElement.scrollHeight")

while True:
    # Scroll to the bottom of page
    driver.execute_script("window.scrollTo(0, arguments[0]);", last_height)
    # Wait for new videos to show up
    time.sleep(WAIT_IN_SECONDS)
    
    # Calculate new document height and compare it with last height
    new_height = driver.execute_script("return document.documentElement.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height


thumbnails = driver.find_elements(By.XPATH, '//a[@id="thumbnail"]/yt-image/img')
views = driver.find_elements(By.XPATH,'//div[@id="metadata-line"]/span[1]')
titles = driver.find_elements(By.ID, "video-title")
links = driver.find_elements(By.ID, "video-title-link")

videos = []
for title, view, thumb, link in zip(titles, views, thumbnails, links):
    video_dict = {
        'title': title.text,
        'views': view.text,
        'thumbnail': thumb.get_attribute('src'),
        'link': link.get_attribute('href')
    }
    videos.append(video_dict)

pprint(videos)

Avoid getting blocked by using ScrapingBee

There are a few caveats I didn't discuss in detail. The biggest one is that if you run your scraper every so often, YouTube will block it. They have services in place to figure out when a request is made by a script and no simply setting an appropriate User-Agent string is not going to help you bypass that. You will have to use rotating proxies and automated captcha-solving services. This can be too much to handle on your own and luckily there is a service to help with that: ScrapingBee.

You can use ScrapingBee to extract information from whichever channel page you want and ScrapingBee will make sure that it uses rotating proxies and solves captchas all on its own. This will let you focus on the business logic (data extraction) and let ScrapingBee deal with all the grunt work.

Let's look at a quick example of how you can use ScrapingBee. First, go to the terminal and install the ScrapingBee Python SDK:

$ pip install scrapingbee

Next, go to the ScrapingBee website and sign up for an account:

Signup scrapingbee

After successful signup, you will be greeted with the default dashboard. Copy your API key from this page and edit the code in your app.py file.

Copy API key

I will show you the code and then explain what is happening:

from pprint import pprint
from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key='API_KEY')

url = "https://www.youtube.com/@YasoobKhalid/videos"

js_scenario = {
  "instructions": [
    {"click": "form > div > div > button"},
    {"wait": 1500},
    {"scroll_y": 1080}
  ]
}

response = client.get(
    url,
    params={
        "block_resources": "false",
        "premium_proxy": "true", 
        'country_code':'us',
        "js_scenario": js_scenario, 
        'extract_rules': {
            "channel_title": {
                "selector": "yt-formatted-string[class='style-scope ytd-channel-name']",
                "output": "text",
            },
            "handle": {
                "selector": "yt-formatted-string[id='channel-handle']",
                "output": "text",
            },
            "subscriber_count": {
                "selector": "yt-formatted-string[id='subscriber-count']",
                "output": "text",
            },
            "thumbnails": {
                "selector": "a[id='thumbnail'] > yt-image > img",
                "type": "list",
                "output": "@src",
            },
            "views": {
                "selector": "div[id='metadata-line'] > span:nth-of-type(1)",
                "type": "list",
                "output": "text",
            },
            "titles": {
                "selector": "#video-title",
                "type": "list",
                "output": "text",
            },
            "links": {
                "selector": "#video-title-link",
                "type": "list",
                "output": "@href",
            },
        }
    }
)

if response.ok:
    scraped_data = response.json()
    videos = []
    for title, view, thumb, link in zip(scraped_data["titles"], scraped_data["views"], scraped_data["thumbnails"], scraped_data["links"]):
        video_dict = {
            'title': title,
            'views': view,
            'thumbnail': thumb,
            'link': link
        }
        videos.append(video_dict)
    pprint(videos)

Note: Make sure to replace API_KEY with your API key.

The code is somewhat similar to what you wrote using selenium and XPath. This code, however, makes use of ScrapingBee's powerful extract rules. It allows you to state the tags and selectors that you want to extract the data from and ScrapingBee will return you the scraped data.

The code contains a js_scenario dictionary that tells ScrapingBee what actions it needs to take before extracting the data. The first step of clicking a button might seem a bit out of place but Google/YouTube oftentimes returns a cookies consent form on first run that needs to be accepted. This is what it looks like:

Cookie consent

After accepting the cookie consent form, the code instructs ScrapingBee to wait for a few seconds for the page to load and then scroll to the end of the page. You can add as many scrolls and waits as you want and ScrapingBee will do as you say without any hiccups.

We are also making use of the premium proxies provided by ScrapingBee as regular proxies are easily blocked by Google.

Finally, we pass in the extract rules. These instruct ScrapingBee on how to extract the data. The HTML selectors are a bit different here because, unlike the XPath selectors, ScrapingBee executes the JavaScript code in the page and these selectors are JS-based selectors. This can be extremely useful when you need to extract data from a page that makes heavy use of JavaScript such as SPAs (Single Page Applications).

ScrapingBee will make sure that you are charged only for a successful response which makes it a really good deal.

Conclusion

In this article, you learned about how to scrape data from YouTube using Selenium. You got introduced to XPaths and saw their usefulness in web scraping. And finally, you learned how to make sure you don't get blocked by using a reliable proxy provider: ScrapingBee.

You can follow the same steps to extract data about individual videos or to scrape search results from YouTube. You can even use this code in conjunction with youtube-dl or something similar to download specific videos for offline viewing.

I hope you learned something new from this article. If you have any questions regarding web scraping, please feel free to reach out. We would love to help you!

image description
Yasoob Khalid

Yasoob is a renowned author, blogger and a tech speaker. He has authored the Intermediate Python and Practical Python Projects books ad writes regularly. He is currently working on Azure at Microsoft.