Using jQuery to Parse HTML and Extract Data

03 May 2022 | 8 min read

Your web page may sometimes need to use information from other web pages that do not provide an API. For instance, you may need to fetch stock price information from a web page in real time and display it in a widget of your web page. However, some of the stock price aggregation websites don’t provide APIs.

In such cases, you need to retrieve the source HTML of the web page and manually find the information you need. This process of retrieving and manually parsing HTML to find specific information is known as web scraping.

In this tutorial, you’ll learn how to scrape a web page using jQuery, a fast and versatile tool for parsing and manipulating HTML. Although jQuery is traditionally used for efficiently interacting with HTML and CSS from client-side JavaScript, its DOM traversal and manipulation capabilities combined with its AJAX feature makes it a solid choice for web scraping.

cover image

What Is Client-Side Scraping?

Client-side scraping involves fetching a web page’s source as HTML using the page URL and parsing the information to obtain specific information.

For example, you might want to build a code search engine. A website such as Stack Overflow provides an API to access their questions and answers programmatically. However, other tutorial websites, such as this one from Draft.dev, have code blocks but do not supply an API for consuming information. To read their code blocks, you will have to use client-side scraping, as explained in this tutorial.

Implementing Client-Side Scraping Using jQuery

This tutorial shows you how to scrape a web page using jQuery. jQuery is a fast and powerful JavaScript library that supports HTML document traversal and manipulating HTML element attributes. It also has features that can handle events of HTML elements. jQuery uses CSS selectors to select objects.

Prerequisites

Start by adding a reference to the jQuery library using the <script> tag:

<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>

Fetching the Remote Page Using Get

In this section, you’ll learn how to fetch the remote page HTML using the jQuery get() method. The get() method loads data from the server using the http get request.

The get() request allows you to define a callback function that can be executed when the get request is successful. The callback function also accepts parameters.

Let’s consider a sample web page from Draft.dev’s blog. This web page contains HTML elements with IDs and different classes and some with attributes. You’ll fetch the complete HTML code of this page and alert it.

To fetch the URL, pass it to the jQuery get() method and define a callback function with the alert statement. You can pass the HTML data that is returned by the get() request, as shown below.

$.get('https://draft.dev/learn/how-to-use-markdown', function(html) {
  alert(html);
});

You should see the complete HTML code of the web page. The output is trimmed to show only the sample HTML.

<!DOCTYPE html><html lang="en"><head>    <meta charset="utf-8">    <title>How to Use Markdown | Draft.dev</title>  <meta content="Draft.dev" property="og:site_name">      <meta content="How to Use Markdown" property="og:title">        <meta content="article" property="og:type">      <meta content="If you don't know what Markdown is, or if you've only heard of it, it may seem confusing how a tool for writing can do these things and be held in such high regard by an entire community. This blog post will dive a bit deeper into why Markdown was developed, what exactly it is, and why you should use it. " property="og:description"> 
--
<script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script></body></html>

This is how you can get the complete HTML source of a web page using the jQuery get() method.

Extracting the Desired Data Using jQuery’s Find Method

In this section, you’ll learn how to extract the desired data from the HTML source (for example, extracting the text of a specific HTML element or extracting the text of elements with a specific class). Also, you’ll learn how to access elements that have the same HTML class.

Note: to know the IDs or classes of the HTML elements in a web page, you can right-click the web page and select the view page source option.

jQuery provides the find() method to search through descendant objects available in the set of DOM elements. You can use the CSS selector to find an element. CSS selectors define the elements to which the CSS styles apply.

Getting Element Text Using ID

The ID attribute is used to specify a unique ID for the HTML elements in your web page. You cannot define the same ID to more than one element in the same HTML page. CSS and JavaScript in your page uses this ID to access the specific elements to set a style or perform any other operations on the element.

During data scraping, you can use this ID to find and access the element. The sample URL has an element with the ID why-use-markdown. Using the find() method, you’ll find the element with this ID and print its text information.

To find an element with an ID, you need to use the ID selector, which is #. For example, to select an element with the ID why-use-markdown, prefix the ID of the element with # and pass it to the find() method, as shown below. Once the element is selected, you can use the text() method to access the text attribute of the element.


$.get('https://draft.dev/learn/how-to-use-markdown', function(html) {

  alert($(html).find("#why-use-markdown").text());
  
});

You’ve now accessed the HTML element using its ID.

Getting Element Text Using Class Name

Next, you’ll learn how to select elements using its class. The class attribute is used to specify the class of HTML elements. Unlike the ID property, you can have more than one element with the same class in a single web page.

The class attribute allows you to define a set of styles using CSS. This style will be applied to all the elements defined using that specific class. During data scraping, you can use this class name to find all the elements with a specific class and get data from those elements.

In the sample URL, there is an element with the class page-title.

To find an element with class, you need to use the class selector, which is .. For example, to select an element with the class page-title, prefix . with the class of the element and pass it to the find() method, as shown below. Once the element is selected, you can use the text() method to access the text attribute of the element.

$.get('https://draft.dev/learn/how-to-use-markdown', function(html) {
  alert($(html).find(".page-title").text());
});

This is how you find an element with a class.

Handling Multiple Elements with the Same Class

As discussed before, it is possible that more than one element can have the same class in an HTML document. Therefore, you’ll now learn how to find more than one element with the same class.

You can use the find() method to find an element using its class name. When there are more elements with the same class, you can use the each() method to iterate over the elements. The defined callback function will be applied to each element.

In the sample URL, there are multiple elements defined with the class highlight. It is the class used to denote all the code blocks that have markdown tutorials.

When you find the element using this class, you’ll get a list of jQuery objects. You can then use the each() method to iterate over these elements and print the text of the elements to print the tutorial blocks, as shown below.


// Get HTML from page and fetching the element with ID 

$.get('https://draft.dev/learn/how-to-use-markdown', function(html) {
  
  // Loop through elements you want to scrape content from
  $(html).find(".highlight").each(function() {

    alert($(this).text());

  });

});

This is how you can find an element with its class name and iterate over the matched elements.

Security Considerations

When fetching data from a URL, it might contain scripts. By default, jQuery’s API doesn’t run the scripts. However, HTML code like <img onerror='script'> will execute the script indirectly. Hence, you need to be careful and clean or escape the scripts from the sources. Otherwise, unknown scripts may cause damage to your program, or you can potentially provide access to your personal information to hackers.

Limitations

There are some limitations to client-side web scraping using jQuery.

  1. The source of the web page might change over time. Hence, any change in the class names or the IDs of the web page might break the scraping application.
  2. Dynamic web pages that change often are very difficult to be scraped from the client side due to the changing nature of the HTML element structure, element ID, and element classes. Also, due to the security considerations, scripts are not executed while scraping. So the components that are loaded by scripts will not be loaded in the dynamic web page.
  3. It is difficult to scrape pages with pagination where the data is loaded with a script. If pagination is implemented where each page has a separate URL (for example, https://example.com?page=<page_number>) then it’s possible to scrape by simply making separate requests. However, if the pages are loaded via AJAX calls (for example, an infinite scrolling page), then it’s difficult to scrape since scripts cannot be executed by jQuery.

Hence, this approach is most suitable for pages that make use of server-side rendering, simple static HTML, or single-page HTML.

Conclusion

In this article, you’ve learned how to do web scraping using jQuery, how to find elements using its ID or HTML element, and how to deal with HTML elements with the same class.

jQuery is a fast and versatile tool that provides functionality to parse and manipulate HTML DOM elements, allowing you to build powerful scraping applications for web pages.

However, when manually scraping websites using your own programs, there’s a chance that your IP can get blocked for security reasons or access limits. In that case, you can use APIs, such as Scraping Bee, that handle headless browsers and rotate proxies for you.

image description
Vikram Aruchamy

Vikram Aruchamy is a cloud solutions architect who loves to build things on the cloud and a technical writer who loves writing about how to build things on the cloud.