How to build a job board with web scraping and ChatGPT

09 March 2023 | 10 min read

There are a huge number of job openings! If you wanted to build a job board that collects job openings, how would you go about finding them? And once you have collected them, how would you go about extracting useful information such as salary and benefits?

In this article, you’ll learn how to build a job board by scraping Google and recruiting software sites such as Workable and Lever in order to collect job openings. You’ll then learn how to use the ChatGPT API to extract useful information from the job openings. ChatGPT is an Artificial Intelligence tool that can be used to generate, summarize, classify or extract text.

cover image

Setting up the prerequisites

This tutorial will use NodeJS v18.12.1 as well as Open AI’s NodeJS library. However, the concepts in this tutorial will also apply to any languages. Open AI also offers libraries for Python, C#, .NET, Crystal, Go, Java, PHP, R, Ruby, and Swift.

Similarly, in this tutorial you will use ScrapingBee’s NodeJS SDK. However, it is also available Python, NodeJS, Java, Ruby, PHP and Go.

To setup the new Node project:

$ npm init
$ npm install openai
$ npm install scrapingbee
$ npm install axios # To easily make http requests

Step One: Collecting Companies to Scrape

Most companies use recruiting software such as Workable or Lever to host their job openings. These sites then host the openings on their subdomains. For example, in the case of Workable, openings are hosted on https://apply.workable.com/{company_name} . Thus you can quickly build a list of companies to scrape job openings from by using a Google search query to find sites that match these format. You can do this by using a search query with the format {search_term} site:https://apply.workable.com/* . For example, see the image below for remote software engineer site:https://apply.workable.com/* :


Google Search Results for "remote software engineer site:https://apply.workable.com/

In order to scrape these results from Google more efficiently, you can use ScrapingBee’s Google Scraper:

import axios from 'axios';

export const retrieveGoogleURLSforSearchTerm = async (searchTerm) => {
    const response = await axios.get('https://app.scrapingbee.com/api/v1/store/google', {
        params: {
            api_key: 'YOUR SCRAPINGBEE API KEY',
            search: searchTerm,
        },
    });
    const organicResults = response.data.organic_results;
    return urls = organicResults.map((organicResult) => organicResult.url)
};

Step Two: Retrieving Openings to Scrape

Now that you have a list of Workable URLs to scrape job openings from, the next step is extract the links for each individual job opening.


Since the page may contain urls that aren’t for job openings, you’ll need to filter those out. For example, in the case of Workable, each individual opening is hosted on a url with the format https://apply.workable.com/{company}/j/{job_id} and hence you should filter our links that don’t match this format.

One “gotcha” is that sometimes Workable may detect your location and automatically apply filters based on it:

Location filter applied on Workable

This greatly reduces the number of links you’ll be able to extract since it only shows job openings for that location. Thus you’ll need to remove this filter. To do this, you can execute custom Javascript to click the “Clear Filters” button above:

import { ScrapingBeeClient } from 'scrapingbee';

const scrapingBeeClient = new ScrapingBeeClient('YOUR API KEY');
const js_scenario = {
    instructions: [
        { wait: 3000 },
        { evaluate:
                `const dismissButton = document.getElementsByClassName('button--2de5X button--14TuV tertiary--1L6hu styles--2s5xh')[0];
      if (dismissButton) {
        const clickEvent = new MouseEvent('click', { view: window, bubbles: true, cancelable: false });
        dismissButton.dispatchEvent(clickEvent);
      }`
        },
        { wait: 2000 },
    ],
};

const response = await scrapingBeeClient.get({
    url,
    {
        extract_rules,
            js_scenario,
    },
});

Step Three: Parsing Information from the Job Opening

Now that you have a full list of job opening urls, you’ll want to scrape and parse useful information from them. Some information such as job title and location will be straight forward to retrieve using CSS selectors, whilst other information such as salary, job description, and benefits will be more difficult.

Why It's Hard to Extract Job Description and Benefits

Sometimes, the job description and benefits will be wrapped in their own div with a nicely label identifier. Thus it will be easy to retrieve the text using CSS selectors. However, other times the entire text may be in single div, making it much harder to separate the job description and the benefits.

Job Description and Benefits in separate divs, making it easy to parse

Job Description and Benefits in separate divs, making it easy to parse

Job Description and Benefits in the same div, making it difficult to parse

Job Description and Benefits in the same div, making it difficult to parse

To easily parse the job description and benefits, you’ll use ChatGPT to extract the job description and benefits into separate texts.

A Brief Introduction to the ChatGPT API

To use the ChatGPT API, you'll need to create an account with Open AI in order to generate an API Key. Once you have an API Key, you'll be able to make API requests.

Here is an example of what a request may look like:

import axios from 'axios';

const exampleChatGPTRequest = async () => {
    const apiUrl = 'https://api.openai.com/v1/chat/completions';
    const headers = {
        'Content-Type': 'application/json',
        Authorization: `Bearer ${'YOUR_API_KEY'}`,
    };
    const data = {
        model: 'gpt-3.5-turbo',
        messages: [
            { role: 'system', content: "Your job is to extract information from job openings." },
            { role: 'user', content: "Extract the benefits from the following job opening…" },
        ],
    };
    const result = await axios.post(apiUrl, data, { headers });
    return result.data.choices[0].message.content;
}

Notice the messages argument in the example above. This is how you prompt ChatGPT to perform tasks. The system messages helps set the behavior of the ChatGPT. The user messages help instruct the assistant.

A complete introduction to the ChatGPT API can be viewed in their docs.

Tokens and The Costs of the ChatGPT API

ChatGPT's utilizes a usage-based pricing model. The cost of the api is $0.002 per 1,000 tokens. A token can be thought of as roughly 4 characters or 0.75 words. When you make an API request, the text in the messages parameter is converted into tokens. The response from the API is then also converted into tokens. This means that the cost of your API request is number tokens in your request plus the number of tokens in the response.

How to Extract Job Description and Benefits Using ChatGPT

First, you'll extract all the text from the job opening page. You'll then request ChatGPT to extract the job description and benefits from this text:

import axios from 'axios';

const getJobDescription = async (scrapedText) => {
    const apiUrl = 'https://api.openai.com/v1/chat/completions';
    const headers = {
        'Content-Type': 'application/json',
        Authorization: `Bearer ${'YOUR_API_KEY'}`,
    };
    const data = {
        model: 'gpt-3.5-turbo',
        messages: [
            { role: 'system', content: "Your job is to extract information from job openings." },
            { role: 'user', content: `Here is a job opening: ${scrapedText}\n\nExtract the description from the job opening.` },
        ],
    };
    const result = await axios.post(apiUrl, data, { headers });
    return result.data.choices[0].message.content;
}

const getJobBenefits = async (scrapedText) => {
    const apiUrl = 'https://api.openai.com/v1/chat/completions';
    const headers = {
        'Content-Type': 'application/json',
        Authorization: `Bearer ${'YOUR_API_KEY'}`,
    };
    const data = {
        model: 'gpt-3.5-turbo',
        messages: [
            { role: 'system', content: "Your job is to extract information from job openings." },
            { role: 'user', content: `Here is a job opening: ${scrapedText}\n\nExtract the benefits from the job opening.` },
        ],
    };
    const result = await axios.post(apiUrl, data, { headers });
    return result.data.choices[0].message.content;
}

Here is an output example from this job opening:

Description:
    - Arranging meetings and producing meeting minutes;
    - Set up and maintain project files;
    - Collect actuals data and forecasts;
    - Update project plans;
    - Administer or assist the quality review process;
    - Administer or assist Project Board meetings;
    - Assist with the compilation of reports;
    - Contribute expertise in specialist tools and techniques;
    - Maintain the following records: Quality Register, Configuration Item Records, any other registers/logs delegated by the Project Manager;
    - Administer the configuration management procedure.
Requirements:
    - Minimum 3 years of relevant education (bachelor degree or equivalent) after the secondary school.
    - Minimum 2 years of relevant professional experience, of which minimum 1 year experience in IT project support.
    - Experience of working within a project management office utilising Prince 2 or an equivalent project management methodology is needed.
    - Prince 2 foundation qualifications or an equivalent industry standard project management methodology.

Results:

The job involves arranging meetings and producing minutes, maintaining project files, updating project plans and contributing expertise in specialist tools and techniques. The candidate is also expected to collect actuals data and forecasts, administer Project Board meetings, assist with report compilation and maintain records such as Quality Register and Configuration Item Records. Additionally, they are supposed to administer the quality review process and configuration management procedure.

Parsing Salary with ChatGPT

Salary is one of the most interesting pieces of information that a job seeker is interested in. However, parsing it from a job opening is not straight forward. You could consider doing a simple text search for a currency symbol or code (e.g. “$” or “USD”), however this will not work since:

  • Sometimes the job opening includes the company’s funding data. For example, "We’ve raised $800k in pre-seed funding”.
  • Sometimes the job opening includes a learning budget. For example, "annual learning budget of $5,000”.
  • Not all job openings include a salary range.

Thus a simpler way would be to just use ChatGPT again:

import axios from 'axios';

const getJobSalary = async (scrapedText) => {
    const apiUrl = 'https://api.openai.com/v1/chat/completions';
    const headers = {
        'Content-Type': 'application/json',
        Authorization: `Bearer ${'YOUR_API_KEY'}`,
    };
    const data = {
        model: 'gpt-3.5-turbo',
        messages: [
            { role: 'system', content: "Your job is to extract information from job openings." },
            { role: 'user', content: `Here is a job opening: ${scrapedText}\n\nExtract the salary from the job opening.` },
        ],
    };
    const result = await axios.post(apiUrl, data, { headers });
    return result.data.choices[0].message.content;
}

Here is an output example from this job opening:

Company Description:
We are a fast-growing tech startup that provides cutting-edge software solutions for businesses across various industries. Our company has recently raised $10 million in funding from top-tier investors and we are poised for rapid growth. We have a dynamic and collaborative work culture that values innovation, creativity, and excellence.

Job Overview:
We are seeking a talented and experienced Java Developer to join our development team. The successful candidate will be responsible for designing, developing, and maintaining software applications using Java technologies. They will work closely with cross-functional teams to deliver high-quality software solutions that meet customer requirements.

Key Responsibilities:

Design, develop, and maintain Java-based software applications
Write clean, efficient, and well-documented code
Collaborate with cross-functional teams to understand requirements and deliver high-quality software solutions
Participate in code reviews and contribute to the development of best practices
Troubleshoot and debug software issues as needed
Stay up-to-date with emerging trends and technologies in Java development

Requirements:

Bachelor's degree in Computer Science or a related field
3+ years of experience in Java development
Experience with Spring Framework, Hibernate, and SQL
Strong understanding of object-oriented programming principles
Familiarity with agile software development methodologies
Excellent problem-solving and analytical skills
Strong communication and teamwork skills
Salary Range: $90,000-$120,000

If you are passionate about Java development and want to work in a fast-paced and dynamic environment, please apply with your resume and a cover letter highlighting your relevant experience and skills. We look forward to hearing from you!

Results:

The salary range for the Java Developer position is $90,000-$120,000, commensurate with experience and qualifications.

You would still need to parse it and regexp it to get the actual salary range. However, this is a good start.

Let's try to tweak it the prompt a bit: "Extract the salary from the job opening. Extract it as a JS variable called salary"

And here is the result:

```javascript
const salary = "$90,000-$120,000";

Much better!

Conclusion

As we can see, ChatGPT is a great tool to extract information from job openings. It is fast, easy to use and cost effective. You can use it to extract information from any text, not just job openings.

However, its non-deterministic nature can be a problem. As well as the fact that it can be hard to generate an output in the format we want.

But with some practice and correct data cleaning and validation, you can get great results!

ChatGPT might be a bit too young to feed data to critic systems, but it's definitely good enough to extract information to feed to a tool that tolerate some noise from time to time.

I hope you learned something new today. If you have any further questions related to web scraping or ChatGPT, feel free to reach out to us! We would love to help you.

image description
Lior Neu-ner

Lior Neu-ner is the founder of Remote Rocketship, a job board for remote tech job openings. You can reach him on LinkedIn and Twitter.