Playwright MCP - Scraping Smithery MCP database Tutorial with Cursor

Karthik Devan | 28 April 2025 | 11 min read

Table of contents

Humanity got itself a huge upgrade by embracing tool use a few million years ago and now AI is getting that upgrade. AI is now able to use various tools for you. For example, it can search the web, turn on your living room lamps, play Pokémon and of course use browsers to scrape data.

A critical link in the interface between AI and software tools is the Model Context Protocol (MCP). It is an open protocol that defines how tools can expose their data and functionality to be used by AI models. It was introduced by Anthropic in November 2024 and now the internet is full of MCP servers that can enable AI to do various things. Recently, OpenAI and Google have announced that they will be supporting MCP for ChatGPT and Gemini respectively. So it looks like MCP is going to be the industry standard.

Naturally, the next question is can we use an MCP server that enables AI to scrape data from the web without writing any code? In a previous blog, we attempted to 'vibe scrape with ChatGPT', i.e. see if we could write a Python scraper with ChatGPT to scrape data from a website. We used the ChatGPT web interface which gave us a scraping script, but we had to manually pick CSS selectors to have it work. In this blog, we try to use some MCPs and see how much further we can push ourselves away from having to write code for a web scraping task.

About the task itself, we tried to scrape the website smithery.ai, which lists various MCP servers along with usage statistics. We asked Claude 3.7 Sonnet (using the Cursor IDE) to scrape the homepage and save the extracted data to a JSON file. We've detailed our approach and the results below. If you'd like to follow along, please make sure you have an LLM app that supports MCPs (Claude Desktop, Cursor, etc.), with nodejs, npm, and npx installed (Don't forget to update them as well. If you're on Windows using PowerShell as your terminal, it can be helpful to upgrade it to version 7+).

The Smithery home page we tried scraping

Installing The MCPs On Cursor IDE

The primary MCP we used was the Playwright MCP by Execute Automation. This provided a more comprehensive set of utilities compared to the Microsoft Playwright MCP and seemed more usable for scraping. The Execute Automation YouTube Channel also publishes informative videos about testing and automation if you're interested.

We installed it using npm as follows:

npm install -g @executeautomation/playwright-mcp-server --loglevel verbose

The -g argument installs it globally and also makes the package runnable with npx. This command also installs the browsers necessary for Playwright to run, which usually takes a few minutes. The verbose log level setting helps monitor the progress. Next, we also used the filesystem MCP server, which enables AI to access defined directories of the local filesystem. Now it could write a JSON file with the scraped data.

We added these two MCPs to the MCP config file so Cursor could provide these tools to its LLMs:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["-y", "@executeautomation/playwright-mcp-server"]
    },
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/Users/kd/Desktop"
      ]
    }
  }
}

For Cursor, this file needs to be placed at ~/.cursor/mcp.json on MacOS/Linux, or at C:\Users\UserName\.cursor\mcp.json on Windows.

Editing mcp.json on Cursor:

Adding MCPs to Cursor

Also, we could skip the installation step for the filesystem MCP, as npx would handle this automatically. It's just that the installation takes longer for Playwright MCP, and it's better to have it ready before firing up the IDE.

Finally, we also enabled the Agent Mode on Cursor, so Cursor wouldn't ask our approval for every MCP tool call. Cursor calls this the YOLO mode.

Once we added the two MCPs, Cursor's AI models had the following tools at their disposal:

MCP	Tool Name	Description
Playwright MCP	start_codegen_session	Starts a session to generate test automation code
	end_codegen_session	Ends the codegen session
	get_codegen_session	Gets information about a codegen session
	clear_codegen_session	Clears codegen session
	playwright_navigate	Opens a URL in the browser
	playwright_screenshot	Takes Screenshot of the current page or specified element
	playwright_click	Click on specified page element
	playwright_iframe_click	Click specified element inside an iframe
	playwright_fill	Fill specified input field
	playwright_select	Select the specified element
	playwright_hover	Hover over specified element
	playwright_evaluate	Evaluate JS code in the browser console
	playwright_console_logs	Get logs from the browser console
	playwright_close	Close the browser
	playwright_get	Send HTTP GET Request
	playwright_post	Send HTTP PUT Request
	playwright_put	Send HTTP PUT Request
	playwright_patch	Send HTTP PATCH Request
	playwright_delete	Send HTTP DELETE Request
	playwright_expect_response	Wait for a response
	playwright_assert_response	Wait for a response and validate it when received
	playwright_custom_user_agent	Set Custom User Agent
	playwright_get_visible_text	Get the Visible text in the loaded page
	playwright_get_visible_html	Get full HTML content of loaded page
	playwright_go_back	Go back in the navigation history
	playwright_go_forward	Go forward in the navigation history
	playwright_drag	Drag the specified element
	playwright_press_key	Press the specified key
	playwright_save_as_pdf	Save loaded page as PDF
Filesystem MCP	read_file	Read contents of specified file
	read_multiple_files	Read multiple files at once
	write_file	Create new file with specified contents
	edit_file	Edit the specified file
	create_directory	Create a folder at specified path
	list_directory	List the files in the specified folder
	move_file	Move/rename a file
	search_files	Search for files, works recursively
	get_file_info	Get detailed metadata about a file/directory
	list_allowed_directories	List the folders the MCP is permitted to access

Asking Cursor To Scrape Smithery

After some trial and error, we found that the following prompt worked best for scraping the homepage of Smithery:

Use playwright mcp to scrape the list of MCP servers from this page using playwright_get_visible_text https://smithery.ai/
Only scrape the homepage
From the output of playwright_get_visible_text, without writing any scripts, extract the name, URL, description, type (remote/local), icon URL, and the number of downloads at the end of the listing. Save this list to a json file with the URL as the unique key. The URL is of the format https://smithery.ai/server/@{username}/{mcp_name}

Cursor used the claude-3.7-sonnet model to handle this prompt and first opened up a browser window to visit the Smithery homepage. And as per our instruction, it used the playwright_get_visible_text function to get the text from the page, process it and save it to mcp_servers.json on the Desktop. The GIF below shows how this worked. It took just over a minute.

Smithery Scraping With Playwright MCP and Cursor

Results

Upon inspecting the mcp_servers.json file that Cursor produced, we found that the file had details about 9 servers, in the following format:

{
 "https://smithery.ai/server/@smithery/toolbox": {
    "name": "Toolbox",
    "description": "Toolbox dynamically routes to all MCPs in the Smithery registry based on your agent's need. When an MCP requires configuration, our tool will prompt the user to configure their tool with a callback link.",
    "type": "Remote",
    "downloads": "6.74k",
    "icon_url": null
  },
  "https://smithery.ai/server/@smithery-ai/server-sequential-thinking": {
    "name": "Sequential Thinking",
    "description": "An MCP server implementation that provides a tool for dynamic and reflective problem-solving through a structured thinking process.",
    "type": "Remote",
    "downloads": "531.96k",
    "icon_url": null
  },
  "...7 more servers...": ""
}

While the name, description, URL, and type looked alright, the downloads field wasn't parsed to an integer, and the icon URL was null for all the entries. This is because the icon URL is inside an HTML tag attribute and not in the visible text. The full output of the prompt run is below:

I'll help you scrape the MCP servers list from smithery.ai using Playwright. Let me break this down into steps:

First, I'll start a browser session and navigate to the page
Get the visible text content
Parse the text and create a JSON structure
Save it to a file

Let's begin:

> Called MCP tool playwright_navigate

Now, let's get the visible text content:

> Called MCP tool playwright_get_visible_text

Now I'll close the browser since we have the data we need:

> Called MCP tool playwright_close

Now, I'll parse this data and create a JSON file with the MCP servers information. I'll create a structured JSON file:

> Called MCP tool write_file

I've created a JSON file at /Users/kd/Desktop/mcp_servers.json containing the MCP servers information from the homepage. I focused on the featured servers section since that's what's shown on the homepage. The JSON file is structured with:

URL as the key (in the format https://smithery.ai/server/@{username}/{mcp_name})
For each server:
- name
- description
- type (Remote/Local)
- downloads (as shown on the site)
- icon_url (set to null as icons weren't directly visible in the text content)

The file has been saved and includes the 9 featured servers from the homepage. Each entry preserves the exact information as shown on the website, including download counts in their original format (e.g., "6.74k").

Would you like me to modify the format or include additional servers from other sections of the page?

Playwright MCP Scraping Limitations

We ran into major limitations while trying to scrape Smithery using Playwright MCP and Cursor. Due to the limited context window of claude-3.7-sonnet, we couldn't have it analyze the full HTML code and had to stick to visible text, which would be much fewer tokens for the LLM to process. Hence, we couldn't extract some details such as the icon URL. On our first few attempts, it couldn't get the URLs of linked pages either, because these would be in the href attributes of anchor tags in the HTML code, and not in the visible text. We worked around this by asking it to construct the URL from the name using a provided pattern.

Second, we couldn't repeat this over multiple pages. When we tried to scrape around 60 URLs from the site using a page counter, Cursor often stopped at 13-14 pages, either prompting us to proceed for each page, writing incomplete data, or just stopping entirely. We also tried this with Claude Desktop, which could do 3-4 pages at best. AI LLMs are known to be non-deterministic, therefore, they are not the best suited for repeating tasks based on a definition. Code and APIs still seem to be the best way to accomplish that.

Cursor also has a built in hurdle that stops the running of MCP tools after 25 tool calls where it prompts the user to press a link to continue. Although this can be worked around by using this useful script from Github. There’s a couple of steps to installing it, but it's well worth it if you want your MCPs to run uninterrupted.

Playwright MCP Scraping Alternative

If you want AI-powered Web Scraping without the limitations of the Playwright MCP then check out our AI Web Scraping API which:

Scales easily
Enables you to extract data without having to mess about with selectors
Bypasses Anti-Bot Measures
Automatically adapts to page layout changes

Give it a spin by grabbing your free API key and 1000 free scraping credits. Read more on our AI-powered Web Scraping API endpoint.

# Install the Python ScrapingBee library:
# `pip install scrapingbee`
from scrapingbee import ScrapingBeeClient
import json

client = ScrapingBeeClient(api_key='INSERT_YOUR_SCRAPINGBEE_API_KEY')

response = client.get(
    'https://smithery.ai/',  # Replace this with the actual URL
    params={
        'ai_query': 'Return a list of servers and their attributes',
        'ai_extract_rules': json.dumps({
            "server url": {
                'type': 'list',
                'description': 'The URL of the server in the format https://smithery.ai/server/@{username}/{mcp_name}'
            },
            "server name": {
                'type': 'list',
                'description': 'The name of the server'
            },
            "server description": {
                'type': 'list',
                'description': 'The description of the server'
            },
            "server type": {
                'type': 'list',
                'description': 'Type of server, either Remote or Local'
            },
            "server downloads": {
                'type': 'list',
                'description': 'Number of Downloads associated with each mcp server, its typically of this format: 1.2k'
            }
        })
    }
)

# Decode the response content from byte string to regular string
response_content_str = response.content.decode('utf-8', errors='ignore')

# Now parse the string into a JSON object
response_content = json.loads(response_content_str)

# Combine the lists into a list of dictionaries
servers = []
for i in range(len(response_content["server url"])):
    server = {
        "url": response_content["server url"][i],
        "name": response_content["server name"][i],
        "description": response_content["server description"][i],
        "type": response_content["server type"][i],
        "downloads": response_content["server downloads"][i]
    }
    servers.append(server)

# Print the list of servers in a JSON format for better readability
print('Response HTTP Status Code: ', response.status_code)
print(json.dumps(servers, indent=2))

Output from the Scrapingbee AI Web Scraping API:

Response HTTP Status Code:  200
[
  {
    "url": "/server/@smithery/toolbox",
    "name": "Toolbox",
    "description": "Toolbox dynamically routes to all MCPs in the Smithery registry based on your agent's need. When an MCP requires configuration, our tool will prompt the user to configure their tool with a callback link.",
    "type": "Remote",
    "downloads": "22.39k"
  },
  {
    "url": "/server/@wonderwhy-er/desktop-commander",
    "name": "Desktop Commander",
    "description": "Execute terminal commands and manage files with diff editing capabilities. Coding, shell and terminal, task automation",
    "type": "Local",
    "downloads": "303.87k"
  },
  {
    "url": "/server/@smithery-ai/server-sequential-thinking",
    "name": "Sequential Thinking",
    "description": "An MCP server implementation that provides a tool for dynamic and reflective problem-solving through a structured thinking process.",
    "type": "Remote",
    "downloads": "133.86k"
  },
  {
    "url": "/server/@browserbasehq/mcp-browserbase",
    "name": "Browserbase",
    "description": "Provides cloud browser automation capabilities using Browserbase, enabling LLMs to interact with web pages, take screenshots, and execute JavaScript in a cloud browser environment.",
    "type": "Remote",
    "downloads": "28.45k"
  },
  {
    "url": "/server/@smithery-ai/github",
    "name": "Github",
    "description": "Access the GitHub API, enabling file operations, repository management, search functionality, and more.",
    "type": "Remote",
    "downloads": "32.87k"
  }
    "...39 more servers..."

Conclusion

In this blog, we tried to scrape the Smithery website without actively writing or running code, using the Cursor IDE with a Playwright MCP. We tried it with just the homepage and we could say that the results were decent. We were able to extract most of the fields that we needed as long as the fields were in the visible text of the page. However, we were unable to effectively scale this approach to multiple pages due to limitations of the new technology.

Before you go, check out these related reads:

Karthik Devan

I work freelance on full-stack development of apps and websites, and I'm also trying to work on a SaaS product. When I'm not working, I like to travel, play board games, hike and climb rocks.

Playwright MCP - Scraping Smithery MCP database Tutorial with Cursor

Installing The MCPs On Cursor IDE

Asking Cursor To Scrape Smithery

Results

Playwright MCP Scraping Limitations

Playwright MCP Scraping Alternative

Conclusion

You might also like:

Web Scraping with Perl

How to bypass PerimeterX anti-bot system in 2025

What is Screen Scraping and How To Do It With Examples