Playwright MCP - Scraping Smithery MCP database Tutorial with Cursor

28 April 2025 | 11 min read

Humanity got itself a huge upgrade by embracing tool use a few million years ago and now AI is getting that upgrade. AI is now able to use various tools for you. For example, it can search the web, turn on your living room lamps, play Pokémon and of course use browsers to scrape data.

A critical link in the interface between AI and software tools is the Model Context Protocol (MCP). It is an open protocol that defines how tools can expose their data and functionality to be used by AI models. It was introduced by Anthropic in November 2024 and now the internet is full of MCP servers that can enable AI to do various things. Recently, OpenAI and Google have announced that they will be supporting MCP for ChatGPT and Gemini respectively. So it looks like MCP is going to be the industry standard.

Naturally, the next question is can we use an MCP server that enables AI to scrape data from the web without writing any code? In a previous blog, we attempted to 'vibe scrape with ChatGPT', i.e. see if we could write a Python scraper with ChatGPT to scrape data from a website. We used the ChatGPT web interface which gave us a scraping script, but we had to manually pick CSS selectors to have it work. In this blog, we try to use some MCPs and see how much further we can push ourselves away from having to write code for a web scraping task.

About the task itself, we tried to scrape the website smithery.ai, which lists various MCP servers along with usage statistics. We asked Claude 3.7 Sonnet (using the Cursor IDE) to scrape the homepage and save the extracted data to a JSON file. We've detailed our approach and the results below. If you'd like to follow along, please make sure you have an LLM app that supports MCPs (Claude Desktop, Cursor, etc.), with nodejs, npm, and npx installed (Don't forget to update them as well. If you're on Windows using PowerShell as your terminal, it can be helpful to upgrade it to version 7+).

The Smithery home page we tried scraping

Installing The MCPs On Cursor IDE

The primary MCP we used was the Playwright MCP by Execute Automation. This provided a more comprehensive set of utilities compared to the Microsoft Playwright MCP and seemed more usable for scraping. The Execute Automation YouTube Channel also publishes informative videos about testing and automation if you're interested.

We installed it using npm as follows:

npm install -g @executeautomation/playwright-mcp-server --loglevel verbose

The -g argument installs it globally and also makes the package runnable with npx. This command also installs the browsers necessary for Playwright to run, which usually takes a few minutes. The verbose log level setting helps monitor the progress. Next, we also used the filesystem MCP server, which enables AI to access defined directories of the local filesystem. Now it could write a JSON file with the scraped data.

We added these two MCPs to the MCP config file so Cursor could provide these tools to its LLMs:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["-y", "@executeautomation/playwright-mcp-server"]
    },
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/Users/kd/Desktop"
      ]
    }
  }
}

For Cursor, this file needs to be placed at ~/.cursor/mcp.json on MacOS/Linux, or at C:\Users\UserName\.cursor\mcp.json on Windows.

Editing mcp.json on Cursor:

Adding MCPs to Cursor

Also, we could skip the installation step for the filesystem MCP, as npx would handle this automatically. It's just that the installation takes longer for Playwright MCP, and it's better to have it ready before firing up the IDE.

Finally, we also enabled the Agent Mode on Cursor, so Cursor wouldn't ask our approval for every MCP tool call. Cursor calls this the YOLO mode.

Once we added the two MCPs, Cursor's AI models had the following tools at their disposal:

MCPTool NameDescription
Playwright MCPstart_codegen_sessionStarts a session to generate test automation code
end_codegen_sessionEnds the codegen session
get_codegen_sessionGets information about a codegen session
clear_codegen_sessionClears codegen session
playwright_navigateOpens a URL in the browser
playwright_screenshotTakes Screenshot of the current page or specified element
playwright_clickClick on specified page element
playwright_iframe_clickClick specified element inside an iframe
playwright_fillFill specified input field
playwright_selectSelect the specified element
playwright_hoverHover over specified element
playwright_evaluateEvaluate JS code in the browser console
playwright_console_logsGet logs from the browser console
playwright_closeClose the browser
playwright_getSend HTTP GET Request
playwright_postSend HTTP PUT Request
playwright_putSend HTTP PUT Request
playwright_patchSend HTTP PATCH Request
playwright_deleteSend HTTP DELETE Request
playwright_expect_responseWait for a response
playwright_assert_responseWait for a response and validate it when received
playwright_custom_user_agentSet Custom User Agent
playwright_get_visible_textGet the Visible text in the loaded page
playwright_get_visible_htmlGet full HTML content of loaded page
playwright_go_backGo back in the navigation history
playwright_go_forwardGo forward in the navigation history
playwright_dragDrag the specified element
playwright_press_keyPress the specified key
playwright_save_as_pdfSave loaded page as PDF
Filesystem MCPread_fileRead contents of specified file
read_multiple_filesRead multiple files at once
write_fileCreate new file with specified contents
edit_fileEdit the specified file
create_directoryCreate a folder at specified path
list_directoryList the files in the specified folder
move_fileMove/rename a file
search_filesSearch for files, works recursively
get_file_infoGet detailed metadata about a file/directory
list_allowed_directoriesList the folders the MCP is permitted to access

Asking Cursor To Scrape Smithery

After some trial and error, we found that the following prompt worked best for scraping the homepage of Smithery:

Use playwright mcp to scrape the list of MCP servers from this page using playwright_get_visible_text https://smithery.ai/

Only scrape the homepage

From the output of playwright_get_visible_text, without writing any scripts, extract the name, URL, description, type (remote/local), icon URL, and the number of downloads at the end of the listing. Save this list to a json file with the URL as the unique key. The URL is of the format https://smithery.ai/server/@{username}/{mcp_name}

Cursor used the claude-3.7-sonnet model to handle this prompt and first opened up a browser window to visit the Smithery homepage. And as per our instruction, it used the playwright_get_visible_text function to get the text from the page, process it and save it to mcp_servers.json on the Desktop. The GIF below shows how this worked. It took just over a minute.

Smithery Scraping With Playwright MCP and Cursor

Results

Upon inspecting the mcp_servers.json file that Cursor produced, we found that the file had details about 9 servers, in the following format:

{
 "https://smithery.ai/server/@smithery/toolbox": {
    "name": "Toolbox",
    "description": "Toolbox dynamically routes to all MCPs in the Smithery registry based on your agent's need. When an MCP requires configuration, our tool will prompt the user to configure their tool with a callback link.",
    "type": "Remote",
    "downloads": "6.74k",
    "icon_url": null
  },
  "https://smithery.ai/server/@smithery-ai/server-sequential-thinking": {
    "name": "Sequential Thinking",
    "description": "An MCP server implementation that provides a tool for dynamic and reflective problem-solving through a structured thinking process.",
    "type": "Remote",
    "downloads": "531.96k",
    "icon_url": null
  },
  "...7 more servers...": ""
}

While the name, description, URL, and type looked alright, the downloads field wasn't parsed to an integer, and the icon URL was null for all the entries. This is because the icon URL is inside an HTML tag attribute and not in the visible text. The full output of the prompt run is below:

I'll help you scrape the MCP servers list from smithery.ai using Playwright. Let me break this down into steps:

  1. First, I'll start a browser session and navigate to the page
  2. Get the visible text content
  3. Parse the text and create a JSON structure
  4. Save it to a file

Let's begin:

> Called MCP tool playwright_navigate

Now, let's get the visible text content:

> Called MCP tool playwright_get_visible_text

Now I'll close the browser since we have the data we need:

> Called MCP tool playwright_close

Now, I'll parse this data and create a JSON file with the MCP servers information. I'll create a structured JSON file:

> Called MCP tool write_file

I've created a JSON file at /Users/kd/Desktop/mcp_servers.json containing the MCP servers information from the homepage. I focused on the featured servers section since that's what's shown on the homepage. The JSON file is structured with:

The file has been saved and includes the 9 featured servers from the homepage. Each entry preserves the exact information as shown on the website, including download counts in their original format (e.g., "6.74k").

Would you like me to modify the format or include additional servers from other sections of the page?

Playwright MCP Scraping Limitations

We ran into major limitations while trying to scrape Smithery using Playwright MCP and Cursor. Due to the limited context window of claude-3.7-sonnet, we couldn't have it analyze the full HTML code and had to stick to visible text, which would be much fewer tokens for the LLM to process. Hence, we couldn't extract some details such as the icon URL. On our first few attempts, it couldn't get the URLs of linked pages either, because these would be in the href attributes of anchor tags in the HTML code, and not in the visible text. We worked around this by asking it to construct the URL from the name using a provided pattern.

Second, we couldn't repeat this over multiple pages. When we tried to scrape around 60 URLs from the site using a page counter, Cursor often stopped at 13-14 pages, either prompting us to proceed for each page, writing incomplete data, or just stopping entirely. We also tried this with Claude Desktop, which could do 3-4 pages at best. AI LLMs are known to be non-deterministic, therefore, they are not the best suited for repeating tasks based on a definition. Code and APIs still seem to be the best way to accomplish that.

Cursor also has a built in hurdle that stops the running of MCP tools after 25 tool calls where it prompts the user to press a link to continue. Although this can be worked around by using this useful script from Github. There’s a couple of steps to installing it, but it's well worth it if you want your MCPs to run uninterrupted.

Playwright MCP Scraping Alternative

If you want AI-powered Web Scraping without the limitations of the Playwright MCP then check out our AI Web Scraping API which:

  • Scales easily
  • Enables you to extract data without having to mess about with selectors
  • Bypasses Anti-Bot Measures
  • Automatically adapts to page layout changes

Give it a spin by grabbing your free API key and 1000 free scraping credits. Read more on our AI-powered Web Scraping API endpoint.

# Install the Python ScrapingBee library:
# `pip install scrapingbee`
from scrapingbee import ScrapingBeeClient
import json

client = ScrapingBeeClient(api_key='INSERT_YOUR_SCRAPINGBEE_API_KEY')

response = client.get(
    'https://smithery.ai/',  # Replace this with the actual URL
    params={
        'ai_query': 'Return a list of servers and their attributes',
        'ai_extract_rules': json.dumps({
            "server url": {
                'type': 'list',
                'description': 'The URL of the server in the format https://smithery.ai/server/@{username}/{mcp_name}'
            },
            "server name": {
                'type': 'list',
                'description': 'The name of the server'
            },
            "server description": {
                'type': 'list',
                'description': 'The description of the server'
            },
            "server type": {
                'type': 'list',
                'description': 'Type of server, either Remote or Local'
            },
            "server downloads": {
                'type': 'list',
                'description': 'Number of Downloads associated with each mcp server, its typically of this format: 1.2k'
            }
        })
    }
)

# Decode the response content from byte string to regular string
response_content_str = response.content.decode('utf-8', errors='ignore')

# Now parse the string into a JSON object
response_content = json.loads(response_content_str)

# Combine the lists into a list of dictionaries
servers = []
for i in range(len(response_content["server url"])):
    server = {
        "url": response_content["server url"][i],
        "name": response_content["server name"][i],
        "description": response_content["server description"][i],
        "type": response_content["server type"][i],
        "downloads": response_content["server downloads"][i]
    }
    servers.append(server)

# Print the list of servers in a JSON format for better readability
print('Response HTTP Status Code: ', response.status_code)
print(json.dumps(servers, indent=2))

Output from the Scrapingbee AI Web Scraping API:

Response HTTP Status Code:  200
[
  {
    "url": "/server/@smithery/toolbox",
    "name": "Toolbox",
    "description": "Toolbox dynamically routes to all MCPs in the Smithery registry based on your agent's need. When an MCP requires configuration, our tool will prompt the user to configure their tool with a callback link.",
    "type": "Remote",
    "downloads": "22.39k"
  },
  {
    "url": "/server/@wonderwhy-er/desktop-commander",
    "name": "Desktop Commander",
    "description": "Execute terminal commands and manage files with diff editing capabilities. Coding, shell and terminal, task automation",
    "type": "Local",
    "downloads": "303.87k"
  },
  {
    "url": "/server/@smithery-ai/server-sequential-thinking",
    "name": "Sequential Thinking",
    "description": "An MCP server implementation that provides a tool for dynamic and reflective problem-solving through a structured thinking process.",
    "type": "Remote",
    "downloads": "133.86k"
  },
  {
    "url": "/server/@browserbasehq/mcp-browserbase",
    "name": "Browserbase",
    "description": "Provides cloud browser automation capabilities using Browserbase, enabling LLMs to interact with web pages, take screenshots, and execute JavaScript in a cloud browser environment.",
    "type": "Remote",
    "downloads": "28.45k"
  },
  {
    "url": "/server/@smithery-ai/github",
    "name": "Github",
    "description": "Access the GitHub API, enabling file operations, repository management, search functionality, and more.",
    "type": "Remote",
    "downloads": "32.87k"
  }
    "...39 more servers..."

Conclusion

In this blog, we tried to scrape the Smithery website without actively writing or running code, using the Cursor IDE with a Playwright MCP. We tried it with just the homepage and we could say that the results were decent. We were able to extract most of the fields that we needed as long as the fields were in the visible text of the page. However, we were unable to effectively scale this approach to multiple pages due to limitations of the new technology.

image description
Karthik Devan

I work freelance on full-stack development of apps and websites, and I'm also trying to work on a SaaS product. When I'm not working, I like to travel, play board games, hike and climb rocks.