Humanity got itself a huge upgrade by embracing tool use a few million years ago and now AI is getting that upgrade. AI is now able to use various tools for you. For example, it can search the web, turn on your living room lamps, play Pokémon and of course use browsers to scrape data.
A critical link in the interface between AI and software tools is the Model Context Protocol (MCP). It is an open protocol that defines how tools can expose their data and functionality to be used by AI models. It was introduced by Anthropic in November 2024 and now the internet is full of MCP servers that can enable AI to do various things. Recently, OpenAI and Google have announced that they will be supporting MCP for ChatGPT and Gemini respectively. So it looks like MCP is going to be the industry standard.
Naturally, the next question is can we use an MCP server that enables AI to scrape data from the web without writing any code? In a previous blog, we attempted to 'vibe scrape with ChatGPT', i.e. see if we could write a Python scraper with ChatGPT to scrape data from a website. We used the ChatGPT web interface which gave us a scraping script, but we had to manually pick CSS selectors to have it work. In this blog, we try to use some MCPs and see how much further we can push ourselves away from having to write code for a web scraping task.
About the task itself, we tried to scrape the website smithery.ai, which lists various MCP servers along with usage statistics. We asked Claude 3.7 Sonnet (using the Cursor IDE) to scrape the homepage and save the extracted data to a JSON file. We've detailed our approach and the results below. If you'd like to follow along, please make sure you have an LLM app that supports MCPs (Claude Desktop, Cursor, etc.), with nodejs
, npm
, and npx
installed (Don't forget to update them as well. If you're on Windows using PowerShell as your terminal, it can be helpful to upgrade it to version 7+).
Installing The MCPs On Cursor IDE
The primary MCP we used was the Playwright MCP by Execute Automation. This provided a more comprehensive set of utilities compared to the Microsoft Playwright MCP and seemed more usable for scraping. The Execute Automation YouTube Channel also publishes informative videos about testing and automation if you're interested.
We installed it using npm
as follows:
npm install -g @executeautomation/playwright-mcp-server --loglevel verbose
The -g
argument installs it globally and also makes the package runnable with npx
. This command also installs the browsers necessary for Playwright to run, which usually takes a few minutes. The verbose log level setting helps monitor the progress. Next, we also used the filesystem MCP server, which enables AI to access defined directories of the local filesystem. Now it could write a JSON file with the scraped data.
We added these two MCPs to the MCP config file so Cursor could provide these tools to its LLMs:
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["-y", "@executeautomation/playwright-mcp-server"]
},
"filesystem": {
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-filesystem",
"/Users/kd/Desktop"
]
}
}
}
For Cursor, this file needs to be placed at
~/.cursor/mcp.json
on MacOS/Linux, or atC:\Users\UserName\.cursor\mcp.json
on Windows.
Editing mcp.json
on Cursor:
Also, we could skip the installation step for the filesystem MCP, as npx
would handle this automatically. It's just that the installation takes longer for Playwright MCP, and it's better to have it ready before firing up the IDE.
Finally, we also enabled the Agent Mode on Cursor, so Cursor wouldn't ask our approval for every MCP tool call. Cursor calls this the YOLO mode.
Once we added the two MCPs, Cursor's AI models had the following tools at their disposal:
MCP | Tool Name | Description |
---|---|---|
Playwright MCP | start_codegen_session | Starts a session to generate test automation code |
end_codegen_session | Ends the codegen session | |
get_codegen_session | Gets information about a codegen session | |
clear_codegen_session | Clears codegen session | |
playwright_navigate | Opens a URL in the browser | |
playwright_screenshot | Takes Screenshot of the current page or specified element | |
playwright_click | Click on specified page element | |
playwright_iframe_click | Click specified element inside an iframe | |
playwright_fill | Fill specified input field | |
playwright_select | Select the specified element | |
playwright_hover | Hover over specified element | |
playwright_evaluate | Evaluate JS code in the browser console | |
playwright_console_logs | Get logs from the browser console | |
playwright_close | Close the browser | |
playwright_get | Send HTTP GET Request | |
playwright_post | Send HTTP PUT Request | |
playwright_put | Send HTTP PUT Request | |
playwright_patch | Send HTTP PATCH Request | |
playwright_delete | Send HTTP DELETE Request | |
playwright_expect_response | Wait for a response | |
playwright_assert_response | Wait for a response and validate it when received | |
playwright_custom_user_agent | Set Custom User Agent | |
playwright_get_visible_text | Get the Visible text in the loaded page | |
playwright_get_visible_html | Get full HTML content of loaded page | |
playwright_go_back | Go back in the navigation history | |
playwright_go_forward | Go forward in the navigation history | |
playwright_drag | Drag the specified element | |
playwright_press_key | Press the specified key | |
playwright_save_as_pdf | Save loaded page as PDF | |
Filesystem MCP | read_file | Read contents of specified file |
read_multiple_files | Read multiple files at once | |
write_file | Create new file with specified contents | |
edit_file | Edit the specified file | |
create_directory | Create a folder at specified path | |
list_directory | List the files in the specified folder | |
move_file | Move/rename a file | |
search_files | Search for files, works recursively | |
get_file_info | Get detailed metadata about a file/directory | |
list_allowed_directories | List the folders the MCP is permitted to access |
Asking Cursor To Scrape Smithery
After some trial and error, we found that the following prompt worked best for scraping the homepage of Smithery:
Use playwright mcp to scrape the list of MCP servers from this page using playwright_get_visible_text https://smithery.ai/
Only scrape the homepage
From the output of playwright_get_visible_text, without writing any scripts, extract the name, URL, description, type (remote/local), icon URL, and the number of downloads at the end of the listing. Save this list to a json file with the URL as the unique key. The URL is of the format https://smithery.ai/server/@{username}/{mcp_name}
Cursor used the claude-3.7-sonnet
model to handle this prompt and first opened up a browser window to visit the Smithery homepage. And as per our instruction, it used the playwright_get_visible_text
function to get the text from the page, process it and save it to mcp_servers.json
on the Desktop. The GIF below shows how this worked. It took just over a minute.
Results
Upon inspecting the mcp_servers.json
file that Cursor produced, we found that the file had details about 9 servers, in the following format:
{
"https://smithery.ai/server/@smithery/toolbox": {
"name": "Toolbox",
"description": "Toolbox dynamically routes to all MCPs in the Smithery registry based on your agent's need. When an MCP requires configuration, our tool will prompt the user to configure their tool with a callback link.",
"type": "Remote",
"downloads": "6.74k",
"icon_url": null
},
"https://smithery.ai/server/@smithery-ai/server-sequential-thinking": {
"name": "Sequential Thinking",
"description": "An MCP server implementation that provides a tool for dynamic and reflective problem-solving through a structured thinking process.",
"type": "Remote",
"downloads": "531.96k",
"icon_url": null
},
"...7 more servers...": ""
}
While the name, description, URL, and type looked alright, the downloads field wasn't parsed to an integer, and the icon URL was null for all the entries. This is because the icon URL is inside an HTML tag attribute and not in the visible text. The full output of the prompt run is below:
I'll help you scrape the MCP servers list from smithery.ai using Playwright. Let me break this down into steps:
- First, I'll start a browser session and navigate to the page
- Get the visible text content
- Parse the text and create a JSON structure
- Save it to a file
Let's begin:
> Called MCP tool playwright_navigate
Now, let's get the visible text content:
> Called MCP tool playwright_get_visible_text
Now I'll close the browser since we have the data we need:
> Called MCP tool playwright_close
Now, I'll parse this data and create a JSON file with the MCP servers information. I'll create a structured JSON file:
> Called MCP tool write_file
I've created a JSON file at /Users/kd/Desktop/mcp_servers.json containing the MCP servers information from the homepage. I focused on the featured servers section since that's what's shown on the homepage. The JSON file is structured with:
- URL as the key (in the format https://smithery.ai/server/@{username}/{mcp_name})
- For each server:
- name
- description
- type (Remote/Local)
- downloads (as shown on the site)
- icon_url (set to null as icons weren't directly visible in the text content)
The file has been saved and includes the 9 featured servers from the homepage. Each entry preserves the exact information as shown on the website, including download counts in their original format (e.g., "6.74k").
Would you like me to modify the format or include additional servers from other sections of the page?
Playwright MCP Scraping Limitations
We ran into major limitations while trying to scrape Smithery using Playwright MCP and Cursor. Due to the limited context window of claude-3.7-sonnet
, we couldn't have it analyze the full HTML code and had to stick to visible text, which would be much fewer tokens for the LLM to process. Hence, we couldn't extract some details such as the icon URL. On our first few attempts, it couldn't get the URLs of linked pages either, because these would be in the href
attributes of anchor tags in the HTML code, and not in the visible text. We worked around this by asking it to construct the URL from the name using a provided pattern.
Second, we couldn't repeat this over multiple pages. When we tried to scrape around 60 URLs from the site using a page counter, Cursor often stopped at 13-14 pages, either prompting us to proceed for each page, writing incomplete data, or just stopping entirely. We also tried this with Claude Desktop, which could do 3-4 pages at best. AI LLMs are known to be non-deterministic, therefore, they are not the best suited for repeating tasks based on a definition. Code and APIs still seem to be the best way to accomplish that.
Cursor also has a built in hurdle that stops the running of MCP tools after 25 tool calls where it prompts the user to press a link to continue. Although this can be worked around by using this useful script from Github. There’s a couple of steps to installing it, but it's well worth it if you want your MCPs to run uninterrupted.
Playwright MCP Scraping Alternative
If you want AI-powered Web Scraping without the limitations of the Playwright MCP then check out our AI Web Scraping API which:
- Scales easily
- Enables you to extract data without having to mess about with selectors
- Bypasses Anti-Bot Measures
- Automatically adapts to page layout changes
Give it a spin by grabbing your free API key and 1000 free scraping credits. Read more on our AI-powered Web Scraping API endpoint.
# Install the Python ScrapingBee library:
# `pip install scrapingbee`
from scrapingbee import ScrapingBeeClient
import json
client = ScrapingBeeClient(api_key='INSERT_YOUR_SCRAPINGBEE_API_KEY')
response = client.get(
'https://smithery.ai/', # Replace this with the actual URL
params={
'ai_query': 'Return a list of servers and their attributes',
'ai_extract_rules': json.dumps({
"server url": {
'type': 'list',
'description': 'The URL of the server in the format https://smithery.ai/server/@{username}/{mcp_name}'
},
"server name": {
'type': 'list',
'description': 'The name of the server'
},
"server description": {
'type': 'list',
'description': 'The description of the server'
},
"server type": {
'type': 'list',
'description': 'Type of server, either Remote or Local'
},
"server downloads": {
'type': 'list',
'description': 'Number of Downloads associated with each mcp server, its typically of this format: 1.2k'
}
})
}
)
# Decode the response content from byte string to regular string
response_content_str = response.content.decode('utf-8', errors='ignore')
# Now parse the string into a JSON object
response_content = json.loads(response_content_str)
# Combine the lists into a list of dictionaries
servers = []
for i in range(len(response_content["server url"])):
server = {
"url": response_content["server url"][i],
"name": response_content["server name"][i],
"description": response_content["server description"][i],
"type": response_content["server type"][i],
"downloads": response_content["server downloads"][i]
}
servers.append(server)
# Print the list of servers in a JSON format for better readability
print('Response HTTP Status Code: ', response.status_code)
print(json.dumps(servers, indent=2))
Output from the Scrapingbee AI Web Scraping API:
Response HTTP Status Code: 200
[
{
"url": "/server/@smithery/toolbox",
"name": "Toolbox",
"description": "Toolbox dynamically routes to all MCPs in the Smithery registry based on your agent's need. When an MCP requires configuration, our tool will prompt the user to configure their tool with a callback link.",
"type": "Remote",
"downloads": "22.39k"
},
{
"url": "/server/@wonderwhy-er/desktop-commander",
"name": "Desktop Commander",
"description": "Execute terminal commands and manage files with diff editing capabilities. Coding, shell and terminal, task automation",
"type": "Local",
"downloads": "303.87k"
},
{
"url": "/server/@smithery-ai/server-sequential-thinking",
"name": "Sequential Thinking",
"description": "An MCP server implementation that provides a tool for dynamic and reflective problem-solving through a structured thinking process.",
"type": "Remote",
"downloads": "133.86k"
},
{
"url": "/server/@browserbasehq/mcp-browserbase",
"name": "Browserbase",
"description": "Provides cloud browser automation capabilities using Browserbase, enabling LLMs to interact with web pages, take screenshots, and execute JavaScript in a cloud browser environment.",
"type": "Remote",
"downloads": "28.45k"
},
{
"url": "/server/@smithery-ai/github",
"name": "Github",
"description": "Access the GitHub API, enabling file operations, repository management, search functionality, and more.",
"type": "Remote",
"downloads": "32.87k"
}
"...39 more servers..."
Conclusion
In this blog, we tried to scrape the Smithery website without actively writing or running code, using the Cursor IDE with a Playwright MCP. We tried it with just the homepage and we could say that the results were decent. We were able to extract most of the fields that we needed as long as the fields were in the visible text of the page. However, we were unable to effectively scale this approach to multiple pages due to limitations of the new technology.