en.Wedoany.com Reported - On May 25, US-based AI web data platform Thunderbit released its developer API, Model Context Protocol server, and command-line tool, targeting AI agents, RAG pipelines, knowledge bases, and automation workflows to convert complex web pages into Markdown or structured data. Thunderbit stated that its platform currently has over 100,000 users.
This release focuses on the engineering aspect of acquiring web data for AI applications. When building agents, retrieval-augmented generation systems, market research tools, lead generation, e-commerce data monitoring, and internal automation systems, enterprises often need to extract content from product pages, directory pages, search results, comment sections, price lists, and long-tail web pages. Traditional web scraping methods rely on CSS selectors, XPath, or writing parsing rules for individual websites; once the web page structure changes, the scraping process can fail. By extending its web extraction capabilities to the developer API, MCP server, and CLI, Thunderbit means these capabilities can be more directly integrated into AI applications, automation scripts, and internal enterprise systems.
The core of the release is Thunderbit Distill. This is an adaptive HTML-to-Markdown engine designed for high-fidelity conversion of complex web pages. Thunderbit disclosed that in internal HTML-to-Markdown evaluations, Distill achieved a ROUGE-L score of 0.87, generating cleaner and more complete Markdown across page types such as product pages, price lists, directories, search results, and reviews, without requiring individual rules for each website.
The Extract feature is geared towards structured data output. Developers can return JSON or CSV data from a specified URL based on a custom schema, for use in databases, spreadsheets, data enrichment tasks, and internal tools. The combination of Distill and Extract serves AI agents, RAG, knowledge bases, and content ingestion on one end, and tabular data, business systems, and automation processes on the other. For enterprise AI teams, the value of such tools lies not in simply "scraping web pages," but in reducing the interference of web noise, navigation bars, scripts, ads, and template content on the input quality for large models, allowing AI systems to receive more stable, computable, and reusable data.
The inclusion of the MCP server makes it easier for Thunderbit to enter the agent tool ecosystem. The Model Context Protocol is being used to connect AI assistants with external tools, databases, file systems, and business services. By providing web data acquisition capabilities to AI assistants through the MCP server, Thunderbit allows developers to embed web content scraping, field extraction, Markdown conversion, and structured output into workflows supporting MCP, such as Claude Desktop and Cursor. For sales, operations, e-commerce, research, and content teams, this means data wrangling tasks that previously relied on manual copying, browser plugins, or one-off scripts can potentially be incorporated into a repeatable, callable AI toolchain.
Thunderbit stated that its Chrome extension and web application are already used by sales, e-commerce, research, and operations teams to extract tens of millions of pages monthly. The launch of the developer API, MCP server, and CLI opens up web extraction capabilities, previously geared towards no-code users, further to developers and enterprise engineering teams. The company's co-founder and CEO, Shuai Guan, stated that the effectiveness of an AI agent depends on whether it can truly access usable web data, and Thunderbit aims to transform ever-changing web pages into data that software can reliably use.
The impact of this release on the enterprise software and intelligent data processing market is mainly reflected in the data ingestion layer of AI applications. After the implementation of large model applications, enterprises quickly encounter the problem of difficulty in stably accessing external web pages, supplier pages, industry directories, competitor information, public prices, review data, and unstructured web content. If the quality of the data source is unstable, RAG knowledge bases, agent task chains, and automated decision-making processes can all generate noise. Thunderbit's choice to simultaneously launch the API, MCP server, and CLI indicates that AI software tools are extending from "front-end interaction" to "back-end data pipelines." Developers are no longer only focused on model capabilities but are also beginning to pay attention to whether models can receive clean, traceable, and structurally consistent data input.
Subsequent developments to watch include the adaptation of Thunderbit's developer tools within AI agent and enterprise RAG systems, feedback on MCP server ecosystem integration, and changes in the usage scale of its web extraction capabilities among e-commerce, sales, research, and operations teams. What can be confirmed at this stage is that Thunderbit has released its developer API, MCP server, and CLI; public information has not disclosed an enterprise customer list, paying user scale, revenue data, specific model provider costs, or major customer contracts, so it should not be extrapolated to suggest confirmed enterprise-level commercial orders have been formed.
This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com










