Baidu Restricts Google and Bing from Accessing Content Amid AI Data Needs

· 1 min read

article picture

In a move that highlights the growing importance of data in the artificial intelligence (AI) era, Chinese internet giant Baidu has taken steps to limit access to its content by major search engines Google and Bing.

The Beijing-based company recently updated the robots.txt file for Baidu Baike, its Wikipedia-like online encyclopedia. This change, implemented on August 8, effectively blocks Google's and Microsoft's search engine crawlers from indexing and accessing content on the platform.

Baidu Baike, which boasts nearly 30 million entries, is a valuable repository of information. By comparison, the Chinese version of Wikipedia contains only 1.43 million entries. This vast difference in content volume underscores the significance of Baidu's decision.

The move comes at a time when AI developers are increasingly seeking large datasets to train their models. Many companies are striking deals with content publishers to gain access to quality information for their generative AI projects.

For instance, OpenAI recently partnered with Time magazine, gaining access to over a century's worth of archived content. Similarly, Reddit has entered into a multimillion-dollar agreement with Google, allowing the tech giant to scrape data from its platform for AI training purposes.

Baidu's decision to restrict access to its content aligns with a growing trend among tech companies to protect their data assets. Last year, Microsoft reportedly threatened to cut off access to its internet search data for rival companies using it to train chatbots and other AI services.

As of Friday, some Baidu Baike entries were still appearing in Google and Bing search results, likely due to older cached content. However, the long-term impact of this restriction could be significant for these search engines and their AI development efforts.

This development highlights the increasing value placed on data in the AI industry and the strategic moves companies are making to control and monetize their information assets. As the race for AI supremacy continues, access to vast and diverse datasets is becoming a critical factor in gaining a competitive edge.