The game between websites and crawlers Can a small robots txt still keep the data?

Mondo Technology Updated on 2024-02-20

For more than thirty years, a simple text file robotstxt has been maintaining order on the network and controlling the behavior of web crawlers. But with the rise of AI companies, they collect a lot of data in order to train models, and the "social contract" of the network is facing collapse. ** Game between the owner and the AI company, RobotsThe future of TXT is also uncertain.

For decades, a tiny text file has silently guarded the order of the Internet. It has no legal or technical authority, and it doesn't even have a sophisticated technical component. It represents a "handshake agreement" from the early pioneers of the Internet, which aims to respect each other's wishes and work together to build a network that is good for everyone. It can be said to be a miniature constitution of the Internet, written in **.

It's called robotstxt, which is usually located in your ** root directory. This file allows any ** owner – big or small, whether it's a cooking blog or a multinational corporation – to tell the internet who can get in and who can't. Which search engines can index yours**? Which archival items can crawl your pages and save? Can competitors monitor your page for their own use? You can make a decision and notify the network.

It's not a perfect system, but it used to work well. At least that was the case before. For decades, robotsTXT's main focus is on search engines; You allow them to scrape your **, and in exchange, they promise to send people back to you. Now, AI has changed that formula: companies on the web are using your ** and its data to build massive training datasets in order to build models and products that may not recognize your existence at all.

robots.txt provides for an exchange relationship; For many, AI feels like just taking, not giving. But now, the funding in the AI space is so huge and the level of technology is changing so fast that many ** owners can't keep up. And robotsThe underlying protocol behind TXT, and the network as a whole – which has long been "everyone keeps calm" – may also not be able to keep up.

In the early days of the internet, bots were known by many names: spider, crawler, worm, webant, web crawler. Most of the time, they are built with good intentions. It's usually developers trying to build a cool new directory, make sure their own is up and running, or build research databases – this was around 1993, when search engines weren't yet widespread, and you could fit most of the internet on your computer's hard drive.

The only real problem at the time was traffic: accessing the internet was slow and expensive for both the people who saw it and the people who hosted it. If, like many people, you host your page on your computer or run hastily built server software through your home internet connection, it only takes a few bots to overzealously ** your page and things will crash and your bills will skyrocket.

For several months in 1994, software engineer and developer Martijn Koster worked with other network administrators and developers to come up with a solution they called the Robots Exclusion Protocol. The proposal is fairly simple: it asks web developers to add a plain text file to their domain, specify which bots are not allowed to search on their **, or list pages that all bots are not allowed to visit. (Again, at the time you could maintain a list of every bot that existed – Koster and a few others helped do that.) For the robot manufacturer, the deal is simpler: respect the desire for text files.

From the beginning, Koster made it clear that he didn't hate robots and didn't plan to get rid of them. "Bots are one of the few operational problems and frustrations on the web," he said in an initial email in early 1994 to a mailing list called www-talk, which included early pioneers of the web such as Tim Berners-Lee and Marc Andreessen. "At the same time, they do provide useful services. Koster warns against arguing about whether bots are good or bad – because it doesn't matter, they're already there and aren't going away. He was simply trying to design a system that "minimized the problem and possibly maximized the benefits."

Bots are one of the few aspects on the web that cause operational problems and frustration. "At the same time, they do provide useful services. ”

By the summer of that year, his proposal had become a standard – not an official one, but a more or less universally accepted one. Koster called the www-talk group again in June for an update. "In short, it's a way to steer bots away from certain areas of the web server's URL space by providing a simple text file on the server," he wrote. "This is especially handy if you have large archives, CGI scripts with a lot of URL subtrees, temporary information, or just don't want to serve a bot. He set up a subject-specific mailing list, and its members agreed on some basic syntax and structure to use for those text files, changing the file names from robotsnotwantedtxt to simple robotstxt, and almost everyone agrees to support it.

Over the next 30 years, this worked just fine.

But the Internet can no longer fit on hard drives, and robots have become more powerful. Google uses them to crawl and index entire webs for use by its search engine, which has become the interface to the internet, bringing in billions of dollars in revenue for the company every year. Bing's crawler did the same, and Microsoft licensed its database to other search engines and companies. Internet archives use crawlers to store web pages for future generations. Amazon's crawlers scoured the web for product information, which the company used to penalize sellers who offered better deals outside of Amazon, according to a recent antitrust lawsuit. AI companies like OpenAI are scraping the web to train large language models that could once again fundamentally change the way we access and share information.

The ability to store, organize, and query the modern Internet makes the world's accumulated knowledge accessible to any company or developer. Over the past year or so, the rise of AI products like ChatGPT, and the large language models behind them, has made high-quality training data one of the most valuable commodities on the internet. This has led various internet providers to rethink the value of the data on their servers and reconsider who has access to what. Being too lenient can make your ** lose all value; Being too strict can make you **. You must always make choices based on new companies, new partners, and new stakeholders.

There are several types of internet bots. You can build a completely innocuous bot to crawl around and make sure all your page links still point to other live pages; You can send a more sketchy bot around the web to collect every email address or number you can find. But the most common, and by far the most controversial, is simple web crawlers. Its job is to find and ** as much content as possible on the internet.

Web crawlers are usually fairly straightforward. They start with a well-known **, such as cnncom or wikipediaorg or healthgov。(If you're running a general-purpose search engine, you'll start with a large number of high-quality domain names on different topics; If you only care about sports or cars, you'll only start with cars**. The crawler stores that first page and stores it somewhere, and then automatically clicks on every link on that page, all of them, clicks on each link, and spreads on the web. With enough time and enough computing resources, crawlers will eventually find and ** billions of web pages.

The trade-off is fairly simple: if Google can crawl your page, it can index it and show it in search results.

Google estimated in 2019 that more than 500 million** owned a single robottxt page, indicating if and what these crawlers can access. The structure of these pages is usually roughly the same: it is named a "user-agent", which refers to the name that the crawler uses to identify itself to the server. Google's is Googlebot; Amazon's is AmazonBot; Bing is BingBot; OpenAI's is GPTbot. Pinterest, LinkedIn, Twitter, and many other ** and services have their own bots, and not all bots will be mentioned on every page. (Wikipedia and Facebook are two platforms where bots are particularly detailed.) Below, robotsThe txt page lists the parts or pages of the site that are not allowed to be accessed for a given **, as well as the specific exceptions that are allowed. If the line only says "disallow:"", the crawler is completely undesirable.

For most people, "server overload" is no longer a real concern for them. "These days, it's often less about the resources used on ** and more about personal preferences," says John Mueller, a Google search advocate. "What content do you want to be crawled and indexed and so on? ”

Historically, the biggest question that most owners have to answer is whether or not to allow Googlebot to crawl theirs. The trade-off is fairly simple: if Google can crawl your page, it can index it and show it in search results. Anything you want Google to be searchable on, Googlebot needs to see. (Of course, how and where the page appears in Google search results is a completely different issue.) The question is whether you're willing to let Google consume some of your bandwidth and **your* copy in exchange for the visibility that search brings.

For most**, it's a simple deal. "Google is our most important spider," said Tony Stubblebine, CEO of Medium. Google can** all pages of Medium," and in exchange, we get a lot of traffic. It's a win-win. Everyone thinks so. "It's Google's agreement with the internet as a whole to drive traffic to others through ads in search results. According to everyone, Google has always been robotsExemplary citizens of txt. "Almost all well-known search engines adhere to it," says Google's Mueller. "They're happy to be able to scrape the web, but they don't want to annoy people with ......It just makes it easier for everyone. ”

However, in the past year or so, the rise of artificial intelligence has turned the equation upside down. For many publishers and platforms, having their data scraped to train it feels more like stealing than trading. "We quickly discovered that working with AI companies," says Stubblebine, "wasn't just a value exchange, we weren't getting anything out." Really zero. Last fall, when Stubblebine announced that Medium would stop AI crawlers, he wrote that "AI companies have squeezed value out of writers in order to spam internet readers." ”

Last year, much of the industry expressed the same sentiment as Stubblebine. "We don't think it's in the public interest to 'scrape' BBC data to train general AI models without our permission," Rhodi Talfan D**ies, the BBC's national director, wrote last fall, announcing that the BBC would also block OpenAI's crawlers. The New York Times has also blocked GPTbot, suing OpenAI a few months ago, saying that OpenAI's model "is built by copying and using millions of copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more." A study by Ben Welsh, editor of the Reuters news app, found that of the 1,156 publishers surveyed, 606 had their robotsgptbot is blocked in txt file.

It's not just publishers. Amazon, Facebook, Pinterest, Wikihow, WebMD, and many other platforms explicitly block GPTbot from accessing some or all of it**. In most of these robotsOn the TXT page, OpenAI's GPTbot is the only crawler that is explicitly and completely not allowed. But there are many other bots targeting AI that are starting to crawl the web, such as Anthropic-AI by Anthropic and Google's new Google-extended. According to originalityAccording to an AI study last fall, 306 of the top 1,000** on the web blocked GPTbot, but only 85 blocked Google-extended and 28 blocked Anthropic-AI.

There are also crawlers that are used for web search and artificial intelligence. CCBOT is run by the Common Crawl organization and scours the web for search engine purposes, but its data is also used by OpenAI, Google, and others to train its models. Microsoft's Bingbot is both a search crawler and an AI crawler. These are just crawlers that identify themselves – many others try to operate in relatively secret environments, making it difficult to stop or even find them in other web traffic. For any popular enough**, finding a sneaky crawler is like looking for a needle in a haystack.

For the most part, GPTbot has become a robotThe main villain of TXT, because OpenAI allows this to happen. The company posted and advertised a page on how to block gptbot and built its crawler to recognize itself loudly every time it approached **. Of course, it does all of these things after it has trained the underlying model that makes it so powerful, and only after it has become an important part of the technology ecosystem. But Jason Kwon, OpenAI's chief strategy officer, says that's exactly the point. "We are participants in the ecosystem," he said. "If you want to participate in this ecosystem in an open way, then this is a reciprocal transaction that everyone is interested in. Without the deal, he said, the network would start to shrink, shut down — and that's bad for OpenAI and everyone. "We do all this to keep the network open. ”

By default, the bot exclusion protocol has always been allowed. It believes that, as Koster did 30 years ago, most robots are good, made by good people, and therefore allow them by default. Overall, it was the right decision. "I think the internet is fundamentally a social creature," OpenAI's Kwon said, "and this handshake that has lasted for decades seems to have worked." OpenAI's role in maintaining this protocol includes making ChatGPT free for most users, giving value back to them, and following bot rules, he said.

But robotsThe TXT is not a legal document – 30 years after its creation, it still relies on the goodwill of all parties involved. in your robotsBanning bots on a txt page is like putting a "girls are not allowed" sign on your treehouse – it will send a message, but it won't hold up in court. Anyone who wants to ignore robotsTXT's crawlers can all simply do so with little to no fear of consequences. (Although there are some legal precedents surrounding web scraping, even these precedents can be complex and mostly fall on allowing scraping and scraping.) For example, the Internet Archive announced in 2017 that it no longer complies with robotstxt. "Over time, we've observed that robots for search engine crawlersTXT files don't necessarily serve our archival purposes," Mark Graham, director of Wayback Machine at the Internet Archive, wrote at the time. And so it ended.

With the emergence of AI companies, and their crawlers becoming more and more unscrupulous, anyone who wants to wait or wait for AI to take over is engaged in a never-ending game of whack-a-mole. They have to block each bot and crawler individually, if that's even possible, while also considering that if AI is indeed the future of search, as Google and others do, then stopping AI crawlers could be a short-term victory, but a long-term disaster.

There are people on both sides who believe that we need better, stronger, and more rigorous tools to manage crawlers. They argue that there is too much money involved, too many new and unregulated use cases, and that everyone can't be relied upon to agree to do the right thing. "While many participants have some rules in place to regulate their use of crawlers," two technology-focused lawyers wrote in a 2019 article on the legality of web crawlers, "but overall, the rules are too weak and it is too difficult to hold them accountable." ”

Some publishers want more detailed control over what is crawled and where it is used, rather than robotstxt's full allow or deny permissions. Google worked a few years ago to make the bot exclusion protocol an official standard, and also pushed for a weakening of robotstxt's status, on the grounds that it is an outdated standard that too many ** dismiss. "We recognize that existing web publisher controls were developed before new AI and research use cases emerged," Danielle Romain, Google's vice president of trust, wrote last year. "We believe it's time for the web and AI community to explore additional machine-readable means to enable web publishers to select and control emerging AI and research use cases. ”

Even as AI companies face regulatory and legal issues about how to build and train models, these models are constantly improving, and new companies seem to be popping up every day. Both large and small are faced with a decision: to succumb to the AI revolution or to stand up against it. For those who choose to quit, their most powerful ** is the agreement reached thirty years ago by the earliest and most optimistic true believers on the web. They believe that the internet is a wonderful place, full of kind people, and they want the internet to be a good thing above all. In that world and on that internet, it should be enough to explain your wishes in a text file.

Related Pages