
Website owners now have the choice to prevent Google from using their content to train its Bard AI and future models. This can be done by adding a directive to the site’s robots.txt file. Despite Google’s ethical claims, data collected without consent has already been used. Google’s request for permission after the fact appears insincere. While framed as a choice to contribute positively, it lacks authenticity given the prior data collection. Medium and others are also blocking such crawlers until better solutions emerge, reflecting growing concerns over data usage consent.
The Bard AI from Google and any upcoming AI models can now be trained without your website’s participation. A large amount of data, much of which has been collected without the knowledge or consent of the subjects, is used to train large language models, such as Google’s. Now, it’s up to you whether you want Google to be able to train its Bard AI and any upcoming models using your web content as training material.
It’s a straightforward process—simply include a directive in your site’s robots.txt file, specifying the disallowance of “User-Agent: Google-Extended.” This file communicates to automated web crawlers which content they are allowed to access.
Despite Google’s assertion that it develops its AI ethically and inclusively, it’s essential to note that AI training differs significantly from web indexing. In a blog post, Danielle Romain, the company’s VP of Trust, acknowledges that web publishers have expressed a desire for greater control over how their content is employed in emerging generative AI applications. This statement implies that Google was somewhat surprised by this request.
Interestingly, the word “train” is absent from the post, though the purpose of this data is evidently to serve as raw material for training machine learning models. Instead, the VP of Trust queries whether you would be willing to “assist in enhancing Bard and Vertex AI generative APIs,” aiming to make these AI models more accurate and capable over time.
This question presentation emphasizes the importance of consent, allowing individuals to make a positive choice to contribute. Nevertheless, it is worth noting that Bard and other models have already been trained using vast quantities of data gathered from users without their consent, which diminishes the authenticity of this framing.
Google’s actions unmistakably reveal that it initially exploited unrestricted access to web data, acquired what it needed, and is now requesting permission retroactively to create the impression that consent and ethical data collection are priorities. If they genuinely were, this option would have been available years ago.
Interestingly, Medium recently announced its intention to universally block crawlers like Google’s until a more refined solution becomes available, and they are not alone in this decision.