How Can We Help?

AI Bots and the Pure PortalAI Bots and the Pure Portal

AI Bots are increasingly becoming an important part of the modern internet, both in terms of online traffic, workflows and user experience. In this article we provide an overview of how Pure Portal manages bots, what flexibility and choices customers have in the way their Portal interacts with bots and what we recommend as best practice. We also go through the customer frequently asked questions.

How does Pure Portal treat bots?

Pure uses three key tools to manage bot traffic on the Portal: Cloudflare, Robots.txt and TDMReps.

Here is an overview of how these tools work together:

Cloudflare acts as the first filter, letting the human traffic and verified bots pass through to the Portal
Bots are expected to follow the instructions of robots.txt to guide their behaviour
TDMRep specifically addresses whether the bots are allowed to do text data mining. If the TDMRep is activated, the bots are not allowed to do text data mining. By default the TDMRep is off

Please note that both Robots.txt and TRDMRep are not enforceable requests: they act as a sign asking the public not to step on the grass, but should they wish to do so there is no way to stop them.

What is Cloudflare and how is it used?

Cloudflare mitigates threats before they reach Pure: it can block malicious traffic and requests that try to exploit yet-undiscovered vulnerabilities, such as SQL injection, cross-site scripting, or denial-of-service attacks.

It also provides protection against malicious applications and DDOS attacks by specifically checking HTTP and network requests: it can inspect the content and parameters of each HTTP request and response and apply granular rules and policies to allow or deny access based on various criteria.

What this means for Portals

Only Cloudflare-verified bots can access the Portal (=good bots only)
Customer’s own bots are likely going to be filtered out => customers can request them to be white-listed by Cloudflare (success for smaller bots is unlikely) or get the data they require via the API
Human visitors of the Portal might see a slightly higher number of “are you a human?” checks

Cloudflare offers the same high level of protection for all Portals.

What is Robots.txt and how is it used?

A robots.txt file acts as a set of instructions for web crawlers (bots) visiting a website. When a bot makes a request to a website, it first checks the robots.txt file to understand which pages it can and can't crawl, which links it should and shouldn't follow, and other requirements for bot behavior. Robots.txt files manage bot activity for the entire site, while the meta robots tags apply to individual web pages (for example, person profile sub-pages are marked as not to be indexed on the Portal).

Robots.txt file cannot enforce these rules, but good bots are programmed to look for the file and follow the rules before they do anything else.

You can choose to exclude certain AI crawlers from your Portal by disallowing them by name in Robots.txt. However, if you are looking to forbit text data mining entirely, adjusting the TDMRep setting would be preferable as the list of LLM bots is ever-growing and banning the bots one by one in your robots.txt is not recommended under best practice guidelines as it could quickly become an unwieldy task.

To review your Robots.txt file go to your Portal homepage and add /robots.txt to the URL. Robots.txt text is a vendor-only setting. To make a change to it, please reach out to Support.

What is TDMRep and how is it used?

The EU legislation known as the EU 2019 Directive for Copyright in the Digital Single Market (CDSM Directive) states the rights of different companies to perform text data mining (TDM) on the various websites.

It contains the two articles below, that are particularly relevant for how AI bots can interact with the Portal:

Research organisations and cultural heritage institutions may carry out TDM for the purposes of scientific research if they have lawful access to the content (Article 3)
Any organisation that wants to carry out TDM for any purpose, including commercial purposes, can do so if they have lawful access to the content, UNLESS TDM is explicitly reserved (Article 4)

TDM Reservation protocol (TDMRep) was developed in accordance with W3C and STM organization. This protocol addresses in particular Article 4 of EU DSM.

TDMRep is available as a setting option to allow more control over how data displayed on the Portal is used for TDM. If the TDMRep is enabled, it covers the entire Portal and anyone wishing to run TDM can request the rights to do so. They would need to reach out to the institution whose Portal they are interested in to gain access (we suggest providing the desired data to the requestor via an API).

The EU DSM Directive also applies to general-purpose AI models, particularly large generative AI models. Providers offering such AI models in the EU market must adhere to the directive's requirements, irrespective of where the copyright-related activities supporting the training of these models occur. This ensures compliance with copyright regulations and safeguards the rights of content creators.

How the TDMRep works on Portals:

TDMRep is DISABLED => data mining agents can mine the content for TDM purposes without having to contact the rightsholder. (TDM rights are not reserved, as per Article 4 they can do TDM).
- This is the DEFAULT setting.
TDMRep is ENABLED => data mining agents can NOT mine the content for TDM purposes without having to contact the rightsholder.

TDMRep cannot enforce these rules, but good bots are programmed to look for it and follow the stipulated guidance.

The EU 2019 Directive for Copyright in the Digital Single Market is an EU legislation, so it primarily applies to all AI bots that would like to do TDM on an EU institution Portal. However, with the internet being an inherently global place, the TDMRep functionality is available to any customer who wishes to use it. While not all bots will look for it, we expect the main bots to comply with the legislation.

TDMRep is a vendor-only setting. To make a change to it, please reach out to Support.

FAQs

Does banning AI crawler impact our search ranking?

With many of the key search engine companies also operating in the GenAI space and the ongoing discussion of how these companies collect data to train their LLMs, there have been concerns that blocking GenAI bots from their Portals would lead to lower ranking in the search results.

In their September 2024 statement Google explicitly pointed out that their AI and Search bots are separate and banning their Gen AI crawler would impact how the Portal is indexed for search. Bing has made a similar statement in their 2023 blog post, underlining that even the sites that choose to be excluded from the GanAI chatbot results, would appear in the search (although, they make no mention of the impact on the result ranking specifically). It is also easy to see that for most companies the search and the GenAI bots are separate in Cloudflare verified bot listing.

There are discussions in the industry whether there could be an indirect impact, and there is no clear consensus on this subject yet, mainly because the AI Chatbots is a rapidly evolving field (so the noticeable impact on the search rankings might be yet to come). For now, however, recommendations for optimizing pages for AI Bot usefulness are vastly the same as the ones given for SEO and for making pages useful to the human users: pages with structured data, clear content, and semantic relevance—are more likely to maintain or improve their search rankings.

Would AI Chatbots add traffic or take it away from the Portal?

With the rise of Gen AI chatbots, there have been significant worry among in the industry about the impact on traffic this development would have: would the bots keep straining the websites, while the actual human traffic would diminish as the AI chatbot would provide all the necessary information?

This worry is not unfounded: in 2024 35% of chatbot users used the chatbot to answer a question instead of a search engine (Exploding Topics, Nov 2024, Eology Feb 2025).

Additionally, the Chatbot behavior itself keeps evolving, with most AI chatbots currently providing source links for their answers. However, with the inclusion of source links being quite a recent addition (i.e., ChatGPT started providing links in March 2024), it is still early days and we are likely to get more insight into how the Chatbot referrals are impacting traffic in the coming months and years.

However, several recent studies (2025) have looked into how the chatbot referrals have impacted human website traffic so far.

There appears to be a tendency for the smaller websites to get more Gen AI referrals (Ahrefs, Feb 2025) – this could be instrumental especially for smaller Portals
There are conflicting views on how important chatbot referral traffic really is, while there is some evidence that while its share is still relatively small, the visitors are more engaged (Ahrefs, Feb 2025, WeAreJunction Feb 2025, Search Engine Journal Feb 2025)
The visitor can be a referral from an AI chatbot, but would not show correctly in the stats (Ahrefs, Feb 2025), which complicates the task of evaluating the true impact of these referrals

So on balance, it appears that while right now the impact might be small, it is generally positive, due to the users clicking through being more engaged and interested in the content. With over a third of users turning to Chatbots with search queries, it is worth considering the importance of having your Portal indexed by the AI bots.

Can I link my portal to my own AI to help me find the best expert, etc.? How can I best do that?

An increasing number of customers are currently considering the possibilities as using their Pure Portal in combination with a home-grown Gen AI chatbot. This can include using the data from Pure as the key source of training material for the LLM or just one of the sources.

If you are considering creating your own Gen AI Chatbot that works with the Pure Portal, we recommend using the Pure APIs to get data to train your solution on. The data we provide on API also includes information that would allow you to reconstruct the specific URL addresses, allowing your chatbot refer the user directly to the relevant Portal pages.

It is also worth noting that there is no unified agreement of what the GenAI counts as evidence of expertise. Over April 2025, we have challenged a number of GenAI chatbots to recommend an expert in a number of fields in a number of different Institutions (i.e. “Can you recommend an expert in SUBJECT at the INSTITUTION?”). When asked to suggest an expert, different AI chatbots would provide different results for the same instruction and field of expertise. The general AI Chatbots found it challenging to define expertise as an overall concept and seemed to mostly go on text descriptions directly referring to someone as having particular expertise in the field or holding a senior position in the relevant department in the institution, rather than to the actual amount of publications and other work the person might have done in the area. Defining indicators of expertise might be a valuable discussion when creating your own GenAI chatbots.

How do bot visits affect visitor stats?

Adobe Analytics does NOT include bot visits into the statistics. The exclusion is done directly by Adobe, who maintain an IAB’s (International Advertising Bureau’s) International Spiders & Bots List and exclude traffic that comes from bots. Google analytics also automatically excludes traffic from known bots and spiders, they also use the IAB’s list.

Do Search Engines penalize or prioritise pages with AI generated content? Should we optimize for AI, and if yes, how to do that?

Google and other search engines do not penalize (or make any difference) between human and AI generated content. They prioritise content that

Is original, high-quality, people-first
Demonstrates qualities of E-E-A-T: expertise, experience, authoritativeness, and trustworthiness and is compliant with the google guides on content and SEO

There is an increasing view that the content should also be optimised for AI driven indexing, as more users might be coming from the Gen AI reply links. However, people-first content is still the key recommendation. Making the website pages more structured and easy to navigate is recommended as well.

The Portal follows these recommendations and we continue to monitor this evolving landscape for further recommendations.

Can we decide which bots are allowed and which not?

It is possible to disallow specific bots from using your Portal for their training purposes. An indication can be made in the Robots.txt file, disallowing the specific bot. Please bear in mind that you would need to name specifically every bot you wish to disallow, and with the growing number of bots, this can quickly turn into an unwieldy task. Please see our guidance on managing your Robots.txt and TDMRep for more details on the best practices in this area.

Please note that neither robots.txt, nor TDMRep are mandatory for the bots to follow, there is no practical way to enforce this demand. We are relying on Cloudflare verified bot list to only allow “well-behaved” bots onto the Portal, so believe it is reasonable to expect them to follow these requirements.

Do bots currently refer to Portals in their source links? Do they obey our requests?

We have done some spot checks on a range of AI bot crawlers in April 2025 and can see that they mostly do use Portal links as sources in the search, when allowed by Robots.txt and TDMRep.

The results vary across the different AI Bot providers, as well as across portals, similar to how we see a variation in the search engine ranking in pages for different portals: for some queries the portal page would be listed as a source, for others the AI chat could point to the institution’s main website.

Note also that the quality of the results may vary and is largely driven by the content the person has on their Portal page (i.e. does it have enough descriptive text for the AI Chatbot to go on). For example, when asked to suggest an expert, different AI chatbots would provide different results for the same instruction and field of expertise. The general AI Chatbots found it challenging to define expertise as an overall concept and seemed to mostly go on text descriptions directly referring to someone as having particular expertise in the field or holding a senior position in the relevant department in the institution, rather than to the actual amount of publications and other work the person might have done in the area.

Please bear in mind that this is a rapidly evolving area, so the rules by which AI Chats select and promote source links may change over time.

There are regular reports about AI bots causing disruption for scientific databases and journals. How do you make sure this does not happen to Portals?

The key tool we are using to protect ourselves against this is Cloudflare: by letting in only the verified, well-behaved bots we expect to minimise the chance of such incidents.

We are also adding an update time stamp to the metadata of individual pages, that would help to ensure the bots are not scraping excessively (we aim to ensure that this happens once every 24h at the most frequent).

There is ongoing debate in the industry on whether implementing a longer crawl-delay in the Robots.txt is a good approach to safeguard against excessive scraping. However, as many AI bots (including Google) do not recognise it and as it can create unintended consequences for SEO optimisation, we are not considering this approach at this point.

We are constantly monitoring the development of this area and the recommended best practices, so as the situation evolves, we review our approach accordingly.

Published at June 17, 2025