Is Bravebot Masquerading as Googlebot? Concerns About Crawling and Data Usage

Hello Brave community,

I have recently come across claims that Bravebot, the web crawler used for Brave Search, has been masquerading as Googlebot in some cases, making it difficult for website administrators to block or manage its access properly. Some have also raised concerns about whether Brave is selling the collected data externally.

To clarify, I would like to ask:

  1. Does Bravebot use its own unique User-Agent at all times, or has it ever mimicked Googlebot or other crawlers?
  2. If such impersonation has occurred in the past, has Brave taken steps to ensure full transparency moving forward?
  3. What is Brave’s policy on sharing or selling indexed data to third parties?
  4. How can website owners effectively control or block Bravebot if they choose to do so?

I appreciate any official clarification from the Brave team or insights from the community. Transparency is key to maintaining trust, and I hope to get a clear and honest response.

Thank you!

@koihic.miyaji you may want to read https://search.brave.com/help/brave-search-crawler

1 Like

@koihic.miyaji I guess key things is.

  1. & 4. Brave doesn’t use its own unique agent for crawler. It uses generic crawler and does not allow for people to explicitly block Brave but allow other search engines. This has always been how it’s done

  2. Brave always has had full transparency. Not quite sure what you’re getting at on this. For example, there’s the help article I linked in the prior reply of mine, but then you can even see this information has long been shared as well. Such as screenshot of Reddit comment made 4 years ago below:

  1. This also feels like a leading question. Brave Search is a search engine. People get into semantics on things here all the time. People can use the Search API for training AI, performing searches themselves, or whatever. Rather than putting my words, let me copy/paste a part of a reply to someone else before:

The rights being mentioned are not rights to content, copyrighted or not, as the article misleadingly seems to imply. The rights are to the output of the API request, which is a set of results to a query sent by the API user. Brave Search has the right to monetize and put terms of service on the output of its search-engine. The “content of web page” is always an excerpt that depends on the user’s query, always with attribution to the URI of the content. This is a standard and expected feature of all search engines.

Where you see Brave Search API as a way to shamefully make money, we see it as a service to all the people who want to innovate on search and LLMs, who could use only Microsoft Bing Search API, which is in reality a monopoly (Google’s search API is not open-access). This is a pretty different take, not as clickbait-y though.

There are also some doubts towards how crawling is done, which could have been solved by asking before publishing.

Brave Search has a crawler which is partially powered by information provided by users enrolled in the Web Discovery Project (WDP) option in Brave browser’s search settings, which is an off-by-default AKA opt-in, privacy-preserving system with multiple mechanisms to prevent Brave from knowing who is contributing what (WDP is open-source for inspection by anyone).

The reason we do not expose a crawler user-agent is practical: we do not have the resources to contact all domain-owners, who rightfully or not, discriminate against anyone but Google. If a domain or page is not crawlable by any search engine (it has a no-index tag), or if it is not crawlable by googlebot, then Brave Search’s bot will not crawl it either.

Regards,

Josep M. Pujol
Chief of Search at Brave

@solso and @sampson tagging on this just in case either of you have something to add or correct on my responses.

cc: @steeven

2 Likes

I’m pretty sure your last quote for the point 3 is taken from https://stackdiary.com/brave-selling-copyrighted-data-for-ai-training/, and if so, you had to make it clear it’s quoted from the article and add the link as their content is licensed under CC BY 4.0.
Personally, I agree with the author of the article that the email didn’t do a great job of answering questions. Nice if any of the official members can clarify.

Not true at all. It falls under Fair Use. I can also quote from CC directly:

If your use would not require permission from the rights holder because it falls under an exception or limitation, such as fair use, or because the material has come into the public domain, the license does not apply, and you do not need to comply with its terms and conditions. Additionally, if you are using an excerpt small enough to be uncopyrightable, the license does not apply to your use, and you do not need to comply with its terms.

NOTE

Also want to point out:

When you post an email on your website that you received from a company, you generally do not acquire its copyright. Even if you have permission to display the email (for example, if the sender didn’t restrict redistribution), you aren’t creating a new work that is eligible for your own exclusive copyright protection over its content.

In other words, while you might have a license to republish it (or it might fall under an implied permission), you cannot simply affix a copyright notice claiming that the email is your own original work or that others are prohibited from using it.

1 Like

Sure, it may fall under Fair Use in US law. Regarding the rest part, unless you yourself is Josep M. Pujol, you never know if that is really the email by him or not. I don’t think there’s any point not to link the article tbh.

1 Like

@Yuki2718 there was no need. And I find it ironic how people act sometimes. Like while you’re here complaining about things like copyright and agreements, you’re also violating these things on a regular basis in helping with filter lists. Places like YouTube have it stipulated that they own content and people can only view the content if they see ads or if they pay for a subscription. The similar gets said for many other sites. It’s part of the Terms of Use for these sites and also is a condition for accounts.

So how is it you’re okay in helping to circumvent those things but then you want to come around and complain on things like this? And to answer, the reason for not linking is because the original article was full of false and malicious content. It was just one of the many things of people shitposting with no understanding of how things work.

Whether it be them, you, or others…it gets annoying when people preach about things of which they have no knowledge. Similar to how you tried to lecture and say that the content was licensed under CC by 4.0 and claimed I needed to link to it. Yet you are 100% wrong in your understanding.

And it wasn’t just under US law. It is under Creative Commons itself, the very place you claimed said it’s protected. There is not a single country or law in the world that would have said I violated anything.

I wasn’t criticizing anything—I just wanted to confirm Brave’s stance on the matter. Now that I have my answer, thank you.

Also, regarding the discussion about ad blocking and YouTube’s terms of service, I never intended to express any opinion on those topics. They are not directly related to my original question. But perhaps some people just enjoy getting into unrelated discussions.