You cannot have our user's data

As you may have noticed, SourceHut has deployed Anubis to parts of our services to protect ourselves from aggressive LLM crawlers.¹ Much ink has been spilled on the subject of the LLM problem elsewhere, and we needn’t revisit that here. I do want to take this opportunity, however, to clarify how SourceHut views this kind of scraper behavior more generally, and how we feel that the data entrusted to us by our users ought to be used.

Up until this point, we have offered some quiet assurances to this effect in a few places, notably our terms of service and robots.txt file. Quoting the former:

You may use automated tools to obtain public information from the services for the purposes of archival or open-access research. You may not use this data for recruiting, solicitation, or profit.

This has been part of our terms of service since they were originally written in 2018. With the benefit of hindsight, I might propose a different wording to better reflect our intentions – but we try not to update the terms of service too often because we have to send all users an email letting them know we’ve done so. I have a proposed edit pending to include in the next batch of changes to the terms which reads as follows:

You may use automated tools to access public SourceHut data in bulk (i.e. crawlers, robots, spiders, etc) provided that:

Your software obeys the rules set forth in robots.txt

Your software uses a User-Agent header which clearly identifies your software and its operators, including your contact information

Your software requests data at a rate which does not negatively affect the performance of our services for other users

You may only collect this data for one or more of the following purposes:

Search engine indexing

Open-access research

Archival

You may not use automated tools to collect SourceHut data for solicitation, profit, training machine learning models, or any other purpose not enumerated here without explicit permission from SourceHut staff.

This text, or something similar, will be included in our next update to the terms of service, which will probably ship around the time we finish setting up our new European billing system.

A careful observer can see our views on scrapers elaborated on in our robots.txt file as well. It begins as follows:

# Our policy
#
# Allowed:
# - Search engine indexers
# - Archival services (e.g. IA)
#
# Disallowed:
# - Marketing or SEO crawlers
# - Anything used to feed a machine learning model
# - Bots which are too agressive by default. This is subjective, if you annoy
#   our sysadmins you'll be blocked.
#
# If you do not respect robots.txt or you deliberately circumvent it we will
# block your subnets and leave a bag of flaming dog shit on your mother's front
# porch.

One can infer from the tone of the last sentence that attempting to enforce robots.txt is a frustrating, thankless task for our sysadmins.

To add to these resources, I’d like to elaborate a bit more on our views on scrapers for you today. Scrapers have been a thorn in the side of sysadmins for a very long time, but it’s particularly important as LLM scrapers seize the entire Internet to feed into expensive, inefficient machine learning models – ignoring the copyright (or copyleft, as it were) of the data as they go. The serious costs and widespread performance issues and outages caused by reckless scrapers has been on everyone’s mind in the sysadmin community as of late, and has been the subject of much discussion online.

Aside from the much-appreciated responses of incredulity towards LLM operators, and support and compassion for sysadmins from much of the community, a significant minority views this problem as less important than we believe it to be. Many of their arguments reduce to victim blaming – that it’s not that hard to handle this volume of traffic, that we should be optimized to better deal with it, that we need more caching or to improve our performance, or that we should pay a racketeer like CloudFlare to make the problem go away. Some suggest that sysadmins should be reaching out to LLM companies to offer them more efficient ways to access our data to address the problem.

Of course, not all software is necessarily able to be as resource efficient as Joe Naysayer’s static website. Moreover, LLM companies are not particularly interested in the more expensive route of building software integrations for each data source they scrape when they could go the cheap route of making us all pay for the overhead; nor should us sysadmins have to spend the finite time and resources at our disposal (often much more modest than the resources available to these LLM companies) negotiating with terrorists and building bespoke solutions for them.

More important than any of these concerns is to address the underlying assumption: that these companies are entitled to this data. This assumption has varied roots, as benign as misplaced Libertarian ideals and as unhinged as the Rationalism cult belief that AGI is around the corner and everyone ought to be participating as best they can for the benefit of untold numbers of unborn future-humans.

It is our view that these companies are not entitled to the data we provide, nor is anyone else. The intended audience for the publicly available data on SourceHut is users of and contributors to open source software who are accessing the data for those purposes. Indeed some profitable use of public SourceHut data is permitted, as one is entitled to by the Open Source Definition, but we do not wish to provide our data in bulk for any business, megacorp or startup alike, who wants to feed it into an LLM or do anything else with it which does not directly serve our mission, which is to improve open source software.

We would not come to a special arrangement to share this data with any of these companies, either, even in the unlikely event that they offered to pay for it. We are funded by paid subscriptions, not by selling our user’s data. It is not ours to sell – something GitHub, with its own LLM products, would do well to consider more carefully. The data we have been entrusted with belongs to our users, and is dedicated to the commons, and we take our role as the stewards of this data seriously. It is our duty to ensure that it is used in the service of our users best interests. We have always put them first.

We’re also researching go-away, a new option which may be effective with a reduced user impact (notably by not necessarily requiring JavaScript). ↩︎