Ugh, exposing it with cgit is why. Put it all behind an OAuth login using someth...

PaulDavisThe1st · 2026-02-11T19:46:54 1770839214

Using gitea does not help if you goal is to allow non-auth'ed read-only access to the repo from a web browser. The scrapers use that to hit up every individual commit, over and over and over.

We used nginx config to prevent access to individual commits, while still leaving the "rest" of what gitea makes available read-only for non-auth'ed access unaffected.

kstrauser · 2026-02-11T20:16:16 1770840976

Every commit. Every diff between 2 different commits. Every diff with different query parameters. Git blame for each line of each commit.

Imagine a task to enumerate every possible read-only command you could make against a Git repo, and then imagine a farm of scrapers running exactly one of them per IP address.

Ugh.

PaulDavisThe1st · 2026-02-11T20:33:21 1770842001

Ugh Ugh Ugh ... and endless ughs, when all they needed was "git clone" to get the whole thing and spend as much time and energy as they wanted analyzing it.

reactordev · 2026-02-11T20:16:53 1770841013

Yuk…

   http {
       # ... other http settings
       limit_req_zone $binary_remote_addr zone=mylimit:10m rate=10r/s;
       # ...
   }


    server {
        # ... other server settings
        location / {
            limit_req zone=mylimit burst=20 nodelay;
            # ... proxy_pass or other location-specific settings
        }
    }

Rate limit read-only access at the very least. I know this is a hard problem for open source projects that have relied on web access like this for a while. Anubis?

diath · 2026-02-12T06:14:19 1770876859

Easier said than done, I have 700k requests from bots in my access.log coming from 15k different IP addresses.

:: ~/website ‹master*› » rg '(GPTBot|ClaudeBot|Bytespider|Amazonbot)' access.log | awk '{print $1}' | sort -u | wc -l

15163

reactordev · 2026-02-13T12:09:10 1770984550

    map $http_user_agent $uatype {
            default             'user';
            ~*(googlebot|bingbot) 'good_bot';
            ~*(nastybot|somebadscraper) 'bad_bot';
        }

You can also do something like this to rate limit instead of by IP address. Making all ‘bad_bots’ limited but not ‘good_bots’.

I’m not dismissing the difficulty of the problem but there are multiple vectors that can identify these ‘bad_bots’.

PaulDavisThe1st · 2026-02-11T20:32:28 1770841948

We used fail2ban to do rate limiting first. It wasn't adequate.

reactordev · 2026-02-11T20:59:07 1770843547

Ooof, maybe a write up is in order? An opinioned blog post? I'd love to know more.

PaulDavisThe1st · 2026-02-12T03:31:09 1770867069

As noted by others, the scrapers do not seem to respond to rate limiting. When you're being hit by 10-100k different IP's per hour and they don't respond to rate limiting, rate limiting isn't very effective.