Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yuk…

   http {
       # ... other http settings
       limit_req_zone $binary_remote_addr zone=mylimit:10m rate=10r/s;
       # ...
   }


    server {
        # ... other server settings
        location / {
            limit_req zone=mylimit burst=20 nodelay;
            # ... proxy_pass or other location-specific settings
        }
    }

Rate limit read-only access at the very least. I know this is a hard problem for open source projects that have relied on web access like this for a while. Anubis?
 help



Easier said than done, I have 700k requests from bots in my access.log coming from 15k different IP addresses.

:: ~/website ‹master*› » rg '(GPTBot|ClaudeBot|Bytespider|Amazonbot)' access.log | awk '{print $1}' | sort -u | wc -l

15163


    map $http_user_agent $uatype {
            default             'user';
            ~*(googlebot|bingbot) 'good_bot';
            ~*(nastybot|somebadscraper) 'bad_bot';
        }
You can also do something like this to rate limit instead of by IP address. Making all ‘bad_bots’ limited but not ‘good_bots’.

I’m not dismissing the difficulty of the problem but there are multiple vectors that can identify these ‘bad_bots’.


We used fail2ban to do rate limiting first. It wasn't adequate.

Ooof, maybe a write up is in order? An opinioned blog post? I'd love to know more.

As noted by others, the scrapers do not seem to respond to rate limiting. When you're being hit by 10-100k different IP's per hour and they don't respond to rate limiting, rate limiting isn't very effective.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: