These days, it feels like the sole use of User-Agent is as a weak defence agains...

jaywalk · on March 25, 2020

Preventing scraping is an entirely futile effort. I've lost count of the number of times I've had to tell a project manager that if a user can see it in their browser, there is a way to scrape it.

Best I've ever been able to do is implement server-side throttling to force the scrapers to slow down. But I manage some public web applications with data that is very valuable to certain other players in the industry, so they will invest the time and effort to bypass any measures I throw at them.

pocket_cheese · on March 26, 2020

As a person who scrapes sites (ethically), I think it's impossible or pretty damn near impossible to prevent a motivated actor from scraping your website. However, I've avoided scraping websites because their anti scraping measures made it not worth the effort of figuring out their site. I think it's still worth for do minimal things like minify/obfuscate your client side JS and use some type of one time use request token to restrict replay-ability. The difference between knowing that I can figure it in 30 minutes vs 4 hours vs a few days is going to filter out a lot of people.

Of course, sometimes obfuscating how your website works can make it needlessly more complicated, so it's a trade off.

cirno · on March 25, 2020

Checking the user-agent string for scrapers doesn't work anyway. In addition to using dozens of proxies in different IP address blocks, archive.is spoofs its user agents to be the latest Chrome release and updates it often.