Maybe a good illustration would be ClearView AI. They are scraping websites, extracting information (images), and training ML models to learn embeddings (distance between faces). They indiscriminately collect personal data without opt-in, but a limited opt-out mechanism.
In this case, if this tool is used to scrape a website, there are too direct issues:
1/ no immediate way for the website owner to exclude this particular scraper (what is the useragent?)
2/ no way for data subjects (whose data is present on the website) to search whether the scraper learned their personal data in the embeddings. Data being available publicly doesn't mean it can be widely used [at least outside the US, where we have much stricter rules on scraping].
In this case, if this tool is used to scrape a website, there are too direct issues: 1/ no immediate way for the website owner to exclude this particular scraper (what is the useragent?) 2/ no way for data subjects (whose data is present on the website) to search whether the scraper learned their personal data in the embeddings. Data being available publicly doesn't mean it can be widely used [at least outside the US, where we have much stricter rules on scraping].