Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As someone who built a machine learning field detector that works only on the source html, there are a few challenges to detecting these in the browser:

1. JavaScript is not a great language to build high speed low resource inference engines.

2. Sending user's HTML to your cloud to do so is a privacy, security and latency nightmare.

3. Regex matching doesn't get you anywhere near 95% field accuracy (not even talking about form accuracy, identifying all fields on a page correctly) on the web. You'd be amazed (or maybe not if you're a web dev) at the lack of consistency in field names - from machine generated field names to missing names to duplicate names... It's magical. Even to a human observer looking at a rendered page, it can be confusing which field is which. Just look at chrome try and fill a complicated address form out for you using a saved persona.

4. Even if achieved in JavaScript, the model would be simple to pull out and reuse elsewhere, possibly to learn how to game it.

5. Good models will be built by good data scientists, whose favorite tools likely don't produce models that can be serialized for use in a JavaScript inference engine.

This comes down to the approach taken here - put the inference in the code where it runs (and that was likely the same that trained the model) and interface with the page via extension to ship it there and back. It's local so it never leaves the machine.

It's very high stakes to take user controlled HTML into unsafe memory. Large surface area doesn't automatically mean an insecure implementation.



These are good points but:

> JavaScript is not a great language to build high speed low resource inference engines.

This is a browser plugin analyzing html downloaded at human browsing speed. I doubt performance is a primary requirement.

> Even if achieved in JavaScript, the model would be simple to pull out and reuse elsewhere, possibly to learn how to game it.

The binary model is obfuscated but still distributed. I expect that the added difficulty of working with the binary model is small compared to the overall challenge of gaming it.

> Good models will be built by good data scientists, whose favorite tools likely don't produce models that can be serialized for use in a JavaScript inference engine.

Models built in research-optimized environments can be translated after the fact to match production needs. (It is getting easier with e.g. standard interchange formats for neural models.). Kapersky is a resource-rich org working on security software -- exactly the folks whom, if diligent and well intentioned, should invest in such hardening.


Getting a sparkML serialized model to drop into a JavaScript interpreter is not possible today from what I can see (and certainly wasn't the case 2 years ago). There does seem to be some good progress in the field with tensorflow.js and ml.js but nothing I'd put in production with a few million users. In native Scala the inference engine with a loaded model takes a few hundred MB of memory, I'd imagine with some JavaScript transpiling with emscripten or similar that would get ballooned quite a bit.

I'd be really glad if there was a viable method to do this without murdering the end user device.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: