Very interesting idea. However, having dabbled with Bayesian classifiers myself, I suspect that your decision to break the article into sentences may be a fatal flaw...both the Mturkers and the classifier lose out on valuable context that way. I know because I took a a similar approach in one of my own projects, and the classifier could never gain enough 'confidence' to classify things, even after thousands of trainings.
The approach might work pretty well if you broke it down to paragraphs, where some context is preserved, but the more context you provide, the better the classifier will perform, often for surprising reasons. For example, I built one classifier that was performing poorly until I re-trained it without stripping HTML and headers; after that it performed splendidly. Upon examination I found that tokens such as domains and IP addresses in headers and links were hugely influential (and rightly so) in the classification.
Just something to consider. :) Apologies if I incorrectly assumed that your classifier is Bayesian.
No need to apologize. I'm actually not using a generative classifier like Bayes. I'm using Conditional Random Fields which is a discriminative classifier. I found that based on my training set, CRF produced higher accuracy.
That being said, I've also thought of way more ways to make the processing of the articles a lot better. I had an idea using an LSTM and another idea which would use an RNN to map the sensational article to a non-sensational article. The only problem with some of these approaches is that they would require someone to read an article and write a non-biased form of the article.
I do like your approach, do you have any results from what you've found tinkering around?
The approach might work pretty well if you broke it down to paragraphs, where some context is preserved, but the more context you provide, the better the classifier will perform, often for surprising reasons. For example, I built one classifier that was performing poorly until I re-trained it without stripping HTML and headers; after that it performed splendidly. Upon examination I found that tokens such as domains and IP addresses in headers and links were hugely influential (and rightly so) in the classification.
Just something to consider. :) Apologies if I incorrectly assumed that your classifier is Bayesian.