I'm going to describe some failed experiments in my research in sentiment analysis. I am using LDA and supervised LDA. I will be developing other custom models that incorporate blog comments, and I will be training using stochastic optimization in future iterations.
Data
NYT: A corpus of medium-length (500-3000 words) articles from the New York Times. It contains nearly every article from January 1, 2008 through September 2011. It contains 115,586 documents and 118,028,937 words. It contains over 100,000 unique words.
YELP: A corpus of short (10-500 words) local business reviews, almost exclusively restaurants, from the Yelp.com website. Each review is labeled with 1,2,3,4, or 5 stars by the author of the review to indicate the quality of the restaurant the text describes. It contains 152327 documents and 19,753,615 words. It also contains over 100,000 unique words, many of which are misspellings.
Experiments
I ran several experiments to figure out what information can be extracted about sentiments in the new york times articles dataset NYT.
I created a vocabulary nytimes_med_common based on the NYT dataset using words that appear in less than 40% of the documents and more than 0.1% of the documents. This removes very common words and very rare which aren't informative about the document collection in general.
First, I ran LDA on the NYT dataset using the nytimes_med_common vocabulary. On the most recent 2000 articles, I extracted 40 topics represented below. The topics closely follow the lines of politics, education, international news, and so on. They closely model the different sections of the newspaper. (lda_c_2011_10_16).
I ran sLDA on the YELP dataset using the nytimes_med_common vocabulary. This excludes many features of the YELP dataset which are specific to restaurant reviews, and misspellings (e.g. "terrrrrible"). On the first 10000 reviews of the dataset, I extracted 50 topics. The topics computed include a few topics which describe negative words. Many of the topics generally describe specific kinds of restaurants (ice cream shops, thai foods) in detail in generally neutral or positive terms. There is a chinese food topic with generally negative terms. The topics with the most extreme coefficients do seem to give a good sense of the polarity of the words contained within. Based on informal analysis, it looks like the topics would have good word intrusion and document intrusion properties. (yelp_slda_2011_10_17)
I ran LDA on the NYT dataset starting from the model and the topics extracted from the sLDA on the YELP dataset. This did not work very well, and got about the same topics as LDA from scratch. Perhaps a better experiment would be to take the topics with the most predictive coefficients, the 5-10 of them, and run LDA starting with those. (yelptopics_nytimes_lda_c_2011_10_17).
More interestingly, I created a lexicon of the words with high coefficients for predicting the polarity of Yelp reviews using Naive Bayes (yelp_lexicon and yelp_lexicon_small). I ran LDA on the NYT dataset using the yelp_lexicon as a vocabulary. This brought out a few topics that did not strictly follow along with the newspaper sections. For example, there is an epidemic/disease topic. There is a "corrections" topic with words like the following: incorrectly, misidentified, erroneously, incorrect, correction. The topic on employment reveals a strong motivator: paid, contract, negotiations, wages, executives, employees, unions, manager, compensation. Many of the topics do match up, like baseball and football and music and food and books, but it is just a much more noisy set of topics. It is easier to find the same section topics when that section uses a lot of review-filled words (like food, music, and book reviews). Many of the topics are unidentifiable, perhaps I used too many topics. But some are interesting, such as topic 029 using yelp_lexicon_small: winner, favorite, amazing, perfect, fantastic, outstanding, with other words in various sections of the newspaper.
A final experiment I ran on the Yelp dataset using nytimes_med_common vocabulary. I ran sLDA on the Yelp dataset to generate topics with coefficients. I then ran inference on the news articles using these generated topics and coefficients. The distribution of predicted ratings looks Gaussian with mean 3.5 and standard deviation .25 . Nearly all the documents are clustered to be labeled between 3 and 4 stars, with less than 5% below 3 or over 4. Even at the extremes, the documents with the highest predicted label have many death-related and terrorism-related articles. The negative extremes are also not consistent.
My next experiment will be to try to isolate topics which relate specifically to sentiment, independent of domain. One idea I have relates to fixing topics when training (an idea Chong Wang introduced to me). My idea is to run LDA on the yelp dataset to generate domain topics. Then, I will run sLDA with those topics fixed plus 2-10 extra topics which are unfixed. The fixed topics will act as background with middling coefficients, I predict, and the remaining trained topics will end up with extreme coefficients and will contain strong sentiment words independent of topics in the domains.