Tech and Business Pearls

Sentiment analysis using transfer learning from reviews to news

I'm going to describe some failed experiments in my research in sentiment analysis. I am using LDA and supervised LDA. I will be developing other custom models that incorporate blog comments, and I will be training using stochastic optimization in future iterations.


NYT: A corpus of medium-length (500-3000 words) articles from the New York Times. It contains nearly every article from January 1, 2008 through September 2011. It contains 115,586 documents and 118,028,937 words. It contains over 100,000 unique words.

YELP: A corpus of short (10-500 words) local business reviews, almost exclusively restaurants, from the website. Each review is labeled with 1,2,3,4, or 5 stars by the author of the review to indicate the quality of the restaurant the text describes. It contains 152327 documents and 19,753,615 words. It also contains over 100,000 unique words, many of which are misspellings.


I ran several experiments to figure out what information can be extracted about sentiments in the new york times articles dataset NYT.

I created a vocabulary nytimes_med_common based on the NYT dataset using words that appear in less than 40% of the documents and more than 0.1% of the documents. This removes very common words and very rare which aren't informative about the document collection in general.

First, I ran LDA on the NYT dataset using the nytimes_med_common vocabulary. On the most recent 2000 articles, I extracted 40 topics represented below. The topics closely follow the lines of politics, education, international news, and so on. They closely model the different sections of the newspaper. (lda_c_2011_10_16).

I ran sLDA on the YELP dataset using the nytimes_med_common vocabulary. This excludes many features of the YELP dataset which are specific to restaurant reviews, and misspellings (e.g. "terrrrrible"). On the first 10000 reviews of the dataset, I extracted 50 topics. The topics computed include a few topics which describe negative words. Many of the topics generally describe specific kinds of restaurants (ice cream shops, thai foods) in detail in generally neutral or positive terms. There is a chinese food topic with generally negative terms. The topics with the most extreme coefficients do seem to give a good sense of the polarity of the words contained within. Based on informal analysis, it looks like the topics would have good word intrusion and document intrusion properties. (yelp_slda_2011_10_17)

I ran LDA on the NYT dataset starting from the model and the topics extracted from the sLDA on the YELP dataset. This did not work very well, and got about the same topics as LDA from scratch. Perhaps a better experiment would be to take the topics with the most predictive coefficients, the 5-10 of them, and run LDA starting with those. (yelptopics_nytimes_lda_c_2011_10_17).

More interestingly, I created a lexicon of the words with high coefficients for predicting the polarity of Yelp reviews using Naive Bayes (yelp_lexicon and yelp_lexicon_small). I ran LDA on the NYT dataset using the yelp_lexicon as a vocabulary. This brought out a few topics that did not strictly follow along with the newspaper sections. For example, there is an epidemic/disease topic. There is a "corrections" topic with words like the following: incorrectly, misidentified, erroneously, incorrect, correction. The topic on employment reveals a strong motivator: paid, contract, negotiations, wages, executives, employees, unions, manager, compensation. Many of the topics do match up, like baseball and football and music and food and books, but it is just a much more noisy set of topics. It is easier to find the same section topics when that section uses a lot of review-filled words (like food, music, and book reviews). Many of the topics are unidentifiable, perhaps I used too many topics. But some are interesting, such as topic 029 using yelp_lexicon_small: winner, favorite, amazing, perfect, fantastic, outstanding, with other words in various sections of the newspaper.

A final experiment I ran on the Yelp dataset using nytimes_med_common vocabulary. I ran sLDA on the Yelp dataset to generate topics with coefficients. I then ran inference on the news articles using these generated topics and coefficients. The distribution of predicted ratings looks Gaussian with mean 3.5 and standard deviation .25 . Nearly all the documents are clustered to be labeled between 3 and 4 stars, with less than 5% below 3 or over 4. Even at the extremes, the documents with the highest predicted label have many death-related and terrorism-related articles. The negative extremes are also not consistent.

My next experiment will be to try to isolate topics which relate specifically to sentiment, independent of domain. One idea I have relates to fixing topics when training (an idea Chong Wang introduced to me). My idea is to run LDA on the yelp dataset to generate domain topics. Then, I will run sLDA with those topics fixed plus 2-10 extra topics which are unfixed. The fixed topics will act as background with middling coefficients, I predict, and the remaining trained topics will end up with extreme coefficients and will contain strong sentiment words independent of topics in the domains.

over 7 years ago on October 21 at 5:41 am by Joseph Perla in tech, hacks, research

blog comments powered by Disqus

Hi, my business card says Joseph Perla. Former VP of Technology, founding team, My first college startup was in the education space. My second was Labmeeting, a cross between Google, LinkedIn, and Facebook for scientists. I dropped out of Princeton (twice).

I love to advise and help startups. My code on Github powers many websites and iPhone apps. I give talks about startup tech around the US and also internationally at conferences in Florence. incubators in Paris, and startups in Budapest.

Twitter: @jperla

Subscribe to my mailing list

* indicates required

Favorite Posts

Y Combinator Application Guide
What to do in Budapest
How to hack Silicon Valley, meet CEO's, make your own adventure
Your website is unviral
The Face that Launched a Thousand Startups
Google Creates Humanoid Robot, Programs Itself

Popular Posts

How to launch in a month, scale to a million users
Weby templates are easier, faster, and more flexible
Write bug-free javascript with Pebbles
How to Ace an IQ Test
Capturing frames from a webcam on Linux
A Clean Python Shell Script
Why Plant Rights?

Recent Posts

Working Copy is a great git editor
Venture Capital is broken
The nature of intelligence: brain bowls, cogniphysics, and prochines
Bitcoin: A call-to-arms for technologists
Stanford is startups
Today is Internet Freedom Day! DRM-free book about Aaron Swartz's causes