research Pearls

Mobile conference badge project

We developed a new system for scanning for local devices in a unidirectional manner using the camera. Scanning is expensive and not directional, so this system works a lot better for certain applications like mobile conference badges.

over 7 years ago on May 14 at 8:55 pm by Joseph Perla in research


How to win a Nobel Prize

Richard Hamming, who made Hamming Codes among other things, once gave a famous talk about how to do good research and become a great scientist. His talk is often passed around research circles to introduce new grad students to the process.

The printed version is an exact transcript of his talk, so it rambles. It's not as tight as a proper essay, and it's a little long and repetitive for today's blog age. It is, however, a classic, so I am posting some notes with the major points here. If you like it, I recommend you read the whole thing since it has many interesting anecdotes.

You might notice that a lot of this advice works well in business and startups, and even life in general as well (like "don't get angry").

You and Your Research Notes:

First, decide that you want to do significant Nobel Prize -level work. It's okay to reach.

It's not all about luck, since lots of great scientists (Einstein, Shannon) made many great contributions. They got many hits, so it doesn't seem like pure luck. "Luck favors the prepared mind."

One of the characteristics you see, and many people have it including great scientists, is that usually when they were young they had independent thoughts and had the courage to pursue them. Einstein challenged ideas about the speed of light when he was 12.

You need brains, but only a certain amount, and you probably have enough.

You need courage to dare to think through some impossible thoughts and follow through. Perservere.

People worry about age, but that might be a social effect. It is hard to work on small problems after you win a Nobel Prize young. You need to plant acorns that will become oak trees.

What most people think are the best working conditions, are not. One of the better times of the Cambridge Physical Laboratories was when they had practically shacks - they did some of the best physics ever. Not having enough programmers can force you to invent automatic programming.

You have to have drive, work hard. Knowledge grows like compound interest.

So, effort is important, but you have to apply effort sensibly, or you just spin wheels.

Great scientists can tolerate ambiguity. They know that a theory works, why it works, but also where it doesn't work, and they live in a balance between believing it and not believing it. Darwin wrote down everything that contradicted his beliefs, lest he forget.

Great scientists are committed to their problems, emotionally, so as to not drop them.

Creativity comes out of subconscious, so focus all your conscious efforts on a problem so that your subconscious also works on the problem for you.

You should be working on the most important problems in your field. Why aren't you? Important problems have an method of attack (unlike, say, teleportation). Most scientist work on problems they do not believe to be important.

Great scientists keep 10 or 20 important problems in their heads and are prepared to attack them when they come across new techniques.

Keep your office door open. You have less short-run efficiency, but achieve more in the long run by learning more from others.

By changing a problem slightly you can often do great work rather than merely good work. Instead of attacking isolated problems, I made the resolution that I would never again solve an isolated problem except as characteristic of a class. The mathematician knows that the business of abstraction frequently makes things simple.

You need to sell your work. There are three things you have to do in selling. You have to learn to write clearly and well so that people will read it, you must learn to give reasonably formal talks, and you also must learn to give informal talks.

You can get what you want in spite of top management. You have to sell your ideas there also.

Drive and commitment. The people who do great work with less ability but who are committed to it, get more done that those who have great skill and dabble in it, who work during the day and go home and do other things and come back and work the next day.

One problem is the problem of personality defects, like trying to control everything yourself.

You find this happening again and again; good scientists will fight the system rather than learn to work with the system and take advantage of all the system has to offer.

You should dress according to the expectations of the audience spoken to. If I am going to give an address at the MIT computer center, I dress with a bolo and an old corduroy jacket or something else. I know enough not to let my clothes, my appearance, my manners get in the way of what I care about. An enormous number of scientists feel they must assert their ego and do their thing their way. They have got to be able to do this, that, or the other thing, and they pay a steady price.

Now you are going to tell me that somebody has to change the system. I agree; somebody's has to. Which do you want to be? The person who changes the system or the person who does first-class science? Which person is it that you want to be?

On the other hand, we can't always give in. There are times when a certain amount of rebellion is sensible. Originality is being different. You can't be an original scientist without having some other original characteristics. I'm not against all ego assertion; I'm against some.

Don't get angry.

Another thing you should look for is the positive side of things instead of the negative.

Don't give alibis for why you can't do something. To yourself try to be honest.

If you really want to be a first-class scientist you need to know yourself, your weaknesses, your strengths, and your bad faults, like my egotism. How can you convert a fault to an asset? How can you convert a situation where you haven't got enough manpower to move into a direction when that's exactly what you need to do?

In summary, I claim that some of the reasons why so many people who have greatness within their grasp don't succeed are: they don't work on important problems, they don't become emotionally involved, they don't try and change what is difficult to some other situation which is easily done but is still important, and they keep giving themselves alibis why they don't. They keep saying that it is a matter of luck. I've told you how easy it is; furthermore I've told you how to reform. Therefore, go forth and become great scientists!

QA Section:

If you read all the time what other people have done you will think the way they thought. If you want to think new thoughts that are different, then do what a lot of creative people do - get the problem reasonably clear and then refuse to look at any answers until you've thought the problem through carefully how you would do it, how you could slightly change the problem to be the correct one.

How to avoid the Nobel Prize effect: somewhere around every seven years make a significant, if not complete, shift in your field.

The moment that physics table I always ate at lost the best people, I left. The moment I saw that the same was true of the chemistry table, I left. I tried to go with people who had great ability so I could learn from them and who would expect great results out of me. By deliberately managing myself, I think I did much better than laissez faire.

over 7 years ago on November 21 at 1:46 pm by Joseph Perla in research, life, philosophy


Sentiment analysis using transfer learning from reviews to news

I'm going to describe some failed experiments in my research in sentiment analysis. I am using LDA and supervised LDA. I will be developing other custom models that incorporate blog comments, and I will be training using stochastic optimization in future iterations.

Data

NYT: A corpus of medium-length (500-3000 words) articles from the New York Times. It contains nearly every article from January 1, 2008 through September 2011. It contains 115,586 documents and 118,028,937 words. It contains over 100,000 unique words.

YELP: A corpus of short (10-500 words) local business reviews, almost exclusively restaurants, from the Yelp.com website. Each review is labeled with 1,2,3,4, or 5 stars by the author of the review to indicate the quality of the restaurant the text describes. It contains 152327 documents and 19,753,615 words. It also contains over 100,000 unique words, many of which are misspellings.

Experiments

I ran several experiments to figure out what information can be extracted about sentiments in the new york times articles dataset NYT.

I created a vocabulary nytimes_med_common based on the NYT dataset using words that appear in less than 40% of the documents and more than 0.1% of the documents. This removes very common words and very rare which aren't informative about the document collection in general.

First, I ran LDA on the NYT dataset using the nytimes_med_common vocabulary. On the most recent 2000 articles, I extracted 40 topics represented below. The topics closely follow the lines of politics, education, international news, and so on. They closely model the different sections of the newspaper. (lda_c_2011_10_16).

I ran sLDA on the YELP dataset using the nytimes_med_common vocabulary. This excludes many features of the YELP dataset which are specific to restaurant reviews, and misspellings (e.g. "terrrrrible"). On the first 10000 reviews of the dataset, I extracted 50 topics. The topics computed include a few topics which describe negative words. Many of the topics generally describe specific kinds of restaurants (ice cream shops, thai foods) in detail in generally neutral or positive terms. There is a chinese food topic with generally negative terms. The topics with the most extreme coefficients do seem to give a good sense of the polarity of the words contained within. Based on informal analysis, it looks like the topics would have good word intrusion and document intrusion properties. (yelp_slda_2011_10_17)

I ran LDA on the NYT dataset starting from the model and the topics extracted from the sLDA on the YELP dataset. This did not work very well, and got about the same topics as LDA from scratch. Perhaps a better experiment would be to take the topics with the most predictive coefficients, the 5-10 of them, and run LDA starting with those. (yelptopics_nytimes_lda_c_2011_10_17).

More interestingly, I created a lexicon of the words with high coefficients for predicting the polarity of Yelp reviews using Naive Bayes (yelp_lexicon and yelp_lexicon_small). I ran LDA on the NYT dataset using the yelp_lexicon as a vocabulary. This brought out a few topics that did not strictly follow along with the newspaper sections. For example, there is an epidemic/disease topic. There is a "corrections" topic with words like the following: incorrectly, misidentified, erroneously, incorrect, correction. The topic on employment reveals a strong motivator: paid, contract, negotiations, wages, executives, employees, unions, manager, compensation. Many of the topics do match up, like baseball and football and music and food and books, but it is just a much more noisy set of topics. It is easier to find the same section topics when that section uses a lot of review-filled words (like food, music, and book reviews). Many of the topics are unidentifiable, perhaps I used too many topics. But some are interesting, such as topic 029 using yelp_lexicon_small: winner, favorite, amazing, perfect, fantastic, outstanding, with other words in various sections of the newspaper.

A final experiment I ran on the Yelp dataset using nytimes_med_common vocabulary. I ran sLDA on the Yelp dataset to generate topics with coefficients. I then ran inference on the news articles using these generated topics and coefficients. The distribution of predicted ratings looks Gaussian with mean 3.5 and standard deviation .25 . Nearly all the documents are clustered to be labeled between 3 and 4 stars, with less than 5% below 3 or over 4. Even at the extremes, the documents with the highest predicted label have many death-related and terrorism-related articles. The negative extremes are also not consistent.

My next experiment will be to try to isolate topics which relate specifically to sentiment, independent of domain. One idea I have relates to fixing topics when training (an idea Chong Wang introduced to me). My idea is to run LDA on the yelp dataset to generate domain topics. Then, I will run sLDA with those topics fixed plus 2-10 extra topics which are unfixed. The fixed topics will act as background with middling coefficients, I predict, and the remaining trained topics will end up with extreme coefficients and will contain strong sentiment words independent of topics in the domains.

over 7 years ago on October 21 at 5:41 am by Joseph Perla in tech, hacks, research


Howdy, my name is Joseph Perla. Former VP of Technology, founding team, Turntable.fm. Entrepreneur. Actor. Writer. Art historian. Economist. Investor. Comedian. Researcher. EMT. Philosophe

Twitter: @jperla

Subscribe to my mailing list

* indicates required

Favorite Posts

Y Combinator Application Guide
What to do in Budapest
How to hack Silicon Valley, meet CEO's, make your own adventure
Your website is unviral
The Face that Launched a Thousand Startups
Google Creates Humanoid Robot, Programs Itself

Popular Posts

How to launch in a month, scale to a million users
Weby templates are easier, faster, and more flexible
Write bug-free javascript with Pebbles
How to Ace an IQ Test
Capturing frames from a webcam on Linux
A Clean Python Shell Script
Why Plant Rights?

Recent Posts

Working Copy is a great git editor
Venture Capital is broken
The nature of intelligence: brain bowls, cogniphysics, and prochines
Bitcoin: A call-to-arms for technologists
Stanford is startups
Today is Internet Freedom Day! DRM-free book about Aaron Swartz's causes

More...