tech Pearls

Venture Capital is broken

VC is broken. We need more operators making faster, knowledgeable investment decisions and actually supporting innovation rather than coasting on reputation. LPs at big institutions allocate $200 billion of AUM to VC (in the order of magnitude of $20 billion a year). These LPs do not have enough context to judge who understand how to build businesses and foster innovative companies.

As a result, much of the total venture AUM ends up being captured by financiers or those who otherwise do not know how to create or properly grow innovation as operators and experienced founders do. They make sub-optimal decisions and capture rather than create value. This lamentable situation is especially acute outside of the US where VCs make almost no seed stage funding and have a corresponding dearth of startups. The growing number of accelerators in Europe, furthermore, are often run by non-founders.

Various new firms have started to appear in the last decade (including A16Z, Union Square Ventures, First Round, and even YC) who are moving toward a stronger more informed and entrepreneur-friendly model. We expect their returns to outpace those of the rest of the industry (which are negative). We need more. Much more across seed, A, and growth.

We believe strongly that we aren't fighting for a slice of the same pie, but that a focused concentration on properly fostering innovation at the earliest stages will grow the pie for everyone. We want to harness the $200 billion of institutional portfolio allocation and redirect it more efficiently.

11 months ago on October 10 at 11:46 pm by Joseph Perla in tech, startups


The nature of intelligence: brain bowls, cogniphysics, and prochines

Consider the machine, which takes inputs and produces outputs, embodied theoretically as the Turing Machine, but taking many possible forms.

Your brain is just a soup bowl of machines. For clarity, we will distinguish machines of the brains by calling them "prochines" (a portmanteau of proteins and machines). The way prochines interact is studied in the field of cogniphysics.

We can describe the machines, write out their source codes precisely, and predict how they interact. The defining movement of the 21st century will be the understanding, discovery, explicit codification, and creation of myriad machines and prochines.

This proces has been fomenting subliminally since the beginning of time. Your body, your tools, morality, the Internet, philosophical ideas, and emotions are all examples of machines. Your dreams and ideas are reality.

A few immediate and many more distant conclusions spring immediately from this theoretical foundation. In particular, IQ tests lack theoretical grounding and The Bell Curve book is fundamentally wrong and misguided.

11 months ago on October 9 at 11:59 pm by Joseph Perla in brain bowls, cogniphysics, prochines, intelligence, philosophy, tech, overview


Bitcoin: A call-to-arms for technologists

I asked a friend making millions trading commodities in New York what he thought of bitcoin. His one-word SMS reply: "scam."

I can see why he might think that. Finance is his world. If Goldman Sachs had created bitcoin, then it would be obvious that they were merely creating another house of cards.

But bitcoin is different. Bitcoin is based on math and deep computer science. It was created by a technologist, and adopted early by technologists. Technologists build things: Apple, Facebook, Google. They created the modern world around us. With the support of a few dozen engineers, companies worldwide can begin accepting bitcoin practically overnight.

I personally know several growing startups with bitcoin integration nearly complete. Who would stop employees from integrating with a payment system with a lower transaction fee that lets any free person on the planet pay?

This is the true test of bitcoin: do engineers have the power to remake the world and a parallel economy overnight? Do you?

over 1 year ago on April 10 at 4:17 pm by Joseph Perla in tech, bitcoin


Stanford is startups

New York made a huge mistake by not choosing Stanford for the engineering partner in their new New York tech university. Stanford is startups. It's their culture. It is not just accepted and encouraged, those who want to do startups know to go there. I asked many people why they go to Stanford for graduate school, and the main motivation is to get ideas to do a startup. They infect the rest of the school. These ideas jump around from brain to brain there. Just as ideas of finance jump into Princeton brains when they arrive, Stanford kids with no interest in entrepreneurship quickly catch the same bug. Professors teach classes about and get rich off entrepreneurship. It permeates the atmosphere.

over 1 year ago on March 15 at 4:54 am by Joseph Perla in tech


Today is Internet Freedom Day! DRM-free book about Aaron Swartz's causes

My good friend Marvin Ammori, a board member of Demand Progress, the foundation that Aaron Swartz founded, just published a book on the causes that ended up putting immense pressure on him. Aaron Swartz is on the first page, and his work mentioned throughout. You need to learn about the legal issues, the history, and what you can do to be a part of this important conversation. Get and read it: On Internet Freedom. Marvin's been working on the book for months, and on the cause for a decade. It's free today only, because today is Internet Freedom Day (the 1-year anniversary of stopping SOPA). You might want to buy it tomorrow, though, because all the proceeds go to Fight for the Future and Demand Progress.

I've been thinking about this all week. I'm tearing up just typing this. I've had anxiety all week, I've never known anyone who died. I don't know how to handle it. Aaron Swartz felt like a good old friend to me. I have never met him, but I've been following his life, his startups, and his blog for years. He is my inspiration for this very blog. I saw another kid, my age, writing eloquent thoughtful prose and getting great feedback. He began his life on the Internet, in suburban areas far away from the centers of power. The Internet and his blog gave him access and purpose. He dedicated his life to ensuring that everyone else have that very same access.

His article How to be more productive I read regularly once every year lest I forget. I remember when he renamed his blog Raw Thought.

I can't believe I won't read anything else he writes. He's never replied to my cold emails (he's written many times about his email overflow and business). I'm disappointed by that, so in many ways he's not even an an acquaintance. But I know he's read my writing, my blog, my quora posts. He's commented. In some ways we've had conversations in this cosmic universe.

I was in no rush to meet him because I felt I had decades to get the opportunity to know him. Maybe I felt I needed to learn more, build more, accomplish more in order to deserve the opportunity. I built weby because of the inspiration of his web.py framework. I've imitated his articles, scraped millions of academic papers, followed him in startups, absorbed his ideas. He is a part of me. I sometimes feel like just an echo or shadow of his work.

I hope that, in some way, by promoting the causes that he so dearly loved right now, I can help continue his legacy and continue his spirit and life.

over 1 year ago on January 18 at 3:49 am by Joseph Perla in tech, news, life


I help startups around the world

I love helping young companies figure out their strategy and grow. Having started up many companies, I understand a lot of the issues and stress involved in the pre-seed stage. I was in Budapest recently helping out some entrepreneurs, and the #1 newspaper in Hungary interviewed me about my work there. I'm told it's quite flattering, although I don't read Hungarian which is a beautiful but complicated language!

over 1 year ago on November 26 at 6:36 pm by Joseph Perla in tech, entrepreneurship


MailPlus simplifies your inbox

MailPlus asks you, your colleagues, and your friends to add a simple expected action to every email you send. MailPlus makes it clear what you are supposed to do with every email you get (read something, reply with details, forward, etc).

MailPlus is simple. Just add "MailPlus: read" or "MailPlus: event" to each email on its own line at the bottom or top. There is no set template, but most common actions are suggested below.

MailPlus encourages you to send emails which require only one action by the recipient. If you have multiple actions required, please send multiple emails.

MailPlus makes it easier to filter urgent emails from non-urgent emails. This means you can better spend your time going through emails in the right order and the right frame of mind. MailPlus makes it easy to sort which actions you can do now versus later.

We recomend that you copy and paste the following as part of your email signature to (1) remind you to add a MailPlus action to all emails you send (change "idle" to anything else), and (2) tell other people about MailPlus.


MailPlus: idle


MailPlus is simple. MailPlus states the action expected for each email.

Learn more about MailPlus: http://bit.ly/mail-plus


Suggested Actions

  • read

    • I thought this article/link would be interesting to you. Please read it. Let me know your thoughts, if you'd like.

  • yes or no

    • Please decide yes or no and reply to this message.

  • event

    • I am inviting you to this event. Please reply on the event page yes/no/maybe or reply to me if there is no Facebook, EventBrite, or other event link.

  • call

    • Please call me now (or whatever time is stated in the message). My phone number is detailed in the email or my signature.

  • fyi

    • For your information. Just letting you know about this information or link. There is no need to reply.

  • confirmation

    • Just archive this message for your records. No need to reply. This is useful for credit card purchases or other notifications.

  • introduction

    • I am introducing you to this person, who is cc'd on the email. Please reply promptly and decide to communicate more over email, phone, or in person. Please bcc me so that I know you received the email, but I don't receive all your correspondence.

  • idle

    • Just checking in. I haven't talked to you in a while, or I am bored. I'd like to talk to you, but it's not urgent. Feel free to reply when you have time. Emails by default, with no MailPlus line, are expected to be "idle". Add a different line if you have a more urgent or specific action.

over 1 year ago on October 12 at 1:11 pm by Joseph Perla in life, tech


Apple Maps will cause huge problems

Apple Maps is gorgeous, I love using their navigation system. However, it has a huge problem: it has wrong locations. It is missing a lot of basic places like restaurants, but fundamentally it can take you to the wrong city.

I was going to a meeting in Palo Alto, and the app listed a Hawthorne address both in Palo Alto and in Los Altos. I select the Palo Alto one, however, it instead takes me to an address in Los Altos near the Palo Alto border, nowhere near the real Palo Alto address.

This will cause huge problems when it launches, given that hundreds of millions of people have iOS devices. It will cause huge problems and people will revolt and want Google Maps back. Apple invested a lot in design, but they are missing the quality utility element which is a key part of Google's philosophy and expertise.

over 2 years ago on September 7 at 1:50 pm by Joseph Perla in tech


You need to learn User Experiences

Blog posts and other resources are great ways to get introduced to an idea, or expand the scope of what you know, but they generally are very shallow. You want to learn as much as you can about a topic efficiently. Blog posts are inefficient, but good books are excellent. Avoid bad books.

Here are great books about UI Design that will make you as good as you can be.

Donald Norman: Design of Everyday Things

Good introduction to the idea of being observant about the world around you. Its examples are about the everyday objects like faucet handles, but the concepts apply to websites and mobile too.

Steve Krug: Don't Make Me Think

This is a fantastic introduction to website usability. It focuses on the modern web, and it covers a lot of great common patterns we see. It shows you how to have an eye for good web design.

Nielsen: Web Usability

Very deep research-backed information on designing interfaces. Look at all of his books.

Jef Raskin: The Humane Interface

Amazing book that opened my eyes to how simple and powerful computer

interfaces can be. Jef Raskin designed the Mac, and his ideas are brilliant. Also check out his son Aza's stuff.

Edward Tufte: The Visual Display of Quantitative Information

Classic, brilliant books about representing information to people in

most informative way.

I actually haven't read this, but I totally believe the thesis. Great idea. Maybe it has little to do with design, though.

More books:

A friend goes to a program that recommends these books: Reading Recommendations

Crumlish: Designing Social Interfaces

Summer Bedard (great UX designer of Turntable.fm) recommends this book, if you're into social networks.

over 2 years ago on May 2 at 12:16 am by Joseph Perla in tech


Google thinks it's smarter than you

Google has always been a great search engine. Since 1998, whenever anyone felt lucky, they found what they wanted instantly. On the other hand, Altavista, Excite, Yahoo, and all of the other search engines were frustrating. You search for "Bob Carpenter" and it gives you results for woodworking.

Recently, however, you may have noticed that Google changes your queries. When you search for "boosting" it changes to "boost" which has very different results since boosting is a theoretical machine learning idea. Google changed "pyquery" to "jquery" when I searched for it.

Google thinks it's smarter than you. That's a problem because the results for non-popular queries look increasingly like Altavista's. I want Google back.

over 2 years ago on April 26 at 8:53 pm by Joseph Perla in tech


Python in Italy

I was in Europe for a bit this summer. I wanted to go to a technology conference to meet fellow hackers internationally. I saw that EuroPython (the Python language conference of Europe) was in Florence, Italy this year. They were offering scholarships for students for free tickets to the conference plus hotel. I applied, pointing to my many contributions to Python like my Python web framework.

I got it, and had a fantastic time. It was incredibly well-organized, and I met some brilliant hackers. My room-mates made Kivy touch platform, psyco, and the Italian Pirate Bay. I met Armin Rigo, the hero-genius behind PyPy.

I gave a talk about my minimalist with-statement based Python templating system.

Google also hosted a programming competition, Google Code Jam. I got 2nd place among all contestants at the conference and I won a nice new Android phone!

It was a good trip that paid for itself.

over 2 years ago on January 8 at 1:12 am by Joseph Perla in travel, tech


Teach yourself Git in 2 minutes

Git is very simple. It's very powerful, but fundamentally very logical and very simple. If you try to learn everything you can do with git, then the information will flood your brain and drown you. That's true of any powerful tool like Photoshop and Unix.

But if you just want to use git to backup your code changes, develop new branches, and share your source, git is actually as straightforward as SVN. Avoid complex and dangerous commands like git-rebase. I've worked on large codebases with distributed teams and I've never needed anything more than basic commit, branch/merge, and push/pull. Git also has useful log, diff, and grep tools for quickly finding out information about your code.

Git Flow: Git for Humans

To use git without brain augmentation surgery, you should make a simple, consistent system for yourself with a handful of commands. Or just use my system.

You want to commit often, so it's good to create bash shortcuts so that you use git more often. The 2-letter shortcuts encourage you to commit more often and keep everyone's code up to date. You'll never be afraid of losing code.

Here's my main workflow, commit and push:

$ # make changes, fix bugs...

$ cm "fixed bug 214 in the UI"

$ ph

I'm constantly checking the status to see if I forgot to add files or commit something:

$ sl

# On branch master

nothing to commit (working directory clean)

If I'm branching, I create a branch, make my changes, and then merge.

$ ct -b newfeature

# make changes

$ ct master

$ me newfeature

And then I can push my changes and delete the branch.

$ ph

$ bh -d newfeature

You want to commit often, so always cm (git commit -a -m) and ph (git push) after even small changes. The 2-letter shortcuts encourage you to commit more often and keep everyone's code up to date.

The codes are easy to remember because they are consistent. The codes are always 2 letters, composed of precisely the first letter of the command and the last letter (including all of the options). By using the last letter of the command including options, the shortcut tricks your mind into thinking of the full command every time you type it. Normally, you forget commands with abbreviations of the first letters, but with my system you remember the whole command every time so you can still use git on other systems and other peoples' computers.

Git Flow Examples

Where is this variable myVar declared (git grep)?

$ gp myVar

How is my branch different from master (git diff --ignore-space-change)?

$ de master

I forgot, did I commit all my changes, what files did I forget to add (git status -uall)?

$ sl

How do I make a new branch (git checkout -b)?

$ ct -b mybranch

How do I merge it back (git merge)?

$ #ensure you've committed all changes in your branch

$ ct master

$ me mybranch

How do I delete my branch after I've merged changes (git branch -d)?

$ bh -d mybranch

How do I pull and push my changes (git pull, git push)?

$ pl

$ ph

What changes were made recently (git log)?

$ lg

What branches exist and branch am I on (git branch)?

$ bh

What if I screwed up and want to remove all the code in my branch without merging (git branch -D, since caps are harder)?

$ bh -D mybranch

How do I make a new repository?

$ git init

I just added new files to my code, how do I add them to my git repository?

$ ad .

Below are my bash aliases. Add these to your ~/.bashrc file so that you can use these shortcuts too:

alias ad='git add'

alias pl='git pull'

alias ph='git push'

alias cm='git commit -a -m'

alias sl='git status -uall'

alias lg='git log'

alias gp='git grep'

alias de='git diff --ignore-space-change'

alias me='git merge'

alias bh='git branch'

alias ct='git checkout'

over 2 years ago on January 6 at 1:12 am by Joseph Perla in tech, hacks


Don't write on the whiteboard

I recently interviewed at a major technology company. I won't mention the name because, honestly, I can't remember whether I signed an NDA, much less how strong it was.

I did well. Mostly because of luck. I normally step over myself when I interview. I guess I've improved over the years. Here are a few tips to ace your own interview.

1. Don't write on the whiteboard

When I interviewed at Palantir around 5 years ago, I had a lot of trouble with this. Yes, I knew next to nothing about computer science then, but I should have been able to answer many of those questions. For example, Palantir asked me to write an API for a hash table, and I forgot set key and get key, the most basic operations. The alien situation of the whiteboard contributed to my nervousness. I didn't get an offer from Palantir.

Most people think you have to write on the whiteboard. Steve Yegge recommends that you practice writing code on a whiteboard and even buy and bring your own marker to the interview. That's pretty extreme and truly conveys the capriciousness of modern-day tech interviewing.

The interviewer started by asking me to code up a simple recursive calculation, using any language I wanted. "I dont like to write on whiteboards," I said. "It feels unnatural and distracting. I'd prefer to write on paper." "Okay," he shrugged.

The interviewers don't care. Use paper.

2. Bring your own paper and pen

So I asked for some paper and a pen. But there was no paper around, only some post-it notes. My mistake.

You should always have paper and pen anyway to write down ideas. On the subway, in line for movie tickets. Or you can keep a few sheets of paper with your resume and folder you brought to the interview (you did that, right?). Moleskines are excellent notebooks.

Some of the best programmers figure out the high-level overview on paper before they write a single line of new code.

3. Use Python

Even if you are a C++ systems guru. Even if you only know how to use Eclipse to program Java. Learn Python and use it during your interview. Python's philosophy is very simple and consistent. It's largely composed of a subset of the ideas in Java and C++. 80% of Python is based around the dictionary (HashMap). It will take you a few days to learn and not much longer to master.

You will waste a lot of time writing string manipulation code and initialization code that you can do in one line of Python. Get to the algorithm.

All I had was a post-it note, a tiny amount of space. So I wrote down the algorithm in Python line-by-line on the post-it note.

4. Write short algorithms, then make them half as long, then make them shorter, then ask an expert how they would make it even shorter

Five minutes later, I had my algorithm. It took up less than a few lines. He looked at it, yea, that looks correct. "Normally people write it in Java and it takes them a while and it takes up a lot more space on the whiteboard. They spend a lot of time manipulating the input string."

Since my first interview at Palantir, I had done a lot of practice problems.

The highest value problems I know of are on Project Euler. The site posts a sequence of problems of increasing difficulty which you have to solve with increasing efficiency. To become a great hacker, just do those problems in order (in Python!). Then take your solution and do it in half the lines. Now, read more of the Python docs (maybe read about generators and list comprehensions and decorators) and make it even shorter. Finally, look at the solutions posted on the Project Euler site. Stand in awe of the 1-line solutions. Weep in joy over the solutions of the guy who answered the problems using just pen and paper and his brain.

The highest value courses I took are algorithms and advanced algorithms courses. I was lucky enough to study under Robert Tarjan. But I also did every problem in CLRS, the standard (and very well-written) algorithms textbook.

If you do all that, you won't be nervous at your big interview. You'll be bored.

5. Write tests on your own code, sometimes

I say sometimes instead of always because, first, it is impossible to test every case. 100% test coverage is a myth. Any non-trivial program is going to have too many edge cases to check, computationally. You need to test the high-value parts. You need to test the parts that you keep breaking.

Finally, as you finish, the interviewer will look at your code and ask you to write tests for it. So, pre-empt him and describe the tests that you would write yourself. Write edge case tests. Run the tests in your mind. Does your algorithm work? Remember, the interviewer will ask you to do this anyway, so just do it yourself and you will be one step ahead and score well.

The highest value book I read which taught me practical programming techniques is Programming Pearls. It also teaches you the importance of tests and how to write them in a pain-free way. Read this book, it's very short.

He asked me to write tests for my code, find corner cases. He then asked me 3 other problems. They were Dan Tunkelang type problems. He ran out of problems and there were 15 minutes left. "Normally there's not enough time to ask more than 1 or 2", he said. So we just talked about VMs for 15 minutes. He taught me a lot about virtual machines. This brings up lesson 6:

6. Understand what you don't know, why you don't know it, have an interest in it

Read random Wikipedia articles. You don't have to understand it all, just know enough to be able to ask someone who is knowledgeable. Usually, they can teach you a lot from the seed of what you read. But you need that seed, that germ of interest.

This will make your questions good and your conversations interesting. People like to talk about themselves. Always carry some knowledge and some ignorance as fuel for them talk about themselves and teach you. The interviewer will leave with a positive impression of you.

Don't just read blogs. Read research papers published by successful company. Read Google's MapReduce, GFS, and BigTable papers. Read Yahoo's Hadoop and PNUTS papers. Read Amazon's Dynamo paper. Big companies have big systems, and they will expect at least some familiarity with how they work. These papers are hard. If you have no systems experience, it may take you a day to read through a single one. In the end, you will understand not just how these systems work, but how to think about these systems and design one yourself.

7. Use esoteric tricks you know, teach the interviewer

You're not supposed to use libraries in these coding exercises. But if you know something cool, just use it. In the worst case, the interviewer will tell you to rewrite it. In the best case, he will be interested and ask more questions about it. You will teach him.

I taught the next interviewer about the memoization decorator in Python. Memoization takes a complicated dynamic programming problem and makes it blazingly fast. He asked me to solve a problem. I wrote an O(N^2) algorithm, then made it O(N) with just one more line of code on my post-it note (still no paper).

I taught him about how I often write a file-backed cached version of @memoized that writes to a file so that I can persist the quick results between runs from the shell. He's a C++ guy, so I taught him about Python decorators as well.

8. Think how you think, go with it.

Studies on creativity have shown that if you tell people to be more creative, they end up less creative. If you split two painters up and tell them that the most creative person wins a prize, the paintings will be boring or the same. People forced into creativity think in the same way.

So, don't think outside the box. Just think how you think to the extreme. Go with it.

The third interviewer asked me to design a game API. He wanted a low-level design, but I misunderstood, so I started adding artificial intelligence features that would suggest moves for you to make. I wasn't sure what he was asking, and I kind of tried to clarify, but then I just went with it. My main interest is AI and machine learning. He was impressed by the originality, and I eventually answered his original questions too. You can usually tell when someone is being genuinely sincere by the manner in which they go out of their way to tell you, versus when they are just mouthing the words. He honestly implored, "it was really great talking to you."

9. Give reasons for your opinions, not just opinions

I asked the next interviewer how much he liked mobile development and he said he liked it. I was learning iPhone development. I played with Android too, but I told him I found the XML for UI distasteful. He asked, "Why?" I said you always want to put as much as you can into code because it gives you power. For example, you want to create and manage a repetitive UI element in code to avoid repetition. Avoiding repetition is the whole point of good coding.

He said that Android has a solution for the repetition in the form of XML templates or something. "Oh, I didn't know that," I said. I thought about it more. "Yea," I observed, "the solution to the problem of XML, in enterprise, is often more XML." He chortled.

At a previous interview, a CTO was looking for an employee #1. He was just leaving Google to start a new startup, and he asked me what my thoughts were about Amazon Web Services. I said, "they're good."

I didn't realize that he did not have much experience with them (having been at Google), and wanted some real constructive feedback. I didn't realize that he was testing to see my experience and familiarity with AWS. More importantly, he was testing my reasoning ability and judgment. I have been running an EC2 instance continuously to run this blog for the past 5 years. It scales well and is very simple. I've used nearly every single one of their technologies since they came out. I invest all of my savings in Amazon because I know this technology will net them billions of dollars. I should have said this. Instead, all I said was "they're good."

10. Interview the interviewer

The last interviewer asked me some simple questions that reminded me of a harder problem I had in one of the companies I started. So I asked him how he would solve my problem.

I was trying to find the phrases in a document that correspond to scientific terms in order to link them to Wikipedia automatically. The vocabulary is composed of words and sequences of words like "DNA", "p53ase", "phospholipid bilayer", and "congenital determined myoglasia peptide". Documents are research papers, which can be long. There are a lot of terms (a hundred thousand biological terms, and many more if we include other sciences). How would you find and label all of the phrases in a document efficiently and what is the big-O running time?

He got the O(vocabulary size * document size) algorithm pretty easily, but I told him that there is an O(document size) solution. Can you solve it? Try it out. It's a fun, practical problem.

I pushed him a little bit but he didn't get it. I interviewed the interviewer and stumped him (although I'm sure he would get it if he thought about it longer).

11. Mention your projects and passions first

After finishing up all of the problems in 40 minutes, I had 5 minutes left with the last interviewer. I started telling him about my minimalist python web framework and he said, "That's so interesting, that's what we should have been talking about instead of going over these questions."

I got the offer this time, but I would much rather do a PhD instead. I want to learn how to push the boundaries of knowledge, not just apply what I learned in these books.

over 2 years ago on January 2 at 1:12 am by Joseph Perla in tech, hacks


Sentiment analysis using transfer learning from reviews to news

I'm going to describe some failed experiments in my research in sentiment analysis. I am using LDA and supervised LDA. I will be developing other custom models that incorporate blog comments, and I will be training using stochastic optimization in future iterations.

Data

NYT: A corpus of medium-length (500-3000 words) articles from the New York Times. It contains nearly every article from January 1, 2008 through September 2011. It contains 115,586 documents and 118,028,937 words. It contains over 100,000 unique words.

YELP: A corpus of short (10-500 words) local business reviews, almost exclusively restaurants, from the Yelp.com website. Each review is labeled with 1,2,3,4, or 5 stars by the author of the review to indicate the quality of the restaurant the text describes. It contains 152327 documents and 19,753,615 words. It also contains over 100,000 unique words, many of which are misspellings.

Experiments

I ran several experiments to figure out what information can be extracted about sentiments in the new york times articles dataset NYT.

I created a vocabulary nytimes_med_common based on the NYT dataset using words that appear in less than 40% of the documents and more than 0.1% of the documents. This removes very common words and very rare which aren't informative about the document collection in general.

First, I ran LDA on the NYT dataset using the nytimes_med_common vocabulary. On the most recent 2000 articles, I extracted 40 topics represented below. The topics closely follow the lines of politics, education, international news, and so on. They closely model the different sections of the newspaper. (lda_c_2011_10_16).

I ran sLDA on the YELP dataset using the nytimes_med_common vocabulary. This excludes many features of the YELP dataset which are specific to restaurant reviews, and misspellings (e.g. "terrrrrible"). On the first 10000 reviews of the dataset, I extracted 50 topics. The topics computed include a few topics which describe negative words. Many of the topics generally describe specific kinds of restaurants (ice cream shops, thai foods) in detail in generally neutral or positive terms. There is a chinese food topic with generally negative terms. The topics with the most extreme coefficients do seem to give a good sense of the polarity of the words contained within. Based on informal analysis, it looks like the topics would have good word intrusion and document intrusion properties. (yelp_slda_2011_10_17)

I ran LDA on the NYT dataset starting from the model and the topics extracted from the sLDA on the YELP dataset. This did not work very well, and got about the same topics as LDA from scratch. Perhaps a better experiment would be to take the topics with the most predictive coefficients, the 5-10 of them, and run LDA starting with those. (yelptopics_nytimes_lda_c_2011_10_17).

More interestingly, I created a lexicon of the words with high coefficients for predicting the polarity of Yelp reviews using Naive Bayes (yelp_lexicon and yelp_lexicon_small). I ran LDA on the NYT dataset using the yelp_lexicon as a vocabulary. This brought out a few topics that did not strictly follow along with the newspaper sections. For example, there is an epidemic/disease topic. There is a "corrections" topic with words like the following: incorrectly, misidentified, erroneously, incorrect, correction. The topic on employment reveals a strong motivator: paid, contract, negotiations, wages, executives, employees, unions, manager, compensation. Many of the topics do match up, like baseball and football and music and food and books, but it is just a much more noisy set of topics. It is easier to find the same section topics when that section uses a lot of review-filled words (like food, music, and book reviews). Many of the topics are unidentifiable, perhaps I used too many topics. But some are interesting, such as topic 029 using yelp_lexicon_small: winner, favorite, amazing, perfect, fantastic, outstanding, with other words in various sections of the newspaper.

A final experiment I ran on the Yelp dataset using nytimes_med_common vocabulary. I ran sLDA on the Yelp dataset to generate topics with coefficients. I then ran inference on the news articles using these generated topics and coefficients. The distribution of predicted ratings looks Gaussian with mean 3.5 and standard deviation .25 . Nearly all the documents are clustered to be labeled between 3 and 4 stars, with less than 5% below 3 or over 4. Even at the extremes, the documents with the highest predicted label have many death-related and terrorism-related articles. The negative extremes are also not consistent.

My next experiment will be to try to isolate topics which relate specifically to sentiment, independent of domain. One idea I have relates to fixing topics when training (an idea Chong Wang introduced to me). My idea is to run LDA on the yelp dataset to generate domain topics. Then, I will run sLDA with those topics fixed plus 2-10 extra topics which are unfixed. The fixed topics will act as background with middling coefficients, I predict, and the remaining trained topics will end up with extreme coefficients and will contain strong sentiment words independent of topics in the domains.

over 2 years ago on October 21 at 5:41 am by Joseph Perla in tech, hacks, research


Your website is unviral

Your website is probably unviral.

Everybody wants his or her website to go viral. As web designers and entrepreneurs it is our goal to create buzz; an unstoppable avalanche of traffic; a self-feeding hurricane.

For many entrepreneurs PayPal, YouTube and Facebook are the alpha and omega of marketing and strategic growth. The goal is to emulate the distinguishing characteristics of these products in an attempt to achieve similar heights.

Some websites succeed and grow on the same trajectory or faster. They usually exist in fundamentally social businesses like email, payment services or social networking in places without such networks.

However, there is also a special cadre of websites which lies in a no-man's land untouched by virality. Not only is it difficult for these sites to become viral, but the nature of the business actively fights its own online growth, behaving much like a tumor suppressor gene. These businesses never experience exponential growth and all of their growth paths, even if strong, are linear. Some examples quickly come to mind: male enhancement pills, adult diapers, schizophrenia medicine.

The online world is in the habit of thinking itself as viral by nature. Virality, some think, is built into the very fabric of the Internet. It is not. In fact, one of the most profitable online businesses is unviral: online dating.

Dating websites carry such a stigma that some couples successfully matched online invent a fictionalized romantic encounter to conceal the fact that they were mouse-selected by a filtering process and a geographic search. Although dating websites provide immense value, arguably more than almost any other online service, users will often strive to hide their enrollment from even their closest friends. Maybe especially their closest friends.

Dating websites are unviral. They do not spread by word of mouth. As a matter of fact, they actively suppress this form of growth due to the nature of their service. Many an experienced entrepreneur has made the mistake of underestimating the unvirality of online dating.

Your website may be unviral too, although it may be less than obvious. Perhaps it only demonstrates certain elements of unvirality; for instance, would your users tell all of their friends about your service, or only some? Would they actively deny using or even knowing about your website to certain friends, acquaintances, or co-workers if asked? Is it embarrassing? That would be pretty bad for your virality.

If this is the case you must face the facts: your website is unviral.

Look at examples of of purely viral websites: PayPal, the old Hotmail, YouTube and the current Facebook (not the old version that limited itself to college students) all display or once displayed growth without any symptoms of unvirality. I told everyone about Hotmail when it was launched: cousins, teachers, pen-pals. Users of contagious sites like this may not actively evangelize to literally everyone (as they do with, say, YouTube) but they certainly wouldn't avoid a discussion or hold back praise once the topic was broached. In contrast, there are many sites that quickly provoke responses of "yea that's weird," when mentioned. In these instances unvirality dominates and kills growth.

Though unvirality is not a death sentence, it does limit a potential for greater growth. In some instances, unvirality is inherent to the service or structure of the business. But if this is the case, why should an entrepreneur involve himself with it and how can he manage to save it from the depths of unvirality?

Two Answers: Anonymity and Covert Transformation

  1. The Internet supports anonymity which allows users to praise a product they love to others--thousands or even millions of other strangers--while avoiding the embarrassment and reluctance to share a product or website that often results in unvirality. Anonymous forums and reviews abound for even the most unviral of products.
  2. Secondly: The web entrepreneur can covertly transform an unviral business into a viral one by euphemizing or disguising its true purpose. For example: Create a website with dating tools but market it as a social network. Facebook at Harvard worked this way with its subtle but pivotal "relationship status." Give users a guise under which to refer their friends and thus avoid the focus on unviral traits, such as the embarrassment endorsing a dating site, that would otherwise prevent the expansion of your user base.

Utility vs. Virality

Many people confuse utility with virality, but these are actually very independent qualities. Entrepreneurs may believe that if they have developed a good product it will naturally become viral. This could not be further from the truth.

Often, people will want to tell their friends about a product they find useful but this is not a necessarily so. Some of the most useful online services actually discourage such talk. Something can be useful and viral (such as Facebook), useful and not viral (dating websites; most products ever created), not useful and viral (lolcatz; chia pets; almost all 4chan memes) and something can most definitely be neither useful nor viral (almost everything). Utility and virality are therefore orthogonal. They represent two different dimensions which can intersect but do not necessarily do so.

In fact, something extremely useful--something that you and many others may pay thousands of dollars for to bring yourselves joy for a lifetime--may be strictly unviral. As I mentioned earlier, people will actively go out of their way to not talk about some very useful products. Many medical products fall into this category. Utility is one dimension of a service and it is clearly distinct from virality, though by no means mutually exclusive. The realm of influence between utility and virality is vast and depends mostly on the nature of the business.

The lesson: just because your website is useful, does not mean it will be viral. Just because it is viral, it may not be useful and thus will die once the virus finishes spreading. First, solve the utility problem: build something useful. Then, you have to solve the distribution problem, and I gave 2 techniques for doing so: anonymity and covert transformations. Do you have other ideas?

over 2 years ago on October 4 at 4:42 am by Joseph Perla in tech, hacks, entrepreneurship


How to hack Silicon Valley, meet CEO's, make your own adventure

It was my sophomore year. Everyone was making plans for fall break. What are you going to do? You don't know? There are only 4 days left before break.

Before this particular fall break, I was busy with classes and had thus neglected to make plans. Some students were going skiing, others on class trips, others to homes nearby. Where are you going? I had no idea.

However, around this time, I was reading a lot about California. I read work by entrepreneur and essayist, Paul Graham, in which he says that the San Francisco Bay Area is the best place to start a company. He described the energy, but I couldn't palpate it. If I were to take his word, it's an ethereal, magical place.

That day, James Currier, internet entrepreneur, stood before me and a packed class full of eager students. His eyes were shot open, a purple glaze lit them afire. His wavy hair burst out atop his skinny head. Gaunt and fearless, he embraced the air as he swung his arms widely to make his point.

“Silicon Valley is absolutely the place to be,” he said. “It’s where all technology happens. It’s where Google started, it’s where Apple, Yahoo, Intel, Oracle, and so many other technology companies started. Some of the smartest people in the world lived there at Stanford, Berkeley, and Xerox PARC. It is a magical forever-sunny wonderland where dreams come true and it rains investments and acquisitions.”

He went on to make even more grandiose claims. Startups? Risky? Not at all when you do things right. Moreover, they are nothing compared to the risks of a financial job.

Everyone laughed. This room in Princeton was filled with students who had already accepted offers at investment banks or who would be applying soon. In 2006, finance was booming with big bonuses and strong growth prospects. Derivatives opened up whole new worlds for trading and speculation. Operations research quants donned their glasses in pride. They were respected.

So everyone laughed. He said, "No, really. They can fire you any time. They don't care about you. The market can turn the other way in a heartbeat. You have no job security. Your firm can go bankrupt."

To the students at the time, this all seemed ludicrous. They all envied these corporate finance jobs, nevermind that many would lose their finance jobs less than 2 years later.

He inspired. He didn't have charm so much as hurricane-force energy. He was insightful and learned. He talked about his great times. He talked about his learning moments.

And so he inspired me to see it. I had to see it. What is so special about the Silicon Valley, the San Francisco Bay Area? How can it actually be that great? What exactly gives the air such power to breath life into world-changing tech empires?

I knew what I had to do, but Fall Break was just two days away. How would I fly there without paying outrageous fees? Where would I stay? What would I do there? How would I meet the minds behind these great startups? I was a sophomore from Florida. I had no network in California.

I searched online for cheap tickets, no luck--that is until I noticed an ad for Hotwire. If you have yet to try this site, Hotwire buys leftover seats in bulk, and then sells them to users blind such that they don't know exactly which flight on which airline at what time until they buy. I snagged a very cheap ticket for 3 days later.

Now, where would I stay? I knew exactly three people from my high school in the bay area, 2 at Berkeley, and 1 at Stanford. I sent them all an email and hoped they would get back to me in time. I could always get a hotel somewhere.

Finally, how could I reach the top startups in Silicon Valley and find entrepreneurs who could meet with me on such short notice? I didn’t know any CEO's. How do I hack Silicon Valley itself?

TechCrunch always covers the hottest new funded startups. Every day, they publish dozens of new articles on the latest technology. I should just pick a few of the best and email them. But how do I choose the best? What if they don't get back to me? Will I waste a trip?

I noticed half of all of the articles listed the location of each company. Some in California, some not. So I enumerated every neighborhood in the bay area: Redwood City, Palo Alto, Berkeley, San Francisco, Menlo Park, etc. I wrote a program to crawl all of the Techcrunch archives to find all of the articles about companies in one of these cities. I then parse out the name and URL automatically as well.

I looked through the list of companies and I picked the most interesting. Some invented a new technology, and others just came out of a new incubator called YCombinator.

These days, you can do this easily yourself with Crunchbase, a useful database of every startup in existence.

I wrote another script to send an email to every single one: jobs@azureus.com, jobs@youos.com, and so on. In each email, I wrote: I am a student who will be graduating soon, and I would be very interested in learning more about your startup since I saw it in Techcrunch and I think your Company is very innovative. I'm from Princeton and I'd be interested in potentially working for your company. I am visiting California next week, can we please meet?

I sent out dozens of emails, and then I waited. Not everyone replied, but many did. I flew in, finished my homework on the plane, and crashed with my friends (all three through through the week).

I met with many CEOs. In startup land the companies are all very small. Everyone in the company has to wear many different hats. Therefore, when you send an email to jobs@startup.com, the CEO reads it. Few people realize that you can easily get direct access to startup CEOs.

One company I reached out to was YouOS. YouOS was in the first class of YCombinator. They are incredibly good hackers. We just talked over pizza and they joked about how they've written and rewritten servers from scratch so many times that they can do it in 5 minutes while sleeping. YouOS did not work out, but the founders continued innovating. A couple of them made and then sold Project Wedding. Another went on to create thesixtyone.com and Aweditorium, two of the most innovative music apps in the world.

Walking around Palo Alto, I saw several startups on each block. If you can imagine another planet where the Internet is turned into physical locations with storefronts, with Facebook next to Dropbox next to Shopkick, then you have a pretty good idea of what Silicon Valley looks like.

You see zetok, jlingo, and any conceivable combination of letters plastered everywhere. I noticed a small blue frog in one corner, it looked familiar. I tried the door and walked upstairs. The offices were in fact the offices of Azureus, the Bittorrent app that made torrents popular. I went up to the exhausted man walking hurriedly by the front desk, and I began with: hi. I'm Joseph Perla. I am a student looking for a job. I am visiting just for a week, can I talk to you for just a few minutes?

He was taken aback at first, a little flustered. He said, yes, sure, but not today, I'm a little stressed because I'm signing papers. We’re raising 4 million dollars right now. Can you come tomorrow?

Absolutely.

The next day, I spent 3 hours talking to the CEO one-on-one about Azureus, raising money, silicon valley, bittorrent, technology, France (he's French), and the french technology industry.

I ended up meeting with half a dozen other startup founders. I toured the golden gate bridge and many parts of the bay area, Berkeley, and Stanford.

The Valley is very friendly, and everyone does everything they can to help you because, at some point, someone definitely went out of their way to help them succeed. I started building a network from nothing. I directly used the connections I made on my spontaneous trip to start my next company, Labmeeting.

David Tisch, of Techstars NYC, made a great point recently. Startups are very difficult. The odds are against you. Your competitors are twofold. On the one hand you compete with the biggest companies in the world. Even more difficult, you compete with inertia and ignorance and apathy. Everyone in the startup industry knows how hard it is, so we all do what we can to help each other to beat the odds. That's the only way it works at all. That's how we succeed against all odds. Silicon Valley is one big mega-commune of startup capitalists.

Make the most of what you have (friends in new places), trust in people, and find out what the ethos of Silicon Valley is really like. I know scores of startups who would love to have smart students, especially students looking for jobs, visit their offices and see what they have built. I can point you in the right direction. CEO's love to tell their stories more than you like to listen to them! Let me know if you plan to make your own adventure, and please tell me about your trip when you get back.

over 2 years ago on October 3 at 12:00 pm by Joseph Perla in tech, hacks, entrepreneurship


Write bug-free javascript with Pebbles

Github: https://github.com/jperla/pebbles

We actively seek contributors!

Goals of Pebbles

  • so easy that designers and non-programmers use it to write complicated AJAX!
  • 0 lines of javascript
  • complicated ajax websites
  • 0 lines of javascript
  • no bugs
  • 0 lines of javascript
  • very fast speed and optimality.
  • backwards compatibility with clients who have javascript off (this was more important 4 years ago when I first made this)
  • Memrise loves it!

Plus, you don't even have to write one line of javascript!

The basic idea is that almost every complicated AJAX interaction can be reduced to a handful of fundamental actions which can be composed (remind you of UNIX?). So, all you have to do with this library is add few lines of HTML to elements of a page to describe the Pebbles response that happens when someone clicks that element. Maybe you submit a form, maybe you fetch some content and update part of the page.

Most current websites write and rewrite slightly different versions of these same basic patterns in javascript. This separates the HTML which has information about AJAX interactions and the Javascript which has other information. But you want it all in one place!

Pebbles uses the jQuery.live function. Very heavy pages with tens of thousands of elements take 0 time to load, since almost no javascript is executed.

Javascript can be tricky to write even for an experienced programmer. Moreover, a lot of this stuff is repeated, and it shouldn't be. Pebbles brings more of a descriptive style programming (a la Haskell, Prolog) to the web in the simplest of ways.

FAQ

Couldn't you just write javascript functions that you call that do the same thing?

You might but then you introduce the opportunity of syntax and other programming errors, thus not achieving 0 bugs. You would also have to figure out how to make it fast yourself. In practice, this library is so straightforward to use that once you define a complicated action, which only takes a few seconds, you can move it around and it just always works.

Moreover, it's easier to auto-generate correct readable html (e.g. from Django templates). Many of your pages won't need *any* javascript even if highly dynamic. All the custom logic is in one place rather than spread over the html and the javascript. Basically, writing javascript is harder than what amounts to a DSL in HTML.

I need more complicated action-handlers than just these 3, can you please make them?

The code is open source and on Github on jperla/pebbles. Feel free to add your own enhancements. Be careful because you want to keep your app simple, and, in my experience, these 3 actions comprise the vast majority of user ajax paradigms. With a little thinking you can probably do what you want using either "form-submit" or "replace" with the right response html.

Technical Documentation

Pebbles accepts spinner url (to an animated gif of a spinner for waits). Pebbles sets up a live listener on divs with classes of type "actionable".

Classes of type actionable contain a hidden div which has class "kwargs".

.actionable .kwargs { display: none; }

kwargs div contains a number of <input> html elements, each with a name and value. The name is the key name, the value is the value for that key. In this way, in HTML, we specify a dictionary of keyword arguments to the actionable.

Here are some self-explanatory examples:


It fails loudly if misconfigured. It's hard to write buggy code and not notice in quick testing. It is easy to do everything right and it is easy for you to write a complex ajax website with no extra javascript code.

Full arguments are below:
===========================
Arguments:
  type: replace, open-close, submit-form
        replace replaces the target with the url
        open-close will toggle hide/display the target, 
                which also may dynamically lazily load content from an url
        submit-form submits a form via ajax which is a child of the actionable,
                or may be specified in form argument; 
                the response of the ajax replaces target

  url: url string of remote page contents

  target: CSS3 selector of element on page to update

  target-type: absolute, parent, sibling, closest, or child-of
                Absolute just executes the target selector in jQuery.
                Parent executes target selector on jQuery.parents().
                Sibling the same on siblings.
                Closest looks at children and children of children and so on.
                child-of looks at target's children

  closest: selector used in combination with target-type:child-of to get target's children
  form: selector used in combination with type:submit-form to find the form

If you use the open-close type, then the actionable can have two child divs with classes "when-open" and "when-closed". Fill when-open with what the actionable looks like when the target is toggled open (for example, a minus sign), and fill when-closed with what the it looks like when the target is toggled closed (for example, a plus sign).

over 3 years ago on September 22 at 5:41 am by Joseph Perla in tech, hacks


How to launch in a month, scale to a million users

These are case studies.

I will talk about my last two startups where I used a lot of techniques to build them quickly and scale them up. Here I explore different techniques I used to architect them to scale which are quite simple, but someone who is not familiar with building systems may be interested in learning how to build his or her own scalable site.

This is based on the outline of a paper by Lampson with more modern web-based examples:

http://research.microsoft.com/en-us/um/people/blampson/33-Hints/WebPage.html

Labmeeting was a search engine for biomedical literature and a social network for scientists. http://www.crunchbase.com/company/labmeeting.

The same principles in Lampson can be applied to any large system such as Google or http://www.turntable.fm.

Functionality

Keep it simple.

We built API's before making the website at Labmeeting. That means that the design of data access, security, and data flow happens long before the first interfaces are created. Simplicity in a small interface is key, with well-defined and single-purpose functions coming from each module and submodule of the API. The whole front-end interface uses exclusively less than 30 methods in 5 modules available in the API.

Get it right.

From day one, we built automated tests into Labmeeting to catch any conceivable and subtle bugs that we may introduce during development. In advance, we knew that it would be a complex, dynamic site with hard to reproduce state. This made it all the more important that each simple method in the API performed exactly as it needed to both in the edge cases and in the normal case. We had individual function tests, module level tests, and full integration tests that automatically started a full chatserver and tested real requests. The tests were run on every commit and no bugs were allowed to persist before writing new code.

Don't hide power

Labmeeting was a very dynamic website using a lot of AJAX to speed up page requests and minimize initial page download size. You could click on a fragment of an abstract, or a button that said Full Text, which then made a request to the server and replaced information on the page with the response. A lot of javascript is repetitive, and can be very buggy. We implemented a library that could create complicated AJAX interactions by writing 0 javascript, instead just adding a few extra HTML tags to code. The library virtually eliminated bugs and increased speed on the site by eliminating javascript execution time and centralizing code. Despite just requiring HTML tags, the library allows for maximum flexibility by the user to submit full CSS3 jQuery selectors as arguments if desired. Despite normalizing an interface, it does not hide the power of jQuery. Memrise.com now uses this library enthusiastically.

You can see the docs and use this library yourself at the Pebbles introduction.

Use procedure arguments to provide flexibility in an interface

We created a system for filtering through news articles. The system has many basic parameters that can be passed that are very simple, but the parameters are simply procedures. Therefore, if someone had a special complicated need, they could write their own function that returned a boolean value of whether to filter the news and pass that through the interface.

Leave it to the client

The interface at Labmeeting was very simple, and we expect the client to perform complicated manipulations of the many elements of the interface and keep track of all those states. This allowed the backend to be developed very quickly, although it meant that frontends, like an iPad or Android app, take a little longer to develop.

Continuity

Keep basic interfaces stable. Keep a place to stand if you do have to change interfaces.

The API of Labmeeting and Stickybits is versioned. They can thus offer full compatibility with previous functionality, but enhancements and changes can be made in newer versions.

Making implementations work

Plan to throw one away.

Many of the routines in the initial prototypes were written very quickly and with an eye to throwing them out once in full production mode. For example, the first version of the PDF search feature pulled in the whole user's collection and did a search in memory. A fully optimized version would be a little more complicated, but many routines were designed that way with an eye to throwing the inner part of the function out and rewriting once the bottlenecks are identified.

Keep secrets of the implementation

We built Labmeeting up from separate silo'd modules that, while decreasing performance a bit, allowed them to operate independently and with maximum flexibility to respond to changes in requirements in the interface. For example, the PDF manager knew nothing about how users were stored or queried. The User api could store users in memory, on disk, in Postgres, or halfway across the world. Lab groups only knew that it could call the same API external methods used to look up a user or set of users.

Use a good idea again instead of generalizing it

At Labmeeting, we had to extract author names from PDFs. We realized that we could do decently well at extracting the names using machine learning techniques, but never perfectly. However, by indexing a gazette, a complete database of every possible PDF, then we could simply make some guesses (possibly using machine learning) and then just look up those guesses in the gazette to see if there is a match. It becomes a problem of efficient enumeration. We didn't generalize it, and used the idea again in a slightly different context. Each PDF has a scientific abstract with various complicated terms from biology and physics. We wanted to identify those important terms to allow further exploration. Again, some indicators could point us in the right direction, but we did not get everything. So, we crawled Wikipedia to compile a gazette of biological terms, then merely used those terms in the abstracts that appear in the gazette modulo very frequent words like DNA. This was highly accurate again. We linked these extracted entities to Wikipedia to provide further information for the curious.

Handle all the cases

Handle normal and worst cases separately as a rule

At Labmeeting, we analyzed PDFs to extract the title, publication date, and other information. The special case of a PDF which is encrypted and unparseable and no text can be extracted went straight to a separate method. The special case could possibly be handled by a more general-purpose algorithm for text extraction that happens to special case to a right answer, but it is more straightforwardly handled separately. Anyone reading the code could see it plainly, rather than having to think through the special case in more complicated parsing code.

Speed

Split resources in a fixed way if in doubt

At Labmeeting, we put the database index on a separate machine from the Solr search index. We had millions of search queries coming into the search system, and we didn't want those queries to slow down the db, and thus normal operation of the site. Writes take much longer than reads, and are more important for logged in users. External users of the site using the search engine just hit the index, performing exclusively reads on the index. This allowed us to scale up the search index independently from the database.

Use static analysis if you can

At Labmeeting, before every commit, I had a version of PyFlakes run on all of my new code. PyFlakes is a static analysis tool for Python that finds common errors that can be detected before run-time. For example, PyFlakes can find improper number of arguments to a function call and references to variable names that are not in scope (like typos). Static analysis finds a lot of bugs that might appear in production only rarely in edge cases. It is most useful in a language like Python that is dynamic and thus doesn't have a lot of the normal safety features available to a statically typed language.

Cache answers to expensive computations

Obvious we did this all of the time at Labmeeting. For example, we performed a document similarity search to find "Related Papers" when we showed one individual paper to recommend other papers a scientist may want to read. The vector computation and search for this is quite expensive so we cache the results for a month. Another example: we had to open up a PDF file which has a research publication, perform text extraction, and then do an information extraction step from the text to analyze the title, authors, publication date and other information. This is a difficult problem to do and involves searching a gazette of 30 million documents and querying the PubMed database at least once. Once this process was completed for one step we saved it to the paper metadata so that we would not have to calculate it again for that PDF each time. The flip-side of caching is that for quickly changing data then one needs to be careful about cache invalidation.

When in doubt, use brute force

We wanted to get the first version of Labmeeting finished very quickly. There are many ways to optimize a system to improve performance, but they come at the cost of decreasing modularity, making more assumptions, and, most directly costly, developer time. The first implementations of the pdf search algorithm used brute force linear search by pulling the name of every scientific paper and then searching each one for the substring. This takes a few minutes to write and does not require a complicated separate hosted index. Moreover, for the small number of papers used during testing, it ends up being much faster than doing a network query to a search index!

Compute in background when possible

After a user uploads a PDF to Labmeeting, a process must go through the PDF, analyze it and normalize it, perhaps convert it to a standard format, extract the metadata, and deduplicate it. This process can take a while, so we avoid this process from blocking the web server by pushing it to a queue. When the queue completes, it sends a message back to the user, which adds the PDF to the person's collection.

over 3 years ago on September 21 at 5:41 am by Joseph Perla in tech, hacks, entrepreneurship


Google Creates Humanoid Robot, Programs Itself

by Joseph Perla. Associated Press. May 13, 2011.

MOUNTAIN VIEW, Calif. — Anyone enjoying talks at Google's recent I/O Conference at Moscone West in San Francisco may have glimpsed some engineers wearing curiously thick belts or backpacks. Harder to notice was that the person carrying those items was not actually a person.


(Computer hardware in the inside of one of the seven autonomous electronic engineers.)

The robots are a project of Google, which has been working in secret but in plain view on robot engineers that can program themselves, using artificial-intelligence software that can reason about programming and mimic the decisions made by a human engineer.

With a technician nearby with root access to monitor the robot talk, seven test engineers have given over 1,000 tech talks without human intervention and written more than 140,000 lines of code with only occasional human debugging. One even programmed itself to learn product management, a task that requires creative and analytical thinking. The only accident, engineers said, was when one robot engineer released a product that was far too technical for human engineers and users at Google I/O last year.

Autonomous electronic programmers are years from mass production, but technologists who have long dreamed of them believe that they can transform society as profoundly as the Internet has.

Robot employees fix bugs faster than humans, have infinite memory and do not get distracted, sleepy or intoxicated, the engineers argue. They speak in terms of products shipped and bugs avoided — more than 37,000 bug patches were released by software development shops in the United States in 2009. The engineers say the technology could double the capacity of the Internet by re-engineering every line of code in legacy routers. Because the robot engineers would eventually require less office space and energy than a human, they would reduce Google’s carbon footprint. But of course, to be truly better, the robots must be far more reliable than, say, today’s personal computers, which crash on occasion and are frequently infected.

The Google research program using artificial intelligence to revolutionize programming is proof that the company’s ambitions reach beyond the search engine business. The program is also a departure from the mainstream of innovation in Silicon Valley, which has veered toward social networks and Hollywood-style digital media.

During a half-hour talk beginning Moscone West, a convention center in the heart of San Francisco last Monday, a robot engineer equipped with a variety of sensors and following a Powerpoint projected onto the screen nimbly discussed the finer details of unsupervised machine learning to several thousand developers from the heart of Silicon Valley. Little did the attendees know that the code he was projecting and editing was his own.


(A robot engineer developed and outfitted by Google, with advanced backup on belt, lecturing on the new Chromebook at the Google I/O conference in San Francisco, Calif.)

Later that day, the robot engineer announced the new Chromebook computer at the Google conference on Tuesday. He developed and programmed the software on his own. “We’re terribly proud of Sundar, the most successful of our electronic colleagues,” said a Google engineer. Sundar, as they call it, taught himself product management and has risen through the famously meritocratic ranks of Google’s hierarchy to the level of Vice-President.

The autonomous developer can be programmed for different personalities — from cautious, in which it is more likely to write more code to avoid bugs and security breaches, to aggressive, where it is more likely to quickly write brief code and use expletives in documentation.

Christopher Urmson, a Carnegie Mellon University robotics scientist, was pair programming with a robot engineer but not typing. To gain control, he has to do one of three things: hit a red button near his right hand, move the mouse, or press a key. He did so twice, once when a robot almost removed colorful themes from Gmail and again when another human engineer was launching a new feature simultaneously. But the robot developer seemed likely to have prevented the accidents itself.

When he returned to automated "plugged in" mode, the robot slouched and made a grim face meant to evoke going into a deep meditative zone and Dr. Urmson was able to take his hands off the keyboard and gesticulate when talking to a colleague. He said the engineers did attract attention, but people seem to think they are just the some dorky young engineers that Google just hired out of MIT.

The project is the brainchild of Sebastian Thrun, the 44-year-old director of the Stanford Artificial Intelligence Laboratory, the co-inventor of the Street View mapping service, and director of Google’s autonomous car project.

In 2005, he led a team of Stanford students and faculty members in designing the Stanley robot car, winning the second Grand Challenge of the Defense Advanced Research Projects Agency, a $2 million Pentagon prize for driving autonomously over 132 miles in the desert. Last year, he announced the Google driverless car project which has recorded thousands of miles of driving on hlghways from San Francisco to Los Angeles. Google is currently lobbying Nevada to be the first state to allow autonomous vehicles legally.

Besides the team of 15 engineers working on the current project, Google created seven robot engineers, each working as employees on the team to program themselves. Google is using six hundred Intel and one AMD processor in the project.

The Google researchers said the company did not yet have a clear plan to create a business from the experiments. Dr. Thrun is known as a passionate promoter of the potential to use robotics to make software more secure and lower the nation’s energy costs. It is a commitment shared by Larry Page, Google’s co-founder, according to several people familiar with the project.

Google first publicly experimented with human-less engineers at the Google I/O conference in 2010. Lars Rasmussen, another engineer at Google, worked with Thrun to create the robot engineer they named Jens which covertly played Lars Rasmussen’s brother. Jens presented a talk on stage to launch Google Wave with Lars Rasmussen supervising as his human operator. They notified authorities beforehand.


(Lars Rasmussen and Jens launching Google Wave last year)

Google Wave was conceived, developed, programmed, and launched entirely by Jens. Recently, however, Google Wave was deemed too technically complex and cancelled, one of the many failed projects created by robot engineers in the past year. The self-programming engineer initiative is an example of Google’s willingness to gamble on technology that may not pay off for years, Dr. Thrun said. Even the most optimistic predictions put the deployment of the technology more than eighteen years away. "The engineering quality is currently at Microsoft-level, but not Google-level, quality."

Late last year, Lars Rasmussen left Google for Facebook. Sources close to Google say that Lars attempted to assert his legal right to his robot brother Jens. “Despite flaws, he is such an invaluable colleague,” Lars remarked.

“The technology is ahead of the law in many areas,” said Bernard Liu, senior staff counsel for the California Human Rights Center. “If you look at the legal code, there are scores of laws pertaining to the rights of individuals, and they all presume to have a human being operating under contract.”

The Google researchers said they had carefully examined California’s legal regulations and determined that because the electronic engineers were wholly created at Google, the experimental employees are Google’s property. Mr. Liu agreed.

Scientists and engineers have been designing robots since the mid-1960s, but crucial innovation happened in 2005 when Thrun and colleagues achieved successes with their autonomous vehicles in the DARPA Grand Challenge. Peter Norvig, Director of Research at Google and Artificial Intelligence expert, and colleagues quickly translated their success in the complex task of driving into their own field of programming computers.

The original codename of the project was Android, the fictional human-like robot of Philip K. Dick stories, but that was scrapped with the increasing popularity of the Android mobile operating system also developed by Google. Since changing the name to Project Watson and starting collaboration with IBM, the technology has been steadily improving as the robot engineers work alongside Thrun’s human team to improve themselves.


(Smarter Than You Think: Guided by Computers and Sensors, 3 robot Google employees)

Advances have been so encouraging that Dr. Thrun sounds like an evangelist when he speaks of robot engineers. There is their potential to reduce energy use by eliminating the Google chefs and cafeterias, given the reduced need for amenities, and to ultimately build a smaller Googleplex.

There is even the farther-off prospect of employees that do not need any upper management. That would allow the robot engineers to manage themselves, so that they can get more work done. Fewer employees would then be needed, reducing the need for office space, which consumes valuable land.

And, of course, the robots could save engineers from themselves. "Can we program twice as much while playing video games at work, without the guilt?" Dr. Thrun said in a recent talk. "Yes, we can. Now, if only Droid apps would write themselves."

over 3 years ago on May 13 at 9:04 am by Joseph Perla in news, tech, google


Howdy, my name is Joseph Perla. Former VP of Technology, founding team, Turntable.fm. Entrepreneur. Actor. Writer. Art historian. Economist. Investor. Comedian. Researcher. EMT. Philosophe

Twitter: @jperla

Sign up to my Blog Blog

Favorite Posts

Y Combinator Application Guide
What to do in Budapest
How to hack Silicon Valley, meet CEO's, make your own adventure
Your website is unviral
The Face that Launched a Thousand Startups
Google Creates Humanoid Robot, Programs Itself

Popular Posts

How to launch in a month, scale to a million users
Weby templates are easier, faster, and more flexible
Write bug-free javascript with Pebbles
How to Ace an IQ Test
Capturing frames from a webcam on Linux
A Clean Python Shell Script
Why Plant Rights?

Recent Posts

Venture Capital is broken
The nature of intelligence: brain bowls, cogniphysics, and prochines
Bitcoin: A call-to-arms for technologists
Stanford is startups
Today is Internet Freedom Day! DRM-free book about Aaron Swartz's causes
I help startups around the world

More...