Hackers fly for free
I was in Europe for a bit this summer. I wanted to go to a technology conference to meet fellow hackers internationally. I saw that EuroPython (the Python language conference of Europe) was in Florence, Italy this year. They were offering scholarships for students for free tickets to the conference plus hotel. I applied, pointing to my many contributions to Python like my Python web framework. I got it, and had a fantastic time. It was incredibly well-organized, and I met some brilliant hackers. My room-mates made Kivy touch platform, psyco, and the Italian Pirate Bay. I met Armin Rigo, the hero-genius behind PyPy. I gave a talk about my minimalist with-statement based Python templating system. Google also hosted a programming competition, Google Code Jam. I got 2nd place among all contestants at the conference and I won a nice new Android phone! It was a good trip that paid for itself.
Teach yourself Git in 2 minutes
Git is very simple. It's very powerful, but fundamentally very logical and very simple. If you try to learn everything you can do with git, then the information will flood your brain and drown you. That's true of any powerful tool like Photoshop and Unix. But if you just want to use git to backup your code changes, develop new branches, and share your source, git is actually as straightforward as SVN. Avoid complex and dangerous commands like git-rebase. I've worked on large codebases with distributed teams and I've never needed anything more than basic commit, branch/merge, and push/pull. Git also has useful log, diff, and grep tools for quickly finding out information about your code. Git Flow: Git for HumansTo use git without brain augmentation surgery, you should make a simple, consistent system for yourself with a handful of commands. Or just use my system. You want to commit often, so it's good to create bash shortcuts so that you use git more often. The 2-letter shortcuts encourage you to commit more often and keep everyone's code up to date. You'll never be afraid of losing code. Here's my main workflow, commit and push:
$ # make changes, fix bugs... $ cm "fixed bug 214 in the UI" $ ph I'm constantly checking the status to see if I forgot to add files or commit something:
$ sl # On branch master nothing to commit (working directory clean) If I'm branching, I create a branch, make my changes, and then merge.
$ ct -b newfeature # make changes
$ ct master $ me newfeature And then I can push my changes and delete the branch.
$ ph $ bh -d newfeature You want to commit often, so always cm (git commit -a -m) and ph (git push) after even small changes. The 2-letter shortcuts encourage you to commit more often and keep everyone's code up to date. The codes are easy to remember because they are consistent. The codes are always 2 letters, composed of precisely the first letter of the command and the last letter (including all of the options). By using the last letter of the command including options, the shortcut tricks your mind into thinking of the full command every time you type it. Normally, you forget commands with abbreviations of the first letters, but with my system you remember the whole command every time so you can still use git on other systems and other peoples' computers. Git Flow ExamplesWhere is this variable myVar declared (git grep)?
$ gp myVar How is my branch different from master (git diff --ignore-space-change)?
$ de master I forgot, did I commit all my changes, what files did I forget to add (git status -uall)?
$ sl How do I make a new branch (git checkout -b)?
$ ct -b mybranch How do I merge it back (git merge)?
$ #ensure you've committed all changes in your branch $ ct master $ me mybranch How do I delete my branch after I've merged changes (git branch -d)?
$ bh -d mybranch How do I pull and push my changes (git pull, git push)?
$ pl $ ph What changes were made recently (git log)?
$ lg What branches exist and branch am I on (git branch)?
$ bh What if I screwed up and want to remove all the code in my branch without merging (git branch -D, since caps are harder)?
$ bh -D mybranch How do I make a new repository?
$ git init I just added new files to my code, how do I add them to my git repository?
$ ad . Below are my bash aliases. Add these to your ~/.bashrc file so that you can use these shortcuts too:
alias ad='git add' alias pl='git pull' alias ph='git push' alias cm='git commit -a -m' alias sl='git status -uall' alias lg='git log' alias gp='git grep' alias de='git diff --ignore-space-change' alias me='git merge' alias bh='git branch' alias ct='git checkout'
Don't write on the whiteboard
I recently interviewed at a major technology company. I won't mention the name because, honestly, I can't remember whether I signed an NDA, much less how strong it was. I did well. Mostly because of luck. I normally step over myself when I interview. I guess I've improved over the years. Here are a few tips to ace your own interview. 1. Don't write on the whiteboardWhen I interviewed at Palantir around 5 years ago, I had a lot of trouble with this. Yes, I knew next to nothing about computer science then, but I should have been able to answer many of those questions. For example, Palantir asked me to write an API for a hash table, and I forgot set key and get key, the most basic operations. The alien situation of the whiteboard contributed to my nervousness. I didn't get an offer from Palantir. Most people think you have to write on the whiteboard. Steve Yegge recommends that you practice writing code on a whiteboard and even buy and bring your own marker to the interview. That's pretty extreme and truly conveys the capriciousness of modern-day tech interviewing. The interviewer started by asking me to code up a simple recursive calculation, using any language I wanted. "I dont like to write on whiteboards," I said. "It feels unnatural and distracting. I'd prefer to write on paper." "Okay," he shrugged. The interviewers don't care. Use paper. 2. Bring your own paper and penSo I asked for some paper and a pen. But there was no paper around, only some post-it notes. My mistake. You should always have paper and pen anyway to write down ideas. On the subway, in line for movie tickets. Or you can keep a few sheets of paper with your resume and folder you brought to the interview (you did that, right?). Moleskines are excellent notebooks. Some of the best programmers figure out the high-level overview on paper before they write a single line of new code. 3. Use PythonEven if you are a C++ systems guru. Even if you only know how to use Eclipse to program Java. Learn Python and use it during your interview. Python's philosophy is very simple and consistent. It's largely composed of a subset of the ideas in Java and C++. 80% of Python is based around the dictionary (HashMap). It will take you a few days to learn and not much longer to master. You will waste a lot of time writing string manipulation code and initialization code that you can do in one line of Python. Get to the algorithm. All I had was a post-it note, a tiny amount of space. So I wrote down the algorithm in Python line-by-line on the post-it note. 4. Write short algorithms, then make them half as long, then make them shorter, then ask an expert how they would make it even shorterFive minutes later, I had my algorithm. It took up less than a few lines. He looked at it, yea, that looks correct. "Normally people write it in Java and it takes them a while and it takes up a lot more space on the whiteboard. They spend a lot of time manipulating the input string." Since my first interview at Palantir, I had done a lot of practice problems. The highest value problems I know of are on Project Euler. The site posts a sequence of problems of increasing difficulty which you have to solve with increasing efficiency. To become a great hacker, just do those problems in order (in Python!). Then take your solution and do it in half the lines. Now, read more of the Python docs (maybe read about generators and list comprehensions and decorators) and make it even shorter. Finally, look at the solutions posted on the Project Euler site. Stand in awe of the 1-line solutions. Weep in joy over the solutions of the guy who answered the problems using just pen and paper and his brain. The highest value courses I took are algorithms and advanced algorithms courses. I was lucky enough to study under Robert Tarjan. But I also did every problem in CLRS, the standard (and very well-written) algorithms textbook. If you do all that, you won't be nervous at your big interview. You'll be bored. 5. Write tests on your own code, sometimesI say sometimes instead of always because, first, it is impossible to test every case. 100% test coverage is a myth. Any non-trivial program is going to have too many edge cases to check, computationally. You need to test the high-value parts. You need to test the parts that you keep breaking. Finally, as you finish, the interviewer will look at your code and ask you to write tests for it. So, pre-empt him and describe the tests that you would write yourself. Write edge case tests. Run the tests in your mind. Does your algorithm work? Remember, the interviewer will ask you to do this anyway, so just do it yourself and you will be one step ahead and score well. The highest value book I read which taught me practical programming techniques is Programming Pearls. It also teaches you the importance of tests and how to write them in a pain-free way. Read this book, it's very short. He asked me to write tests for my code, find corner cases. He then asked me 3 other problems. They were Dan Tunkelang type problems. He ran out of problems and there were 15 minutes left. "Normally there's not enough time to ask more than 1 or 2", he said. So we just talked about VMs for 15 minutes. He taught me a lot about virtual machines. This brings up lesson 6: 6. Understand what you don't know, why you don't know it, have an interest in itRead random Wikipedia articles. You don't have to understand it all, just know enough to be able to ask someone who is knowledgeable. Usually, they can teach you a lot from the seed of what you read. But you need that seed, that germ of interest. This will make your questions good and your conversations interesting. People like to talk about themselves. Always carry some knowledge and some ignorance as fuel for them talk about themselves and teach you. The interviewer will leave with a positive impression of you. Don't just read blogs. Read research papers published by successful company. Read Google's MapReduce, GFS, and BigTable papers. Read Yahoo's Hadoop and PNUTS papers. Read Amazon's Dynamo paper. Big companies have big systems, and they will expect at least some familiarity with how they work. These papers are hard. If you have no systems experience, it may take you a day to read through a single one. In the end, you will understand not just how these systems work, but how to think about these systems and design one yourself. 7. Use esoteric tricks you know, teach the interviewerYou're not supposed to use libraries in these coding exercises. But if you know something cool, just use it. In the worst case, the interviewer will tell you to rewrite it. In the best case, he will be interested and ask more questions about it. You will teach him. I taught the next interviewer about the memoization decorator in Python. Memoization takes a complicated dynamic programming problem and makes it blazingly fast. He asked me to solve a problem. I wrote an O(N^2) algorithm, then made it O(N) with just one more line of code on my post-it note (still no paper). I taught him about how I often write a file-backed cached version of @memoized that writes to a file so that I can persist the quick results between runs from the shell. He's a C++ guy, so I taught him about Python decorators as well. 8. Think how you think, go with it.Studies on creativity have shown that if you tell people to be more creative, they end up less creative. If you split two painters up and tell them that the most creative person wins a prize, the paintings will be boring or the same. People forced into creativity think in the same way. So, don't think outside the box. Just think how you think to the extreme. Go with it. The third interviewer asked me to design a game API. He wanted a low-level design, but I misunderstood, so I started adding artificial intelligence features that would suggest moves for you to make. I wasn't sure what he was asking, and I kind of tried to clarify, but then I just went with it. My main interest is AI and machine learning. He was impressed by the originality, and I eventually answered his original questions too. You can usually tell when someone is being genuinely sincere by the manner in which they go out of their way to tell you, versus when they are just mouthing the words. He honestly implored, "it was really great talking to you." 9. Give reasons for your opinions, not just opinionsI asked the next interviewer how much he liked mobile development and he said he liked it. I was learning iPhone development. I played with Android too, but I told him I found the XML for UI distasteful. He asked, "Why?" I said you always want to put as much as you can into code because it gives you power. For example, you want to create and manage a repetitive UI element in code to avoid repetition. Avoiding repetition is the whole point of good coding. He said that Android has a solution for the repetition in the form of XML templates or something. "Oh, I didn't know that," I said. I thought about it more. "Yea," I observed, "the solution to the problem of XML, in enterprise, is often more XML." He chortled. At a previous interview, a CTO was looking for an employee #1. He was just leaving Google to start a new startup, and he asked me what my thoughts were about Amazon Web Services. I said, "they're good." I didn't realize that he did not have much experience with them (having been at Google), and wanted some real constructive feedback. I didn't realize that he was testing to see my experience and familiarity with AWS. More importantly, he was testing my reasoning ability and judgment. I have been running an EC2 instance continuously to run this blog for the past 5 years. It scales well and is very simple. I've used nearly every single one of their technologies since they came out. I invest all of my savings in Amazon because I know this technology will net them billions of dollars. I should have said this. Instead, all I said was "they're good." 10. Interview the interviewerThe last interviewer asked me some simple questions that reminded me of a harder problem I had in one of the companies I started. So I asked him how he would solve my problem. I was trying to find the phrases in a document that correspond to scientific terms in order to link them to Wikipedia automatically. The vocabulary is composed of words and sequences of words like "DNA", "p53ase", "phospholipid bilayer", and "congenital determined myoglasia peptide". Documents are research papers, which can be long. There are a lot of terms (a hundred thousand biological terms, and many more if we include other sciences). How would you find and label all of the phrases in a document efficiently and what is the big-O running time? He got the O(vocabulary size * document size) algorithm pretty easily, but I told him that there is an O(document size) solution. Can you solve it? Try it out. It's a fun, practical problem. I pushed him a little bit but he didn't get it. I interviewed the interviewer and stumped him (although I'm sure he would get it if he thought about it longer). 11. Mention your projects and passions firstAfter finishing up all of the problems in 40 minutes, I had 5 minutes left with the last interviewer. I started telling him about my minimalist python web framework and he said, "That's so interesting, that's what we should have been talking about instead of going over these questions." I got the offer this time, but I would much rather do a PhD instead. I want to learn how to push the boundaries of knowledge, not just apply what I learned in these books.
Sentiment analysis using transfer learning from reviews to news
I'm going to describe some failed experiments in my research in sentiment analysis. I am using LDA and supervised LDA. I will be developing other custom models that incorporate blog comments, and I will be training using stochastic optimization in future iterations. DataNYT: A corpus of medium-length (500-3000 words) articles from the New York Times. It contains nearly every article from January 1, 2008 through September 2011. It contains 115,586 documents and 118,028,937 words. It contains over 100,000 unique words. YELP: A corpus of short (10-500 words) local business reviews, almost exclusively restaurants, from the Yelp.com website. Each review is labeled with 1,2,3,4, or 5 stars by the author of the review to indicate the quality of the restaurant the text describes. It contains 152327 documents and 19,753,615 words. It also contains over 100,000 unique words, many of which are misspellings. ExperimentsI ran several experiments to figure out what information can be extracted about sentiments in the new york times articles dataset NYT. I created a vocabulary nytimes_med_common based on the NYT dataset using words that appear in less than 40% of the documents and more than 0.1% of the documents. This removes very common words and very rare which aren't informative about the document collection in general. First, I ran LDA on the NYT dataset using the nytimes_med_common vocabulary. On the most recent 2000 articles, I extracted 40 topics represented below. The topics closely follow the lines of politics, education, international news, and so on. They closely model the different sections of the newspaper. (lda_c_2011_10_16). I ran sLDA on the YELP dataset using the nytimes_med_common vocabulary. This excludes many features of the YELP dataset which are specific to restaurant reviews, and misspellings (e.g. "terrrrrible"). On the first 10000 reviews of the dataset, I extracted 50 topics. The topics computed include a few topics which describe negative words. Many of the topics generally describe specific kinds of restaurants (ice cream shops, thai foods) in detail in generally neutral or positive terms. There is a chinese food topic with generally negative terms. The topics with the most extreme coefficients do seem to give a good sense of the polarity of the words contained within. Based on informal analysis, it looks like the topics would have good word intrusion and document intrusion properties. (yelp_slda_2011_10_17) I ran LDA on the NYT dataset starting from the model and the topics extracted from the sLDA on the YELP dataset. This did not work very well, and got about the same topics as LDA from scratch. Perhaps a better experiment would be to take the topics with the most predictive coefficients, the 5-10 of them, and run LDA starting with those. (yelptopics_nytimes_lda_c_2011_10_17). More interestingly, I created a lexicon of the words with high coefficients for predicting the polarity of Yelp reviews using Naive Bayes (yelp_lexicon and yelp_lexicon_small). I ran LDA on the NYT dataset using the yelp_lexicon as a vocabulary. This brought out a few topics that did not strictly follow along with the newspaper sections. For example, there is an epidemic/disease topic. There is a "corrections" topic with words like the following: incorrectly, misidentified, erroneously, incorrect, correction. The topic on employment reveals a strong motivator: paid, contract, negotiations, wages, executives, employees, unions, manager, compensation. Many of the topics do match up, like baseball and football and music and food and books, but it is just a much more noisy set of topics. It is easier to find the same section topics when that section uses a lot of review-filled words (like food, music, and book reviews). Many of the topics are unidentifiable, perhaps I used too many topics. But some are interesting, such as topic 029 using yelp_lexicon_small: winner, favorite, amazing, perfect, fantastic, outstanding, with other words in various sections of the newspaper. A final experiment I ran on the Yelp dataset using nytimes_med_common vocabulary. I ran sLDA on the Yelp dataset to generate topics with coefficients. I then ran inference on the news articles using these generated topics and coefficients. The distribution of predicted ratings looks Gaussian with mean 3.5 and standard deviation .25 . Nearly all the documents are clustered to be labeled between 3 and 4 stars, with less than 5% below 3 or over 4. Even at the extremes, the documents with the highest predicted label have many death-related and terrorism-related articles. The negative extremes are also not consistent. My next experiment will be to try to isolate topics which relate specifically to sentiment, independent of domain. One idea I have relates to fixing topics when training (an idea Chong Wang introduced to me). My idea is to run LDA on the yelp dataset to generate domain topics. Then, I will run sLDA with those topics fixed plus 2-10 extra topics which are unfixed. The fixed topics will act as background with middling coefficients, I predict, and the remaining trained topics will end up with extreme coefficients and will contain strong sentiment words independent of topics in the domains.
Your website is unviral
Your website is probably unviral. Everybody wants his or her website to go viral. As web designers and entrepreneurs it is our goal to create buzz; an unstoppable avalanche of traffic; a self-feeding hurricane. For many entrepreneurs PayPal, YouTube and Facebook are the alpha and omega of marketing and strategic growth. The goal is to emulate the distinguishing characteristics of these products in an attempt to achieve similar heights. Some websites succeed and grow on the same trajectory or faster. They usually exist in fundamentally social businesses like email, payment services or social networking in places without such networks. However, there is also a special cadre of websites which lies in a no-man's land untouched by virality. Not only is it difficult for these sites to become viral, but the nature of the business actively fights its own online growth, behaving much like a tumor suppressor gene. These businesses never experience exponential growth and all of their growth paths, even if strong, are linear. Some examples quickly come to mind: male enhancement pills, adult diapers, schizophrenia medicine. The online world is in the habit of thinking itself as viral by nature. Virality, some think, is built into the very fabric of the Internet. It is not. In fact, one of the most profitable online businesses is unviral: online dating. Dating websites carry such a stigma that some couples successfully matched online invent a fictionalized romantic encounter to conceal the fact that they were mouse-selected by a filtering process and a geographic search. Although dating websites provide immense value, arguably more than almost any other online service, users will often strive to hide their enrollment from even their closest friends. Maybe especially their closest friends. Dating websites are unviral. They do not spread by word of mouth. As a matter of fact, they actively suppress this form of growth due to the nature of their service. Many an experienced entrepreneur has made the mistake of underestimating the unvirality of online dating. Your website may be unviral too, although it may be less than obvious. Perhaps it only demonstrates certain elements of unvirality; for instance, would your users tell all of their friends about your service, or only some? Would they actively deny using or even knowing about your website to certain friends, acquaintances, or co-workers if asked? Is it embarrassing? That would be pretty bad for your virality. If this is the case you must face the facts: your website is unviral. Look at examples of of purely viral websites: PayPal, the old Hotmail, YouTube and the current Facebook (not the old version that limited itself to college students) all display or once displayed growth without any symptoms of unvirality. I told everyone about Hotmail when it was launched: cousins, teachers, pen-pals. Users of contagious sites like this may not actively evangelize to literally everyone (as they do with, say, YouTube) but they certainly wouldn't avoid a discussion or hold back praise once the topic was broached. In contrast, there are many sites that quickly provoke responses of "yea that's weird," when mentioned. In these instances unvirality dominates and kills growth. Though unvirality is not a death sentence, it does limit a potential for greater growth. In some instances, unvirality is inherent to the service or structure of the business. But if this is the case, why should an entrepreneur involve himself with it and how can he manage to save it from the depths of unvirality? Two Answers: Anonymity and Covert Transformation- The Internet supports anonymity which allows users to praise a product they love to others--thousands or even millions of other strangers--while avoiding the embarrassment and reluctance to share a product or website that often results in unvirality. Anonymous forums and reviews abound for even the most unviral of products.
- Secondly: The web entrepreneur can covertly transform an unviral business into a viral one by euphemizing or disguising its true purpose. For example: Create a website with dating tools but market it as a social network. Facebook at Harvard worked this way with its subtle but pivotal "relationship status." Give users a guise under which to refer their friends and thus avoid the focus on unviral traits, such as the embarrassment endorsing a dating site, that would otherwise prevent the expansion of your user base.
Utility vs. ViralityMany people confuse utility with virality, but these are actually very independent qualities. Entrepreneurs may believe that if they have developed a good product it will naturally become viral. This could not be further from the truth. Often, people will want to tell their friends about a product they find useful but this is not a necessarily so. Some of the most useful online services actually discourage such talk. Something can be useful and viral (such as Facebook), useful and not viral (dating websites; most products ever created), not useful and viral (lolcatz; chia pets; almost all 4chan memes) and something can most definitely be neither useful nor viral (almost everything). Utility and virality are therefore orthogonal. They represent two different dimensions which can intersect but do not necessarily do so. In fact, something extremely useful--something that you and many others may pay thousands of dollars for to bring yourselves joy for a lifetime--may be strictly unviral. As I mentioned earlier, people will actively go out of their way to not talk about some very useful products. Many medical products fall into this category. Utility is one dimension of a service and it is clearly distinct from virality, though by no means mutually exclusive. The realm of influence between utility and virality is vast and depends mostly on the nature of the business. The lesson: just because your website is useful, does not mean it will be viral. Just because it is viral, it may not be useful and thus will die once the virus finishes spreading. First, solve the utility problem: build something useful. Then, you have to solve the distribution problem, and I gave 2 techniques for doing so: anonymity and covert transformations. Do you have other ideas?
How to hack Silicon Valley, meet CEO's, make your own adventure
It was my sophomore year. Everyone was making plans for fall break. What are you going to do? You don't know? There are only 4 days left before break. Before this particular fall break, I was busy with classes and had thus neglected to make plans. Some students were going skiing, others on class trips, others to homes nearby. Where are you going? I had no idea. However, around this time, I was reading a lot about California. I read work by entrepreneur and essayist, Paul Graham, in which he says that the San Francisco Bay Area is the best place to start a company. He described the energy, but I couldn't palpate it. If I were to take his word, it's an ethereal, magical place. That day, James Currier, internet entrepreneur, stood before me and a packed class full of eager students. His eyes were shot open, a purple glaze lit them afire. His wavy hair burst out atop his skinny head. Gaunt and fearless, he embraced the air as he swung his arms widely to make his point. “Silicon Valley is absolutely the place to be,” he said. “It’s where all technology happens. It’s where Google started, it’s where Apple, Yahoo, Intel, Oracle, and so many other technology companies started. Some of the smartest people in the world lived there at Stanford, Berkeley, and Xerox PARC. It is a magical forever-sunny wonderland where dreams come true and it rains investments and acquisitions.” He went on to make even more grandiose claims. Startups? Risky? Not at all when you do things right. Moreover, they are nothing compared to the risks of a financial job. Everyone laughed. This room in Princeton was filled with students who had already accepted offers at investment banks or who would be applying soon. In 2006, finance was booming with big bonuses and strong growth prospects. Derivatives opened up whole new worlds for trading and speculation. Operations research quants donned their glasses in pride. They were respected. So everyone laughed. He said, "No, really. They can fire you any time. They don't care about you. The market can turn the other way in a heartbeat. You have no job security. Your firm can go bankrupt." To the students at the time, this all seemed ludicrous. They all envied these corporate finance jobs, nevermind that many would lose their finance jobs less than 2 years later. He inspired. He didn't have charm so much as hurricane-force energy. He was insightful and learned. He talked about his great times. He talked about his learning moments. And so he inspired me to see it. I had to see it. What is so special about the Silicon Valley, the San Francisco Bay Area? How can it actually be that great? What exactly gives the air such power to breath life into world-changing tech empires? I knew what I had to do, but Fall Break was just two days away. How would I fly there without paying outrageous fees? Where would I stay? What would I do there? How would I meet the minds behind these great startups? I was a sophomore from Florida. I had no network in California. I searched online for cheap tickets, no luck--that is until I noticed an ad for Hotwire. If you have yet to try this site, Hotwire buys leftover seats in bulk, and then sells them to users blind such that they don't know exactly which flight on which airline at what time until they buy. I snagged a very cheap ticket for 3 days later. Now, where would I stay? I knew exactly three people from my high school in the bay area, 2 at Berkeley, and 1 at Stanford. I sent them all an email and hoped they would get back to me in time. I could always get a hotel somewhere. Finally, how could I reach the top startups in Silicon Valley and find entrepreneurs who could meet with me on such short notice? I didn’t know any CEO's. How do I hack Silicon Valley itself? TechCrunch always covers the hottest new funded startups. Every day, they publish dozens of new articles on the latest technology. I should just pick a few of the best and email them. But how do I choose the best? What if they don't get back to me? Will I waste a trip? I noticed half of all of the articles listed the location of each company. Some in California, some not. So I enumerated every neighborhood in the bay area: Redwood City, Palo Alto, Berkeley, San Francisco, Menlo Park, etc. I wrote a program to crawl all of the Techcrunch archives to find all of the articles about companies in one of these cities. I then parse out the name and URL automatically as well. I looked through the list of companies and I picked the most interesting. Some invented a new technology, and others just came out of a new incubator called YCombinator. These days, you can do this easily yourself with Crunchbase, a useful database of every startup in existence. I wrote another script to send an email to every single one: jobs@azureus.com, jobs@youos.com, and so on. In each email, I wrote: I am a student who will be graduating soon, and I would be very interested in learning more about your startup since I saw it in Techcrunch and I think your Company is very innovative. I'm from Princeton and I'd be interested in potentially working for your company. I am visiting California next week, can we please meet? I sent out dozens of emails, and then I waited. Not everyone replied, but many did. I flew in, finished my homework on the plane, and crashed with my friends (all three through through the week). I met with many CEOs. In startup land the companies are all very small. Everyone in the company has to wear many different hats. Therefore, when you send an email to jobs@startup.com, the CEO reads it. Few people realize that you can easily get direct access to startup CEOs. One company I reached out to was YouOS. YouOS was in the first class of YCombinator. They are incredibly good hackers. We just talked over pizza and they joked about how they've written and rewritten servers from scratch so many times that they can do it in 5 minutes while sleeping. YouOS did not work out, but the founders continued innovating. A couple of them made and then sold Project Wedding. Another went on to create thesixtyone.com and Aweditorium, two of the most innovative music apps in the world. Walking around Palo Alto, I saw several startups on each block. If you can imagine another planet where the Internet is turned into physical locations with storefronts, with Facebook next to Dropbox next to Shopkick, then you have a pretty good idea of what Silicon Valley looks like. You see zetok, jlingo, and any conceivable combination of letters plastered everywhere. I noticed a small blue frog in one corner, it looked familiar. I tried the door and walked upstairs. The offices were in fact the offices of Azureus, the Bittorrent app that made torrents popular. I went up to the exhausted man walking hurriedly by the front desk, and I began with: hi. I'm Joseph Perla. I am a student looking for a job. I am visiting just for a week, can I talk to you for just a few minutes? He was taken aback at first, a little flustered. He said, yes, sure, but not today, I'm a little stressed because I'm signing papers. We’re raising 4 million dollars right now. Can you come tomorrow? Absolutely. The next day, I spent 3 hours talking to the CEO one-on-one about Azureus, raising money, silicon valley, bittorrent, technology, France (he's French), and the french technology industry. I ended up meeting with half a dozen other startup founders. I toured the golden gate bridge and many parts of the bay area, Berkeley, and Stanford. The Valley is very friendly, and everyone does everything they can to help you because, at some point, someone definitely went out of their way to help them succeed. I started building a network from nothing. I directly used the connections I made on my spontaneous trip to start my next company, Labmeeting. David Tisch, of Techstars NYC, made a great point recently. Startups are very difficult. The odds are against you. Your competitors are twofold. On the one hand you compete with the biggest companies in the world. Even more difficult, you compete with inertia and ignorance and apathy. Everyone in the startup industry knows how hard it is, so we all do what we can to help each other to beat the odds. That's the only way it works at all. That's how we succeed against all odds. Silicon Valley is one big mega-commune of startup capitalists. Make the most of what you have (friends in new places), trust in people, and find out what the ethos of Silicon Valley is really like. I know scores of startups who would love to have smart students, especially students looking for jobs, visit their offices and see what they have built. I can point you in the right direction. CEO's love to tell their stories more than you like to listen to them! Let me know if you plan to make your own adventure, and please tell me about your trip when you get back.
Write bug-free javascript with Pebbles
Github: https://github.com/jperla/pebbles
We actively seek contributors!
Goals of Pebbles
- so easy that designers and non-programmers use it to write complicated AJAX!
- 0 lines of javascript
- complicated ajax websites
- 0 lines of javascript
- no bugs
- 0 lines of javascript
- very fast speed and optimality.
- backwards compatibility with clients who have javascript off (this was more important 4 years ago when I first made this)
- Memrise loves it!
Plus, you don't even have to write one line of javascript!
The basic idea is that almost every complicated AJAX interaction can be reduced to a handful of fundamental actions which can be composed (remind you of UNIX?). So, all you have to do with this library is add few lines of HTML to elements of a page to describe the Pebbles response that happens when someone clicks that element. Maybe you submit a form, maybe you fetch some content and update part of the page.
Most current websites write and rewrite slightly different versions of these same basic patterns in javascript. This separates the HTML which has information about AJAX interactions and the Javascript which has other information. But you want it all in one place!
Pebbles uses the jQuery.live function. Very heavy pages with tens of thousands of elements take 0 time to load, since almost no javascript is executed.
Javascript can be tricky to write even for an experience programmer. Moreover, a lot of this stuff is repeated, and it shouldn't be. Pebbles brings more of a descriptive style programming (a la Haskell, Prolog) to the web in the simplest of ways.
FAQ
Couldn't you just write javascript functions that you call that do the same thing?
You might but then you introduce the opportunity of syntax and other programming errors, thus not achieving 0 bugs. You would also have to figure out how to make it fast yourself. In practice, this library is so straightforward to use that once you define a complicated action, which only takes a few seconds, you can move it around and it just always works.
Moreover, it's easier to auto-generate correct readable html (e.g. from Django templates). Many of your pages won't need *any* javascript even if highly dynamic. All the custom logic is in one place rather than spread over the html and the javascript. Basically, writing javascript is harder than what amounts to a DSL in HTML.
I need more complicated action-handlers than just these 3, can you please make them?
The code is open source and on Github on jperla/pebbles. Feel free to add your own enhancements. Be careful because you want to keep your app simple, and, in my experience, these 3 actions comprise the vast majority of user ajax paradigms. With a little thinking you can probably do what you want using either "form-submit" or "replace" with the right response html.
Technical Documentation
Pebbles accepts spinner url (to an animated gif of a spinner for waits).
Pebbles sets up a live listener on divs with classes of type "actionable".
Classes of type actionable contain a hidden div which has class "kwargs".
.actionable .kwargs { display: none; }
kwargs div contains a number of <input> html elements, each with a name and value. The name is the key name, the value is the value for that key. In this way, in HTML, we specify a dictionary of keyword arguments to the actionable.
Here are some self-explanatory examples:
It fails loudly if misconfigured. It's hard to write buggy code and not notice in quick testing. It is easy to do everything right and it is easy for you to write a complex ajax website with no extra javascript code.
Full arguments are below:
===========================
Arguments:
type: replace, open-close, submit-form
replace replaces the target with the url
open-close will toggle hide/display the target,
which also may dynamically lazily load content from an url
submit-form submits a form via ajax which is a child of the actionable,
or may be specified in form argument;
the response of the ajax replaces target
url: url string of remote page contents
target: CSS3 selector of element on page to update
target-type: absolute, parent, sibling, closest, or child-of
Absolute just executes the target selector in jQuery.
Parent executes target selector on jQuery.parents().
Sibling the same on siblings.
Closest looks at children and children of children and so on.
child-of looks at target's children
closest: selector used in combination with target-type:child-of to get target's children
form: selector used in combination with type:submit-form to find the form
If you use the open-close type, then the actionable can have two child divs with classes "when-open" and "when-closed". Fill when-open with what the actionable looks like when the target is toggled open (for example, a minus sign), and fill when-closed with what the it looks like when the target is toggled closed (for example, a plus sign).
How to launch in a month, scale to a million users
These are case studies. I will talk about my last two startups where I used a lot of techniques to build them quickly and scale them up. Here I explore different techniques I used to architect them to scale which are quite simple, but someone who is not familiar with building systems may be interested in learning how to build his or her own scalable site. This is based on the outline of a paper by Lampson with more modern web-based examples: http://research.microsoft.com/en-us/um/people/blampson/33-Hints/WebPage.html Labmeeting was a search engine for biomedical literature and a social network for scientists. http://www.crunchbase.com/company/labmeeting. The same principles in Lampson can be applied to any large system such as Google or http://www.turntable.fm. FunctionalityKeep it simple.We built API's before making the website at Labmeeting. That means that the design of data access, security, and data flow happens long before the first interfaces are created. Simplicity in a small interface is key, with well-defined and single-purpose functions coming from each module and submodule of the API. The whole front-end interface uses exclusively less than 30 methods in 5 modules available in the API. Get it right.From day one, we built automated tests into Labmeeting to catch any conceivable and subtle bugs that we may introduce during development. In advance, we knew that it would be a complex, dynamic site with hard to reproduce state. This made it all the more important that each simple method in the API performed exactly as it needed to both in the edge cases and in the normal case. We had individual function tests, module level tests, and full integration tests that automatically started a full chatserver and tested real requests. The tests were run on every commit and no bugs were allowed to persist before writing new code. Don't hide powerLabmeeting was a very dynamic website using a lot of AJAX to speed up page requests and minimize initial page download size. You could click on a fragment of an abstract, or a button that said Full Text, which then made a request to the server and replaced information on the page with the response. A lot of javascript is repetitive, and can be very buggy. We implemented a library that could create complicated AJAX interactions by writing 0 javascript, instead just adding a few extra HTML tags to code. The library virtually eliminated bugs and increased speed on the site by eliminating javascript execution time and centralizing code. Despite just requiring HTML tags, the library allows for maximum flexibility by the user to submit full CSS3 jQuery selectors as arguments if desired. Despite normalizing an interface, it does not hide the power of jQuery. Memrise.com now uses this library enthusiastically. You can see the docs and use this library yourself at the Pebbles introduction. Use procedure arguments to provide flexibility in an interfaceWe created a system for filtering through news articles. The system has many basic parameters that can be passed that are very simple, but the parameters are simply procedures. Therefore, if someone had a special complicated need, they could write their own function that returned a boolean value of whether to filter the news and pass that through the interface. Leave it to the clientThe interface at Labmeeting was very simple, and we expect the client to perform complicated manipulations of the many elements of the interface and keep track of all those states. This allowed the backend to be developed very quickly, although it meant that frontends, like an iPad or Android app, take a little longer to develop. ContinuityKeep basic interfaces stable. Keep a place to stand if you do have to change interfaces. The API of Labmeeting and Stickybits is versioned. They can thus offer full compatibility with previous functionality, but enhancements and changes can be made in newer versions. Making implementations workPlan to throw one away.Many of the routines in the initial prototypes were written very quickly and with an eye to throwing them out once in full production mode. For example, the first version of the PDF search feature pulled in the whole user's collection and did a search in memory. A fully optimized version would be a little more complicated, but many routines were designed that way with an eye to throwing the inner part of the function out and rewriting once the bottlenecks are identified. Keep secrets of the implementationWe built Labmeeting up from separate silo'd modules that, while decreasing performance a bit, allowed them to operate independently and with maximum flexibility to respond to changes in requirements in the interface. For example, the PDF manager knew nothing about how users were stored or queried. The User api could store users in memory, on disk, in Postgres, or halfway across the world. Lab groups only knew that it could call the same API external methods used to look up a user or set of users. Use a good idea again instead of generalizing itAt Labmeeting, we had to extract author names from PDFs. We realized that we could do decently well at extracting the names using machine learning techniques, but never perfectly. However, by indexing a gazette, a complete database of every possible PDF, then we could simply make some guesses (possibly using machine learning) and then just look up those guesses in the gazette to see if there is a match. It becomes a problem of efficient enumeration. We didn't generalize it, and used the idea again in a slightly different context. Each PDF has a scientific abstract with various complicated terms from biology and physics. We wanted to identify those important terms to allow further exploration. Again, some indicators could point us in the right direction, but we did not get everything. So, we crawled Wikipedia to compile a gazette of biological terms, then merely used those terms in the abstracts that appear in the gazette modulo very frequent words like DNA. This was highly accurate again. We linked these extracted entities to Wikipedia to provide further information for the curious. Handle all the casesHandle normal and worst cases separately as a ruleAt Labmeeting, we analyzed PDFs to extract the title, publication date, and other information. The special case of a PDF which is encrypted and unparseable and no text can be extracted went straight to a separate method. The special case could possibly be handled by a more general-purpose algorithm for text extraction that happens to special case to a right answer, but it is more straightforwardly handled separately. Anyone reading the code could see it plainly, rather than having to think through the special case in more complicated parsing code. SpeedSplit resources in a fixed way if in doubtAt Labmeeting, we put the database index on a separate machine from the Solr search index. We had millions of search queries coming into the search system, and we didn't want those queries to slow down the db, and thus normal operation of the site. Writes take much longer than reads, and are more important for logged in users. External users of the site using the search engine just hit the index, performing exclusively reads on the index. This allowed us to scale up the search index independently from the database. Use static analysis if you canAt Labmeeting, before every commit, I had a version of PyFlakes run on all of my new code. PyFlakes is a static analysis tool for Python that finds common errors that can be detected before run-time. For example, PyFlakes can find improper number of arguments to a function call and references to variable names that are not in scope (like typos). Static analysis finds a lot of bugs that might appear in production only rarely in edge cases. It is most useful in a language like Python that is dynamic and thus doesn't have a lot of the normal safety features available to a statically typed language. Cache answers to expensive computationsObvious we did this all of the time at Labmeeting. For example, we performed a document similarity search to find "Related Papers" when we showed one individual paper to recommend other papers a scientist may want to read. The vector computation and search for this is quite expensive so we cache the results for a month. Another example: we had to open up a PDF file which has a research publication, perform text extraction, and then do an information extraction step from the text to analyze the title, authors, publication date and other information. This is a difficult problem to do and involves searching a gazette of 30 million documents and querying the PubMed database at least once. Once this process was completed for one step we saved it to the paper metadata so that we would not have to calculate it again for that PDF each time. The flip-side of caching is that for quickly changing data then one needs to be careful about cache invalidation. When in doubt, use brute forceWe wanted to get the first version of Labmeeting finished very quickly. There are many ways to optimize a system to improve performance, but they come at the cost of decreasing modularity, making more assumptions, and, most directly costly, developer time. The first implementations of the pdf search algorithm used brute force linear search by pulling the name of every scientific paper and then searching each one for the substring. This takes a few minutes to write and does not require a complicated separate hosted index. Moreover, for the small number of papers used during testing, it ends up being much faster than doing a network query to a search index! Compute in background when possibleAfter a user uploads a PDF to Labmeeting, a process must go through the PDF, analyze it and normalize it, perhaps convert it to a standard format, extract the metadata, and deduplicate it. This process can take a while, so we avoid this process from blocking the web server by pushing it to a queue. When the queue completes, it sends a message back to the user, which adds the PDF to the person's collection.
Google Creates Humanoid Robot, Programs Itself
by Joseph Perla. Associated Press. May 13, 2011.
MOUNTAIN VIEW, Calif. — Anyone enjoying talks at Google's recent I/O Conference at Moscone West in San Francisco may have glimpsed some engineers wearing curiously thick belts or backpacks. Harder to notice was that the person carrying those items was not actually a person.
 (Computer hardware in the inside of one of the seven autonomous electronic engineers.)
The robots are a project of Google, which has been working in secret but in plain view on robot engineers that can program themselves, using artificial-intelligence software that can reason about programming and mimic the decisions made by a human engineer.
With a technician nearby with root access to monitor the robot talk, seven test engineers have given over 1,000 tech talks without human intervention and written more than 140,000 lines of code with only occasional human debugging. One even programmed itself to learn product management, a task that requires creative and analytical thinking. The only accident, engineers said, was when one robot engineer released a product that was far too technical for human engineers and users at Google I/O last year.
Autonomous electronic programmers are years from mass production, but technologists who have long dreamed of them believe that they can transform society as profoundly as the Internet has.
Robot employees fix bugs faster than humans, have infinite memory and do not get distracted, sleepy or intoxicated, the engineers argue. They speak in terms of products shipped and bugs avoided — more than 37,000 bug patches were released by software development shops in the United States in 2009. The engineers say the technology could double the capacity of the Internet by re-engineering every line of code in legacy routers. Because the robot engineers would eventually require less office space and energy than a human, they would reduce Google’s carbon footprint. But of course, to be truly better, the robots must be far more reliable than, say, today’s personal computers, which crash on occasion and are frequently infected.
The Google research program using artificial intelligence to revolutionize programming is proof that the company’s ambitions reach beyond the search engine business. The program is also a departure from the mainstream of innovation in Silicon Valley, which has veered toward social networks and Hollywood-style digital media.
During a half-hour talk beginning Moscone West, a convention center in the heart of San Francisco last Monday, a robot engineer equipped with a variety of sensors and following a Powerpoint projected onto the screen nimbly discussed the finer details of unsupervised machine learning to several thousand developers from the heart of Silicon Valley. Little did the attendees know that the code he was projecting and editing was his own.

(A robot engineer developed and outfitted by Google, with advanced backup on belt, lecturing on the new Chromebook at the Google I/O conference in San Francisco, Calif.)
Later that day, the robot engineer announced the new Chromebook computer at the Google conference on Tuesday. He developed and programmed the software on his own. “We’re terribly proud of Sundar, the most successful of our electronic colleagues,” said a Google engineer. Sundar, as they call it, taught himself product management and has risen through the famously meritocratic ranks of Google’s hierarchy to the level of Vice-President.
The autonomous developer can be programmed for different personalities — from cautious, in which it is more likely to write more code to avoid bugs and security breaches, to aggressive, where it is more likely to quickly write brief code and use expletives in documentation.
Christopher Urmson, a Carnegie Mellon University robotics scientist, was pair programming with a robot engineer but not typing. To gain control, he has to do one of three things: hit a red button near his right hand, move the mouse, or press a key. He did so twice, once when a robot almost removed colorful themes from Gmail and again when another human engineer was launching a new feature simultaneously. But the robot developer seemed likely to have prevented the accidents itself.
When he returned to automated "plugged in" mode, the robot slouched and made a grim face meant to evoke going into a deep meditative zone and Dr. Urmson was able to take his hands off the keyboard and gesticulate when talking to a colleague. He said the engineers did attract attention, but people seem to think they are just the some dorky young engineers that Google just hired out of MIT.
The project is the brainchild of Sebastian Thrun, the 44-year-old director of the Stanford Artificial Intelligence Laboratory, the co-inventor of the Street View mapping service, and director of Google’s autonomous car project.
In 2005, he led a team of Stanford students and faculty members in designing the Stanley robot car, winning the second Grand Challenge of the Defense Advanced Research Projects Agency, a $2 million Pentagon prize for driving autonomously over 132 miles in the desert. Last year, he announced the Google driverless car project which has recorded thousands of miles of driving on hlghways from San Francisco to Los Angeles. Google is currently lobbying Nevada to be the first state to allow autonomous vehicles legally.
Besides the team of 15 engineers working on the current project, Google created seven robot engineers, each working as employees on the team to program themselves. Google is using six hundred Intel and one AMD processor in the project.
The Google researchers said the company did not yet have a clear plan to create a business from the experiments. Dr. Thrun is known as a passionate promoter of the potential to use robotics to make software more secure and lower the nation’s energy costs. It is a commitment shared by Larry Page, Google’s co-founder, according to several people familiar with the project.
Google first publicly experimented with human-less engineers at the Google I/O conference in 2010. Lars Rasmussen, another engineer at Google, worked with Thrun to create the robot engineer they named Jens which covertly played Lars Rasmussen’s brother. Jens presented a talk on stage to launch Google Wave with Lars Rasmussen supervising as his human operator. They notified authorities beforehand.
 (Lars Rasmussen and Jens launching Google Wave last year)
Google Wave was conceived, developed, programmed, and launched entirely by Jens. Recently, however, Google Wave was deemed too technically complex and cancelled, one of the many failed projects created by robot engineers in the past year. The self-programming engineer initiative is an example of Google’s willingness to gamble on technology that may not pay off for years, Dr. Thrun said. Even the most optimistic predictions put the deployment of the technology more than eighteen years away. "The engineering quality is currently at Microsoft-level, but not Google-level, quality."
Late last year, Lars Rasmussen left Google for Facebook. Sources close to Google say that Lars attempted to assert his legal right to his robot brother Jens. “Despite flaws, he is such an invaluable colleague,” Lars remarked.
“The technology is ahead of the law in many areas,” said Bernard Liu, senior staff counsel for the California Human Rights Center. “If you look at the legal code, there are scores of laws pertaining to the rights of individuals, and they all presume to have a human being operating under contract.”
The Google researchers said they had carefully examined California’s legal regulations and determined that because the electronic engineers were wholly created at Google, the experimental employees are Google’s property. Mr. Liu agreed.
Scientists and engineers have been designing robots since the mid-1960s, but crucial innovation happened in 2005 when Thrun and colleagues achieved successes with their autonomous vehicles in the DARPA Grand Challenge. Peter Norvig, Director of Research at Google and Artificial Intelligence expert, and colleagues quickly translated their success in the complex task of driving into their own field of programming computers.
The original codename of the project was Android, the fictional human-like robot of Philip K. Dick stories, but that was scrapped with the increasing popularity of the Android mobile operating system also developed by Google. Since changing the name to Project Watson and starting collaboration with IBM, the technology has been steadily improving as the robot engineers work alongside Thrun’s human team to improve themselves.

(Smarter Than You Think: Guided by Computers and Sensors, 3 robot Google employees)
Advances have been so encouraging that Dr. Thrun sounds like an evangelist when he speaks of robot engineers. There is their potential to reduce energy use by eliminating the Google chefs and cafeterias, given the reduced need for amenities, and to ultimately build a smaller Googleplex.
There is even the farther-off prospect of employees that do not need any upper management. That would allow the robot engineers to manage themselves, so that they can get more work done. Fewer employees would then be needed, reducing the need for office space, which consumes valuable land.
And, of course, the robots could save engineers from themselves. "Can we program twice as much while playing video games at work, without the guilt?" Dr. Thrun said in a recent talk. "Yes, we can. Now, if only Droid apps would write themselves."
|