Tech and Business Pearls

How to launch in a month, scale to a million users

These are case studies.

I will talk about my last two startups where I used a lot of techniques to build them quickly and scale them up. Here I explore different techniques I used to architect them to scale which are quite simple, but someone who is not familiar with building systems may be interested in learning how to build his or her own scalable site.

This is based on the outline of a paper by Lampson with more modern web-based examples:

http://research.microsoft.com/en-us/um/people/blampson/33-Hints/WebPage.html

Labmeeting was a search engine for biomedical literature and a social network for scientists. http://www.crunchbase.com/company/labmeeting.

The same principles in Lampson can be applied to any large system such as Google or http://www.turntable.fm.

Functionality

Keep it simple.

We built API's before making the website at Labmeeting. That means that the design of data access, security, and data flow happens long before the first interfaces are created. Simplicity in a small interface is key, with well-defined and single-purpose functions coming from each module and submodule of the API. The whole front-end interface uses exclusively less than 30 methods in 5 modules available in the API.

Get it right.

From day one, we built automated tests into Labmeeting to catch any conceivable and subtle bugs that we may introduce during development. In advance, we knew that it would be a complex, dynamic site with hard to reproduce state. This made it all the more important that each simple method in the API performed exactly as it needed to both in the edge cases and in the normal case. We had individual function tests, module level tests, and full integration tests that automatically started a full chatserver and tested real requests. The tests were run on every commit and no bugs were allowed to persist before writing new code.

Don't hide power

Labmeeting was a very dynamic website using a lot of AJAX to speed up page requests and minimize initial page download size. You could click on a fragment of an abstract, or a button that said Full Text, which then made a request to the server and replaced information on the page with the response. A lot of javascript is repetitive, and can be very buggy. We implemented a library that could create complicated AJAX interactions by writing 0 javascript, instead just adding a few extra HTML tags to code. The library virtually eliminated bugs and increased speed on the site by eliminating javascript execution time and centralizing code. Despite just requiring HTML tags, the library allows for maximum flexibility by the user to submit full CSS3 jQuery selectors as arguments if desired. Despite normalizing an interface, it does not hide the power of jQuery. Memrise.com now uses this library enthusiastically.

You can see the docs and use this library yourself at the Pebbles introduction.

Use procedure arguments to provide flexibility in an interface

We created a system for filtering through news articles. The system has many basic parameters that can be passed that are very simple, but the parameters are simply procedures. Therefore, if someone had a special complicated need, they could write their own function that returned a boolean value of whether to filter the news and pass that through the interface.

Leave it to the client

The interface at Labmeeting was very simple, and we expect the client to perform complicated manipulations of the many elements of the interface and keep track of all those states. This allowed the backend to be developed very quickly, although it meant that frontends, like an iPad or Android app, take a little longer to develop.

Continuity

Keep basic interfaces stable. Keep a place to stand if you do have to change interfaces.

The API of Labmeeting and Stickybits is versioned. They can thus offer full compatibility with previous functionality, but enhancements and changes can be made in newer versions.

Making implementations work

Plan to throw one away.

Many of the routines in the initial prototypes were written very quickly and with an eye to throwing them out once in full production mode. For example, the first version of the PDF search feature pulled in the whole user's collection and did a search in memory. A fully optimized version would be a little more complicated, but many routines were designed that way with an eye to throwing the inner part of the function out and rewriting once the bottlenecks are identified.

Keep secrets of the implementation

We built Labmeeting up from separate silo'd modules that, while decreasing performance a bit, allowed them to operate independently and with maximum flexibility to respond to changes in requirements in the interface. For example, the PDF manager knew nothing about how users were stored or queried. The User api could store users in memory, on disk, in Postgres, or halfway across the world. Lab groups only knew that it could call the same API external methods used to look up a user or set of users.

Use a good idea again instead of generalizing it

At Labmeeting, we had to extract author names from PDFs. We realized that we could do decently well at extracting the names using machine learning techniques, but never perfectly. However, by indexing a gazette, a complete database of every possible PDF, then we could simply make some guesses (possibly using machine learning) and then just look up those guesses in the gazette to see if there is a match. It becomes a problem of efficient enumeration. We didn't generalize it, and used the idea again in a slightly different context. Each PDF has a scientific abstract with various complicated terms from biology and physics. We wanted to identify those important terms to allow further exploration. Again, some indicators could point us in the right direction, but we did not get everything. So, we crawled Wikipedia to compile a gazette of biological terms, then merely used those terms in the abstracts that appear in the gazette modulo very frequent words like DNA. This was highly accurate again. We linked these extracted entities to Wikipedia to provide further information for the curious.

Handle all the cases

Handle normal and worst cases separately as a rule

At Labmeeting, we analyzed PDFs to extract the title, publication date, and other information. The special case of a PDF which is encrypted and unparseable and no text can be extracted went straight to a separate method. The special case could possibly be handled by a more general-purpose algorithm for text extraction that happens to special case to a right answer, but it is more straightforwardly handled separately. Anyone reading the code could see it plainly, rather than having to think through the special case in more complicated parsing code.

Speed

Split resources in a fixed way if in doubt

At Labmeeting, we put the database index on a separate machine from the Solr search index. We had millions of search queries coming into the search system, and we didn't want those queries to slow down the db, and thus normal operation of the site. Writes take much longer than reads, and are more important for logged in users. External users of the site using the search engine just hit the index, performing exclusively reads on the index. This allowed us to scale up the search index independently from the database.

Use static analysis if you can

At Labmeeting, before every commit, I had a version of PyFlakes run on all of my new code. PyFlakes is a static analysis tool for Python that finds common errors that can be detected before run-time. For example, PyFlakes can find improper number of arguments to a function call and references to variable names that are not in scope (like typos). Static analysis finds a lot of bugs that might appear in production only rarely in edge cases. It is most useful in a language like Python that is dynamic and thus doesn't have a lot of the normal safety features available to a statically typed language.

Cache answers to expensive computations

Obvious we did this all of the time at Labmeeting. For example, we performed a document similarity search to find "Related Papers" when we showed one individual paper to recommend other papers a scientist may want to read. The vector computation and search for this is quite expensive so we cache the results for a month. Another example: we had to open up a PDF file which has a research publication, perform text extraction, and then do an information extraction step from the text to analyze the title, authors, publication date and other information. This is a difficult problem to do and involves searching a gazette of 30 million documents and querying the PubMed database at least once. Once this process was completed for one step we saved it to the paper metadata so that we would not have to calculate it again for that PDF each time. The flip-side of caching is that for quickly changing data then one needs to be careful about cache invalidation.

When in doubt, use brute force

We wanted to get the first version of Labmeeting finished very quickly. There are many ways to optimize a system to improve performance, but they come at the cost of decreasing modularity, making more assumptions, and, most directly costly, developer time. The first implementations of the pdf search algorithm used brute force linear search by pulling the name of every scientific paper and then searching each one for the substring. This takes a few minutes to write and does not require a complicated separate hosted index. Moreover, for the small number of papers used during testing, it ends up being much faster than doing a network query to a search index!

Compute in background when possible

After a user uploads a PDF to Labmeeting, a process must go through the PDF, analyze it and normalize it, perhaps convert it to a standard format, extract the metadata, and deduplicate it. This process can take a while, so we avoid this process from blocking the web server by pushing it to a queue. When the queue completes, it sends a message back to the user, which adds the PDF to the person's collection.

over 3 years ago on September 21 at 5:41 am by Joseph Perla in tech, hacks, entrepreneurship


blog comments powered by Disqus

Hi, my business card says Joseph Perla. Former VP of Technology, founding team, Turntable.fm. My first college startup was in the education space. My second was Labmeeting, a cross between Google, LinkedIn, and Facebook for scientists. I dropped out of Princeton (twice).

I love to advise and help startups. My code on Github powers many websites and iPhone apps. I give talks about startup tech around the US and also internationally at conferences in Florence. incubators in Paris, and startups in Budapest.

Twitter: @jperla

Sign up to my Tech and Business Blog

Favorite Posts

Y Combinator Application Guide
What to do in Budapest
How to hack Silicon Valley, meet CEO's, make your own adventure
Your website is unviral
The Face that Launched a Thousand Startups
Google Creates Humanoid Robot, Programs Itself

Popular Posts

How to launch in a month, scale to a million users
Weby templates are easier, faster, and more flexible
Write bug-free javascript with Pebbles
How to Ace an IQ Test
Capturing frames from a webcam on Linux
A Clean Python Shell Script
Why Plant Rights?

Recent Posts

Venture Capital is broken
The nature of intelligence: brain bowls, cogniphysics, and prochines
Bitcoin: A call-to-arms for technologists
Stanford is startups
Today is Internet Freedom Day! DRM-free book about Aaron Swartz's causes
I help startups around the world

More...