These are case studies.
I will talk about my last two startups where I used a lot of techniques to build them quickly and scale them up. Here I explore different techniques I used to architect them to scale which are quite simple, but someone who is not familiar with building systems may be interested in learning how to build his or her own scalable site.
This is based on the outline of a paper by Lampson with more modern web-based examples:
Labmeeting was a search engine for biomedical literature and a social network for scientists. http://www.crunchbase.com/company/labmeeting.
Keep it simple.
We built API's before making the website at Labmeeting. That means that the design of data access, security, and data flow happens long before the first interfaces are created. Simplicity in a small interface is key, with well-defined and single-purpose functions coming from each module and submodule of the API. The whole front-end interface uses exclusively less than 30 methods in 5 modules available in the API.
Get it right.
From day one, we built automated tests into Labmeeting to catch any conceivable and subtle bugs that we may introduce during development. In advance, we knew that it would be a complex, dynamic site with hard to reproduce state. This made it all the more important that each simple method in the API performed exactly as it needed to both in the edge cases and in the normal case. We had individual function tests, module level tests, and full integration tests that automatically started a full chatserver and tested real requests. The tests were run on every commit and no bugs were allowed to persist before writing new code.
Don't hide power
You can see the docs and use this library yourself at the Pebbles introduction.
Use procedure arguments to provide flexibility in an interface
We created a system for filtering through news articles. The system has many basic parameters that can be passed that are very simple, but the parameters are simply procedures. Therefore, if someone had a special complicated need, they could write their own function that returned a boolean value of whether to filter the news and pass that through the interface.
Leave it to the client
The interface at Labmeeting was very simple, and we expect the client to perform complicated manipulations of the many elements of the interface and keep track of all those states. This allowed the backend to be developed very quickly, although it meant that frontends, like an iPad or Android app, take a little longer to develop.
Keep basic interfaces stable. Keep a place to stand if you do have to change interfaces.
The API of Labmeeting and Stickybits is versioned. They can thus offer full compatibility with previous functionality, but enhancements and changes can be made in newer versions.
Making implementations work
Plan to throw one away.
Many of the routines in the initial prototypes were written very quickly and with an eye to throwing them out once in full production mode. For example, the first version of the PDF search feature pulled in the whole user's collection and did a search in memory. A fully optimized version would be a little more complicated, but many routines were designed that way with an eye to throwing the inner part of the function out and rewriting once the bottlenecks are identified.
Keep secrets of the implementation
We built Labmeeting up from separate silo'd modules that, while decreasing performance a bit, allowed them to operate independently and with maximum flexibility to respond to changes in requirements in the interface. For example, the PDF manager knew nothing about how users were stored or queried. The User api could store users in memory, on disk, in Postgres, or halfway across the world. Lab groups only knew that it could call the same API external methods used to look up a user or set of users.
Use a good idea again instead of generalizing it
At Labmeeting, we had to extract author names from PDFs. We realized that we could do decently well at extracting the names using machine learning techniques, but never perfectly. However, by indexing a gazette, a complete database of every possible PDF, then we could simply make some guesses (possibly using machine learning) and then just look up those guesses in the gazette to see if there is a match. It becomes a problem of efficient enumeration. We didn't generalize it, and used the idea again in a slightly different context. Each PDF has a scientific abstract with various complicated terms from biology and physics. We wanted to identify those important terms to allow further exploration. Again, some indicators could point us in the right direction, but we did not get everything. So, we crawled Wikipedia to compile a gazette of biological terms, then merely used those terms in the abstracts that appear in the gazette modulo very frequent words like DNA. This was highly accurate again. We linked these extracted entities to Wikipedia to provide further information for the curious.
Handle all the cases
Handle normal and worst cases separately as a rule
At Labmeeting, we analyzed PDFs to extract the title, publication date, and other information. The special case of a PDF which is encrypted and unparseable and no text can be extracted went straight to a separate method. The special case could possibly be handled by a more general-purpose algorithm for text extraction that happens to special case to a right answer, but it is more straightforwardly handled separately. Anyone reading the code could see it plainly, rather than having to think through the special case in more complicated parsing code.
Split resources in a fixed way if in doubt
At Labmeeting, we put the database index on a separate machine from the Solr search index. We had millions of search queries coming into the search system, and we didn't want those queries to slow down the db, and thus normal operation of the site. Writes take much longer than reads, and are more important for logged in users. External users of the site using the search engine just hit the index, performing exclusively reads on the index. This allowed us to scale up the search index independently from the database.
Use static analysis if you can
At Labmeeting, before every commit, I had a version of PyFlakes run on all of my new code. PyFlakes is a static analysis tool for Python that finds common errors that can be detected before run-time. For example, PyFlakes can find improper number of arguments to a function call and references to variable names that are not in scope (like typos). Static analysis finds a lot of bugs that might appear in production only rarely in edge cases. It is most useful in a language like Python that is dynamic and thus doesn't have a lot of the normal safety features available to a statically typed language.
Cache answers to expensive computations
Obvious we did this all of the time at Labmeeting. For example, we performed a document similarity search to find "Related Papers" when we showed one individual paper to recommend other papers a scientist may want to read. The vector computation and search for this is quite expensive so we cache the results for a month. Another example: we had to open up a PDF file which has a research publication, perform text extraction, and then do an information extraction step from the text to analyze the title, authors, publication date and other information. This is a difficult problem to do and involves searching a gazette of 30 million documents and querying the PubMed database at least once. Once this process was completed for one step we saved it to the paper metadata so that we would not have to calculate it again for that PDF each time. The flip-side of caching is that for quickly changing data then one needs to be careful about cache invalidation.
When in doubt, use brute force
We wanted to get the first version of Labmeeting finished very quickly. There are many ways to optimize a system to improve performance, but they come at the cost of decreasing modularity, making more assumptions, and, most directly costly, developer time. The first implementations of the pdf search algorithm used brute force linear search by pulling the name of every scientific paper and then searching each one for the substring. This takes a few minutes to write and does not require a complicated separate hosted index. Moreover, for the small number of papers used during testing, it ends up being much faster than doing a network query to a search index!
Compute in background when possible
After a user uploads a PDF to Labmeeting, a process must go through the PDF, analyze it and normalize it, perhaps convert it to a standard format, extract the metadata, and deduplicate it. This process can take a while, so we avoid this process from blocking the web server by pushing it to a queue. When the queue completes, it sends a message back to the user, which adds the PDF to the person's collection.