The following are important engineering building blocks for internet companies.
I’ve had a friend who has worked on Win32 applications for about 10 years and is going to an internet company. I was telling him that these are the computer science advances that people at internet companies consider as core competencies. They are a must read for people who have not been working at an internet company.
Internet companies are quick to build because they use these open source components, and then they build “callbacks” or “plugins” to make them solve the problems for their customers. These are stable because they have had years or a decade of improvements from large companies like Yahoo, etc.
I consider this the 4th generation of programming languages.
(Generation #1 is assembly, Generation #2 is C/C++, Generation #3 is
Java/C#) Creating an internet company is all about speed.
That means drop the time building UI to as close to zero as you can. The
real value is building the content driving the company (YouTube, Flickr, eBay)
or the algorithms adding value to the customer (TalentSpring, Zillow, Farecast,
PayScale, PayPal) or customer service (Craigslist, Zappos). Rails
gets the development efficiencies that you need to compete.
This is a common framework to submit a task and it can be completed across a
farm of servers. Each server picking up the task, sends it across a
series of servers. Each server runs the task scoped to solve a partition
of the problem. This is used so often that it feeds into MapReduce (see
below). People who want to write an internet crawler run Nutch tasks as
the task. Each task is a JAR file (a bundled batch of Java files).
This is similar to the OS thread scheduler where each server is a thread,
except it helps coordinate as the problem is partitioned to be completed
separately. Also see here
This is the web crawler open source component.
When you need a database, and the database doesn’t need to scale beyond one
machine – then MySQL/Postgres/MS SQL is fine. When you need a database
that scales across a series of machines, then HBase is a good solution.
This is like Google’s BigTable as one database to drive all of their products,
or Amazon’s SimpleDB to drive all of their products. Also see here.
- PageRank: Google uses this as their core to drive the quality in their search. We here at TalentSpring use this in powerful ways. Also see here.
- MapReduce: MapReduce is
a core competency for internet companies (Google, Amazon, Yahoo, Windows Live
Search, etc.). This is like window messages for Win32 programmers or CLR
framework for MSFT server programmers. This is also useful.
Google is a keyword search engine (with PageRank helping weightings). The
generation of search beyond keyword search is Semantic search. Carrot2 is
a search engine using clustering to generate the results. Also see the
This is the open source search component. It is a search engine for a
local database of content. (You need to add nutch to create an internet
This goes along with Lucene.
This is a framework to build a website immediately. Building internet
companies is about not re-inventing the wheel. This gives the web site
sign-in/user pictures, and all of the basic support. A WIKI takes a web
page to the next level with instant editability and low-cost-of-ownership.
The parallel is DRUPAL is to a WEB SITE what WIKI is to a single page.
Drupal gives you the toolbars, pages that you can create when in edit mode, and
plug-in modules to give you features similar to: DIGG, FLICKR, YouTube,
etc. If you create a web site, it is ridiculous to do so without seriously
considering starting it with Drupal. If you create an internet company,
you should seriously consider starting it on top of Drupal. Why re-invent
a toolbar & navigation system, help system, simple pages for marketing,
sign-on support with User Pictures and forget password emails, and search that
works across the entire site.
- EC2: Do you want
a server farm of 300 computers for 2 days? And only pay a few dollars
since you only need it for a few days? EC2 is the solution from
Amazon. We built an internet crawler very quickly using
Hadoop+Nutch+EC2. Crawling a significant part of the internet is easy
with a hundred servers and it only takes a day or so. EC2 makes it very
inexpensive. Envision thinking about having EC2 available, and your
horizons of creativity expand.
- And of course everyone knows about WordPress, MySQL, Postres.
- These are also interesting but lower priority: Google Analytics, OpenID, Mechanical Turk, oDesk, S3, OpenSocial, JOONE, GRETL, and Ning.