Engineering Building Blocks for a Startup Company
The following are important engineering building blocks for internet companies.
I've had a friend who has worked on Win32 applications for about 10 years and is going to an internet company. I was telling him that these are the computer science advances that people at internet companies consider as core competencies. They are a must read for people who have not been working at an internet company.
Internet companies are quick to build because they use these open source components, and then they build "callbacks" or "plugins" to make them solve the problems for their customers. These are stable because they have had years or a decade of improvements from large companies like Yahoo, etc.
- Rails: I consider this the 4th generation of programming languages. (Generation #1 is assembly, Generation #2 is C/C++, Generation #3 is Java/C#) Creating an internet company is all about speed. That means drop the time building UI to as close to zero as you can. The real value is building the content driving the company (YouTube, Flickr, eBay) or the algorithms adding value to the customer (TalentSpring, Zillow, Farecast, PayScale, PayPal) or customer service (Craigslist, Zappos). Rails gets the development efficiencies that you need to compete.
- Hadoop: This is a common framework to submit a task and it can be completed across a farm of servers. Each server picking up the task, sends it across a series of servers. Each server runs the task scoped to solve a partition of the problem. This is used so often that it feeds into MapReduce (see below). People who want to write an internet crawler run Nutch tasks as the task. Each task is a JAR file (a bundled batch of Java files). This is similar to the OS thread scheduler where each server is a thread, except it helps coordinate as the problem is partitioned to be completed separately. Also see here and here.
- Nutch: This is the web crawler open source component.
- HBase: When you need a database, and the database doesn’t need to scale beyond one machine – then MySQL/Postgres/MS SQL is fine. When you need a database that scales across a series of machines, then HBase is a good solution. This is like Google’s BigTable as one database to drive all of their products, or Amazon’s SimpleDB to drive all of their products. Also see here.
- PageRank: Google uses this as their core to drive the quality in their search. We here at TalentSpring use this in powerful ways. Also see here.
- MapReduce: MapReduce is a core competency for internet companies (Google, Amazon, Yahoo, Windows Live Search, etc.). This is like window messages for Win32 programmers or CLR framework for MSFT server programmers. This is also useful.
- Carrot2: Google is a keyword search engine (with PageRank helping weightings). The generation of search beyond keyword search is Semantic search. Carrot2 is a search engine using clustering to generate the results. Also see the demo here.
- Lucene: This is the open source search component. It is a search engine for a local database of content. (You need to add nutch to create an internet search engine)
- SOLR: This goes along with Lucene.
- Drupal: This is a framework to build a website immediately. Building internet companies is about not re-inventing the wheel. This gives the web site sign-in/user pictures, and all of the basic support. A WIKI takes a web page to the next level with instant editability and low-cost-of-ownership. The parallel is DRUPAL is to a WEB SITE what WIKI is to a single page. Drupal gives you the toolbars, pages that you can create when in edit mode, and plug-in modules to give you features similar to: DIGG, FLICKR, YouTube, etc. If you create a web site, it is ridiculous to do so without seriously considering starting it with Drupal. If you create an internet company, you should seriously consider starting it on top of Drupal. Why re-invent a toolbar & navigation system, help system, simple pages for marketing, sign-on support with User Pictures and forget password emails, and search that works across the entire site.
- EC2: Do you want a server farm of 300 computers for 2 days? And only pay a few dollars since you only need it for a few days? EC2 is the solution from Amazon. We built an internet crawler very quickly using Hadoop+Nutch+EC2. Crawling a significant part of the internet is easy with a hundred servers and it only takes a day or so. EC2 makes it very inexpensive. Envision thinking about having EC2 available, and your horizons of creativity expand.
- And of course everyone knows about WordPress, MySQL, Postres.
- These are also interesting but lower priority: Google Analytics, OpenID, Mechanical Turk, oDesk, S3, OpenSocial, JOONE, GRETL, and Ning.



Yep! Internet websites are a dime a dozen but only a few are worth checking out...
Posted by: K.C. | October 24, 2008 at 07:22 AM
These are very valuable and vital tools for an architect.
Posted by: Rafiq | November 06, 2008 at 10:33 AM
well I thought I would leave my first comment. I don't know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.
Posted by: jeff paul | December 19, 2008 at 10:19 PM
Well i must say the things you mentioned above was really informative and we should take care of these things when opening a company .
Posted by: john beck foreclosure | January 21, 2009 at 11:00 PM
I started a course on computer programming a while back and just couldn't handle all the crazyness that came along with it...LOL but I do understand some of these things that you have mentioned (cause I still tinker with it here and there) and thing they you are definitely right on.
Posted by: Joel "Cheaters Guide" Gutierrez | January 24, 2009 at 10:43 AM
Wow great information about engineering building blocks or infrastructure for internet companies. Will learn this thing while am looking for a job.
Posted by: Job Seeker | February 14, 2009 at 03:27 AM
Interesting blog, feel free to check out mine.
-Scott
Posted by: Scott | February 27, 2009 at 01:28 AM
hey how you doing! Nice posting. I enjoyed reading it. I too run a blog on internet and traffic building, and checking
out what others may have written.
Posted by: Traffic Building | March 26, 2009 at 09:30 PM