Engineering Building Blocks for a Startup Company

 In Blog

The following are important engineering building blocks for internet companies.

I’ve had a friend who has worked on Win32 applications for about 10 years and is going to an internet company.  I was telling him that these are the computer science advances that people at internet companies consider as core competencies.  They are a must read for people who have not been working at an internet company.

Internet companies are quick to build because they use these open source components, and then they build “callbacks” or “plugins” to make them solve the problems for their customers.  These are stable because they have had years or a decade of improvements from large companies like Yahoo, etc.

  • Rails:
    I consider this the 4th generation of programming languages.
    (Generation #1 is assembly, Generation #2 is C/C++, Generation #3 is
    Java/C#)   Creating an internet company is all about speed.
    That means drop the time building UI to as close to zero as you can.  The
    real value is building the content driving the company (YouTube, Flickr, eBay)
    or the algorithms adding value to the customer (TalentSpring, Zillow, Farecast,
    PayScale, PayPal) or customer service (Craigslist, Zappos).   Rails
    gets the development efficiencies that you need to compete.
  • Hadoop:
    This is a common framework to submit a task and it can be completed across a
    farm of servers.  Each server picking up the task, sends it across a
    series of servers.  Each server runs the task scoped to solve a partition
    of the problem.  This is used so often that it feeds into MapReduce (see
    below).  People who want to write an internet crawler run Nutch tasks as
    the task.  Each task is a JAR file (a bundled batch of Java files).
    This is similar to the OS thread scheduler where each server is a thread,
    except it helps coordinate as the problem is partitioned to be completed
    separately.   Also see here
    and here.
  • Nutch:
    This is the web crawler open source component.
  • HBase:
    When you need a database, and the database doesn’t need to scale beyond one
    machine – then MySQL/Postgres/MS SQL is fine.  When you need a database
    that scales across a series of machines, then HBase is a good solution.
    This is like Google’s BigTable as one database to drive all of their products,
    or Amazon’s SimpleDB to drive all of their products.  Also see here.
  • PageRank: Google uses this as their core to drive the quality in their search.  We here at TalentSpring use this in powerful ways.  Also see here.
  • MapReduce: MapReduce is
    a core competency for internet companies (Google, Amazon, Yahoo, Windows Live
    Search, etc.).  This is like window messages for Win32 programmers or CLR
    framework for MSFT server programmers.   This is also useful.
  • Carrot2:
    Google is a keyword search engine (with PageRank helping weightings).  The
    generation of search beyond keyword search is Semantic search.  Carrot2 is
    a search engine using clustering to generate the results.  Also see the
    demo here.
  • Lucene:
    This is the open source search component.  It is a search engine for a
    local database of content.  (You need to add nutch to create an internet
    search engine)
  • SOLR:
    This goes along with Lucene.
  • Drupal:
    This is a framework to build a website immediately.  Building internet
    companies is about not re-inventing the wheel.  This gives the web site
    sign-in/user pictures, and all of the basic support.  A WIKI takes a web
    page to the next level with instant editability and low-cost-of-ownership.
    The parallel is DRUPAL is to a WEB SITE what WIKI is to a single page.
    Drupal gives you the toolbars, pages that you can create when in edit mode, and
    plug-in modules to give you features similar to: DIGG, FLICKR, YouTube,
    etc.  If you create a web site, it is ridiculous to do so without seriously
    considering starting it with Drupal.  If you create an internet company,
    you should seriously consider starting it on top of Drupal.  Why re-invent
    a toolbar & navigation system, help system, simple pages for marketing,
    sign-on support with User Pictures and forget password emails, and search that
    works across the entire site.
  • EC2: Do you want
    a server farm of 300 computers for 2 days?  And only pay a few dollars
    since you only need it for a few days?  EC2 is the solution from
    Amazon.  We built an internet crawler very quickly using
    Hadoop+Nutch+EC2.  Crawling a significant part of the internet is easy
    with a hundred servers and it only takes a day or so.  EC2 makes it very
    inexpensive.  Envision thinking about having EC2 available, and your
    horizons of creativity expand.
  • And of course everyone knows about WordPress, MySQL, Postres.
  • These are also interesting but lower priority: Google Analytics, OpenID, Mechanical Turk, oDesk, S3, OpenSocial, JOONE, GRETL, and Ning.
Recent Posts

Leave a Comment

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Not readable? Change text. captcha txt