Narayanan Shivakumar on Google's Software Infrastructure

The follow notes are part 2 of 3 from Narayanan 'Shiva' Shivakumar's presentation at the July 27th VanHPC meeting. These notes cover Shivakumar's discussion of the Google software infrastructure.

Infrastructure consists of:

  • Google File System (GFS)
  • MapReduce
  • BigTable

Google File System (GFS)

  • Why another distributed file system? - We had to have reliability over thousands of nodes
  • We control libraries and everything
  • Master system is responsible for a variety of different chunks
  • Chunk servers manage files
  • Master handles meta data
  • Master resolves the chunk location
  • 50+ clusters
  • Redundancies, chunks on different machines
  • Handles variety of clients
  • Running on petabyte file systems
  • Master has to manage how chunks being transferred, manage for bottlenecks and redundancy
  • Since there's multiple chunks in multiple servers serving multiple requests need to distribute requests appropriately across chunk servers
  • Question: Is data distribution and load balancing automated?
    Answer: Correct. And chunk servers also do other computation
  • Question: Do you ever lose data?
    Answer: Great question. We've lost 1 - 64meg chuck. Meanwhile machines have been dieing and traffic growing
  • Question: How did you know you lost exactly 1 chunk
    Answer: We had a disaster a couple years back. Enough logic wasn't built in to automatically recover but people had enough knowledge to get it back
  • Question: It seems you'll need to assign value to chunks because of worthless data
    Answer: Do we have idea of priority for chunks? We have replication weights
  • Question: Is it data type specific
    Answer: No, it's just a blob. Google File System is data type neutral
  • Question: How do you check integrity?
    Answer: We have error checking, some of it happens at the client end. Chunk master will keep track of how often things happen and route accordingly. Clearly not everything was thought through early on.
  • Question: Are chunk masters stronger than chunk servers
    Answer: The machines I expect to be higher in reliability than others but the same failure logic applies. Chunk servers are typically more disks.

MapReduce

  • A programming model and library to simplify large-scale computations on large clusters
  • Kind of comes out of functional programming
  • Mapping step
  • Reduction step
  • Word frequencies example given
  • Graph shown of execution on thousands of machines (Similar slides can be found here: http://labs.google.com/papers/mapreduce-osdi04-slides/index.html)
  • This makes it easy to build applications without having to worry about underlying
  • Need to locate task close to data
  • Pipelining, have more tasks per machine
  • MR_grep 1 terabyte in 100 seconds (using 1800 machines)
  • Search Google on MapReduce for paper
  • Gave example of 3000 lines of C++ code down to 100 lines
  • Question: Could MapReduce be decoupled form GFS?
    Answer: Pretty hard. We publish techniques so people can create open source, as our competitors are.
  • Question: Are servers are dedicated to a specific talk?
    Answer: Certain tasks are natural choices for machines (disks), otherwise no

BigTable

  • For semi-structured data
  • Like a really big hash table
  • Scale is too large for commercial database and they didn't scale
  • Cost for commercial database isn't nearly as interesting
  • Since we control the entire layer we know about the application we need to support
  • Multi-layer map
  • Fault tolerate, terabyte in memory, petabyte on disk
  • Chart shown on data module
  • Distributed multi-dimensional sparse map
  • Good match for most of our applications
  • We need versions, so over time/timestamp.
  • Locality of versions desirable for deltas
  • The key is a string, the value a blob
  • Storage up to application
  • This lets you cat the entire web
  • Rows are broken into tablets at row boundaries
  • Divided among tablet servers
  • Beneath tablet servers is GFS
  • 100 machines each pick up 1 tablet from failed machine