2005-03-24

Google talk by Rob Pike

Yesterday Rob Pike, from the Google Systems Lab (he's also worked at Bell labs and done a lot of work on Unix), came to do a seminar at the University of Sydney.

The main topic was the language Google uses to handle all its data processing, querying etc, though there were lots of other interesting things in there too. He started off by going through his past work in designing concurrent programming languages and then onto his work at Google.

Google uses a filesystem called GFS to store their data. I was surprised to find out that all those zillions of bytes are stored as flat files - not as SQL databases. To query etc they use a special language devised primarily by Rob Pike, though other people were of course involved. :)

Basically, to facillitate the distribution of the workload across all their machines, jobs are carried out in the following way:
- First, the information is stored using IDL (information descriptive language I think (?)) which he said was "like XML, but without all the XML stuff". So pretty much fulfilling the same purpose as XML - storing data in a structured format
- This structured information is transmitted through a number of protocols: Google's system relies a lot on message passing to accomplish its work, so the protocols are important. Basically, this is just so the program that receives the information knows what form it is in, what data/variables it can access etc.
- The program that receives the information, written in Google's own language (which I never did catch the name of unfortunately), deals with one record only. There is no way to access what was done on previous records, no inter-record operability at all
- The results from each record are emitted to the aggregator, where all the inter-record stuff happens. The aggregator collates all the data, combining it into the final output.

In order to do this, the operations on each record have to be commutative - i.e. order doesn't matter. This means there is no need for Google to store its data in an ordered way: the files are all indexed to make them easy to get, but not in any particular order. Also, the operations performed by the aggregator have to be assosciative - i.e. it doesn't matter which records' results it gets first, it can collate data in any order. So for instance, it can aggregate the data from machines X and Y, and then combine the data from machine Z, or it can do X and Z and then add in Y. This is important as then each machine is independent, so the work can be easily distributed.

The language itself is designed specifically for Google's purposes, so it has some features not present in other languages and is missing other features that are unnecessary. Some of the interesting points are:
- Data types are appropriate to Google's purposes, e.g. string, url, time, int.
- Syntax makes it easy to express operations Google would want to do, such as compiling tables indexed by time, country, etc; getting the geographical location of a query from the ip address; getting information about a search query such as country, search terms, etc.
- Special function called sawzall() that uses regex to do all sorts of funky parsing stuff, like splitting a string into tokens or lines in different formats
- No standard I/O. This means Google's sensitive data is protected since it's actually impossible to view anything the protocol doesn't allow. Even if protecte data is present in a file, if the protocol used doesn't deliver it, it can't be accessed.
- Parsing of data is primarily done by preprocessing using the protocols. At the top of the program is a line [code]proto: "some_protocol.proto"[/code] which is a bit like a #include line and tells the program what data is present to be worked with. The programmer then doesn't have to worry about parsing and can just use the data, since all this work has been hidden at a lower level.
- Data type 'undef' indicates an error - there aren't any exceptions in the language. There is a flag that says either stop the program if undef is found, or ignore it and drop that data. During debugging, they stop the program, but when they are actually running it, the data is simply dropped. Since Google has such an enormous amount of data, it is statistically insignificant to drop the few cases where there are errors.
- Syntax of the language is kind of like C, but much more concise: it is an interpretive scripting type of language.

A notable point about this language is that there are no threads, locks, sychronisations etc. Considering how much concurrent processing Google does, that's pretty amazing. However, this is achieved by building these ideas into the fundamental design of the language. The program gets run once for each record and order doesn't matter, so it doesn't have to worry about multiple threads or anything. It just emits the result to the aggregator, then gets the next record and runs again. The aggregator doesn't care which order it gets the data in because all its operations are assosciative, so it doesn't particularly have to worry either.

The work load is distributed by Google's work queue, which he didn't go into much detail about since it wasn't the topic of the talk, but essentially it just works out which jobs get done on which machine and when.

So here's an example of how powerful yet simple this language is. Rob Pike showed us a program that he'd written which was about 15 lines long. It got all Google searches for a day and built a table of them, indexed by time they occured, and lattitude and longitude they came from (using ip address to get the location). This was then plotted as a world map which varied over time. The output was a black background, with white dots of varying intensity showing how many searches there were in each location. So it was like looking at the world at night time, with the lights indicating use of Google. As the hour of the day changed, you could see night time going across the world by the decrease in activity. It took 9 minutes to generate that map from all Google's searches for 24 hours.

One other anecdote I thought was interesting, which just illustrates the sheer scale of Google's operation: As in any good system, there is redundancy to protect their data. Once a rack holding about 60 TB, iirc, got completely wiped, and they didn't lost any data at all. Not a single byte.

We also got to talk to him after the seminar and he explained GFS a bit more and told us about Pagerank and how Google decides which results are most relevant. All in all, definately a worthwhile lecture to attend. :)