You can think of Google as the research lab for the internet.
Every so often, the company releases a research paper describing one of the sweeping software platforms that help drive its online empire, and a few years later, this paper will spawn an open source software project that seeks to share Google’s creation with the rest of the world.
Papers describing the Google File System and Google MapReduce spawned Hadoop, an open source platform that lets you spread data across thousands of dirt-cheap computer servers and then crunch it into something useful. Google BigTable gave rise to an army of “NoSQL” databases that can juggle unusually large amounts of information. Google Pregel delivered multiple “graph” databases that can map the many online relationships between people and things.
‘We’re getting to the point where we’re building what I wanted to build when we started the company.’
— Jeff Hammerbacher
Some have complained that the outside world takes far too long in rebuilding these groundbreaking Google creations. And that includes Mike Olson, the CEO of Cloudera, a Silicon Valley startup that brought Hadoop to the business world. But this time is different.
On Wednesday, Cloudera uncloaked a software platform known as Impala. Under development for the past two years, Impala is a means of instantly analyzing the massive amounts of data stored in Hadoop, and it’s based on a sweeping Google database known as F1. Google only revealed F1 this past May, with a presentation delivered at a conference in Arizona, and it has yet to release a full paper describing the technology. Two years ago, Cloudera hired away one of the main Google engineers behind the project, a database guru named Marcel Kornacker.
Hadoop is now widely used across the web, driving such big-name operations as Facebook, Yahoo, and Twitter, and it’s spreading into traditional businesses as well. According to market research outfit IDC, it will fuel a $813 million software market by the year 2016.
It was originally designed as a “batch processing” platform. You give it a data-crunching task, and it takes several minutes — or several hours — to complete that task. It can build you, say, an index for the entire internet. With open source tools such as Hive, you can also analyze Hadoop data in much the same way you would query a traditional database using the common Structured Query Language, or SQL. If you’ve collected data describing a collection of digital books, for instance, you could run a query asking for a list of authors. But this too takes time.
Impala lets you query the same data “in real-time” — i.e., in seconds. According to Cloudera, it’s 10 times faster than a tool like Hive.
Cloudera is now four years old. But Jeff Hammerbacher — who helped found Cloudera after overseeing the rise of Hadoop at Facebook — refers to Impala as the company’s “version 1.0.” In other words, it’s the beginning. “We’re getting to the point,” he says, “where we’re building what I wanted to build when we started the company.”
Google’s F1 is a massive relational database management system, or RDBMS, that helps run the company’s online ad system. It sits atop Spanner, a much ballyhooed Google creation that lets the company store information across its worldwide network of data centers. “Spanner stores records and data,” says Kornacker. “F1 gives you access to those records. It runs queries. And it correlates them.”
‘Spanner stores records and data. F1 gives you access to those records. It runs queries. And it correlates them.’
— Marcel Kornacker
At Google, Marcel Kornacker oversaw the development of the F1 “query engine” — the system that lets the company instantly analyze information stored in the database. Then Hammerbacher brought him to Cloudera, and he basically rebuilt this query engine for use with Hadoop and Hbase, the NoSQL database built to work in tandem with Hadoop.
Kornacker left Google, he tells us, in large part because he wanted to build something everyone could use. “I wanted to work on something similar to what I had been doing,” he says, “but in a more publicly accessible context.” With today’s announcement, this wish is a realit
According to Cloudera, Impala is already being used by various pilot customers, including the online travel operation Expedia. Expedia did not immediately respond to our questions about the tool.y. And in keeping with its mission, Cloudera has open sourced the code behind the Impala. The company makes its money by providing services and various proprietary tools to businesses that use Hadoop and its sister platforms.
Cloudera isn’t the only one bringing real-time queries to Hadoop. MapR — a conspicuous competitor to Cloudera — is currently building an open source version of Dremel, another sweeping Google platform. Google released a paper describing Dremel in 2010, saying it could query multiple petabytes of data — aka millions of gigabytes — in a matter of seconds.
Kornacker says that F1 and Dremel are two very different animals. Whereas Dremel was designed primarily for instant data analysis, F1 also handles online transaction processing, or OLTP — meaning it shuttles data to and from live applications. It was built for more than just speedy queries.
But speedy queries are part of it, and this has spawned Impala. Two years ago, when Google released its Dremel paper, Jeff Hammerbacher told the web that Hadoop would one day offer a real-time query engine along similar lines. And now it does.