Drinking Science from the Nozzle of a Firehose

Yesterday, I heard a presentation on the Large Synoptic Survey Telescope, a planned telescope that will take a wide-angle photo of the sky, read the data out, and then move on to the next patch of the sky. This is in marked contrast to the highly planned targeting and relatively long observing time that is more traditional.

The LSST will generate 30 terabytes of data nightly and generate a 70 petabyte catalog (that's a spicy meatball -- 70,000,000 gigabytes). They expect to pick up 1,000,000 transient events per night. You could put all the astronomers and students in the world to the task, and they still couldn't manually keep up with the data flow.

To make things even more challenging, the game-changing science is going to come from the most unusual stuff -- the faintest stuff, the most short-lived stuff, the rarer stuff -- since if it was bright, long-lived, and common, someone might have noticed it already.

The solution will call for wonderfully powerful parallelized machine learning systems. Watson ain't in it, although it's interesting to think how Watson-like analysis of publications might create initial starting places for partitioning the data.

Particle physics has already faced this enormous onslaught of data and it didn't surprise me to see Fermilab in the LSST subscriber list. Nor, when I heard of the data volumes and the computational challenges, was it surprising that another subscriber was Google.