Brian McCallister

Thu, 11 Oct 2007

Martin and I had a fantastic discussion ~~earlier today~~ a week or so ago (when I started writing this, holy crap having a two month old on hand stops the blogging) about really big applications. It was fantastic partly because we hashed out some good ideas (which I want to start sharing) but also because it reinforced something Pat Helland has been saying which I strongly agree with: that we lack good language for talking about really big apps.

There has been a bit of a blogging theme going on looking at this. The aforementioned Pat talks about it, Werner Vogels certainly talks a lot about it, the High Scalability Blog has been posting lots of profiles, there was lots of talk around the Seattle Conference on Scalability, and most recently there has been much patter about Amazon's Dynamo.

Once past the operational issues (extremely non-trivial) the scaling bottleneck for big apps generally runs into handling of data. At JavaOne this past summer, Diego specifically narrowed it to handling update rates and I agree in principle, but want to expand it into enough detail to be meaningful. Of the folks with really large, industrial datasets, not many folks realy talk about how they handle it.

Google has done the most talking about how they handle large contiguous datasets -- first with GFS, MapReduce, Sawzall, then BigTable and Chubby. This has lead lots of folks to thinking of the problems in the same terms -- not a bad thing, but not the only way. Hadoop is basically reverse engineering this, and I know of a couple other private implementations modeling the same thing. It is kind of funny, actually. I am super-jazzed to have Amazon talking about different angles on big-data stuff as well :-)

Going to go post this before another week sneaks by. A lot more thoughts but not enough time to brain dump them. Hopefully get a chance soon.

1 writebacks [/src/ddbd] permanent link

Brian's Waste of Time