Saturday, November 12, 2011

Q: How much data is Big Data?

A: More than a single human being can type in a lifetime

(PS: This post is like a box of Cracker Jacks. There's a prize at the end.)

Recently I've developed more than a passing interest in working with Big Data. I made the obligatory pilgrimage to Hadoopsville: got a little bit of HDFS and MapReduce under my belt, taught a few intro classes. I even did an interview with Doug Cutting.

Then, about a month ago a colleague at the Day Job suggested that I take a look at MongoDB as a data store for a project I was imagining. So, I did.

As I began to piddle around the innards and start to work with the on-demand, distributed features of the product, I found myself in need of a lot of structured data, to do some advanced piddling, about 45 gigs worth.

(I wanted to make it so that I was performing operations that exceeded the memory capacity of any one machine, or any two machines for that matter.)

As I said before, I'm interested in Big Data. So allow me a moment to share with you just how big, big is. The sort of structured data I figured to use is the typical log file entry, about 100 characters long, which looks like this:

20204,2011-11-11 19:19:07.123,Leslee,Falleti,777-64-9738,625 Orange Terr.,Suite 87,MT,59430,16413,A

The structure of the line above is:

unique_id,datetime,firstname,lastname,ssn,address_1,address_2,state,zip,a_random_number,a_random_character

I figure a structured log entry like the above will provide a lot of flexibility to do some MongoDB indexing, run some Map-Reduce analysis and do some benchmarking comparison between Hadoop and MongoDB.

So let's take a look at what 45 gigs of log data looks like. Say I want to make a ~1 MB text file full of unique log entries. Such a file will contain about ~10,000 lines of text. Again, this is for ~1 MB file.

45 gigs translates into forty-five thousand 1 MB files.

Now, this is not the type of data that you keep around on a hard drive next to the pictures of your dog and grandkids. In fact, getting your hands on this sort of data is kinda hard.

So, I figured I'd make some.

I could type out 450,000,000 lines of structured data at a rate of a line every thirty seconds, which translates to 120 lines a hour, Thus, I can get out 2400 lines in a day, provided I forsake sleep and go for twenty hours without mistake. But, being the world's worst typist, we're more likely looking at a lot less than the optimal. Anyway, typing as required, I could do ~12,000 lines during a work week, which means that I could get my 45 gigs of data in say, 40 thousand days, which translates into ~123 years.

Not an option.

Or, I could create a website and get a grant from the Federal Government as a jobs initiative to get everybody in the US to go to a web site I've made and enter some data according to the structure I need. Actually, upon reflection I would need every man, woman and child to enter data over 150 times.

Again, not an option. Government cutbacks are rampant and few innocent bystanders have the patience to do a structured task 150 times.

So, I decided to automate.

I made a cute little project in Java under Maven. The program uses an address and name randomizer I made a few years ago. You can download a zip file containing the project--data generator and randomizer all-- here. (There is also a Post Office module included that is more relevant to the article that describes the randomizer.)

One of the modules, DataGen, kicks off files of unique log entries. You can set it to kick off a variable amount a files. You can kick off 10 files of 100 bytes of unique log entries. You can set the DataGen to kick off 1000 files of 1000 bytes of unique entries.

In terms of rate, DataGen presently kicks off a ~1 MB file of unique log entries of random names, addresses and US zip codes taken from some predefined lists loaded into memory, in about 2 minutes.

I configured DataGen to kick off 45,000, ~1 MB files of log entries. I started running the generation last night on my Ubuntu laptop. So in about 2 months I'll have the 45 gigs I need.

Sorta makes possible the notion that if you chained a chimp to a typewriter for eternity, eventually the beast would produce Hamlet.