{ by david linsin }

November 17, 2008

Let Pig Crunch Your Data

A couple of days ago I read about this new Language called Pig Latin, which was developed by Yahoo and contributed to Apache. The Pig Wiki describes it as follows:

  • A language for processing data, called Pig Latin.
  • A set of evaluation mechanisms for evaluating a Pig Latin program. Current evaluation mechanisms include (a) local evaluation in a single JVM, (b) evaluation by translation into one or more Map-Reduce jobs, executed using Hadoop.

  • Since I just downloaded my confusing monthly usage report of Amazon S3, I thought I'd give it a try and see how good Pig is in bringing some light to that data. My first step was to download the tutorial, which contains among others a Jar file called pig.jar, which is essentially all you need to run Pig scripts.

    Pig Latin let's query data, e.g load it from a file, filter, group or join that data and eventually apply some function to it. It reminds me of SQL, in that you are querying sets of data and applying functions to those. Pig Latin knows literals, tuples, bags, which are basically sets of tuples and maps. You can use statements to create so called relations, which have aliases that you can refer to in subsequent statements.

    In case of my S3 usage report I came up with a little Pig Latin script, which accumulates my S3 usage data in a more useful manner:

    raw = LOAD 'report.csv' USING PigStorage(',') AS (Service, Operation, UsageType, Resource, StartTime, EndTime, UsageValue);
    clean = FILTER raw BY Service eq 'AmazonS3';
    ops = GROUP clean BY (Operation, Resource);
    op_counts = FOREACH ops GENERATE group, SUM(clean.UsageValue)/1000000;
    dump op_counts;

    The first line basically loads the data from the file 'report.csv', which is as the name implies comma separated. Each column of that file gets an identifier, so that I can refer to those columns in subsequent statements. Since the S3 usage reports come with a headline and we don't want to analyze that, we get rid of it using the FILTER statement and a condition, as shown in the second line.

    Service, Operation, UsageType, Resource, StartTime, EndTime, UsageValue
    AmazonS3,GetObject,EU-DataTransfer-Out-Bytes,dlinsin-archives,11/01/08 00:00:00,11/01/08 01:00:00,272
    AmazonS3,GetObject,EU-DataTransfer-Out-Bytes,dlinsin-downloads,11/01/08 01:00:00,11/01/08 02:00:00,269
    AmazonS3,GetObject,EU-Requests-Tier2,dlinsin-downloads,11/01/08 03:00:00,11/01/08 04:00:00,37
    AmazonS3,GetObject,EU-DataTransfer-Out-Bytes,dlinsin-downloads,11/01/08 03:00:00,11/01/08 04:00:00,116276888
    AmazonS3,GetObject,EU-Requests-Tier2,dlinsin-downloads,11/01/08 05:00:00,11/01/08 06:00:00,58
    AmazonS3,GetObject,EU-DataTransfer-Out-Bytes,dlinsin-downloads,11/01/08 05:00:00,11/01/08 06:00:00,179015922
    AmazonS3,StandardStorage,StorageObjectCount,dlinsin-archives,11/01/08 07:00:00,11/01/08 08:00:00,137
    AmazonS3,StandardStorage,EU-TimedStorage-ByteHrs,dlinsin-downloads,11/01/08 07:00:00,11/01/08 08:00:00,9809386248
    AmazonS3,StandardStorage,EU-TimedStorage-ByteHrs,dlinsin-archives,11/01/08 07:00:00,11/01/08 08:00:00,89743752
    AmazonS3,StandardStorage,StorageObjectCount,dlinsin-downloads,11/01/08 07:00:00,11/01/08 08:00:00,47
    AmazonS3,StandardStorage,StorageObjectCount,dlinsin-archives,11/02/08 07:00:00,11/02/08 08:00:00,137
    AmazonS3,StandardStorage,EU-TimedStorage-ByteHrs,dlinsin-downloads,11/02/08 07:00:00,11/02/08 08:00:00,9809386248

    I wanted to see the amount of traffic of each operation applied on each of my buckets in S3. As you can see the report look quite cluttered, therefore I grouped the data by 'Operation' and 'Resource', which is basically what line 3 does. I then applied a SUM function to each of the grouped item's 'UsageValue' field, which results in the amount of traffic in bytes. For better readability I converted the results in Megabytes. Finally the last statements send the results to stdout.

    crusty:pigtmp dlinsin$ java -cp pig.jar org.apache.pig.Main -x local amazons3.pig
    ((DeleteObject, dlinsin-downloads), 2.0E-6)
    ((GetObject, dlinsin-archives), 1.785656)
    ((GetObject, dlinsin-downloads), 11194.995918)
    ((HeadBucket, dlinsin-downloads), 0.032442)
    ((HeadObject, dlinsin-downloads), 5.42E-4)
    ((ListAllMyBuckets, ), 0.00448)
    ((ListBucket, dlinsin-downloads), 0.108617)
    ((PutObject, dlinsin-downloads), 0.098595)
    ((StandardStorage, dlinsin-archives), 1346.158335)
    ((StandardStorage, dlinsin-downloads), 147140.794425)

    If you are interested you can download the pig tutorial together with my script and the Amazon S3 usage report as a pre-packaged archive. Simply unzip it and start pig with

    java -cp pig.jar org.apache.pig.Main -x local amazons3.pig

    I think this little script already shows the potential of Pig Latin and it's evaluation mechanism, but there is more. You can not only run your scripts on a local JVM, but also in a Hadoop cluster in MapReduce style. There's also an API which let's you embed Pig in a Java application.

    We have a Lucene index with about 150GB at work and I could imagine running all sorts of statistical analyzes on it. Unfortunately I couldn't find any Lucene integration, in terms of simply loading a index and accessing the Lucene documents, which are basically simple sets of tuples. I would really love to see how Pig performs, when it crunches a big bunch of data.



    • mail(dlinsin@gmail.com)
    • jabber(dlinsin@gmail.com)
    • skype(dlinsin)