A critical piece of that architecture is the ability for Terracotta BigMemory to act as a fast In-Memory buffer, accessible by both the real-time world and the batch world...effectively bridging the gap between the 2.
The Terracotta BigMemory-Hadoop connector is at the center of that piece, allowing hadoop to write seamlessly to BigMemory.
For general information, please refer to existing writings about this connector:
- http://terracotta.org/resources/whitepapers/bigmemory-hadoop?set=1 (White paper)
I'll be using as a guide the code I put together for "AFCEA Cyber Symposium Plugfest", available on github at https://github.com/lanimall/cyberplugfest
Master Step 1 - Get the software components up and running
1 - Let's download the needed components:
- Hadoop connector: http://terracotta.org/downloads/hadoop-connector?set=1
- Bigmemory Max: http://terracotta.org/downloads/bigmemorymax?set=1
- We'll test here with latest 4.0.2 version…but works fine with TC 3.7.x as well.
- Hadoop: The connector documentation mentions hadoop version 0.20.203.0 (http://archive.apache.org/dist/hadoop/core/hadoop-0.20.203.0/), but I tested successfully with Hadoop 1.1.2, and Cloudera CDH (http://www.cloudera.com/content/cloudera/en/products/cdh.html) packages as well. Pick the one you want. I chose the standard Apache Hadoop 1.1.2
2 - Clone the git repository to get the cyberplugfest code:
git clone https://github.com/lanimall/cyberplugfestIn the rest of the article, we will assume that $CYBERPLUGFEST_CODE_HOME is the root install directory for the code.
3 - Extract the hadoop connector somewhere on your development box.
If you want to explore and follow the default instructions + sample word count program, it works fine…but please note that I took some liberties when it comes to my setup…and these will be explained in this article.
In the rest of the article, we will assume that $TC_HADOOP_HOME is the root install directory of the terracotta hadoop connector.
4 - Install, configure, and start BigMemory Max
In the rest of the article, we will assume that $TC_HOME is the root directory of BigMemory Max.
I added a sample tc-config.xml at https://github.com/lanimall/cyberplugfest/blob/master/configs/tc-config.xml.
To get bigmemory-max started with that configuration file on your local machine, run:
export CYBERPLUGFEST_CODE_HOME=<root path to cyberplugfest code cloned from github>
export TC_HOME=<root path to terracotta install>
$TC_HOME/server/bin/start-tc-server.sh -f $CYBERPLUGFEST_CODE_HOME/configs/tc-config.xml -n Server1
5 - Install Hadoop
In the rest of the article, we will assume that $HADOOP_INSTALL is the root install directory of Apache Hadoop
6 - Add the needed terracotta libraries to the Hadoop class path
- The hadoop connector library: bigmemory-hadoop-0.1.jar
- The ehcache client library: ehcache-ee-<downloaded version>.jar
- The terracotta toolkit library: terracotta-toolkit-runtime-ee-<downloaded version>.jar
Edit $HADOOP_INSTALL/conf/hadoop-env.sh and add the following towards the top (replace the default empty HADOOP_CLASSPATH= line with it)
export TC_HOME=<root path to terracotta install>
export TC_HADOOP_HOME=<root path to hadoop install>
7 - Start Hadoop in pseudo-distributed mode
Master Step 2 - Write the Map/Reduce job using spring-hadoop and Terracotta BigMemory output connector
|Vendor A||Mean A|
|Vendor B||Mean B|
|Vendor N||Mean N|
And since I really like Spring (http://www.springsource.org) and wanted to extend the simple hadoop wordCount example, I decided to use Spring-Data Hadoop (http://www.springsource.org/spring-data/hadoop) to build the map reduce job for the plugfest.
Some really good tutorials for Spring Hadoop out there, so I don't want to duplicate here…One I liked for it's simplicity and clarity was http://www.petrikainulainen.net/programming/apache-hadoop/creating-hadoop-mapreduce-job-with-spring-data-apache-hadoop/
Rather, I'll concentrate at the specificities related to the Terracotta BigMemory output writing.
Code available at: https://github.com/lanimall/cyberplugfest/tree/master/HadoopJobs
1 - Let's explore the application-context.xml
a - Specify the output cache name for the BigMemory hadoop jobIn the <hdp:configuration><hdp:configuration>, make sure to add the "bigmemory.output.cache" entry that specifies the output cache. Since our output cache is "vendorAvgSpend", it should basically be: bigmemory.output.cache=vendorAvgSpend
NOTE: I use Maven resource plugin, so this value is actually specific in the pom.xml (in the property "hadoop.output.cache")
b - Check the difference between hadoop jobs
- hdjob-vendoraverage=standard M/R job that outputs to HDFS
- hdjob-vendoraverage-bm=the same M/R job that outputs to BigMemory
- For the hadoop BigMemory job (hdjob-vendoraverage-bm), output-format value is: "org.terracotta.bigmemory.hadoop.BigmemoryOutputFormat"
- For hdjob-vendoraverage, it's the standard "org.apache.hadoop.mapreduce.lib.output.TextOutputFormat"
- It is not needed for the hadoop BigMemory job since it does not write onto HDFS...
- For the hadoop BigMemory job, a different reducer implementation is needed (org.terracotta.pocs.cyberplugfest.VendorSalesAvgReducerBigMemory) because you need to return an object of type "BigmemoryElementWritable"
- For hdjob-vendoraverage job, the reducer returns an object of type"Text".
- In the hdjob-vendoraverage-bm job, you need to add the terracotta license file so the hadoop job can connect to the terracotta bigmemory (enterprise feature)
c - Specify the job to run.Done in the <hdp:job-runner ...> tag. You can switch back and forth to see the difference...
2 - Now, let's look at the reducers
- hdjob-vendoraverage-bm reducer class: org.terracotta.pocs.cyberplugfest.VendorSalesAvgReducerBigMemory
- hdjob-vendoraverage: org.terracotta.pocs.cyberplugfest.VendorSalesAvgReducer
3 - Include the cache configuration (Ehcache.xml) in your M/R project
In this ehcache.xml file, you'll notice our hadoop output cache (as well as several other caches that are NOT used by the hadoop jobs). The one thing that is needed is that it must be a "distributed" cache - in other word, the data will be stored on the BigMemory Max Server instance that should be already running on your development box (The "
For more info on that, go to http://terracotta.org/documentation/4.0/bigmemorymax/get-started/server-array
Master Step 3 - Prepare the sample data
Extract the sample at: $CYBERPLUGFEST_CODE_HOME/HadoopJobs/SampleTransactionsData/sample-data.zip.
It should create a "flume" folder with the following hierarchy:
- events.* (those are the files with the comma separated data)
Navigate to $CYBERPLUGFEST_CODE_HOME/HadoopJobs/SampleTransactionsData/
Run the hadoop shell "put" command to add all these files into HDFS:
$HADOOP_INSTALL/bin/hadoop dfs -put flume/ .Once done, verify that the data is in HDFS by running (This should bring a lot of event files…):
$HADOOP_INSTALL/bin/hadoop dfs -ls flume/events/13-07-17/
Master Step 4 - Compile and Run the hadoop job
Before that though, make sure the maven properties are right for your environment (i.e. hadoop name node url, hadoop job tracker url, terracotta url, cache name, etc…). These properties are specified towards the end of the pom file, in the maven profiles I created for that event (dev profile is for my local, prod profile is to deploy in the amazon ec2 cloud)
Then, navigate to the $CYBERPLUGFEST_CODE_HOME/HadoopJobs folder and run:
mvn clean package appassembler:assembleThis should build without an issue, and create a "PlugfestHadoopApp" executable script (the maven appassembler plugin helps with that) in $CYBERPLUGFEST_CODE_HOME/HadoopJobs/target/appassembler/bin folder.
Depending on your platform (window or nix), chose the right script (sh or bat) and run:
sh $CYBERPLUGFEST_CODE_HOME/HadoopJobs/target/appassembler/bin/PlugfestHadoopAppYour hadoop job should be running.
Master Step 5 - Verify data is written to Terracotta BigMemory
sh $CYBERPLUGFEST_CODE_HOME/HadoopJobs/target/appassembler/bin/VerifyBigmemoryDataYou should see 6 entries being printed for cache vendorAvgSpend
Hope you find this hands-on post useful.