Tuesday, December 6, 2011

Running a first Hadoop job with Cloudera's Tutorial VM

After watching the O'Reilly video on Map Reduce I decided that I'd like to know more about Hadoop. After doing some Googling I found that a firm by the name of Cloudera has pre-populated VMs available for playing around with located here.

The tutorial located here looks very much like the Apache Hadoop tutorial.

I was running through the tutorial and ran into a number of road blocks. I eventually got around the road blocks but thought it would be handy to document the issues that I had in case I ever revisit this tutorial in the future.

First issue that I ran into was that I couldn't get the Cloudera VM up and running with VirtualBox. After some trial and error I found that I could get the CentOS based VM up and running after selecting the IO APIC checkbox in the configuration for the VM:



The second issue was that I couldn't install the VirtualBox Guest Additions since CentOS didn't have all the kernel source installed. What a bummer.

I was able to install the necessary kernel files by issuing the following command:

yum install kernel-devel-2.6.18-274.7.1.el5

Then I found that the VirtualBox Guest Additions still couldn't be installed as there was no GCC compiler. That was easily fixed with:

yum install gcc

After that, the VirtualBox Guest Additions installed just fine and with a reboot I know have much better control over the VM. The default 640x480 is a bit stiffling. Now I have it running full screen at 1920x1600 and it is much, much nicer.

Next step was working on the WordCount example.

As a side note: In the O'Reilly video we did a very similar word counting map reduce job and I have to say that I much prefer the terse Python code over the Java solution. But that is just personal preference.

The tutorial provides the source code for WordCount.java and I ran into some issues where with some deviation from the tutorial. The tutorial gives a tip on the proper environment variables for HADOOP_HOME and HADOOP_VERSION but the tutorial is out of sync with the Cloudera VM.

The tutorial states that the proper version information is "0.20.2-cdh3u1" when it is actually "0.20.2-cdh3u2". Not really a big deal but when following a tutorial on a subject that is brand new, this can be frustrating.

The next issue that I ran into was due to my forgetting most of my Java development skills. Java development is not something that I do day in and out so some of that information had been garbage collected off of my mental heap to make way for other information (probably due to memorizing useless movie quotes).

I created a sub-dir for the WordCount.java code and compiled and created a .jar as provided by the instructions but my first attempts at executing a Hadoop job failed with Hadoop complaining that it couldn't find "org.myorg.WordClass" as seen below.

Compiling the WordClass.java code:

[root@localhost wordcount_classes]# javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar WordCount.java
[root@localhost wordcount_classes]# ls -al
total 24
drwxr-xr-x 2 root root 4096 Dec 6 18:34 .
drwxr-xr-x 3 root root 4096 Dec 6 18:32 ..
-rw-r--r-- 1 root root 1546 Dec 6 18:34 WordCount.class
-rw-r--r-- 1 root root 1869 Dec 6 18:33 WordCount.java
-rw-r--r-- 1 root root 1938 Dec 6 18:34 WordCount$Map.class
-rw-r--r-- 1 root root 1611 Dec 6 18:34 WordCount$Reduce.class
[root@localhost wordcount_classes]# cd ..
[root@localhost wordcount]# jar -cvf wordcount.jar -C wordcount_classes/ .
added manifest
adding: WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)
adding: WordCount.java(in = 1869) (out= 644)(deflated 65%)
adding: WordCount.class(in = 1546) (out= 749)(deflated 51%)
adding: WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)
[root@localhost wordcount]# jar tf wordcount.jar
META-INF/
META-INF/MANIFEST.MF
WordCount$Map.class
WordCount.java
WordCount.class
WordCount$Reduce.class
[root@localhost wordcount]#


When I attempted to run my first Hadoop job I got this output:


[root@localhost bad.wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output_1
Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:179)


What got me was the "Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount."

I look over the code and as far as I can tell, everything is just fine and there is not reason why the WordClass shouldn't be found. After mulling the problem for a while my brain went into brain persistence layer and pulled out the proper way of taking care of the issue.

The source code defines the package name as org.myorg but I hadn't created the proper sub-dirs to reflect the org.myorg package.

I created the org subdir, then myorg in the org subdir and then compiled the code again:



I re-created the .jar pulling in all the files under the org subdir the jar file I found that Hadoop would be much happier and would find my org.myorg.WordCount class finally.

The next issue that I ran into was due not understanding where Hadoop would full the input files for the word count example. In the O'Reilly Map Reduce video STDIN and STDOUT were used and I just figured that I'd be able to specifiy the input and output subdirs from the local file system. I was incorrect.

I attempted to execute the Hadoop job with the following parameters referencing the input and output subdirs:


[root@localhost wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /home/cloudera/Desktop/wordcount/input /home/cloudera/Desktop/wordcount/output/
11/12/06 18:57:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/06 18:57:51 WARN snappy.LoadSnappy: Snappy native library is available
11/12/06 18:57:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/06 18:57:51 INFO snappy.LoadSnappy: Snappy native library loaded
11/12/06 18:57:51 INFO mapred.JobClient: Cleaning up the staging area hdfs://0.0.0.0/var/lib/hadoop-0.20/cache/mapred/mapred/staging/root/.staging/job_201112061431_0006
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0/home/cloudera/Desktop/wordcount/input
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242)
at org.myorg.WordCount.main(WordCount.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


Notice that I used the path to my Desktop (yeah, I know, I shouldn't be putting the files under the Desktop subdir but it makes it easy to acces the files via the GUI. Obviously I would never do this on a real development system) subdirs referencing /input I created.

What I didn't realize at the time was that he paths given to Hadoop are relative to the HDFS file system and not the local ext3 file system. After reading the Cloudera Quick Start Guide PDF it all started to make sense.

I needed to populate HDFS with the input and output subdirs along with the input files noted in the tutorial. The paths from the tutorial reference the path of /usr/joe/wordcount/input and /usr/joe/wordcount/output.

I created the input subdir using the proper DFS command:


/usr/bin/hadoop dfs -mkdir /usr/joe/wordcount/input


And I copied the previously created input files, file01 and file02:


/usr/bin/hadoop dfs -put file01 /usr/joe/wordcount/input
/usr/bin/hadoop dfs -put file02 /usr/joe/wordcount/input


Now it was time for the big event, now that I put everything into place I tried the tutorial command over again:


[root@localhost wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
11/12/06 19:13:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/06 19:13:33 WARN snappy.LoadSnappy: Snappy native library is available
11/12/06 19:13:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/06 19:13:33 INFO snappy.LoadSnappy: Snappy native library loaded
11/12/06 19:13:33 INFO mapred.FileInputFormat: Total input paths to process : 2
11/12/06 19:13:34 INFO mapred.JobClient: Running job: job_201112061431_0007
11/12/06 19:13:35 INFO mapred.JobClient: map 0% reduce 0%
11/12/06 19:13:46 INFO mapred.JobClient: map 33% reduce 0%
11/12/06 19:13:47 INFO mapred.JobClient: map 66% reduce 0%
11/12/06 19:13:52 INFO mapred.JobClient: map 100% reduce 0%
11/12/06 19:14:06 INFO mapred.JobClient: map 100% reduce 100%
11/12/06 19:14:09 INFO mapred.JobClient: Job complete: job_201112061431_0007
11/12/06 19:14:09 INFO mapred.JobClient: Counters: 23
11/12/06 19:14:09 INFO mapred.JobClient: Job Counters
11/12/06 19:14:09 INFO mapred.JobClient: Launched reduce tasks=1
11/12/06 19:14:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=25035
11/12/06 19:14:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/06 19:14:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/06 19:14:09 INFO mapred.JobClient: Launched map tasks=3
11/12/06 19:14:09 INFO mapred.JobClient: Data-local map tasks=3
11/12/06 19:14:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19860
11/12/06 19:14:09 INFO mapred.JobClient: FileSystemCounters
11/12/06 19:14:09 INFO mapred.JobClient: FILE_BYTES_READ=79
11/12/06 19:14:09 INFO mapred.JobClient: HDFS_BYTES_READ=348
11/12/06 19:14:09 INFO mapred.JobClient: FILE_BYTES_WRITTEN=215844
11/12/06 19:14:09 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
11/12/06 19:14:09 INFO mapred.JobClient: Map-Reduce Framework
11/12/06 19:14:09 INFO mapred.JobClient: Reduce input groups=5
11/12/06 19:14:09 INFO mapred.JobClient: Combine output records=6
11/12/06 19:14:09 INFO mapred.JobClient: Map input records=2
11/12/06 19:14:09 INFO mapred.JobClient: Reduce shuffle bytes=91
11/12/06 19:14:09 INFO mapred.JobClient: Reduce output records=5
11/12/06 19:14:09 INFO mapred.JobClient: Spilled Records=12
11/12/06 19:14:09 INFO mapred.JobClient: Map output bytes=82
11/12/06 19:14:09 INFO mapred.JobClient: Map input bytes=50
11/12/06 19:14:09 INFO mapred.JobClient: Combine input records=8
11/12/06 19:14:09 INFO mapred.JobClient: Map output records=8
11/12/06 19:14:09 INFO mapred.JobClient: SPLIT_RAW_BYTES=294
11/12/06 19:14:09 INFO mapred.JobClient: Reduce input records=6


Wait? What is this? Could it be? Yes! It worked! YES!YES!YES!

My dancing around the room was enough to wake up my Basset Hound. He looked up at me, cocked his head as to say, "Hey, why are you being so goofy? Make yourself useful and get me another doggie snack."

Checking the results of the job I see the following:


[root@localhost wordcount]# /usr/bin/hadoop dfs -ls /usr/joe/wordcount/output
Found 3 items
-rw-r--r-- 1 root supergroup 0 2011-12-06 19:14 /usr/joe/wordcount/output/_SUCCESS
drwxr-xr-x - root supergroup 0 2011-12-06 19:13 /usr/joe/wordcount/output/_logs
-rw-r--r-- 1 root supergroup 41 2011-12-06 19:14 /usr/joe/wordcount/output/part-00000


Look at that! _SUCCESS!

Checking the part-00000 file:


[root@localhost wordcount]# /usr/bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2


Sure. It was a lot of work to just count a few words in a text file, but it was a real good learning experience this afternoon. After banging my head against the brick wall for a long enough period I got around the potholes that I ran into and feel that I'll be able to continue the tutorial and learn the basics of Hadoop.

Not a bad bit of afternoon vacation learning.

9 comments:

  1. Great Work.. After working for around 5 hours, not able to solve problems... I faced same problems.. thnks for post

    ReplyDelete
  2. Thanks a lot for the post.. It helped.. Darn new to Java after reading post couple of times I managed to fix "Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount."..

    ReplyDelete
  3. Glad this post was of some service!

    ReplyDelete
  4. Awesome ... thanks to your blog i was able to run it.

    ReplyDelete
  5. Your post was a tremendous help as I worked through getting the Cloudera WordCount sample to work. Thank you.

    ReplyDelete
  6. To have a personal web server you need some simple software that includes Apache, PHP and MySQL and it can be installed on almost any machine. This works well but for serious development where you need to have some specific PHP version or the Apache server under total control, you need a separate machine dedicated for web hosting.

    apache jobs

    ReplyDelete
  7. Thanks so much! the input files were giving me fits as well - your tip on moving them to the dfs file system saved me... Thanks again!!

    ReplyDelete
  8. Good tutorial!
    It helps me so much!
    thanks for the tips.hadoop online tutorial

    ReplyDelete
  9. Hi, thanks for sharing this blog. this is really an informative one. those who are new and want to learn about Hadoop, I would like to recommend this page for them- https://intellipaat.com/

    ReplyDelete