The tutorial located here looks very much like the Apache Hadoop tutorial.
I was running through the tutorial and ran into a number of road blocks. I eventually got around the road blocks but thought it would be handy to document the issues that I had in case I ever revisit this tutorial in the future.
First issue that I ran into was that I couldn't get the Cloudera VM up and running with VirtualBox. After some trial and error I found that I could get the CentOS based VM up and running after selecting the IO APIC checkbox in the configuration for the VM:
The second issue was that I couldn't install the VirtualBox Guest Additions since CentOS didn't have all the kernel source installed. What a bummer.
I was able to install the necessary kernel files by issuing the following command:
yum install kernel-devel-2.6.18-274.7.1.el5
Then I found that the VirtualBox Guest Additions still couldn't be installed as there was no GCC compiler. That was easily fixed with:
yum install gcc
After that, the VirtualBox Guest Additions installed just fine and with a reboot I know have much better control over the VM. The default 640x480 is a bit stiffling. Now I have it running full screen at 1920x1600 and it is much, much nicer.
Next step was working on the WordCount example.
As a side note: In the O'Reilly video we did a very similar word counting map reduce job and I have to say that I much prefer the terse Python code over the Java solution. But that is just personal preference.
The tutorial provides the source code for WordCount.java and I ran into some issues where with some deviation from the tutorial. The tutorial gives a tip on the proper environment variables for HADOOP_HOME and HADOOP_VERSION but the tutorial is out of sync with the Cloudera VM.
The tutorial states that the proper version information is "
0.20.2-cdh3u1
" when it is actually "0.20.2-cdh3u2
". Not really a big deal but when following a tutorial on a subject that is brand new, this can be frustrating.The next issue that I ran into was due to my forgetting most of my Java development skills. Java development is not something that I do day in and out so some of that information had been garbage collected off of my mental heap to make way for other information (probably due to memorizing useless movie quotes).
I created a sub-dir for the WordCount.java code and compiled and created a .jar as provided by the instructions but my first attempts at executing a Hadoop job failed with Hadoop complaining that it couldn't find "
org.myorg.WordClass
" as seen below.Compiling the WordClass.java code:
[root@localhost wordcount_classes]# javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar WordCount.java
[root@localhost wordcount_classes]# ls -al
total 24
drwxr-xr-x 2 root root 4096 Dec 6 18:34 .
drwxr-xr-x 3 root root 4096 Dec 6 18:32 ..
-rw-r--r-- 1 root root 1546 Dec 6 18:34 WordCount.class
-rw-r--r-- 1 root root 1869 Dec 6 18:33 WordCount.java
-rw-r--r-- 1 root root 1938 Dec 6 18:34 WordCount$Map.class
-rw-r--r-- 1 root root 1611 Dec 6 18:34 WordCount$Reduce.class
[root@localhost wordcount_classes]# cd ..
[root@localhost wordcount]# jar -cvf wordcount.jar -C wordcount_classes/ .
added manifest
adding: WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)
adding: WordCount.java(in = 1869) (out= 644)(deflated 65%)
adding: WordCount.class(in = 1546) (out= 749)(deflated 51%)
adding: WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)
[root@localhost wordcount]# jar tf wordcount.jar
META-INF/
META-INF/MANIFEST.MF
WordCount$Map.class
WordCount.java
WordCount.class
WordCount$Reduce.class
[root@localhost wordcount]#
When I attempted to run my first Hadoop job I got this output:
[root@localhost bad.wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output_1
Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:179)
What got me was the "
Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount
."I look over the code and as far as I can tell, everything is just fine and there is not reason why the WordClass shouldn't be found. After mulling the problem for a while my brain went into brain persistence layer and pulled out the proper way of taking care of the issue.
The source code defines the package name as org.myorg but I hadn't created the proper sub-dirs to reflect the org.myorg package.
I created the org subdir, then myorg in the org subdir and then compiled the code again:
I re-created the .jar pulling in all the files under the
org
subdir the jar file I found that Hadoop would be much happier and would find my org.myorg.WordCount class finally.The next issue that I ran into was due not understanding where Hadoop would full the input files for the word count example. In the O'Reilly Map Reduce video STDIN and STDOUT were used and I just figured that I'd be able to specifiy the input and output subdirs from the local file system. I was incorrect.
I attempted to execute the Hadoop job with the following parameters referencing the input and output subdirs:
[root@localhost wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /home/cloudera/Desktop/wordcount/input /home/cloudera/Desktop/wordcount/output/
11/12/06 18:57:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/06 18:57:51 WARN snappy.LoadSnappy: Snappy native library is available
11/12/06 18:57:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/06 18:57:51 INFO snappy.LoadSnappy: Snappy native library loaded
11/12/06 18:57:51 INFO mapred.JobClient: Cleaning up the staging area hdfs://0.0.0.0/var/lib/hadoop-0.20/cache/mapred/mapred/staging/root/.staging/job_201112061431_0006
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0/home/cloudera/Desktop/wordcount/input
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242)
at org.myorg.WordCount.main(WordCount.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Notice that I used the path to my Desktop (yeah, I know, I shouldn't be putting the files under the Desktop subdir but it makes it easy to acces the files via the GUI. Obviously I would never do this on a real development system) subdirs referencing /input I created.
What I didn't realize at the time was that he paths given to Hadoop are relative to the HDFS file system and not the local ext3 file system. After reading the Cloudera Quick Start Guide PDF it all started to make sense.
I needed to populate HDFS with the input and output subdirs along with the input files noted in the tutorial. The paths from the tutorial reference the path of
/usr/joe/wordcount/input
and /usr/joe/wordcount/output
.I created the input subdir using the proper DFS command:
/usr/bin/hadoop dfs -mkdir /usr/joe/wordcount/input
And I copied the previously created input files, file01 and file02:
/usr/bin/hadoop dfs -put file01 /usr/joe/wordcount/input
/usr/bin/hadoop dfs -put file02 /usr/joe/wordcount/input
Now it was time for the big event, now that I put everything into place I tried the tutorial command over again:
[root@localhost wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
11/12/06 19:13:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/06 19:13:33 WARN snappy.LoadSnappy: Snappy native library is available
11/12/06 19:13:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/06 19:13:33 INFO snappy.LoadSnappy: Snappy native library loaded
11/12/06 19:13:33 INFO mapred.FileInputFormat: Total input paths to process : 2
11/12/06 19:13:34 INFO mapred.JobClient: Running job: job_201112061431_0007
11/12/06 19:13:35 INFO mapred.JobClient: map 0% reduce 0%
11/12/06 19:13:46 INFO mapred.JobClient: map 33% reduce 0%
11/12/06 19:13:47 INFO mapred.JobClient: map 66% reduce 0%
11/12/06 19:13:52 INFO mapred.JobClient: map 100% reduce 0%
11/12/06 19:14:06 INFO mapred.JobClient: map 100% reduce 100%
11/12/06 19:14:09 INFO mapred.JobClient: Job complete: job_201112061431_0007
11/12/06 19:14:09 INFO mapred.JobClient: Counters: 23
11/12/06 19:14:09 INFO mapred.JobClient: Job Counters
11/12/06 19:14:09 INFO mapred.JobClient: Launched reduce tasks=1
11/12/06 19:14:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=25035
11/12/06 19:14:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/06 19:14:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/06 19:14:09 INFO mapred.JobClient: Launched map tasks=3
11/12/06 19:14:09 INFO mapred.JobClient: Data-local map tasks=3
11/12/06 19:14:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19860
11/12/06 19:14:09 INFO mapred.JobClient: FileSystemCounters
11/12/06 19:14:09 INFO mapred.JobClient: FILE_BYTES_READ=79
11/12/06 19:14:09 INFO mapred.JobClient: HDFS_BYTES_READ=348
11/12/06 19:14:09 INFO mapred.JobClient: FILE_BYTES_WRITTEN=215844
11/12/06 19:14:09 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
11/12/06 19:14:09 INFO mapred.JobClient: Map-Reduce Framework
11/12/06 19:14:09 INFO mapred.JobClient: Reduce input groups=5
11/12/06 19:14:09 INFO mapred.JobClient: Combine output records=6
11/12/06 19:14:09 INFO mapred.JobClient: Map input records=2
11/12/06 19:14:09 INFO mapred.JobClient: Reduce shuffle bytes=91
11/12/06 19:14:09 INFO mapred.JobClient: Reduce output records=5
11/12/06 19:14:09 INFO mapred.JobClient: Spilled Records=12
11/12/06 19:14:09 INFO mapred.JobClient: Map output bytes=82
11/12/06 19:14:09 INFO mapred.JobClient: Map input bytes=50
11/12/06 19:14:09 INFO mapred.JobClient: Combine input records=8
11/12/06 19:14:09 INFO mapred.JobClient: Map output records=8
11/12/06 19:14:09 INFO mapred.JobClient: SPLIT_RAW_BYTES=294
11/12/06 19:14:09 INFO mapred.JobClient: Reduce input records=6
Wait? What is this? Could it be? Yes! It worked! YES!YES!YES!
My dancing around the room was enough to wake up my Basset Hound. He looked up at me, cocked his head as to say, "Hey, why are you being so goofy? Make yourself useful and get me another doggie snack."
Checking the results of the job I see the following:
[root@localhost wordcount]# /usr/bin/hadoop dfs -ls /usr/joe/wordcount/output
Found 3 items
-rw-r--r-- 1 root supergroup 0 2011-12-06 19:14 /usr/joe/wordcount/output/_SUCCESS
drwxr-xr-x - root supergroup 0 2011-12-06 19:13 /usr/joe/wordcount/output/_logs
-rw-r--r-- 1 root supergroup 41 2011-12-06 19:14 /usr/joe/wordcount/output/part-00000
Look at that! _SUCCESS!
Checking the part-00000 file:
[root@localhost wordcount]# /usr/bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Sure. It was a lot of work to just count a few words in a text file, but it was a real good learning experience this afternoon. After banging my head against the brick wall for a long enough period I got around the potholes that I ran into and feel that I'll be able to continue the tutorial and learn the basics of Hadoop.
Not a bad bit of afternoon vacation learning.