Sunday, April 29, 2012

Creating stacked area plots with GD::Graph::area in Perl

I recently came across a need to generate an area plot to show the utilization of a series of virtual machines on an IBM pSeries frame. I had already written this perl utility working with TeamQuest's tqharvc.exe to extract TeamQuest physc data. The routine dumps out both .CSVs and scatter plots for requested metrics that are specified by a saved TeamQuest View .rpt file created with TeamQuest View.

I decided to write another perl utility to stitch together indiviual .CSV files together into a single .CSV so that I could import the single .CSV and create area plots in Excel. That worked like a champ but there was a problem. The problem was that the hostnames of the virtual machines running on the frame are cryptic; they don't make any sense unless you have all of them memorized. This really frustrated me.

I had been creating the single .CSV and then adding an extra tab to the spreadsheet with the cryptic hostname and an simple description of the purpose of the VM so that my stacked area plots would be easier to read.

This worked like a champ, but the extra time to add the lookup information, add the =vlookup() function to the spreadsheet just got to be annoying. Once again, I decided to use perl to solve this problem for me.

First, the routine that I wrote to combine the individual .CSV files into a single .CSV. This routine uses a directory listing of the individual .CSV files that I want to stitch together. This is pretty simple to generate under Windows command line with the `dir *.csv /b > csv.list` command where csv.list is the list of .CSVs that I want to stitch together. All the individual .CSV files need to have the first column be a common date/time stamp, otherwise it won't work correctly. This isn't an issue since my tqharvc perl routine dumps the first column as a date/time stamp.

The code:

1:  use strict;  
2:    
3:  my $csvList = shift;  
4:  open(LIST, "$csvList") || die "$!";  
5:    
6:  my @csvArray = ();  
7:  my %matrixHash = ();  
8:    
9:  while (<LIST>) {  
10:   chomp($_);  
11:   push(@csvArray, $_);  
12:  };  
13:    
14:  foreach my $csvFile (@csvArray) {  
15:   open(CSV, "$csvFile") || die "$!";  
16:   print STDERR "Working on $csvFile\n";  
17:   while(<CSV>) {  
18:    if ($. > 1) {  
19:     chomp($_);  
20:     my @columns = split(/,/, $_);  
21:     $matrixHash{$columns[0]}{$csvFile} = $columns[1];  
22:    };  
23:   };  
24:   close(CSV);  
25:  };  
26:    
27:  # header line  
28:    
29:  print STDOUT "timeStamp";  
30:  foreach my $csvFile (@csvArray) {  
31:   my @columns = split(/_/, $csvFile);  
32:   print STDOUT ",$columns[0]";  
33:  };  
34:  print STDOUT "\n";  
35:    
36:  # actual data into CSV matrix  
37:    
38:  foreach my $timeStamp (sort(keys(%matrixHash))) {  
39:   print STDOUT "$timeStamp";  
40:   foreach my $csvFile (@csvArray) {  
41:    print STDOUT ",$matrixHash{$timeStamp}{$csvFile}"  
42:   };  
43:   print STDOUT "\n";  
44:  };  
45:    

The beauty of this code is that I can take tens, hundreds, even thousands of individual .CSV files and generate a single .CSV that can be imported into Excel for number crunching.


Next I needed the perl code to take the single .CSV file and generate a stacked area plot. For that I turned to the tried and true GD::Graph module, specifically GD::Graph::area. The perl code accepts three command line arguments: The path of the single .CSV, path to the text file that contains the hostnames and simple english description in CSV format and the name of the image to produce. Here is an example of the text file with hostnames and simple descriptions that I used in my example which I simply called "lpar.list:"
 vm1,Virtual Machine Number 1  
 vm2,Virtual Machine Number 2  
 vm3,Virtual Machine Number 3  
 vm4,Virtual Machine Number 4  
 vm5,Virtual Machine Number 5  
 vm6,Virtual Machine Number 6  
 vm7,Virtual Machine Number 7  
 vm8,Virtual Machine Number 8  
 vm9,Virtual Machine Number 9  
 vm10,Virtual Machine Number 10   
 vm11,Virtual Machine Number 11  
 vm12,Virtual Machine Number 12  
 vmvio_pri,Production VIO  
 vmvio_sec,Production VIO  
 vmvio_nonprod_pri,Non-Production VIO  
 vmvio_nonprod_sec,Non-Production VIO  
Fortunately, GD::Graph::area will automagically generate a stacked area plot if the proper option is set in the hash for the area plot on line #89 which is "cumulate => 1,"

And now the code that actually generates the area plot.
1:  use strict;  
2:  use GD::Graph::area;  
3:    
4:  my $csvData  = shift;  
5:  my $lookupData = shift;  
6:  my $graphName = shift;  
7:    
8:  $graphName .= ".gif";  
9:    
10:  open(DATA, "$csvData")   || die "$!";  
11:  open(LOOKUP, "$lookupData") || die "$!";  
12:    
13:  my %lookupHash = ();  
14:  my %areaHash  = ();  
15:  my @columnNames = ();  
16:    
17:  while(<LOOKUP>) {  
18:   chomp($_);  
19:   my ($lpar, $name) = split(/,/, $_);  
20:   $lookupHash{$lpar} = $name;  
21:  };  
22:    
23:  while(<DATA>) {  
24:   chomp($_);  
25:   my @columns = split(/,/, $_);  
26:   if ($. == 1) {  
27:    @columnNames = @columns;  
28:    foreach my $index (1..$#columns) {  
29:     # $columnNames[$index] = $lookupHash{$columnNames[$index]};  
30:     $columnNames[$index] .= " - $lookupHash{$columnNames[$index]}";  
31:    };  
32:   } else {  
33:    foreach my $i (1..$#columns) {  
34:     $areaHash{$columns[0]}{$columnNames[$i]} = $columns[$i];  
35:    };  
36:   };  
37:  };  
38:    
39:  my %timeStampsToDelete = ();  
40:    
41:  foreach my $timeStamp (keys(%areaHash)) {  
42:   foreach my $columnName (keys(%{$areaHash{$timeStamp}})) {  
43:    if ($areaHash{$timeStamp}{$columnName} < 0 || $areaHash{$timeStamp}{$columnName} eq '') {  
44:     $timeStampsToDelete{$timeStamp}++;  
45:    };  
46:   };  
47:  };  
48:    
49:  foreach my $timeStamp (sort(keys(%timeStampsToDelete))) {  
50:   print STDERR "Deleting $timeStamp from areaHash\n";  
51:   delete $areaHash{$timeStamp};  
52:  };  
53:    
54:  my @dataArray = ();  
55:  my @dataLine = ();  
56:    
57:  my $elementCount = 0;  
58:  my @legendArray = ();  
59:  my $i = 0;  
60:    
61:  foreach my $columnName (@columnNames) {  
62:   my @outputArray = ();  
63:   if ($columnName !~ /timeStamp/) {  
64:    my $j= 0;  
65:    foreach my $timeStamp (sort(keys(%areaHash))) {  
66:     $dataLine[$j++] = $areaHash{$timeStamp}{$columnName};  
67:    };  
68:    @outputArray = @dataLine;  
69:    push(@legendArray, $columnName);  
70:   } else {  
71:    foreach my $timeStamp (sort(keys(%areaHash))) {  
72:     push(@outputArray, $timeStamp);  
73:     $elementCount++;  
74:    };  
75:   };  
76:   $dataArray[$i] = \@outputArray;  
77:   $i++;  
78:  };  
79:     
80:  my $mygraph = GD::Graph::area->new(1280, 1024);  
81:    
82:  $mygraph->set(x_label_skip   => int($elementCount/40),  
83:         x_labels_vertical => 1,  
84:         y_label      => "physc",  
85:         y_min_value    => 0,  
86:         y_max_value    => 16,  
87:         y_tick_number   => 16,  
88:         title       => "Stacked physc utilization",  
89:         cumulate     => 1,  
90:  ) or warn $mygraph->error;  
91:    
92:  $mygraph->set_legend(@legendArray);  
93:    
94:  $mygraph->set_legend_font(GD::gdMediumBoldFont);  
95:  $mygraph->set_x_axis_font(GD::gdMediumBoldFont);  
96:  $mygraph->set_y_axis_font(GD::gdMediumBoldFont);  
97:    
98:  my $myimage = $mygraph->plot(\@dataArray) or die $mygraph->error;  
99:    
100:  open(PICTURE, ">$graphName");  
101:  binmode PICTURE;  
102:  print PICTURE $myimage->gif;  
103:  close(PICTURE);  
104:    

The output of the code becomes exactly what I wanted:

I don't believe that the area plot looks as nice as the Excel output, but now it is effortless to generate a stacked area plot of VM physc utilization as part of a daily cron job for reporting purposes.

Saturday, January 21, 2012

Extracting data from CA Wily APM via web service API


Found an interesting gotcha with the CA Wily API web services that bit me in the buttocks this week and thought anybody else that might want to access the CA Wily APM via web services might like to know about.

I wrote some sample web service client code in python using the suds library to pull a list of agents against some LPARs and then list out the available metrics for the agents.

8<------------
from suds.client import Client


userName = 'someAccount'
passWord = 'supersupersekrit'


serverName = 'someserver'
serverPort = 1234


metricsDataServiceURL = 'http://%s:%d/introscope-web-services/services/…' % (serverName, serverPort)
agentListURL = 'http://%s:%d/introscope-web-services/services/…' % (serverName, serverPort)


agentListClient = Client(agentListURL, username=userName, password=passWord)
agentList = agentListClient.service.listAgents('.*servername.*')


for agentName in agentList:
  print "Working on Agent %s" % (agentName,)
  metricList = agentListClient.service.listMetrics(agentName, '.*')
  for metricName in metricList:
    print "\t%s" % (metricName,)
------------->8

What I failed to take into account is that the list of agents that is returned from the system have special regex meta characters and when passed back with the call to the listMetrics and these meta characters need to be escaped with a back-whack.

I was able to fix the problem by importing the re regular expression library and using a call to re.escape with the agent name by changing the line

8<--------
metricList = agentListClient.service.listMetrics(agentName, '.*')
--------->8

to

8<--------
metricList = agentListClient.service.listMetrics(re.escape(agentName), '.*')
--------->8

After that, I was able to retrieve a list of all the available metrics for the captured Agents. Any of the API calls that require a resolved Agent name from a regex just need to be escaped and all is well.

Tuesday, December 6, 2011

Running a first Hadoop job with Cloudera's Tutorial VM

After watching the O'Reilly video on Map Reduce I decided that I'd like to know more about Hadoop. After doing some Googling I found that a firm by the name of Cloudera has pre-populated VMs available for playing around with located here.

The tutorial located here looks very much like the Apache Hadoop tutorial.

I was running through the tutorial and ran into a number of road blocks. I eventually got around the road blocks but thought it would be handy to document the issues that I had in case I ever revisit this tutorial in the future.

First issue that I ran into was that I couldn't get the Cloudera VM up and running with VirtualBox. After some trial and error I found that I could get the CentOS based VM up and running after selecting the IO APIC checkbox in the configuration for the VM:



The second issue was that I couldn't install the VirtualBox Guest Additions since CentOS didn't have all the kernel source installed. What a bummer.

I was able to install the necessary kernel files by issuing the following command:

yum install kernel-devel-2.6.18-274.7.1.el5

Then I found that the VirtualBox Guest Additions still couldn't be installed as there was no GCC compiler. That was easily fixed with:

yum install gcc

After that, the VirtualBox Guest Additions installed just fine and with a reboot I know have much better control over the VM. The default 640x480 is a bit stiffling. Now I have it running full screen at 1920x1600 and it is much, much nicer.

Next step was working on the WordCount example.

As a side note: In the O'Reilly video we did a very similar word counting map reduce job and I have to say that I much prefer the terse Python code over the Java solution. But that is just personal preference.

The tutorial provides the source code for WordCount.java and I ran into some issues where with some deviation from the tutorial. The tutorial gives a tip on the proper environment variables for HADOOP_HOME and HADOOP_VERSION but the tutorial is out of sync with the Cloudera VM.

The tutorial states that the proper version information is "0.20.2-cdh3u1" when it is actually "0.20.2-cdh3u2". Not really a big deal but when following a tutorial on a subject that is brand new, this can be frustrating.

The next issue that I ran into was due to my forgetting most of my Java development skills. Java development is not something that I do day in and out so some of that information had been garbage collected off of my mental heap to make way for other information (probably due to memorizing useless movie quotes).

I created a sub-dir for the WordCount.java code and compiled and created a .jar as provided by the instructions but my first attempts at executing a Hadoop job failed with Hadoop complaining that it couldn't find "org.myorg.WordClass" as seen below.

Compiling the WordClass.java code:

[root@localhost wordcount_classes]# javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar WordCount.java
[root@localhost wordcount_classes]# ls -al
total 24
drwxr-xr-x 2 root root 4096 Dec 6 18:34 .
drwxr-xr-x 3 root root 4096 Dec 6 18:32 ..
-rw-r--r-- 1 root root 1546 Dec 6 18:34 WordCount.class
-rw-r--r-- 1 root root 1869 Dec 6 18:33 WordCount.java
-rw-r--r-- 1 root root 1938 Dec 6 18:34 WordCount$Map.class
-rw-r--r-- 1 root root 1611 Dec 6 18:34 WordCount$Reduce.class
[root@localhost wordcount_classes]# cd ..
[root@localhost wordcount]# jar -cvf wordcount.jar -C wordcount_classes/ .
added manifest
adding: WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)
adding: WordCount.java(in = 1869) (out= 644)(deflated 65%)
adding: WordCount.class(in = 1546) (out= 749)(deflated 51%)
adding: WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)
[root@localhost wordcount]# jar tf wordcount.jar
META-INF/
META-INF/MANIFEST.MF
WordCount$Map.class
WordCount.java
WordCount.class
WordCount$Reduce.class
[root@localhost wordcount]#


When I attempted to run my first Hadoop job I got this output:


[root@localhost bad.wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output_1
Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:179)


What got me was the "Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount."

I look over the code and as far as I can tell, everything is just fine and there is not reason why the WordClass shouldn't be found. After mulling the problem for a while my brain went into brain persistence layer and pulled out the proper way of taking care of the issue.

The source code defines the package name as org.myorg but I hadn't created the proper sub-dirs to reflect the org.myorg package.

I created the org subdir, then myorg in the org subdir and then compiled the code again:



I re-created the .jar pulling in all the files under the org subdir the jar file I found that Hadoop would be much happier and would find my org.myorg.WordCount class finally.

The next issue that I ran into was due not understanding where Hadoop would full the input files for the word count example. In the O'Reilly Map Reduce video STDIN and STDOUT were used and I just figured that I'd be able to specifiy the input and output subdirs from the local file system. I was incorrect.

I attempted to execute the Hadoop job with the following parameters referencing the input and output subdirs:


[root@localhost wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /home/cloudera/Desktop/wordcount/input /home/cloudera/Desktop/wordcount/output/
11/12/06 18:57:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/06 18:57:51 WARN snappy.LoadSnappy: Snappy native library is available
11/12/06 18:57:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/06 18:57:51 INFO snappy.LoadSnappy: Snappy native library loaded
11/12/06 18:57:51 INFO mapred.JobClient: Cleaning up the staging area hdfs://0.0.0.0/var/lib/hadoop-0.20/cache/mapred/mapred/staging/root/.staging/job_201112061431_0006
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0/home/cloudera/Desktop/wordcount/input
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242)
at org.myorg.WordCount.main(WordCount.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


Notice that I used the path to my Desktop (yeah, I know, I shouldn't be putting the files under the Desktop subdir but it makes it easy to acces the files via the GUI. Obviously I would never do this on a real development system) subdirs referencing /input I created.

What I didn't realize at the time was that he paths given to Hadoop are relative to the HDFS file system and not the local ext3 file system. After reading the Cloudera Quick Start Guide PDF it all started to make sense.

I needed to populate HDFS with the input and output subdirs along with the input files noted in the tutorial. The paths from the tutorial reference the path of /usr/joe/wordcount/input and /usr/joe/wordcount/output.

I created the input subdir using the proper DFS command:


/usr/bin/hadoop dfs -mkdir /usr/joe/wordcount/input


And I copied the previously created input files, file01 and file02:


/usr/bin/hadoop dfs -put file01 /usr/joe/wordcount/input
/usr/bin/hadoop dfs -put file02 /usr/joe/wordcount/input


Now it was time for the big event, now that I put everything into place I tried the tutorial command over again:


[root@localhost wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
11/12/06 19:13:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/06 19:13:33 WARN snappy.LoadSnappy: Snappy native library is available
11/12/06 19:13:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/06 19:13:33 INFO snappy.LoadSnappy: Snappy native library loaded
11/12/06 19:13:33 INFO mapred.FileInputFormat: Total input paths to process : 2
11/12/06 19:13:34 INFO mapred.JobClient: Running job: job_201112061431_0007
11/12/06 19:13:35 INFO mapred.JobClient: map 0% reduce 0%
11/12/06 19:13:46 INFO mapred.JobClient: map 33% reduce 0%
11/12/06 19:13:47 INFO mapred.JobClient: map 66% reduce 0%
11/12/06 19:13:52 INFO mapred.JobClient: map 100% reduce 0%
11/12/06 19:14:06 INFO mapred.JobClient: map 100% reduce 100%
11/12/06 19:14:09 INFO mapred.JobClient: Job complete: job_201112061431_0007
11/12/06 19:14:09 INFO mapred.JobClient: Counters: 23
11/12/06 19:14:09 INFO mapred.JobClient: Job Counters
11/12/06 19:14:09 INFO mapred.JobClient: Launched reduce tasks=1
11/12/06 19:14:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=25035
11/12/06 19:14:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/06 19:14:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/06 19:14:09 INFO mapred.JobClient: Launched map tasks=3
11/12/06 19:14:09 INFO mapred.JobClient: Data-local map tasks=3
11/12/06 19:14:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19860
11/12/06 19:14:09 INFO mapred.JobClient: FileSystemCounters
11/12/06 19:14:09 INFO mapred.JobClient: FILE_BYTES_READ=79
11/12/06 19:14:09 INFO mapred.JobClient: HDFS_BYTES_READ=348
11/12/06 19:14:09 INFO mapred.JobClient: FILE_BYTES_WRITTEN=215844
11/12/06 19:14:09 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
11/12/06 19:14:09 INFO mapred.JobClient: Map-Reduce Framework
11/12/06 19:14:09 INFO mapred.JobClient: Reduce input groups=5
11/12/06 19:14:09 INFO mapred.JobClient: Combine output records=6
11/12/06 19:14:09 INFO mapred.JobClient: Map input records=2
11/12/06 19:14:09 INFO mapred.JobClient: Reduce shuffle bytes=91
11/12/06 19:14:09 INFO mapred.JobClient: Reduce output records=5
11/12/06 19:14:09 INFO mapred.JobClient: Spilled Records=12
11/12/06 19:14:09 INFO mapred.JobClient: Map output bytes=82
11/12/06 19:14:09 INFO mapred.JobClient: Map input bytes=50
11/12/06 19:14:09 INFO mapred.JobClient: Combine input records=8
11/12/06 19:14:09 INFO mapred.JobClient: Map output records=8
11/12/06 19:14:09 INFO mapred.JobClient: SPLIT_RAW_BYTES=294
11/12/06 19:14:09 INFO mapred.JobClient: Reduce input records=6


Wait? What is this? Could it be? Yes! It worked! YES!YES!YES!

My dancing around the room was enough to wake up my Basset Hound. He looked up at me, cocked his head as to say, "Hey, why are you being so goofy? Make yourself useful and get me another doggie snack."

Checking the results of the job I see the following:


[root@localhost wordcount]# /usr/bin/hadoop dfs -ls /usr/joe/wordcount/output
Found 3 items
-rw-r--r-- 1 root supergroup 0 2011-12-06 19:14 /usr/joe/wordcount/output/_SUCCESS
drwxr-xr-x - root supergroup 0 2011-12-06 19:13 /usr/joe/wordcount/output/_logs
-rw-r--r-- 1 root supergroup 41 2011-12-06 19:14 /usr/joe/wordcount/output/part-00000


Look at that! _SUCCESS!

Checking the part-00000 file:


[root@localhost wordcount]# /usr/bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2


Sure. It was a lot of work to just count a few words in a text file, but it was a real good learning experience this afternoon. After banging my head against the brick wall for a long enough period I got around the potholes that I ran into and feel that I'll be able to continue the tutorial and learn the basics of Hadoop.

Not a bad bit of afternoon vacation learning.