Tuesday, December 6, 2011

Running a first Hadoop job with Cloudera's Tutorial VM

After watching the O'Reilly video on Map Reduce I decided that I'd like to know more about Hadoop. After doing some Googling I found that a firm by the name of Cloudera has pre-populated VMs available for playing around with located here.

The tutorial located here looks very much like the Apache Hadoop tutorial.

I was running through the tutorial and ran into a number of road blocks. I eventually got around the road blocks but thought it would be handy to document the issues that I had in case I ever revisit this tutorial in the future.

First issue that I ran into was that I couldn't get the Cloudera VM up and running with VirtualBox. After some trial and error I found that I could get the CentOS based VM up and running after selecting the IO APIC checkbox in the configuration for the VM:



The second issue was that I couldn't install the VirtualBox Guest Additions since CentOS didn't have all the kernel source installed. What a bummer.

I was able to install the necessary kernel files by issuing the following command:

yum install kernel-devel-2.6.18-274.7.1.el5

Then I found that the VirtualBox Guest Additions still couldn't be installed as there was no GCC compiler. That was easily fixed with:

yum install gcc

After that, the VirtualBox Guest Additions installed just fine and with a reboot I know have much better control over the VM. The default 640x480 is a bit stiffling. Now I have it running full screen at 1920x1600 and it is much, much nicer.

Next step was working on the WordCount example.

As a side note: In the O'Reilly video we did a very similar word counting map reduce job and I have to say that I much prefer the terse Python code over the Java solution. But that is just personal preference.

The tutorial provides the source code for WordCount.java and I ran into some issues where with some deviation from the tutorial. The tutorial gives a tip on the proper environment variables for HADOOP_HOME and HADOOP_VERSION but the tutorial is out of sync with the Cloudera VM.

The tutorial states that the proper version information is "0.20.2-cdh3u1" when it is actually "0.20.2-cdh3u2". Not really a big deal but when following a tutorial on a subject that is brand new, this can be frustrating.

The next issue that I ran into was due to my forgetting most of my Java development skills. Java development is not something that I do day in and out so some of that information had been garbage collected off of my mental heap to make way for other information (probably due to memorizing useless movie quotes).

I created a sub-dir for the WordCount.java code and compiled and created a .jar as provided by the instructions but my first attempts at executing a Hadoop job failed with Hadoop complaining that it couldn't find "org.myorg.WordClass" as seen below.

Compiling the WordClass.java code:

[root@localhost wordcount_classes]# javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar WordCount.java
[root@localhost wordcount_classes]# ls -al
total 24
drwxr-xr-x 2 root root 4096 Dec 6 18:34 .
drwxr-xr-x 3 root root 4096 Dec 6 18:32 ..
-rw-r--r-- 1 root root 1546 Dec 6 18:34 WordCount.class
-rw-r--r-- 1 root root 1869 Dec 6 18:33 WordCount.java
-rw-r--r-- 1 root root 1938 Dec 6 18:34 WordCount$Map.class
-rw-r--r-- 1 root root 1611 Dec 6 18:34 WordCount$Reduce.class
[root@localhost wordcount_classes]# cd ..
[root@localhost wordcount]# jar -cvf wordcount.jar -C wordcount_classes/ .
added manifest
adding: WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)
adding: WordCount.java(in = 1869) (out= 644)(deflated 65%)
adding: WordCount.class(in = 1546) (out= 749)(deflated 51%)
adding: WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)
[root@localhost wordcount]# jar tf wordcount.jar
META-INF/
META-INF/MANIFEST.MF
WordCount$Map.class
WordCount.java
WordCount.class
WordCount$Reduce.class
[root@localhost wordcount]#


When I attempted to run my first Hadoop job I got this output:


[root@localhost bad.wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output_1
Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:179)


What got me was the "Exception in thread "main" java.lang.ClassNotFoundException: org.myorg.WordCount."

I look over the code and as far as I can tell, everything is just fine and there is not reason why the WordClass shouldn't be found. After mulling the problem for a while my brain went into brain persistence layer and pulled out the proper way of taking care of the issue.

The source code defines the package name as org.myorg but I hadn't created the proper sub-dirs to reflect the org.myorg package.

I created the org subdir, then myorg in the org subdir and then compiled the code again:



I re-created the .jar pulling in all the files under the org subdir the jar file I found that Hadoop would be much happier and would find my org.myorg.WordCount class finally.

The next issue that I ran into was due not understanding where Hadoop would full the input files for the word count example. In the O'Reilly Map Reduce video STDIN and STDOUT were used and I just figured that I'd be able to specifiy the input and output subdirs from the local file system. I was incorrect.

I attempted to execute the Hadoop job with the following parameters referencing the input and output subdirs:


[root@localhost wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /home/cloudera/Desktop/wordcount/input /home/cloudera/Desktop/wordcount/output/
11/12/06 18:57:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/06 18:57:51 WARN snappy.LoadSnappy: Snappy native library is available
11/12/06 18:57:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/06 18:57:51 INFO snappy.LoadSnappy: Snappy native library loaded
11/12/06 18:57:51 INFO mapred.JobClient: Cleaning up the staging area hdfs://0.0.0.0/var/lib/hadoop-0.20/cache/mapred/mapred/staging/root/.staging/job_201112061431_0006
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://0.0.0.0/home/cloudera/Desktop/wordcount/input
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242)
at org.myorg.WordCount.main(WordCount.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


Notice that I used the path to my Desktop (yeah, I know, I shouldn't be putting the files under the Desktop subdir but it makes it easy to acces the files via the GUI. Obviously I would never do this on a real development system) subdirs referencing /input I created.

What I didn't realize at the time was that he paths given to Hadoop are relative to the HDFS file system and not the local ext3 file system. After reading the Cloudera Quick Start Guide PDF it all started to make sense.

I needed to populate HDFS with the input and output subdirs along with the input files noted in the tutorial. The paths from the tutorial reference the path of /usr/joe/wordcount/input and /usr/joe/wordcount/output.

I created the input subdir using the proper DFS command:


/usr/bin/hadoop dfs -mkdir /usr/joe/wordcount/input


And I copied the previously created input files, file01 and file02:


/usr/bin/hadoop dfs -put file01 /usr/joe/wordcount/input
/usr/bin/hadoop dfs -put file02 /usr/joe/wordcount/input


Now it was time for the big event, now that I put everything into place I tried the tutorial command over again:


[root@localhost wordcount]# /usr/bin/hadoop jar wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
11/12/06 19:13:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/06 19:13:33 WARN snappy.LoadSnappy: Snappy native library is available
11/12/06 19:13:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/12/06 19:13:33 INFO snappy.LoadSnappy: Snappy native library loaded
11/12/06 19:13:33 INFO mapred.FileInputFormat: Total input paths to process : 2
11/12/06 19:13:34 INFO mapred.JobClient: Running job: job_201112061431_0007
11/12/06 19:13:35 INFO mapred.JobClient: map 0% reduce 0%
11/12/06 19:13:46 INFO mapred.JobClient: map 33% reduce 0%
11/12/06 19:13:47 INFO mapred.JobClient: map 66% reduce 0%
11/12/06 19:13:52 INFO mapred.JobClient: map 100% reduce 0%
11/12/06 19:14:06 INFO mapred.JobClient: map 100% reduce 100%
11/12/06 19:14:09 INFO mapred.JobClient: Job complete: job_201112061431_0007
11/12/06 19:14:09 INFO mapred.JobClient: Counters: 23
11/12/06 19:14:09 INFO mapred.JobClient: Job Counters
11/12/06 19:14:09 INFO mapred.JobClient: Launched reduce tasks=1
11/12/06 19:14:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=25035
11/12/06 19:14:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/06 19:14:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/06 19:14:09 INFO mapred.JobClient: Launched map tasks=3
11/12/06 19:14:09 INFO mapred.JobClient: Data-local map tasks=3
11/12/06 19:14:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19860
11/12/06 19:14:09 INFO mapred.JobClient: FileSystemCounters
11/12/06 19:14:09 INFO mapred.JobClient: FILE_BYTES_READ=79
11/12/06 19:14:09 INFO mapred.JobClient: HDFS_BYTES_READ=348
11/12/06 19:14:09 INFO mapred.JobClient: FILE_BYTES_WRITTEN=215844
11/12/06 19:14:09 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
11/12/06 19:14:09 INFO mapred.JobClient: Map-Reduce Framework
11/12/06 19:14:09 INFO mapred.JobClient: Reduce input groups=5
11/12/06 19:14:09 INFO mapred.JobClient: Combine output records=6
11/12/06 19:14:09 INFO mapred.JobClient: Map input records=2
11/12/06 19:14:09 INFO mapred.JobClient: Reduce shuffle bytes=91
11/12/06 19:14:09 INFO mapred.JobClient: Reduce output records=5
11/12/06 19:14:09 INFO mapred.JobClient: Spilled Records=12
11/12/06 19:14:09 INFO mapred.JobClient: Map output bytes=82
11/12/06 19:14:09 INFO mapred.JobClient: Map input bytes=50
11/12/06 19:14:09 INFO mapred.JobClient: Combine input records=8
11/12/06 19:14:09 INFO mapred.JobClient: Map output records=8
11/12/06 19:14:09 INFO mapred.JobClient: SPLIT_RAW_BYTES=294
11/12/06 19:14:09 INFO mapred.JobClient: Reduce input records=6


Wait? What is this? Could it be? Yes! It worked! YES!YES!YES!

My dancing around the room was enough to wake up my Basset Hound. He looked up at me, cocked his head as to say, "Hey, why are you being so goofy? Make yourself useful and get me another doggie snack."

Checking the results of the job I see the following:


[root@localhost wordcount]# /usr/bin/hadoop dfs -ls /usr/joe/wordcount/output
Found 3 items
-rw-r--r-- 1 root supergroup 0 2011-12-06 19:14 /usr/joe/wordcount/output/_SUCCESS
drwxr-xr-x - root supergroup 0 2011-12-06 19:13 /usr/joe/wordcount/output/_logs
-rw-r--r-- 1 root supergroup 41 2011-12-06 19:14 /usr/joe/wordcount/output/part-00000


Look at that! _SUCCESS!

Checking the part-00000 file:


[root@localhost wordcount]# /usr/bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2


Sure. It was a lot of work to just count a few words in a text file, but it was a real good learning experience this afternoon. After banging my head against the brick wall for a long enough period I got around the potholes that I ran into and feel that I'll be able to continue the tutorial and learn the basics of Hadoop.

Not a bad bit of afternoon vacation learning.

Saturday, December 3, 2011

I'm dreaming of a geeky Xmas

Black Friday and Cyber Monday has come and gone and the team and I kicked ass... We kicked major ass. All that work over the past 11 months paid off over that single week of retail Hell and our systems just simply kicked ass...

But now, it is time for all the vacation that I didn't take during the year while prepping for the few busiest days of the year. I am taking off the entire month of December (and still rolling 40 hours of vacation into next year!)

What to do with all this time?

On CyberMonday O'Reilly had a 60% off sale on books and videos so I decided to purchase a few videos to get up to speed in some technologies that I haven't fiddled around with.

Over December I hope to go over the following videos:

An Introduction to MapReduce with Pete Warden
Erlang by Example with Cesarini and Thompson
Hilary Mason: An Introduction to Machine Learning with Web Data

I've had a very brief introduction to Machine Learning during a GDAT class with Dr. Gunther and found it very interesting and something that could be cool for automated monitoring of systems to determine when something is going astray.

Also I plan on covering some Haskell learnings with the Channel9 lecture series on the subject of Haskell and functional programming. I've done a little bit of FP with R and it is time to wrap my head around FP more with Haskell and Erlang.

The current plan on the Haskell front is to gather with fellow Geeks at the Addison, TX location of Thoughtworks for Geeknight to watch the Channel9 videos and learn the ways of FP first hand.

I'm sure that good times will ensue!

Wednesday, November 23, 2011

Wrapping perl around tqharvc for generating scatter plots

TeamQuest View is a handy utility for looking at metrics collected on servers by TeamQuest. But one thing that I haven't liked about TeamQuest View is creating graphs from collected metrics. Fortunately, with the installation of TeamQuest View a command line utility by the name of tqharvc is also installed. It is possible to execute tqharvc.exe from the command line to collect metrics and create plots from the output.

I got the idea of writing some perl code that slices and dices the TeamQuest .RPT files that can be generated by TeamQuest View and then passing on the queries to tqharvc to a graphing routine to generate both plots and .CSV files for later processing if need be.

For the graphing of data I utilize GD::Graph for generating a scatter plot.

For the example in this blog entry I have a very simple .RPT file by the name of physc.RPT and as the name implies, it simply reports physc usage from a LPAR:


[General]
Report Name = "physc"
View = Line
Format = Date+Time,
Parameter

[dParm1]
System = "[?]"
Category Group = "CPU"
Category = "by LPAR"
Subcategory = ""
Statistic = "physc"
Value Types = "Average"


Simple enough, isn't it?

My perl routine builds up a query for tqharvc by slicing and dicing the .RPT file and iterates through all the sliced and diced metrics, collects the output data and generates the .CSV output and plot image.

The tqharvc command line query is formed as:

tqharvc.exe -m [hostName] -s [startDate]:[startTime] -e [endDate]:[endTime] -a 10-minute -k [metric]

Say that I want to query "serverA" for the metric "CPU:by LPAR:physc" between 11/1/2011 midnight to 11/07/2011 11:59:59 PM with metrics from every 10 minutes?

I call tqharvc with the following command line:

tqharvc.exe -m serverA -s 11012011:000000 -e 11072011:235959 -a 10-minute -k "CPU:by LPAR:physc"

My routine takes several command line parameters:

tqGatherMetrics.pl server.list physc.rpt 11/01/2011 00:00:00 11/06/2011 23:59:59

server.list is simply a text file with a list of servers to query:


serverA
serverB
serverC
serverD
serverE
serverF


phsyc.rpt is the aforementioned .RPT file. 11/01/2011 is the start date and 00:0:000 is the start time. 11/07/2011 is the end date and 23:59:59 is the end time of the queries.

Below is the output to STDERR from the perl routine as it crunches through the data from the servers:


Running query against serverA.
Executing TQ Query for serverA:CPU:by LPAR.
Processing query results.
Extracted 864 lines of data from query.
Processing results for serverA : CPU:by LPAR : 7
Generating graph for "serverA_CPU_by_LPAR_physc.gif"
Running query against serverB.
Executing TQ Query for serverB:CPU:by LPAR.
Processing query results.
Extracted 864 lines of data from query.
Processing results for serverB : CPU:by LPAR : 7
Generating graph for "serverB_CPU_by_LPAR_physc.gif"
Running query against serverC.
Executing TQ Query for serverC:CPU:by LPAR.
Processing query results.
Extracted 864 lines of data from query.
Processing results for serverC : CPU:by LPAR : 7
Generating graph for "serverC_CPU_by_LPAR_physc.gif"
Running query against serverD.
Executing TQ Query for serverD:CPU:by LPAR.
Processing query results.
Extracted 864 lines of data from query.
Processing results for serverD : CPU:by LPAR : 7
Generating graph for "serverD_CPU_by_LPAR_physc.gif"
Running query against serverE.
Executing TQ Query for serverE:CPU:by LPAR.
Processing query results.
Extracted 864 lines of data from query.
Processing results for serverE : CPU:by LPAR : 7
Generating graph for "serverE_CPU_by_LPAR_physc.gif"
Running query against serverF.
Executing TQ Query for serverF:CPU:by LPAR.
Processing query results.
Extracted 864 lines of data from query.
Processing results for serverF : CPU:by LPAR : 7
Generating graph for "serverF_CPU_by_LPAR_physc.gif"


The output of the plot appears as:



With the above command line I was able to generate plots for six servers in the server.list file.

If the .RPT file had listed multiple metrics, there would have been one plot generated per server for each metric. I chose the single metric physc as a simple example.

The perl code to generate the plot follows:

1:  use GD::Graph::points;  
2: use Statistics::Descriptive;
3: use Time::Local;
4: use POSIX qw/ceil/;
5:
6: use strict;
7:
8: my $sourceList = shift;
9: my $tqReport = shift;
10: my $startDate = shift;
11: my $startTime = shift;
12: my $endDate = shift;
13: my $endTime = shift;
14:
15: $startDate =~ s/\///g;
16: $endDate =~ s/\///g;
17:
18: $startTime =~ s/\://g;
19: $endTime =~ s/\://g;
20:
21: if (length($sourceList) == 0) {
22: die "Usage:tqGatherMerics.pl <sourceList> <tqReport> <startDate> <stareTime> <endDate> <endTime>\n";
23: };
24:
25: my $tqHarvestBinary = "C:\\Program Files\\TeamQuest\\manager\\bin\\tqharvc.exe";
26:
27: if ( -f "$tqHarvestBinary" ) {
28: if ( -f "$sourceList" ) {
29: if ( -f "$tqReport" ) {
30:
31: my @hostNames = ();
32: my %metricHash = ();
33:
34: open(HOSTNAMES, "$sourceList") || die "$!";
35: while (<HOSTNAMES>) {
36: chomp($_);
37: if ($_ !~ /^#/) {
38: push(@hostNames, $_);
39: };
40: };
41: close(HOSTNAMES);
42:
43: open(REPORT, "$tqReport") || die "$!";
44:
45: my $catGroup = "::";
46: my $catName = "::";
47: my $subCat = "::";
48:
49: while (<REPORT>) {
50:
51: my @statArray = ();
52:
53: chomp($_);
54:
55: if ($_ =~ /Category Group = \"(.+?)\"/) {
56: $catGroup = $1;
57: };
58:
59: if ($_ =~ /Category = \"(.+?)\"/) {
60: $catName = $1;
61: };
62:
63: if ($_ =~ /Subcategory = \"(.+?)\"/) {
64: $subCat = $1;
65: };
66:
67: if ($_ =~ /Statistic =/) {
68: my $tmpString = "";
69: $_ =~ s/Statistic =//g;
70: $tmpString = $_;
71: do {
72: $_ = <REPORT>;
73: chomp($_);
74: if ($_ !~ /^Resource|^Value Types/) {
75: $tmpString .= $_;
76: };
77: } until ($_ =~ /^Resource|^Value Types/);
78: my @statArray = split(/\,/, $tmpString);
79: $metricHash{"${catGroup}:${catName}"} = \@statArray;
80: };
81:
82: };
83:
84: close(REPORT);
85:
86: foreach my $hostName (@hostNames) {
87: print STDERR "Running query against $hostName.\n";
88: my %metricData = ();
89: foreach my $paramName (sort(keys(%metricHash))) {
90: my %columnHash = ();
91: my $linesExtracted = 0;
92: my $shellCmd = "\"$tqHarvestBinary\" -m $hostName -s $startDate:$startTime -e $endDate:$endTime -a 10-minute -k \"$paramName\"";
93: # my $shellCmd = "\"$tqHarvestBinary\" -m $hostName -s $startDate:$startTime -e $endDate:$endTime -a 1-minute -k \"$paramName\"";
94: print STDERR "\t\tExecuting TQ Query for $hostName:$paramName.\n";
95:
96: open(OUTPUT, "$shellCmd |") || die "$!";
97:
98: print STDERR "\t\tProcessing query results.\n";
99:
100: my $totalColumns = 0;
101:
102: while (<OUTPUT>) {
103: chomp($_);
104: if ($_ =~ /^Time:/) {
105: my @columns = split(/\,/, $_);
106: my $statName = "";
107: for (my $index = 0; $index < $#columns; $index++) {
108: foreach $statName (@{$metricHash{$paramName}}) {
109: $statName =~ s/^\s+//g; # ltrim
110: $statName =~ s/\s+$//g; # rtrim
111: $statName =~ s/\"//g;
112:
113: my $columnName = $columns[$index];
114: if (index($columnName, $statName, 0) >= 0) {
115: $columnHash{$index} = $columns[$index];
116: $totalColumns++;
117: };
118: };
119: };
120: } else {
121: if ($_ =~ /^[0-9]/) {
122: chomp($_);
123: my @columns = split(/\,/, $_);
124: foreach my $index (sort(keys(%columnHash))) {
125: $metricData{"$columns[0] $columns[1]"}{$columnHash{$index}} = $columns[$index];
126: };
127: $linesExtracted++;
128: };
129: };
130: };
131:
132: close(OUTPUT);
133:
134: if (($linesExtracted > 0) && ($totalColumns > 0)) {
135: print STDERR "\tExtracted $linesExtracted lines of data from query.\n";
136: my @domainData = ();
137:
138: foreach my $timeStamp (sort dateSort keys(%metricData)) {
139: push(@domainData, $timeStamp);
140: };
141:
142: foreach my $metricIndex (sort(keys(%columnHash))) {
143: print STDERR "\t\tProcessing results for $hostName : $paramName : $metricIndex\n";
144:
145: my $metricName = $columnHash{$metricIndex};
146: my @rangeData = ();
147: my $stat = Statistics::Descriptive::Full->new();
148:
149: foreach my $timeStamp (@domainData) {
150: push(@rangeData, $metricData{$timeStamp}{$columnHash{$metricIndex}});
151: $stat->add_data($metricData{$timeStamp}{$columnHash{$metricIndex}});
152: };
153:
154: my $graphName = "${hostName}_${paramName}_${metricName}";
155: my $csvName = "";
156:
157: $graphName =~ s/\\/_/g;
158: $graphName =~ s/\//_/g;
159: $graphName =~ s/\%/_/g;
160: $graphName =~ s/\:/_/g;
161: $graphName =~ s/\s/_/g;
162:
163: $csvName = $graphName;
164: $graphName .= ".gif";
165: $csvName .= ".csv";
166:
167: print STDERR "\t\tGenerating graph for \"$graphName\"\n";
168:
169: open(CSVOUTPUT, ">$csvName");
170: print CSVOUTPUT "Timestamp,$paramName:$metricName\n";
171:
172: my $i = 0;
173: foreach my $timeStamp (@domainData) {
174: print CSVOUTPUT "$domainData[$i],$rangeData[$i]\n";
175: $i++;
176: };
177:
178: close(CSVOUTPUT);
179:
180: my $dataMax = $stat->max();
181: my $dataMin = $stat->min();
182:
183: if ($dataMax < 1) {
184: $dataMax = 0.5;
185: };
186:
187: if ($dataMin > 0) {
188: $dataMin = 0;
189: };
190:
191: for (my $rangeIndex = 0; $rangeIndex < $#rangeData; $rangeIndex++) {
192: if ($rangeData[$rangeIndex] == 0) {
193: $rangeData[$rangeIndex] = $dataMax * 5;
194: };
195: };
196:
197: my @data = (\@domainData, \@rangeData);
198: my $tqGraph = GD::Graph::points->new(1024, int(768/2));
199: my $totalMeasurements = $#{$data[0]} + 1;
200:
201: $tqGraph->set(x_label_skip => int($#domainData/40),
202: x_labels_vertical => 1,
203: markers => [6],
204: marker_size => 2,
205: y_label => "$metricName",
206: y_min_value => $dataMin,
207: y_max_value => ceil($dataMax * 1.1),
208: title => "${hostName}:${paramName}:${metricName}, N = " . $stat->count(). "; avg = " . $stat->mean() . "; SD = " . $stat->standard_deviation() . "; 90th = " . $stat->percentile(90) . ".",
209: line_types => [1],
210: line_width => 1,
211: dclrs => ['red'],
212: ) or warn $tqGraph->error;
213:
214: $tqGraph->set_legend("TQ Measurement");
215: $tqGraph->set_legend_font(GD::gdMediumBoldFont);
216: $tqGraph->set_x_axis_font(GD::gdMediumBoldFont);
217: $tqGraph->set_y_axis_font(GD::gdMediumBoldFont);
218:
219: my $tqImage = $tqGraph->plot(\@data) or die $tqGraph->error;
220:
221: open(PICTURE, ">$graphName");
222: binmode PICTURE;
223: print PICTURE $tqImage->gif;
224: close(PICTURE);
225:
226: };
227:
228: } else {
229: print STDERR "#################### Nothing extracted for Hostname \"$hostName\" and metric \"$paramName\" ####################\n";
230: };
231:
232: };
233: };
234:
235: } else {
236: print STDERR "Could not find the TeamQuest Report file.\n";
237: };
238: } else {
239: print STDERR "Could not find the list of hostnames to run against.\n";
240: };
241: } else {
242: print STDERR "Could not find the TeamQuest Manager TQHarvest binary at \"$tqHarvestBinary\". Cannot continue.\n";
243: };
244:
245: sub dateSort {
246: my $a_value = dateToEpoch($a);
247: my $b_value = dateToEpoch($b);
248:
249: if ($a_value > $b_value) {
250: return 1;
251: } else {
252: if ($b_value > $a_value) {
253: return -1;
254: } else {
255: return 0;
256: };
257: };
258:
259: };
260:
261: sub dateToEpoch {
262: my ($timeStamp) = @_;
263: my ($dateString, $timeString) = split(/ /, $timeStamp);
264: my ($month, $day, $year) = split(/\//, $dateString);
265: my ($hour, $min, $sec) = split(/:/, $timeString);
266:
267: $year += 2000;
268: $month -= 1;
269:
270: return timegm($sec,$min,$hour,$day,$month,$year);
271:
272: };
273:

Sunday, November 20, 2011

Discrete Event Simulation of the Three Tier eBiz with SimPy

In this post I blogged on how I wrote a simulation solution in PDQ-R for a case study used in the book, Performance By Design.

I also wanted to expand my horizons and write a Discrete Event Simulation solution to go along with the heuristic solution of PDQ-R. The firm that I was working for at the time when I came up with this solution had a DES package by the name of SimScript II but they weren't willing to cough up a license so that I could learn their product. So, I searched the web and found a python package by the name of SimPy that can be used for Discrete Event Simulation.

There are some major differences between the heuristic solution and the DES solution. With the heuristic solution I was able to use some linear algebra to determine how many many hits per second to send off to the various pages that consumed resources. With my SimPy solution I did not have that luxury as I am simulating individual hits to each page one at a time. To get a proper distribution of hits I came up with this simple routine to dispatch the hits to pages:
1:  class RandomPath:
2: def RowSum(self, Vector):
3: rowSum = 0.0
4: for i in range(len(Vector)):
5: rowSum += Vector[i]
6: return rowSum
7: def NextPage(self, T, i):
8: rowSum = self.RowSum(T[i])
9: randomValue = G.Rnd.uniform(0, rowSum)
10: sumT = 0.0
11: for j in range(len(T[i])):
12: sumT += T[i][j]
13: if randomValue < sumT:
14: break
15: return j
With this routine I take the sum of probability from a row of the TransitionMatrix hard coded in the routine and then generate a random value between zero and the sum of the probabilities. I then walk the row and summing the probabilities until I find a condition where the probablities are greater than the random value. The count when this happens determines the next page.


For example, below I have the TransitionMatrix as used in my python code:


88: # 0 1 2 3 4 5 6 7
89: TransitionMatrix = [ [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00], # 0
90: [0.00, 0.00, 0.70, 0.00, 0.10, 0.00, 0.00, 0.20], # 1
91: [0.00, 0.00, 0.45, 0.15, 0.10, 0.00, 0.00, 0.30], # 2
92: [0.00, 0.00, 0.00, 0.00, 0.40, 0.00, 0.00, 0.60], # 3
93: [0.00, 0.00, 0.00, 0.00, 0.00, 0.30, 0.55, 0.15], # 4
94: [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00], # 5
95: [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00], # 6
96: [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00] ]# 7


And we are currently on the search page, as represented by row #2 of the above stochastic matrix:


0 1 2 3 4 5 6 7
91: [0.00, 0.00, 0.45, 0.15, 0.10, 0.00, 0.00, 0.30], # 2


Column 0 is 0, my sum is 0.
Column 1 is 0, my sum is 0.
Column 2 is 0.45, my sum is 0.45. 0.45 is less than my random value, 0.67.
Column 3 is 0.15, my sum is 0.60. 0.60 is less than my random value, 0.67.
Column 4 is 0.10, my sum is 0.70. 0.70 is greater than my random value, 0.67. I pass back 4 to my routine and the next row in the random walk will be row 4 (which represents the login page).

What we end up is a Markov random walk of the transition matrix. Multiple comparisons of random walks versus the heuristic analysis was fairly close. Of course, with a random variable it is never the same but comes out fairly close with each iteration.

The great thing about this algorithm is that it can be used for both Discrete Event Simulation and in load testing as well. At an eCommerce site that I worked I had created a perl routine to slice and dice IIS logs and by using unique shopper ID in the logged cookies I had created a Markov transition matrix of actual customer traffic. The resulting matrix wasn't small. For our site, it was over 500 x 500 elements. That's quite a lot of different random paths to take.

The idea was to take this information and modify our LoadRunner HTTP Vusers to make use of this data so that our scripts would properly emulate actual customers as apposed to crude page summary analysis. With this mechanism, our load tests would properly represent customers and we would have been able to run our analysis once a week to always make sure that our scenarios represented how customers browsed the site.

Unfortunately, I never got to put my ideas into action and only got as far as creating the stochastic transition matrix and some crude perl code that demonstrated random walks that would have represented customers. All in all, this was the easy part. The real hard part of the project would have been coding the HTTP Vusers that would have properly handled state transition to page to page for non-happy routes that customers sometimes followed.

Below is the SimPy solution that I came up with for the case study presented:

1:  #!/usr/bin/env python
2: from SimPy.Simulation import *
3: from random import Random, expovariate, uniform
4: class Metrics:
5: metrics = dict()
6: def Add(self, metricName, frameNumber, value):
7: if self.metrics.has_key(metricName):
8: if self.metrics[metricName].has_key(frameNumber):
9: self.metrics[metricName][frameNumber].append(value)
10: else:
11: self.metrics[metricName][frameNumber] = list()
12: self.metrics[metricName][frameNumber].append(value)
13: else:
14: self.metrics[metricName] = dict()
15: self.metrics[metricName][frameNumber] = list()
16: self.metrics[metricName][frameNumber].append(value)
17: def Keys(self):
18: return self.metrics.keys()
19: def Mean(self, metricName):
20: valueArray = list()
21: if self.metrics.has_key(metricName):
22: for frame in self.metrics[metricName].keys():
23: for values in range(len(self.metrics[metricName][frame])):
24: valueArray.append(self.metrics[metricName][frame][values])
25: sum = 0.0
26: for i in range(len(valueArray)):
27: sum += valueArray[i]
28: if len(self.metrics[metricName][frame]) != 0:
29: return sum/len(self.metrics[metricName])
30: else:
31: return 0 # Need to learn python throwing exceptions
32: else:
33: return 0
34: class RandomPath:
35: def RowSum(self, Vector):
36: rowSum = 0.0
37: for i in range(len(Vector)):
38: rowSum += Vector[i]
39: return rowSum
40: def NextPage(self, T, i):
41: rowSum = self.RowSum(T[i])
42: randomValue = G.Rnd.uniform(0, rowSum)
43: sumT = 0.0
44: for j in range(len(T[i])):
45: sumT += T[i][j]
46: if randomValue < sumT:
47: break
48: return j
49: class G:
50: numWS = 1
51: numAS = 1
52: numDS = 2
53: Rnd = random.Random(12345)
54: PageNames = ["Entry", "Home", "Search", "View", "Login", "Create", "Bid", "Exit" ]
55: Entry = 0
56: Home = 1
57: Search = 2
58: View = 3
59: Login = 4
60: Create = 5
61: Bid = 6
62: Exit = 7
63: WS = 0
64: AS = 1
65: DS = 2
66: CPU = 0
67: DISK = 1
68: WS_CPU = 0
69: WS_DISK = 1
70: AS_CPU = 2
71: AS_DISK = 3
72: DS_CPU = 4
73: DS_DISK = 5
74: metrics = Metrics()
75: # e h s v l c b e
76: HitCount = [0, 0, 0, 0, 0, 0, 0, 0]
77: Resources = [[ Resource(1), Resource(1) ], # WS CPU and DISK
78: [ Resource(1), Resource(1) ], # AS CPU and DISK
79: [ Resource(1), Resource(1) ]] # DS CPU and DISK
80: # Enter Home Search View Login Create Bid Exit
81: ServiceDemand = [ [0.000, 0.008, 0.009, 0.011, 0.060, 0.012, 0.015, 0.000], # WS_CPU
82: [0.000, 0.030, 0.010, 0.010, 0.010, 0.010, 0.010, 0.000], # WS_DISK
83: [0.000, 0.000, 0.030, 0.035, 0.025, 0.045, 0.040, 0.000], # AS_CPU
84: [0.000, 0.000, 0.008, 0.080, 0.009, 0.011, 0.012, 0.000], # AS_DISK
85: [0.000, 0.000, 0.010, 0.009, 0.015, 0.070, 0.045, 0.000], # DS_CPU
86: [0.000, 0.000, 0.035, 0.018, 0.050, 0.080, 0.090, 0.000] ] # DS_DISK
87: # Type B shopper
88: # 0 1 2 3 4 5 6 7
89: TransitionMatrix = [ [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00], # 0
90: [0.00, 0.00, 0.70, 0.00, 0.10, 0.00, 0.00, 0.20], # 1
91: [0.00, 0.00, 0.45, 0.15, 0.10, 0.00, 0.00, 0.30], # 2
92: [0.00, 0.00, 0.00, 0.00, 0.40, 0.00, 0.00, 0.60], # 3
93: [0.00, 0.00, 0.00, 0.00, 0.00, 0.30, 0.55, 0.15], # 4
94: [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00], # 5
95: [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00], # 6
96: [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00] ] # 7
97: class DoWork(Process):
98: def __init__(self, i, resource, serviceDemand, nodeName, pageName):
99: Process.__init__(self)
100: self.frame = i
101: self.resource = resource
102: self.serviceDemand = serviceDemand
103: self.nodeName = nodeName
104: self.pageName = pageName
105: def execute(self):
106: StartUpTime = now()
107: yield request, self, self.resource
108: yield hold, self, self.serviceDemand
109: yield release, self, self.resource
110: R = now() - StartUpTime
111: G.metrics.Add(self.pageName, self.frame, R)
112: class CallPage(Process):
113: def __init__(self, i, node, pageName):
114: Process.__init__(self)
115: self.frame = i
116: self.StartUpTime = 0.0
117: self.currentPage = node
118: self.pageName = pageName
119: def execute(self):
120: if self.currentPage != G.Exit:
121: print >> sys.stderr, "Working on Frame # ", self.frame, " @ ", now() , " for page ", self.pageName
122: self.StartUpTime = now()
123: if G.ServiceDemand[G.WS_CPU][self.currentPage] > 0.0:
124: wsCPU = DoWork(self.frame, G.Resources[G.WS][G.CPU], G.ServiceDemand[G.WS_CPU][self.currentPage]/G.numWS, "wsCPU", self.pageName)
125: activate(wsCPU, wsCPU.execute())
126: if G.ServiceDemand[G.WS_DISK][self.currentPage] > 0.0:
127: wsDISK = DoWork(self.frame, G.Resources[G.WS][G.DISK], G.ServiceDemand[G.WS_DISK][self.currentPage]/G.numWS, "wsDISK", self.pageName)
128: activate(wsDISK, wsDISK.execute())
129: if G.ServiceDemand[G.AS_CPU][self.currentPage] > 0.0:
130: asCPU = DoWork(self.frame, G.Resources[G.AS][G.CPU], G.ServiceDemand[G.AS_CPU][self.currentPage]/G.numAS, "asCPU", self.pageName)
131: activate(asCPU, asCPU.execute())
132: if G.ServiceDemand[G.AS_DISK][self.currentPage] > 0.0:
133: asDISK = DoWork(self.frame, G.Resources[G.AS][G.DISK], G.ServiceDemand[G.AS_DISK][self.currentPage]/G.numAS, "asDISK", self.pageName)
134: activate(asDISK, asDISK.execute())
135: if G.ServiceDemand[G.DS_CPU][self.currentPage] > 0.0:
136: dsCPU = DoWork(self.frame, G.Resources[G.DS][G.CPU], G.ServiceDemand[G.DS_CPU][self.currentPage]/G.numDS, "dsCPU", self.pageName)
137: activate(dsCPU, dsCPU.execute())
138: if G.ServiceDemand[G.DS_DISK][self.currentPage] > 0.0:
139: dsDISK = DoWork(self.frame, G.Resources[G.DS][G.DISK], G.ServiceDemand[G.DS_DISK][self.currentPage]/G.numDS, "dsDISK", self.pageName)
140: activate(dsDISK, dsDISK.execute())
141: G.HitCount[self.currentPage] += 1
142: yield hold, self, 0.00001 # Needed to prevent an error. Doesn't add any blocking to the six queues above
143: class Generator(Process):
144: def __init__(self, rate, maxT, maxN):
145: Process.__init__(self)
146: self.name = "Generator"
147: self.rate = rate
148: self.maxN = maxN
149: self.maxT = maxT
150: self.g = Random(11335577)
151: self.i = 0
152: self.currentPage = G.Home
153: def execute(self):
154: while (now() < self.maxT):
155: self.i += 1
156: p = CallPage(self.i,self.currentPage,G.PageNames[self.currentPage])
157: activate(p,p.execute())
158: yield hold,self,self.g.expovariate(self.rate)
159: randomPath = RandomPath()
160: if self.currentPage == G.Exit:
161: self.currentPage = G.Home
162: else:
163: self.currentPage = randomPath.NextPage(G.TransitionMatrix, self.currentPage)
164: def main():
165: maxWorkLoad = 10000
166: Lambda = 4.026*float(sys.argv[1])
167: maxSimTime = float(sys.argv[2])
168: initialize()
169: g = Generator(Lambda, maxSimTime, maxWorkLoad)
170: activate(g,g.execute())
171: simulate(until=maxSimTime)
172: print >> sys.stderr, "Simulated Seconds : ", maxSimTime
173: print >> sys.stderr, "Page Hits :"
174: for i in range(len(G.PageNames)):
175: print >> sys.stderr, "\t", G.PageNames[i], " = ", G.HitCount[i]
176: print >> sys.stderr, "Throughput : "
177: for i in range(len(G.PageNames)):
178: print >> sys.stderr, "\t", G.PageNames[i], " = ", G.HitCount[i]/maxSimTime
179: print >> sys.stderr, "Mean Response Times:"
180: for i in G.metrics.Keys():
181: print >> sys.stderr, "\t", i, " = ", G.metrics.Mean(i)
182: print G.HitCount[G.Home]/maxSimTime, ",", G.metrics.Mean("Home"), ",", G.metrics.Mean("View"), ",", G.metrics.Mean("Search"), ",", G.metrics.Mean("Login"), ",", G.metrics.Mean("Create"), ",", G.metrics.Mean("Bid")
183: if __name__ == '__main__': main()


In a future blog post I will compare the results of the Discrete Event Simulation versus my PDQ-R solution and finally compare both of those against a work up done with TeamQuest Model just for good measure.

Tuesday, November 15, 2011

Using PDQ-R to model a three tier eBiz Application

This post is a continuation of this post that I had done earlier.

Continuing the solution for Case Study IV: An E-Business Service from the book, Performance by Design: Computer Capacity Planning By Example.

In the previous post I discuss how to take the transition probability matrix and work backwards toward the original series of linear equations for solving the number of visits to a series of web pages. In this case study there are actually two types of visitors that result in two transition probability matrices that must be utilized. There are 25% of Type A visitors and 75% of Type B visitors.

Each tier of the hypothetical e-biz service is made up by a single CPU and a single disk drive. A matrix is supplied with the total service demand for each component by each page that is hit by visitors.

While it is simple to write some code to analyze web logs to generate the transition probability matrix based upon customer traffic it is very difficult to isolate the total demand at each component with chaotic customer traffic. But that is why we have load testing tools that are available to us. In a pseudo-production environment we are capable of simulating customer traffic to one page at a time and calculating the total demand for components. In this particular case only the CPU and disk drives are being modeled but for a real service we'd want to model the CPU, disk drives, memory system, network system, etc.

After running simulated customer traffic against isolated page hits we could generate a similar demand matrix for components and use it for what-if analysis.

For this solution I had opted to use PDQ-R with the R programming language.

My R code for the solution:

 # Solution parameters   
gamma <- 10.96; # Rate into system
numWS <- 1; # Number of Web Servers
numAS <- 1; # Number of Application Servers
numDS <- 1; # Number of Database Servers
# external library
library("pdq");
# Constants #
E <- 1;
H <- 2;
S <- 3;
V <- 4;
G <- 5;
C <- 6;
B <- 7;
X <- 8;
PAGE_NAMES <- c("Enter", "HomePage", "Search", "ViewBids", "Login", "CreateAuction", "PlaceBid", "Exit");
COMPONENTS <- c("CPU", "Disk");
SERVER_TYPES <- c("WS", "AS", "DS");
WS_CPU <- 1;
WS_DISK <- 2;
AS_CPU <- 3;
AS_DISK <- 4;
DS_CPU <- 5;
DS_DISK <- 6;
# Functions used in solution
VisitsByTransitionMatrix <- function(M, B) {
A <- t(M);
A <- -1 * A;
for (i in 1:sqrt(length(A))) {
j <- i;
A[i,j] <- A[i,j] + 1;
};
return(solve(A,B));
};
CalculateLambda <- function(gamma, f_a, f_b, V_a, V_b, index) {
return (
gamma*((f_a*V_a[index]) + (f_b*V_b[index]))
);
};
f_a <- 0.25; # Fraction of TypeA users
f_b <- 1 - f_a; # Fraction of TypeB users
lambda <- 1:X; # Array of lambda for each page
SystemInput <- matrix(c(1,0,0,0,0,0,0,0),nrow=8,ncol=1) # 8.3, Figure 8.2, page 208
TypeA <- matrix(c(0,1,0,0,0,0,0,0,0,0,0.7,0,0.1,0,0,
0.2,0,0,0.4,0.2,0.15,0,0,0.25,0,0,
0,0,0.65,0,0,0.35,0,0,0,0,0,0.3,0.6,
0.1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,
0,0,0,0,0,0,0,0), ncol=8, nrow=8, byrow=TRUE); # 8.4, Table 8.1, page 209
TypeB <- matrix(c(0,1,0,0,0,0,0,0,0,0,0.7,0,0.1,0,0,
0.2,0,0,0.45,0.15,0.1,0,0,0.3,0,0,
0,0,0.4,0,0,0.6,0,0,0,0,0,0.3,0.55,
0.15,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,
1,0,0,0,0,0,0,0,0), nrow=8, ncol=8, byrow=TRUE); # 8.4, Table 8.2, page 210
DemandTable <- matrix(c(0,0.008,0.009,0.011,0.06,0.012,0.015,
0,0,0.03,0.01,0.01,0.01,0.01,0.01,0,
0,0,0.03,0.035,0.025,0.045,0.04,0,0,
0,0.008,0.08,0.009,0.011,0.012,0,0,
0,0.01,0.009,0.015,0.07,0.045,0,0,0,
0.035,0.018,0.05,0.08,0.09,0), ncol=8, nrow=6, byrow=TRUE); # 8.4, Table 8.4, page 212 (with modifications)
VisitsA <- VisitsByTransitionMatrix(TypeA, SystemInput);
VisitsB <- VisitsByTransitionMatrix(TypeB, SystemInput);
lambda[E] <- 0; # Not used in calculations
lambda[H] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, H);
lambda[S] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, S);
lambda[V] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, V);
lambda[G] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, G);
lambda[C] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, C);
lambda[B] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, B);
lambda[X] <- 0 # Not used in calculations
Init("e_biz_service");
# Define workstreams
for (n in H:B) {
workStreamName <- sprintf("%s", PAGE_NAMES[n]);
CreateOpen(workStreamName, lambda[n]);
};
# Define Web Server Queues
for (i in 1:numWS) {
for (j in 1:length(COMPONENTS)) {
nodeName <- sprintf("WS_%d_%s", i, COMPONENTS[j]);
CreateNode(nodeName, CEN, FCFS);
};
};
# Define Application Server Queues
for (i in 1:numAS) {
for (j in 1:length(COMPONENTS)) {
nodeName <- sprintf("AS_%d_%s", i, COMPONENTS[j]);
CreateNode(nodeName, CEN, FCFS);
};
};
# Define Database Server Queues
for (i in 1:numDS) {
for (j in 1:length(COMPONENTS)) {
nodeName <- sprintf("DS_%d_%s", i, COMPONENTS[j]);
CreateNode(nodeName, CEN, FCFS);
};
};
# Set Demand for the Web Servers
for (i in 1:numWS) {
demandIndex <- WS_CPU;
for (j in 1:length(COMPONENTS)) {
nodeName <- sprintf("WS_%d_%s", i, COMPONENTS[j]);
for (k in H:B) {
workStreamName <- sprintf("%s", PAGE_NAMES[k]);
SetDemand(nodeName, workStreamName, (DemandTable[demandIndex + (j-1), k])/numWS);
};
};
};
# Set Demand for the App Servers
for (i in 1:numAS) {
demandIndex <- AS_CPU;
for (j in 1:length(COMPONENTS)) {
nodeName <- sprintf("AS_%d_%s", i, COMPONENTS[j]);
for (k in H:B) {
workStreamName <- sprintf("%s", PAGE_NAMES[k]);
SetDemand(nodeName, workStreamName, (DemandTable[demandIndex + (j-1), k])/numAS);
};
};
};
# Set Demand for the Database Servers
for (i in 1:numDS) {
demandIndex <- DS_CPU;
for (j in 1:length(COMPONENTS)) {
nodeName <- sprintf("DS_%d_%s", i, COMPONENTS[j]);
for (k in H:B) {
workStreamName <- sprintf("%s", PAGE_NAMES[k]);
SetDemand(nodeName, workStreamName, (DemandTable[demandIndex + (j-1), k])/numDS);
};
};
};
SetWUnit("Trans");
SetTUnit("Second");
Solve(CANON);
print("Arrival Rates for each page:");
for (i in H:B) {
print(sprintf("%s = %f", PAGE_NAMES[i], lambda[i]));
};
print("[-------------------------------------------------]");
print("Page Response Times");
for (i in H:B) {
workStreamName <- sprintf("%s", PAGE_NAMES[i]);
print(sprintf("%s = %f seconds.", PAGE_NAMES[i], GetResponse(TRANS, workStreamName)));
};
print("[-------------------------------------------------]");
print("Component Utilizations");
for (i in 1:numWS) {
for (j in 1:length(COMPONENTS)) {
totalUtilization <- 0;
nodeName <- sprintf("WS_%s_%s", i, COMPONENTS[j]);
for (k in H:B) {
workStreamName <- sprintf("%s", PAGE_NAMES[k]);
totalUtilization <- totalUtilization + GetUtilization(nodeName, workStreamName, TRANS);
};
print(sprintf("%s = %3.2f %%", nodeName, totalUtilization * 100));
};
};
for (i in 1:numAS) {
for (j in 1:length(COMPONENTS)) {
totalUtilization <- 0;
nodeName <- sprintf("AS_%s_%s", i, COMPONENTS[j]);
for (k in H:B) {
workStreamName <- sprintf("%s", PAGE_NAMES[k]);
totalUtilization <- totalUtilization + GetUtilization(nodeName, workStreamName, TRANS);
};
print(sprintf("%s = %3.2f %%", nodeName, totalUtilization * 100));
};
};
for (i in 1:numDS) {
for (j in 1:length(COMPONENTS)) {
totalUtilization <- 0;
nodeName <- sprintf("DS_%s_%s", i, COMPONENTS[j]);
for (k in H:B) {
workStreamName <- sprintf("%s", PAGE_NAMES[k]);
totalUtilization <- totalUtilization + GetUtilization(nodeName, workStreamName, TRANS);
};
print(sprintf("%s = %3.2f %%", nodeName, totalUtilization * 100));
};
};


Here is some sample output from the R code:

 Here is a bit of sample output with 10.96 users entering the system per second:  
[1] "Arrival Rates for each page:"
[1] "HomePage = 10.960000"
[1] "Search = 13.658485"
[1] "ViewBids = 2.208606"
[1] "Login = 3.664958"
[1] "CreateAuction = 1.099487"
[1] "PlaceBid = 2.074180"
[1] "[-------------------------------------------------]"
[1] "Page Response Times"
[1] "HomePage = 0.083517 seconds."
[1] "Search = 1.612366 seconds."
[1] "ViewBids = 1.044683 seconds."
[1] "Login = 2.323417 seconds."
[1] "CreateAuction = 3.622690 seconds."
[1] "PlaceBid = 3.983755 seconds."
[1] "[-------------------------------------------------]"
[1] "Component Utilizations"
[1] "WS_1_CPU = 49.91 %"
[1] "WS_1_Disk = 55.59 %"
[1] "AS_1_CPU = 71.11 %"
[1] "AS_1_Disk = 35.59 %"
[1] "DS_1_CPU = 38.17 %"
[1] "DS_1_Disk = 97.57 %"


In this analysis we can see that the database disk IO is approaching 100%. A simple solution (for this analysis at least) is to add another database server to spread the load evenly.

I modify the line that reads "numDS <- 1; # Number of Database Servers" to read "numDS <- 2; # Number of Database Servers" and re-run the analysis:

 [1] "Arrival Rates for each page:"  
[1] "HomePage = 10.960000"
[1] "Search = 13.658485"
[1] "ViewBids = 2.208606"
[1] "Login = 3.664958"
[1] "CreateAuction = 1.099487"
[1] "PlaceBid = 2.074180"
[1] "[-------------------------------------------------]"
[1] "Page Response Times"
[1] "HomePage = 0.083517 seconds."
[1] "Search = 0.237452 seconds."
[1] "ViewBids = 0.336113 seconds."
[1] "Login = 0.358981 seconds."
[1] "CreateAuction = 0.462042 seconds."
[1] "PlaceBid = 0.440903 seconds."
[1] "[-------------------------------------------------]"
[1] "Component Utilizations"
[1] "WS_1_CPU = 49.91 %"
[1] "WS_1_Disk = 55.59 %"
[1] "AS_1_CPU = 71.11 %"
[1] "AS_1_Disk = 35.59 %"
[1] "DS_1_CPU = 19.09 %"
[1] "DS_1_Disk = 48.78 %"
[1] "DS_2_CPU = 19.09 %"
[1] "DS_2_Disk = 48.78 %"


Not only do we alleviate the database server disk IO utilization, we have an appreciable decrease in page response time for the end user. The Search page went from 1.61 seconds to 0.24 seconds. Not too shabby. The biggest difference was with the Create Auction page which wnet from 3.62 seconds to 0.46 seconds for a change of over three seconds.

In a future post I will go over modified PDQ-R solution that also generates graphs to show changes in resource utilization and page response times with increasing load. I have have a Discrete Event Solution to the aformentioned eBiz application that was done with Python and the SimPy library to compare output of heuristic analysis and event simulation.

Monday, October 24, 2011

Easier parsing of Microsoft perfmon logs with regex

In this blog post I discussed perl code use to slice and dice perfmon logs by passing the Excel column name (ie, A, B, XC, AAZ, etc) in the command line. That made slicing and dicing perfmon log files a lot easier but it was still missing something.

With a large perfmon capture I was having to write down a lot of column names to slice out. I decided there had to be a better way and adapted my code to use regular expressions for column selection.

For example, I recently had a capacity study of a server and all that I wanted was the the disk IO and processor usage.

With the new code (below) I can now pull that quite easily:

perfMonLogRegEx.pl somePerfMonLog.csv 14:00:00 15:00:00 "disk|proc"

Now I automagically get the columns that contain the keywords "disk" and "proc." That pulls out the following from my perfmon log:


LogicalDisk(C:)\% Disk Time
LogicalDisk(D:)\% Disk Time
LogicalDisk(_Total)\% Disk Time
LogicalDisk(C:)\% Idle Time
LogicalDisk(D:)\% Idle Time
LogicalDisk(_Total)\% Idle Time
LogicalDisk(C:)\Avg. Disk Queue Length
LogicalDisk(D:)\Avg. Disk Queue Length
LogicalDisk(_Total)\Avg. Disk Queue Length
LogicalDisk(C:)\Avg. Disk sec/Read
LogicalDisk(D:)\Avg. Disk sec/Read
LogicalDisk(_Total)\Avg. Disk sec/Read
LogicalDisk(C:)\Avg. Disk sec/Write
LogicalDisk(D:)\Avg. Disk sec/Write
LogicalDisk(_Total)\Avg. Disk sec/Write
LogicalDisk(C:)\Disk Reads/sec
LogicalDisk(D:)\Disk Reads/sec
LogicalDisk(_Total)\Disk Reads/sec
LogicalDisk(C:)\Disk Writes/sec
LogicalDisk(D:)\Disk Writes/sec
LogicalDisk(_Total)\Disk Writes/sec


Want just the LogicalDisk(_Total) columns? Easy enough:

perfMonLogRegEx.pl somePerfMonLog.csv 14:00:00 15:00:00 "LogicalDisk\(_Total\)"

Notice that I have to escape the parentheses as I am not trying to capture anything from the regex.

And I get:


LogicalDisk(_Total)\% Disk Time
LogicalDisk(_Total)\% Idle Time
LogicalDisk(_Total)\Avg. Disk Queue Length
LogicalDisk(_Total)\Avg. Disk sec/Read
LogicalDisk(_Total)\Avg. Disk sec/Write
LogicalDisk(_Total)\Disk Reads/sec
LogicalDisk(_Total)\Disk Writes/sec


Want all columns between the upper and lower time limits? Easy enough:

perfMonLogRegEx.pl somePerfMonLog.csv 14:00:00 15:00:00 ".*"

The modified perl code for this handy-dandy slicing and dicing is found below and uses the same command line params as the previous code replacing the Excel column names with the regex for column selection.

1:  use strict;  
2: my $logFile = shift;
3: my $startTime = &timeIntoSeconds(shift);
4: my $endTime = &timeIntoSeconds(shift);
5: my $keyWordRegEx = shift;
6: my @targetColumns = ();
7: my %columnHash = ();
8: my %columnNames = ();
9: open(LOG, "$logFile") || die "$!";
10: while (my $input = <LOG>) {
11: chomp($input);
12: $input =~ s/\"//g;
13: my @columns = split(/\,/, $input);
14: if ($input =~ /PDH-CSV 4\.0/i) {
15: my $maxColumns = $#columns+1;
16: for (my $i = 0; $i < $maxColumns; $i++) {
17: if ($columns[$i] =~ qr/$keyWordRegEx/i) {
18: push(@targetColumns, ($i+1));
19: };
20: };
21: foreach my $columnNumber (@targetColumns) {
22: my @dataArray = ();
23: $columnHash{$columnNumber} = \@dataArray;
24: };
25: print STDOUT "$columns[0]";
26: foreach my $columnNumber (sort {$a <=> $b} (keys(%columnHash))) {
27: $columnNames{$columnNumber} = $columns[$columnNumber-1];
28: print STDOUT ",$columns[$columnNumber-1]";
29: };
30: print STDOUT "\n";
31: } else {
32: my ($dateStamp, $timeStamp) = split(/ /, $columns[0]);
33: my $metricSeconds = &timeIntoSeconds($timeStamp);
34: if ( $metricSeconds >= $startTime && $metricSeconds <= $endTime) {
35: print STDOUT "$columns[0]";
36: foreach my $columnNumber (sort {$a <=> $b} (keys(%columnHash))) {
37: print STDOUT ",$columns[$columnNumber-1]";
38: };
39: print STDOUT "\n";
40: };
41: };
42: if (($. % 1000) == 0) {
43: print STDERR "Working on line $.\n";
44: };
45: };
46: sub timeIntoSeconds() {
47: my ($timeStamp) = @_;
48: my ($hours, $mins, $sec) = split(/:/, $timeStamp);
49: return (($hours * 3600) + ($mins * 60)) + $sec;
50: };

Thursday, August 25, 2011

Low Power Consumption Server Gotcha

Something interesting happened this week. I've been fighting an issue where newer 12 and 16 core servers couldn't compete with older 8-core Windows 2003 servers. Turns out there were two gotchas.

The first one is fairly straight forward. A recent Microsoft security patch was installed that affected the way that .NET handles garbage collection. By default the .NET 2.0 CLR can handle up to 8 cores for garbage collection. Subsequent updates increased the number of cores that can handle garbage collection. But with the patch in question hobbled .NETs ability to properly handle garbage collection with more than 8 cores resulting in % time in GC shooting up and negatively effecting server capacity. The hot fix for our problem can be found here. The GC was being passed around to the (n-8) extra cores in the box.

We placed the hotfix in place and viola, the service demand measurements dropped by about 40%.

But still, the capacity of the 12/16 cores was still having issues compared to the original 8-core machines. We suspected that the .NET stack still needed to be further tuned but something interesting happened this week.

It was found that our servers were shipped from the factory in power saving mode. While under low core utilization the core clocks were running anywhere from 800 MHz to 1.1 GHz. It's not until the cores are pushed past 60% that the core clocks are increased up to the maximum of 2.4 GHz. Most of my capacity studies of production traffic were in the 40 to 50% CPU range. Not quite enough to force the core clocks to start jacking up the clock rate.

The result? It appeared that the capacity of the larger machines was less than the 8-core machines. After the core clock was set at the environmentally less friendly rate of 2.4 GHz the measured service demand under production traffic dropped a significant 56.5%. Not too shabby. However, it would have been nice to know about the server issue to begin with to eliminate a lot of confusion and wasted fundage.

I have been calculating capacity requirements based upon the low power consumption clock speed and a lot of extra boxes have been ordered. Now that we know about the power savings issue we can configure the boxes for maximum performance. And the extra boxes? That'll just add a bunch of extra capacity to the application.

Good times!


Sunday, June 19, 2011

The Value of Log Analysis

Something that I run into time and time again is capacity planners and load testers that simply have no concept of how to perform log analysis. There is simply no excuse for this issue. Perl has been available for 20+ years and is my bread and butter when it comes to slicing and dicing logs. But if you are not a perl guru, there are plenty of alternatives available to assist the analyst that wants to crunch some log files.

I've heard a lot of good things about Microsoft LogParser. No perl required.

Personally, I prefer perl for slicing and dicing logs but I've been coding perl routines for 13+ years.

But back to log analysis: Example that happened to me earlier this week.

Load test is testing a major install that is going into production shortly and the management is hot and heavy to get this install tested. Load testing is finally done and their results show that the CPU utilization of servers in a particular tier increased by 2% while the arrival rate decreased 15%. The result? CPU service demand measured increased by roughly 22% which has an affect on the number of servers required for production.

This increase of 2% and decrease of arrival rate went missed by the load testing team completely but fortunately was found by myself. The question is this though: What is causing the extra 2% CPU utilization with a 15% decrease in arrival rate.

The load testers have no idea and simply scratch their heads. After watching them flail helplessly I decided to jump into the issue. After all, I did identify the 22% increase in CPU service demand between release versions.

The first place that I looked were the web logs. Because I'm a fanatic about log analysis I've written various perl routines over the years for slicing and dicing log files. I crunch the logs for the baseline and after tests and immediately see the problem: A major web service that was being called in the baseline is no longer being called in the after test. The difference in hits between the two load tests? 15%. The same difference in the arrival rate between tests. Somewhere, something went kaput. Load testers haven't found out what yet as we've been busy all week (and this weekend) hunting down other issues.

But the net result is this: If you are a load tester or capacity analyst, you damned well better be able to perform log analysis, even if it is rudimentary. Whether it is via perl or some other system like splunk, you need to have the tools in your toolbox to get the job done.

In the above example, the load testers did not properly analyze log files to see what was up. No attempt was made. Heck, they didn't even compare the output of Load Runner to ensure that the number of transactions between the two tests were the same. A load tester that cannot perform rudimentary log analysis is useless to me.

Saturday, June 4, 2011

Flexible Modeling with PyDQ

A while back I ran into a situation where I needed to model capacity on a server based upon a set of page hits into the server. Fortunately during testing it was a simple task to determine the service demand utilization of page hits and I developed a bit of Python code around the PyDQ API to create a flexible model of the system.

This bit of code pulls in a .CSV file from the command line wherein the .CSV file is defined the queueing nodes of the system and the pages and related service demands for each queueing node. The nodes are not limited to just a single system. In my particular use, I had two servers in the mix: A web server on the front end and a back-end database server.

Example model.CSV file:

1:  workstream  ,coefficient,s1.cpu ,s1.nic.sent,s1.nic.recv,dat.cpu
2: somePageA.jsp,0.1 ,0.029 ,0.01296 ,0.0000599 ,0.00137853
3: somePageB.jsp,0.8 ,0.00725,0.00324 ,0.0000599 ,0.000981735
4: somePageC.jsp,0.1 ,0.00525,0.01324 ,0.0000599 ,0
.000981735

workstream is the name of the pages that will be hitting the system.
coefficient is the coefficient of pages that will be hitting the system based upon arrival rate. In the example above, if the arrival rate of page hits is 100 pages per second, 10 hits per second will be attributed to somePageA.jsp, 80 page hits per second to somePageB.jsp and finally, the remaining 10 hits per second to somePageC.jsp.

The rest of the columns define the service demand utilization for identified queueing nodes such as CPU, NICs, etc.

The code will crunch the numbers and find the maximum arrival rate for requests in the pre-defined ratios and then will generate .PNG graphs along with .CSV output. The graphs do a good job of letting management type folks what the system will be doing under load. I found the real response times of the system in question to be very close to the PyDQ analyzed numbers.

Response times per page by total arrival rate into the system:


Queueing node utilization by total arrival rate into the system:



And finally, the Python code itself:

1:  #!/usr/bin/python  
2:
3: from pylab import *
4: import pdq
5: import sys
6:
7: # Load the model workload .CSV file
8:
9: workLoadFile = open(sys.argv[1], 'r')
10:
11: serviceDemandByPage = {}
12: serverQueues = []
13:
14: lineCount = 0
15:
16: for inputLine in workLoadFile:
17: inputLine = inputLine.rstrip('\r\n')
18: workLoadElements = inputLine.split(',')
19:
20: if (lineCount == 0):
21: serverQueues = workLoadElements[2:len(workLoadElements)]
22: else:
23: serviceDemand = []
24: workLoadName = workLoadElements[0]
25: coefficient = workLoadElements[1]
26: serviceDemandValues = workLoadElements[2:len(workLoadElements)]
27:
28: for s in serviceDemandValues:
29: s = float(s)
30: serviceDemand.append(s)
31:
32: serviceDemandByPage[workLoadName] = {}
33: serviceDemandByPage[workLoadName]['%'] = float(coefficient)
34:
35: for queueName in serverQueues:
36: serviceDemandByPage[workLoadName][queueName] = serviceDemand[serverQueues.index(queueName)]
37:
38: lineCount = lineCount + 1
39:
40: # Initial model settings
41:
42: workLoadMatrix = []
43:
44: for workLoadName in serviceDemandByPage.keys():
45: workLoadMatrix.append([serviceDemandByPage[workLoadName]['%'], workLoadName])
46: del(serviceDemandByPage[workLoadName]['%'])
47:
48: workingData = {}
49: finalResults = {}
50:
51: arrivalRate = 0.1
52:
53: contLoop = True
54:
55: while contLoop:
56:
57: workLoadArray = []
58: workingData[arrivalRate] = {}
59:
60: pdq.Init("arrivalRate = %f" % arrivalRate)
61:
62: for queueName in serverQueues:
63: pdq.CreateNode(queueName, pdq.CEN, pdq.FCFS)
64:
65: for workLoad in workLoadMatrix:
66:
67: coefficient = workLoad[0]
68: pageName = workLoad[1]
69:
70: workLoadName = pageName
71:
72: workLoadArray.append([workLoadName, coefficient])
73:
74: workingData[arrivalRate][workLoadName] = {}
75:
76: workLoadLambda = coefficient * arrivalRate
77:
78: pdq.streams = pdq.CreateOpen(workLoadName, workLoadLambda)
79: pdq.SetWUnit("Hits")
80: pdq.SetTUnit("Seconds")
81:
82: for queueName in serverQueues:
83: totalServiceDemand = serviceDemandByPage[workLoadName][queueName]
84: workingData[arrivalRate][workLoadName][queueName] = 0
85: pdq.SetDemand(queueName, workLoadName, totalServiceDemand)
86:
87: pdq.Solve(pdq.CANON)
88:
89: for workLoad in workLoadArray:
90: workLoadName = workLoad[0]
91: for queueName in serverQueues:
92: workingData[arrivalRate][workLoadName][queueName] = pdq.GetUtilization(queueName, workLoadName, pdq.TRANS)
93:
94: for workLoadData in workLoadArray:
95: workLoadName = workLoadData[0]
96: coefficient = workLoadData[1]
97: workingData[arrivalRate][workLoadName]['R'] = pdq.GetResponse(pdq.TRANS, workLoadName)
98: workingData[arrivalRate][workLoadName]['%'] = coefficient
99:
100: for queueName in serverQueues:
101: queueTotal = 0.0
102: for workLoad in serviceDemandByPage.keys():
103: queueTotal += workingData[arrivalRate][workLoad][queueName]
104: if queueTotal > 0.99:
105: contLoop = False
106:
107: arrivalRate += 0.1
108:
109: for arrivalRate in sorted(workingData.keys()):
110:
111: finalResults[arrivalRate] = {}
112: finalResults[arrivalRate]['R'] = {}
113: finalResults[arrivalRate]['%'] = {}
114:
115: arrivalData = workingData[arrivalRate]
116:
117: workLoadArray = []
118:
119: for queueName in serverQueues:
120: finalResults[arrivalRate][queueName] = {}
121: rho = 0
122: for workLoad in arrivalData.keys():
123: rho = rho + arrivalData[workLoad][queueName]
124: workLoadArray.append(workLoad)
125: finalResults[arrivalRate][queueName]['rho'] = rho
126:
127: for workLoad in workLoadArray:
128: finalResults[arrivalRate]['R'][workLoad] = arrivalData[workLoad]['R']
129: finalResults[arrivalRate]['%'][workLoad] = arrivalData[workLoad]['%']
130:
131: # .CSV output
132:
133: arrivalRateArray = []
134: queueUtilization = {}
135: responseTimes = {}
136:
137: reportHeader = "lambda"
138:
139: for queueName in serverQueues:
140: reportHeader = reportHeader + ",%s" % queueName
141:
142: for workLoad in workLoadMatrix:
143: reportHeader = reportHeader + ",%s" % workLoad[1]
144:
145: print reportHeader
146:
147: for arrivalRate in sorted(finalResults.keys()):
148: arrivalRateArray.append(arrivalRate)
149: outputLine = "%0.3f" % arrivalRate
150:
151: for queueName in serverQueues:
152: outputLine = outputLine + ",%0.3f" % finalResults[arrivalRate][queueName]['rho']
153: if not queueUtilization.has_key(queueName):
154: queueUtilization[queueName] = []
155: queueUtilization[queueName].append(finalResults[arrivalRate][queueName]['rho'])
156:
157: for workLoad in finalResults[arrivalRate]['R'].keys():
158: if not responseTimes.has_key(workLoad):
159: responseTimes[workLoad] = []
160: responseTimes[workLoad].append(finalResults[arrivalRate]['R'][workLoad])
161: outputLine = outputLine + ",%0.3f" % finalResults[arrivalRate]['R'][workLoad]
162:
163: print outputLine
164:
165: # graph output
166:
167: F = gcf()
168: dpi = F.get_dpi()
169: xInches = 1024 / dpi
170: yInches = (768/2) / dpi
171:
172: rcParams['figure.figsize'] = xInches, yInches
173:
174: for queueName in serverQueues:
175: figure()
176: domain = arrivalRateArray
177: range = queueUtilization[queueName]
178: plot(domain, range, linewidth=1.0)
179: xlabel("Arrival Rate, hits per second")
180: ylabel("Utilization of %s" % queueName)
181: title("Utilization of queue %s by arrival rate" % queueName)
182: grid(True)
183: savefig("%s.utilization.png" % queueName)
184:
185: figure()
186:
187: for queueName in serverQueues:
188: domain = arrivalRateArray
189: range = queueUtilization[queueName]
190: plot(domain, range, label="%s" % queueName, linewidth=1.0)
191: xlabel("Arrival Rate, hits per second")
192: ylabel("Utilization")
193: title("Utilization of queues by arrival rate")
194: grid(True)
195:
196: legend(loc=2)
197: savefig("utilization.by.arrival.rate.png")
198:
199: for workLoad in workLoadMatrix:
200: figure()
201: domain = arrivalRateArray
202: range = responseTimes[workLoad[1]]
203: plot(domain, range, linewidth=1.0)
204: xlabel("Arrival Rate, hits per second")
205: ylabel("Response Time, seconds, for %s" % workLoad[1])
206: title("Response Time for %s by arrival rate" % workLoad[1])
207: grid(True)
208: savefig("%s.response.time.png" % workLoad[1])
209:
210: figure()
211:
212: for workLoad in workLoadMatrix:
213: domain = arrivalRateArray
214: range = responseTimes[workLoad[1]]
215: plot(domain, range, label="%s" % workLoad[1], linewidth=1.0)
216: xlabel("Arrival Rate, hits per second")
217: ylabel("Response Times, seconds")
218: title("Response Times by arrival rate")
219: grid(True)
220:
221: legend(loc=2)
222: savefig("response.times.by.arrival.rate.png")
223:

Sunday, May 29, 2011

Converting a Probablity Matrix into a set of Linear Equations in R

This entry was from another blog entry of mine but I have migrated it over to this blog as I find that it is relevant to something that I am working on for eCommerce capacity analysis where probability matrices might work out very nicely.

The case study that I have started working on is in Chapter 8: Case Study IV: An E-Business Service.

The author provides us with a Customer Behavior Model Graph of a website:



From this graph we can create a set of linear equations that describes the number of visits to each page of the system presented in the CBMG (seen above). Taking a closer look at the graph we can see that there are probabilities of transitioning between one page to another. For example, the probability of moving from the Home Page (h) to the Login (g) is Phg. The probability of moving from the Login (g) page to the Search (s) page is Psg.

Using these probabilities we can create a set of linear equations to describe the number of visits to each page in the system based upon the number of visits into the entry of the system. In this case, there is only one way into the system and that is via the Entry (e) page.

We can look at the graph and manually derive the linear equations. For example, we can easily see that Vh = Peh*Ve. We can also see that Vg = Phg*Vh + Psg*Vs + Pvg*Vv. Doing this we eventually come up with eight linear equations that we can then solve. (The equations are listed on page 208 of Performance by Design: Computer Capacity Planning By Example).

The authors of the book provide us with all the probabilities in the form of a Matrix of Transition Probabilities on page 209. Here is that matrix:


                       (e)     (h)   (s)   (v)   (g)   (c)   (b)   (x)

Entry (e) 0 1 0 0 0 0 0 0
Home (h) 0 0 0.7 0 0.1 0 0 0.2
Search (s) 0 0 0.4 0.2 0.15 0 0 0.25
View Bids (v) 0 0 0 0 0.65 0 0 0.35
Login (g) 0 0 0 0 0 0.3 0.6 0.1
Create Auction (c) 0 0 0 0 0 0 0 1
Place Bid (b) 0 0 0 0 0 0 0 1
Exit (x) 0 0 0 0 0 0 0 0


So we can take those values found in the transition matrix and eventually gonkulate all the visits to each page. Although the transition matrix is provided by the author it is easy enough to re-create this matrix from web logs for a site.

There's bound to be a way of backing the transition matrix into the set of linear equations that can easily be solved in R (or any other method that can solve linear equations).

This is how I backed out the equations by hand. I know that the transition matrix has from "from" pages down the rows and the "to" along the columns. So for example, if I wanted to figure out the equation for the Vs I would work down the column and use the coefficients of the transition matrix as multipliers for the V variables.

For example, with the 3rd column with solving for Vs:

The third column of the transition matrix is:


   (s)

(e) 0
(h) 0.7
(s) 0.4
(v) 0
(g) 0
(c) 0
(b) 0
(x) 0


So to solve for Vs:

Vs = 0*Ve + 0.7*Vh + 0.4Vs + 0*Vv + 0*Vg + 0*Vc + 0*Vb + 0*Vx

Notice that we have a Vs on both sides of the equal sign. This corresponds to the loop on the Search node.

We can drop out the variables that are multiplied by zero.

Vs = 0.7*Vh + 0.4Vs

Repeat the steps for all the columns of the transition matrix and you end up with the equations to solve for. But, most packages want the equations setup in matrix form. So, let's get all the variables on the left side of the equal sign.

Vs - 0.4Vs - 0.7*Vh

Simplifies into:

0.6Vs - 0.7*Vh

If we do this with all the equations we can finally put it into matrix form. But we can't solve it yet. We need another vector to solve with. This vector will contain the input into the system. In this case the entry into the system is into the Entry page so the vertical vector is:

(e) 1
(h) 0
(s) 0
(v) 0
(g) 0
(c) 0
(b) 0
(x) 0

If we call our first matrix that we created from the transition matrix A and our input vertical vector B, we have the classic A*B = 0 that we can solve for.

Now we get into the R solution for the above. As I was working out the algorithm to setup the problem for calling the R solve() routine it hit me how simple the solution was!

Let's call our transition matrix supplied by the authors as M.

We can take transpose of M and let's call it A.

Take A and multiply by the scalar value of -1. This is the act of moving all the variables to the left side of the equals sign.

Now we combine variables we are solving for in each row of A by adding 1 to the trace of matrix A.

And there we have our final form of A. We can now solve for A*B = 0!

Here is my solution in R
# Define the column information for easy access by name


e <- 1;
h <- 2;
s <- 3;
v <- 4;
g <- 5;
c <- 6;
b <- 7;
x <- 8;

# M will be the supplied transition matrix

M <- matrix(c(0,1,0,0,0,0,0,0,0,0,0.7,0,0.1,0,0,0.2,0,0,0.4,0.2,0.15,0,0,0.25,0,0,0,0,0.65,0,0,0.35,0,0,0,0,0,0.3,0.6,0.1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0), ncol=8, nrow=8, byrow=TRUE);

# B is the vector that describes entry into the system
B <- matrix(c(1,0,0,0,0,0,0,0),nrow=8,ncol=1);

VisitsByTransitionMatrix <- function(M, B) {
A <- t(M);
A <- -1 * A;
for (i in 1:sqrt(length(A))) {
j <- i;
A[i,j] <- A[i,j] + 1;
};
return(solve(A,B));
};

Visits <- VisitsByTransitionMatrix(M, B);

> Visits
[,1]
[1,] 1.0000000
[2,] 1.0000000
[3,] 1.1666667
[4,] 0.2333333
[5,] 0.4266667
[6,] 0.1280000
[7,] 0.2560000
[8,] 1.0000000

> Visits[e]
[1] 1
> Visits[h]
[1] 1
> Visits[s]
[1] 1.166667
> Visits[v]
[1] 0.2333333
> Visits[g]
[1] 0.4266667
> Visits[c]
[1] 0.128
> Visits[b]
[1] 0.256
> Visits[x]
[1] 1