I Crush Servers: November 2011

Wednesday, November 23, 2011

Wrapping perl around tqharvc for generating scatter plots

TeamQuest View is a handy utility for looking at metrics collected on servers by TeamQuest. But one thing that I haven't liked about TeamQuest View is creating graphs from collected metrics. Fortunately, with the installation of TeamQuest View a command line utility by the name of tqharvc is also installed. It is possible to execute tqharvc.exe from the command line to collect metrics and create plots from the output.

I got the idea of writing some perl code that slices and dices the TeamQuest .RPT files that can be generated by TeamQuest View and then passing on the queries to tqharvc to a graphing routine to generate both plots and .CSV files for later processing if need be.

For the graphing of data I utilize GD::Graph for generating a scatter plot.

For the example in this blog entry I have a very simple .RPT file by the name of physc.RPT and as the name implies, it simply reports physc usage from a LPAR:


[General]
Report Name = "physc"
View = Line
Format = Date+Time,
        Parameter

[dParm1]
System = "[?]"
Category Group = "CPU"
Category = "by LPAR"
Subcategory = ""
Statistic = "physc"
Value Types = "Average"

Simple enough, isn't it?

My perl routine builds up a query for tqharvc by slicing and dicing the .RPT file and iterates through all the sliced and diced metrics, collects the output data and generates the .CSV output and plot image.

The tqharvc command line query is formed as:

tqharvc.exe -m [hostName] -s [startDate]:[startTime] -e [endDate]:[endTime] -a 10-minute -k [metric]

Say that I want to query "serverA" for the metric "CPU:by LPAR:physc" between 11/1/2011 midnight to 11/07/2011 11:59:59 PM with metrics from every 10 minutes?

I call tqharvc with the following command line:

tqharvc.exe -m serverA -s 11012011:000000 -e 11072011:235959 -a 10-minute -k "CPU:by LPAR:physc"

My routine takes several command line parameters:

tqGatherMetrics.pl server.list physc.rpt 11/01/2011 00:00:00 11/06/2011 23:59:59

server.list is simply a text file with a list of servers to query:


serverA
serverB
serverC
serverD
serverE
serverF

phsyc.rpt is the aforementioned .RPT file. 11/01/2011 is the start date and 00:0:000 is the start time. 11/07/2011 is the end date and 23:59:59 is the end time of the queries.

Below is the output to STDERR from the perl routine as it crunches through the data from the servers:


Running query against serverA.
  Executing TQ Query for serverA:CPU:by LPAR.
  Processing query results.
 Extracted 864 lines of data from query.
  Processing results for serverA : CPU:by LPAR : 7
  Generating graph for "serverA_CPU_by_LPAR_physc.gif"
Running query against serverB.
  Executing TQ Query for serverB:CPU:by LPAR.
  Processing query results.
 Extracted 864 lines of data from query.
  Processing results for serverB : CPU:by LPAR : 7
  Generating graph for "serverB_CPU_by_LPAR_physc.gif"
Running query against serverC.
  Executing TQ Query for serverC:CPU:by LPAR.
  Processing query results.
 Extracted 864 lines of data from query.
  Processing results for serverC : CPU:by LPAR : 7
  Generating graph for "serverC_CPU_by_LPAR_physc.gif"
Running query against serverD.
  Executing TQ Query for serverD:CPU:by LPAR.
  Processing query results.
 Extracted 864 lines of data from query.
  Processing results for serverD : CPU:by LPAR : 7
  Generating graph for "serverD_CPU_by_LPAR_physc.gif"
Running query against serverE.
  Executing TQ Query for serverE:CPU:by LPAR.
  Processing query results.
 Extracted 864 lines of data from query.
  Processing results for serverE : CPU:by LPAR : 7
  Generating graph for "serverE_CPU_by_LPAR_physc.gif"
Running query against serverF.
  Executing TQ Query for serverF:CPU:by LPAR.
  Processing query results.
 Extracted 864 lines of data from query.
  Processing results for serverF : CPU:by LPAR : 7
  Generating graph for "serverF_CPU_by_LPAR_physc.gif"

The output of the plot appears as:

With the above command line I was able to generate plots for six servers in the server.list file.

If the .RPT file had listed multiple metrics, there would have been one plot generated per server for each metric. I chose the single metric physc as a simple example.

The perl code to generate the plot follows:

1:  use GD::Graph::points;  
2:  use Statistics::Descriptive;  
3:  use Time::Local;  
4:  use POSIX qw/ceil/;  
5:    
6:  use strict;  
7:    
8:  my $sourceList = shift;  
9:  my $tqReport  = shift;  
10:  my $startDate = shift;  
11:  my $startTime = shift;  
12:  my $endDate  = shift;  
13:  my $endTime  = shift;  
14:    
15:  $startDate =~ s/\///g;  
16:  $endDate =~ s/\///g;  
17:    
18:  $startTime =~ s/\://g;  
19:  $endTime  =~ s/\://g;  
20:    
21:  if (length($sourceList) == 0) {  
22:   die "Usage:tqGatherMerics.pl <sourceList> <tqReport> <startDate> <stareTime> <endDate> <endTime>\n";  
23:  };  
24:    
25:  my $tqHarvestBinary = "C:\\Program Files\\TeamQuest\\manager\\bin\\tqharvc.exe";  
26:    
27:  if ( -f "$tqHarvestBinary" ) {  
28:   if ( -f "$sourceList" ) {  
29:    if ( -f "$tqReport" ) {  
30:    
31:     my @hostNames = ();  
32:     my %metricHash = ();  
33:    
34:     open(HOSTNAMES, "$sourceList") || die "$!";  
35:     while (<HOSTNAMES>) {  
36:      chomp($_);  
37:      if ($_ !~ /^#/) {  
38:       push(@hostNames, $_);  
39:      };  
40:     };  
41:     close(HOSTNAMES);  
42:       
43:     open(REPORT, "$tqReport") || die "$!";  
44:       
45:     my $catGroup = "::";  
46:     my $catName = "::";  
47:     my $subCat  = "::";  
48:       
49:     while (<REPORT>) {  
50:       
51:      my @statArray = ();  
52:       
53:      chomp($_);  
54:       
55:      if ($_ =~ /Category Group = \"(.+?)\"/) {  
56:       $catGroup = $1;  
57:      };  
58:       
59:      if ($_ =~ /Category = \"(.+?)\"/) {  
60:       $catName = $1;  
61:      };  
62:       
63:      if ($_ =~ /Subcategory = \"(.+?)\"/) {  
64:       $subCat = $1;  
65:      };  
66:       
67:      if ($_ =~ /Statistic =/) {  
68:       my $tmpString = "";  
69:       $_ =~ s/Statistic =//g;  
70:       $tmpString = $_;  
71:       do {  
72:        $_ = <REPORT>;  
73:        chomp($_);  
74:        if ($_ !~ /^Resource|^Value Types/) {  
75:         $tmpString .= $_;  
76:        };  
77:       } until ($_ =~ /^Resource|^Value Types/);  
78:       my @statArray = split(/\,/, $tmpString);  
79:       $metricHash{"${catGroup}:${catName}"} = \@statArray;  
80:      };  
81:        
82:     };  
83:       
84:     close(REPORT);  
85:    
86:     foreach my $hostName (@hostNames) {  
87:      print STDERR "Running query against $hostName.\n";  
88:      my %metricData = ();  
89:      foreach my $paramName (sort(keys(%metricHash))) {  
90:       my %columnHash = ();      
91:       my $linesExtracted = 0;  
92:       my $shellCmd  = "\"$tqHarvestBinary\" -m $hostName -s $startDate:$startTime -e $endDate:$endTime -a 10-minute -k \"$paramName\"";  
93:       # my $shellCmd  = "\"$tqHarvestBinary\" -m $hostName -s $startDate:$startTime -e $endDate:$endTime -a 1-minute -k \"$paramName\"";  
94:       print STDERR "\t\tExecuting TQ Query for $hostName:$paramName.\n";  
95:    
96:       open(OUTPUT, "$shellCmd |") || die "$!";  
97:         
98:       print STDERR "\t\tProcessing query results.\n";  
99:         
100:       my $totalColumns = 0;  
101:    
102:       while (<OUTPUT>) {  
103:        chomp($_);  
104:        if ($_ =~ /^Time:/) {  
105:         my @columns = split(/\,/, $_);  
106:         my $statName = "";  
107:         for (my $index = 0; $index < $#columns; $index++) {  
108:          foreach $statName (@{$metricHash{$paramName}}) {  
109:           $statName =~ s/^\s+//g; # ltrim  
110:           $statName =~ s/\s+$//g; # rtrim  
111:           $statName =~ s/\"//g;  
112:             
113:           my $columnName = $columns[$index];  
114:           if (index($columnName, $statName, 0) >= 0) {  
115:            $columnHash{$index} = $columns[$index];  
116:            $totalColumns++;  
117:           };  
118:          };  
119:         };  
120:        } else {  
121:         if ($_ =~ /^[0-9]/) {  
122:          chomp($_);  
123:          my @columns = split(/\,/, $_);  
124:          foreach my $index (sort(keys(%columnHash))) {  
125:           $metricData{"$columns[0] $columns[1]"}{$columnHash{$index}} = $columns[$index];  
126:          };  
127:          $linesExtracted++;  
128:         };  
129:        };  
130:       };  
131:    
132:       close(OUTPUT);  
133:    
134:       if (($linesExtracted > 0) && ($totalColumns > 0)) {  
135:        print STDERR "\tExtracted $linesExtracted lines of data from query.\n";  
136:        my @domainData = ();  
137:    
138:        foreach my $timeStamp (sort dateSort keys(%metricData)) {  
139:         push(@domainData, $timeStamp);  
140:        };  
141:    
142:        foreach my $metricIndex (sort(keys(%columnHash))) {  
143:         print STDERR "\t\tProcessing results for $hostName : $paramName : $metricIndex\n";         
144:    
145:         my $metricName = $columnHash{$metricIndex};  
146:         my @rangeData = ();  
147:         my $stat = Statistics::Descriptive::Full->new();         
148:    
149:         foreach my $timeStamp (@domainData) {  
150:          push(@rangeData, $metricData{$timeStamp}{$columnHash{$metricIndex}});  
151:          $stat->add_data($metricData{$timeStamp}{$columnHash{$metricIndex}});  
152:         };  
153:           
154:         my $graphName = "${hostName}_${paramName}_${metricName}";  
155:         my $csvName  = "";  
156:    
157:         $graphName =~ s/\\/_/g;  
158:         $graphName =~ s/\//_/g;  
159:         $graphName =~ s/\%/_/g;  
160:         $graphName =~ s/\:/_/g;  
161:         $graphName =~ s/\s/_/g;  
162:    
163:         $csvName  = $graphName;  
164:         $graphName .= ".gif";  
165:         $csvName  .= ".csv";  
166:    
167:         print STDERR "\t\tGenerating graph for \"$graphName\"\n";  
168:           
169:         open(CSVOUTPUT, ">$csvName");  
170:         print CSVOUTPUT "Timestamp,$paramName:$metricName\n";  
171:           
172:         my $i = 0;  
173:         foreach my $timeStamp (@domainData) {  
174:          print CSVOUTPUT "$domainData[$i],$rangeData[$i]\n";  
175:          $i++;  
176:         };  
177:           
178:         close(CSVOUTPUT);         
179:        
180:         my $dataMax = $stat->max();  
181:         my $dataMin = $stat->min();  
182:           
183:         if ($dataMax < 1) {  
184:          $dataMax = 0.5;  
185:         };  
186:    
187:         if ($dataMin > 0) {  
188:          $dataMin = 0;  
189:         };  
190:     
191:         for (my $rangeIndex = 0; $rangeIndex < $#rangeData; $rangeIndex++) {  
192:          if ($rangeData[$rangeIndex] == 0) {  
193:           $rangeData[$rangeIndex] = $dataMax * 5;  
194:          };  
195:         };  
196:           
197:         my @data = (\@domainData, \@rangeData);  
198:         my $tqGraph = GD::Graph::points->new(1024, int(768/2));  
199:         my $totalMeasurements = $#{$data[0]} + 1;  
200:          
201:         $tqGraph->set(x_label_skip   => int($#domainData/40),  
202:                x_labels_vertical => 1,  
203:                markers      => [6],  
204:                marker_size    => 2,  
205:                y_label      => "$metricName",  
206:                y_min_value    => $dataMin,  
207:                y_max_value    => ceil($dataMax * 1.1),  
208:                title       => "${hostName}:${paramName}:${metricName}, N = " . $stat->count(). "; avg = " . $stat->mean() . "; SD = " . $stat->standard_deviation() . "; 90th = " . $stat->percentile(90) . ".",  
209:                line_types    => [1],  
210:                line_width    => 1,  
211:                dclrs       => ['red'],  
212:         ) or warn $tqGraph->error;  
213:          
214:         $tqGraph->set_legend("TQ Measurement");  
215:         $tqGraph->set_legend_font(GD::gdMediumBoldFont);  
216:         $tqGraph->set_x_axis_font(GD::gdMediumBoldFont);  
217:         $tqGraph->set_y_axis_font(GD::gdMediumBoldFont);  
218:          
219:         my $tqImage = $tqGraph->plot(\@data) or die $tqGraph->error;  
220:          
221:         open(PICTURE, ">$graphName");  
222:         binmode PICTURE;  
223:         print PICTURE $tqImage->gif;  
224:         close(PICTURE);  
225:        
226:        };  
227:    
228:       } else {  
229:        print STDERR "#################### Nothing extracted for Hostname \"$hostName\" and metric \"$paramName\" ####################\n";  
230:       };  
231:    
232:      };  
233:     };  
234:    
235:    } else {  
236:     print STDERR "Could not find the TeamQuest Report file.\n";  
237:    };  
238:   } else {  
239:    print STDERR "Could not find the list of hostnames to run against.\n";  
240:   };  
241:  } else {  
242:   print STDERR "Could not find the TeamQuest Manager TQHarvest binary at \"$tqHarvestBinary\". Cannot continue.\n";  
243:  };  
244:    
245:  sub dateSort {  
246:   my $a_value = dateToEpoch($a);  
247:   my $b_value = dateToEpoch($b);  
248:     
249:   if ($a_value > $b_value) {  
250:    return 1;  
251:   } else {  
252:    if ($b_value > $a_value) {  
253:     return -1;  
254:    } else {  
255:     return 0;  
256:    };  
257:   };  
258:     
259:  };  
260:    
261:  sub dateToEpoch {  
262:   my ($timeStamp) = @_;  
263:   my ($dateString, $timeString) = split(/ /, $timeStamp);  
264:   my ($month, $day, $year) = split(/\//, $dateString);  
265:   my ($hour, $min, $sec) = split(/:/, $timeString);  
266:     
267:   $year += 2000;  
268:   $month -= 1;  
269:     
270:   return timegm($sec,$min,$hour,$day,$month,$year);  
271:     
272:  };  
273:

Sunday, November 20, 2011

Discrete Event Simulation of the Three Tier eBiz with SimPy

In this post I blogged on how I wrote a simulation solution in PDQ-R for a case study used in the book, Performance By Design.

I also wanted to expand my horizons and write a Discrete Event Simulation solution to go along with the heuristic solution of PDQ-R. The firm that I was working for at the time when I came up with this solution had a DES package by the name of SimScript II but they weren't willing to cough up a license so that I could learn their product. So, I searched the web and found a python package by the name of SimPy that can be used for Discrete Event Simulation.

There are some major differences between the heuristic solution and the DES solution. With the heuristic solution I was able to use some linear algebra to determine how many many hits per second to send off to the various pages that consumed resources. With my SimPy solution I did not have that luxury as I am simulating individual hits to each page one at a time. To get a proper distribution of hits I came up with this simple routine to dispatch the hits to pages:

1:  class RandomPath:
2:   def RowSum(self, Vector):
3:    rowSum = 0.0
4:    for i in range(len(Vector)):
5:     rowSum += Vector[i]
6:    return rowSum
7:   def NextPage(self, T, i):
8:    rowSum = self.RowSum(T[i])
9:    randomValue = G.Rnd.uniform(0, rowSum)
10:    sumT = 0.0
11:    for j in range(len(T[i])):
12:     sumT += T[i][j]
13:     if randomValue < sumT:
14:      break
15:    return j

With this routine I take the sum of probability from a row of the TransitionMatrix hard coded in the routine and then generate a random value between zero and the sum of the probabilities. I then walk the row and summing the probabilities until I find a condition where the probablities are greater than the random value. The count when this happens determines the next page.

For example, below I have the TransitionMatrix as used in my python code:


88:   #                        0     1     2     3     4     5     6     7
89:   TransitionMatrix = [ [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00], # 0
90:                        [0.00, 0.00, 0.70, 0.00, 0.10, 0.00, 0.00, 0.20], # 1
91:                        [0.00, 0.00, 0.45, 0.15, 0.10, 0.00, 0.00, 0.30], # 2
92:                        [0.00, 0.00, 0.00, 0.00, 0.40, 0.00, 0.00, 0.60], # 3
93:                        [0.00, 0.00, 0.00, 0.00, 0.00, 0.30, 0.55, 0.15], # 4
94:                        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00], # 5
95:                        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00], # 6
96:                        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00] ]# 7

And we are currently on the search page, as represented by row #2 of the above stochastic matrix:


                               0     1     2     3     4     5     6     7
91:                        [0.00, 0.00, 0.45, 0.15, 0.10, 0.00, 0.00, 0.30], # 2

Column 0 is 0, my sum is 0.

Column 1 is 0, my sum is 0.

Column 2 is 0.45, my sum is 0.45. 0.45 is less than my random value, 0.67.

Column 3 is 0.15, my sum is 0.60. 0.60 is less than my random value, 0.67.

Column 4 is 0.10, my sum is 0.70. 0.70 is greater than my random value, 0.67. I pass back 4 to my routine and the next row in the random walk will be row 4 (which represents the login page).

What we end up is a Markov random walk of the transition matrix. Multiple comparisons of random walks versus the heuristic analysis was fairly close. Of course, with a random variable it is never the same but comes out fairly close with each iteration.

The great thing about this algorithm is that it can be used for both Discrete Event Simulation and in load testing as well. At an eCommerce site that I worked I had created a perl routine to slice and dice IIS logs and by using unique shopper ID in the logged cookies I had created a Markov transition matrix of actual customer traffic. The resulting matrix wasn't small. For our site, it was over 500 x 500 elements. That's quite a lot of different random paths to take.

The idea was to take this information and modify our LoadRunner HTTP Vusers to make use of this data so that our scripts would properly emulate actual customers as apposed to crude page summary analysis. With this mechanism, our load tests would properly represent customers and we would have been able to run our analysis once a week to always make sure that our scenarios represented how customers browsed the site.

Unfortunately, I never got to put my ideas into action and only got as far as creating the stochastic transition matrix and some crude perl code that demonstrated random walks that would have represented customers. All in all, this was the easy part. The real hard part of the project would have been coding the HTTP Vusers that would have properly handled state transition to page to page for non-happy routes that customers sometimes followed.

Below is the SimPy solution that I came up with for the case study presented:

1:  #!/usr/bin/env python
2:  from SimPy.Simulation import *
3:  from random import Random, expovariate, uniform
4:  class Metrics:
5:   metrics = dict()
6:   def Add(self, metricName, frameNumber, value):
7:    if self.metrics.has_key(metricName):
8:     if self.metrics[metricName].has_key(frameNumber):
9:      self.metrics[metricName][frameNumber].append(value)
10:     else:
11:      self.metrics[metricName][frameNumber] = list()
12:      self.metrics[metricName][frameNumber].append(value)
13:    else:
14:     self.metrics[metricName] = dict()
15:     self.metrics[metricName][frameNumber] = list()
16:     self.metrics[metricName][frameNumber].append(value)
17:   def Keys(self):
18:    return self.metrics.keys()
19:   def Mean(self, metricName):
20:    valueArray = list()
21:    if self.metrics.has_key(metricName):
22:     for frame in self.metrics[metricName].keys():
23:      for values in range(len(self.metrics[metricName][frame])):
24:       valueArray.append(self.metrics[metricName][frame][values])
25:     sum = 0.0
26:     for i in range(len(valueArray)):
27:      sum += valueArray[i]
28:     if len(self.metrics[metricName][frame]) != 0:
29:      return sum/len(self.metrics[metricName])
30:     else:
31:      return 0 # Need to learn python throwing exceptions
32:    else:
33:     return 0
34:  class RandomPath:
35:   def RowSum(self, Vector):
36:    rowSum = 0.0
37:    for i in range(len(Vector)):
38:     rowSum += Vector[i]
39:    return rowSum
40:   def NextPage(self, T, i):
41:    rowSum = self.RowSum(T[i])
42:    randomValue = G.Rnd.uniform(0, rowSum)
43:    sumT = 0.0
44:    for j in range(len(T[i])):
45:     sumT += T[i][j]
46:     if randomValue < sumT:
47:      break
48:    return j
49:  class G:
50:   numWS      = 1
51:   numAS      = 1
52:   numDS      = 2
53:   Rnd        = random.Random(12345)
54:   PageNames  = ["Entry", "Home", "Search", "View", "Login", "Create", "Bid", "Exit" ]
55:   Entry      = 0
56:   Home       = 1
57:   Search     = 2
58:   View       = 3
59:   Login      = 4
60:   Create     = 5
61:   Bid        = 6
62:   Exit       = 7
63:   WS         = 0
64:   AS         = 1
65:   DS         = 2
66:   CPU        = 0
67:   DISK       = 1
68:   WS_CPU     = 0
69:   WS_DISK    = 1
70:   AS_CPU     = 2
71:   AS_DISK    = 3
72:   DS_CPU     = 4
73:   DS_DISK    = 5
74:   metrics    = Metrics()
75:   #            e  h  s  v  l  c  b  e
76:   HitCount  = [0, 0, 0, 0, 0, 0, 0, 0]
77:   Resources = [[ Resource(1), Resource(1) ], # WS CPU and DISK
78:                [ Resource(1), Resource(1) ], # AS CPU and DISK
79:                [ Resource(1), Resource(1) ]] # DS CPU and DISK
80:   #         Enter Home  Search View  Login Create Bid  Exit
81:   ServiceDemand = [ [0.000, 0.008, 0.009, 0.011, 0.060, 0.012, 0.015, 0.000],  # WS_CPU
82:                     [0.000, 0.030, 0.010, 0.010, 0.010, 0.010, 0.010, 0.000],  # WS_DISK
83:                     [0.000, 0.000, 0.030, 0.035, 0.025, 0.045, 0.040, 0.000],  # AS_CPU
84:                     [0.000, 0.000, 0.008, 0.080, 0.009, 0.011, 0.012, 0.000],  # AS_DISK
85:                     [0.000, 0.000, 0.010, 0.009, 0.015, 0.070, 0.045, 0.000],  # DS_CPU
86:                     [0.000, 0.000, 0.035, 0.018, 0.050, 0.080, 0.090, 0.000] ] # DS_DISK
87:   # Type B shopper
88:   #                        0     1     2     3     4     5     6     7
89:   TransitionMatrix = [ [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00], # 0
90:                        [0.00, 0.00, 0.70, 0.00, 0.10, 0.00, 0.00, 0.20], # 1
91:                        [0.00, 0.00, 0.45, 0.15, 0.10, 0.00, 0.00, 0.30], # 2
92:                        [0.00, 0.00, 0.00, 0.00, 0.40, 0.00, 0.00, 0.60], # 3
93:                        [0.00, 0.00, 0.00, 0.00, 0.00, 0.30, 0.55, 0.15], # 4
94:                        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00], # 5
95:                        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00], # 6
96:                        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00] ] # 7
97:  class DoWork(Process):
98:   def __init__(self, i, resource, serviceDemand, nodeName, pageName):
99:    Process.__init__(self)
100:    self.frame         = i
101:    self.resource      = resource
102:    self.serviceDemand = serviceDemand
103:    self.nodeName      = nodeName
104:    self.pageName      = pageName
105:   def execute(self):
106:    StartUpTime = now()
107:    yield request, self, self.resource
108:    yield hold, self, self.serviceDemand
109:    yield release, self, self.resource
110:    R = now() - StartUpTime
111:    G.metrics.Add(self.pageName, self.frame, R)
112:  class CallPage(Process):
113:   def __init__(self, i, node, pageName):
114:    Process.__init__(self)
115:    self.frame        = i
116:    self.StartUpTime  = 0.0
117:    self.currentPage  = node
118:    self.pageName     = pageName
119:   def execute(self):
120:    if self.currentPage != G.Exit:
121:     print >> sys.stderr, "Working on Frame # ", self.frame, " @ ", now() , " for page ", self.pageName
122:     self.StartUpTime = now()
123:     if G.ServiceDemand[G.WS_CPU][self.currentPage] > 0.0:
124:      wsCPU = DoWork(self.frame, G.Resources[G.WS][G.CPU], G.ServiceDemand[G.WS_CPU][self.currentPage]/G.numWS, "wsCPU", self.pageName)
125:      activate(wsCPU, wsCPU.execute())
126:     if G.ServiceDemand[G.WS_DISK][self.currentPage] > 0.0:
127:      wsDISK = DoWork(self.frame, G.Resources[G.WS][G.DISK], G.ServiceDemand[G.WS_DISK][self.currentPage]/G.numWS, "wsDISK", self.pageName)
128:      activate(wsDISK, wsDISK.execute())
129:     if G.ServiceDemand[G.AS_CPU][self.currentPage] > 0.0:
130:      asCPU = DoWork(self.frame, G.Resources[G.AS][G.CPU], G.ServiceDemand[G.AS_CPU][self.currentPage]/G.numAS, "asCPU", self.pageName)
131:      activate(asCPU, asCPU.execute())
132:     if G.ServiceDemand[G.AS_DISK][self.currentPage] > 0.0:
133:      asDISK = DoWork(self.frame, G.Resources[G.AS][G.DISK], G.ServiceDemand[G.AS_DISK][self.currentPage]/G.numAS, "asDISK", self.pageName)
134:      activate(asDISK, asDISK.execute())
135:     if G.ServiceDemand[G.DS_CPU][self.currentPage] > 0.0:
136:      dsCPU = DoWork(self.frame, G.Resources[G.DS][G.CPU], G.ServiceDemand[G.DS_CPU][self.currentPage]/G.numDS, "dsCPU", self.pageName)
137:      activate(dsCPU, dsCPU.execute())
138:     if G.ServiceDemand[G.DS_DISK][self.currentPage] > 0.0:
139:      dsDISK = DoWork(self.frame, G.Resources[G.DS][G.DISK], G.ServiceDemand[G.DS_DISK][self.currentPage]/G.numDS, "dsDISK", self.pageName)
140:      activate(dsDISK, dsDISK.execute())
141:     G.HitCount[self.currentPage] += 1
142:     yield hold, self, 0.00001 # Needed to prevent an error. Doesn't add any blocking to the six queues above
143:  class Generator(Process):
144:   def __init__(self, rate, maxT, maxN):
145:     Process.__init__(self)
146:     self.name = "Generator"
147:     self.rate = rate
148:     self.maxN = maxN
149:     self.maxT = maxT
150:     self.g    = Random(11335577)
151:     self.i    = 0
152:     self.currentPage = G.Home
153:   def execute(self):
154:    while (now() < self.maxT):
155:     self.i += 1
156:     p = CallPage(self.i,self.currentPage,G.PageNames[self.currentPage])
157:     activate(p,p.execute())
158:     yield hold,self,self.g.expovariate(self.rate)
159:     randomPath = RandomPath()
160:     if self.currentPage == G.Exit:
161:      self.currentPage = G.Home
162:     else:
163:      self.currentPage = randomPath.NextPage(G.TransitionMatrix, self.currentPage)
164:  def main():
165:    maxWorkLoad = 10000
166:    Lambda      = 4.026*float(sys.argv[1])
167:    maxSimTime  = float(sys.argv[2])
168:    initialize()
169:    g = Generator(Lambda, maxSimTime, maxWorkLoad)
170:    activate(g,g.execute())
171:    simulate(until=maxSimTime)
172:    print >> sys.stderr, "Simulated Seconds  : ", maxSimTime
173:    print >> sys.stderr, "Page Hits      :"
174:    for i in range(len(G.PageNames)):
175:     print >> sys.stderr, "\t", G.PageNames[i], " = ", G.HitCount[i]
176:    print >> sys.stderr, "Throughput      : "
177:    for i in range(len(G.PageNames)):
178:     print >> sys.stderr, "\t", G.PageNames[i], " = ", G.HitCount[i]/maxSimTime
179:    print >> sys.stderr, "Mean Response Times:"
180:    for i in G.metrics.Keys():
181:     print >> sys.stderr, "\t", i, " = ", G.metrics.Mean(i)
182:    print G.HitCount[G.Home]/maxSimTime, ",", G.metrics.Mean("Home"), ",", G.metrics.Mean("View"), ",", G.metrics.Mean("Search"), ",", G.metrics.Mean("Login"), ",", G.metrics.Mean("Create"), ",", G.metrics.Mean("Bid")
183:  if __name__ == '__main__': main()

In a future blog post I will compare the results of the Discrete Event Simulation versus my PDQ-R solution and finally compare both of those against a work up done with TeamQuest Model just for good measure.

Tuesday, November 15, 2011

Using PDQ-R to model a three tier eBiz Application

This post is a continuation of this post that I had done earlier.

Continuing the solution for Case Study IV: An E-Business Service from the book, Performance by Design: Computer Capacity Planning By Example.

In the previous post I discuss how to take the transition probability matrix and work backwards toward the original series of linear equations for solving the number of visits to a series of web pages. In this case study there are actually two types of visitors that result in two transition probability matrices that must be utilized. There are 25% of Type A visitors and 75% of Type B visitors.

Each tier of the hypothetical e-biz service is made up by a single CPU and a single disk drive. A matrix is supplied with the total service demand for each component by each page that is hit by visitors.

While it is simple to write some code to analyze web logs to generate the transition probability matrix based upon customer traffic it is very difficult to isolate the total demand at each component with chaotic customer traffic. But that is why we have load testing tools that are available to us. In a pseudo-production environment we are capable of simulating customer traffic to one page at a time and calculating the total demand for components. In this particular case only the CPU and disk drives are being modeled but for a real service we'd want to model the CPU, disk drives, memory system, network system, etc.

After running simulated customer traffic against isolated page hits we could generate a similar demand matrix for components and use it for what-if analysis.

For this solution I had opted to use PDQ-R with the R programming language.

My R code for the solution:

 # Solution parameters   
 gamma <- 10.96;  # Rate into system  
 numWS <- 1;    # Number of Web Servers  
 numAS <- 1;    # Number of Application Servers  
 numDS <- 1;    # Number of Database Servers  
 # external library  
 library("pdq");  
 # Constants #  
 E <- 1;  
 H <- 2;  
 S <- 3;  
 V <- 4;  
 G <- 5;  
 C <- 6;  
 B <- 7;  
 X <- 8;  
 PAGE_NAMES <- c("Enter", "HomePage", "Search", "ViewBids", "Login", "CreateAuction", "PlaceBid", "Exit");  
 COMPONENTS <- c("CPU", "Disk");  
 SERVER_TYPES <- c("WS", "AS", "DS");  
 WS_CPU <- 1;  
 WS_DISK <- 2;  
 AS_CPU <- 3;  
 AS_DISK <- 4;  
 DS_CPU <- 5;  
 DS_DISK <- 6;  
 # Functions used in solution  
 VisitsByTransitionMatrix <- function(M, B) {  
   A <- t(M);  
   A <- -1 * A;  
   for (i in 1:sqrt(length(A))) {  
    j <- i;  
    A[i,j] <- A[i,j] + 1;  
   };  
   return(solve(A,B));  
 };  
 CalculateLambda <- function(gamma, f_a, f_b, V_a, V_b, index) {  
  return (  
       gamma*((f_a*V_a[index]) + (f_b*V_b[index]))  
      );  
 };  
 f_a  <- 0.25;  # Fraction of TypeA users  
 f_b  <- 1 - f_a; # Fraction of TypeB users  
 lambda <- 1:X;   # Array of lambda for each page   
 SystemInput <- matrix(c(1,0,0,0,0,0,0,0),nrow=8,ncol=1)               # 8.3, Figure 8.2, page 208  
 TypeA    <- matrix(c(0,1,0,0,0,0,0,0,0,0,0.7,0,0.1,0,0,  
             0.2,0,0,0.4,0.2,0.15,0,0,0.25,0,0,  
             0,0,0.65,0,0,0.35,0,0,0,0,0,0.3,0.6,  
             0.1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,  
             0,0,0,0,0,0,0,0), ncol=8, nrow=8, byrow=TRUE);       # 8.4, Table 8.1, page 209  
 TypeB    <- matrix(c(0,1,0,0,0,0,0,0,0,0,0.7,0,0.1,0,0,  
             0.2,0,0,0.45,0.15,0.1,0,0,0.3,0,0,  
             0,0,0.4,0,0,0.6,0,0,0,0,0,0.3,0.55,  
             0.15,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,  
             1,0,0,0,0,0,0,0,0), nrow=8, ncol=8, byrow=TRUE);      # 8.4, Table 8.2, page 210  
 DemandTable <- matrix(c(0,0.008,0.009,0.011,0.06,0.012,0.015,  
             0,0,0.03,0.01,0.01,0.01,0.01,0.01,0,  
             0,0,0.03,0.035,0.025,0.045,0.04,0,0,  
             0,0.008,0.08,0.009,0.011,0.012,0,0,  
             0,0.01,0.009,0.015,0.07,0.045,0,0,0,  
             0.035,0.018,0.05,0.08,0.09,0), ncol=8, nrow=6, byrow=TRUE); # 8.4, Table 8.4, page 212 (with modifications)  
 VisitsA <- VisitsByTransitionMatrix(TypeA, SystemInput);  
 VisitsB <- VisitsByTransitionMatrix(TypeB, SystemInput);  
 lambda[E] <- 0; # Not used in calculations  
 lambda[H] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, H);  
 lambda[S] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, S);  
 lambda[V] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, V);  
 lambda[G] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, G);  
 lambda[C] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, C);  
 lambda[B] <- CalculateLambda(gamma, f_a, f_b, VisitsA, VisitsB, B);  
 lambda[X] <- 0 # Not used in calculations  
 Init("e_biz_service");  
 # Define workstreams   
 for (n in H:B) {  
  workStreamName <- sprintf("%s", PAGE_NAMES[n]);  
  CreateOpen(workStreamName, lambda[n]);  
 };  
 # Define Web Server Queues  
 for (i in 1:numWS) {  
  for (j in 1:length(COMPONENTS)) {  
   nodeName <- sprintf("WS_%d_%s", i, COMPONENTS[j]);  
   CreateNode(nodeName, CEN, FCFS);  
  };  
 };  
 # Define Application Server Queues  
 for (i in 1:numAS) {  
  for (j in 1:length(COMPONENTS)) {  
   nodeName <- sprintf("AS_%d_%s", i, COMPONENTS[j]);  
   CreateNode(nodeName, CEN, FCFS);  
  };  
 };  
 # Define Database Server Queues  
 for (i in 1:numDS) {  
  for (j in 1:length(COMPONENTS)) {  
   nodeName <- sprintf("DS_%d_%s", i, COMPONENTS[j]);  
   CreateNode(nodeName, CEN, FCFS);  
  };  
 };  
 # Set Demand for the Web Servers  
 for (i in 1:numWS) {  
  demandIndex <- WS_CPU;  
  for (j in 1:length(COMPONENTS)) {  
   nodeName <- sprintf("WS_%d_%s", i, COMPONENTS[j]);  
   for (k in H:B) {  
    workStreamName <- sprintf("%s", PAGE_NAMES[k]);  
    SetDemand(nodeName, workStreamName, (DemandTable[demandIndex + (j-1), k])/numWS);  
   };  
  };  
 };  
 # Set Demand for the App Servers  
 for (i in 1:numAS) {  
  demandIndex <- AS_CPU;  
  for (j in 1:length(COMPONENTS)) {  
   nodeName <- sprintf("AS_%d_%s", i, COMPONENTS[j]);  
   for (k in H:B) {  
    workStreamName <- sprintf("%s", PAGE_NAMES[k]);  
    SetDemand(nodeName, workStreamName, (DemandTable[demandIndex + (j-1), k])/numAS);  
   };  
  };  
 };  
 # Set Demand for the Database Servers  
 for (i in 1:numDS) {  
  demandIndex <- DS_CPU;  
  for (j in 1:length(COMPONENTS)) {  
   nodeName <- sprintf("DS_%d_%s", i, COMPONENTS[j]);  
   for (k in H:B) {  
    workStreamName <- sprintf("%s", PAGE_NAMES[k]);  
    SetDemand(nodeName, workStreamName, (DemandTable[demandIndex + (j-1), k])/numDS);  
   };  
  };  
 };  
 SetWUnit("Trans");  
 SetTUnit("Second");  
 Solve(CANON);  
 print("Arrival Rates for each page:");  
 for (i in H:B) {  
  print(sprintf("%s = %f", PAGE_NAMES[i], lambda[i]));  
 };  
 print("[-------------------------------------------------]");  
 print("Page Response Times");  
 for (i in H:B) {  
  workStreamName <- sprintf("%s", PAGE_NAMES[i]);  
  print(sprintf("%s = %f seconds.", PAGE_NAMES[i], GetResponse(TRANS, workStreamName)));  
 };  
 print("[-------------------------------------------------]");  
 print("Component Utilizations");  
 for (i in 1:numWS) {  
  for (j in 1:length(COMPONENTS)) {  
   totalUtilization <- 0;  
   nodeName <- sprintf("WS_%s_%s", i, COMPONENTS[j]);  
   for (k in H:B) {  
    workStreamName <- sprintf("%s", PAGE_NAMES[k]);  
    totalUtilization <- totalUtilization + GetUtilization(nodeName, workStreamName, TRANS);  
   };  
   print(sprintf("%s = %3.2f %%", nodeName, totalUtilization * 100));  
  };  
 };  
 for (i in 1:numAS) {  
  for (j in 1:length(COMPONENTS)) {  
   totalUtilization <- 0;  
   nodeName <- sprintf("AS_%s_%s", i, COMPONENTS[j]);  
   for (k in H:B) {  
    workStreamName <- sprintf("%s", PAGE_NAMES[k]);  
    totalUtilization <- totalUtilization + GetUtilization(nodeName, workStreamName, TRANS);  
   };  
   print(sprintf("%s = %3.2f %%", nodeName, totalUtilization * 100));  
  };  
 };  
 for (i in 1:numDS) {  
  for (j in 1:length(COMPONENTS)) {  
   totalUtilization <- 0;  
   nodeName <- sprintf("DS_%s_%s", i, COMPONENTS[j]);  
   for (k in H:B) {  
    workStreamName <- sprintf("%s", PAGE_NAMES[k]);  
    totalUtilization <- totalUtilization + GetUtilization(nodeName, workStreamName, TRANS);  
   };  
   print(sprintf("%s = %3.2f %%", nodeName, totalUtilization * 100));  
  };  
 };

Here is some sample output from the R code:

 Here is a bit of sample output with 10.96 users entering the system per second:  
 [1] "Arrival Rates for each page:"  
 [1] "HomePage = 10.960000"  
 [1] "Search = 13.658485"  
 [1] "ViewBids = 2.208606"  
 [1] "Login = 3.664958"  
 [1] "CreateAuction = 1.099487"  
 [1] "PlaceBid = 2.074180"  
 [1] "[-------------------------------------------------]"  
 [1] "Page Response Times"  
 [1] "HomePage = 0.083517 seconds."  
 [1] "Search = 1.612366 seconds."  
 [1] "ViewBids = 1.044683 seconds."  
 [1] "Login = 2.323417 seconds."  
 [1] "CreateAuction = 3.622690 seconds."  
 [1] "PlaceBid = 3.983755 seconds."  
 [1] "[-------------------------------------------------]"  
 [1] "Component Utilizations"  
 [1] "WS_1_CPU = 49.91 %"  
 [1] "WS_1_Disk = 55.59 %"  
 [1] "AS_1_CPU = 71.11 %"  
 [1] "AS_1_Disk = 35.59 %"  
 [1] "DS_1_CPU = 38.17 %"  
 [1] "DS_1_Disk = 97.57 %"

In this analysis we can see that the database disk IO is approaching 100%. A simple solution (for this analysis at least) is to add another database server to spread the load evenly.

I modify the line that reads "numDS <- 1; # Number of Database Servers" to read "numDS <- 2; # Number of Database Servers" and re-run the analysis:

 [1] "Arrival Rates for each page:"  
 [1] "HomePage = 10.960000"  
 [1] "Search = 13.658485"  
 [1] "ViewBids = 2.208606"  
 [1] "Login = 3.664958"  
 [1] "CreateAuction = 1.099487"  
 [1] "PlaceBid = 2.074180"  
 [1] "[-------------------------------------------------]"  
 [1] "Page Response Times"  
 [1] "HomePage = 0.083517 seconds."  
 [1] "Search = 0.237452 seconds."  
 [1] "ViewBids = 0.336113 seconds."  
 [1] "Login = 0.358981 seconds."  
 [1] "CreateAuction = 0.462042 seconds."  
 [1] "PlaceBid = 0.440903 seconds."  
 [1] "[-------------------------------------------------]"  
 [1] "Component Utilizations"  
 [1] "WS_1_CPU = 49.91 %"  
 [1] "WS_1_Disk = 55.59 %"  
 [1] "AS_1_CPU = 71.11 %"  
 [1] "AS_1_Disk = 35.59 %"  
 [1] "DS_1_CPU = 19.09 %"  
 [1] "DS_1_Disk = 48.78 %"  
 [1] "DS_2_CPU = 19.09 %"  
 [1] "DS_2_Disk = 48.78 %"

Not only do we alleviate the database server disk IO utilization, we have an appreciable decrease in page response time for the end user. The Search page went from 1.61 seconds to 0.24 seconds. Not too shabby. The biggest difference was with the Create Auction page which wnet from 3.62 seconds to 0.46 seconds for a change of over three seconds.

In a future post I will go over modified PDQ-R solution that also generates graphs to show changes in resource utilization and page response times with increasing load. I have have a Discrete Event Solution to the aformentioned eBiz application that was done with Python and the SimPy library to compare output of heuristic analysis and event simulation.