I Crush Servers: June 2011

Something that I run into time and time again is capacity planners and load testers that simply have no concept of how to perform log analysis. There is simply no excuse for this issue. Perl has been available for 20+ years and is my bread and butter when it comes to slicing and dicing logs. But if you are not a perl guru, there are plenty of alternatives available to assist the analyst that wants to crunch some log files.

I've heard a lot of good things about Microsoft LogParser. No perl required.

Personally, I prefer perl for slicing and dicing logs but I've been coding perl routines for 13+ years.

But back to log analysis: Example that happened to me earlier this week.

Load test is testing a major install that is going into production shortly and the management is hot and heavy to get this install tested. Load testing is finally done and their results show that the CPU utilization of servers in a particular tier increased by 2% while the arrival rate decreased 15%. The result? CPU service demand measured increased by roughly 22% which has an affect on the number of servers required for production.

This increase of 2% and decrease of arrival rate went missed by the load testing team completely but fortunately was found by myself. The question is this though: What is causing the extra 2% CPU utilization with a 15% decrease in arrival rate.

The load testers have no idea and simply scratch their heads. After watching them flail helplessly I decided to jump into the issue. After all, I did identify the 22% increase in CPU service demand between release versions.

The first place that I looked were the web logs. Because I'm a fanatic about log analysis I've written various perl routines over the years for slicing and dicing log files. I crunch the logs for the baseline and after tests and immediately see the problem: A major web service that was being called in the baseline is no longer being called in the after test. The difference in hits between the two load tests? 15%. The same difference in the arrival rate between tests. Somewhere, something went kaput. Load testers haven't found out what yet as we've been busy all week (and this weekend) hunting down other issues.

But the net result is this: If you are a load tester or capacity analyst, you damned well better be able to perform log analysis, even if it is rudimentary. Whether it is via perl or some other system like splunk, you need to have the tools in your toolbox to get the job done.

In the above example, the load testers did not properly analyze log files to see what was up. No attempt was made. Heck, they didn't even compare the output of Load Runner to ensure that the number of transactions between the two tests were the same. A load tester that cannot perform rudimentary log analysis is useless to me.

A while back I ran into a situation where I needed to model capacity on a server based upon a set of page hits into the server. Fortunately during testing it was a simple task to determine the service demand utilization of page hits and I developed a bit of Python code around the PyDQ API to create a flexible model of the system.

This bit of code pulls in a .CSV file from the command line wherein the .CSV file is defined the queueing nodes of the system and the pages and related service demands for each queueing node. The nodes are not limited to just a single system. In my particular use, I had two servers in the mix: A web server on the front end and a back-end database server.

Example model.CSV file:

1:  workstream  ,coefficient,s1.cpu ,s1.nic.sent,s1.nic.recv,dat.cpu
2:  somePageA.jsp,0.1    ,0.029 ,0.01296  ,0.0000599 ,0.00137853
3:  somePageB.jsp,0.8    ,0.00725,0.00324  ,0.0000599 ,0.000981735
4:  somePageC.jsp,0.1    ,0.00525,0.01324  ,0.0000599 ,0.000981735

workstream is the name of the pages that will be hitting the system.
coefficient is the coefficient of pages that will be hitting the system based upon arrival rate. In the example above, if the arrival rate of page hits is 100 pages per second, 10 hits per second will be attributed to somePageA.jsp, 80 page hits per second to somePageB.jsp and finally, the remaining 10 hits per second to somePageC.jsp.

The rest of the columns define the service demand utilization for identified queueing nodes such as CPU, NICs, etc.

The code will crunch the numbers and find the maximum arrival rate for requests in the pre-defined ratios and then will generate .PNG graphs along with .CSV output. The graphs do a good job of letting management type folks what the system will be doing under load. I found the real response times of the system in question to be very close to the PyDQ analyzed numbers.

Response times per page by total arrival rate into the system:

Queueing node utilization by total arrival rate into the system:

And finally, the Python code itself:

1:  #!/usr/bin/python  
2:    
3:  from pylab import *  
4:  import pdq  
5:  import sys  
6:    
7:  # Load the model workload .CSV file  
8:    
9:  workLoadFile = open(sys.argv[1], 'r')  
10:    
11:  serviceDemandByPage = {}  
12:  serverQueues    = []  
13:    
14:  lineCount = 0  
15:    
16:  for inputLine in workLoadFile:  
17:   inputLine = inputLine.rstrip('\r\n')  
18:   workLoadElements = inputLine.split(',')  
19:    
20:   if (lineCount == 0):  
21:    serverQueues = workLoadElements[2:len(workLoadElements)]  
22:   else:  
23:    serviceDemand    = []  
24:    workLoadName    = workLoadElements[0]  
25:    coefficient     = workLoadElements[1]  
26:    serviceDemandValues = workLoadElements[2:len(workLoadElements)]  
27:    
28:    for s in serviceDemandValues:  
29:     s = float(s)  
30:     serviceDemand.append(s)  
31:    
32:    serviceDemandByPage[workLoadName]   = {}  
33:    serviceDemandByPage[workLoadName]['%'] = float(coefficient)  
34:    
35:    for queueName in serverQueues:  
36:     serviceDemandByPage[workLoadName][queueName] = serviceDemand[serverQueues.index(queueName)]  
37:    
38:   lineCount = lineCount + 1  
39:    
40:  # Initial model settings  
41:    
42:  workLoadMatrix = []  
43:    
44:  for workLoadName in serviceDemandByPage.keys():  
45:   workLoadMatrix.append([serviceDemandByPage[workLoadName]['%'], workLoadName])  
46:   del(serviceDemandByPage[workLoadName]['%'])  
47:    
48:  workingData = {}  
49:  finalResults = {}  
50:    
51:  arrivalRate = 0.1  
52:    
53:  contLoop = True  
54:    
55:  while contLoop:  
56:    
57:   workLoadArray = []  
58:   workingData[arrivalRate] = {}  
59:    
60:   pdq.Init("arrivalRate = %f" % arrivalRate)  
61:    
62:   for queueName in serverQueues:  
63:    pdq.CreateNode(queueName, pdq.CEN, pdq.FCFS)  
64:    
65:   for workLoad in workLoadMatrix:  
66:    
67:    coefficient = workLoad[0]  
68:    pageName   = workLoad[1]  
69:    
70:    workLoadName = pageName  
71:    
72:    workLoadArray.append([workLoadName, coefficient])  
73:    
74:    workingData[arrivalRate][workLoadName] = {}  
75:    
76:    workLoadLambda = coefficient * arrivalRate  
77:    
78:    pdq.streams = pdq.CreateOpen(workLoadName, workLoadLambda)  
79:    pdq.SetWUnit("Hits")  
80:    pdq.SetTUnit("Seconds")  
81:    
82:    for queueName in serverQueues:  
83:     totalServiceDemand = serviceDemandByPage[workLoadName][queueName]  
84:     workingData[arrivalRate][workLoadName][queueName] = 0  
85:     pdq.SetDemand(queueName, workLoadName, totalServiceDemand)  
86:    
87:   pdq.Solve(pdq.CANON)  
88:    
89:   for workLoad in workLoadArray:  
90:    workLoadName = workLoad[0]  
91:    for queueName in serverQueues:  
92:     workingData[arrivalRate][workLoadName][queueName] = pdq.GetUtilization(queueName, workLoadName, pdq.TRANS)  
93:    
94:   for workLoadData in workLoadArray:  
95:    workLoadName = workLoadData[0]  
96:    coefficient = workLoadData[1]  
97:    workingData[arrivalRate][workLoadName]['R'] = pdq.GetResponse(pdq.TRANS, workLoadName)  
98:    workingData[arrivalRate][workLoadName]['%'] = coefficient  
99:    
100:   for queueName in serverQueues:  
101:    queueTotal = 0.0  
102:    for workLoad in serviceDemandByPage.keys():   
103:     queueTotal += workingData[arrivalRate][workLoad][queueName]  
104:     if queueTotal > 0.99:  
105:      contLoop = False  
106:    
107:   arrivalRate += 0.1  
108:    
109:  for arrivalRate in sorted(workingData.keys()):  
110:    
111:   finalResults[arrivalRate]   = {}  
112:   finalResults[arrivalRate]['R'] = {}  
113:   finalResults[arrivalRate]['%'] = {}  
114:    
115:   arrivalData = workingData[arrivalRate]  
116:    
117:   workLoadArray = []  
118:    
119:   for queueName in serverQueues:  
120:    finalResults[arrivalRate][queueName] = {}  
121:    rho = 0  
122:    for workLoad in arrivalData.keys():  
123:     rho = rho + arrivalData[workLoad][queueName]  
124:     workLoadArray.append(workLoad)  
125:    finalResults[arrivalRate][queueName]['rho'] = rho  
126:    
127:   for workLoad in workLoadArray:  
128:    finalResults[arrivalRate]['R'][workLoad] = arrivalData[workLoad]['R']  
129:    finalResults[arrivalRate]['%'][workLoad] = arrivalData[workLoad]['%']  
130:    
131:  # .CSV output  
132:    
133:  arrivalRateArray = []  
134:  queueUtilization = {}  
135:  responseTimes  = {}  
136:    
137:  reportHeader  = "lambda"  
138:    
139:  for queueName in serverQueues:  
140:   reportHeader = reportHeader + ",%s" % queueName  
141:    
142:  for workLoad in workLoadMatrix:  
143:   reportHeader = reportHeader + ",%s" % workLoad[1]  
144:    
145:  print reportHeader  
146:    
147:  for arrivalRate in sorted(finalResults.keys()):  
148:   arrivalRateArray.append(arrivalRate)  
149:   outputLine = "%0.3f" % arrivalRate  
150:    
151:   for queueName in serverQueues:  
152:    outputLine = outputLine + ",%0.3f" % finalResults[arrivalRate][queueName]['rho']  
153:    if not queueUtilization.has_key(queueName):  
154:     queueUtilization[queueName] = []  
155:    queueUtilization[queueName].append(finalResults[arrivalRate][queueName]['rho'])  
156:    
157:   for workLoad in finalResults[arrivalRate]['R'].keys():  
158:    if not responseTimes.has_key(workLoad):  
159:     responseTimes[workLoad] = []  
160:    responseTimes[workLoad].append(finalResults[arrivalRate]['R'][workLoad])  
161:    outputLine = outputLine + ",%0.3f" % finalResults[arrivalRate]['R'][workLoad]  
162:    
163:   print outputLine  
164:    
165:  # graph output  
166:    
167:  F    = gcf()  
168:  dpi   = F.get_dpi()  
169:  xInches = 1024 / dpi  
170:  yInches = (768/2) / dpi  
171:    
172:  rcParams['figure.figsize'] = xInches, yInches   
173:    
174:  for queueName in serverQueues:  
175:   figure()  
176:   domain = arrivalRateArray  
177:   range = queueUtilization[queueName]  
178:   plot(domain, range, linewidth=1.0)  
179:   xlabel("Arrival Rate, hits per second")  
180:   ylabel("Utilization of %s" % queueName)   
181:   title("Utilization of queue %s by arrival rate" % queueName)  
182:   grid(True)  
183:   savefig("%s.utilization.png" % queueName)  
184:    
185:  figure()  
186:    
187:  for queueName in serverQueues:  
188:   domain = arrivalRateArray  
189:   range = queueUtilization[queueName]  
190:   plot(domain, range, label="%s" % queueName, linewidth=1.0)  
191:   xlabel("Arrival Rate, hits per second")  
192:   ylabel("Utilization")   
193:   title("Utilization of queues by arrival rate")  
194:   grid(True)  
195:    
196:  legend(loc=2)  
197:  savefig("utilization.by.arrival.rate.png")  
198:    
199:  for workLoad in workLoadMatrix:  
200:   figure()  
201:   domain = arrivalRateArray  
202:   range = responseTimes[workLoad[1]]  
203:   plot(domain, range, linewidth=1.0)  
204:   xlabel("Arrival Rate, hits per second")  
205:   ylabel("Response Time, seconds, for %s" % workLoad[1])   
206:   title("Response Time for %s by arrival rate" % workLoad[1])  
207:   grid(True)  
208:   savefig("%s.response.time.png" % workLoad[1])  
209:    
210:  figure()  
211:    
212:  for workLoad in workLoadMatrix:  
213:   domain = arrivalRateArray  
214:   range = responseTimes[workLoad[1]]  
215:   plot(domain, range, label="%s" % workLoad[1], linewidth=1.0)  
216:   xlabel("Arrival Rate, hits per second")  
217:   ylabel("Response Times, seconds")   
218:   title("Response Times by arrival rate")  
219:   grid(True)  
220:    
221:  legend(loc=2)  
222:  savefig("response.times.by.arrival.rate.png")  
223:

I Crush Servers

Sunday, June 19, 2011

The Value of Log Analysis

Saturday, June 4, 2011

Flexible Modeling with PyDQ