I Crush Servers: The Value of Log Analysis

Something that I run into time and time again is capacity planners and load testers that simply have no concept of how to perform log analysis. There is simply no excuse for this issue. Perl has been available for 20+ years and is my bread and butter when it comes to slicing and dicing logs. But if you are not a perl guru, there are plenty of alternatives available to assist the analyst that wants to crunch some log files.

I've heard a lot of good things about Microsoft LogParser. No perl required.

Personally, I prefer perl for slicing and dicing logs but I've been coding perl routines for 13+ years.

But back to log analysis: Example that happened to me earlier this week.

Load test is testing a major install that is going into production shortly and the management is hot and heavy to get this install tested. Load testing is finally done and their results show that the CPU utilization of servers in a particular tier increased by 2% while the arrival rate decreased 15%. The result? CPU service demand measured increased by roughly 22% which has an affect on the number of servers required for production.

This increase of 2% and decrease of arrival rate went missed by the load testing team completely but fortunately was found by myself. The question is this though: What is causing the extra 2% CPU utilization with a 15% decrease in arrival rate.

The load testers have no idea and simply scratch their heads. After watching them flail helplessly I decided to jump into the issue. After all, I did identify the 22% increase in CPU service demand between release versions.

The first place that I looked were the web logs. Because I'm a fanatic about log analysis I've written various perl routines over the years for slicing and dicing log files. I crunch the logs for the baseline and after tests and immediately see the problem: A major web service that was being called in the baseline is no longer being called in the after test. The difference in hits between the two load tests? 15%. The same difference in the arrival rate between tests. Somewhere, something went kaput. Load testers haven't found out what yet as we've been busy all week (and this weekend) hunting down other issues.

But the net result is this: If you are a load tester or capacity analyst, you damned well better be able to perform log analysis, even if it is rudimentary. Whether it is via perl or some other system like splunk, you need to have the tools in your toolbox to get the job done.

In the above example, the load testers did not properly analyze log files to see what was up. No attempt was made. Heck, they didn't even compare the output of Load Runner to ensure that the number of transactions between the two tests were the same. A load tester that cannot perform rudimentary log analysis is useless to me.

I Crush Servers

Sunday, June 19, 2011

The Value of Log Analysis

No comments:

Post a Comment