From PROGRESSIONS #56 Fall 2003
One of the most significant challenges of managing complex systems is determining how the system will behave as changes are being made. Changes in the business environment, the user load, hardware and software; changes to the schema or configuration of the database and even small and seemingly innocuous changes to tunable parameters all have enormous potential for unexpected (and decidedly negative) consequences. Properly testing the impact of such changes is often seen as "too complex" or "too expensive."
Load simulation is a simple and effective technique that can help to bring this complexity under control and provide meaningful answers to questions about the likely impact of such changes. This article presents a straightforward approach to defining a realistic user load and 4gl based tools for simulating the same.
Typical questions that can be answered by load simulation include:
Perhaps the most difficult hurdle to implementing load simulation is simply taking the time to characterize the load. In many organizations there are no formal metrics that are routinely used to communicate the concept of "load". To some extent this is because there are many valid measurements of system load and no single number can cover all aspects of the question. Another key difficulty is that many of the questions that one wants to pose via load simulation have a "business" component – and even fewer organizations have tied key business drivers to the technical metrics that reflect load.
So where can we start? At a fundamental level business applications accept input, process it in some way and output results. The details, of course, vary. But these components are always present and some general observations can be made about them. Broadly speaking:
There are, of course, specific processes that are exceptions to these points. These exceptions are probably "well known" in your organization. In fact the exceptions are probably the areas (if any) where you have measurements pertaining to performance. You may know, for instance, that the month end trial balance takes 17 hours and 20 minutes to run and that nobody else can get any work done if runs during the business day.
Historical database performance data is essential to a successful load testing environment. You need to know what is "normal" for your systems. The most critical metrics to collect are:
You need to collect this data and be familiar with it in order to understand the load on your systems. Notice that this data is all about the load – none of it speaks to the efficiency of the system. It only characterizes what is being asked of the system. It is certainly interesting and useful to know about things like OS reads, buffer hit ratios, latch timeouts, buffers flushed and so forth but they speak to efficiency – not load.
Scripts to collect this data and more can be reviewed and downloaded at:
http://www.greenfieldtech.com/downloads.shtml
In addition to historical data regarding database load you should also quantify business load. Every business has key metrics that indicate to management how well things are going. Perhaps it is sales volume in dollars, number of orders processed or widgets produced. There are some numbers somewhere in the database that quantify load from a business perspective. Correlating this metric with the database load gives you a very powerful tool for capacity planning and load simulation. Without it you're just guessing anytime the question is related to a business driver.
Given the assumptions above the main thing that we need is a method of realistically reading records in a manner that reflects what real users do with the system. To do that we need to know the distribution of reads among users, the distribution of reads among tables and the degree of "think time" between requests.
Examining the output of the PROMON "IO by User" screen often reveals that users do not uniformly access the database. A profile such as this is common:
10/22/03 I/O Operations by Process
10:01:54
-------- Database ----- ---- BI ----- ---- AI -----
Usr Name Access Read Write Read Write Read Write
0 tom 4513 22 2 271 140 0 0
5 1 0 0 0 9899 0 0
6 1 0 20 0 0 0 0
7 1 0 18 0 0 0 0
8 tom 16707 271 0 0 0 0 0
9 tom 2094413 8 285 723 6765 0 0
10 tom 740648 0 89 213 1660 0 0
11 jami 81990 542 0 0 0 0 0
12 julia 78902 50 0 0 0 0 0
13 peter 81290 531 0 0 0 0 0
14 emily 74588 251 0 0 0 0 0
15 tucker 42662 227 0 0 0 0 0
16 granite 28290 0 0 0 0 0 0
17 tiger 15786 0 0 0 0 0 0
18 jami 9085 4 0 0 0 0 0
…
Sorted this output would reveal that a very small number of users account for the bulk of logical IO operations (Database Access):

(The vertical axis is db access converted to the rate per second.) In part this is because it is also typical to see a lot of "think time" on systems:
$ idlx
User activity & system load
09:40AM up 44 days, 10:07, 1058 users, load average: 2.47, 2.21, 2.20
Currently Active: 218
Idle Users:
0:01 73
0:02 59
0:03 42
0:04 27
0:05 29
0:06 22
0:07 25
0:08 30
0:09 26
10m - 1hr: 366
Hour+ old: 128
Day+ old: 13
As you can see of the 1058 users logged on to this system only 218 have been active in the last minute – an additional 73 have only been inactive for a minute and so forth. The application shown here obviously involves considerable "think time".
Table access in the database is also non-uniform – some tables are much more active than others:
TableRead: Total Rate Percentage Tbl Table Cumulative Interval Accum/s Inter/s Accum% Inte ---- ------------------------- ---------- -------- ------- ------- ------- ----- 4 OrderLine 2270392 398229 0 1327 52.00% 46.0 24 POLine 782050 192252 0 641 18.00% 22.0 18 Order 639465 151586 0 505 15.00% 17.0 23 PurchaseOrder 324253 61855 0 206 7.00% 7.0 2 Customer 157219 41531 0 138 4.00% 5.0 21 Bin 115240 13134 0 44 3.00% 2.0 10 Employee 16246 10475 0 35 0.00% 1.0 22 InventoryTrans 20427 1304 0 4 0.00% 0.0 8 RefCall 980 948 0 3 0.00% 0.0 25 SupplierItemXref 8190 511 0 2 0.00% 0.0 13 Family 14000 369 0 1 0.00% 0.0 15 Benefits 3708 80 0 0 0.00% 0.0 3 Item 3297 76 0 0 0.00% 0.0 1 Invoice 10090 35 0 0 0.00% 0.0 6 State 2056 17 0 0 0.00% 0.0 12 Vacation 8550 17 0 0 0.00% 0.0 11 TimeSheet 1463 0 0 0 0.00% 0.0 9 Feedback 3 0 0 0 0.00% 0.0 7 LocalDefault 264 0 0 0 0.00% 0.0 5 Salesrep 0 0 0 0 0.00% 0.0 14 Department 0 0 0 0 0.00% 0.0 20 Warehouse 13332 0 0 0 0.00% 0.0 16 ShipTo 0 0 0 0 0.00% 0.0
It would be very useful to know table access by user but, unfortunately, that data is not available to us. If the data above is "typical" of the load that we wish to simulate (you need to examine your load history to determine that) then we have everything we need to create a profile for a basic load simulation. Programming the Simulator
You could go out and spend hundreds of thousands of dollars on fancy load simulation software. Or you can download:
http://www.greenfieldtech.com/downloads/files/pace.tar
It's up to you– but I'm going to explain how to program the PACE toolkit. The heart of the pace toolkit is a simple include file – z.i:
do while k < j:
for each {1} no-lock:
k = k + 1.
if k > j then leave.
end.
end.
This include file simple reads J records from the table which is passed in argument {1}. J is a random number selected in the parent program (pace.p) according to parameters that you provide. This simple loop provides the load for a one session in your simulation. The simplicity of this is attractive but it does have some weaknesses – in particular the sequential nature of the FOR EACH construct is potentially troublesome. It might also be unrealistic to start every "burst" at the beginning – your actual "working set" of records is more likely to be at the end of the table. If those are significant considerations for your case the code can be readily modified to reflect your situation.
The next layer up from the basic load loop is the table selection logic. In order to spread the load across the database tables in a fashion similar to actual data access a "switch" statement (filename x.i) is used:
if x <= 9 then {z.i Salesrep}
else if x <= 19 then {z.i Local-Default}
else if x <= 32 then {z.i Ref-Call}
else if x <= 83 then {z.i State}
else if x <= 138 then {z.i Item}
else if x <= 221 then {z.i Customer}
else if x <= 368 then {z.i Invoice}
else if x <= 575 then {z.i Order}
else if x <= 1448 then {z.i Order-Line}
This example distributes requests across the tables of the "sports" database. The variable X is generated by the parent program (pace.p). It is a random number in the range from 1 to a limit determined by the observed IO distribution taken by summing read operations from TableStat data in the load profile. For example:
Table Rate Sum ============= ==== ==== Salesrep 9 9 Local-Default 10 19 Ref-Call 13 32 State 51 83 Item 55 138 Customer 83 221 Invoice 147 368 Order 207 575 Order-Line 873 1448
The data is sorted with the least active table first. The "rate" column is the observed number of record reads for the interval and the "sum" column is a running total of the read rate. (Observant readers will notice that this faked up sample data is actually the number of records per table from the "sports" database – but it serves to illustrate the technique.)
Once the "switch" statement has been populated the main program, pace.p, must be configured:
do while true:
j = random( r / 20 , r * 20 ).
i = i + j.
k = 0.
x = random( 1, 1448 ).
{x.i}
t = time - s.
do while (( i / t ) > r ):
pause 1 no-message.
t = time - s.
end.
end.
Simply change "1448" to whatever the final summation of your data profile actually is.
The value R is passed in to pace.p as a startup parameter (-param) by the control script. It is the runmber of reads per second that you wish session to perform.
You can consider modifying the "20" used in the calculation of J. That calculation controls how "bursty" the individual loads need to be. I've found 20 to work well but you may settle on different values (the limits do not need to use the same factor – nor are they required to be related to R.)
The DO loop at the bottom checks to see if the session is outrunning its target read rate and, if it is, will take a nap until it is back within the desired rate. This simulates both the reading rate profile and the think time of the system. Modifying the calculation of R will impact how often the session needs to sleep and therefore the amount of think time that is simulated.
The final step before running the simulation is to define the number of sessions and their individual read rates. This data will be read by the following script (pace2.sh):
cat ioload | while read RATE
do
sleep 1
echo $RATE
mbpro $DBNAME -p pace.p -rand 2 -param $RATE >> pace.log
done
The "ioload" data file is simply a list, one per line, of desired read rates (per second) derived from the PROMON "IO By User" screen (or from a VST based 4gl program). For example:
500 250 125 63 32 16 8 4 2 1
This configuration will start 10 sessions with an aggregate read rate of 1,001 record reads per second. (The "IO By User" screen uses "Database Accesses" as its metric rather than record reads. A general approximation is that there are 2 database access operations for every record read.)
You're now ready to launch the load simulator!
Now that you have a functioning load simulation what do you do with it? How can you apply this tool to your environment?
1) Make it a standard part of acceptance testing to run N users (where N is your maximum expected user count) for some period of time at a realistic load. Run this test whenever changes are promoted to production, whenever Progress is upgraded, whenever database parameters are changed (client or server) and whenever OS level changes are made. What will this do for you?
2) Run the load simulator at all times in your development and test environments. This will help you to better understand how your programs will behave in a production environment where they have to compete for resources.
3) Always run the load simulator when evaluating new hardware or new releases of Progress. Use it as a "burn in" test to exercise all of the components of your database, to encourage "flaky" hardware to fail and to expose problems that build over time (such as memory leaks and counter overflows).
4) Use the load simulator to provide "background noise" to accompany more detailed tests or processes that you suspect might be affected by the load on the system.
5) Run (carefully controlled) experiments in production to gauge the impact of proposed increases in business volume – go ahead and simulate those 100 extra users that management wants to add next week. (This isn't for everyone – many companies would cringe at the thought of doing this in production. But it's quick and effective for those who are willing.)
6) In your test environment compare and contrast different parameter settings and configurations to establish a predicted impact and justification for making the change to production.
7) Justify putting a realistic copy of the production database on your test system so that you can run meaningful test cases before you promote new code to production.
8) Provide a stable test bed for learning how to work with monitoring tools such as ProTop!
9) You can impress all of your friends circled around the PEG Bar at Exchange with the sophistication of your IT process!
Routine load simulation doesn't have to be an impossibly complex or inordinately expensive task. Taking a few simple steps can put this powerful technique into your toolkit and bring immediately useful benefits to your organization.
Greenfield Technologies knowledge of business, applications, and infrastructure helps companies to develop and deploy applications which are built to last and designed to exceed user expectations.
-- Rob Lux
Enterprise Services Manager
Large Global IT Outsourcing Firm
With technology evolving at an increasingly challenging rate, its great to have a partner that you trust, and one that you can leverage to help your business take advantage of a constantly changing technology landscape. Greenfield Technologies has been there for us in the past, and will be THE partner we go to in the future when we need in-depth expertise.
-- Todd Lunsford
CIO
Quicken Loans
Greenfield Technologies in depth knowledge of the Progress database and our application made it possible to not only prepare our hardware, operating system and Progress software upgrade to a point that we felt very comfortable to go ahead with it, but also enabled us to execute it in less time than anticipated and resulted in a much larger performance improvement than we expected! Toms motto to prepare well and test twice beforehand paid off fully.
-- Gabriela Summerer-Herndon
Unix Admin, Progress DBA
Columbia National Inc.
We just watched! You deserve the credit! Thanks again!
-- Alex Hillman
Thank you for your extraordinary efforts during the past few days. All of us really appreciate it. Given our volume and customer service requirements, your support -- which extended far beyond the normal work day and schedule -- was invaluable.
-- Jenne Britell
Thank you again for going the "extra mile".
-- Ben Smith
Tom, you especially have gone beyond the call of duty in monitoring our system and getting issues regarding capacity etc resolved.
-- Matt White
Great program! Great features!.
-- Scott Cooper
Thank you for your work on the [...] rehosting project. Expediting the conversion of the Progress Database was critical to our success. The knowledge that you brought to the team about Progress tuning and database management helped not only with this effort but will improve our on-going management of the database. Thank you!
-- Anonymous CIO
| Address: |
White Star Software PO Box 3058 Nashua, NH 03061 |
| Cell: | +1 603 396 4886 |
| E-mail: | mailwss.com |
| wss.com | |