Your Progress OpenEdge Database Expert

Load Simulation for Everyone

From PROGRESSIONS #56 Fall 2003

Overview

One of the most significant challenges of managing complex systems is determining how the system will behave as changes are being made. Changes in the business environment, the user load, hardware and software; changes to the schema or configuration of the database and even small and seemingly innocuous changes to tunable parameters all have enormous potential for unexpected (and decidedly negative) consequences. Properly testing the impact of such changes is often seen as "too complex" or "too expensive."

Load simulation is a simple and effective technique that can help to bring this complexity under control and provide meaningful answers to questions about the likely impact of such changes. This article presents a straightforward approach to defining a realistic user load and 4gl based tools for simulating the same.

Typical questions that can be answered by load simulation include:

Where to Start?

Perhaps the most difficult hurdle to implementing load simulation is simply taking the time to characterize the load. In many organizations there are no formal metrics that are routinely used to communicate the concept of "load". To some extent this is because there are many valid measurements of system load and no single number can cover all aspects of the question. Another key difficulty is that many of the questions that one wants to pose via load simulation have a "business" component – and even fewer organizations have tied key business drivers to the technical metrics that reflect load.

So where can we start? At a fundamental level business applications accept input, process it in some way and output results. The details, of course, vary. But these components are always present and some general observations can be made about them. Broadly speaking:

There are, of course, specific processes that are exceptions to these points. These exceptions are probably "well known" in your organization. In fact the exceptions are probably the areas (if any) where you have measurements pertaining to performance. You may know, for instance, that the month end trial balance takes 17 hours and 20 minutes to run and that nobody else can get any work done if runs during the business day.

Historical database performance data is essential to a successful load testing environment. You need to know what is "normal" for your systems. The most critical metrics to collect are:

You need to collect this data and be familiar with it in order to understand the load on your systems. Notice that this data is all about the load – none of it speaks to the efficiency of the system. It only characterizes what is being asked of the system. It is certainly interesting and useful to know about things like OS reads, buffer hit ratios, latch timeouts, buffers flushed and so forth but they speak to efficiency – not load.

Scripts to collect this data and more can be reviewed and downloaded at:

http://www.greenfieldtech.com/downloads.shtml

In addition to historical data regarding database load you should also quantify business load. Every business has key metrics that indicate to management how well things are going. Perhaps it is sales volume in dollars, number of orders processed or widgets produced. There are some numbers somewhere in the database that quantify load from a business perspective. Correlating this metric with the database load gives you a very powerful tool for capacity planning and load simulation. Without it you're just guessing anytime the question is related to a business driver.

Now What?

Given the assumptions above the main thing that we need is a method of realistically reading records in a manner that reflects what real users do with the system. To do that we need to know the distribution of reads among users, the distribution of reads among tables and the degree of "think time" between requests.

Examining the output of the PROMON "IO by User" screen often reveals that users do not uniformly access the database. A profile such as this is common:

 
10/22/03        I/O Operations by Process
10:01:54        

                -------- Database -----     ---- BI -----     ---- AI -----
Usr Name        Access    Read    Write     Read    Write     Read    Write

  0 tom          4513       22        2      271      140        0        0
  5                 1        0        0        0     9899        0        0
  6                 1        0       20        0        0        0        0
  7                 1        0       18        0        0        0        0
  8 tom         16707      271        0        0        0        0        0
  9 tom       2094413        8      285      723     6765        0        0
 10 tom        740648        0       89      213     1660        0        0
 11 jami        81990      542        0        0        0        0        0
 12 julia       78902       50        0        0        0        0        0
 13 peter       81290      531        0        0        0        0        0
 14 emily       74588      251        0        0        0        0        0
 15 tucker      42662      227        0        0        0        0        0
 16 granite     28290        0        0        0        0        0        0
 17 tiger       15786        0        0        0        0        0        0
 18 jami         9085        4        0        0        0        0        0

Sorted this output would reveal that a very small number of users account for the bulk of logical IO operations (Database Access):


(The vertical axis is db access converted to the rate per second.) In part this is because it is also typical to see a lot of "think time" on systems:

$ idlx

User activity & system load

  09:40AM   up 44 days,  10:07,  1058 users,  load average: 2.47, 2.21, 2.20

Currently Active:  218

Idle Users:

    0:01  73
    0:02  59
    0:03  42
    0:04  27
    0:05  29
    0:06  22
    0:07  25
    0:08  30
    0:09  26

10m - 1hr:  366
Hour+ old:  128
 Day+ old:  13

As you can see of the 1058 users logged on to this system only 218 have been active in the last minute – an additional 73 have only been inactive for a minute and so forth. The application shown here obviously involves considerable "think time".

Table access in the database is also non-uniform – some tables are much more active than others:

 
TableRead:                              Total            Rate      Percentage
 Tbl Table                     Cumulative Interval Accum/s Inter/s  Accum%  Inte
---- ------------------------- ---------- -------- ------- ------- ------- -----
   4 OrderLine                    2270392   398229       0    1327  52.00%  46.0
  24 POLine                        782050   192252       0     641  18.00%  22.0
  18 Order                         639465   151586       0     505  15.00%  17.0
  23 PurchaseOrder                 324253    61855       0     206   7.00%   7.0
   2 Customer                      157219    41531       0     138   4.00%   5.0
  21 Bin                           115240    13134       0      44   3.00%   2.0
  10 Employee                       16246    10475       0      35   0.00%   1.0
  22 InventoryTrans                 20427     1304       0       4   0.00%   0.0
   8 RefCall                          980      948       0       3   0.00%   0.0
  25 SupplierItemXref                8190      511       0       2   0.00%   0.0
  13 Family                         14000      369       0       1   0.00%   0.0
  15 Benefits                        3708       80       0       0   0.00%   0.0
   3 Item                            3297       76       0       0   0.00%   0.0
   1 Invoice                        10090       35       0       0   0.00%   0.0
   6 State                           2056       17       0       0   0.00%   0.0
  12 Vacation                        8550       17       0       0   0.00%   0.0
  11 TimeSheet                       1463        0       0       0   0.00%   0.0
   9 Feedback                           3        0       0       0   0.00%   0.0
   7 LocalDefault                     264        0       0       0   0.00%   0.0
   5 Salesrep                           0        0       0       0   0.00%   0.0
  14 Department                         0        0       0       0   0.00%   0.0
  20 Warehouse                      13332        0       0       0   0.00%   0.0
  16 ShipTo                             0        0       0       0   0.00%   0.0

It would be very useful to know table access by user but, unfortunately, that data is not available to us. If the data above is "typical" of the load that we wish to simulate (you need to examine your load history to determine that) then we have everything we need to create a profile for a basic load simulation. Programming the Simulator

You could go out and spend hundreds of thousands of dollars on fancy load simulation software. Or you can download:

http://www.greenfieldtech.com/downloads/files/pace.tar

It's up to you– but I'm going to explain how to program the PACE toolkit. The heart of the pace toolkit is a simple include file – z.i:

 
do while k < j:
  for each {1} no-lock:
    k = k + 1.
    if k > j then leave.
  end.
end.

This include file simple reads J records from the table which is passed in argument {1}. J is a random number selected in the parent program (pace.p) according to parameters that you provide. This simple loop provides the load for a one session in your simulation. The simplicity of this is attractive but it does have some weaknesses – in particular the sequential nature of the FOR EACH construct is potentially troublesome. It might also be unrealistic to start every "burst" at the beginning – your actual "working set" of records is more likely to be at the end of the table. If those are significant considerations for your case the code can be readily modified to reflect your situation.

The next layer up from the basic load loop is the table selection logic. In order to spread the load across the database tables in a fashion similar to actual data access a "switch" statement (filename x.i) is used:

 
if       x <=    9 then {z.i Salesrep}
 else if x <=   19 then {z.i Local-Default}
 else if x <=   32 then {z.i Ref-Call}
 else if x <=   83 then {z.i State}
 else if x <=  138 then {z.i Item}
 else if x <=  221 then {z.i Customer}
 else if x <=  368 then {z.i Invoice}
 else if x <=  575 then {z.i Order}
 else if x <= 1448 then {z.i Order-Line}

This example distributes requests across the tables of the "sports" database. The variable X is generated by the parent program (pace.p). It is a random number in the range from 1 to a limit determined by the observed IO distribution taken by summing read operations from TableStat data in the load profile. For example:

 
Table			Rate	 Sum
=============		====	====
Salesrep		   9	   9
Local-Default		  10	  19
Ref-Call		  13	  32
State			  51	  83
Item			  55	 138
Customer		  83	 221
Invoice			 147	 368
Order			 207	 575
Order-Line		 873	1448

The data is sorted with the least active table first. The "rate" column is the observed number of record reads for the interval and the "sum" column is a running total of the read rate. (Observant readers will notice that this faked up sample data is actually the number of records per table from the "sports" database – but it serves to illustrate the technique.)

Once the "switch" statement has been populated the main program, pace.p, must be configured:

 
do while true:        

  j = random( r / 20 , r * 20 ).  
  
  i = i + j.
  k = 0.
  
  x = random( 1, 1448 ).
  
  {x.i}
  
  t = time - s.
  do while (( i / t ) > r ):
    pause 1 no-message.
    t = time - s.
  end.

end.

Simply change "1448" to whatever the final summation of your data profile actually is.

The value R is passed in to pace.p as a startup parameter (-param) by the control script. It is the runmber of reads per second that you wish session to perform.

You can consider modifying the "20" used in the calculation of J. That calculation controls how "bursty" the individual loads need to be. I've found 20 to work well but you may settle on different values (the limits do not need to use the same factor – nor are they required to be related to R.)

The DO loop at the bottom checks to see if the session is outrunning its target read rate and, if it is, will take a nap until it is back within the desired rate. This simulates both the reading rate profile and the think time of the system. Modifying the calculation of R will impact how often the session needs to sleep and therefore the amount of think time that is simulated.

The final step before running the simulation is to define the number of sessions and their individual read rates. This data will be read by the following script (pace2.sh):

 
cat ioload | while read RATE
do
        sleep 1
        echo $RATE
        mbpro $DBNAME -p pace.p -rand 2 -param $RATE >> pace.log
done

The "ioload" data file is simply a list, one per line, of desired read rates (per second) derived from the PROMON "IO By User" screen (or from a VST based 4gl program). For example:

 
500
250
125
63
32
16
8
4
2
1

This configuration will start 10 sessions with an aggregate read rate of 1,001 record reads per second. (The "IO By User" screen uses "Database Accesses" as its metric rather than record reads. A general approximation is that there are 2 database access operations for every record read.)

You're now ready to launch the load simulator!

What Can I Do With This Thing?

Now that you have a functioning load simulation what do you do with it? How can you apply this tool to your environment?

1) Make it a standard part of acceptance testing to run N users (where N is your maximum expected user count) for some period of time at a realistic load. Run this test whenever changes are promoted to production, whenever Progress is upgraded, whenever database parameters are changed (client or server) and whenever OS level changes are made. What will this do for you?

2) Run the load simulator at all times in your development and test environments. This will help you to better understand how your programs will behave in a production environment where they have to compete for resources.

3) Always run the load simulator when evaluating new hardware or new releases of Progress. Use it as a "burn in" test to exercise all of the components of your database, to encourage "flaky" hardware to fail and to expose problems that build over time (such as memory leaks and counter overflows).

4) Use the load simulator to provide "background noise" to accompany more detailed tests or processes that you suspect might be affected by the load on the system.

5) Run (carefully controlled) experiments in production to gauge the impact of proposed increases in business volume – go ahead and simulate those 100 extra users that management wants to add next week. (This isn't for everyone – many companies would cringe at the thought of doing this in production. But it's quick and effective for those who are willing.)

6) In your test environment compare and contrast different parameter settings and configurations to establish a predicted impact and justification for making the change to production.

7) Justify putting a realistic copy of the production database on your test system so that you can run meaningful test cases before you promote new code to production.

8) Provide a stable test bed for learning how to work with monitoring tools such as ProTop!

9) You can impress all of your friends circled around the PEG Bar at Exchange with the sophistication of your IT process!

Conclusion

Routine load simulation doesn't have to be an impossibly complex or inordinately expensive task. Taking a few simple steps can put this powerful technique into your toolkit and bring immediately useful benefits to your organization.

Greenfield Technologies knowledge of business, applications, and infrastructure helps companies to develop and deploy applications which are built to last and designed to exceed user expectations.

-- Rob Lux
Enterprise Services Manager
Large Global IT Outsourcing Firm

With technology evolving at an increasingly challenging rate, it’s great to have a partner that you trust, and one that you can leverage to help your business take advantage of a constantly changing technology landscape. Greenfield Technologies has been there for us in the past, and will be THE partner we go to in the future when we need in-depth expertise.

-- Todd Lunsford
CIO
Quicken Loans

Greenfield Technologies in depth knowledge of the Progress database and our application made it possible to not only prepare our hardware, operating system and Progress software upgrade to a point that we felt very comfortable to go ahead with it, but also enabled us to execute it in less time than anticipated and resulted in a much larger performance improvement than we expected! Tom’s motto to prepare well and test twice beforehand paid off fully.

-- Gabriela Summerer-Herndon
Unix Admin, Progress DBA
Columbia National Inc.

We just watched! You deserve the credit! Thanks again!

-- Alex Hillman

Thank you for your extraordinary efforts during the past few days. All of us really appreciate it. Given our volume and customer service requirements, your support -- which extended far beyond the normal work day and schedule -- was invaluable.

-- Jenne Britell

Thank you again for going the "extra mile".

-- Ben Smith

Tom, you especially have gone beyond the call of duty in monitoring our system and getting issues regarding capacity etc resolved.

-- Matt White

Great program! Great features!.

-- Scott Cooper

Thank you for your work on the [...] rehosting project. Expediting the conversion of the Progress Database was critical to our success. The knowledge that you brought to the team about Progress tuning and database management helped not only with this effort but will improve our on-going management of the database. Thank you!

-- Anonymous CIO


ProJAX

ProJAX is an implementation of AJAX designed to get Progress developers, especially those working in legacy environments, up and running with a minimum of muss or fuss. ProJAX makes it simple to leverage your existing Progress 4GL programming skills to deliver rich and responsive web applications without annoying delays and timeouts for page refreshes.


Have a question?
Don't know where to look?

Contact Us!

Address: White Star Software
PO Box 3058
Nashua, NH 03061
Cell: +1 603 396 4886
E-mail: mailwss.com
wss.com