View RSS Feed

Development Team Blog

Structs, Arrays and Optimization Strategies (Part 1 of 2)

Rate this Entry
A common programming problem is the need to process sets of data that are not part of your database. Typically sets of data that are stored in arrays. If the data is structured then you would use arrays of type struct. For example...

Code:
Struct tNameData
    String sName
    Date dDateOfBirth
End_Struct

tNameData[] NameData    // declares an array of type tNameData
Wherever you have arrays of data like this you will inevitably need to search the array. There are many different searching techniques: looping, array search functions, compare functions, etc. What is the best way to search an array?

As usual, the answer to this question is: it depends. It depends on the structure of the data that you are storing (is is simple or complex), how big your array is, what kind of operations you are performing on the array members and so on.

The Problem Description

To explore this question I will describe a specific problem and then compare various solutions.

Our test application will be required to process lists of globally unique names (or ID's if you prefer). It fetches the names from some external source (up to your imagination). For each name, the application is required to determine the 'date of birth'. This operation is arbitrary and does not really matter for the purposes of our tests. There are two special features of this particular problem:

  1. Each list of names that is passed to the application is sourced from a finite pool of names. Any particular list will mostly contain names that have been processed before, only 100 of the names in each list will be different. In reality, each list would contain fewer and fewer 'new' names but for the purposes of our test we will assume each list has 100 new names (except the first list which is all new).
  2. Each Date of Birth lookup will take a fixed amount of time. We might imagine that the application has to perform some costly operation such as calling a web service to get the date of birth.

We will test three different scales where the application is expected to process: 1,000 name lists; 10,000 name lists; or 50,000 name lists.

We will also be able to vary the time that it takes to lookup the date of birth for each name. The default latency for a lookup will be 5 milliseconds.



For each test, the application will be passed a string array of names and is expected to return an array of type tNameData (as declared above) containing names and corresponding dates of birth.

Generating the Test Data

To perform a series of tests the application must simulate being passed an array of names from a large finite pool of unique names. I decided to create a pool of 60,000 names. For each test the application fetches X names from the list and marks its position in the list. The next test advances the position by 100 names and fetches the next list of X names.

I generated the names by randomly selecting 5 letters between A-Z. The names are all uppercase to keep things simple. I discovered that generating 60,000 unique names like this takes a long time. I did not want to spend time optimizing this process as it has nothing to do with the tests I am trying to conduct. Instead, the application writes the list of names to a text file. The next time I run the application the names are just read from this text file which saved me quite a bit of time while I refined the test application.

Simulating the Tests

The application simulates looking up the date of birth for each name by generating a random date and then pausing the application for a fixed time.

Code:
Function GetDateOfBirth String sName Returns Date
    Integer iTestLatency iVoid iDay iMonth iYear
        
    // simulate waiting for the specified time....
    Get Value of oLatency_frm to iTestLatency
    Move (Sleep(iTestLatency)) to iVoid
    Move (Random(11)) to iDay
    Move (Random(11)) to iMonth
    Move (Random(9)) to iYear
    Increment iDay
    Increment iMonth
    Move (iYear + 2000) to iYear
        
    Function_Return (Date(String(iDay) + "/" + String(iMonth) + "/" + String(iYear)))
End_Function  // GetDateOfBirth
The date that is generated never has a day greater than 12. This ensures that each generated date can be used in European or American date format. The actual date does not matter.

First Strategy: The Simple Approach

The first strategy we will test is the simplest possible approach. We simply iterate the list of names, perform the date lookup for each name and return the array of tNameData containing the names and matching dates of birth.

I don't need to show you the code that does this work, but I will. Mostly I want to illustrate the difference in coding complexity for each strategy starting with this, the very simplest case.

Code:
Function LookupNamesTest1 String[] sNames Integer ByRef iTime Returns tNameData[]
    Integer iStart iStop
    Integer iName icName
    tNameData[] NameData
        
    Move (GetTickCount()) to iStart
        
    Move (SizeOfArray(sNames)) to icName
    Decrement icName
        
    For iName from 0 to icName
        Move sNames[iName]               to NameData[iName].sName
        Get GetDateOfBirth sNames[iName] to NameData[iName].dDateOfBirth
    Loop
        
    Move (GetTickCount()) to iStop
    Move (iStop - iStart) to iTime
    Function_Return NameData
End_Function  // LookupNamesTest1
When you want to start testing different optimization strategies it is important to establish a baseline to compare to. i.e. how does your optimization compare to a no-optimization approach.

It is also important to measure different parts of your baseline algorithm to establish which parts are slow and which parts require no optimization at all. Don't do this by simply reading the code and guessing where it will be slow! You will usually waste a lot of time optimizing things that take 20 milliseconds in total so that they now only take 10 milliseconds (wow a 50% improvement!)

A prime example of unnecessary optimization can be seen in the above code (I just can't help myself sometimes). Notice that in the FOR loop I get the size of the array and decrement it by 1. The means that the FOR loop goes from the 0th member to the size of the array - 1 th member. I could have written the loop like this...

Code:
Move (SizeOfArray(sNames)) to icName
        
For iName from 0 to (icName - 1)
    ...
Loop
However this is 'slower' because (icName - 1) has to be calculated for each loop. But how much slower is it? Well I tested that. For 1000 loops the difference is < 1 millisecond (that's less than 1,000th of a second); for 10,000 loops the test was inconclusive. Sometimes I got a difference of 16 ms sometimes I did not. Probably the difference was due to windows processes running in the background. Only at 50,000 loops did I notice a real difference... a 50% savings!!!! 15ms compared to 30ms. This optimization is clearly a waste of effort. On the other hand it wasn't much effort so what the heck, but it does illustrate a point.

So how long did it take? On my laptop I got the following results with a 5 millisecond latency time for each date lookup:

1,000 names -> 5.008 seconds
10,000 names -> 50.123 seconds
50,000 names -> 253.158 seconds

As I expected, the time increases linearly with the number of names that are processed. If a user is waiting for this process, then anything above about 2 seconds becomes annoying. 50 seconds will be considered very long and 253 seconds is going to be unacceptable.

This means that we will need to optimize this code. If you remove the latency time for each lookup then the time which remains is:

1,000 names -> 8 milliseconds
10,000 names -> 123 milliseconds
50,000 names -> 3.158 seconds

Clearly any optimizations we make should concentrate on reducing the latency time for each lookup. Anything else is really a waste of time.

Second Strategy: Using a Cache

We know that each list of names only contains about 100 new names. This means that most of the lookups that are performed are unnecessary repetitions. We will store a cache of each name and date looked up so far. Instead of calling the web service for each name, we will first look at our cache. We will only call the web service for names that are not currently cached.

What do we expect from this strategy? Well when the first list of names is processed, then it will take the same amount of time as the non-optimized approach because the cache will be empty and so every name will need to be looked up. Actually the first list of names should take slightly longer because we have the additional overhead of updating the cache.

The time required for subsequent lookups will depend on how long it takes to search for a name in the cache. Plus the latency time multiplied by 100 (the number of names that are not in the cache) i.e. 500ms.

This means that no matter how good our caching optimization is we cannot expect to get below 500ms per run. That is a pretty optimistic target.

In this strategy we will implement the cache searching in the simplest way possible. The cache will be stored as an array property of type tNameData.

Code:
Property tNameData[] pNameDataCache
The cache is searched in the simplest possible way by iterating the NameDataCache array for each name in the list of names to be processed. Here is the code:

Code:
Function LookupNamesTest2 String[] sNames Integer ByRef iTime Returns tNameData[]
    Integer iStart iStop
    Integer iName icName
    Integer iCache icCache iTopOfCache iFound
    Boolean bFound bChanged
    tNameData[] NameData NameCacheOld NameCacheNew
        
    Move (GetTickCount()) to iStart
        
    // Get the cached data....
    Move False to bChanged
    Get pNameDataCache to NameCacheOld
    Move NameCacheOld to NameCacheNew           // make a copy of the cache
    Move (SizeOfArray(NameCacheOld)) to icCache
    Move icCache to iTopOfCache
    Decrement icCache
        
    // Iterate the set of names....
    Move (SizeOfArray(sNames)) to icName
    Decrement icName
      
    For iName from 0 to icName
        Move False to bFound
            
        For iCache from 0 to icCache
            Move (sNames[iName] = NameCacheOld[iCache].sName) to bFound
              
            If (bFound) Begin
                Move iCache to iFound
            End
            
            If (bFound) Break    // stop looking
        Loop
            
        Move sNames[iName] to NameData[iName].sName
            
        If (bFound) Begin
            // If found, then use the cached value....
            Move NameCacheOld[iFound].dDateOfBirth to NameData[iName].dDateOfBirth
        End
        Else Begin
            // Otherwise we have to look it up and add the value to the cache....
            Get GetDateOfBirth sNames[iName] to NameData[iName].dDateOfBirth
                
            Move NameData[iName].sName        to NameCacheNew[iTopOfCache].sName
            Move NameData[iName].dDateOfBirth to NameCacheNew[iTopOfCache].dDateOfBirth
            Increment iTopOfCache
            Move True to bChanged
        End
    Loop
        
    // If the cached has been updated then we need to save the changes....
    If (bChanged) Begin
        Set pNameDataCache to NameCacheNew
    End
        
    Move (GetTickCount()) to iStop
    Move (iStop - iStart) to iTime
    Function_Return NameData
End_Function  // LookupNamesTest2
We retrieve two copies of the cache. One is used for searching and the other one is used for updating (to handle adding new names to the cache). After each name is searched in the cache, we either use the cached date (if the name was found) or we need to perform a date lookup and then add the name + date to the new (expanded) cache.

So how does this strategy compare to our baseline? Since the processing time now varies based on how many names have been previously processed, we will need to perform multiple runs of each test:

1,000 names
1st run: 5.023 sec
2nd run: 1.373 sec
3rd - 6th runs: 1.528, 1.684, 1.732, 1.982 sec

The first (cache building) run took 5.023 sec. This means that the extra overhead required to build the cache was only 17ms which is a good result.

The second run (which uses the cache) took 1.373 sec which is an acceptable reduction if your data size is 1,000 names or less.

When we continue processing lists of names the time starts growing linearly even though the number of names being processed does not increase. This is not acceptable as the total time would soon begin to exceed the baseline. We should have expected this. Each time we process a list of names, the size of the cache grows. This means that the number of cache searches per name increases.

How well does this technique scale with larger lists of names?

10,000 names
1st run: 50.248 sec
2nd run: 72.978 sec
3rd run: 83.086 sec

50,000 names
1st run: 250.802 sec
2nd run: 35 minutes

The first runs confirm that the time required to setup/maintain the cache is insignificant and it scales well.

The second runs show that the time to search the cache increases exponentially with the number of names to process. We should also have expected this result. We have a loop inside a loop. Both loops increase with the number of names to be processed. Added to this, the time taken still increases linearly each time we process a list of names.

What's Next

We have seen that the cost of maintaining a cache is very small, but our technique for searching the cache produced worse results overall than our baseline. In part 2 of this series we will look at using various array searching techniques to try to lower the cache searching time.
Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	TestParameters.jpg 
Views:	2026 
Size:	6.3 KB 
ID:	3777  

Comments