Structs, Arrays and Optimization Strategies (Part 1 of 2)
by
, 14-Oct-2009 at 09:52 AM (4306 Views)
A common programming problem is the need to process sets of data that are not part of your database. Typically sets of data that are stored in arrays. If the data is structured then you would use arrays of type struct. For example...
Wherever you have arrays of data like this you will inevitably need to search the array. There are many different searching techniques: looping, array search functions, compare functions, etc. What is the best way to search an array?Code:Struct tNameData String sName Date dDateOfBirth End_Struct tNameData[] NameData // declares an array of type tNameData
As usual, the answer to this question is: it depends. It depends on the structure of the data that you are storing (is is simple or complex), how big your array is, what kind of operations you are performing on the array members and so on.
The Problem Description
To explore this question I will describe a specific problem and then compare various solutions.
Our test application will be required to process lists of globally unique names (or ID's if you prefer). It fetches the names from some external source (up to your imagination). For each name, the application is required to determine the 'date of birth'. This operation is arbitrary and does not really matter for the purposes of our tests. There are two special features of this particular problem:
- Each list of names that is passed to the application is sourced from a finite pool of names. Any particular list will mostly contain names that have been processed before, only 100 of the names in each list will be different. In reality, each list would contain fewer and fewer 'new' names but for the purposes of our test we will assume each list has 100 new names (except the first list which is all new).
- Each Date of Birth lookup will take a fixed amount of time. We might imagine that the application has to perform some costly operation such as calling a web service to get the date of birth.
We will test three different scales where the application is expected to process: 1,000 name lists; 10,000 name lists; or 50,000 name lists.
We will also be able to vary the time that it takes to lookup the date of birth for each name. The default latency for a lookup will be 5 milliseconds.
For each test, the application will be passed a string array of names and is expected to return an array of type tNameData (as declared above) containing names and corresponding dates of birth.
Generating the Test Data
To perform a series of tests the application must simulate being passed an array of names from a large finite pool of unique names. I decided to create a pool of 60,000 names. For each test the application fetches X names from the list and marks its position in the list. The next test advances the position by 100 names and fetches the next list of X names.
I generated the names by randomly selecting 5 letters between A-Z. The names are all uppercase to keep things simple. I discovered that generating 60,000 unique names like this takes a long time. I did not want to spend time optimizing this process as it has nothing to do with the tests I am trying to conduct. Instead, the application writes the list of names to a text file. The next time I run the application the names are just read from this text file which saved me quite a bit of time while I refined the test application.
Simulating the Tests
The application simulates looking up the date of birth for each name by generating a random date and then pausing the application for a fixed time.
The date that is generated never has a day greater than 12. This ensures that each generated date can be used in European or American date format. The actual date does not matter.Code:Function GetDateOfBirth String sName Returns Date Integer iTestLatency iVoid iDay iMonth iYear // simulate waiting for the specified time.... Get Value of oLatency_frm to iTestLatency Move (Sleep(iTestLatency)) to iVoid Move (Random(11)) to iDay Move (Random(11)) to iMonth Move (Random(9)) to iYear Increment iDay Increment iMonth Move (iYear + 2000) to iYear Function_Return (Date(String(iDay) + "/" + String(iMonth) + "/" + String(iYear))) End_Function // GetDateOfBirth
First Strategy: The Simple Approach
The first strategy we will test is the simplest possible approach. We simply iterate the list of names, perform the date lookup for each name and return the array of tNameData containing the names and matching dates of birth.
I don't need to show you the code that does this work, but I will. Mostly I want to illustrate the difference in coding complexity for each strategy starting with this, the very simplest case.
When you want to start testing different optimization strategies it is important to establish a baseline to compare to. i.e. how does your optimization compare to a no-optimization approach.Code:Function LookupNamesTest1 String[] sNames Integer ByRef iTime Returns tNameData[] Integer iStart iStop Integer iName icName tNameData[] NameData Move (GetTickCount()) to iStart Move (SizeOfArray(sNames)) to icName Decrement icName For iName from 0 to icName Move sNames[iName] to NameData[iName].sName Get GetDateOfBirth sNames[iName] to NameData[iName].dDateOfBirth Loop Move (GetTickCount()) to iStop Move (iStop - iStart) to iTime Function_Return NameData End_Function // LookupNamesTest1
It is also important to measure different parts of your baseline algorithm to establish which parts are slow and which parts require no optimization at all. Don't do this by simply reading the code and guessing where it will be slow! You will usually waste a lot of time optimizing things that take 20 milliseconds in total so that they now only take 10 milliseconds (wow a 50% improvement!)
A prime example of unnecessary optimization can be seen in the above code (I just can't help myself sometimes). Notice that in the FOR loop I get the size of the array and decrement it by 1. The means that the FOR loop goes from the 0th member to the size of the array - 1 th member. I could have written the loop like this...
However this is 'slower' because (icName - 1) has to be calculated for each loop. But how much slower is it? Well I tested that. For 1000 loops the difference is < 1 millisecond (that's less than 1,000th of a second); for 10,000 loops the test was inconclusive. Sometimes I got a difference of 16 ms sometimes I did not. Probably the difference was due to windows processes running in the background. Only at 50,000 loops did I notice a real difference... a 50% savings!!!! 15ms compared to 30ms. This optimization is clearly a waste of effort. On the other hand it wasn't much effort so what the heck, but it does illustrate a point.Code:Move (SizeOfArray(sNames)) to icName For iName from 0 to (icName - 1) ... Loop
So how long did it take? On my laptop I got the following results with a 5 millisecond latency time for each date lookup:
1,000 names -> 5.008 seconds
10,000 names -> 50.123 seconds
50,000 names -> 253.158 seconds
As I expected, the time increases linearly with the number of names that are processed. If a user is waiting for this process, then anything above about 2 seconds becomes annoying. 50 seconds will be considered very long and 253 seconds is going to be unacceptable.
This means that we will need to optimize this code. If you remove the latency time for each lookup then the time which remains is:
1,000 names -> 8 milliseconds
10,000 names -> 123 milliseconds
50,000 names -> 3.158 seconds
Clearly any optimizations we make should concentrate on reducing the latency time for each lookup. Anything else is really a waste of time.
Second Strategy: Using a Cache
We know that each list of names only contains about 100 new names. This means that most of the lookups that are performed are unnecessary repetitions. We will store a cache of each name and date looked up so far. Instead of calling the web service for each name, we will first look at our cache. We will only call the web service for names that are not currently cached.
What do we expect from this strategy? Well when the first list of names is processed, then it will take the same amount of time as the non-optimized approach because the cache will be empty and so every name will need to be looked up. Actually the first list of names should take slightly longer because we have the additional overhead of updating the cache.
The time required for subsequent lookups will depend on how long it takes to search for a name in the cache. Plus the latency time multiplied by 100 (the number of names that are not in the cache) i.e. 500ms.
This means that no matter how good our caching optimization is we cannot expect to get below 500ms per run. That is a pretty optimistic target.
In this strategy we will implement the cache searching in the simplest way possible. The cache will be stored as an array property of type tNameData.
The cache is searched in the simplest possible way by iterating the NameDataCache array for each name in the list of names to be processed. Here is the code:Code:Property tNameData[] pNameDataCache
We retrieve two copies of the cache. One is used for searching and the other one is used for updating (to handle adding new names to the cache). After each name is searched in the cache, we either use the cached date (if the name was found) or we need to perform a date lookup and then add the name + date to the new (expanded) cache.Code:Function LookupNamesTest2 String[] sNames Integer ByRef iTime Returns tNameData[] Integer iStart iStop Integer iName icName Integer iCache icCache iTopOfCache iFound Boolean bFound bChanged tNameData[] NameData NameCacheOld NameCacheNew Move (GetTickCount()) to iStart // Get the cached data.... Move False to bChanged Get pNameDataCache to NameCacheOld Move NameCacheOld to NameCacheNew // make a copy of the cache Move (SizeOfArray(NameCacheOld)) to icCache Move icCache to iTopOfCache Decrement icCache // Iterate the set of names.... Move (SizeOfArray(sNames)) to icName Decrement icName For iName from 0 to icName Move False to bFound For iCache from 0 to icCache Move (sNames[iName] = NameCacheOld[iCache].sName) to bFound If (bFound) Begin Move iCache to iFound End If (bFound) Break // stop looking Loop Move sNames[iName] to NameData[iName].sName If (bFound) Begin // If found, then use the cached value.... Move NameCacheOld[iFound].dDateOfBirth to NameData[iName].dDateOfBirth End Else Begin // Otherwise we have to look it up and add the value to the cache.... Get GetDateOfBirth sNames[iName] to NameData[iName].dDateOfBirth Move NameData[iName].sName to NameCacheNew[iTopOfCache].sName Move NameData[iName].dDateOfBirth to NameCacheNew[iTopOfCache].dDateOfBirth Increment iTopOfCache Move True to bChanged End Loop // If the cached has been updated then we need to save the changes.... If (bChanged) Begin Set pNameDataCache to NameCacheNew End Move (GetTickCount()) to iStop Move (iStop - iStart) to iTime Function_Return NameData End_Function // LookupNamesTest2
So how does this strategy compare to our baseline? Since the processing time now varies based on how many names have been previously processed, we will need to perform multiple runs of each test:
1,000 names
1st run: 5.023 sec
2nd run: 1.373 sec
3rd - 6th runs: 1.528, 1.684, 1.732, 1.982 sec
The first (cache building) run took 5.023 sec. This means that the extra overhead required to build the cache was only 17ms which is a good result.
The second run (which uses the cache) took 1.373 sec which is an acceptable reduction if your data size is 1,000 names or less.
When we continue processing lists of names the time starts growing linearly even though the number of names being processed does not increase. This is not acceptable as the total time would soon begin to exceed the baseline. We should have expected this. Each time we process a list of names, the size of the cache grows. This means that the number of cache searches per name increases.
How well does this technique scale with larger lists of names?
10,000 names
1st run: 50.248 sec
2nd run: 72.978 sec
3rd run: 83.086 sec
50,000 names
1st run: 250.802 sec
2nd run: 35 minutes
The first runs confirm that the time required to setup/maintain the cache is insignificant and it scales well.
The second runs show that the time to search the cache increases exponentially with the number of names to process. We should also have expected this result. We have a loop inside a loop. Both loops increase with the number of names to be processed. Added to this, the time taken still increases linearly each time we process a list of names.
What's Next
We have seen that the cost of maintaining a cache is very small, but our technique for searching the cache produced worse results overall than our baseline. In part 2 of this series we will look at using various array searching techniques to try to lower the cache searching time.