View RSS Feed

Development Team Blog

Visual DataFlex 15.1 - XML Optimizations

Rate this Entry
In his article, Visual DataFlex 15.1 Sneak Peek - XML Changes, Sonny Falk, explained that Visual DataFlex 15.1 will be using a newer XML parser (msxml6), which will allow you to perform schema validations on your documents. In addition to this we’ve made some changes in our packages which optimizes XML processing. These changes, which are not related to the new XML parser, were instigated by a recent forum discussion dealing with XML performance. Andrew Kennard uploaded an XML document and some parsing examples that he hoped could be made faster. Anders Ohrt mentioned that he was working with very large XML documents that seemed to take a very long time to process. We looked at this and we found some surprising results. We used these findings to make some significant optimizations. Here we will discuss what those changes are and what they will mean to you.

First let’s make sure we understand what we are talking about when we refer to XML performance. We can break down what takes time during XML processing into four areas. To handle XML in an application you must:

1. Transfer/Load the document – Typically this is some kind of HTTP transfer.

2. Parse the document – This is an internal step performed entirely by the XML processor. It takes your XML document, parses it and stores it in a DOM (document object model).

3. Process the document – This is the time it takes to step through the XML DOM model and move the data into DataFlex variables and structures. Often this is the code that you write although if you are using web-services the client web-service class does most of this for you.

4. Work with the data – After you’ve taken the data out of the XML DOM you now need to do something useful with it. At this point, you are just dealing with data in your application and this has nothing directly to do with XML.

(Note: Throughout this article we will talk about reading, as opposed to creating, an XML documents. It turns out that the significant optimizations were made on the read side.)

We are only interested in the performance of Parse (step 2) and Process (step 3) because those are the only steps that are directly related to XML. However, as you interpret the performance results keep in mind that the other steps (1 and 4) will often be the steps that take most of your time. If it takes a very long time to transfer the XML document (because the server or network is slow) or if it takes a very long time to work with the data (because you have to do a lot of processing), the time it takes to parse and process may not be significant.

Our sample XML data is pretty typical of the kind of data you might deal with:
Code:
<changesdatadl>
    <recorddata>
        <fileno>111</fileno> 
        <dmsrecid>10052013</dmsrecid> 
        <saveseq>10244971</saveseq> 
        <mode>1</mode> 
        :
        :  more data elements
        :  
    </recorddata>
    :
    :  more RecordData records
    :
</changesdatadl>
The size of the XML document is determined by how many <recorddata> records are added to the document. The processing code looked like this:
Code:
Get Create (RefClass(cXMLDOMDocument)) to hoXML
Set psDocumentName of hoXML to "SomeName.xml"
Get LoadXmlDocument of hoXML to bOk
Get DocumentElement of hoXML to hoRoot
Get FirstChild of hoRoot to hoRecord
While (hoRecord)
    Get FirstChild of hoRecord to hoNodeRecordData 
    While (hoNodeRecordData)
        Get psText of hoNodeRecordData to sNodeValue
        Get NextNode of hoNodeRecordData to hoNodeRecordData 
    Loop
    Get NextNode of hoRecord to hoRecord
Loop
Send Destroy of hoXml
The Parsing part of this code occurs in one line of code "Get LoadXmlDocument of hoXML to bOk" (actually this is technically loading and parsing but the load is from a file which we assume is fast). The Processing part is the rest of the code where we step through the entire document one <recorddata> record at a time reading each data element in the record. Because this just a test, we don't actually do anything with the data we read. We kept this simple so we can measure only the XML processing time. Once again keep in mind that the we don’t work with this data (step 4 above) and often that will be the most time consuming step.

We tested this on two documents:

Small Test is 2.1 megabytes and contains 1,853 records.
Big Test is 13.1 megabytes and contains 18,530 records.

We started with Small Test. The parse time was 100ms and the process time was 2,387ms.

Relative to processing, parsing is very fast, which is good because there is not much we can do about that. Since most of the time was consumed with processing, we placed our attention there. We noticed that Get NextNode got called many times (number of records x number of items in a record). We decided to look at the NextNode method and see if that could be optimized. We found a nice optimization and we were able to reduce the 2,387ms down to 1,592 representing a 33% improvement. This seemed like a worthwhile change.

Next we tested this with Big Text. Since this has 10 times more data, we were expecting a 10x increase. That's not what happened as shown below:
Code:
XML Processing Test 1: NextNode Optimizations

           Parse  Old-Process   New-Process   
Small Test   100        2,387         1,592
Big Test   1,000      233,626       225,434

Time is in milliseconds
Small Test uses a 2.1 meg file with 1,853 records
Big Test uses a 13.1 meg file with 18,530 records
Old-Process is  15.0 Flexml processing
New-Process was a proposed 15.1 Flexml optimization for NextNode
The parse time made perfect sense. The larger the file the more time it takes to parse and the increase was linear. Process time however was not linear. With a larger file our processing optimization went from 33% down to 3%. Something else was slowing this down and our NextNode optimization seemed a little less important. Unlike the parse time which increased in a linear fashion the processing time increased in a geometric fashion. While this was not what we expected, it did explain why some of our developers noted that processing very large XML files was slow. To confirm this we tried doubling the size of the Big Test XML document. This took so long that we did not even wait for results.

It turns out the Get psText is the culprit. If we remove the Get psText from the sample everything becomes linear. Unfortunately psText is an internal msxml message, which means there is not much we can do about this – or is there? With a little bit of experimentation we discovered that we could augment psText so it obtains the element's value using a different mechanism (we get the psNodeValue of the first child node). We tried this and things got much better.
Code:
XML Processing Test 1: NextNode and psText Optimizations

           Parse  Old-Process   New-Process   
Small Test   100        2,387           702
Big Test   1,000      233,626         6,474

Time is in Milliseconds
Small Test uses a 2.1 meg file with 1,853 records
Big Test uses a 13.1 meg file with 18,530 records
Old-Process is 15.0 Flexml processing
New-Process is 15.1 Flexml with optimizations for NextNode and psText
The Small Test results are better and the Big Test results are much better. With Big Test the processing went from 3.8 minutes to 6 seconds!

So why is psText so slow? We don’t know – you’d have to ask Microsoft. Actually it is faster in msxml6 but it is still quite slow. Why hasn’t someone discovered this issue before? The main reason is that this only matters with really large XML files. Even the small test is a rather large document (2.1 megabytes) and its performance was acceptable. Also when added to the time it takes to transfer and work with the data the process time still may not be significant.

What’s Changed in 15.1

In 15.1 we’ve added optimizations to Get psText, Get NextNode and Get AttributeNS. You take advantage of these optimizations simply by using the new classes. Since the client web-service class uses the XML classes, you will get these same optimizations in your web-service objects.

You get these optimizations without changing any code! There must be a catch and there is. Technically we did not optimize Get psText, we changed the way it works. The exact change will be discussed shortly. We expect that the new behavior will work the way most developers, maybe even all developers, expect it to work and therefore there will be no compatibility issues. Normally we avoid changing a behavior of an existing message and we did consider creating a new message (e.g., Get psTextEx). If we did this developers would have to change their code. A bigger problem is that developers might not discover this change and continue to use the older slower mechanism. We wanted to make sure that everyone benefited from this optimization.

The New psText Behavior

Here is a technical description of what has changed in Get psText. The built in msxml behavior of psText is designed to: 1) return the text content of the node or 2) return the concatenated text representing the node and its descendants. In actual use, psText is mostly used for the first purpose and the new psText now only supports this behavior. Let’s pick an example using this XML segment:
Code:
<contact>
    <name>Sifu Nick</name>
    <phone>415 555 1213</phone> 
    <email>sifunick@mingfoobakery.com</email> 
</contact>
If you wanted to get the values for <name>, <phone> and <email> you might write the following:
Code:
// assume that hoContact is the <contact> xml node:
Get FirstChild of hoContact to hoData // <name>
Get psText of hoData to sName // "Sifu Nick"

Get NextNode of hoData to hoData // <telephone>
Get psText of hoData to sPhone // "415 555 1213"

Get NestNode of hoData to hoData // <email>
Get psText of hoData to sEMail // "sifunick@mingfoobakery.com"

Send Destroy of hoData
This is typically how psText is used. It returns the text content of a node. When used this way the behaviors of psText in 15.0 and 15.1 are identical - except, of course, the 15.1 version is faster.

You can also use psText to return concatenated text representing the node and its descendants. For example:
Code:
// assume that hoContact is the <contact> xml node:
Get psText of hoContact to sData

// sData should be "Sifu Nick415 555 1213sifunick@mingfoobakery.com"
When used this way, psText is not all that useful. What are you going to do with "Sifu Nick415 555 1213sifunick@mingfoobakery.com"? We made the assumption that very few developers are intentionally using psText in this fashion and in 15.1 this is no longer supported. This is the price that is paid for making psText faster. If for some strange reason you actually need this behavior, we’ve provided an alternate way to do this. A new function has been created called AllChildNodesText that will do exactly what psText used to do. We doubt if anyone will ever use this. We hope we’ve made the right call here – the beta process will determine this.

Summary

To summarize, we’ve made a number of optimizations that will speed up XML processing in Visual DataFlex 15.1. The most important change is in psText where non-linear processing time is now linear. No changes are required on your part unless you are using psText in an unusual fashion. Depending on what else your program is doing, you may see varying performance improvements. The larger the XML document, the more dramatic the difference will be.

Finally, I'd like to thank Andrew Kennard and Anders Ohrt for their persistence in raising this issue and their assistance in testing these changes.

Comments

  1. Focus's Avatar
    Thank you to you too John for your persistence in resolving the problem for us once we had collectively figured out where all the time was being "lost".

    We hope that other people will see the benefit of these changes too.

    I have to say though, I'm glad that's not my email address in the last example .... I wouldn't like to have to say that after a few too many beers !!

    Cheers!