View RSS Feed

Development Team Blog

Constraints and Performance

Rate this Entry
Constraints again, who knew there could be so much to say about that. In some comments to an earlier post about constraints, the question about ordering of constraints and the impact on performance came up. John Tuohy then also mentioned in his follow-up article about constraints and properties that "The constraints are split up into multiple internal expressions...", I'll expand on that a little bit here, which will also reveal the answers to the questions raised in the earlier comments.

The Order of Constraints in Source Code
To the question about whether order of constraints matter, the real simple answer is that it doesn't matter, but you should probably specify the constraints in the order you intended them to be evaluated anyway. While the order usually makes no difference for performance, there's no reason to ever specify them out of order, other than to confuse things for anyone reading the code.

The longer answer is that it's quite complicated. The runtime divides constraints into three groups, jump-out, pre-relate and post-relate. There's actually also a fourth pseudo group, jump-in, but it's not really a separate group, its actually a subset of the pre-relate group that also seeds the buffer before the first find, effectively performing a jump-in.

When you perform the first find, all the constraints are examined and "built". At this point the runtime will look at each constraint and divide them all into the three groups, and compile each group into separate expressions. The buffers are seeded with the values that qualify for jump-in given the specified index, so that it effectively jumps to the first candidate record.

The Constraint Groups
The first group, jump-out, is basically anything that matches the index so that when it's no longer satisfied, there cannot be any more records that would satisfy the constraint, and it might as well stop looking. For example, if you have a constraint for Customer_Number LT 20, and you do a Find GT on the Customer_Number index. As long as Customer_Number is less than 20, it's OK, but as soon as it finds a record that is not less than 20, it can stop looking because there will not be any more records matching the criteria. Any constraints that fall into this group are tested first immediately after finding each record. If this group expression evaluates to false, it stops looking for more records.

The second group is pre-relate. Most constraints tend to fall into this group, and can include things like Customer.Email_Address Ne "". If this group expression evaluates to false, it immediately moves on to the next record. This is different from jump-out, which would cause it stop looking for more records. In this case it just stops evaluating the constraints for the current record, finds the next record and goes back to test the jump-out constraints for that record.

The third group then is obviously post-relate. If the pre-relate group expression evaluates to true, the runtime then performs a complete Relate, and then after that it evaluates the post-relate group expression. Most constraints fall into the pre-relate group, and only those constraints that refer to parent tables fall into the post-relate group. So if for example you're finding OrderHea records, and you have a constraint for the Customer table, it would fall into the post-relate group, as that constraint obviously cannot be tested until the runtime has first done a Relate to locate the corresponding Customer record. And as with the pre-relate group, if this group expression evaluates to false, it moves on to the next record. And finally if it evaluates to true, then the record satisfies all constraints and it's a successful Find operation.

OK, now we have all the information about the separate group constraints. If one particular constraint belongs to the jump-out group, and another one belongs to the post-relate group, we can now safely conclude that the order of those two constraints in code doesn't matter. The runtime will always evaluate the jump-out constraints before both the pre-relate and post-relate constraints.

Constrain As
What about Constrain As then? Well, as we learned earlier, they can never belong to the jump-out group(nor can they be used for jump-in), so no matter what order it is in the code, it will always be tested after the jump-out group. But would it belong to the pre-relate or post-relate group? Remember that if a constraint refers to a parent table, it must be post-relate. And since Constrain As is an expression that we know nothing about, it doesn't know if it's referring to a parent table, thus Constrain As always fall into the post-relate group.

This again means that the order of Constrain As in code with respect to any other Constrain that falls into the jump-out or pre-relate group doesn't matter for performance, as the runtime will always evaluate both jump-out and pre-relate constraints before post-relate constraints.

Order doesn't matter...
What about the order of the constraints within each group then? That's where the order in the code finally matters. Two constraints that fall into the same group are evaluated in the order they are defined within the code. If this would make any difference it's so tiny it's usually irrelevant. It's also complicated by the constraints in the DDO structure, where you don't really have full control over the final order within the post-relate group anyway.

Summary
So in summary, if you define the constraints in their natural optimal order, you'll have the best chance of getting it in the most possible optimal order when evaluated. And if you've defined the constraints in an unnatural order, it will probably work out anyway as it doesn't matter most of the time(and when it does, the difference is usually so tiny it's negligible), but it's of course not as obvious when reading the source code.

The trick to good performance is to make sure you have good jump-in and jump-out constraints matching the index. As long as that's the case, the rest(including Constrain As) is usually just noise, and the order of the constraints in code usually makes no difference. Consequently, whenever you see a performance problem with constraints, 99% of the time it's because it was not able to utilize jump-in and jump-out optimizations due to a mismatching index, or there are no pre-relate constraints at all, so it ends up testing all records in the table.

One last thing, what about the magic with the runtime finding the best index to use? In many cases this voodoo never comes into play. The index is usually determined by the DDO or the DEO, with the Ordering property or the main index of the specified column assigned to the DEO. When the index has not otherwise been specified, the runtime automatically tries to find the best index matching most of the columns used in the constraints. Typically in this case the best index is glaringly obvious, such as with a relates constraint where there's one index that exactly matches the related columns. The exact details for picking the best matching index is quite complicated, and it even examines things such as the length of the respective columns etc.

This is all based on the best matching index structure, the runtime of course cannot examine the actual data in the table to determine which index would happen to be the most effective index given the current data. Obviously doing so correctly would require finding all the matching records using each index and then see which one was the most effective. While such techniques could potentially be utilized by SQL server backends, caching the results for more optimal finding later(I don't know whether any SQL servers actually do something like that), the VDF runtime would not have the information or means to do so. If you really have multiple indexes that structurally are equally(or extremely closely) matching the constraints, then the best solution is to specify the index in code to remove the ambiguity, or you may be in for a surprise when making minor seemingly innocent structural changes which suddenly tips the scale one way or another. Basically, if relying on the auto-index mechanism, make sure it's an easy obvious choice for the runtime (such as with primary indexes for relationships etc.), in all other cases it's best to specify the index to avoid any surprises later on.

Finally, whenever in doubt, there's an excellent debugging tool available, the Data Dictionary Inspector, a.k.a. dd_debug.dg. To quote the documentation "The DataDictionary Inspector (DD_Debug.dg) is a small package utility that can be used to aid in the debugging of Data Dictionaries. It was created to make it easier to inspect and change the state of any DataDictionary object and to test the efficiency of your find constraints." This dialog keeps track of how many records were tested and how many of those records were found (constraint valid). You can use this to test your own theories about constraint optimization.
Categories
Uncategorized

Comments

  1. Jake Moffatt's Avatar
    Great writeup Sonny. This blog is great - these are the kinds of posts that help to clarify how the language works to a newcomer. Great job!
  2. Mk@p3rfect's Avatar
    Many thanks Sonny, a great insite into the inner workings. Exactly the sort of stuff we need at any level.