What's a crash, anyway?
by, 1-Dec-2009 at 01:02 PM (2846 Views)
A program crashing is something we're all painfully familiar with, but what does it really mean? Crash can mean different things to different people, sometimes people even use the term crash when referring to different types of problems at different times. It's often used loosely to mean anything from "the operating system rudely terminated my program", or simply "an error message was displayed (fatal or non-fatal)", or "the program stopped responding to user input", or even to the mundane and generic "the program isn't working correctly". I have learned the hard way to never expect that other people always refer to the same meaning. Nowadays I always ask for the exact error message or a screenshot, it's much easier than trying to make everyone else agree with my use of the term crash.
I will concentrate on "the operating system rudely terminated my program" as the meaning of crash, because it's the type of crash that is often the most mysterious. Why does the program crash sometimes, what does it mean, and why can't you just ignore it so the program can continue?
Different versions of Windows typically alerts the user of a crash in different ways. From the famous GPF to Invalid Page Fault and the more modern variations of "The program has performed an invalid operation and must be terminated". In most cases a situation has occurred where the CPU is unable to execute the current instruction, and therefore simply cannot continue running the program. This sounds very technical, but usually this means that the current instruction refers to a memory address that is not available, and there's nothing that can be done to correct the problem and therefore the operating system has no choice but to terminate the program.
The crash most often occurs somewhere in vdfvm15.dll or similar with Visual DataFlex programs. It doesn't necessarily mean there's something wrong with the VDF runtime, it's just that vdfvm15.dll executes code on behalf of your VDF program, so it's likely to be the module where the crash occurs. The underlying cause is usually something related to the VDF code executed around the time of the crash.
Buffer Overrun/Memory Overwrite and Corrupt Pointers
The majority of all crashes are caused by either a corrupt/dangling pointer, or a buffer overrun. If a pointer contains a memory address that's not available, and the CPU tries to dereference that pointer, the program would crash.
A buffer overrun occurs when the pointer references a valid memory address, but the program reads/writes more data than the memory buffer holds (indirectly this may turn into the same problem, pointing to a memory address that's not available). A buffer overrun doesn't always cause a crash, it could for example just overwrite other data stored after the specified buffer (which of course is bad). That's in fact the most common cause of corrupt pointers. A previous buffer overrun overwrites a pointer variable stored in memory after the specified buffer, and later when the CPU tries to dereference the pointer it crashes because it just contains garbage due to the previous buffer overrun.
If you think this all sounds very low level, it's because it is low level. Most crashes occurs due to fiddling around with pointers and copying memory buffers in VDF code, often using the Win32 API directly. Using advanced features such as External_Function and pointers in VDF can be very powerful, but it also comes with greater responsibility. Another common cause is manipulation of very large strings without adjusting argument_size, which as you might guess can cause buffer overruns. There are lots of other reasons for crashes, but these are the most common problems related to VDF programs.
Crashes and the Debugger
The debugger is your best friend in tracking down and fixing these crashes. It can be very tempting to dismiss crashes when debugging, especially if you don't see the crash in normal operation. Unfortunately, that usually comes around to bite you. Crashes are symptoms of problems with the program. And while the presence of a crash is proof of the presence a problem, the absence of a crash is not proof of absence of a problem. As I mentioned above, a buffer overrun does not necessarily cause a crash by itself, but the damage is done, other variables are now corrupt and it's a ticking time bomb waiting to happen.
The debugger can sometimes bump things around in memory, and add additional barriers to detect common problems, making buffer overruns more likely to crash while debugging. That's a good thing, because it's a lot easier to find the buffer overrun if the crash occurs around that code, rather than much later in some unrelated piece of code. If your program crashes in the debugger, don't dismiss it, because it's most likely real. Finding the problem now can save you many hours of debugging later.
When running the program in the debugger and it crashes, the debugger notifies you of the crash and then puts you right at the line of code causing the crash. You can then inspect variables and the callstack to get a better understanding of why and how it crashed. Remember that you can click on the various lines in the callstack to take you to the line of code for each method, as well as viewing the respective local variables. Finding the cause of the crash can be a challenge, so I've listed a few things to think about when trying to understand the crash.
- Verify your Watches window in the debugger. Some expressions execute in the context of the program and can cause side effects, changing the state of the program. It's not unusual for such watch expressions to cause a crash or alter the program state which in turn can cause a crash. Clear the watches window to see if it makes any difference.
- Use the callstack to see how your program got to the point of the crash, and examine variables to see if something doesn't seem quite right or out of the ordinary. Often the real bug could have occurred earlier and propagated through many layers in the callstack before it finally crashed. For example, given a pointer to a buffer and length, the length could have been incorrectly calculated and then passed on through many layers before finally being applied in such a way that would crash.
- If the callstack is very long, and perhaps displays a repetitive pattern of methods, you should suspect a recursion problem. Recursion problems can easily occur when you augment an event or method which perhaps indirectly causes the event to occur again. Focus changes is one of those things that can be very tricky for example.
- Consider other seemingly unrelated actions that have been taken before the crash. A buffer overflow/memory overwrite could have occurred somewhere else, and only manifested itself in a crash now.
- Looking for the cause can be overwhelming, so a divide and conquer approach is often the best. Start with the most likely possibilities, which is any specific and direct manipulation of pointers and memory allocations, External_Function etc. Consider any third party controls/libraries you're using, and if possible try to exclude them from your test to see if it changes the behavior and removes the crash. Statistically speaking, code that uses advanced pointer and memory manipulations, large strings, and other esoteric techniques are more likely to be the culprit. Similarly, code that is very common, used in many different types of programs and by many different developers in varying situations, are less likely to be the culprit. Knowing this will help you prioritize your testing and understanding how to apply a divide and conquer technique.
- Keep in mind that the cause of the crash may not necessarily be directly in your code, it could be indirectly related to an unusual esoteric use of something that's otherwise very commonly used (just not in that particular way). Perhaps almost all VDF developers are using the Foo technique of DDs, but it would be very unusual to use it in combination with the Bar technique while at the same time activating a view or something. An unusual combination of different techniques are suspicious because it perhaps hasn't been tested together, or perhaps have conflicting internal designs that are not directly apparent.
- Another thing to keep in mind is that crashes are very rarely random (unless you're exposed to multi-threading). In fact, computers are notoriously precise and repeating the exact same steps are bound to cause the exact same behavior. The problem is finding out the exact steps required. Crashes are often dismissed as a fluke or random act, but deep down we know that's obviously not the case, it's just that we haven't isolated the exact steps required yet. Sometimes it's dismissed because it doesn't appear to be a problem all of the time (the exact steps required may be so convoluted that it's not something the user is expected to do anyway), or there's a simple workaround, and we may have more important problems to fix. That's OK, you always have to put things in perspective.
Finding the cause of a crash can be very time consuming and difficult. Effective use and a thorough understanding of all the tools at your disposal, such as the callstack, as well as a systematic application of divide and conquer techniques to track down the problem, can make a huge difference. And of course, patience. It can be very challenging, but nobody ever said software development is easy.