Wednesday, June 4, 2008

The Day of Bugs

Ok, I can honestly say that today was one of the weirder days that I've had in a long while.  I don't know about others, but I can say with confidence that I've never personally identified a bug in Visual Studio in my career.  I've seen plenty of them mentioned by other folks, I've seen 'features' that I'd be inclined to call a bug (but could be interpreted either way), but I've never really found a bug myself.

Today, I found two.  I guess it's a case of 'when it rains, it pours'.  One of them was known long before I 'found' it, but obviously not known to me.  The other, I'm pretty confident, is still unknown to 'everyone'.

Bug #1 - Dynamic Version vs. BAML.

Ok...  So we've done a fair bit of playing with WPF on my project, and we've done some custom control development (user controls) in WPF for use in our application.  WPF is very convenient for being able to prototype and design the UX/UI of something without being bogged down by all the crap you have to do to customize WinForms (our users seem to never like the 'way it is').  I can say that I feel pretty comfortable that my skills with WPF, while not the best on the block, are probably up there along with most of the folks currently doing WPF development.  I've done a ton of data binding work, and feel pretty confident that I know most of the tricks there - especially thanks to the wonderful work of Bea Costa!

So, I was very puzzled when one of our developers started having issues with running our application's forms that use one of my user controls.  The user control was pretty simple - it was a list of names that had alternating highlights for the rows (one row was white, one was gray, etc.).  It did a few other things, but mostly that was the gist of it.  This is one of the simpler controls we have.  Anyway, the weird thing was that the 'bug' that we kept seeing only appeared when running the debug build of our application, and it only appeared when our UI was being used from the APL application.  It never appeared in release builds, and it never appeared when running debug mode in our UI test harness.

So, naturally, I looked first at the APL runtime, thinking it was a bad install on this dev's machine.  We then took his build and his APL workspace and ran it on my machine.  To my surprise, it crashed on my machine too.  Then, we tried running one of my builds on his machine - it worked (also to my surprise!).  So then, I concluded it was a problem with his machine.

Two days later, after he had gotten some other work done and managed to uninstall all of .NET 2.0 through 3.5, VS2005 and VS2008, and then reinstall all of them (carefully in order), he tried it again.  BOOM!  It still didn't work.  I brought him my old laptop, and I had IT set it up for him to be able to use it instead of his desktop, thinking we'd be rebuilding his desktop from scratch.  All the while, still being puzzled by the fact that the behavior ran around to different machines and environments and was so skittish.

Later that day, he came over to my desk and told me the problem started appearing on his release builds too.  I thought, "oh great - a viral bug!".  He then said that the problem also started appearing on my builds.  At this point, I thought - "ok, there's gotta be something else going on here".

The bad part about this bug was this - whenever you ran the application, it would look like it wanted to pop up an error dialog, in fact it would show the thread exception dialog (System.Windows.Forms.ThreadExceptionDialog) briefly (actually several of them on top of each other), but then the application would disappear before you could do anything.  Apparently, looking back, the problem was on one of WPF's "special" threads and APL apparently doesn't react very nicely to the .NET AppDomain having threads other than the main UI thread throw exceptions.

Finally recognizing that I might be able to do something about this, I went into the code and added an exception handler with a plain-old message box in it (e.ToString()).  Looking at the exception text, I saw that it said something about a XAML parse error and that my ValueConverter couldn't be loaded (I had a ValueConverter as a static resource in my XAML for getting the backcolor brush for doing the highlighting).  This error message pointed me to: Rob Relyea's blog post (along with several MSDN forums posts).  The only thing was that his post didn't apply completely to my issue.  But, the workarounds did.  It turned out that I was using AssemblyVersion(1.0.*) in my files (which I really like for our 'in dev' work), but it was causing problems.  It seems that the reason the bug was so 'fleeting' was that there must have been a timing issue on the fourth bit of the version number (the revision), since it's based on a timestamp.

Apparently, my computer is too fast (most of the time), so I didn't see this bug on my builds, just on my colleagues'!  As I said, this bug has been known about for a long time, and while I'm not thrilled with the workaround, it's there and working, so I'll live with it.

BTW, I spent nearly 4 days chasing this bug, off and on (my colleague did most of the legwork).  It wasn't much fun, being that we couldn't get an error message for the first 3 of those.

Bug #2 - VS2008 STL vs. NUnit, C++/CLI, and ME!

So, I spent the last 4 days chasing what I though must have been a bug in our code, only to figure out that I'm pretty sure it's a bug in MS's STL implementation (for Debug builds).  However, this bug is really hard to find (though it, at least, is VERY consistent - i.e. happens every time and is very reproducible).

First, some background on our design.  Our application consists of two major pieces, a UI, and a backend calculation engine.  The UI is written in .NET 3.0/3.5 (C#), and the backend is written in C++ (native).  However, these systems need to be able to share a common file format.  To meet this need, we developed a generic file library in native C++ code that can be used to read/write files for our system.  The files are basically like MS's structured storage, but with the features and interfaces that we desired for our system (along with a design oriented towards meeting our required performance characteristics).

All of our file structures are build upon this file library, along with several other native libraries that support it and some of the other 'shared' features.  We also have a generic 'key' library (this is a domain concept for us - you can think of it as a property bag with some special 'matching' features).

In order to support using these libraries from both C++ and C# code, we decided we'd write the implementations in native C++ using traditional object-oriented design principles (many of which were lifted from my C#/Java experiences), and then write a thin C++/CLI (the managed C++ language) wrapper around this library using the IJW (it just works) interop supported by C++/CLI.  Then, our C# clients would call into the C++/CLI managed library (not even knowing that it's implemented in C++) just as they would call into managed C# code, but be able to use the underlying data structures and implementation of the native libraries.  I think this design is pretty elegant, and we solved quite a few interesting issues when developing it.  We've been using it now for some time and it's working quite well.

So...  now, the bug.  Just recently we needed to add a new feature to the 'key' library.  This library is very simple and the feature was also quite simple.  I added it to the native code, added it in the managed C++/CLI library, and added the C# unit test code to exercise it.  By convention, we only run our unit tests in Release mode, unless we're debugging them, since we only want to take the time to test in one build environment and it makes most sense to test the bits that are going out the door...

Anyway, so I tested in release mode, the tests all passed, and I checked in.  I was happy and I went home for the day.  The next day, I happened to be compiling in Debug mode and I decided to run what I was working on.  I had forgotten that my startup project in VS2008 was set to the unit tests for the 'key' library, so the unit tests ran instead of what I was intending to test.  To my surprise, the unit tests for the 'key' library 'exploded' (they didn't fail, they caused an AccessViolationException that was caught by VS2008 and popped up in the debugger!).  The AV was showing up on the destructor call for one of our unmanaged C++ objects (native library).

To make matters worse, the bug only showed up when the finalizer was called for the class, and even weirder, only showed up when the variables were not deterministically destroyed (i.e. using IDispose and 'using').  Since my unit test code wasn't using 'using' anywhere, I saw the 'explosion'.  But, I saw it only when the finalizer was called (much later than the offending code, obviously).  I invested some time writing code to track which object was the culprit, and after figuring out which (using 'value numbering', a technique I use a lot in single-threaded debugging of applications with lots of object instances and no unique identifiers in them), I followed the code.  It was not at all obvious why there was a problem.  In fact, I looked at it for several days and couldn't figure out what the problem was.

I then posted here, and called my Arch. Evangelist at MS and talked with him about it, and still couldn't figure out what it is (the spoiler is the last post in the thread).  I finally had to resort to the 'commenting' technique to track down the bug.  I first commented out all unit tests and started adding them back in one by one.  Once I found which unit test failed, I commented out the entire test and started adding code back in block by block.  Once I found the offending block, I looked down into the C++ code and STILL couldn't find anything wrong with it.

At that point, I decided I needed to think outside the box.  I looked at what was different between the various method calls that worked and didn't work, and decided that the throwing of the exception might be a problem.  First, I removed all the code in the offending method.  Now, my test failed, but it didn't explode.  So, I put the code back, and looked at the exception more thoroughly - I decided to move it to the head of the function.  That also caused my test to fail, but it didn't explode.  So then I concentrated on the code before the exception (in the original function).  I kept saying to myself - there's NOTHING wrong with this code!  (you can see the code in the MSDN post).  Finally, I thought - "it has to be this code, so let's assume it's broken and figure out when and why".  I then restructured the code (as described in the MSDN post) and determined that it was absolutely something in the 'begin()' or 'end()' STL vector calls.  There's no way I messed that up - that's their code.  Voilà - bug #2.

For this one, I'm going to have to figure out how to submit it to MS.  I'm sure nobody's seen this one (at least as far as I can tell from searching).

Whew!  Looking forward (hopefully) to not having very many more of those days!

No comments: