Andrew Thompson: March 2011

Sunday 20 March 2011

Achieving Transactions with a Database and File System

:: The Overarching Problem

Recently I was tasked with solving an interesting software design problem. The existing software's implementation wrote many files to a file system through a web service and the requirement was to add in a mechanism to store metadata for specific files. The files still needed to be written to the file system and were not to be modified in any way. Other than that, it was up to me to devise a solution for how best to store the file metadata. Here is a diagram to illustrate the basic architecture (the red question mark represents the unknown for how to store the file metadata):

:: Design Approaches

I opted early to use a database to store the file metadata instead of more files in the file system that stored associated metadata (that would very quickly become unmanageable). Given the use of a database, the difficult aspect of this problem was how I would guarantee that both a file gets written to the file system and its associated metadata is stored in the database table. I could not get into a situation where one of the operations succeeded while the other failed... it was all or nothing. While researching various solutions and talking to other developers I came up with these 3 approaches as outlined below.

1. Microsoft SQL Server FILESTREAM Data Type

I won't go into too much detail about this SQL Server FILESTREAM data type since it is explained thoroughly here (and an example here) but essentially it allows for structured data to be stored in a database while also allowing the storage of unstructured data on an NTFS filesystem. Essentially a single SQL operation can be used be used to store both the file on the file system and the file's metadata in the database. With regards to the transaction around this operation here is what I found from the first article:

FILESTREAM is the only solution that provides transactional consistency of structured and unstructured data as well as integrated management, security, low-cost, and excellent streaming performance. This is accomplished by storing the structured data in the database files and the unstructured BLOB data in the file system, while maintaining transactional consistency between the two stores. More details of the FILESTREAM architecture are given in the “Overview of FILESTREAM” section later in this white paper.

I also came across these two issues that may restrict the ability to use this FILESTREAM data type:

SQL Server maintains a link of sorts from table rows to the FILESTREAM files associated with them. This means that deleting or renaming any FILESTREAM files directly through the file system will result in database corruption.
The FILESTREAM data type is only available on SQL Server 2008

Using the FILESTREAM data type would definitely be a viable solution to explore further but I continued to investigate other alternative solutions.

2. Modified File Query Architecture

Given that it's not trivial to guarantee that both a file system and SQL operation either succeed or fail together, another approach can be used where each operation can happen independently but only the final operation in the sequence of operations "commits/registers" the overall action (in my case the overall action is that both the file and its metadata are stored successfully or not at all).

The thought here was to perform the operation which stores the file to the file system first. If this fails I know the overall action has failed. It it succeeds I can move onto the next action in the sequence which is the SQL operation.

For the SQL operation I would be storing the file's metadata in a row within a database table. This row will have an identifier for the file stored on the file system (whether it be its location on the file system or simply its unique filename). At this point if the SQL operation succeeds the overall action succeeds. The tricky part comes if the SQL operation fails, I would then be left with a file on the file system which is unreferenced by any row in the database table (the beginnings of data integrity problems). Given that the files stored on the file system are queried by another process, this can very easily cause a corrupted system where files did not have any associated metadata.

So what to do if the SQL operation fails? The simplest (but naive) solution would be to simply delete the file on the file system if the SQL operation failed for any reason. This has 2 problems:

What if the deletion of the file fails? You could build in some retry logic but if the deletion failed indefinitely, the file would sit lying around on the file system forever as an orphaned file with no associated metadata.
What if the other process queried for the file on the file system before its metadata was inserted into the database? Given that the file is stored first and then its metadata is stored second, it is theoretically possible for the other process to query a file that is on the file system but that does not yet have its metadata stored in the database. This could be solved with sufficient error case handling (if metadata doesn't exist try again until it is) but this would not be a very elegant solution.

To use this approach effectively, the other process's query mechanism would need to be updated such that it queried the database table first (to determine what metadata exists) and then only queried for files on the file system to which there was metadata for.

The drawbacks of this approach is that the other process needs to be modified (this could be done but the less code changes the better) and there would still be the issue of orphaned files, with no metadata, that exist on the file system. A separate clean-up routine could be built that periodically tries to take care of (i.e. delete) the orphaned files but again, I'd have to write a whole other application to perform clean-up... there has to be a better way.

3. Transactional NTFS & The .NET TransactionScope Class

I knew that the .NET framework has the ability to wrap SQL operations in a transaction but I knew this didn't work for traditional file system operations. What I wanted to guarantee was that if any of the file operations or SQL operations failed, then none of them would be committed. I needed an atomic transaction for both the file system write and the SQL insert statement.

I posted this question on Stack Overflow and got a ton of responses. It turns out that since Windows Vista was introduced, NTFS has been bundled with a special component called Transactional NTFS (TxF). According to Wikipedia, Transactional NTFS "brings the concept of atomic transactions to the NTFS file system, allowing Windows application developers to write file output routines that are guaranteed either to succeed completely or to fail completely". Thus I now had a mechanism to enable transactions for both SQL and file system operations.

Upon researching more about how I would add this transactional support to file system operations within my application, I came across this Transactional NTFS Screen-cast by Jason Olsen. Not only does he highlight the difference between a regular file system operation and a transacted file system operation, but he also dives deeper into the underlying code that makes the transaction possible. As you will notice from the screen-cast, the underlying code supporting the transactions is very low-level and uses the Win32 API functions. I'd prefer to interact with managed code (at the architectural layer that this is being implemented in) so I found a more convenient managed wrapper that allows me to use the well known .NET TransactionScope class (This is the same class that SQL operations can be wrapped with to form transactions).

Note that using Transactional NTFS requires developing and running the application on Windows Vista or above. Additionally, transacted file operations can be performed on network shares but only if their underlying file systems support Transactional NTFS themselves.

:: Conclusion

After evaluating these 3 approaches I decided that the use of "Transactional NTFS & The .NET TransactionScope Class" would solve the underlying problem most effectively while minimizing the amount of code base changes I would need to make in order to get this feature operational.

Saturday 12 March 2011

Continuous Integration: Tooling, Processes & Implicit Communication

Continuous integration (CI) is a process and set of tools that development teams can use to continually manage the software they develop, test and deploy. In this post I will highlight the various tooling I have used on various .NET projects that harnesses CI, I will discuss some key processes around using CI that make it most effective and I will present some of the more subtle ways CI allows for implicit communication between team members that isn't so obvious when you're starting out with the practice.

Hudson is a really good open-source, Java, web-based application that can be used as a CI build system on Windows or Unix/Linux. What I find particularly compelling about Hudson is how easy it is to install and get started with a new project. First download the Hudson WAR from the homepage. Once you have started it from the command-line using java -jar hudson.war almost everything else is done from its GUI based menus; there is no need to configure any XML or to have an intimate knowledge of its API (CruiseControl is like this).

:: Tooling

Redsolo has a pretty complete post on setting up Hudson for a .NET project and getting the unit-tests and some static analysis (FxCop) to run. I initially configure my own Hudson instances very similarly to his post but over time I have altered it in the following ways:

The use of a source control management (SCM) system is crucial when building a CI infrastructure. I use Subversion (SVN) but you could just as easily use Git if you're more the distributed version control type. It is critical for CI to have a central place where all developers can regularly commit their code and configuration changes for a software project. Applications may work in isolation (when worked on independently) but time and time again I have found that the integration of a team's code changes and configuration can be very very challenging. CI promotes integrating early and often and therefore tries, and I believe succeeds, at avoiding the headaches that are associated with integrating at the end of a project (which by the way is very difficult to estimate against as you have no idea what kinds of bugs are lurking when it comes time to fully integrate).
I use NAnt to easily specify and execute a range of tasks from project compilation, database management, unit and acceptance test execution, static analysis, documentation generation and deployment. Hudson has a NAnt plugin which allows it to use NAnt as a build step. Thus, when Hudson notices that a developer has checked-in code, it does a full checkout of the subversion tree and executes a specific NAnt target to begin compilation or any other task required for that particular project.
I generally use MSTest and not NUnit for unit-testing. Both integrate nicely with Hudson but Hudson only knows how to interpret JUnit and NUnit test results. Because I still want a graph of the test results to show up on the project's main page, I install the MSTest plugin that converts the test report (a TRX file) into a format that Hudson understands natively (which happens to be JUnit XML).
I use FxCop, StyleCop and NCover for static analysis of my projects. FxCop and StyleCop are used more for helping the development team adhere to some design and implementation best practises as well as coding standings throughout development whereas NCover is used more for "after the fact" analysis of the code base (e.g. it highlights potential production code our team has not tested adequately). I use the Violations plugin to visually capture the number of FxCop and StyleCop violations within the code base after each new build. I also use a combination of a NAnt target (to generate the NCover report) and the HTML publisher plugin to publish the NCover report to Hudson so it is accessible via a link on the main page.
I automatically generate Doxygen documentation from the XML comments associated with the classes and methods within my code base. I use a NAnt target along with the HTML publisher plugin to publish the HTML to Hudson so it is accessible via a link on the main page. This way the development team always has a reference to the most up-to-date API documentation for the code base.
I find the Task Scanner plugin particularly helpful to keep track of various keywords in the code base that developers may leave in to remind themselves or others of what to fix or attend to later. Any custom tags such as FIXME, TODO, @deprecated, ??? can all be tracked and are shown visually on the main project page.
I use the Claim plugin which allows a specific developer to claim a build break and comment on what the problem/solution might be. The "claim" mechanism is sticky which means that if a developer continually breaks the build they will be continually responsible for fixing it. This plugin is great for helping a distributed team communicate implicitly about who is responsible for build breaks (instead of email or instant message communication between the whole team - less communication spam is always a good thing).
I also archive specific build artifacts from each CI build that I may want to analyze in the future. That way if I ever want to go back in time and discover where a particular bug entered the code base (and by whom) I can do so. It is much easier to simply download the executables from the CI build server and run them than to checkout a specific revision from source control, compile it and then run it to verify it does or does not have the bug I'm looking for.

Hudson's website also has some best practices to follow for Build Management, Release Management, Deployment Automation, and Test Orchestration

:: Key Processes

Here are some key processes that I use on teams to harness the benefits of CI:

Only one Commit per Build - Finding and solving software bugs that cause a build to break can be like searching for a needle in a haystack. Finding that needle can be made much easier if your software uses logging (such as log4net), a team member has deep insight into certain parts of the application that they know cause certain problems, specific sets of code and configuration changes can be isolated and labelled as the changes that may have caused the bug, or your software application has a testing suite that can pin-point failure areas. Each of these can be harnessed more effectively and precisely if a CI build is executed after each and every developer's commit (i.e. a single CI build does not encompass multiple commits). The reason for this is, if a build failure does occur, it is far easier to analyze a single code change set from one developer and begin the process of finding the bug than to begin analyzing multiple change sets from multiple developers. Whomever's code change actually broke the build at this point is very hard to determine without all developer's working together instead of just the single developer who's code change broke the build. Caveat: The only additional note on this is that it becomes difficult to keep to this requirement as a team grows in size and the build takes longer and longer to execute. For example: If a single build takes 1hr to execute, in a given working day a development team will only be able to run 8-10 CI builds. The more developers that are on a team, the more commits your team will perform each day. At some point your team won't be able to keep to this "one commit per build" policy but it is definitely something to strive for if you can.
Start a Build from Scratch - When setting up your CI build make sure to re-checkout the FULL source controlled tree (do not update the local copy of the tree) and fully rebuild (compile) all executables and libraries from their raw source code. Simply doing an update of your local copy of the tree or rebuilding only the executable/libraries that are needed, allows for the possibility of artifacts, from the previous build, to be used in the current build. The most effective way to eliminate unintended or corrupted build breaks and, more importantly, to truly test how everything integrates from your raw source code, the CI build should start from scratch. Delete everything, start from a freshly created directory and then checkout your source code from source control. Compilation, testing and deployments can then be performed on the raw source code and not any other artifacts still lying around in the directory from a previous build.
Multiple Builds - On my projects I usually need multiple builds. I need one build, called CI Build, that automatically builds after a single source control commit by a developer. I then have another build called Nightly Build, which is run each and every night and therefore builds against all the source control commits from the previous day. Then I may have a QA Build that has a different deployment procedure to some QA server for the testers. The QA Build is usually setup to build after a manual step that one of the testers will execute. Lastly, I have a Production Build that has a deployment procedure specifically geared towards deploying the application into a production environment and doing all the appropriate configuration and tests required to validate that it is stable and functional in production. Of course the Production Build is very important to get right and should be protected against unauthorized execution (i.e. password protect its usage or have only a select few individuals with the permissions to execute it).
Build Both the Release and Debug Configurations - When compiling .NET projects in Visual Studio you have the ability to compile in Release or Debug (or others if you have defined your own). I would highly recommended having both a CI Release Build and a CI Debug Build. I have found very subtle differences between the two when it comes to compilation, timing for unit-tests, configuration & static analysis results. I've had situations where the debug build passes but the release build does not and it was usually the result of some mis-configuration but occasionally I catch a real nasty timing issue in a unit or acceptance test. Note that I care about both build configurations because developer's routinely develop and test an application which was compiled in debug mode whereas the software actually being released to production is compiled in release mode.
Keep the Main Source Control Tree Pristine - Each and every developer works from and modifies the main source control tree. Each modification to the tree affects all other developers so it is imperative that it be kept clean and at the highest quality. The software that a development team produces is what is in the tree, therefore whatever quality the tree has the software has also. This is a slightly more concrete way of thinking about the intangible aspects of software. It is hard to visualize what software really is and what quality it has but it is much easier to visualize the files checked-in to a source control tree and the quality of each of those files. Therefore, keep the files that are checked-in to the source control tree pristine at at the highest quality.
Priority Number 1: Fix the Broken Build - When a build breaks during CI it effects the whole team. Developer's should be immediately informed that the source control tree is unstable and they then know NOT to commit any changes or to update their local code versions to the broken/unstable one. Developer's can of course continue to develop their feature(s) or resolve their bug(s) but eventually they're going to want to commit code changes or update their local versions with everyone else's code changes. With a broken build, developer's are forced to work in isolation which is the very thing CI is trying to avoid. A broken build delays integration and should therefore be avoided at all costs.

:: Implicit Communication

Some projects can be routinely bogged down by communication overhead between team members which makes it difficult to be productive (less communication overhead means more time building and improving software). Question such as: why does this code not compile, who's responsible for this piece of code, what effect will this code change have, what exactly are we testing, what are the exact files I need in order to test this on another machine, are all valid questions developers can ask but if there is an easier way to communicate the answers other than direct developer-to-developer interaction, productivity can be greatly improved. This is particularly useful for development teams that are distributed by both location and time but even small teams located in the same office can see marked improvements by reducing communication overhead by using CI practices.

Having a single, repeatable process which is what CI fosters, allows entire development teams to be on the same page. At any point in time you can verify the state of the code base: is it stable, is it broken or is it in transition (i.e. currently building, testing and deploying). If things are stable then I can concentrate on developing features for my software application. If it's broken then I know something is wrong and the CI build system will most likely help me out by informing me why it's broken or at least try to point me in the right direction (i.e. what commit by what developer broke the build, did a test fail, did deployment fail, is it a configuration issue or a code change... etc). If it's in transition I know that it's potentially unstable and should wait until I know more. Following these basic principles, development teams keep in-sync by knowing at any point in time whether the life blood of their software application (it's source code and configuration) are stable or whether there is a problem that needs fixing.

Extending this CI process with the tools mentioned above: static analysis reports, the ability for specific developers to claim broken builds and comment on them, auto-generated API documentation and code base task scanners, a development team has relevant information at its finger tips that helps them communicate without direct face-to-face or email communication. I call this implicit communication. If, as a developer, I can quickly determine who broke the build, if they have taken responsibility for it and what they have decided to do to fix it, I can quickly go back to the feature I was previously building or I can get in touch with the person responsible for breaking the build and help them by providing a deeper insight about the reason for the failure. Apart from build breaks, I can access the latest API documentation and verify that I understand a method's contract that I am about to use. I can also quickly verify the cyclomatic complexity of a method that I am refactoring and decide whether it has enough test coverage. All of this can be done from my own desk with no direct communication with another developer. I have found that on some projects without this CI infrastructure there is always a "guru" developer who seems to know the most about a project (usually because he's been there the longest) and everyone needs their insight about each and every software related issue. You can imagine what kind of bottle neck that causes and what might happen if the "guru" leaves the project... chaos.

At the end of the day this implicit communication helps me build better software as I have the information I need to design and implement software features and fix bugs. By reducing the communication overhead that CI tools and processes allow, I am more productive at developing, improving and delivering the software customers pay for.

-------------------------------

:: UPDATE

Recently the team that was building Hudson left the project and subsequently forked it into what is now called Jenkins (This was due to some political issues with Oracle over the name Hudson and who had control over it). If you already have a Hudson instance and wish to switch to Jenkins (as the team is mostly over their now) this Jenkins wiki article will help you make the switch.