Sunday 20 January 2013

Software is about to eat the NYSE

The renowned VC firm, Andreessen-Horowitz, formulated a hypothesis in 2011 that "software is eating the world". Their hypothesis has since spawned many versions such as software eating fashion, software is feeding the world and, my personal favorite, Jack Dorsey is eating payments. Software however can have less of an appetite depending on where you do business in the world.

New York City's financial businesses have always been harder to access for software startups due to a number of factors (of which, I'm sure, culture and incentives play some part). One such man, Jeffrey Sprecher, appears to have broken through that divide and was profiled in the New York Times on Sunday, January 20th, 2013. The NYT article is a great story on business success, but embedded within it is the notion that software is currently "eating trading" (that is, the art of human trading within exchanges) and software will eventually eat the vast majority of this human activity over time. An important consequence of all this trading exchange gastronomy is that wealth gets redistributed more quickly to the victor... in this case, it is Mr. Sprecher and his firm ICE: IntercontinentalExchange.

After reading the NYT article, I'm reminded that the best software products aren't always enough for success. Technology alone conquers nothing; timing and deep market insight are the other essential pillars which, when combined with technology, create a powerful catalyst for creative destruction. As markets begin to shift, new opportunities arise. Those of us who see these opportunities more clearly (and that's really the hardest part) have the ability to act by finding a software-based solution that fits that opportunity. This software then has the ability to eat just about anything in its path towards profitability.

Thursday 10 January 2013

Avoiding Downtime For Azure Web/Worker Role Instances when the Azure Infrastructure Upgrades the OS

I discovered a problem the other day where a typical web application that was hosted on 2 Azure web roles experienced downtime for approximately 6 minutes. I was initially alerted to the problem by PingDom which informed me that every single page across my web application went offline. The screenshot below is for one of these pages and it indicates 2 downtime events. The first of which was the actual problem and the second was something else:

PingDom's root cause analysis of the downtime simply indicated a "Timeout (> 30s)" from multiple locations around the globe. Given that every page monitored by PingDom indicated the same problem I was pretty sure the entire site went down. I quickly logged into the Azure Management Portal during the downtime event to observe the status of my web role instances and I noticed that one of them (actually the 2nd of two instances) was currently rebooting. I immediately got the idea that an Azure OS update must have been initiated and that I was observing the second of my web role instances being rebooted after an update. Note that Azure is a PAAS service so it automatically handles OS and infrastructure upgrades. This makes it easier to focus on whatever core application development you are doing (instead of administrating machines) but it does come at a small unexpected price (unintended consequences) which I will explain further.

I wanted to confirm my hypothesis that the OS upgrade had occurred around the same time. Thus I got in touch with Azure support to find out if in fact an OS upgrade had been initiated by the Azure infrastructure on December 20th around 15:20 PST. This is the reply I got:
Thank you for your patience. The behavior your perceived for the update is correct. One of the instances was brought online after an update and showed when the role was set to “started” it moved to the other machine in the update domain to update the node.

The behavior is by design, we wait for the machine to display the result as role started for the machine in order to start updating the other instance.

The ideal will be to try to lower the startup time for the application. [Unfortunately] this will happen every month for the machines since we just count on the role status to update the role instance in the other  domain.
I was also sent the following link by Azure support which talks about managing multi-instance Windows Azure applications.

The web application I have running does take a bit of time to boot-up due to the JIT compilation along with New Relic's .NET agent profile hook-in to IIS, but this usually takes several seconds not minutes to complete. What seems to be going on is that although the 2 web role instances are in different upgrade domains (upgrade domain 1 and 2 respectively), which causes any updates to happen in a non-overlapping schedule, in reality the updates can occur immediately one after the other which makes sense from an infrastructure perspective. And because the Azure infrastructure relies on the status of the actual role instance itself and NOT your own application's status, its entirely possible that when a web role instance, that was just updated (in upgrade domain 1), appears as Ready to the Azure infrastructure, the web application that runs on the instance might still be initializing. And if its still initializing and the second web role instance (in upgrade domain 2) is rebooted due to the OS upgrade, there is no longer a live web role responding to web requests.

There really was only 3 ways around this:

  1. Turn off automatic upgrades of the OS (but then who wants to do that manually given that web roles are a PAAS service after all).
  2. Figure out exactly what and why is causing the web role to come alive more slowly than expected and then spend the engineering resources to reduce this time substantially. Given that we'd never be able to drive that time down to zero there would always be a small but noticeable wait time (probably in seconds rather than minutes).
  3. Have 3 web roles instead of 2. This way OS upgrades that are rolled across upgrade domains will only ever affect 2 of the web roles at any given time (with a minor overlap which in my case was 1-2 minutes). This does cost more money to have an additional web role but its a really easy fix and depending on the size of role instance it might be the cheapest and easiest solution by far.
I chose option 3. In a startup time is usually more precious than money, so if something can be fixed easily with money instead of time... that's generally a good root to take.


Update: Since switching to using more than 2 web roles I have never seen this downtime issue happen again.