Showing posts with label Microsoft. Show all posts
Showing posts with label Microsoft. Show all posts

Thursday, 10 January 2013

Avoiding Downtime For Azure Web/Worker Role Instances when the Azure Infrastructure Upgrades the OS

I discovered a problem the other day where a typical web application that was hosted on 2 Azure web roles experienced downtime for approximately 6 minutes. I was initially alerted to the problem by PingDom which informed me that every single page across my web application went offline. The screenshot below is for one of these pages and it indicates 2 downtime events. The first of which was the actual problem and the second was something else:


PingDom's root cause analysis of the downtime simply indicated a "Timeout (> 30s)" from multiple locations around the globe. Given that every page monitored by PingDom indicated the same problem I was pretty sure the entire site went down. I quickly logged into the Azure Management Portal during the downtime event to observe the status of my web role instances and I noticed that one of them (actually the 2nd of two instances) was currently rebooting. I immediately got the idea that an Azure OS update must have been initiated and that I was observing the second of my web role instances being rebooted after an update. Note that Azure is a PAAS service so it automatically handles OS and infrastructure upgrades. This makes it easier to focus on whatever core application development you are doing (instead of administrating machines) but it does come at a small unexpected price (unintended consequences) which I will explain further.

I wanted to confirm my hypothesis that the OS upgrade had occurred around the same time. Thus I got in touch with Azure support to find out if in fact an OS upgrade had been initiated by the Azure infrastructure on December 20th around 15:20 PST. This is the reply I got:
Thank you for your patience. The behavior your perceived for the update is correct. One of the instances was brought online after an update and showed when the role was set to “started” it moved to the other machine in the update domain to update the node.

The behavior is by design, we wait for the machine to display the result as role started for the machine in order to start updating the other instance.

The ideal will be to try to lower the startup time for the application. [Unfortunately] this will happen every month for the machines since we just count on the role status to update the role instance in the other  domain.
I was also sent the following link by Azure support which talks about managing multi-instance Windows Azure applications.

The web application I have running does take a bit of time to boot-up due to the JIT compilation along with New Relic's .NET agent profile hook-in to IIS, but this usually takes several seconds not minutes to complete. What seems to be going on is that although the 2 web role instances are in different upgrade domains (upgrade domain 1 and 2 respectively), which causes any updates to happen in a non-overlapping schedule, in reality the updates can occur immediately one after the other which makes sense from an infrastructure perspective. And because the Azure infrastructure relies on the status of the actual role instance itself and NOT your own application's status, its entirely possible that when a web role instance, that was just updated (in upgrade domain 1), appears as Ready to the Azure infrastructure, the web application that runs on the instance might still be initializing. And if its still initializing and the second web role instance (in upgrade domain 2) is rebooted due to the OS upgrade, there is no longer a live web role responding to web requests.

There really was only 3 ways around this:

  1. Turn off automatic upgrades of the OS (but then who wants to do that manually given that web roles are a PAAS service after all).
  2. Figure out exactly what and why is causing the web role to come alive more slowly than expected and then spend the engineering resources to reduce this time substantially. Given that we'd never be able to drive that time down to zero there would always be a small but noticeable wait time (probably in seconds rather than minutes).
  3. Have 3 web roles instead of 2. This way OS upgrades that are rolled across upgrade domains will only ever affect 2 of the web roles at any given time (with a minor overlap which in my case was 1-2 minutes). This does cost more money to have an additional web role but its a really easy fix and depending on the size of role instance it might be the cheapest and easiest solution by far.
I chose option 3. In a startup time is usually more precious than money, so if something can be fixed easily with money instead of time... that's generally a good root to take.

------

Update: Since switching to using more than 2 web roles I have never seen this downtime issue happen again.

Saturday, 17 November 2012

RESAAS Wins Prestigious Microsoft Award

Yesterday my company, RESAAS, was announced as the winner of the 2012 Microsoft Impact Award in the category of Windows Azure Platform ISV Partner of the Year.

Our press release of can be found on Bloomberg and Microsoft has a few more details about our category and who we were up against on the Microsoft Partner Network.

Tuesday, 14 August 2012

RESAAS Windows 8 Application Approved by Microsoft

The engineering team at RESAAS recently started working on a Windows 8 styled application (using the Metro design language) for our real-time Question and Answer feed. I'm happy to say that Microsoft has approved the application and it is now available in the Microsoft Windows 8 Store.

RESAAS also pushed out a press release which found its way onto Yahoo Finance: http://finance.yahoo.com/news/resaas-enterprise-platform-real-estate-160000716.html

A PDF version of the press release can be found on the CSNX where our company's stock is listed: http://www.cnsx.ca/Storage/1472/134962_NRAug92012.pdf

Tuesday, 17 July 2012

Interviewed by Jonathan Rozenblit for "Canada Does Windows Azure"

Jonathan Rozenblit who is a Developer Advisor at Microsoft interviewed me for a series he does called "Canada does Windows Azure". We spent 17 min talking about how RESAAS uses Azure, why it decided to use Azure and how our developers ramped up on the platform when we first started.


Here are links to the blog posts Jonathan put on his blogs: