VM Management Failure in Azure

When deploying a new VM from the Azure Portal, I have come across a bizarre error. The VM deployment has been interrupted and it was in a failed state. I have tried deploying the VM several times to the same resource group but without any luck. 

The Provisioning State was failed in my case. Surprisingly the recommendation to remediate this issue seemed little odd. I was trying to deploy a single Windows VM of DS2 v2 SKU, nothing out of the ordinary. 

After examining the deployment properties, I have noticed that the issue was  with the Virtual Machine resource in Azure Resource Manager. But the level of detail for the error was still the same. 

Although the article Understand common error messages when you manage Windows virtual machines in Azure lists common Windows VM management errors in Azure, it doesn’t contain any troubleshooting steps on how to recover.

Allocation failed. Please try reducing the VM size or number of VMs, retry later, or try deploying to a different Availability Set or different Azure location.

As a last resort, I tried starting the failed VM manually from the portal. Well that worked anyway. I tried deploying couple of VMs thereafter in to the same resource group which did not reported any errors.

Have you faced the similar error before? For me the mystery still remains.

Migrating Azure VMs | From Storage Accounts to Managed Disks

Microsoft now recommends using managed disks for all your Azure IaaS VMs. However we need to convert the existing VMs that use unmanaged disks to managed disks first in order to benefit from the performance improvements that managed disks offer.

Keep in Mind

  • The conversion process requires the VM to be restarted.
  • The process irreversible meaning taht you cannot go back to unmanaged disks once converted.
  • The conversion process deallocated the VMs and therefore you may lose any dynamic private IPs assigned to it. Therefore assign static private IPs  to the VMs before you start the conversion.
  • The source VHD and storage account used by the unmanaged disks are left over once migrated. You need to manually delete these.
  • There is a minimum version of the Azure VM agent required by the conversion process and you need to check that first.

Converting single VMs without an availability set

  • Following cmdlets deallocates the VM first.
$rgName = "myUnmanagedRG"
$vmName = "myUnmanagedVM"
Stop-AzureRmVM -ResourceGroupName $rgName -Name $vmName -Force
  • Following cmdlet  converts the unmanaged VM, including OS disk and any data disks, and starts the VM.
ConvertTo-AzureRmVMManagedDisk -ResourceGroupName $rgName -VMName $vmName

Converting VMs in an availability set

  • The process for this one is slightly different as you need to first convert the availability set to a managed availability set. Following cmdlets do the trick.
$rgName = 'myUnmanagedResourceGroup'
$avSetName = 'myUnmanagedAvailabilitySet'

$avSet = Get-AzureRmAvailabilitySet -ResourceGroupName $rgName -Name $avSetName
Update-AzureRmAvailabilitySet -AvailabilitySet $avSet -Sku Aligned
  • However you may encounter an error saying “The specified fault domain count 3 must fall in the range 1 to 2.”. This is beacuse the region where your availability set is may only 2 managed fault domains but the number of unmanaged fault domains is 3.  You need to update the fault domain to 2 and update SKU to Aligned as below:
$avSet.PlatformFaultDomainCount = 2
Update-AzureRmAvailabilitySet -AvailabilitySet $avSet -Sku Aligned
  • Following cmdlets deallocate and convert each VM in the availability set and restarts them automatically.
$avSet = Get-AzureRmAvailabilitySet -ResourceGroupName $rgName -Name $avSetName

foreach($vmInfo in $avSet.VirtualMachinesReferences)
 $vm = Get-AzureRmVM -ResourceGroupName $rgName | Where-Object {$_.Id -eq $vmInfo.id}
 Stop-AzureRmVM -ResourceGroupName $rgName -Name $vm.Name -Force
 ConvertTo-AzureRmVMManagedDisk -ResourceGroupName $rgName -VMName $vm.Name

Azure Site Recovery updates | Managed Disks & Availability Sets

Azure Site Recovery team has made some significant improvements to the service during past couple of months. Recently Microsoft has announced the support for managed disks and availability sets with ASR. 

Managed Disks in ASR

Managed disks allow simplified disk management for Azure IaaS VMs and users no longer have to leverage storage accounts to store the VHD files. With ASR,  you can attach managed disks to your IaaS VMs during a failover or migration to Azure. Additional using managed disks ensure reliability for VMs placed in Availability Sets by guaranteeing that the failed over VMs are automatically placed in different storage scale units (stamps) to avoid any single point of failure.

Availability Sets in ASR

Site Recovery now supports configuring VMs into availability sets in ASR VM settings. Previously users had to leverage a script that can be integrated to the recovery plan to achieve this goal. Now you can configure availability sets before the failover so that you do not need to rely on any manual intervention.

Below are some considerations to be made when you are using these two features.

  • Managed disks are supported only in Resource manager deployment model.  
  • VMs with managed disks can only be part of availability sets with “Use managed disks” property set to Yes
  • Creation of managed disks will fail , if the replication storage account was encrypted with Storage Service Encryption (SSE). If this happens during a failover you can  either set “Use managed disks” to “No” in the Compute and Network settings for the VM and retry failover or disable protection for the vm and protect it to a storage account without Storage service encryption enabled.
  • Use this option only if you plan to migrate to Azure for any SCVMM managed/unmanaged Hyper-V VM’s Failback from Azure to on-premises Hyper-V environment is not currently supported for VMs with managed disks.
  • Disaster Recovery of Azure IaaS machines with managed disks is not supported currently.

VM extensions installed in Azure VMs | Where do they all come from?

I have noticed that no matter what deployment method I use (Portal or ARM template) to deploy an Azure VM in my Azure subscription, I always get two additional VM installed by default. I have no intention of using OMS or Security Center for certain workloads I run on Azure but having these two extensions hanging in a VM looked weird to me. The reason is I never installed them in the first place.

As you can see in the below image MicrosoftMonitoringAgent & Monitoring VM extensions are installed after a VM has been provisioned.

Security Center Data Collection from Azure VMs

When you enable the Azure Security Center for your sbscriptions in Azure, by default data collection is turned on. This setting provisions the Microsoft Monitoring Agent (which explains the auto installed extension) on all Azure VMs in the subscription/s and any new VMs that you create.  The Microsoft Monitoring agent scans for security related configurations and posts them into Event Tracing for Windows (ETW) traces.  Also any event logs raised by the guest OS will be read by the the Microsoft Monitoring Agent and will be posted back to OMS.

In case if you are using the free tier of the Microsoft Azure Security Center, data collection can be disabled in the Security Policy as below. Enabling data collection is required for any subscription that uses the Azure Security Center Standard tier.

Once I disabled the data collection I got rid of the OMS agent being that has been auto provisioned in my Azure VMs. Also note that event though the data collection has been turned off, VM disk snapshots and artifact collection will still be enabled. However Microsoft do recommend to enable the data collection regardless of the security center tier your subscription is on.


APM in SCOM 2016 | Doomed or Saved?

There is a scary bug with SCOM 2016. If you are using Application Performance Management (APM) feature with SCOM 2016 you may possibly run into an issue where the SCOM 2016 Agent may cause a crash for the IIS Application Pool running under .NET 2.0 runtime. The underlying cause dor this issue is that the APM code of SCOM 2016 Agent utilize memory allocation within the APM code of the Microsoft Monitoring Agent, that is incompatible with .NET 2.0 runtime. This results in a crash  if this memory is later accessed in a certain way. The SCOM 2012 R2 agent doesn’t have this issue since the code that cause this behavior is not present in that version. 

Microsoft has provided a fix for this issue with SCOM 2016 Update Rollup 3. Unfortunately this hotfix seems useless in rectifying this issue. Microsoft is working on another hotfix to correct this behavior. 

There are several workarounds that you can perform in order to remediate this issue.

  • Migrating the Application pool  to .NET 4.0 Runtime;
  • Installing the SCOM 2012 R2 Agent as it’s forward-compatible with SCOM 2016 Server and APM feature will continue to work with the older binaries;
  • Reinstalling the SCOM 2016 Microsoft Monitoring Agent with NOAPM=1 switch in msiexec.exe setup command line to exclude the APM feature  from setup;

There are some additional issues casued by this bug.

SharePoint Central Administration site stops working when SCOM 2016 Agent is installed onto the server

Even though the APM feature is in disabled mode by default when you install the SCOM 2016 agents, it adds a registry setting to load inactive APM into IIS Application pools. If you don’t configure APM in the SharePoint Servers, the application pools will have APM loaded in inactive state without monitoring. It has been reported that the inactive APM may crash SharePoint Central Administration v4 application pool and prevent the application from starting.

Known Workarounds

  • Install SCOM 2012 R2 agent if APM is needed.
  • If APM is not need, re-install the SCOM 2016 agent with “NOAPM=1” from the command line.
Web Site crashes during startup when SCOM 2016 Agent is installed 

As described above, APM adds a registry setting to load inactive APM into IIS Application pools regardless of APM is disabled (but installed) or not. The application pool account needs access top the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\System Center Operations Manager\12\APMAgent registry key,. If it cannot access this registry hive, the inactive APM configuration cannot read that key and the application pool process may crash. As a workaround you can add “Read” access permission for the application pool account to this registry key.