Winter is here | Azure Stack is now GA

Microsoft Azure Stack is an extension of Azure, which brings the Azure Services to your datacenter. This revolutionary product from Microsoft is a true hybrid cloud product which allows you to deploy Azure services across both public Azure and your own local Azure environments. During Microsoft Inspire event, Microsoft announced the general availability of Azure Stack and the road map. 

Azure Stack Integrated Systems

Azure Stack production release is delivered as a multi-server integrated system. DELL EMC, HPE & LENOVO are the incumbent OEM partners for Azure Stack who will be shipping the hardware for Azure Stack in September 2017. CISCO & HUAWEI will join the OEM programme in 2018.  

Azure Stack Development Kit (ASDK)

Previously known as technical previews, ASDK allows you to test Azure Stack in a single-server deployment using your own hardware. This is useful to build and validate your workloads for production use in Azure Stack or for Proof-of-Concept scenarios.

I have listed some key resources that is available to you to learn and embrace Azure Stack momentum. 

MVPs Kerrier Meyler, Jacob Svendsen, Steve Buchanan, Mark Scholman and myself have been working on a book on Microsoft Hybrid Cloud,  which will take you through the journey of mastering the required skills to manage Azure Stack. The book is expected to be released in Q4 2017 and you can pre-order your copy via Amazon using this link. 

VM Management Failure in Azure

When deploying a new VM from the Azure Portal, I have come across a bizarre error. The VM deployment has been interrupted and it was in a failed state. I have tried deploying the VM several times to the same resource group but without any luck. 

The Provisioning State was failed in my case. Surprisingly the recommendation to remediate this issue seemed little odd. I was trying to deploy a single Windows VM of DS2 v2 SKU, nothing out of the ordinary. 

After examining the deployment properties, I have noticed that the issue was  with the Virtual Machine resource in Azure Resource Manager. But the level of detail for the error was still the same. 

Although the article Understand common error messages when you manage Windows virtual machines in Azure lists common Windows VM management errors in Azure, it doesn’t contain any troubleshooting steps on how to recover.

Allocation failed. Please try reducing the VM size or number of VMs, retry later, or try deploying to a different Availability Set or different Azure location.

As a last resort, I tried starting the failed VM manually from the portal. Well that worked anyway. I tried deploying couple of VMs thereafter in to the same resource group which did not reported any errors.

Have you faced the similar error before? For me the mystery still remains.

Azure Site Recovery updates | Managed Disks & Availability Sets

Azure Site Recovery team has made some significant improvements to the service during past couple of months. Recently Microsoft has announced the support for managed disks and availability sets with ASR. 

Managed Disks in ASR

Managed disks allow simplified disk management for Azure IaaS VMs and users no longer have to leverage storage accounts to store the VHD files. With ASR,  you can attach managed disks to your IaaS VMs during a failover or migration to Azure. Additional using managed disks ensure reliability for VMs placed in Availability Sets by guaranteeing that the failed over VMs are automatically placed in different storage scale units (stamps) to avoid any single point of failure.

Availability Sets in ASR

Site Recovery now supports configuring VMs into availability sets in ASR VM settings. Previously users had to leverage a script that can be integrated to the recovery plan to achieve this goal. Now you can configure availability sets before the failover so that you do not need to rely on any manual intervention.

Below are some considerations to be made when you are using these two features.

  • Managed disks are supported only in Resource manager deployment model.  
  • VMs with managed disks can only be part of availability sets with “Use managed disks” property set to Yes
  • Creation of managed disks will fail , if the replication storage account was encrypted with Storage Service Encryption (SSE). If this happens during a failover you can  either set “Use managed disks” to “No” in the Compute and Network settings for the VM and retry failover or disable protection for the vm and protect it to a storage account without Storage service encryption enabled.
  • Use this option only if you plan to migrate to Azure for any SCVMM managed/unmanaged Hyper-V VM’s Failback from Azure to on-premises Hyper-V environment is not currently supported for VMs with managed disks.
  • Disaster Recovery of Azure IaaS machines with managed disks is not supported currently.

VM extensions installed in Azure VMs | Where do they all come from?

I have noticed that no matter what deployment method I use (Portal or ARM template) to deploy an Azure VM in my Azure subscription, I always get two additional VM installed by default. I have no intention of using OMS or Security Center for certain workloads I run on Azure but having these two extensions hanging in a VM looked weird to me. The reason is I never installed them in the first place.

As you can see in the below image MicrosoftMonitoringAgent & Monitoring VM extensions are installed after a VM has been provisioned.

Security Center Data Collection from Azure VMs

When you enable the Azure Security Center for your sbscriptions in Azure, by default data collection is turned on. This setting provisions the Microsoft Monitoring Agent (which explains the auto installed extension) on all Azure VMs in the subscription/s and any new VMs that you create.  The Microsoft Monitoring agent scans for security related configurations and posts them into Event Tracing for Windows (ETW) traces.  Also any event logs raised by the guest OS will be read by the the Microsoft Monitoring Agent and will be posted back to OMS.

In case if you are using the free tier of the Microsoft Azure Security Center, data collection can be disabled in the Security Policy as below. Enabling data collection is required for any subscription that uses the Azure Security Center Standard tier.

Once I disabled the data collection I got rid of the OMS agent being that has been auto provisioned in my Azure VMs. Also note that event though the data collection has been turned off, VM disk snapshots and artifact collection will still be enabled. However Microsoft do recommend to enable the data collection regardless of the security center tier your subscription is on.

 

APM in SCOM 2016 | Doomed or Saved?

There is a scary bug with SCOM 2016. If you are using Application Performance Management (APM) feature with SCOM 2016 you may possibly run into an issue where the SCOM 2016 Agent may cause a crash for the IIS Application Pool running under .NET 2.0 runtime. The underlying cause dor this issue is that the APM code of SCOM 2016 Agent utilize memory allocation within the APM code of the Microsoft Monitoring Agent, that is incompatible with .NET 2.0 runtime. This results in a crash  if this memory is later accessed in a certain way. The SCOM 2012 R2 agent doesn’t have this issue since the code that cause this behavior is not present in that version. 

Microsoft has provided a fix for this issue with SCOM 2016 Update Rollup 3. Unfortunately this hotfix seems useless in rectifying this issue. Microsoft is working on another hotfix to correct this behavior. 

There are several workarounds that you can perform in order to remediate this issue.

  • Migrating the Application pool  to .NET 4.0 Runtime;
  • Installing the SCOM 2012 R2 Agent as it’s forward-compatible with SCOM 2016 Server and APM feature will continue to work with the older binaries;
  • Reinstalling the SCOM 2016 Microsoft Monitoring Agent with NOAPM=1 switch in msiexec.exe setup command line to exclude the APM feature  from setup;

There are some additional issues casued by this bug.

SharePoint Central Administration site stops working when SCOM 2016 Agent is installed onto the server

Even though the APM feature is in disabled mode by default when you install the SCOM 2016 agents, it adds a registry setting to load inactive APM into IIS Application pools. If you don’t configure APM in the SharePoint Servers, the application pools will have APM loaded in inactive state without monitoring. It has been reported that the inactive APM may crash SharePoint Central Administration v4 application pool and prevent the application from starting.

Known Workarounds

  • Install SCOM 2012 R2 agent if APM is needed.
  • If APM is not need, re-install the SCOM 2016 agent with “NOAPM=1” from the command line.
Web Site crashes during startup when SCOM 2016 Agent is installed 

As described above, APM adds a registry setting to load inactive APM into IIS Application pools regardless of APM is disabled (but installed) or not. The application pool account needs access top the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\System Center Operations Manager\12\APMAgent registry key,. If it cannot access this registry hive, the inactive APM configuration cannot read that key and the application pool process may crash. As a workaround you can add “Read” access permission for the application pool account to this registry key.

502 Bad Gateway error | Azure Application Gateway Troubleshooting

I was setting up an Azure Application Gateway for a project couple days back. The intended workload was setting up git on nginix.  But when I tried to reach the git URL I noticed that it was failing with 502 Bad Gateway error.

Initial Troubleshooting Steps

  • We tried to access backend servers from Application Gateway IP 10.62.124.6, backend server IPs are 10.62.66.4 and 10.62.66.4. Application Gateway configured for SSL Offload.
  • We were able to access the backend servers directly on port 80, but when accessed via Application Gateway this issue occurs.
  • Rebooted the Application Gateway and the backend servers. and configured custom probe as well. But the issue was with the Request time out value which is by default configured for 30 seconds.
  • This means that when user request is received on Application Gateway, it forwards it to the backend pool and waits for 30 seconds and if it fails to get a response back then within that period users will receive a 502 error.
  • The issue has been temporarily  resolved after the time out period on the Backend HTTP settings has been changed to 120 seconds.

Real Deal

Increasing the timeout values were only a temporary fix as we were unable to find out a permanent fix. I have reached out to Microsoft Support and they wanted us to run below diagnostics.

  • Execute the below cmdlet and share the results.

$getgw = Get-AzureRmApplicationGateway -Name <application gateway name> -ResourceGroupName <resource group name>

  • Collect simultaneous Network traces:
    1. Start network captures on On-premises machine(Source client machine) and Azure VMs (Backend servers)
      • Windows: Netsh
        1. Command to start the trace:  Netsh trace start capture=yes report=yes maxsize=4096 tracefile=C:\Nettrace.etl
        2. Command to stop the trace:  Netsh trace stop 
      • Linux: TCPdump
        1. TCP DUMP command: sudo tcpdump -i eth0 -s 0 -X -w vmtrace.cap 
    2. Reproduce the behavior.
    3. Stop network captures.

Analysis

The network traces collected on Client machine and Destination servers while the issue was reproduced indicates that,  during the time period the trace was collected, for every HTTP Get Request (default probing) from the Application Gateway instances on the backend servers, the servers responded “Status: Forbidden” HTTP Response.

This has resulted in Application Gateway marking the backend servers as unhealthy as the expected response is HTTP 200 OK.
 
The Application Gateway “gitapp” configured for 2 instances (Internal instance IPs: 10.62.124.4, 10.62.124.5)
 
Trace collected on backend server 10.62.66.4
 
12:50:18 PM 1/3/2017    10.62.124.4         10.62.66.4            HTTP      HTTP:Request, GET /
12:50:18 PM 1/3/2017    10.62.66.4            10.62.124.4         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
12:50:45 PM 1/3/2017    10.62.124.5         10.62.66.4            HTTP      HTTP:Request, GET /
12:50:45 PM 1/3/2017    10.62.66.4            10.62.124.5         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
 
Trace collected on backend server 10.62.66.5
 
12:50:48 PM 1/3/2017    10.62.124.4         10.62.66.5            HTTP      HTTP:Request, GET /
12:50:48 PM 1/3/2017    10.62.66.5            10.62.124.4         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
12:50:45 PM 1/3/2017    10.62.124.5         10.62.66.5            HTTP      HTTP:Request, GET /
12:50:45 PM 1/3/2017    10.62.66.5            10.62.124.5         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /

Rootcause

Due to the security feature ‘rack_attack’ enabled on the backend servers, it has blacklisted the application gateway instance IP’s and therefore the servers were not responding to the Application Gateway, causing it to mark the backend servers as Unhealthy.

Fix

Once this feature was disabled on the backend web servers (niginx) , the issue was resolved and we could successfully access the web application using Application Gateway.

 

Microsoft Monitoring Agent | New Update Rollup

Microsoft has unveiled a new update rollup 8.0.11030.0 for the Microsoft Monitoring Agent (MMA) that has fixed issues in the previous version of MMA. The fixes in this version includes below.

  • Improved logging for HTTP connection issues
  • Fix for high CPU utilization when you’re reading a Windows event that has an invalid message description
  • Support for Azure US Government cloud

How to get update rollup version 8.0.11030.0 for Microsoft Monitoring Agent (KB3206063)?

This package is available as a manual download in the  Microsoft Update Catalog. You can search for Microsoft Monitoring Agent and list down the available updates will appear in the search results.

Hotfix 1 for SCVMM 2016 Update Rollup 1

Microsoft has published a new Hotfix 3208888 for those who are running SCVMM 2016 Update Rollup 1. This includes a fix for the issue where when you use VMM to live migrate a VM from one host that’s running one version of Windows Server 2016 to another host that’s running a different version of Windows Server 2016, the placement page assigns a zero rating to the target host.(i.e Datacenter edition to Standard edition). This issue happens only when you try to live migrate between two version of Windows Server 2016 but not when you are trying a live migration between hosts that are running 2012 R2 and 2016.

This leads to block the live migration with below error message:

Unable to migrate or clone the virtual machine VM_name because the version of virtualization software on the host does not match the version of virtual machine’s virtualization software on source version_number. To be migrated or cloned, the virtual machine must be stopped and should not contain any saved state.

Installing KB3208888

Note the this is applicable only to those who are running SCVMM 2016 Update Rollup 1.

  • Download the KB package from here.
  • Use an elevated Command Prompt to install the KB manually.

msiexec.exe /update kb3208888_vmmserver_amd64.msp

 

Rolling back to a previous version of a SCOM Management Pack

In Operations Manager sometimes you may need to revert back to an older version of a Management Pack for a particular workload. The Operations Manager UI allows to delete and re-import MPs from the “Installed Management Packs” screen. The problem happenes when there are multiple and multi-level dependencies on the MP that you are trying to delete.

However now there is an enhanced version of a script available at TechNet  (developed by MSFT employee Christopher Crammond) that will help you to revert Management Packs with a single command.

Using the Script

  • Open the Operations Manager Command Shell prompt as an Administrator.
  • Download the script to remove a management pack with dependencies from here.
  • Execute the script as below. 

 .\RecursiveRemove.ps1 <ID or System Name of the MP>

  • For an example if you want to remove the SQL 2014 Disocvery MP run the script as below.

 .\RecursiveRemove.ps1 Microsoft.SQLServer.2014.Discovery

How to get the  the ID or System Name of an MP?

  • Selecting the MP that you want to delete in the Installed Management Packs view by clicking Properties in the Actions pane.
  • Copy the content in the ID : text box in the General tab.

VM Version Upgrade | Windows Server 2016 & Windows 10

If you have recently upgraded your datacentre infrastructure to Windows Server 2016 (or your client device to Windows 10) you can benefit from the latest Hyper-V features available on your virtual machines by upgrading their configuration version. Before you upgrade to the latest VM version make sure;

  • Your Hyper-V host are running latest version of Windows or Windows Server and you have upgraded the cluster functional level.
  • You are not going to move back the VMs to a Hyper-V host that is running a previous version of Windows or Windows Server.

The process is fairly simple and involves only four steps. First check the current VM configuration version.

  • Run Windows PowerShell as an administrator.
  •  Run the Get-VM cmdlet as below  and check the versions of Hyper-V VMs. Alternatively the configuration version can be obtained by selecting the virtual machine and looking at the Summary tab in Hyper-V Manager.

Get-VM * | Format-Table Name, Version

  • Shutdown the VM.
  • Select Action > Upgrade Configuration Version. If you don’t see this option for any VM that means that  it’s already at the highest configuration version supported by that particular Hyper-V host.

If you prefer PowerShell you can run the below command to upgrade the configuration version.

Update-VMVersion <vmname>