All posts by Janaka Rangama

502 Bad Gateway error | Azure Application Gateway Troubleshooting

I was setting up an Azure Application Gateway for a project couple days back. The intended workload was setting up git on nginix.  But when I tried to reach the git URL I noticed that it was failing with 502 Bad Gateway error.

Initial Troubleshooting Steps

  • We tried to access backend servers from Application Gateway IP 10.62.124.6, backend server IPs are 10.62.66.4 and 10.62.66.4. Application Gateway configured for SSL Offload.
  • We were able to access the backend servers directly on port 80, but when accessed via Application Gateway this issue occurs.
  • Rebooted the Application Gateway and the backend servers. and configured custom probe as well. But the issue was with the Request time out value which is by default configured for 30 seconds.
  • This means that when user request is received on Application Gateway, it forwards it to the backend pool and waits for 30 seconds and if it fails to get a response back then within that period users will receive a 502 error.
  • The issue has been temporarily  resolved after the time out period on the Backend HTTP settings has been changed to 120 seconds.

Real Deal

Increasing the timeout values were only a temporary fix as we were unable to find out a permanent fix. I have reached out to Microsoft Support and they wanted us to run below diagnostics.

  • Execute the below cmdlet and share the results.

$getgw = Get-AzureRmApplicationGateway -Name <application gateway name> -ResourceGroupName <resource group name>

  • Collect simultaneous Network traces:
    1. Start network captures on On-premises machine(Source client machine) and Azure VMs (Backend servers)
      • Windows: Netsh
        1. Command to start the trace:  Netsh trace start capture=yes report=yes maxsize=4096 tracefile=C:\Nettrace.etl
        2. Command to stop the trace:  Netsh trace stop 
      • Linux: TCPdump
        1. TCP DUMP command: sudo tcpdump -i eth0 -s 0 -X -w vmtrace.cap 
    2. Reproduce the behavior.
    3. Stop network captures.

Analysis

The network traces collected on Client machine and Destination servers while the issue was reproduced indicates that,  during the time period the trace was collected, for every HTTP Get Request (default probing) from the Application Gateway instances on the backend servers, the servers responded “Status: Forbidden” HTTP Response.

This has resulted in Application Gateway marking the backend servers as unhealthy as the expected response is HTTP 200 OK.
 
The Application Gateway “gitapp” configured for 2 instances (Internal instance IPs: 10.62.124.4, 10.62.124.5)
 
Trace collected on backend server 10.62.66.4
 
12:50:18 PM 1/3/2017    10.62.124.4         10.62.66.4            HTTP      HTTP:Request, GET /
12:50:18 PM 1/3/2017    10.62.66.4            10.62.124.4         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
12:50:45 PM 1/3/2017    10.62.124.5         10.62.66.4            HTTP      HTTP:Request, GET /
12:50:45 PM 1/3/2017    10.62.66.4            10.62.124.5         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
 
Trace collected on backend server 10.62.66.5
 
12:50:48 PM 1/3/2017    10.62.124.4         10.62.66.5            HTTP      HTTP:Request, GET /
12:50:48 PM 1/3/2017    10.62.66.5            10.62.124.4         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
12:50:45 PM 1/3/2017    10.62.124.5         10.62.66.5            HTTP      HTTP:Request, GET /
12:50:45 PM 1/3/2017    10.62.66.5            10.62.124.5         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /

Rootcause

Due to the security feature ‘rack_attack’ enabled on the backend servers, it has blacklisted the application gateway instance IP’s and therefore the servers were not responding to the Application Gateway, causing it to mark the backend servers as Unhealthy.

Fix

Once this feature was disabled on the backend web servers (niginx) , the issue was resolved and we could successfully access the web application using Application Gateway.

 

Microsoft Monitoring Agent | New Update Rollup

Microsoft has unveiled a new update rollup 8.0.11030.0 for the Microsoft Monitoring Agent (MMA) that has fixed issues in the previous version of MMA. The fixes in this version includes below.

  • Improved logging for HTTP connection issues
  • Fix for high CPU utilization when you’re reading a Windows event that has an invalid message description
  • Support for Azure US Government cloud

How to get update rollup version 8.0.11030.0 for Microsoft Monitoring Agent (KB3206063)?

This package is available as a manual download in the  Microsoft Update Catalog. You can search for Microsoft Monitoring Agent and list down the available updates will appear in the search results.

Hotfix 1 for SCVMM 2016 Update Rollup 1

Microsoft has published a new Hotfix 3208888 for those who are running SCVMM 2016 Update Rollup 1. This includes a fix for the issue where when you use VMM to live migrate a VM from one host that’s running one version of Windows Server 2016 to another host that’s running a different version of Windows Server 2016, the placement page assigns a zero rating to the target host.(i.e Datacenter edition to Standard edition). This issue happens only when you try to live migrate between two version of Windows Server 2016 but not when you are trying a live migration between hosts that are running 2012 R2 and 2016.

This leads to block the live migration with below error message:

Unable to migrate or clone the virtual machine VM_name because the version of virtualization software on the host does not match the version of virtual machine’s virtualization software on source version_number. To be migrated or cloned, the virtual machine must be stopped and should not contain any saved state.

Installing KB3208888

Note the this is applicable only to those who are running SCVMM 2016 Update Rollup 1.

  • Download the KB package from here.
  • Use an elevated Command Prompt to install the KB manually.

msiexec.exe /update kb3208888_vmmserver_amd64.msp

 

Rolling back to a previous version of a SCOM Management Pack

In Operations Manager sometimes you may need to revert back to an older version of a Management Pack for a particular workload. The Operations Manager UI allows to delete and re-import MPs from the “Installed Management Packs” screen. The problem happenes when there are multiple and multi-level dependencies on the MP that you are trying to delete.

However now there is an enhanced version of a script available at TechNet  (developed by MSFT employee Christopher Crammond) that will help you to revert Management Packs with a single command.

Using the Script

  • Open the Operations Manager Command Shell prompt as an Administrator.
  • Download the script to remove a management pack with dependencies from here.
  • Execute the script as below. 

 .\RecursiveRemove.ps1 <ID or System Name of the MP>

  • For an example if you want to remove the SQL 2014 Disocvery MP run the script as below.

 .\RecursiveRemove.ps1 Microsoft.SQLServer.2014.Discovery

How to get the  the ID or System Name of an MP?

  • Selecting the MP that you want to delete in the Installed Management Packs view by clicking Properties in the Actions pane.
  • Copy the content in the ID : text box in the General tab.

VM Version Upgrade | Windows Server 2016 & Windows 10

If you have recently upgraded your datacentre infrastructure to Windows Server 2016 (or your client device to Windows 10) you can benefit from the latest Hyper-V features available on your virtual machines by upgrading their configuration version. Before you upgrade to the latest VM version make sure;

  • Your Hyper-V host are running latest version of Windows or Windows Server and you have upgraded the cluster functional level.
  • You are not going to move back the VMs to a Hyper-V host that is running a previous version of Windows or Windows Server.

The process is fairly simple and involves only four steps. First check the current VM configuration version.

  • Run Windows PowerShell as an administrator.
  •  Run the Get-VM cmdlet as below  and check the versions of Hyper-V VMs. Alternatively the configuration version can be obtained by selecting the virtual machine and looking at the Summary tab in Hyper-V Manager.

Get-VM * | Format-Table Name, Version

  • Shutdown the VM.
  • Select Action > Upgrade Configuration Version. If you don’t see this option for any VM that means that  it’s already at the highest configuration version supported by that particular Hyper-V host.

If you prefer PowerShell you can run the below command to upgrade the configuration version.

Update-VMVersion <vmname> 

Tenant VM Startup Failure in Azure Stack TP2 Refresh

Microsoft has recently released a refresh build for Azure Stack Technical Preview 2. Me and my colleague CDM MVP Nirmal Thewarathanthri were eager to deploy the new build from day one. This post is about a known issue in this build which prevents tenant VMs from being automatically started after a host power failure.

Scenario

After we create a tenant VM in Azure Stack Portal and verify that it is running properly, we decided to turn off the host for the day and start the load testing next day. When the host was turned on next day the tenant VM is missing in Hyper-V manager and is in a failed status in Azure Stack portal. Not only that the VMs used by the PaaS RPs mySQL & SQL have also been disappeared.1-mas-tp2-vms-running

2-mas-tp2-vms-running-hvm

As you can see below neither deleting the VM nor deleting the resource group of that VM works in the portal. Also the VM status is set to Unknown in the portal.

3-mas-tp2-vms-unkown-portal

4-mas-tp2-rg-deletion-failure

4-mas-tp2-vm-deletion-failure

But the Azure Stack TP2 Management VMs have automatically started after the power failure.

5-mas-tp2-vms-missing-hvm

Solution

We noticed that in the Failover Cluster Manager in MAS TP2 host, all the tenant VMs including PaaS RP VMs are in a saved state after a power failure. Once we start these VM,s they will be online in both Hyper-V manager and Azure Stack Portal. Now we can successfully delete the concerned resource group or the tenant VM.

6-mas-tp2-vms-saved-fcm

RCA

This seems to be a known bug in TP2 refresh where the Management VMs will startup automatically “after taking sometime” where tenant VMs including PaaS RP VMs do not automatically start after a power failure. The workaround is to manually start them after all 14 Management VMs are up and running.

You can refer this link for a list of known issues in Azure Stack TP. 

 

Storage Spaces Direct | Deploying S2D in Azure

This post explores how to build a Storage Space Direct lab in Azure. Bear in mind that S2D in Azure is not a supported scenario for production workloads as of yet.

Following are the high level steps that needs to be followed in order to create provision a S2D lab in Azure. For this lab, I’m using DS1 V2 VMs with Windows Server 2016 Datacenter edition for all the roles and two P20 512 GB Premium SSD disks in each storage node.

Create a VNET

In my Azure tenant I have created a VNET called s2d-vnet with 10.0.0.0/24 address space with a single subnet as below.

1-s2d-create-vnet

Create a Domain Controller

I have deployed a domain controller called jcb-dc in a new windows active directory jcb.com with DNS role installed. Once DNS role has been installed, I have changed the DNS server IP address in the s2d-vnet to my domain controller’s IP address. You may wonder what is the second DNS IP address. It is actually the default Azure DNS IP address added as a redundant DNS server in case if we lose connectivity to the domain controller. This will provide Internet name resolution to the VMs in case domain controller is no longer functional.

1-s2d-vnet-dns

Create the Cluster Nodes

Here I have deployed 3 Windows Server VMs jcb-node1, jcb-node2 and jcb-node3 and joined them to the jcb.com domain. All 3 nodes are deployed in a single availability set.

Configure Failover Clustering

Now we have to configure the Failover Cluster. I’m installing the Failover Clustering role in all 3 nodes using below PowerShell snippet.

$nodes = (“jcb-node01”, “jcb-node02”, “jcb-node03”)

icm $nodes {Install-WindowsFeature Failover-Clustering -IncludeAllSubFeature -IncludeManagementTools}

3-s2d-install-fc

Then I’m going to create the Failover Cluster by executing below snippet in any of the three nodes. This will create a Failover Cluster called JCB-CLU.

$nodes = (“jcb-node01”, “jcb-node02”, “jcb-node03”)

New-Cluster -Name JCB-CLU -Node $nodes –StaticAddress 10.0.0.10

4-s2d-create-fc

Deploying S2D

When I execute Enable-ClusterS2D cmdlet, it will enable Storage Paces Direct and start creating a storage pool automatically as below.

5-s2d-enable-1

5-s2d-enable-2

12-s2d-csv

You can see that the storage pool has been created.

7-s2d-pool-fcm

8-s2d-pool

Creating a Volume

Now we can create a volume in our new S2D setup.

New-Volume -StoragePoolFriendlyName S2D* -FriendlyName JCBVDisk01 -FileSystem CSVFS_REFS -Size 800GB

9-s2d-create-volume

Implementing Scale-out File Server Role

Now we can proceed with SOFS role installation followed by adding SOFS cluster role.

icm $nodes {Install-WindowsFeature FS-FileServer}

Add-ClusterScaleOutFileServerRole -Name jcb-sofs

10-s2d-sofs-install

11-s2d-sofs-enable

Finally I have created an SMB share called Janaka in the newly created CSV.
13-s2d-smb-share

Automating S2D Deployment in Azure with ARM Templates

If you want to automate the entire deployment of the S2D lab you can use below ARM template by Keith Mayer which will create a 2-node S2D Cluster.

Create a Storage Spaces Direct (S2D) Scale-Out File Server (SOFS) Cluster with Windows Server 2016 on an existing VNET

This template requires you to have active VNET and a domain controller deployed first which you can automate using below ARM template. 

Create a 2 new Windows VMs, create a new AD Forest, Domain and 2 DCs in an availability set

We will discuss how to use DISKSPD & VMFLET to perform load and stress testing in a S2D deployment in our next post.

New Security Features in Azure Backup

Recently Microsoft has introduced new security capabilities to Azure Backup which allows you to secure your backups against any data compromise and attacks. These features are now built into the recovery services vault and you can enable and start using them within a matter of 5 minutes.

Prevention

For critical operations such as  delete backup data, change passphrase, Azure Backup now allows you to use an additional authentication layer where you need to provide a  Security PIN which is available only for users with valid azure credentials to access the backup vaults.

Alerting

You can now configure email notifications to be sent for specified users for operations that have an impact on the availability of the backup data .

Recovery

You can configure Azure backup to retain deleted backup data for 14 days where you can recover the deleted data using the recovery points. When enabled, this will always maintain more than one recovery point so that there will be enough recovery points from which you can recover the deleted data.

How do I enable security features in Azure Backup?

These security features are now built into the recovery services vault where you can enable all of them with a single click.

1-enable-azure-backup-security

Following are the requirements and considerations that you should be aware of when you enable these new security features.

  • The minimum MAB agent version should be 2.0.9052 or you should upgrade to this agent version immediately after you have enabled these features.
  • If you are using Azure Backup Server the minimum MAB agent version should be 2.0.9052 with Azure Backup Server upgrade 1
  • Currently these settings won’t work with Data Protection Manager and will only be enabled with future Update Roll-ups.
  • Currently these settings won’t work with IaaS VM Backups.
  • Enabling these settings is a one-time action which is irreversible.

Testing new security features

In below video I’m trying to change the passphrase of my Azure Backup agent and save it. Note that here I will have to provide a Security PIN in order to proceed or otherwise the operations fails. 

Next I’m going to setup backup alerts for my recovery services vault. Once I create an alert subscription I’m going to delete my previous backup schedule. Here I will have the chance of restoring the data within 14 days after deletion.

Storage Spaces Direct | Architecture

In my last post I’ve explained the basics of Storage Spaces Direct in Windows Server 2016. This post explores the internals of S2D and it’s architecture in much simple context.

S2D Architecture & Design

(Image Courtesy) Microsoft Technet

S2D is designed to provide nearly 600K IOPS (read) & 1 Tbps of throughput at it’s ultimate configuration with RDMA adapters & NVMe SSD drives. S2D is all about Software Defined Storage and let’s dissect the pieces that makes up the S2D paradigm one by one.

Physical Disks – You can deploy S2D just inside 2 servers up to 16 servers on from 2 to 16 servers with locally-attached SATA, SAS, or NVMe drives. Keep in mind that each server should at least have 2 SSDs, and at least 4 additional drives which can be SAS or SATA HDD. These commodity SATA and SAS devices should be leverage a host-bus adapter (HBA) and SAS expander. 

Software Storage Bus – Think this as the Fiber Channel and Shared SAS cabling in your SAN solution. Software Storage Bus spans across the storage cluster to establish a software-defined storage fabric and allows all the servers can see all the local drives in each and every host in the cluster.

Failover Cluster & Networking – For server communication, S2D leverages the native clustering feature in Windows Server ans uses SMB3, including SMB Direct and SMB Multichannel, over Ethernet. Microsoft recommends to use 10+ GbE (Mellanox) network cards and switches with remote-direct memory access (RDMA), either iWARP or RoCE.

Storage Pool & Storage Spaces – With the recommendation of one pool per cluster Storage Pools consists of the drives that forms the S2D and it is created by discovering and adding all eligible drives automatically to the Storage Pool. Storage Spaces are your software-defined RAID based on Storage Pools. With S2D the data can have tolerance up to two simultaneous drive or server failures along with chassis and rack fault tolerance as well.

Storage Bus Layer Cache – The duty of the Software Storage Bus  is to dynamically bind the fastest drives present  to slower drives (i.e SSD to HDD) which provides server-side read/write caching to accelerate IO and to boost throughput.

Resilient File System (ReFS) & Cluster Shared Volumes – ReFS is a file system that has been built to enhance server virtualization experience in Windows Server. With Acclerated VHDX Operations feature in ReFS it improves the creation, expansion, and checkpoint merging in Virtual Disks significantly. Cluster Shared Volumes consolidate all the ReFS volumes into a single namespace which you can access from any server so it becomes shared storage.

Scale-Out File Server (SOFS) – If your S2D deployment is a Converged solution it is required to implement SOFS which provides remote file access using the SMB3 protocol to clients. i.e Hyper-V Computer Cluster. In a Hyper Converged S2D solution both storage and compute reside in the same cluster thus there is no need to introduce SOFS.

In my next post I’m going to explore how we can deploy S2D in Azure. This will be a Converged setup as Azure doesn’t allow nested virtualization. 

Storage Spaces Direct | Introduction

What is Storage Spaces Direct?

Storage Spaces Direct (S2D) is a new storage feature in Windows Server 2016 which allows you to leverage the locally attached disk drives of the servers in your datacentre to build highly available, highly scalable software-defined storage solutions. S2D helps you save your investments on expensive SAN or NAS solutions by allowing you to use your existing NVMe, SSD or SAS drives combined together to provide high performing and simple storage solutions for your datacentre workloads.

S2D Deployment Choices

There are two deployment options available with S2D.

Converged

In a Converged or disaggreagted S2D architecture, Scale-out File Server/s (SoFS) built on top of S2D  provides shared storage on  SMB3 file shares. Like your traditional NAS systems this separates the storage layer from compute and this option is ideal for large scale enterprise deployments such as Hyper-V VMs hosted by a service provider. 

(Image Courtesy) Microsoft TechNet

Hyper Converged

With Hyper Converged S2D deployments, both compute and storage layers reside in same server/s and this allows to further reduce the hardware cost and ideal for SMEs. 

(Image Courtesy) Microsoft TechNet

S2D is the successor of Storage Spaces introduced in Windows Server 2012 and it is the underlying storage system for Microsoft Azure & Azure Stack. In my next post I will explain about the S2D architecture and key components of an S2D solution in much detail.

Following video explains the core concepts of S2D and it’s  internals and use cases.