Data corruption issue with NTFS sparse files in Windows Server 2016

Microsoft has released a new patch KB4025334 which prevents a critical data corruption issue with NTFS sparse files in Windows Server 2016.  This patch will prevent possible data corruptions that could occur when using Data Deduplication in Windows Server 2016. However this update also a remedy prevent this issue in all applications and Windows components that leverage sparse files on NTFS in Windows Server 2016.

Although this is an optional update, Microsoft recommends to install this KB to avoid any corruptions in Data deduplication although this KB doesn’t provide  a way to recover from existing data corruptions. The reason being is that NTFS incorrectly removes in-use clusters from the file and there is no way to identify what clusters were incorrectly removed afterwards. Furthermore this update will become a mandatory patch in the “Patch Tuesday” release cycle in August 2017.

Since this issue is hard to notice, you won’t be able detected that by monitoring the weekly Dedup integrity scrubbing job. To overcome this challenge this KB also includes an update to chkdsk which will allow you to identify which files are already corrupted.

Identifying corrupted NTFS sparse files with chkdsk in KB4025334

  • First, install KB4025334 on affected servers and restart same. Keep in mind that if your servers are in a failover cluster this patch needs to be applied for all the servers in your cluster.
  • Execute chkdsk in read-only mode which is the default mode for chkdsk.
  • For any possibly corrupted files, chkdsk will provide an output similar to below. Here 20000000000f3 is the file id and make a note of all the file ids of the output.
The total allocated size in attribute record (128, "") of file 20000000000f3 is incorrect.
  • Then you can use fsutil to query the corrupted files by their ids as per below example.
D:\afftectedfolder> fsutil file queryfilenamebyid D:\ 0x20000000000f3
  • Once you run above command, you should get a similar output like below. D:/affectedfolder/TEST.0 is the corrupted file in this case.
A random link name to this file is [file://%3f/D:/affectedfolder/TEST.0]\\?\D:\affectedfolder\TEST.0

VM Inception | Nested Virtualization in Azure

I bet that most of you have watched the movie “Inception”, where a group of people are building a dream within a dream within a dream. Before Windows Server 2016 you couldn’t deploy a VM within a VM in Hyper-V. Lot of people are/were encouraged to use VMware as it supported this capability called “Nested Virtualization”. But with the release of Windows Server 2016 & Hyper-V server 2016 this functionality has been introduced. This is specially useful when you don’t have lot of hardware to run your lab environments or want to deploy a PoC system without burning thousands of dollars.

Microsoft announced the support for nested virtualization Azure IaaS VMs using the newly announced  Dv3 and Ev3 VM sizes. This capability allows you to create nested VMs in an Azure VM and also run Hyper-V containers in Azure by using nested VM hosts. Now let’s have look on how this is implemented in Azure Azure Compute fabric.

Image Courtesy Build 2017

As you can see in the above diagram, on top of the Azure hardware layer, Microsoft has deployed the Windows Server 2016 Hyper-V hypervisor. Microsoft then adds vCPU on top of that to expose the Azure IaaS VMs that you would normally get. With nested virtualization, you can enable Hyper-V inside those Azure IaaS VMs running Windows Server 2016. You can then run any number of  Hyper-V 2016 supported guest operating systems inside these nested VM hosts.

Following references from MSFT provides more information on how you can get started with nested virtualization in Azure.

 

 

 

 

 

502 Bad Gateway error | Azure Application Gateway Troubleshooting

I was setting up an Azure Application Gateway for a project couple days back. The intended workload was setting up git on nginix.  But when I tried to reach the git URL I noticed that it was failing with 502 Bad Gateway error.

Initial Troubleshooting Steps

  • We tried to access backend servers from Application Gateway IP 10.62.124.6, backend server IPs are 10.62.66.4 and 10.62.66.4. Application Gateway configured for SSL Offload.
  • We were able to access the backend servers directly on port 80, but when accessed via Application Gateway this issue occurs.
  • Rebooted the Application Gateway and the backend servers. and configured custom probe as well. But the issue was with the Request time out value which is by default configured for 30 seconds.
  • This means that when user request is received on Application Gateway, it forwards it to the backend pool and waits for 30 seconds and if it fails to get a response back then within that period users will receive a 502 error.
  • The issue has been temporarily  resolved after the time out period on the Backend HTTP settings has been changed to 120 seconds.

Real Deal

Increasing the timeout values were only a temporary fix as we were unable to find out a permanent fix. I have reached out to Microsoft Support and they wanted us to run below diagnostics.

  • Execute the below cmdlet and share the results.

$getgw = Get-AzureRmApplicationGateway -Name <application gateway name> -ResourceGroupName <resource group name>

  • Collect simultaneous Network traces:
    1. Start network captures on On-premises machine(Source client machine) and Azure VMs (Backend servers)
      • Windows: Netsh
        1. Command to start the trace:  Netsh trace start capture=yes report=yes maxsize=4096 tracefile=C:\Nettrace.etl
        2. Command to stop the trace:  Netsh trace stop 
      • Linux: TCPdump
        1. TCP DUMP command: sudo tcpdump -i eth0 -s 0 -X -w vmtrace.cap 
    2. Reproduce the behavior.
    3. Stop network captures.

Analysis

The network traces collected on Client machine and Destination servers while the issue was reproduced indicates that,  during the time period the trace was collected, for every HTTP Get Request (default probing) from the Application Gateway instances on the backend servers, the servers responded “Status: Forbidden” HTTP Response.

This has resulted in Application Gateway marking the backend servers as unhealthy as the expected response is HTTP 200 OK.
 
The Application Gateway “gitapp” configured for 2 instances (Internal instance IPs: 10.62.124.4, 10.62.124.5)
 
Trace collected on backend server 10.62.66.4
 
12:50:18 PM 1/3/2017    10.62.124.4         10.62.66.4            HTTP      HTTP:Request, GET /
12:50:18 PM 1/3/2017    10.62.66.4            10.62.124.4         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
12:50:45 PM 1/3/2017    10.62.124.5         10.62.66.4            HTTP      HTTP:Request, GET /
12:50:45 PM 1/3/2017    10.62.66.4            10.62.124.5         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
 
Trace collected on backend server 10.62.66.5
 
12:50:48 PM 1/3/2017    10.62.124.4         10.62.66.5            HTTP      HTTP:Request, GET /
12:50:48 PM 1/3/2017    10.62.66.5            10.62.124.4         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
12:50:45 PM 1/3/2017    10.62.124.5         10.62.66.5            HTTP      HTTP:Request, GET /
12:50:45 PM 1/3/2017    10.62.66.5            10.62.124.5         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /

Rootcause

Due to the security feature ‘rack_attack’ enabled on the backend servers, it has blacklisted the application gateway instance IP’s and therefore the servers were not responding to the Application Gateway, causing it to mark the backend servers as Unhealthy.

Fix

Once this feature was disabled on the backend web servers (niginx) , the issue was resolved and we could successfully access the web application using Application Gateway.

 

Microsoft Monitoring Agent | New Update Rollup

Microsoft has unveiled a new update rollup 8.0.11030.0 for the Microsoft Monitoring Agent (MMA) that has fixed issues in the previous version of MMA. The fixes in this version includes below.

  • Improved logging for HTTP connection issues
  • Fix for high CPU utilization when you’re reading a Windows event that has an invalid message description
  • Support for Azure US Government cloud

How to get update rollup version 8.0.11030.0 for Microsoft Monitoring Agent (KB3206063)?

This package is available as a manual download in the  Microsoft Update Catalog. You can search for Microsoft Monitoring Agent and list down the available updates will appear in the search results.

Hotfix 1 for SCVMM 2016 Update Rollup 1

Microsoft has published a new Hotfix 3208888 for those who are running SCVMM 2016 Update Rollup 1. This includes a fix for the issue where when you use VMM to live migrate a VM from one host that’s running one version of Windows Server 2016 to another host that’s running a different version of Windows Server 2016, the placement page assigns a zero rating to the target host.(i.e Datacenter edition to Standard edition). This issue happens only when you try to live migrate between two version of Windows Server 2016 but not when you are trying a live migration between hosts that are running 2012 R2 and 2016.

This leads to block the live migration with below error message:

Unable to migrate or clone the virtual machine VM_name because the version of virtualization software on the host does not match the version of virtual machine’s virtualization software on source version_number. To be migrated or cloned, the virtual machine must be stopped and should not contain any saved state.

Installing KB3208888

Note the this is applicable only to those who are running SCVMM 2016 Update Rollup 1.

  • Download the KB package from here.
  • Use an elevated Command Prompt to install the KB manually.

msiexec.exe /update kb3208888_vmmserver_amd64.msp

 

Rolling back to a previous version of a SCOM Management Pack

In Operations Manager sometimes you may need to revert back to an older version of a Management Pack for a particular workload. The Operations Manager UI allows to delete and re-import MPs from the “Installed Management Packs” screen. The problem happenes when there are multiple and multi-level dependencies on the MP that you are trying to delete.

However now there is an enhanced version of a script available at TechNet  (developed by MSFT employee Christopher Crammond) that will help you to revert Management Packs with a single command.

Using the Script

  • Open the Operations Manager Command Shell prompt as an Administrator.
  • Download the script to remove a management pack with dependencies from here.
  • Execute the script as below. 

 .\RecursiveRemove.ps1 <ID or System Name of the MP>

  • For an example if you want to remove the SQL 2014 Disocvery MP run the script as below.

 .\RecursiveRemove.ps1 Microsoft.SQLServer.2014.Discovery

How to get the  the ID or System Name of an MP?

  • Selecting the MP that you want to delete in the Installed Management Packs view by clicking Properties in the Actions pane.
  • Copy the content in the ID : text box in the General tab.

VM Version Upgrade | Windows Server 2016 & Windows 10

If you have recently upgraded your datacentre infrastructure to Windows Server 2016 (or your client device to Windows 10) you can benefit from the latest Hyper-V features available on your virtual machines by upgrading their configuration version. Before you upgrade to the latest VM version make sure;

  • Your Hyper-V host are running latest version of Windows or Windows Server and you have upgraded the cluster functional level.
  • You are not going to move back the VMs to a Hyper-V host that is running a previous version of Windows or Windows Server.

The process is fairly simple and involves only four steps. First check the current VM configuration version.

  • Run Windows PowerShell as an administrator.
  •  Run the Get-VM cmdlet as below  and check the versions of Hyper-V VMs. Alternatively the configuration version can be obtained by selecting the virtual machine and looking at the Summary tab in Hyper-V Manager.

Get-VM * | Format-Table Name, Version

  • Shutdown the VM.
  • Select Action > Upgrade Configuration Version. If you don’t see this option for any VM that means that  it’s already at the highest configuration version supported by that particular Hyper-V host.

If you prefer PowerShell you can run the below command to upgrade the configuration version.

Update-VMVersion <vmname> 

Tenant VM Startup Failure in Azure Stack TP2 Refresh

Microsoft has recently released a refresh build for Azure Stack Technical Preview 2. Me and my colleague CDM MVP Nirmal Thewarathanthri were eager to deploy the new build from day one. This post is about a known issue in this build which prevents tenant VMs from being automatically started after a host power failure.

Scenario

After we create a tenant VM in Azure Stack Portal and verify that it is running properly, we decided to turn off the host for the day and start the load testing next day. When the host was turned on next day the tenant VM is missing in Hyper-V manager and is in a failed status in Azure Stack portal. Not only that the VMs used by the PaaS RPs mySQL & SQL have also been disappeared.1-mas-tp2-vms-running

2-mas-tp2-vms-running-hvm

As you can see below neither deleting the VM nor deleting the resource group of that VM works in the portal. Also the VM status is set to Unknown in the portal.

3-mas-tp2-vms-unkown-portal

4-mas-tp2-rg-deletion-failure

4-mas-tp2-vm-deletion-failure

But the Azure Stack TP2 Management VMs have automatically started after the power failure.

5-mas-tp2-vms-missing-hvm

Solution

We noticed that in the Failover Cluster Manager in MAS TP2 host, all the tenant VMs including PaaS RP VMs are in a saved state after a power failure. Once we start these VM,s they will be online in both Hyper-V manager and Azure Stack Portal. Now we can successfully delete the concerned resource group or the tenant VM.

6-mas-tp2-vms-saved-fcm

RCA

This seems to be a known bug in TP2 refresh where the Management VMs will startup automatically “after taking sometime” where tenant VMs including PaaS RP VMs do not automatically start after a power failure. The workaround is to manually start them after all 14 Management VMs are up and running.

You can refer this link for a list of known issues in Azure Stack TP. 

 

Storage Spaces Direct | Deploying S2D in Azure

This post explores how to build a Storage Space Direct lab in Azure. Bear in mind that S2D in Azure is not a supported scenario for production workloads as of yet.

Following are the high level steps that needs to be followed in order to create provision a S2D lab in Azure. For this lab, I’m using DS1 V2 VMs with Windows Server 2016 Datacenter edition for all the roles and two P20 512 GB Premium SSD disks in each storage node.

Create a VNET

In my Azure tenant I have created a VNET called s2d-vnet with 10.0.0.0/24 address space with a single subnet as below.

1-s2d-create-vnet

Create a Domain Controller

I have deployed a domain controller called jcb-dc in a new windows active directory jcb.com with DNS role installed. Once DNS role has been installed, I have changed the DNS server IP address in the s2d-vnet to my domain controller’s IP address. You may wonder what is the second DNS IP address. It is actually the default Azure DNS IP address added as a redundant DNS server in case if we lose connectivity to the domain controller. This will provide Internet name resolution to the VMs in case domain controller is no longer functional.

1-s2d-vnet-dns

Create the Cluster Nodes

Here I have deployed 3 Windows Server VMs jcb-node1, jcb-node2 and jcb-node3 and joined them to the jcb.com domain. All 3 nodes are deployed in a single availability set.

Configure Failover Clustering

Now we have to configure the Failover Cluster. I’m installing the Failover Clustering role in all 3 nodes using below PowerShell snippet.

$nodes = (“jcb-node01”, “jcb-node02”, “jcb-node03”)

icm $nodes {Install-WindowsFeature Failover-Clustering -IncludeAllSubFeature -IncludeManagementTools}

3-s2d-install-fc

Then I’m going to create the Failover Cluster by executing below snippet in any of the three nodes. This will create a Failover Cluster called JCB-CLU.

$nodes = (“jcb-node01”, “jcb-node02”, “jcb-node03”)

New-Cluster -Name JCB-CLU -Node $nodes –StaticAddress 10.0.0.10

4-s2d-create-fc

Deploying S2D

When I execute Enable-ClusterS2D cmdlet, it will enable Storage Paces Direct and start creating a storage pool automatically as below.

5-s2d-enable-1

5-s2d-enable-2

12-s2d-csv

You can see that the storage pool has been created.

7-s2d-pool-fcm

8-s2d-pool

Creating a Volume

Now we can create a volume in our new S2D setup.

New-Volume -StoragePoolFriendlyName S2D* -FriendlyName JCBVDisk01 -FileSystem CSVFS_REFS -Size 800GB

9-s2d-create-volume

Implementing Scale-out File Server Role

Now we can proceed with SOFS role installation followed by adding SOFS cluster role.

icm $nodes {Install-WindowsFeature FS-FileServer}

Add-ClusterScaleOutFileServerRole -Name jcb-sofs

10-s2d-sofs-install

11-s2d-sofs-enable

Finally I have created an SMB share called Janaka in the newly created CSV.
13-s2d-smb-share

Automating S2D Deployment in Azure with ARM Templates

If you want to automate the entire deployment of the S2D lab you can use below ARM template by Keith Mayer which will create a 2-node S2D Cluster.

Create a Storage Spaces Direct (S2D) Scale-Out File Server (SOFS) Cluster with Windows Server 2016 on an existing VNET

This template requires you to have active VNET and a domain controller deployed first which you can automate using below ARM template. 

Create a 2 new Windows VMs, create a new AD Forest, Domain and 2 DCs in an availability set

We will discuss how to use DISKSPD & VMFLET to perform load and stress testing in a S2D deployment in our next post.

New Security Features in Azure Backup

Recently Microsoft has introduced new security capabilities to Azure Backup which allows you to secure your backups against any data compromise and attacks. These features are now built into the recovery services vault and you can enable and start using them within a matter of 5 minutes.

Prevention

For critical operations such as  delete backup data, change passphrase, Azure Backup now allows you to use an additional authentication layer where you need to provide a  Security PIN which is available only for users with valid azure credentials to access the backup vaults.

Alerting

You can now configure email notifications to be sent for specified users for operations that have an impact on the availability of the backup data .

Recovery

You can configure Azure backup to retain deleted backup data for 14 days where you can recover the deleted data using the recovery points. When enabled, this will always maintain more than one recovery point so that there will be enough recovery points from which you can recover the deleted data.

How do I enable security features in Azure Backup?

These security features are now built into the recovery services vault where you can enable all of them with a single click.

1-enable-azure-backup-security

Following are the requirements and considerations that you should be aware of when you enable these new security features.

  • The minimum MAB agent version should be 2.0.9052 or you should upgrade to this agent version immediately after you have enabled these features.
  • If you are using Azure Backup Server the minimum MAB agent version should be 2.0.9052 with Azure Backup Server upgrade 1
  • Currently these settings won’t work with Data Protection Manager and will only be enabled with future Update Roll-ups.
  • Currently these settings won’t work with IaaS VM Backups.
  • Enabling these settings is a one-time action which is irreversible.

Testing new security features

In below video I’m trying to change the passphrase of my Azure Backup agent and save it. Note that here I will have to provide a Security PIN in order to proceed or otherwise the operations fails. 

Next I’m going to setup backup alerts for my recovery services vault. Once I create an alert subscription I’m going to delete my previous backup schedule. Here I will have the chance of restoring the data within 14 days after deletion.