Tag Archives: Troubleshooting

Data corruption issue with NTFS sparse files in Windows Server 2016

Microsoft has released a new patch KB4025334 which prevents a critical data corruption issue with NTFS sparse files in Windows Server 2016.  This patch will prevent possible data corruptions that could occur when using Data Deduplication in Windows Server 2016. However this update also a remedy prevent this issue in all applications and Windows components that leverage sparse files on NTFS in Windows Server 2016.

Although this is an optional update, Microsoft recommends to install this KB to avoid any corruptions in Data deduplication although this KB doesn’t provide  a way to recover from existing data corruptions. The reason being is that NTFS incorrectly removes in-use clusters from the file and there is no way to identify what clusters were incorrectly removed afterwards. Furthermore this update will become a mandatory patch in the “Patch Tuesday” release cycle in August 2017.

Since this issue is hard to notice, you won’t be able detected that by monitoring the weekly Dedup integrity scrubbing job. To overcome this challenge this KB also includes an update to chkdsk which will allow you to identify which files are already corrupted.

Identifying corrupted NTFS sparse files with chkdsk in KB4025334

  • First, install KB4025334 on affected servers and restart same. Keep in mind that if your servers are in a failover cluster this patch needs to be applied for all the servers in your cluster.
  • Execute chkdsk in read-only mode which is the default mode for chkdsk.
  • For any possibly corrupted files, chkdsk will provide an output similar to below. Here 20000000000f3 is the file id and make a note of all the file ids of the output.
The total allocated size in attribute record (128, "") of file 20000000000f3 is incorrect.
  • Then you can use fsutil to query the corrupted files by their ids as per below example.
D:\afftectedfolder> fsutil file queryfilenamebyid D:\ 0x20000000000f3
  • Once you run above command, you should get a similar output like below. D:/affectedfolder/TEST.0 is the corrupted file in this case.
A random link name to this file is [file://%3f/D:/affectedfolder/TEST.0]\\?\D:\affectedfolder\TEST.0

VM Management Failure in Azure

When deploying a new VM from the Azure Portal, I have come across a bizarre error. The VM deployment has been interrupted and it was in a failed state. I have tried deploying the VM several times to the same resource group but without any luck. 

The Provisioning State was failed in my case. Surprisingly the recommendation to remediate this issue seemed little odd. I was trying to deploy a single Windows VM of DS2 v2 SKU, nothing out of the ordinary. 

After examining the deployment properties, I have noticed that the issue was  with the Virtual Machine resource in Azure Resource Manager. But the level of detail for the error was still the same. 

Although the article Understand common error messages when you manage Windows virtual machines in Azure lists common Windows VM management errors in Azure, it doesn’t contain any troubleshooting steps on how to recover.

Allocation failed. Please try reducing the VM size or number of VMs, retry later, or try deploying to a different Availability Set or different Azure location.

As a last resort, I tried starting the failed VM manually from the portal. Well that worked anyway. I tried deploying couple of VMs thereafter in to the same resource group which did not reported any errors.

Have you faced the similar error before? For me the mystery still remains.

Tenant VM Startup Failure in Azure Stack TP2 Refresh

Microsoft has recently released a refresh build for Azure Stack Technical Preview 2. Me and my colleague CDM MVP Nirmal Thewarathanthri were eager to deploy the new build from day one. This post is about a known issue in this build which prevents tenant VMs from being automatically started after a host power failure.

Scenario

After we create a tenant VM in Azure Stack Portal and verify that it is running properly, we decided to turn off the host for the day and start the load testing next day. When the host was turned on next day the tenant VM is missing in Hyper-V manager and is in a failed status in Azure Stack portal. Not only that the VMs used by the PaaS RPs mySQL & SQL have also been disappeared.1-mas-tp2-vms-running

2-mas-tp2-vms-running-hvm

As you can see below neither deleting the VM nor deleting the resource group of that VM works in the portal. Also the VM status is set to Unknown in the portal.

3-mas-tp2-vms-unkown-portal

4-mas-tp2-rg-deletion-failure

4-mas-tp2-vm-deletion-failure

But the Azure Stack TP2 Management VMs have automatically started after the power failure.

5-mas-tp2-vms-missing-hvm

Solution

We noticed that in the Failover Cluster Manager in MAS TP2 host, all the tenant VMs including PaaS RP VMs are in a saved state after a power failure. Once we start these VM,s they will be online in both Hyper-V manager and Azure Stack Portal. Now we can successfully delete the concerned resource group or the tenant VM.

6-mas-tp2-vms-saved-fcm

RCA

This seems to be a known bug in TP2 refresh where the Management VMs will startup automatically “after taking sometime” where tenant VMs including PaaS RP VMs do not automatically start after a power failure. The workaround is to manually start them after all 14 Management VMs are up and running.

You can refer this link for a list of known issues in Azure Stack TP. 

 

Fix It | October 2016 Cumulative Windows Updates Crash SCOM Console

In my last post I’ve shared the console crashing issue you face after installing the security updates in  MS16-118 and MS16-126. Now Microsoft has published a new KBs to fix the console crashing issue in SCOM after applying these updates.

Individual hot fixes are available for the following list of Operating systems which you can download from here.

  • Windows Vista
  • Windows 7
  • Windows 8.1
  • Windows Server 2008
  • Windows Server 2008R2
  • Windows Server 2012
  • Windows Server 2012 R2

For Windows 10 and Server 2016, the fix was applied to the latest cumulative updates.

October 2016 Cumulative Windows Updates Crash SCOM Console

It seems like the October 2016 cumulative Windows updates (KB3194798, KB3192392, KB3185330 &KB3185331) cause the SCOM consoles 2012/2016 in all Windows versions from Windows Server 2008 R2 up to 2016 and Windows 7 up to 10 Windows 10 to regularly crash without any doubt.

According to Microsoft Germany’s SCOM PFE Dirk Brinkmann  who has blogged about this issue here, the SCOM team is working on a fix for this as of now and no ETA for an resolution has been provided yet.

Once a fix is available you will be able to see it via SCOM team blog.

EWS not working in Exchange Hybrid

Deploying a hybrid Exchange environment should be a carefully planned process though you might encounter hiccups here and there unexpectedly. In a recent hybrid effort after executing the HCW for Exchange 2013 I have noticed that two tedious issues happening in the environment.

  1. Free/Busy information is not visible in Outlook Clients (any version). Scheduling Assistant throws an error saying that it cannot locate the server. Outlook Web App works just fine.
  2. Unable to create a migration batch from the EAC. It throws below error.

The ExchangeRemote endpoint settings could not be determined from the autodiscover response. No MRSProxy was found running at <FQDN of CAS server>.

I decided to have a look on the properties of EWS virtual directory for CAS servers. EWS is responsible for remote moves & free/busy information.

Enable MRS Proxy endpoint option was selected as it should be. Also I could access the EWS service from outside. So the URL is correct.

EWS (1)

When I checked the authentication section the Integrated Windows Authentication option was not enabled so I enabled same. This option allows domain joined computers to use session credentials of their respective windows logins with Outlook clients.

EWS (2)After doing iisreset on all CAS servers I noticed that Free/Busy information is now available via Outlook and I can successfully create and complete migration batches as well.

This TechNet article explains different scenarios in an Exchange Hybrid deployment which can result in such an error and worth a look if you cannot see free/busy information from one tenant to the other (Exhcnage Online Users cannot see Exchange On-Premises free/busy info and vice versa)

Outlook 2007 Clients are disconnected in Exchange Hybrid

Deploying an Exchange Hybrid setup requires lot of effort specially around DNS records. In a recent deployment I noticed that Outlook 2007 clients are in disconnected state once the Hybrid feature has been enabled for an Exchange 2013 single forest organization.

We were using a new Subject Alternative Name (SAN) Certificate from a third party CA which included AutoDiscover.domain.com record. An autodicover record is vital to share free/bust information between Exchange Online & On-premises organization, as well as clients to differentiate where there mailboxes reside.

The problem was not present in Outlook 2010/2013 clients and creating new user profiles was not an option and was not a remedy. Renewing the SSL certificate has always been troubling if we don’t do it the proper way.

Solution

I decided to take a look on autodiscover virtual directories first. To do that open Exchange PowerShell as administrator and executed below cmdlet.

Get-AutoDiscoverVirtualDirectory | fl internalurl,externalurl

The virtual directory was not pointing to correct URL. Therefore below cmdlet has been used to rectify same.

Get-AutodiscoverVirtualDirectory -server <Server Name> | Set-AutodiscoverVirtualDirectory -ExternalUrl ‘https://<Server FQDN>/Autodiscover/Autodiscover.xml’ -InternalUrl ‘https://<Server FQDN>/Autodiscover/Autodiscover.xml’

I had three MBX/CAS servers to run this cmdlet. The <Server Name> placeholder represents same. Now in IIS Manager of each MBX/CAS server, right click the MSExchangeAutoDiscoverAppPool application pool and click Recycle. This will recycle the AutoDiscover application pool.

When I checked the connectivity of Outlook 2007 clients most of them are showing connected and e-mails were downloading without any issue. For those clients that didn’t respond we had to reconfigure Outlook profiles but it was quite a few.

Recommendation

I’m not really comfortable with Outlook 2007, specially if it involves Office 365. If you intend to use Outlook 2007 with this kind of a setup I strongly suggest that you update to the latest service pack or if possible update it to Outlook 2010 at minimum. Service Packs play a vital role and in this case none of those Outlook clients were patched.

DPM 2012 R2 UR7 Re-released

It’s been a while since my last blog post. I’m working on a DPM 2012 R2 test lab these days which I’ve planned to update to the latest UR version. When I checked for the latest UR7 got to know that the bits have been re-released.

As for the DPM team there is an issue in DPM 2012 R2 UR7 released on 28.07.2015 which causes expired recovery points on the disk were not getting cleaned up, resulting an increase in DPM recovery point volume after installing UR7. This re-release has addressed this concern and you can download the upadted bits via DPM 2012 R2 UR7 KB or Microsoft Update Catalog as of today.

OK I have updated to UR7 before 21.08.2015. Now what?

For those who are facing this dilemma should know that the re-released UR7 is not pushed via Microsoft Update and advised to manually install the new package  on the DPM Servers with older UR7 package installed. The installation process will automatically execute pruneshadowcopiesDpm2010.ps1 PowerShell script which contains the fix.

Post-deployment Tips

There is no change in the DPM version (4.2.1338.0) in this re-release and it will remain same after the update. Also you will have to update the Azure Backup Agent to latest version (2.0.8719.0) prior installing DPM UR7 to ensure the integrity of your cloud backups after this release.

For those who like me updating to UR7 the old fashion way (wait for a month or two, lookout for bugs and then update) you’ve got nothing to worry.

Library Server Failure in SCVMM 2012 R2

Few days back I was working with my colleague Law Wen Feng on a SCVMM Managed Hyper-V Cluster. The idea was to update the environment from SCVMM 2012 R2 UR 2 to UR 7. We noticed a strange issue where the Library Server (VMM Server itself) was complaining about a refresh failure. It seemed like the VMM agent was no longer functioning properly in the VMM Management Server.

WinRM Issue  (1)

As a poor man’s alternative we removed the library server from VMM. Then we tried to re-add the same VMM server as a library server which resulted in bizarre output. Nevertheless the VMM rejected another file share in a different server which we were hoping to add an alternative.

WinRM Issue  (2)

The error reads as the VMM Agent was no longer functional on the target server. But it was indeed running without any issue.

WinRM Issue  (3)

WinRM Issue  (4)

I’ve reached out to my fellow MVP colleagues Krisitan Nese, Stanislav Zhelyazkov & Daniel Neuman for some suggestions. They suggested that we do re-associate the VMM Agent with VMM Server. Yes it sound like chicken and egg situation. But this is no ordinary Hyper-V host but the VMM server itself.

Register-SCVMMManagedComputer cmdlet can be used to re-associate a managed computer on which VMM agent software is installed with a different VMM management server. But here we chose to add it to the same VMM server.

WinRM Issue  (5)Now it was complaining about WinRM was no longer functional. For those who are familiar WinRM is necessary component that is needed for you to remotely manage Windows Server. By default during the installation SCVMM takes care of enabling and running the WinRM service. Rebuilding the VMM server with retain DB option was not an option as we were middle of preparing demo lab and as I always believe needed to get to the bottom of it.

The evil WinRM GPO

We checked the GPO settings for the domain and found out WinRM was forced to all computers in our domain by a GPO. We moved the VMM server to a test OU and then disabled inheritance for that particular GPO and guess what, after a gpupdate /force in the VMM server we were able to add the library server back again.

WinRM Issue  (6)

Is that All? No it is not.

But I suspected it couldn’t be the only solution or the issue. So some digging into the default WinRM behavior in SCVMM I noticed that infact there was an actual configuration item that has been missed in the GPO itself.

According to Microsoft, there are some consideration for WinRM when you adda Hyper-V host to a VMM environment. Following has been extracted from above TechNet Article the highlighted section focuses on configuring WinRM listeners for both IPv4 & IPv6.

If you use Group Policy to configure Windows Remote Management (WinRM) settings, understand the following before you add a Hyper-V host to VMM management:

  • VMM supports only the configuration of WinRM Service settings through Group Policy, and only on hosts that are in a trusted Active Directory domain. Specifically, VMM supports the configuration of the Allow automatic configuration of listeners, Turn On Compatibility HTTP Listener, and Turn on Compatibility HTTPS Listener Group Policy settings. VMM does not support configuration of the other WinRM Service policy settings.
  • If you enable the Allow automatic configuration of listeners policy setting, you must configure it to allow messages from any IP address. To verify this configuration, view the policy setting and make sure that the IPv4 filter and IPv6 filter (depending on whether you use IPv6) are set to *.
  • VMM does not support the configuration of WinRM Client settings through Group Policy. If you configure WinRM Client Group Policy settings, these policy settings may override client properties that VMM requires for the VMM agent to work correctly.

I had a look at the Allow Automatic Configuration of Listeners policy setting under Computer Configuration\Administrative Templates\Windows Components\Windows Remote Management node in the GPO and the IPv6 filter was set to null, we changed that to accept from any IP address by putting an asterisk (*). Of course IPv6 was enabled in all Hyper-V hosts and the VMM Server by default.

WinRM Issue  (7)

Now it was about time to move back the VMM Server to it’s original OU with the GPO applied and execute a gpupdate /force. Surprisingly it did the trick. We were able to re-add the library server (in VMM) and couple of other file share as library shares without any issue.

WinRM Issue  (8)

Amazing isn’t it? We may never gaze upon TechNet for such trivial issues when they happen but it was worth all the trouble without rebuilding the VMM server. I must thank all who helped by sharing their ideas to sort this issue out. That is what I love about the community. When all is lost somewhere far away in the world, there will always be good people to help you out.

CSV Access Redirected in Hyper-V Cluster

I’ve been working with Hyper-V for quite sometime. During a recent Hyper-V Cluster deployment that myself and my colleague Hasitha Willarachchi (Enterprise Client Managament MVP) were working with, we have come across an issue which was really interesting to troubleshoot.

For some odd reason one of three Cluster Disks in a 3-Node Hyper-V 2012 R2 Cluster was in Redirected Access status.

CSV GFI Filter 1

When we were going through the cluster event noticed a bunch of 5125 Events complaining about an active system filter driver which is not compatible with CSV. Basically the I/O access to that volume has been redirected through another Hyper-V Node.

CSV GFI Filter 2

We tried changing the ownership of the particular CSV to another node, followed by trying to Turn off the Restricted Access Mode by right clicking the CSV and selecting that option. Changing the ownership was no success and for our surprise the operation to turn off the redirected access mode always failed with Set Operation Failed error.

After doing some research we decided to check up the CSV state and what are the active system filters in that particular volume. So we decided to run below commands in the current node owning the CSV.

CSV GFI Filter 3

We noticed a filter called esecdrv60 was having a frame value of Legacy. The nest command confirms that in all three nodes the CSV access is redirected. Then we immediately checked rest of the nodes with fltmc instances command and found out that same legacy filter was present there as well.

The Culprit aka GFI EndPoint Security

esecdrv60 filter actually belongs to GFI EndPoint Security software, which was installed and running in all three Hyper-V nodes. This software was pushed through it’s default policies and somehow Hyper-V cluster was not excluded in deployment list.

CSV GFI Filter 4

Uninstalling GFI was not possible locally so therefore we worked with GFI administrator to uninstall the software from all three hosts. Remember uninstalling GFI  requires a reboot and therefore we had to live migrate all the VMs and reboot one server at a time.

After uninstalling GFI and rebooting  all three hosts executed fltmc instances again to see whether GFI legacy filters were present or not. As you can see below all legacy filters were gone and CSV was back to normal operation mode without any error.

CSV GFI Filter 5

Following references were really helpful to identify and rectify the issue.

  1. Troubleshooting ‘Redirected Access’ on a Cluster Shared Volume (CSV)
  2. Cluster Shared Volume Diagnostics