Tag Archives: Troubleshooting

Tenant VM Startup Failure in Azure Stack TP2 Refresh

Microsoft has recently released a refresh build for Azure Stack Technical Preview 2. Me and my colleague CDM MVP Nirmal Thewarathanthri were eager to deploy the new build from day one. This post is about a known issue in this build which prevents tenant VMs from being automatically started after a host power failure.

Scenario

After we create a tenant VM in Azure Stack Portal and verify that it is running properly, we decided to turn off the host for the day and start the load testing next day. When the host was turned on next day the tenant VM is missing in Hyper-V manager and is in a failed status in Azure Stack portal. Not only that the VMs used by the PaaS RPs mySQL & SQL have also been disappeared.1-mas-tp2-vms-running

2-mas-tp2-vms-running-hvm

As you can see below neither deleting the VM nor deleting the resource group of that VM works in the portal. Also the VM status is set to Unknown in the portal.

3-mas-tp2-vms-unkown-portal

4-mas-tp2-rg-deletion-failure

4-mas-tp2-vm-deletion-failure

But the Azure Stack TP2 Management VMs have automatically started after the power failure.

5-mas-tp2-vms-missing-hvm

Solution

We noticed that in the Failover Cluster Manager in MAS TP2 host, all the tenant VMs including PaaS RP VMs are in a saved state after a power failure. Once we start these VM,s they will be online in both Hyper-V manager and Azure Stack Portal. Now we can successfully delete the concerned resource group or the tenant VM.

6-mas-tp2-vms-saved-fcm

RCA

This seems to be a known bug in TP2 refresh where the Management VMs will startup automatically “after taking sometime” where tenant VMs including PaaS RP VMs do not automatically start after a power failure. The workaround is to manually start them after all 14 Management VMs are up and running.

You can refer this link for a list of known issues in Azure Stack TP. 

 

Fix It | October 2016 Cumulative Windows Updates Crash SCOM Console

In my last post I’ve shared the console crashing issue you face after installing the security updates in  MS16-118 and MS16-126. Now Microsoft has published a new KBs to fix the console crashing issue in SCOM after applying these updates.

Individual hot fixes are available for the following list of Operating systems which you can download from here.

  • Windows Vista
  • Windows 7
  • Windows 8.1
  • Windows Server 2008
  • Windows Server 2008R2
  • Windows Server 2012
  • Windows Server 2012 R2

For Windows 10 and Server 2016, the fix was applied to the latest cumulative updates.

October 2016 Cumulative Windows Updates Crash SCOM Console

It seems like the October 2016 cumulative Windows updates (KB3194798, KB3192392, KB3185330 &KB3185331) cause the SCOM consoles 2012/2016 in all Windows versions from Windows Server 2008 R2 up to 2016 and Windows 7 up to 10 Windows 10 to regularly crash without any doubt.

According to Microsoft Germany’s SCOM PFE Dirk Brinkmann  who has blogged about this issue here, the SCOM team is working on a fix for this as of now and no ETA for an resolution has been provided yet.

Once a fix is available you will be able to see it via SCOM team blog.

EWS not working in Exchange Hybrid

Deploying a hybrid Exchange environment should be a carefully planned process though you might encounter hiccups here and there unexpectedly. In a recent hybrid effort after executing the HCW for Exchange 2013 I have noticed that two tedious issues happening in the environment.

  1. Free/Busy information is not visible in Outlook Clients (any version). Scheduling Assistant throws an error saying that it cannot locate the server. Outlook Web App works just fine.
  2. Unable to create a migration batch from the EAC. It throws below error.

The ExchangeRemote endpoint settings could not be determined from the autodiscover response. No MRSProxy was found running at <FQDN of CAS server>.

I decided to have a look on the properties of EWS virtual directory for CAS servers. EWS is responsible for remote moves & free/busy information.

Enable MRS Proxy endpoint option was selected as it should be. Also I could access the EWS service from outside. So the URL is correct.

EWS (1)

When I checked the authentication section the Integrated Windows Authentication option was not enabled so I enabled same. This option allows domain joined computers to use session credentials of their respective windows logins with Outlook clients.

EWS (2)After doing iisreset on all CAS servers I noticed that Free/Busy information is now available via Outlook and I can successfully create and complete migration batches as well.

This TechNet article explains different scenarios in an Exchange Hybrid deployment which can result in such an error and worth a look if you cannot see free/busy information from one tenant to the other (Exhcnage Online Users cannot see Exchange On-Premises free/busy info and vice versa)

Outlook 2007 Clients are disconnected in Exchange Hybrid

Deploying an Exchange Hybrid setup requires lot of effort specially around DNS records. In a recent deployment I noticed that Outlook 2007 clients are in disconnected state once the Hybrid feature has been enabled for an Exchange 2013 single forest organization.

We were using a new Subject Alternative Name (SAN) Certificate from a third party CA which included AutoDiscover.domain.com record. An autodicover record is vital to share free/bust information between Exchange Online & On-premises organization, as well as clients to differentiate where there mailboxes reside.

The problem was not present in Outlook 2010/2013 clients and creating new user profiles was not an option and was not a remedy. Renewing the SSL certificate has always been troubling if we don’t do it the proper way.

Solution

I decided to take a look on autodiscover virtual directories first. To do that open Exchange PowerShell as administrator and executed below cmdlet.

Get-AutoDiscoverVirtualDirectory | fl internalurl,externalurl

The virtual directory was not pointing to correct URL. Therefore below cmdlet has been used to rectify same.

Get-AutodiscoverVirtualDirectory -server <Server Name> | Set-AutodiscoverVirtualDirectory -ExternalUrl ‘https://<Server FQDN>/Autodiscover/Autodiscover.xml’ -InternalUrl ‘https://<Server FQDN>/Autodiscover/Autodiscover.xml’

I had three MBX/CAS servers to run this cmdlet. The <Server Name> placeholder represents same. Now in IIS Manager of each MBX/CAS server, right click the MSExchangeAutoDiscoverAppPool application pool and click Recycle. This will recycle the AutoDiscover application pool.

When I checked the connectivity of Outlook 2007 clients most of them are showing connected and e-mails were downloading without any issue. For those clients that didn’t respond we had to reconfigure Outlook profiles but it was quite a few.

Recommendation

I’m not really comfortable with Outlook 2007, specially if it involves Office 365. If you intend to use Outlook 2007 with this kind of a setup I strongly suggest that you update to the latest service pack or if possible update it to Outlook 2010 at minimum. Service Packs play a vital role and in this case none of those Outlook clients were patched.

DPM 2012 R2 UR7 Re-released

It’s been a while since my last blog post. I’m working on a DPM 2012 R2 test lab these days which I’ve planned to update to the latest UR version. When I checked for the latest UR7 got to know that the bits have been re-released.

As for the DPM team there is an issue in DPM 2012 R2 UR7 released on 28.07.2015 which causes expired recovery points on the disk were not getting cleaned up, resulting an increase in DPM recovery point volume after installing UR7. This re-release has addressed this concern and you can download the upadted bits via DPM 2012 R2 UR7 KB or Microsoft Update Catalog as of today.

OK I have updated to UR7 before 21.08.2015. Now what?

For those who are facing this dilemma should know that the re-released UR7 is not pushed via Microsoft Update and advised to manually install the new package  on the DPM Servers with older UR7 package installed. The installation process will automatically execute pruneshadowcopiesDpm2010.ps1 PowerShell script which contains the fix.

Post-deployment Tips

There is no change in the DPM version (4.2.1338.0) in this re-release and it will remain same after the update. Also you will have to update the Azure Backup Agent to latest version (2.0.8719.0) prior installing DPM UR7 to ensure the integrity of your cloud backups after this release.

For those who like me updating to UR7 the old fashion way (wait for a month or two, lookout for bugs and then update) you’ve got nothing to worry.

Library Server Failure in SCVMM 2012 R2

Few days back I was working with my colleague Law Wen Feng on a SCVMM Managed Hyper-V Cluster. The idea was to update the environment from SCVMM 2012 R2 UR 2 to UR 7. We noticed a strange issue where the Library Server (VMM Server itself) was complaining about a refresh failure. It seemed like the VMM agent was no longer functioning properly in the VMM Management Server.

WinRM Issue  (1)

As a poor man’s alternative we removed the library server from VMM. Then we tried to re-add the same VMM server as a library server which resulted in bizarre output. Nevertheless the VMM rejected another file share in a different server which we were hoping to add an alternative.

WinRM Issue  (2)

The error reads as the VMM Agent was no longer functional on the target server. But it was indeed running without any issue.

WinRM Issue  (3)

WinRM Issue  (4)

I’ve reached out to my fellow MVP colleagues Krisitan Nese, Stanislav Zhelyazkov & Daniel Neuman for some suggestions. They suggested that we do re-associate the VMM Agent with VMM Server. Yes it sound like chicken and egg situation. But this is no ordinary Hyper-V host but the VMM server itself.

Register-SCVMMManagedComputer cmdlet can be used to re-associate a managed computer on which VMM agent software is installed with a different VMM management server. But here we chose to add it to the same VMM server.

WinRM Issue  (5)Now it was complaining about WinRM was no longer functional. For those who are familiar WinRM is necessary component that is needed for you to remotely manage Windows Server. By default during the installation SCVMM takes care of enabling and running the WinRM service. Rebuilding the VMM server with retain DB option was not an option as we were middle of preparing demo lab and as I always believe needed to get to the bottom of it.

The evil WinRM GPO

We checked the GPO settings for the domain and found out WinRM was forced to all computers in our domain by a GPO. We moved the VMM server to a test OU and then disabled inheritance for that particular GPO and guess what, after a gpupdate /force in the VMM server we were able to add the library server back again.

WinRM Issue  (6)

Is that All? No it is not.

But I suspected it couldn’t be the only solution or the issue. So some digging into the default WinRM behavior in SCVMM I noticed that infact there was an actual configuration item that has been missed in the GPO itself.

According to Microsoft, there are some consideration for WinRM when you adda Hyper-V host to a VMM environment. Following has been extracted from above TechNet Article the highlighted section focuses on configuring WinRM listeners for both IPv4 & IPv6.

If you use Group Policy to configure Windows Remote Management (WinRM) settings, understand the following before you add a Hyper-V host to VMM management:

  • VMM supports only the configuration of WinRM Service settings through Group Policy, and only on hosts that are in a trusted Active Directory domain. Specifically, VMM supports the configuration of the Allow automatic configuration of listeners, Turn On Compatibility HTTP Listener, and Turn on Compatibility HTTPS Listener Group Policy settings. VMM does not support configuration of the other WinRM Service policy settings.
  • If you enable the Allow automatic configuration of listeners policy setting, you must configure it to allow messages from any IP address. To verify this configuration, view the policy setting and make sure that the IPv4 filter and IPv6 filter (depending on whether you use IPv6) are set to *.
  • VMM does not support the configuration of WinRM Client settings through Group Policy. If you configure WinRM Client Group Policy settings, these policy settings may override client properties that VMM requires for the VMM agent to work correctly.

I had a look at the Allow Automatic Configuration of Listeners policy setting under Computer Configuration\Administrative Templates\Windows Components\Windows Remote Management node in the GPO and the IPv6 filter was set to null, we changed that to accept from any IP address by putting an asterisk (*). Of course IPv6 was enabled in all Hyper-V hosts and the VMM Server by default.

WinRM Issue  (7)

Now it was about time to move back the VMM Server to it’s original OU with the GPO applied and execute a gpupdate /force. Surprisingly it did the trick. We were able to re-add the library server (in VMM) and couple of other file share as library shares without any issue.

WinRM Issue  (8)

Amazing isn’t it? We may never gaze upon TechNet for such trivial issues when they happen but it was worth all the trouble without rebuilding the VMM server. I must thank all who helped by sharing their ideas to sort this issue out. That is what I love about the community. When all is lost somewhere far away in the world, there will always be good people to help you out.

CSV Access Redirected in Hyper-V Cluster

I’ve been working with Hyper-V for quite sometime. During a recent Hyper-V Cluster deployment that myself and my colleague Hasitha Willarachchi (Enterprise Client Managament MVP) were working with, we have come across an issue which was really interesting to troubleshoot.

For some odd reason one of three Cluster Disks in a 3-Node Hyper-V 2012 R2 Cluster was in Redirected Access status.

CSV GFI Filter 1

When we were going through the cluster event noticed a bunch of 5125 Events complaining about an active system filter driver which is not compatible with CSV. Basically the I/O access to that volume has been redirected through another Hyper-V Node.

CSV GFI Filter 2

We tried changing the ownership of the particular CSV to another node, followed by trying to Turn off the Restricted Access Mode by right clicking the CSV and selecting that option. Changing the ownership was no success and for our surprise the operation to turn off the redirected access mode always failed with Set Operation Failed error.

After doing some research we decided to check up the CSV state and what are the active system filters in that particular volume. So we decided to run below commands in the current node owning the CSV.

CSV GFI Filter 3

We noticed a filter called esecdrv60 was having a frame value of Legacy. The nest command confirms that in all three nodes the CSV access is redirected. Then we immediately checked rest of the nodes with fltmc instances command and found out that same legacy filter was present there as well.

The Culprit aka GFI EndPoint Security

esecdrv60 filter actually belongs to GFI EndPoint Security software, which was installed and running in all three Hyper-V nodes. This software was pushed through it’s default policies and somehow Hyper-V cluster was not excluded in deployment list.

CSV GFI Filter 4

Uninstalling GFI was not possible locally so therefore we worked with GFI administrator to uninstall the software from all three hosts. Remember uninstalling GFI  requires a reboot and therefore we had to live migrate all the VMs and reboot one server at a time.

After uninstalling GFI and rebooting  all three hosts executed fltmc instances again to see whether GFI legacy filters were present or not. As you can see below all legacy filters were gone and CSV was back to normal operation mode without any error.

CSV GFI Filter 5

Following references were really helpful to identify and rectify the issue.

  1. Troubleshooting ‘Redirected Access’ on a Cluster Shared Volume (CSV)
  2. Cluster Shared Volume Diagnostics

Network Discovery Rule Failure in SCOM 2012 R2

Although most of my time is now spent on Azure, I love and work on SCOM the best monitoring platform that I’ve ever worked with. Some can say it’s noisy but that’s not true if you know how to tune your SCOM deployment. In a recent adventure I’ve come across another SCOM mystery which is I’m going to tell you how to solve today.

I’ve got a SCOM deployment where there are two management servers and one database server; all part of the same management group. The second management server was implemented solely for the purpose of network device monitoring. For those who know Microsoft does recommend to have a separate management server for that.

First things first, I’ve created a Network Discovery Rule targeting the second management server to be the one that actually does the discovery. If you do not know how to do that you can refer this TechNet article.

The Problem

Though the Network Discovery rules creation was successful I noticed that the rule status is always IDLE and discovers nothing even though I tried to manually run it couple of times. I did all I could possibly fathom restarting services/management servers, recreating the rules, hell even deleting the management pack itself (unsealed management pack  Microsoft.SystemCenter.NetworkDiscovery.Internal which stores the discovery rule) and re-importing. The weirdest thing is if I recreate a rule selecting the first management server I scan discover the network devices but not with the second server. I noticed below error in the second management server’s event log.

SCOM Network Discovery Failure 1Seems like the management server was having trouble with updating the network discovery script and yes obviously I’ve tried it after 3600 seconds like they say. 😉

The Solution

The regular Google search led me to two invaluable posts one from my fellow MVP colleague Daniele Grandini and the other one from TechNet which explained the exact same issue I’ve faced. As Daniele’s post explains it nicely there are couple of events that you can notice in case of a successful or unsuccessful discovery of network devices. But still after performing the steps on both articles I was still at ground zero with no results.

For those who are familiar with my friend & MVP colleague Tao Yang, one of the SCOM Gurus we have in this part of the world know how he does his magic with management packs. Tao has come across the same issue in the past when he was helping out a friend, and he suggested a nice little trick that I’ve missed.

The Trick

Tao suggested to flush the health service state and cache of the ill management server. Now this is one last hope of beacon for us SCOM admins which will perform below tasks.

  1. Stops the System Center Management service.
  2. Deletes the health service store files.
  3. Resets the state of the agent, including all rules, monitors, outgoing data, and cached management packs.
  4. Starts the System Center Management service.

This task leaves no reference to itself as it deletes the cached data in the health service store files, including the record of this task itself.

All you have to do is follow 1>2>3>4 as per below screenshot.

SCOM Network Discovery Failure 2

Now that I’ve done so, I’ve created a brand new network discovery rule for the second management server and let it run for the first time and wait. It did really worked and all I could see was the devices that are discovered with much joy.

SCOM Network Discovery Failure 4

Now looking back at the event log I could see the traces of a successful network discovery.

SCOM Network Discovery Failure 3 revised

Now let’s hear a big round of applause for Master Tao Yang the hero that saved my day.

Health Explorer missing in SCOM Console

During a recent adventure to SCOM world I’ve faced one of the strangest of issues.

I installed the SCOM console to a Windows 8 Professional x64 laptop for a customer and the console seems to be fully functional but right clicking any alert and selecting Open > Health Explorer didn’t seem to be working. Below are the steps that I immediately did to check the issue.

  • Installed the same console in a Windows 8.1 x64 PC and there was nothing wrong with it.
  • Checked whether the Operations Console UR was updated. On both PCs console has been upgraded with UR4 to match the UR version in Management Servers.
  • Checked the logs and found out nothing out of the ordinary.
  • Checked out the regular devil .NET Framework compatibility. According to MSFT SCOM supports below .NET versions.
System Center 2012 R2 component .NET 3.5 SP1 .NET 4 .NET 4.5 .NET 4.5.1
Operations Manager Management Server
Operations Manager Data Warehouse Management Server
Operations Manager Gateway Server
Operations Manager Web Console
Operations Manager Reporting Server
Operations Manager Operations Console

An MVP friend of mine Dieter Wijckmans suggested one important check that I missed.

Clearing the SCOM console cache

You can always clear the OpsMgr cache if things go awol in the console. To do that enter below in a Run window.

“C:\Program Files\Microsoft System Center 2012 R2\Operations Manager\Console\Microsoft.EnterpriseManagement.Monitoring.Console.exe” /clearcache

The /clearcache option will clear the cache and re-opens the SCOM console.

This small step saved many hours of troubleshooting for me.