Sunday, November 10, 2013

Exchange 2013 and Managed Availability

Microsoft Exchange 2013 has a new monitoring and alerting engine built into the product called Managed Availability.  Managed Availability detects alerts and recovers problems as they occur within the product.

In previous versions of Exchange such as 2007/2010, Microsoft recommended Administrators used System Center Operations Manager (SCOM) to monitor an Exchange environment.  In Exchange 2013, the product now has its own monitoring engine which companies can leverage to provide an in-site into their email infrastructure.

Note: SCOM Integration with Exchange 2013 will still be supported

Main components of Managed Availability

Managed Availability is built into both server roles in Exchange 2013. It includes three main asynchronous components. The first component is the probe engine. The probe engine’s responsibility is to take measurements on the server. This flows into the second component, which is the monitor. The monitor contains the business logic that encodes what we consider to be healthy. Finally, there is the responder engine, When something is unhealthy its first action is to attempt to recover that component. Managed Availability provides multi-stage recovery actions – the first attempt might be to restart the application pool, the second attempt might be to restart service, the third attempt might be to restart the server, and the final attempt may be to offline the server so that it no longer accepts traffic. If these attempts fail, managed availability then escalates the issue to a human through event log notification.
  • Probe engine:  The Probe Engine takes measurements on the server.
  • Monitoring probe engine:  The Monitoring Probe Engine stores the business logic about what constitutes a healthy state. It functions like a pattern recognition engine, looking for patterns and measurements that differ from a healthy state, and then evaluating whether a component or feature is unhealthy.
  • Responder engine:  When the Responder Engine is alerted about an unhealthy component, its first action is to try to recover that component. Managed availability enables multi-stage recovery actions. The first attempt may be to restart the application pool, the second attempt may be to restart the corresponding service, and the third attempt may be to restart the server. And, the final attempt may be to put the server offline, so that it no longer accepts traffic. If all of these actions fail, an alert is sent to the help desk.
All above are controlled by the Exchange Health Manager Service (MSExchangeHMHost.exe) and the Exchange Health Manager Worker process (MSExchangeHMWorker.exe)
 The relationship between these components is like
Probes (monitor and when fails occur) --> Monitor status change --> Responder takes action
So to find the root cause and why a responder invoked a specific action we will go in the reverse way
Responder takes action --> which monitor? --> Find the failing probe.

Recovery Sequences

It is important to understand recovery sequence for a monitor. For example, let’s say the probe data for the OWA protocol (the Protocol Self-Test) triggers the monitor to be unhealthy. At this point the current time is saved. The monitor starts a recovery pipeline that is based on current time. The monitor can define recovery actions at named time intervals within the recovery pipeline. In the case of the OWA protocol monitor on the Mailbox server, the recovery sequence is:
  1. At Time =0, the Reset IIS Application Pool responder is executed.
  2. If at Time=5 minutes the monitor hasn’t reverted to a healthy state, the Failover responder is initiated and databases are moved off the server.
  3. If at Time=8 minutes the monitor hasn’t reverted to a healthy state, the Bugcheck responder is initiated and the server is forcibly rebooted.
  4. If at Time=15 minutes the monitor still hasn’t reverted to a healthy state, the Escalate responder is triggered.
The recovery sequence pipeline will stop when the monitor becomes healthy.

Health Determination

Monitors that are similar or are tied to a particular component’s architecture are grouped together to form health sets. The health of a health set is always determined by the “worst of” evaluation of the monitors within the health set.
To view health, you use the Get-ServerHealth and Get-HealthReport cmdlets. Get-ServerHealth is used to retrieve the raw health data, while Get-HealthReport operates on the raw health data and provides a current snapshot of the health. These cmdlets can operate at several layers:
  • They can show the health for a given server, breaking it down by health set.
  • They can be used to dive into a particular health set and see the status of each monitor.
  • They can be used to summarize the health of a given set of servers (DAG members, or load-balanced array of CAS).
Health sets are further grouped into functional units called Health Groups. There are four Health Groups and they are used for reporting within the SCOM Management Portal:
  1. Customer Touch Points – components with direct real-time, customer interactions (e.g., OWA).
  2. Service Components – components without direct, real-time, customer interaction (e.g., OAB generation).
  3. Server Components – physical resources of a server (e.g., disk, memory).
  4. Dependency Availability – server’s ability to call out to dependencies (e.g., Active Directory).


Managed Availability performs a variety of health assessments within each server. The end result is that Managed Availability focuses on the user experience and ensures that while issues may occur, the experience is minimally impacted, if at all, impacted.

For deep dive click on