Operations Management Suite (OMS): Network Performance Monitor

Introduction

In the hybrid cloud era, networking monitoring is no longer restricted to your data centers and offices. The modern enterprise network is in your datacenters, your offices and most likely extended to multiple cloud services. The hybrid network scenario brings with it new requirements for administrators in several areas, including availability, security, and performance. It often brings a more complex network structure, including several locations and networks, with varying connectivity and security requirements.

You also no longer have full control over each part of the network. For example, if you extend your local network to Azure, you cannot see all details of the VPN gateway the Azure side. From a network security perspective, you now have many more endpoints to protect and points of ingress into your network. From a performance perspective, you can no longer rely on the local stable high-speed network, as your network now contains many different types of networks and connections of varying quality.

In this chapter, we will cover the capabilities of the Network Performance Monitoring (NPM) feature of OMS, including:

  • Installing and configuring the solution
  • Agent and network configuration
  • How to interpret and use NPM data
  • Recent improvements to NPM

The NPM supports near real-time monitoring of network performance, collecting information like packet loss and latency, enabling you to perform diagnostics and troubleshooting of network issues. OMS can be configured to generate an alert when a configured threshold is reached in data collected by the NPM solution. The thresholds can be learned and configured automatically based on collected data or configured manually by an administrator. The NPM dashboard displays a summary of your network health. As with other solutions, you can drill into the data to find details about issues such as unhealthy network links or packet loss.

The primary method of data collection in NPM is with synthetic transactions that run every three minutes. As with other OMS solutions, all data collected by NPM is accessible with Log Analytics search. NPM uses TCP ping or ICMP ECHO to track latency and packet loss. It is only control packets that are exchanged, and not any data packets, resulting in a solution does not affect the bandwidth. The test is run every 5 seconds and sent to OMS every 3 minutes.

Installation and configuration

We will begin with a walkthrough of how to configure the NPM solution. Installing and configuring the NPM solution is divided into four major steps:

  1. Enable the solution. The first step is to enable Network Performance Monitor in the Solutions Gallery in the OMS portal.
  2. Install Agents. The second step is to make sure there is at least one agent (although two are recommended) on each subnet that you will be monitoring with NPM. If you have connected SCOM to OMS, then all the required management packs for NPM will be automatically installed in SCOM and downloaded by the agents connected to SCOM.
  3. Configure Agents. NPM uses synthetic transactions to monitor the network. This step verifies that all agents can communicate with each other on the chosen protocol and port. If NPM uses TCP protocol for tests, this step also adds required registry keys to use the agent as a monitoring node.
  4. Setup Networks. The last step is to setup networks. You define networks as containers for subnets. Once you have saved the network configuration, the agent will start collecting data and light up the dashboard.

Enable the solution and install agents

Enabling the NPM solution is done in the same way as all other OMS solutions. Simply navigate to the Solutions Gallery and add the solution. Figure 1 shows the Network Performance Monitor solution in the Solutions Gallery.

FIGURE 1. THE NETWORK PERFORMANCE MONITOR SOLUTION

When you first navigate to the NPM dashboard you will see the configuration page, shown in Figure 2. On this page, you can choose to use wither ICMP or TCP for synthetic transactions. If you select ICMP a ICMP ECHO message will be sent to estimate network latency. ICMP ECHO use the same message as Ping. If you select TCP a TCP SYN packet will be sent to the other NPM agent. The second NPM agent will reply with a TCP handshake and the connection will then be removed with RST packets. There are some things to think about before choosing protocol

  • Accuracy. TCP can be more accurate when discovering multiple routes between agents. ICMP can achieve similar result, but it will require more agents. For example, if you have three routes between two subnets, then ICMP will require 5*3 agents in either source or destination subnet.
  • Priority. Routers and switches often assign lower priority to ICMP ECHO packets compared with TCP packets. In these scenarios, ICMP provides less accurate results than TCP.
  • Connectivity. The default TCP port used by NPM is 8084. ICMP does not operate using a port. In many networks, ICMP is allowed, but in some not. If you can ping between two NPM agents, then you can use ICMP.
  • Configuration effort. The TCP protocol requires manual configuration where ICMP does not. If you have limited access to each NPM agent machine, cannot get necessary network changes enacted, or have an urgent need to get network monitoring in place quickly, ICMP may be the best option In this chapter, we will choose to use TCP for synthetic transactions.

FIGURE 2. NPM CONFIGURATION PAGE

If you select to use TCP for synthetic transactions, you will see a configuration steps overview page, shown in Figure 3. This page outlines all the steps needed to configure the agent, also described later in this chapter.

FIGURE 3. TCP CONFIGURATION

NPM uses the Microsoft Monitoring Agent to perform synthetic tests and collect data. It is recommended to have at least two agents on each subnet that you plan to monitor. More agents running the synthetics tests means more collected data and required storage, but also the availability of more granular data when you are troubleshooting network issues.

Configure Agent

On the first configuration page, we choose to use TCP for synthetic tests. These tests are executed on port 8084. You can change the default port, but then you need to change the port on all agents. OMS provides a PowerShell script that will open the needed port in the local Windows Firewall, and configure a couple of registry keys. If you have other firewalls in your network or if you use Network Security Groups (NSGs) in Azure, you must manually configure them to allow the NPM traffic.

NPM uses the NPMDAgent.exe application to run synthetic transactions. The application is downloaded to the agent machine as soon as the NPM solution is added to the OMS workspace, even if the agent is not configured as an NPM node.

Note: Tao Yang, a Microsoft MVP, has created a management pack for SCOM that contains several agent tasks to administrate NPM settings on agents. For more information see

If you select the ICMP protocol for synthetic transactions no manual configuration is needed on each agent.

To configure a Windows Server to allow the NPM TCP traffic and be discovered as a monitoring node, perform the following steps:

  1. Navigate to TechNet Gallery and find "OMS Network Performance Monitor Agent Configuration Script" (https://gallery.technet.microsoft.com/OMS-Network-Performance-04a66634. There is a direct link to the download page on the TCP configuration page shown in Figure 3.
  2. Download the script and copy it to the Windows server.
  3. On the Windows server, start Windows PowerShell as Administrator
  4. In Windows PowerShell, run the script EnableRules.ps1, as shown in Figure 4. As you can see in Figure 4, the script also adds registry keys used to discover the OMS agent as a monitoring node for NPM. If you do not run the PowerShell script, or add the registry keys manually, the agent cannot be used as a monitoring node.

    It is possible to run the EnableRules.ps1 script with the portNumber parameter to specify another port. It is also possible to run the script with the DisableRule parameter to delete firewall rules.

FIGURE 4. RUNNING THE ENABLERULES.PS1 SCRIPT

The PowerShell script creates five inbound firewall rules, shown in Figure 5.

FIGURE 5. FIREWALL RULES CREATED BY THE ENABLERULES.PS1 SCRIPT

Configure Networks

Once the solution is added, and the agents are deployed and configured, it is time to configure networks. In NPM, a network is one or more subnets, logically grouped together in OMS. You will notice there is a default network, which contains all subnets that are not specified in a user-defined network. The networks you create can have any structure, they do not have to reflect your current network layout. For example, you can base the networks on services instead of actual network structure.

When you navigate to the NPM Configuration dashboard, you will see five categories on the left side: TCP Setup, Networks, Subnetworks, Nodes, and Monitors. TCP Setup is the main configuration page, shown in Figure 6. The TCP Setup page describes the steps needed to enable and configure NPM. It also contains a direct link to the PowerShell script for agent configuration.

FIGURE 6. FIREWALL RULES CREATED BY THE ENABLERULES.PS1 SCRIPT

On the Network tab, you can configure and review networks. There is also a list of unallocated subnetworks, which are discovered subnetworks not yet included in a network. In Figure 7, you can see that all subnetworks are in a network named Default.

FIGURE 7. CONFIGURATION OF NETWORKS

The Subnetworks tab shows all discovered subnetworks. For each subnetwork, you can enable or disable the option "Use for monitoring", as shown in Figure 8. In the Subnetworks view you can also configure which nodes within the subnetwork you want to use for monitoring.

FIGURE 8. SUBNETWORKS CONFIGURATION

On the Networks page, you can click "Add network" and define a network based on the discovered subnetworks. Figure 8 shows how a network with two subnetworks is configured. In Figure 9, you can also see description on each subnetwork. The description is configured per subnetwork on the Subnetworks page.

FIGURE 9. CREATING A NEW NETWORK

In the Nodes view, you work with your monitoring nodes. In Figure 10, you can see the Nodes view including the "Use for monitoring" option. With this option, you can disable or enable nodes for monitoring. If you disable a node, which is the only node on a subnetwork, then the subnetwork monitoring will also be disabled.

FIGURE 10. CONFIGURATION OF NODES

On the Monitor tab, shown in Figure 11, you can configure monitoring rules. By default, there is a default rule that monitors connectivity between all networks and subnetworks. This default rule cannot be deleted, but you can disable it. In Figure 11 you can also see how you can configure the protocol per rule.

FIGURE 11. CONFIGURATION OF MONITORING RULES

Figure 11 shows the page for configuring a new rule. In Figure 12, you can see that user defined networks show up in this view. A good naming convention will make it easier to build new rules.

FIGURE 12. CONFIGURATION OF MONITORING RULES

The network you select in the first drop down will be the source network for the tests, such as the 'North Production Network' shown in Figure 12. Figures 13 and 14 show two different monitoring rules. The difference is the network selected in the first drop-down.

FIGURE 13. CONFIGURATION OF NEW MONITORING RULE

FIGURE 14. DEFAULT NPM DASHBOARD

Figure 15 shows the log for network node links. You can see that source network is always the first selected network.

FIGURE 15. REVIEW COLLECTED NETWORK NODE LINK DATA

When configuring networks in the portal UI, you will see a link named "Create Alerts". This link can be used to enable alert for the monitor rule you have configured. If you click the "Create Alerts" link an alert rule will be automatically created based on the monitor rule, shown in Figure 16. Once you have clicked "Create Alerts" the link will change to "Manage Alerts". You can also find the new alert under Alerts on the Settings page. On the page shown in Figure 16, you can reconfigure alert settings if needed.

FIGURE 16. ALERT RULE FOR NETWORK LINK MONITORING

Review network performance

When all deployment and configuration is complete, it is time to review and monitor collected data. The NPM solution includes a default dashboard, shown in Figure 17. The default dashboard provides an overview of network health and connectivity.

FIGURE 17. DEFAULT NPM DASHBOARD

The default NPM dashboard, shown in Figure 17, includes the following blades

  • The Network Summary blade shows a summary of the network, such as the number of networks and subnetworks.
    • Current Subnetwork Distribution shows all subnetworks and number of subnetwork per network.
    • Current Networks shows the number of networks including the default network.
    • Network Links shows the number of network links, which are the connections between two networks.
    • Subnetwork Links displays all tests between subnets and status.
    • Paths. The path view, shown in Figure 18, displays all hops between two agents.
  • Top Network Health Events shows most recent events and alerts in the network. Alerts and events are generated when there is packet loss, latency or link between network and subnetworks is above a threshold. These thresholds can be learned automatically by the system or you can configure them to use custom alert rules.
  • Unhealthy Network Links lists unhealthy network links, networks with at least one active health events.
  • Top Subnetwork Links and Subnetwork Links with Most Latency shows statistics based on subnetworks with highest and lowest latency and subnetworks packet loss.

FIGURE 18. NETWORK PATHS

In most of the default NPM views, you can click the Action tab and then enable autorefresh, as shown in Figure 19. Auto-refresh will automatically update the view with the latest information.

Note: It is important to know that auto-update is configured per view. In some scenarios, it can be misleading when one view has auto-refresh enabled and another does not.

The ability to select a snapshot on the same tab is a great capability to have when troubleshooting. You can easily review the status at a specified point in time.

FIGURE 19. CONFIGURATION OF TIMEFRAME

When there is a health event or alert, you can click on it and drill down for deeper analysis and troubleshooting. Figure 20 shows an unhealthy network link, and it seems like agents on the 172.16.200.0 subnetwork cannot connect to the 10.1.4.0 subnetwork, but other agents can. This is also an example how important it is to use multiple agents to test connectivity between subnetworks. In this example, we can see that most likely it is an isolated incident, as another communication is working in and out of both affected subnets.

We can click on the different blades, shown in Figure 20, to drill deeper into this information and use the sample queries to drill into the raw data collected, shown in Figure 20. In Figure 21 we can see when this problem started and we can also see details about which tests that are currently working to and from the affected subnetworks.

FIGURE 20. ERROR SHOWN IN THE DEFAULT DASHBOARD

These servers were running in Azure, and as OMS can collect activity logs from Azure too. Figure 21 lists activity logs from Azure. We can see for example that all changes, and in this example, someone had to change a firewall rule that blocked the traffic.

FIGURE 21. DETAILS ABOUT NETWORK SECURITY GROUP CHANGE

Recent Improvements in NPM

Microsoft has delivered several improvements in NPM in recent months. A few of the key enhancements are described here.

Alert diagnostics.

Many customers are using NPM in complex networks, with Microsoft Monitoring Agents installed on multiple nodes. In the past, it has been difficult to determine why an agent is not working as expected. Microsoft has now added agent diagnostic capabilities to the solution, which will help you keep tabs on any health and configuration problems with NPM agents in your network. You can now view the health of all NPM agents in a single view, find those that are misconfigured or unresponsive, and get actionable diagnostic information to resolve the issues.

Hop-by-hop latency breakdown

NPM now provides a hop-by-hop breakdown of latency between two points in your network, on the topology map. This ability complements the other capabilities of the topology map, such as fault localization, path filters, hop compression slider, and advanced search. With latency data on each hop, you can now isolate network latency by identifying problem spots that occur along the network path.

Availability in the Azure portal

NPM is now available in the Azure portal. You can now add NPM from the Azure Marketplace, and use the solution in the Azure portal itself to monitor your environment. You can also continue to use the solution in the OMS portal.

Summary

Slow networks can lead to slow applications and affect business-critical services. Network Performance Monitor is a solution for real-time monitoring of your network that provides monitoring, diagnostic and troubleshooting for network related issues with minimal configuration effort. As the solution does not require access to network devices, it is easy to get started. In this chapter, we have looked at what Network Performance Monitor is and how to get started. We walked through configuration steps on the agent and how to model your network in the OMS portal. Finally, we look at how to analyze the data collected by NPM.