Monitor Alarm Events

Two event workflows, the Alarms card workflow and the Info card workflow, provide a view into the events occurring in the network. The Alarms card workflow tracks critical severity events, whereas the Info card workflow tracks all warning, info, and debug severity events.

To focus on events from a single device perspective, refer to Monitor Switches . To monitor informational alarms, refer to Monitor Informational Events.

Monitor All Alarms

The Alarms card workflow enables users to easily view and track critical severity alarms occurring anywhere in your network.

Alarms Card Workflow Summary

The small Alarms card displays:

  • total number of alarms

  • distribution of alarms

  • performance indicator

<insert image>

The medium Alarms card displays:

  • total number of alarms

  • total number, distribution, and trend of alarms triggered by network protocols and services

  • total number, distribution, and trend of alarms triggered by interfaces

  • total number, distribution, and trend of alarms triggered by trace

  • total number, distribution, and trend of alarms triggered by system

<insert image>

The large Alarms card contains two tabs.

  • Network Services tab which displays:

    • total number, distribution, and trend of alarms triggered by the BGP service

    • total number, distribution, and trend of alarms triggered by the CLAG service

    • total number, distribution, and trend of alarms triggered by the EVPN service

    • total number, distribution, and trend of alarms triggered by the LLDP service

    • total number, distribution, and trend of alarms triggered by OSPF? LNV? VLAN, VXLAN, NTP, Sensors, MTU, PTM, licenses?

    • alarms by most recent

    • devices by most alarms

  • System, Trace, and Interfaces tab which displays: ( no wires )

    • total number, distribution, and trend of alarms triggered by links

    • total number, distribution, and trend of alarms triggered by ports

    • total number, distribution, and trend of alarms triggered by ???

    • interface alarms by most recent

    • devices by most interface alarms

    • total number, distribution, and trend of traces with warnings

    • total number, distribution, and trend of failed traces

    • trace alarms by most recent

    • devices by most trace alarms

    • total number, distribution, and trend of alarms triggered by the NTP service

    • total number, distribution, and trend of alarms triggered by NetQ Agents

    • total number, distribution, and trend of alarms triggered by invalid licenses

    • total number, distribution, and trend of alarms triggered by device sensors

    • system alarms by most recent

<insert images>

The full screen Alarms card provides tabs for all events and all devices.

<insert image>

View Alarm Status Summary

A summary of the critical alarms in the network includes the number of alarms, a trend indicator, a performance indicator, and a distribution of those alarms. The trend indicator is based on the count of alarms that have occurred compared to the count in the last two time periods:

  • Upward facing arrow: alarm count is higher that the last two time periods, an increasing trend

  • Downward facing arrow: alarm count is lower than the last two time periods, a decreasing trend

  • No arrow: count is unchanged, trend is steady.

The performance indicator is based on a set of pre-defined thresholds, where:

  • Low: alarm count is less than x

  • Med: alarm count is between x and y

  • High: alarm count is more than y

To view the summary, open the small Alarms card.

View the Distribution of Alarms

It is helpful to know where and when alarms are occurring in your network. The Alarms card workflow enables you to see the distribution of alarms based on its source—network services, interfaces, traces or other system service. You can also view the trend of alarms in each source category.

To view the alarm distribution, open the medium Alarms card. Scroll down to view all of the charts.

Monitor Network Services Alarms

The Alarms card workflow enables users to easily view and track critical severity alarms triggered by network services.

View All Network Services Alarms

You can view only the alarms associated with network services using the Alarms card workflow. Network services alarms are broken into the following categories: BGP, OSPF , EVPN, LNV , CLAG, and LLDP. You can sort alarms based on their occurrence or view devices with the most network services alarms.

To view network services alarms, open the large Alarms card. Network Services alarms are shown by default.

From this card, you can view the distribution of alarms for each of the categories over time. Scroll down to view any hidden charts. A list of the associated individual network services alarms is also displayed.

View Devices with the Most Network Services Alarms

By default, the list of alarms for all network services is displayed when viewing the large cards. You can filter instead for the devices that have the most network services alarms.

To view devices with the most alarms, open the large Alarms card, and then select Devices by Most Issues from the dropdown.

From this card, you can:

  • Hover over an individual charts to filter the list on the right to focus on only those devices associated with that category. Click on a chart to persist the table changes; the category is highlighted and a checkmark is shown next to its title, while other category charts are faded.

  • Click and drag the vertical lines left and right on the charts to narrow the time period even further.

  • Change the time period for the data to compare with a prior time. If the same devices are consistently indicating the most alarms, you might want to look more carefully at those devices using the Switches card workflow.

  • Click Show All Events to investigate all events with Network Services alarms in the full screen card.

View All Alarms for a Specific Network Service

You can view the alarms for a given network service instead of alarms for all services.

To view all alarms for a given service:

  1. Open the large Alarms card

  2. Hover over an individual service graph. This causes the alarm list to be filtered by the service type.

  3. Optionally, click the checkbox next to a give service to retain the filtered list.

    You can select more than one service by clicking the checkbox next to multiple services.

View Devices with the Most Alarms for a Specific Network Service

You can view devices that have the most alarms associated with a given network service instead of alarms for all services.

To view devices with the most alarms for a given service:

  1. Open the large Alarms card.

  2. Select Devices by Most Issues from the dropdown.

  3. Hover over an individual service graph.
    This causes the device list to be filtered by the service type.

  4. Optionally, click the checkbox next to a given service to retain the filtered list.

    You can select more than one service by clicking the checkbox next to multiple services.

Monitor Interface Alarms

The Alarms card workflow enables users to easily view and track critical severity alarms triggered by interfaces.

No wires to work from.

View All Interface Alarms

View Interfaces with the Most Link Alarms

View Interfaces with the Most Port Alarms

Monitor Trace Alarms

The Alarms card workflow enables users to easily view and track critical severity alarms triggered by trace.

No wires to work from.

View All Trace Alarms

View Traces with the Most Warnings

View Traces with the Most Failures

Monitor System Alarms

The Alarms card workflow enables users to easily view and track critical severity alarms triggered by system services.

No wires to work from.

View All System Alarms

View NTP Service Alarms

View Devices with the Most NTP Service Alarms

View NetQ Agent Alarms

View Devices with the Most NetQ Agent Alarms

View License Alarms

View Device Sensor Alarms

View Devices with the Most Sensor Alarms

Alarms Reference

The following table lists alarm messages organized by message type, by default. Click the column header to sort the list by that characteristic. Click images/download/attachments/8365539/image2018-12-5_17_18_48.png in any column header to toggle the sort order between A-Z and Z-A. Recommended actions suggest NetQ CLI commands and Cumulus Linux NCLU commands for further investigation.

The messages can be viewed in syslog or through third-party notification applications. For details about configuring notifications using the GUI, refer to Notification Management. For details about configuring notifications using the NetQ CLI, refer to the Deployment Guide, Configure Optional NetQ Capabilities.

Type

Trigger

Severity

Message Format

Example

 

agent

NetQ Agent on device has not been heard from in over 15 seconds

Critical

Rotten Agent

Rotten Agent

bgp

BGP session with remote peer failed to establish due to reasons such as link down or peer not enabled

Critical

BGP session with peer @peerhost (@peer vrf @vrf) failed, reason: @reason

BGP session with peer spine02 (swp3 vrf default) failed, reason: link down

bgp

BGP session state changed from established to failed

Critical

BGP session with peer @peer @peerhost @neighbor vrf @vrf session state changed from established to failed

BGP session with peer swp3 leaf12 leaf13 vrf mgmt session state changed from established to failed

bgp

Address Family Identifiers/Subsequent AFIs enabled on remote peer but not local peer

Critical

BGP session with peer @peerhost @peer: AFI/SAFI @families not activated on node

BGP session with peer server3 swp6: AFI/SAFI EVPN not activated on node

bgp

Address Family Identifiers/Subsequent AFIs enabled on local peer but not on remote peer

Critical

BGP session with peer @peerhost @peer: AFI/SAFI @families not activated on peer

BGP session with peer leaf27 swp2: AFI/SAFI ipv6 not activated on peer

bgp

Address Family Identifiers/Subsequent AFIs not enabled on either the local or remote peers

Critical

BGP session with peer @peerhost @peer: AFI/SAFI @families not activated on session

BGP session with peer spine02 swp9: AFI/SAFI ipv4 not activated on session

bgp

Router id conflict detected between two hosts

Critical

Router id @router_id conflict detected between @sess_peer and @router

Router id 13467 conflict detected between router3 and router5

cable

Local port is missing physical connector

Critical

Port cage empty on @ifname, peer @peer @peer_if

Port cage empty on swp16, peer leaf17 swp15

cable

Peer port is missing physical connector

Critical

Peer port cage empty on @ifname, peer @peer @peer_if

Peer port cage empty on @ifname, peer @peer @peer_if

cable

Administrative state of remote peer does not match the state of local peer

Critical

@ifname admin state @state, mismatched with peer @peer @peer_if state @peer_state

swp3 admin state up, mismatched with peer spine04 swp 2 state down

cable

Interface operational state on the two ends of the link is not the same

Critical

@ifname oper state @state, mismatched with peer @peer @peer_if state @peer_state

swp5 oper state up, mismatched with peer leaf11 swp29 state down

cable

Link speed is not the same on both ends of the link

Critical

@ifname speed @speed, mismatched with peer @peer @peer_if speed @peer_speed

swp2 speed 10, mismatched with peer server02 speed 40

cable

Auto-negotiation setting on remote peer does not match setting on local peer

Critical

@ifname autoneg @autoneg, mismatched with peer @peer @peer_if autoneg @peer_autoneg

swp12 autoneg on, mismatched with peer spine01 swp04 autoneg off

cable

Link is flapping

Critical

@ifname @msg

swp8 Link flapped 6 times in last 5 mins

clag

CLAG backup IP address on local peer is not also an address on the remote peer of this CLAG session

Critical

Backup IP @ip does not belong to peer @peer

Backup IP 192.168.33 does not belong to peer leaf13

clag

CLAG sysmac of current session is a duplicate across multiple nodes

Critical

Duplicate sysmac with @node_name

Duplicate sysmac with leaf01

clag

MSTP (multiple spanning tree protocol) is not running

Critical

MSTP not running

MSTP not running

clag

Spanning Tree bridge ID is not the same on the local CLAG node and its remote peer

Critical

Bridge ID mismatch with peer

Bridge ID mismatch with peer

clag

Connectivity with CLAG peer failed

Critical

Session connectivity with peer failed

Session connectivity with peer failed

clag

CLAG peerlink is not part of Spanning Tree

Critical

Peerlink @peerlink not in MSTP

Peerlink swp4 not in MSTP

clag

CLAG peerlink is not a bridge member port

Critical

Peerlink @peerlink not in bridge

Peerlink swp2 not in bridge

clag

CLAG bond is in Conflicted state

Critical

Bond @bond in Conflicted state due to @reason

Bond 4 in Conflicted state due to peerlink down

Bond 3 in Conflicted state due to lacp partner mac mismatch

clag

CLAG bond is in protodown state

Critical

Bond @bond in protodown state due to @reason

Bond 2 in protodown state due to startup-delay

Bond 6 in protodown state due to isl-down

clag

MSTP daemon, mstpd, and the CLAG daemon, clagd, have different views of a bond's dual-connected state

Critical

Bond @bond dual-connected state mismatched with MSTP

Bond 7 dual-connected state mismatched with MSTP

clag

A CLAG bond on each switch of the CLAG pair has inconsistent maximum transmission unit (MTU)

Critical

Dually connected bond @bond MTU mismatch with peer @peer:@peer_if

Dually connected bond 3 MTU mismatch with peer leaf12:swp45

clag

A CLAG bond on each switch of the CLAG pair has mismatched private VLAN ID (PVID)

Critical

Dually connected bond @bond PVID mismatch with peer @peer:@peer_if

Dually connected bond 5 PVID mismatch with peer spine04:swp2

clag

A CLAG bond on each switch of the CLAG pair has mismatched VLAN membership

Critical

Dually connected bond @bond VLANs mismatch with peer @peer:@peer_if

Dually connected bond 23 VLANs mismatch with peer leaf02:swp4

clag

A VXLAN interface has mismatched VXLAN ID (VNID) on each switch of the CLAG pair

Critical

VXLAN @vxlan_if VNI @vni mismatched with peer @peer:@peer_if

VXLAN xx VNI 12 mismatched with peer TOR-13:swp6

clag

VXLAN anycast gateway IP address is mismatched between the two switches of a CLAG pair

Critical

VXLAN anycast address mismatched on peer @peer

VXLAN anycast address mismatched on peer leaf31

clag

Local CLAG node role changed from primary to secondary or vice versa

Critical

Role changed from @old_role to @new_role

Role changed from primary to secondary

clag

CLAG remote peer role changed from secondary to primary or vice versa

Critical

Peer role changed from @old_role to @new_role

Peer role changed from secondary to primary

clag

CLAG remote peer state changed from up to down

Critical

Peer state changed to down

Peer state changed from up to down

configdiff

Configuration file deleted on a device

Critical

@hostname config file @type was deleted

spine03 config file /etc/frr/frr.conf was deleted

evpn

Advertise-All-VNI flag disabled

Critical

VNI @vni advertise-all-vni flag not enabled

VNI 3 advertise-all-vni flag not enabled

evpn

VTEP missing from replication list

Critical

VNI @vni VTEP @ip not in replication list

VNI 13 VTEP 192.168.22 not in replication list

evpn

Same MAC address appears on multiple hosts

Critical

Duplicate Mac @mac VLAN @vlan at @h1: @lk1 and @h2: @lk2

Duplicate Mac A0:00:00:00:00:32 VLAN 13 at leaf02:swp3 and leaf04: swp24

evpn

A VLAN MAC address is not the same for two remote destinations

Critical

Mac @mac VLAN @vlan remote dest @vtep1 inconsistent with @vtep2

Mac A0:00:00:00:00:11 VLAN 4 remote dest 10.0.0.11 inconsistent with 10.0.0.8

evpn

Requested VNI is not detected in the Cumulus Linux kernel

Critical

VNI @vni not in kernel

VNI 11 not in kernel

evpn

VTEP's IP address is either not reachable or a duplicate across multiple VTEPs

Critical

VTEP @vtep: @alert

VTEP 10.0.0.4: No route to VTEP

VTEP 10.0.0.4: IP claimed by more than 2 nodes {leaf04, leaf11, spine02, spine04}

VTEP 10.0.0.4: IP claimed by 2 unconnected VTEPs {10.0.0.4, 10.0.0.7}

evpn

A remote destination is unknown

Critical

Mac @mac VLAN @vlan unknown remote dest @vtep

Mac A0:00:00:00:00:33 VLAN 4 unknown remote dest 10.0.0.12

license

License state is missing or invalid

Critical

License check failed, name @lic_name state @state

License check failed, name agent.lic state invalid

lnv

VXLAN service node daemon, vxsnd, is not running

Critical

vxsnd service not running

vxsnd service not running

lnv

vxsnd peer membership is inconsistent among two or more peers in a cluster

Critical

vxsnd peer membership inconsistent

vxsnd peer membership inconsistent

lnv

VNI database is inconsistent among peers in VXLAN service node cluster

Critical

vxsnd vni database inconsistent

vxsnd vni database inconsistent

lnv

VXLAN replication mode is inconsistent among peers in VXLAN service node cluster

Critical

vxsnd replication mode @mode inconsistent

vxsnd replication mode HER inconsistent

vxsnd replication mode SVC inconsistent

lnv

VXLAN registration daemon, vxrd, is not running

Critical

vxrd service not running

vxrd service not running

lnv

VXLAN registration daemon is configured to point to a VXLAN service node daemon IP address that is unknown

Critical

vxrd points to unknown vxsnd @snd_ip

vxrd points to unknown vxsnd 192.168.54

lnv

VXLAN registration daemon's VNI database is inconsistent with database of the VXLAN service node daemon

Critical

VNI @vni database inconsistent with vxsnd

VNI 24 database inconsistent with vxsnd

lnv

A VNI in the VXLAN registration daemon's database is not found in the service node daemon's database

Critical

VNI @vni not in vxsnd database

VNI 5 not in vxsnd database

lnv

VXLAN interface is not in Up state

Critical

vxlan @vxlan vni @vni in @state state

vxlan 1003 vni 6 in down state

link

Link operational state changed from up to down

Critical

HostName @hostname changed state from @old_state to @new_state Interface:@ifname

HostName leaf01 changed state from up to down Interface:swp34

mtu

MTU mismatch detected between device pair

Critical

Interface @link mtu @mtu mismatch with @peer interface @peer_if mtu @peer_mtu

Interface swp4 mtu 9600 mismatch with server02 interface swp3 mtu 1500

mtu

Missing bond information on peer node

Critical

Bond @bond mtu @mtu, No peer bond info

Bond 3 mtu 1500, No peer bond info

mtu

Missing CLAG peerlink information on peer node

Critical

Clag bond @bond mtu @mtu, peer @peer, no peerlink info

Clag bond 4 mtu 9600, peer leaf13, no peerlink info

mtu

Missing link information on peer node

Critical

Link @link mtu @mtu peer @peer, no peer link info

Link swp35 mtu 1500 peer spine01, no peer link info

ntp

NTP is not synchronized on the device; protocol is not in Sync state

Critical

Sync state changed from @old_state to @new_state for @hostname

Sync state changed from not sync to in sync for leaf11

ospf

OSPF router id of this host conflicts with another host

Critical

@ifname Router ID conflict with @id

swp5 Router ID conflict with leaf4

ospf

OSPF HELLO time is not the same on the local host and its remote peer

Critical

@ifname hello time mismatch with peer @peer

swp16 hello time mismatch with peer leaf21

ospf

OSPF DEAD time is not the same on the local host and its remote peer

Critical

@ifname dead time mismatch with peer @peer

swp13 dead time mismatch with peer spine02

ospf

Link MTU is not the same for the local host and its remote OSPF peer

Critical

@ifname mtu mismatch with peer @peer

swp4 mtu mismatch with peer server04

ospf

OSPF Area ID is not the same on the local host and its remote peer

Critical

@ifname area ID mismatch with peer @peer

swp14 area ID mismatch with peer leaf34

ospf

OSPF Network type is not the same on the local host and its remote peer

Critical

@ifname network type mismatch with peer @peer

swp12 network type mismatch with peer leaf6

ospf

OSPF service is not configured on the peer node

Critical

@ifname no OSPF config on peer @peer

swp2 no OSPF config on peer leaf7

ospf

A particular peer interface does not have the OSPF service enabled

Critical

@ifname OSPF service not enabled on peer @peer

swp9 OSPF service not enabled on peer spine04

ospf

OSPF service is in error state on the peer node

Critical

@ifname OSPF service error on peer @peer

swp4 OSPF service error on peer leaf11

ospf

OSPF service is in shutdown state on the peer node

Critical

@ifname OSPF service shutdown on peer @peer

swp17 OSPF service shutdown on peer spine01

sensor

A temperature, fan, or power supply unit sensor has passed a critical threshold

Critical

Sensor @sensor state @state value @value msg @msg

Sensor temp state critical value x °F msg @msg

Sensor fan state bad value x msg msg

Sensor psu state bad value x msg msg

sensor

A temperature, fan, or power supply unit sensor has changed from low or warning to critical

Critical

Sensor @sensor state changed from @old_s_state to @new_s_state

Sensor temperature state changed from low to critical

sensor

A temperature, fan, or power supply unit sensor has crossed the maximum threshold for that sensor

Critical

Sensor @sensor max value @new_s_max exceeds threshold @ new _s_crit

Sensor fan max value some value exceeds the threshold some value

sensor

A temperature, fan, or power supply unit sensor has crossed the minimum threshold for that sensor

Critical

Sensor @sensor min value @new_s_lcrit fall behind threshold @ new _s_min

Sensor psu min value some value fell below threshold some value

services

The process ID for a service changed and the service status changed from down to up (why is this critical and not info?)

Critical

Service @name with old pid @old_pid changed to @new_pid, status changed from @old_status to @new_status

Service bgp with old pid 12323 changed to 27651, status changed from down to up

services

The process ID for a service changed and the service status changed from up to down

Criticial

Service @name with old pid @old_pid changed to @new_pid, status changed from @old_status to @new_status

Service lldp with old pid 32846 changed to 17493, status changed from up to down

trace

Unable to make connection along the trace path

Critical

Path incomplete, ends at node @hostname

Path incomplete, ends at node spine04

trace

Unable to connect to destination device

Critical

No valid path to destination

No valid path to destination

trace

Interface on path is down

Critical

Link @hostname:@link is down

Link leaf03:swp6 is down

trace

Routing loop detected

Critical

Routing loop: node @hostname vrf @vrf visited twice

Routing loop: node leaf23 vrf default visited twice

trace

Bridging loop detected

Critical

Bridging loop: node @hostname vlan @vlan visited twice

Bridging loop: node spine02 vlan 11 visited twice

trace

Node along path was unreachable

Critical

Tracing stopped at rotten node @hostname

Tracing stopped at rotten node leaf17

trace

Source and destination addresses are of different scope and there is no path between them

Critical

No valid paths between link local and non link-local IP addresses

No valid paths between link local and non link-local IP addresses

trace

There is no valid path between a pair of VTEPs

Critical

No underlay path from @src_ip to @dst_ip for vxlan @vxlan

No underlay path from 192.168.35 to 192.168.12 for vxlan 1005

vlan

VLAN membership is not the same for both ends of the interface

Critical

@link VLAN set (@vlans) mismatch with peer @peerhost:@peer_ifname (@peer_vlans)

swp3 VLAN set (1002 1005 1230) mismatch with peer leaf06:swp4 (1002 1016 1230)

vlan

PVID is not the same for both ends of the interface

Critical

@link PVID (@pvid) mismatch with peer @peerhost:@peer_ifname (@peer_pvid)

swp7 PVID (10) mismatch with peer spine01:swp3 (9)

vxlan

Broadcast, unknown unicast, and multicast ( BUM) replication list of a VNI is inconsistent among all the VTEPs in the network

Critical

VNI @vni replication list inconsistent

VNI 14 replication list inconsistent

vxlan

A VNI is associated with different VLANs on different VTEPs in the network

Critical

VNI @vni mapped to inconsistent VLAN @vlan1

VNI 6 mapped to inconsistent VLAN 7

vxlan

A VXLAN interface on a node has changed state from up to down

Critical

vxlan device @vxlan in @state state

vxlan device 1002 in down state