SNMP Troubles on an Alcatel 7750

Post Reply
ravadac

SNMP Troubles on an Alcatel 7750

Post by ravadac »

All,

First post here, but having an odd issue and hoping that someone can help me out here. We gather usage statistics every 5 minutes on all of our network gear via SNMP. We have a lot of Alcatel 7750's: half of them poll correctly, whereas the other half don't. SNMP works against all of them so we know it's not a bad credential issue. The issue is that on the half that don't work, we see the following : At the 5 minute mark exactly as session counts build ( up to about 6000 ), the Alcatel 7750 stops responding to SNMP and almost acts if its queuing our SNMP request. Our agents will experience timeouts unless we increase our timeouts to almost 60 seconds. As a matter of fact, no polls to the 'broken' boxes are honored unless we increase our timeouts. On the 'good' boxes we never see this happen. We can poll them with the defaults all through 5 minute interval.

Our 'good' 7750's are actually busier and have higher CPU loads than the 'broken' ones.

We've not been able to pin down any difference code wise, software wise, etc. We can poll other device types (junipers, cisco's, etc. ) on the same network segment as the Alcatel's during the time that the 'broken' alcatel's don't respond. ping also responds back 100% of the time while our snmp is timing out.

We've tried polling from multiple network segments.
We've tried tracing the polls ( we see that the Alcatel looks like it gets the snmp packet 50s after we send it on the 'broken' routers but then sends it back immediately ).
We've had alcatel on the phone ( but they've not really given us any clear direction - other than to insist on the fact that the alcatel is not queuing up the request ). They've asked us get on a system and trace the snmp call through the last hop before it gets to the Alcatel ( but due to our network configuration we can't do that ).

Any other things to try? We're completely at a loss.

VARIOUS OUTPUTS LISTED BELOW :

BROKEN : SNMPv2-MIB::sysDescr.0 = STRING: TiMOS-C-10.0.R4 cpm/hops ALCATEL SR 7750 Copyright (c) 2000-2012 Alcatel-Lucent.
GOOD : SNMPv2-MIB::sysDescr.0 = STRING: TiMOS-C-10.0.R4 cpm/hops ALCATEL SR 7750 Copyright (c) 2000-2012 Alcatel-Lucent.



*Broken router*
A:BAD 7750# show snmp counters
==============================================================================
SNMP counters:
==============================================================================
in packets : 97160430
------------------------------------------------------------------------------
in gets : 756371
in getnexts : 2338
in getbulks : 96389555
in sets : 11952

out packets: 97160216
------------------------------------------------------------------------------
out get responses : 97160216
out traps : 0

variables requested: -1085720426
variables set : 21294

*Working router*
*A: GOOD 7750 # show snmp counters
==============================================================================
SNMP counters:
==============================================================================
in packets : 96646877
------------------------------------------------------------------------------
in gets : 2929839
in getnexts : 765
in getbulks : 93694878
in sets : 19687

out packets: 96645179
------------------------------------------------------------------------------
out get responses : 96645169
out traps : 0

variables requested: 947820440
variables set : 38794
==============================================================================
mivens
Member
Posts: 262
Joined: 28 Sep 2012 06:34

Re: SNMP Troubles on an Alcatel 7750

Post by mivens »

Your description makes it sound likes the "bad" routers are getting hammered with requests every 5 mins.

Maybe try and narrow it down to either a particular agent or MIB being polled?

I.e. keep excluding agents till the problem goes away on the "bad" routers or keep excluding subtrees in the SNMP view until the problem goes away.

On the 5 min interval where the "bad" routers stop responding, what does "show system cpu" say? For example, does it show that the SNMP group is using 100% of its allowed capacity? When a node stops responding, does "show system connections" show a large RecvQ compared to the "good" routers?

With regards to the "show snmp counter" output, it's hard to draw a conclusion from the absolute values of the counters. Maybe you can compare how much they increase on each in a 10 min interval, between "good" and "bad" routers.
ravadac

Re: SNMP Troubles on an Alcatel 7750

Post by ravadac »

Thanks Mivens, The good and bad routers in question all show recv q's rise about the same amount when the pollers kick off their polling requests ( they all rise to between 6,000 - 8,000 ).
The CPU never reaches 100% on either the good or bad routers ( tops out at ~55% ) , in fact the CPU is actually a little bit higher on the "good" routers.
lyndhurst

Re: SNMP Troubles on an Alcatel 7750

Post by lyndhurst »

Hi,

This definitely sounds like an issue with the number of requests within a certain period of time. Probably 7750 detects excessive amount of requests as attacks and stops responding.

1) Run "show log event-control" and look for snmp related counters. I'm pretty sure there are some related counters there.

2) Go to the snmp context under configuration--> system and run "info detail" command to see default values.

Thanks.
mivens
Member
Posts: 262
Joined: 28 Sep 2012 06:34

Re: SNMP Troubles on an Alcatel 7750

Post by mivens »

I think you may be confusing SNMP traps with SNMP polling. "show log event-control" will show the number of events being generated by the 7750 which depending on your configuration may be sent out as SNMP traps. It will not tell you what SNMP objects are being polled.

The SNMP Daemon group is limited in the amount of resources it can consume. When the capacity usage of the in the output of "show system cpu" reaches 100% it is not being allowed any more CPU. This is not the same as the CPU usage. You can hit 100% capacity usage with lots of CPU still left over for other processes. This is by design. See example below from the lab.

Code: Select all

*A:7750#  show system cpu | match 0.00 invert-match 

===============================================================================
CPU Utilization (Sample period: 1 second)
===============================================================================
Name                                   CPU Time       CPU Usage        Capacity
                                         (uSec)                           Usage
-------------------------------------------------------------------------------
BFD                                       2,167           0.02%           0.15%
CFLOWD                                    6,743           0.07%           0.34%
Cards & Ports                            41,509           0.46%           0.80%
ICC                                      13,869           0.15%           1.39%
IGMP/MLD                                    937           0.01%           0.08%
IP Stack                                136,581           1.52%          13.28%
IS-IS                                     3,787           0.04%           0.19%
ISA                                      10,243           0.11%           0.31%
LDP                                       2,765           0.03%           0.27%
MPLS/RSVP                                 6,072           0.06%           0.48%
Management                               34,524           0.38%           2.02%
OAM                                       7,958           0.08%           0.43%
OSPF                                        948           0.01%           0.04%
Redundancy                                7,567           0.08%           0.47%
SNMP Daemon                             900,630          10.06%          90.71%
Services                                  3,646           0.04%           0.08%
Subscriber Mgmt                           2,242           0.02%           0.05%
System                                  142,184           1.58%           3.31%
VRRP                                      1,041           0.01%           0.07%
-------------------------------------------------------------------------------
   Idle                               7,626,001          85.18%                
   Usage                              1,326,119          14.81%                
Busiest Core Utilization                415,782          41.79%                
===============================================================================
evans

Re: SNMP Troubles on an Alcatel 7750

Post by evans »

-may be if possible try other os i.e. 10.0 R8 as we had a bug on this os in our network sometime back but mostly as cpu issue.
ravadac

Re: SNMP Troubles on an Alcatel 7750

Post by ravadac »

Looking at the snmp daemon I'm only seeing about 3% usage. I can literally count the snmp requests as they come back from a walk of the iftable.

Name CPU Time CPU Usage Capacity
(uSec) Usage
-------------------------------------------------------------------------------
BGP 18,888 0.21% 1.90%
BGP PE-CE 1,566 0.01% 0.08%
CFLOWD 6,181 0.06% 0.33%
Cards & Ports 59,057 0.66% 0.89%
ICC 68,108 0.76% 6.92%
IP Stack 48,769 0.54% 4.14%
IS-IS 3,476 0.03% 0.18%
ISA 9,101 0.10% 0.32%
LDP 19,765 0.22% 2.10%
MPLS/RSVP 5,845 0.06% 0.46%
Management 26,004 0.29% 1.05%
OAM 19,943 0.22% 0.89%
PIM 35,779 0.39% 3.10%
Redundancy 26,518 0.29% 1.35%
SNMP Daemon 33,985 0.37% 3.62%
Services 6,025 0.06% 15.78%
Subscriber Mgmt 2,818 0.03% 0.06%
System 161,807 1.80% 4.93%
VRRP 1,252 0.01% 0.09%
-------------------------------------------------------------------------------
Idle 8,388,445 93.77%
Usage 557,056 6.22%
Busiest Core Utilization 149,974 15.08%
mivens wrote:I think you may be confusing SNMP traps with SNMP polling. "show log event-control" will show the number of events being generated by the 7750 which depending on your configuration may be sent out as SNMP traps. It will not tell you what SNMP objects are being polled.

The SNMP Daemon group is limited in the amount of resources it can consume. When the capacity usage of the in the output of "show system cpu" reaches 100% it is not being allowed any more CPU. This is not the same as the CPU usage. You can hit 100% capacity usage with lots of CPU still left over for other processes. This is by design. See example below from the lab.

Code: Select all

*A:7750#  show system cpu | match 0.00 invert-match 

===============================================================================
CPU Utilization (Sample period: 1 second)
===============================================================================
Name                                   CPU Time       CPU Usage        Capacity
                                         (uSec)                           Usage
-------------------------------------------------------------------------------
BFD                                       2,167           0.02%           0.15%
CFLOWD                                    6,743           0.07%           0.34%
Cards & Ports                            41,509           0.46%           0.80%
ICC                                      13,869           0.15%           1.39%
IGMP/MLD                                    937           0.01%           0.08%
IP Stack                                136,581           1.52%          13.28%
IS-IS                                     3,787           0.04%           0.19%
ISA                                      10,243           0.11%           0.31%
LDP                                       2,765           0.03%           0.27%
MPLS/RSVP                                 6,072           0.06%           0.48%
Management                               34,524           0.38%           2.02%
OAM                                       7,958           0.08%           0.43%
OSPF                                        948           0.01%           0.04%
Redundancy                                7,567           0.08%           0.47%
SNMP Daemon                             900,630          10.06%          90.71%
Services                                  3,646           0.04%           0.08%
Subscriber Mgmt                           2,242           0.02%           0.05%
System                                  142,184           1.58%           3.31%
VRRP                                      1,041           0.01%           0.07%
-------------------------------------------------------------------------------
   Idle                               7,626,001          85.18%                
   Usage                              1,326,119          14.81%                
Busiest Core Utilization                415,782          41.79%                
===============================================================================
mkk

Re: SNMP Troubles on an Alcatel 7750

Post by mkk »

Hi ravadac,

We are having exactly the same issue you described as above with the same symptoms and same results...

Really appreciate if you share what the solution was for this.

7750 Release code is 10.0R4 that we see the issue.

Thanks.
mivens
Member
Posts: 262
Joined: 28 Sep 2012 06:34

Re: SNMP Troubles on an Alcatel 7750

Post by mivens »

It's not clear if anyone's tried narrowing down if queries for a particular MIB/subtree are responsible yet. For example there have been bugs in the past when SNMP objects like the numbers of PPP subscribers on a node are queried.

For example as a simple test, apply an SNMP view which only allows polls for the IF-MIB (.1.3.6.1.2.1.2) and the SNMPv2-MIB (.1.3.6.1.2.1.1) and anything else is denied. Does that make the problem go away?
Post Reply

Return to “7750 SR”