Page 1 of 1

SNMP Troubles on an Alcatel 7750

Posted: 19 Jun 2014 16:25
by ravadac
All,

First post here, but having an odd issue and hoping that someone can help me out here. We gather usage statistics every 5 minutes on all of our network gear via SNMP. We have a lot of Alcatel 7750's: half of them poll correctly, whereas the other half don't. SNMP works against all of them so we know it's not a bad credential issue. The issue is that on the half that don't work, we see the following : At the 5 minute mark exactly as session counts build ( up to about 6000 ), the Alcatel 7750 stops responding to SNMP and almost acts if its queuing our SNMP request. Our agents will experience timeouts unless we increase our timeouts to almost 60 seconds. As a matter of fact, no polls to the 'broken' boxes are honored unless we increase our timeouts. On the 'good' boxes we never see this happen. We can poll them with the defaults all through 5 minute interval.

Our 'good' 7750's are actually busier and have higher CPU loads than the 'broken' ones.

We've not been able to pin down any difference code wise, software wise, etc. We can poll other device types (junipers, cisco's, etc. ) on the same network segment as the Alcatel's during the time that the 'broken' alcatel's don't respond. ping also responds back 100% of the time while our snmp is timing out.

We've tried polling from multiple network segments.
We've tried tracing the polls ( we see that the Alcatel looks like it gets the snmp packet 50s after we send it on the 'broken' routers but then sends it back immediately ).
We've had alcatel on the phone ( but they've not really given us any clear direction - other than to insist on the fact that the alcatel is not queuing up the request ). They've asked us get on a system and trace the snmp call through the last hop before it gets to the Alcatel ( but due to our network configuration we can't do that ).

Any other things to try? We're completely at a loss.

VARIOUS OUTPUTS LISTED BELOW :

BROKEN : SNMPv2-MIB::sysDescr.0 = STRING: TiMOS-C-10.0.R4 cpm/hops ALCATEL SR 7750 Copyright (c) 2000-2012 Alcatel-Lucent.
GOOD : SNMPv2-MIB::sysDescr.0 = STRING: TiMOS-C-10.0.R4 cpm/hops ALCATEL SR 7750 Copyright (c) 2000-2012 Alcatel-Lucent.



*Broken router*
A:BAD 7750# show snmp counters
==============================================================================
SNMP counters:
==============================================================================
in packets : 97160430
------------------------------------------------------------------------------
in gets : 756371
in getnexts : 2338
in getbulks : 96389555
in sets : 11952

out packets: 97160216
------------------------------------------------------------------------------
out get responses : 97160216
out traps : 0

variables requested: -1085720426
variables set : 21294

*Working router*
*A: GOOD 7750 # show snmp counters
==============================================================================
SNMP counters:
==============================================================================
in packets : 96646877
------------------------------------------------------------------------------
in gets : 2929839
in getnexts : 765
in getbulks : 93694878
in sets : 19687

out packets: 96645179
------------------------------------------------------------------------------
out get responses : 96645169
out traps : 0

variables requested: 947820440
variables set : 38794
==============================================================================

Re: SNMP Troubles on an Alcatel 7750

Posted: 24 Jun 2014 03:47
by mivens
Your description makes it sound likes the "bad" routers are getting hammered with requests every 5 mins.

Maybe try and narrow it down to either a particular agent or MIB being polled?

I.e. keep excluding agents till the problem goes away on the "bad" routers or keep excluding subtrees in the SNMP view until the problem goes away.

On the 5 min interval where the "bad" routers stop responding, what does "show system cpu" say? For example, does it show that the SNMP group is using 100% of its allowed capacity? When a node stops responding, does "show system connections" show a large RecvQ compared to the "good" routers?

With regards to the "show snmp counter" output, it's hard to draw a conclusion from the absolute values of the counters. Maybe you can compare how much they increase on each in a 10 min interval, between "good" and "bad" routers.

Re: SNMP Troubles on an Alcatel 7750

Posted: 24 Jun 2014 12:53
by ravadac
Thanks Mivens, The good and bad routers in question all show recv q's rise about the same amount when the pollers kick off their polling requests ( they all rise to between 6,000 - 8,000 ).
The CPU never reaches 100% on either the good or bad routers ( tops out at ~55% ) , in fact the CPU is actually a little bit higher on the "good" routers.

Re: SNMP Troubles on an Alcatel 7750

Posted: 24 Jul 2014 04:35
by lyndhurst
Hi,

This definitely sounds like an issue with the number of requests within a certain period of time. Probably 7750 detects excessive amount of requests as attacks and stops responding.

1) Run "show log event-control" and look for snmp related counters. I'm pretty sure there are some related counters there.

2) Go to the snmp context under configuration--> system and run "info detail" command to see default values.

Thanks.

Re: SNMP Troubles on an Alcatel 7750

Posted: 24 Jul 2014 11:13
by mivens
I think you may be confusing SNMP traps with SNMP polling. "show log event-control" will show the number of events being generated by the 7750 which depending on your configuration may be sent out as SNMP traps. It will not tell you what SNMP objects are being polled.

The SNMP Daemon group is limited in the amount of resources it can consume. When the capacity usage of the in the output of "show system cpu" reaches 100% it is not being allowed any more CPU. This is not the same as the CPU usage. You can hit 100% capacity usage with lots of CPU still left over for other processes. This is by design. See example below from the lab.

Code: Select all

*A:7750#  show system cpu | match 0.00 invert-match 

===============================================================================
CPU Utilization (Sample period: 1 second)
===============================================================================
Name                                   CPU Time       CPU Usage        Capacity
                                         (uSec)                           Usage
-------------------------------------------------------------------------------
BFD                                       2,167           0.02%           0.15%
CFLOWD                                    6,743           0.07%           0.34%
Cards & Ports                            41,509           0.46%           0.80%
ICC                                      13,869           0.15%           1.39%
IGMP/MLD                                    937           0.01%           0.08%
IP Stack                                136,581           1.52%          13.28%
IS-IS                                     3,787           0.04%           0.19%
ISA                                      10,243           0.11%           0.31%
LDP                                       2,765           0.03%           0.27%
MPLS/RSVP                                 6,072           0.06%           0.48%
Management                               34,524           0.38%           2.02%
OAM                                       7,958           0.08%           0.43%
OSPF                                        948           0.01%           0.04%
Redundancy                                7,567           0.08%           0.47%
SNMP Daemon                             900,630          10.06%          90.71%
Services                                  3,646           0.04%           0.08%
Subscriber Mgmt                           2,242           0.02%           0.05%
System                                  142,184           1.58%           3.31%
VRRP                                      1,041           0.01%           0.07%
-------------------------------------------------------------------------------
   Idle                               7,626,001          85.18%                
   Usage                              1,326,119          14.81%                
Busiest Core Utilization                415,782          41.79%                
===============================================================================

Re: SNMP Troubles on an Alcatel 7750

Posted: 18 Aug 2014 05:23
by evans
-may be if possible try other os i.e. 10.0 R8 as we had a bug on this os in our network sometime back but mostly as cpu issue.

Re: SNMP Troubles on an Alcatel 7750

Posted: 06 Jul 2015 14:43
by ravadac
Looking at the snmp daemon I'm only seeing about 3% usage. I can literally count the snmp requests as they come back from a walk of the iftable.

Name CPU Time CPU Usage Capacity
(uSec) Usage
-------------------------------------------------------------------------------
BGP 18,888 0.21% 1.90%
BGP PE-CE 1,566 0.01% 0.08%
CFLOWD 6,181 0.06% 0.33%
Cards & Ports 59,057 0.66% 0.89%
ICC 68,108 0.76% 6.92%
IP Stack 48,769 0.54% 4.14%
IS-IS 3,476 0.03% 0.18%
ISA 9,101 0.10% 0.32%
LDP 19,765 0.22% 2.10%
MPLS/RSVP 5,845 0.06% 0.46%
Management 26,004 0.29% 1.05%
OAM 19,943 0.22% 0.89%
PIM 35,779 0.39% 3.10%
Redundancy 26,518 0.29% 1.35%
SNMP Daemon 33,985 0.37% 3.62%
Services 6,025 0.06% 15.78%
Subscriber Mgmt 2,818 0.03% 0.06%
System 161,807 1.80% 4.93%
VRRP 1,252 0.01% 0.09%
-------------------------------------------------------------------------------
Idle 8,388,445 93.77%
Usage 557,056 6.22%
Busiest Core Utilization 149,974 15.08%
mivens wrote:I think you may be confusing SNMP traps with SNMP polling. "show log event-control" will show the number of events being generated by the 7750 which depending on your configuration may be sent out as SNMP traps. It will not tell you what SNMP objects are being polled.

The SNMP Daemon group is limited in the amount of resources it can consume. When the capacity usage of the in the output of "show system cpu" reaches 100% it is not being allowed any more CPU. This is not the same as the CPU usage. You can hit 100% capacity usage with lots of CPU still left over for other processes. This is by design. See example below from the lab.

Code: Select all

*A:7750#  show system cpu | match 0.00 invert-match 

===============================================================================
CPU Utilization (Sample period: 1 second)
===============================================================================
Name                                   CPU Time       CPU Usage        Capacity
                                         (uSec)                           Usage
-------------------------------------------------------------------------------
BFD                                       2,167           0.02%           0.15%
CFLOWD                                    6,743           0.07%           0.34%
Cards & Ports                            41,509           0.46%           0.80%
ICC                                      13,869           0.15%           1.39%
IGMP/MLD                                    937           0.01%           0.08%
IP Stack                                136,581           1.52%          13.28%
IS-IS                                     3,787           0.04%           0.19%
ISA                                      10,243           0.11%           0.31%
LDP                                       2,765           0.03%           0.27%
MPLS/RSVP                                 6,072           0.06%           0.48%
Management                               34,524           0.38%           2.02%
OAM                                       7,958           0.08%           0.43%
OSPF                                        948           0.01%           0.04%
Redundancy                                7,567           0.08%           0.47%
SNMP Daemon                             900,630          10.06%          90.71%
Services                                  3,646           0.04%           0.08%
Subscriber Mgmt                           2,242           0.02%           0.05%
System                                  142,184           1.58%           3.31%
VRRP                                      1,041           0.01%           0.07%
-------------------------------------------------------------------------------
   Idle                               7,626,001          85.18%                
   Usage                              1,326,119          14.81%                
Busiest Core Utilization                415,782          41.79%                
===============================================================================

Re: SNMP Troubles on an Alcatel 7750

Posted: 20 Jul 2015 13:09
by mkk
Hi ravadac,

We are having exactly the same issue you described as above with the same symptoms and same results...

Really appreciate if you share what the solution was for this.

7750 Release code is 10.0R4 that we see the issue.

Thanks.

Re: SNMP Troubles on an Alcatel 7750

Posted: 30 Jul 2015 16:42
by mivens
It's not clear if anyone's tried narrowing down if queries for a particular MIB/subtree are responsible yet. For example there have been bugs in the past when SNMP objects like the numbers of PPP subscribers on a node are queried.

For example as a simple test, apply an SNMP view which only allows polls for the IF-MIB (.1.3.6.1.2.1.2) and the SNMPv2-MIB (.1.3.6.1.2.1.1) and anything else is denied. Does that make the problem go away?