• May 22, 2012, 09:10:41 PM
Welcome, Guest. Please login or register. Registration is free.
Did you miss your activation email?

Author Topic: Troubleshoot CPU spikes on an 8310  (Read 490 times)

0 Members and 1 Guest are viewing this topic.

Offline sbenninger

  • Rookie
  • **
  • Posts: 5
Troubleshoot CPU spikes on an 8310
« on: January 10, 2012, 11:01:15 AM »
Hello everyone,
Found this site while looking for guidance. Any suggestions would be beneficial. I am not really all that familiar with troubleshooting on this device and not very well versed in the CLI as well. Mostly use the JDM.

We have a 7yr old Passport(ERS now I guess) 8310. It has 5 8348TX, 1 8324GTX, 2 8348TX-PWR cards and 2 8393SF CPU cards. It has not been restarted for several years. It is running an old 3.0.2.0 OS release. There are a few baystack stacks connected via gBic.

For the past several months we have been experiencing layer 3 issues. At very random times throughout the day and night the CPU will spike to 70-90%(normal is 5-10%) for 10-15s. When that happens layer 3 latencies go way up and any Nortel/Avaya IP phones that are on different subnets( eg remote site with BCM etc or diff vlan) will reboot.

There is not much in the log(see below). Just the ports going up and down from the phones rebooting. I have been trying to see what process is spiking but cant seem to find out how to do that.

What else should I be looking for? Could it be network traffic? If I graph the chassis in JDM I don't see anything drastic happening at the time of the CPU spikes. I am getting some performance metrics with Zenoss but it is not giving much insight. A copy of some of the counters are below. This is over a couple mins as the counters(exept absolute value) will start over sometimes where there is a cpu spike.

Not sure if a reboot would correct the issue but would like to know the cause first? I am sure it is in need of an OS upgrade as well but we do not have support currently so not an option. Could it be too much network traffic for the device? Should we be looking at replacing? Any help would be greatly appreciated.

Thanks in advance!

Scott

Counters:

      AbsoluteValue   Cumulative   Average/sec   Minimum/sec   Maximum/sec   LastVal/sec   
InReceives   948,925,508   2237.0   14.814569536423841   12.2   19.7   12.8   
InHdrErrors   1,651,596   2.0   0.013245033112582781   0.0   0.2   0.0   
InAddrErrors   261,157   0.0   0.0   0.0   0.0   0.0   
ForwDatagrams   808,229,428   1609.0   10.655629139072847   3.2   14.8   9.0   
InUnknownProtos   0   0.0   0.0   0.0   0.0   0.0   
InDiscards   6,435,153   2.0   0.013245033112582781   0.0   0.09523809523809523   0.0   
InDelivers   41,286,263   330.0   2.185430463576159   1.9   3.0   1.9   
OutRequests   42,500,660   331.0   2.19205298013245   1.9   2.9   1.9   
OutDiscards   0   0.0   0.0   0.0   0.0   0.0   
OutNoRoutes   0   0.0   0.0   0.0   0.0   0.0   
FragOKs   0   0.0   0.0   0.0   0.0   0.0   
FragFails   0   0.0   0.0   0.0   0.0   0.0   
FragCreates   0   0.0   0.0   0.0   0.0   0.0   
ReasmReqds   0   0.0   0.0   0.0   0.0   0.0   
ReasmOKs   0   0.0   0.0   0.0   0.0   0.0   
ReasmFails   0   0.0   0.0   0.0   0.0   0.0   

Log:
2012-01-10 08:37:36   Local7.Info   172.16.10.101   CPU5 [01/10/12 09:31:35] SNMP INFO Spanning Tree Topology Change(StgId=1, PortNum=2/13, MacAddr=00:11:f9:b7:c0:01)<000>
2012-01-10 08:57:53   Local7.Info   172.16.10.101   CPU5 [01/10/12 09:51:58] HW INFO portLinkUpEvent starting 01/10/12 09:51:58 on ports 2/27<000>
2012-01-10 08:58:27   Local7.Info   172.16.10.101   CPU5 [01/10/12 09:52:32] SNMP INFO Spanning Tree Topology Change(StgId=1, PortNum=2/27, MacAddr=00:11:f9:b7:c0:01)<000>
2012-01-10 10:47:19   Local7.Info   172.16.10.101   CPU5 [01/10/12 11:41:24] HW INFO portLinkDownEvent starting 01/10/12 11:41:24 on ports 3/20<000>
2012-01-10 10:47:21   Local7.Info   172.16.10.101   CPU5 [01/10/12 11:41:26] HW INFO portLinkUpEvent starting 01/10/12 11:41:26 on ports 3/20<000>
2012-01-10 10:47:49   Local7.Info   172.16.10.101   CPU5 [01/10/12 11:41:55] SNMP INFO Spanning Tree Topology Change(StgId=1, PortNum=3/20, MacAddr=00:11:f9:b7:c0:01)<000>
2012-01-10 10:55:11   Local7.Info   172.16.10.101   CPU5 [01/10/12 11:49:16] HW INFO portLinkDownEvent starting 01/10/12 11:49:16 on ports 8/6<000>
2012-01-10 10:55:14   Local7.Info   172.16.10.101   CPU5 [01/10/12 11:49:19] HW INFO portLinkUpEvent starting 01/10/12 11:49:19 on ports 8/6<000>
2012-01-10 10:55:43   Local7.Info   172.16.10.101   CPU5 [01/10/12 11:49:48] SNMP INFO Spanning Tree Topology Change(StgId=1, PortNum=8/6, MacAddr=00:11:f9:b7:c0:01)<000>


Editor: updated post with tt tags for readibility
« Last Edit: January 12, 2012, 10:11:20 PM by Michael McNamara »


Offline Flintstone

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 584
Re: Troubleshoot CPU spikes on an 8310
« Reply #1 on: January 11, 2012, 03:59:14 AM »
Hi sbenninger and welcome to the forum,

Check to see if the spanning tree changes in your log occur at the same time you have the CPU spikes?

It sounds like you could also be experiencing something like a broadcast storm?  I would use a Sniffer to see if you can capture the offending traffic?

CheerZ and goodluck

Offline sbenninger

  • Rookie
  • **
  • Posts: 5
Re: Troubleshoot CPU spikes on an 8310
« Reply #2 on: January 11, 2012, 04:28:18 PM »
Thanks for the reply.

I believe the spanning tree changes are the ports going down/up when the IP phone reboots due to the CPU spike/high latency and the ports having STP enabled. I only see those messages when the phones reboot. When doing a packet capture I see the normal bdpu's until a phone reboots then I see the topo change bdpu.

I also do not see any indication of a current broadcast storm but this has jogged my memory. Oct/Nov last year we had an issue where the internal switch inside of a ip phone went bad and cause a bcast storm. The port was flapping where the phone was plugged into and caused the CPU to continuously run very high on the 8310. It seems as though the CPU spikes have been happening since that time. Could there be something left over from that event causing the issue?

Is there any way to see what processes are on the 8310 to see what is spiking?

Thanks very much.


Offline Flintstone

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 584
Re: Troubleshoot CPU spikes on an 8310
« Reply #3 on: January 12, 2012, 05:14:25 AM »
Hi sbenninger,

I would compare all your port statistics to see if there are any ports with too many broadcast/multicasts?  If that is your problem, I believe you can also setup broadcast/multicast rate limiting per port?

CheerZ

Offline Michael McNamara

  • Administrator
  • Hero Member
  • *****
  • Posts: 2517
    • Michael McNamara
Re: Troubleshoot CPU spikes on an 8310
« Reply #4 on: January 12, 2012, 10:22:56 PM »
Hi sbenninger and welcome to the forums!

I was going to ask what was connected to ports 2/27, 3/20, 8/6  but I believe you answered it in your follow-up reply to @Flintstone.

A CPU spike can definitely be caused by traffic, a surge of multicast or broadcast packets is usually the suspect in most networks.

Are you relying on Spanning Tree to block any loops in your network? If so you might want to check and see what switch is acting as the root in your Spanning Tree topology.

What I usually do is setup a desktop/laptop to perform a continuous packet trace (WireShark) and save that to your harddisk ever 10Mb or so. Later you can go back and examine the general traffic levels during the event to see if there are any surges in broadcast or multicast packets. The issue with these intermittent issues is capturing the event while it's in progress, then hopefully with that data you'll find the next breadcrumb.

You're running a pretty old version of software but I believe you should have CP-Limit and rate-limiting functionality. It's enabled per port so you might want to check it out, that might help alleviate the impact of the problem while you hunt down the issue.

I would definitely take @Flintstone's advice and review the port statistics... that might give you a hint on where to look, perhaps another switch.

It sounds like the IP phones are tripping their watchdog timer (loss of communications to the Communication Server) and rebooting themselves to try and recover.

Let us know what you find.

Good Luck!
We've been helping network engineers, system administrators and technology professionals since June 2009.
If you've found this site useful or helpful, please help me spread the word. Link to us in your blog or homepage - Thanks!

Offline sbenninger

  • Rookie
  • **
  • Posts: 5
Re: Troubleshoot CPU spikes on an 8310
« Reply #5 on: January 16, 2012, 02:12:33 PM »
Thank you for the replies!

I have been doing some packet captures and using the Statistics->IO Graphs I can see the 2 second intervals for STP if I apply as a filter. When the CPU spike happens I see delays in the STP packets so it is very easy to pinpoint the CPU spikes in the captured data. Only occasionally do some of the IP sets reboot when the watchdog timer trips and that is when I see the STP top change in the Syslog but not for every CPU spike.

Knowing that I can pinpoint the CPU spikes on the IO graph I compare the STP graph to all data captured as well as individual protocols like ARP, UDP etc and nothing really stands out. At least on the primary subnet/VLAN. I will have to put my laptop onto the other VLAN's to see if it is coming from one of the other subnets. The timing between the CPU spikes is random and on this VLAN there is no similarities between traffic prior to each CPU event.

Still hunting. I have been looking at port stats and I have a question. I have tried to clear the port stats so I have a clean baseline since the stats are for the entire time the switch has been up but the stats never clear(should they?). It looks like some stats are at their max values as there are many ports with the exact same high value that is no longer increasing.

Thanks again!

Scott

Offline sbenninger

  • Rookie
  • **
  • Posts: 5
Re: Troubleshoot CPU spikes on an 8310
« Reply #6 on: January 16, 2012, 02:56:40 PM »
Actually hold on a second....I previously though the CPU spikes were random but after looking closer it seems as the spikes are 300s(5mins) apart quite consistently. I am looking to see if we have anything on the network scheduled to fire every 5mins.

From the Nortel side of things is there anything on the switch side that would fire every 300s?

Thanks again for the assistance!

Scott

Offline bylie

  • Sr. Member
  • ****
  • Posts: 120
Re: Troubleshoot CPU spikes on an 8310
« Reply #7 on: January 16, 2012, 03:31:11 PM »
Hmmm, 300 seconds sounds like the default FDB timeout value. If I recall correctly the default ARP timeout is 21600 seconds but if the MAC address disappears from the FDB the ARP entry also disappears which would cause the CPU to re-ARP for the MAC address if it's needed again. On multiple occasions (in some of the official design guides) Avaya has actually recommended using a non-default FDB timeout of 21601 (1 second longer than the default ARP timeout) to counteract certain problems (bugs?) with this.
« Last Edit: January 16, 2012, 03:47:44 PM by bylie »

Offline Michael McNamara

  • Administrator
  • Hero Member
  • *****
  • Posts: 2517
    • Michael McNamara
Re: Troubleshoot CPU spikes on an 8310
« Reply #8 on: January 17, 2012, 07:01:40 PM »
If you are using the default MAC/FDB timers you'll see the switch re-ARP every 300 seconds.

Are you graphing any MIB values using MRTG/RRD or similar tools? Any management workstations doing SNMP polling?

I would also advise you to review your configuration and make sure everything is neat and tidy. If you have VLANs that are no longer in use but they still have IP interfaces, clean them up. I've found over the years that sometimes dirty configurations can create some issues for the software. Cleaning up the configuration usually helps.
We've been helping network engineers, system administrators and technology professionals since June 2009.
If you've found this site useful or helpful, please help me spread the word. Link to us in your blog or homepage - Thanks!

Offline sbenninger

  • Rookie
  • **
  • Posts: 5
Re: Troubleshoot CPU spikes on an 8310
« Reply #9 on: January 30, 2012, 03:19:30 PM »
Thanks everyone for the replies. It was indeed the default FDB timers causing the issue. The CPU spikes are now gone and doing a full review of the infrastructure to clean up old config items etc.

Thanks,

Scott

Offline Flintstone

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 584
Re: Troubleshoot CPU spikes on an 8310
« Reply #10 on: January 30, 2012, 03:50:43 PM »
Hi Scott,
Thanks for letting us know  ;)

CheerZ