• February 11, 2012, 09:53:04 AM
Welcome, Guest. Please login or register.
Did you miss your activation email?

Author Topic: Nortel 8600 4.1.4 CPU Overload  (Read 2038 times)

0 Members and 4 Guests are viewing this topic.

Offline reptou

  • Rookie
  • **
  • Posts: 3
Nortel 8600 4.1.4 CPU Overload
« on: January 29, 2010, 05:13:32 PM »
Hi !

My backbone mainly contains two Nortel 8600 in version 4.1.4 (I know, not the most recent but I am not able to upgrade it at this moment) with only one CPU card 8692SF. All worked without problems for many years, but since a few days the CPU of one of these 8600 is running very high during the day, and stays at 100% for hours without any reason.

Here are the steps I did for trying to debug this issue, without any result for the moment :

1) By checking the log file of the 8600, we can see that the IST links between the two 8600 timeout when the CPU Utilization remains at 100% for more than 50 seconds. We run OSPF too and adjances are computed again when this occurs
2) The log file does not specify any "CPU Warning" related to broadcast or multicast storms detected. Ext CP-Limit is configured and ports are in SoftMode, but once the CPU Utilization threshold is reached nothing more happens, which shows us that no link is being used abnormally.
3) We retrieve the privilege password from Nortel in order to check the processes table, and only the MainTask and the IDLE are asking for CPU cycles. I suppose that an IDLE process should not lead to a CPU overload
4) It is quite difficult to do some CPU Traces because of the CPU Utilization
5) We tried to change the CPU card by another one (exactly the same), but same issue occurs
6) Now, the defective 8600 is just processing a small part of the whole network traffic, and does not perform any layer 3 tasks. CPU load is still huge for what it is doing.



I have no more idea to what tests I could proceed in order to resolve this case. Maybe upgrading the version would be a solution, but it is not possible for the moment.

Do you already have face to this kind of problem, and is there any other action I could perform ?

Many thanks in advance !








Online Michael McNamara

  • Administrator
  • Hero Member
  • *****
  • Posts: 2164
    • Michael McNamara
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #1 on: January 29, 2010, 06:45:26 PM »
Hi reptou and welcome to the forums!

I'm sorry to hear that you're having problems. I never had issues with the 4.1.4.x software branch so I wouldn't jump to upgrading until you've exhausted some other options. Are you having issues passing network traffic when the CPU is at 100%? I would guess yes but you didn't really say. How did you find out about this problem?

Before I go to far here's a document that might help explain some of the things you can look at;
cpuload_v14.pdf

In my experience CPU load issues are usually caused by a bad configuration or by some abnormal network traffic. Since it doesn't sound like you've made any configuration changes recently I would tend to focus on abnormal network traffic. Examples of abnormal network traffic include a virus/trojan infected personal computer, a misbehaving network switch/router and the inevitable network loop caused by some end user or inattentive network analyst/engineer.

I would immediately focus on the possibility that you have a loop somewhere in your network. This should be the easiest to rule out and eliminate from the list of possible causes. You say you are using ext-cp-limit, are you using cp-limit? What values have you configured for broadcast and multicast for cp-limit? Have you enabled cp-limit on all uplinks/downlinks in and out of your ERS 8600 switch. Are you using rate-limiting on your uplinks/downlinks (actually you can use it on all ports).

I would suggest you start with rate-limiting because it's accomplished in hardware and has no reliance on how busy the CPU is. Unfortunately cp-limit and ext-cp-limit both rely on the CPU, so if they don't catch the problem fast enough those mechanisms will never kick in because the CPU is too far gone by the time they try (that's where the values you are using for multicast and broadcast come into play).

Unless you have some monster Multicast applications you should be pretty safe enabling rate-limiting. And see if that makes any difference, if it improves the performance of your CPU but your network is still performing poorly or your having connectivity issues then you most definitely have a loop somewhere.

Have a look at this post a few weeks back for some standard values; http://forums.networkinfrastructure.info/nortel-ethernet-switching/broadcast-and-multicast-rate-limit-values-best-practice/

I can dig up the CLI commands if you need them (don't have a ERS 8600 in front of me right now).

There are quite a few folks here that are very knowledgeable and happy to help.

If possible please keep us updated on your progress.

Good Luck!
If you've found this site useful and helpful, please help me spread the word. Link to us in your blog or homepage or Tweet about us! - Thanks!

Offline qazzie

  • Full Member
  • ***
  • Posts: 92
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #2 on: January 31, 2010, 05:35:48 PM »
Hmm interesting...

What does -show mlt ist info stat- provide for information? This command should be performed several times to see how many mac address entry's are learnt over the IST. Since your using an IST, you prolly use SMLT as well... in these situations break redundancy links and go back to normal d-mlt's for a period of time.

As you said the IST is timing out, that's pretty hairy. There are known things wihin the 4.1 streams related to smlt/ist. Have you or other persons make some radical changes, like rolling out WLAN?

Can you say something about directed broadcast is that used or enabled? I think it was enabled by default in the early 4.1 stream. That could result into programs actively using that with cpu spikes as a result.

You ran a spyreport and saw that tMainTask was the cpu intensive task. Then you really have to focus on making cpu traces. It has to be somewhere in there. Level 15 3 and 8 3 gives most of the times some good information.

If you do show sys perf, you see high cpu. But that command gives also information about switchfabric and bufferutil... What are those values. And! bufferutil higher than 1% can give some indication. Repeat that command for at least 5 times with a 3 sec interval.

q

Offline reptou

  • Rookie
  • **
  • Posts: 3
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #3 on: February 01, 2010, 08:13:57 AM »
Hi Michael, Qazzie,

First of all, thank you for your help !

Even if no broadcast storm have been detected on the 8600, I have set up all the ports except the IST one to use rate limit as described in the link forwarded by Michael.

Qazzie, to answer you there is no utilization of the switch fabric (sometimes 1% no more) and of the buffer when the CPU goes high.

However, I move one step further in the resolution of this issue. We noticed that it is the combination of the fact that the defective 8600 is Master of one particular VRRP and that Backup Master is enabled for this VRRP. When we force the second 8600 to become Master by increasing his priority and deactivating the Backup Master functionnality (whereas it was activated for a long time), the CPU Utilization goes down immediatly and does no increase on the other side.

The behavior is quite strange, and we think that the issue could come from some unicast trafic coming from subnet attached to this VRRP which acts as gateway. What we can not explain at this moment is why the issue does not occur on the second 8600 when the VRRP transition is performed.

Once again, thanks for your help guys !

Online Michael McNamara

  • Administrator
  • Hero Member
  • *****
  • Posts: 2164
    • Michael McNamara
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #4 on: February 01, 2010, 07:09:55 PM »
How many VRRP interfaces do you have configuration on your switch cluster?

Are you using unique VRRP IDs across your switch? There is a bug when you use the same VRRP ID for multiple VLANs, say VRRP ID 1 for VLAN 1, VRRP ID 1 for VLAN 2, VRRP ID 1 for VLAN 3, etc.

We discovered this software bug when we tried to upgrade to 4.1.8.x software. While it worked in the past it created some serious issues with the switch where CPU performance would go to 100% and the switch would essentially run itself into the ground and become unresponsive.

We've since started converting the majority of our interfaces to RSMLT as opposed to using VRRP. RSMLT is easier on the CPU and scales better when you have 200+ VLANs. I'm not sure RSMLT is available in 4.1.4.x software so that might not be an option for you.

Good Luck!
« Last Edit: February 01, 2010, 07:13:29 PM by Michael McNamara »
If you've found this site useful and helpful, please help me spread the word. Link to us in your blog or homepage or Tweet about us! - Thanks!

Offline qazzie

  • Full Member
  • ***
  • Posts: 92
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #5 on: February 02, 2010, 04:00:43 PM »
As I agree that RSMLT is the better way to go with than VRRP nowadays. It is not recommended to use on older (aka 4.1.4) code. There is a -bug- or what one calls it, that the config doesn't save the peers mac address which is used with RSMLT. VRRP mentioned by Michael with different vr-id's is great to use in backup-master setup.

Offline stauftm

  • Full Member
  • ***
  • Posts: 68
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #6 on: February 02, 2010, 06:00:13 PM »
With the rate-limiting, I apply this to my edge 5500 switches.

#interface fastethernet all
#rate-limit both 10

On the core and server ports do you recommend this also?

Thanks in advance
Todd

Online Michael McNamara

  • Administrator
  • Hero Member
  • *****
  • Posts: 2164
    • Michael McNamara
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #7 on: February 02, 2010, 06:42:22 PM »
You should be more restrictive, since 10% of a 1000Mbps link is 100Mbps and that's a lot of broadcast/multicast traffic.

That amount of traffic could easily flood your network and take everything down including your core switches.

I currently use 5% on all my edge switches and some folks would argue that's still a lot of traffic.

Refer to the post below for additional discussions on the topic of rate-limiting;
http://forums.networkinfrastructure.info/nortel-ethernet-switching/broadcast-and-multicast-rate-limit-values-best-practice/

Cheers!
If you've found this site useful and helpful, please help me spread the word. Link to us in your blog or homepage or Tweet about us! - Thanks!

Offline reptou

  • Rookie
  • **
  • Posts: 3
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #8 on: March 02, 2010, 06:22:42 PM »
Hi !

Sorry for this - slight - delay for answering ! First of all, I would like to thank you for having suggested me some paths to follow in order to troubleshoot this issue. This surely would be useful for preventing any looping issue in the future !

For an informational purpose, this CPU load increase was due to an interface of one of the ERS core switch using two links to each Nortel 8610 with OSPF and ECMP enabled. One of these links was doing some link flaps, and the core 8610 on which CPU load was critical was not able to compute correctly the new SPF paths. By changing physically the port during the RMA process, the issue was no more present on the system.


Online Michael McNamara

  • Administrator
  • Hero Member
  • *****
  • Posts: 2164
    • Michael McNamara
Re: Nortel 8600 4.1.4 CPU Overload
« Reply #9 on: March 02, 2010, 07:44:56 PM »
I'm glad to hear you were able to resolve the problem. Thanks for posting back here with an update!
If you've found this site useful and helpful, please help me spread the word. Link to us in your blog or homepage or Tweet about us! - Thanks!