Hi Michael,
The saga is nearly over.
Apologies for the late reply. As will be made clear soon, I was dealing with the fallout yesterday, and now have found some time to update you.
On Wednesday evening, we upgraded the firmware/sw to 5.1.2.035 on two stacks (STACK B and STACK C) with no issues. Upgrading the firmware on Stack A, would not work. Stack A accepted the diags image but not the software image. We then decided to make another unit in Stack A the BASE and tried that. The new base unit accepted the new software and seemed to push it to only 2 of 7 switches. Suffice to say, it all came crashing down. The stack could not be seen. What we noticed was that when the stack 'was broken' the spanning tree option appeared again. In the end we had to upgrade each switch individuallly, and we also rebuilt the entire stack, by reconfiguring the uplinks at the back of the switches so that everything cascaded from the new base unit down. We did this by adding one switch at a time to the stack. We were hoping that this would have isolated the dodgy switch. But in this case, everything worked even after a full reboot of the stack.
Few things stood out:Switch 1 (OLD base) did not accept software image initially. After reboot of switch (and disconnected from stack) it did. This made us believe that this switch was faulty. But it appears not.
Switch 4 and 5 had huge amounts of latency. Ping requst to switch IP (when disconnected from stack) had 300-500ms response times. On Switch 4, took 7 minutes just to transfer the diags image across (via TFTP). Though when it rebooted it was all fine! After our experience on switch 4, we did a reboot of switch 5 and it upgraded normally (i.e. quickly!)
Once the stack had been rebuilt, the original config still stood, but we lost all MLT information (8 MLT Trunks on Switch A) and Switch 5 lost all its config (rate limiting, etc) I think the MLT link information was lost since we were adding one switch at a time to the stack, and since ab MLT could be across three switches, the nortel software was clever enough to know it could not be built.
We decided to rebuild the MLTs from scratch (using our backup config file as reference) and all seems to be fine now! performance issues seem to have disappeared.
We also now have spanning tree on Stack A. Spanning tree has been activated on all switch ports except for MLT links.
It took 7 hours to get this working! We left the office in the early hours of the morning!
Though I have a few more questions that I would like your advice on..
1) On your blog post
http://blog.michaelfmcnamara.com/2009/01/hp-nic-teaming-with-nortel-switches/ you say that MLT links for servers to the switch is now considered "old tech". Our ESX servers have 6 NICs of which 3 are dedicated for virtual machine guest traffic. These three NICs are bonded into an MLT (Trunk) uplinked to the switch since our ESX servers could host guests belonging to 1 of 3 VLANS. Of course when we had the issue we had, my biggest fear was the need to maybe reconfigure all our ESX servers if we had to permanently remove a faulty switch, hence me thinking MLT links are not the way to go? Should I be using Link Aggregation Groups?
2) Our stacks are connected via MLT (fibre) in the following methods
STACK A -> STACK B (3 Fibre MLT)
STACK A -> STACK C (2 Fibre MLT)
Our load balance type is Basic. Would we be better off using Advanced for these ISLs? (Our ESX MLTs are Advanced as per the nortel docs). I say this as I notice that when there is heavy traffic between the stack A and B, 1 MLT link member seems to be used more then the others. This could therefore be overloading that switch?
Many thanks once again for all your help.
Kind regards