Embedded Flash Write Error on N7K

I came across a unique issue the other day while attempting to save my changes to startup-config on a N7004. Never having seen this issue in the wild before, it was great exposure and one that I enjoyed. Here are my findings.

Each Nexus 7000 Supervisor 2/2E is equipped with two onboard embedded identical eUSB flash devise in a RAID1 configuration. Over years of in service, one of these devices may get disconnected from the USB bus. This causes the RAID software to drop the affected device to be removed from its configuration. System still can function normally with the remaining working device.

However, if the second flash device also experiences similar issue and drops out of the RAID array, boot flash devices will be re-mounted as read-only preventing configuration copying.

 

N7k# wr
[########################################] 100%
Configuration update aborted: request was aborted

So apparently I cannot save my configurations. Why? (No, write mem or wr is not supported in NX-OS, but I’m lazy and created an alias for it)

Executing the ‘show module’ command, I can see that my supervisor isn’t so happy.
N7k# show module

Mod  Ports  Module-Type                         Model              Status

—  —–  ———————————– —————— ———-

2    0      Supervisor module-2                 N7K-SUP2           active *
3    24     10 Gbps Ethernet Module             N7K-M224XP-23L     ok
4    48     10/100/1000 Mbps Ethernet XL Module N7K-M148GT-11L     ok

Mod  Online Diag Status

—  ——————

2    Fail
3    Pass
4    Pass

Let’s dig a bit deeper to see what the failure is related to.

I know that my supervisor is in module number 2. So with the command ‘show module internal exceptionlog module 2’ I can view diagnostic information about module 2.

N7k# show module internal exceptionlog module 2

********* Exception info for module 2 ********

exception information — exception instance 1 —-
Module Slot Number: 2
Device Id         : 0
Device Name       : undef
Device Errorcode  : 0x00000000
Device ID         : 00 (0x00)
Device Instance   : 00 (0x00)
Dev Type (HW/SW)  : 00 (0x00)
ErrNum (devInfo)  : 00 (0x00)
System Errorcode  : 0x418b001e The compact flash power test failed
Error Type        : Warning
PhyPortLayer      : 0x0
Port(s) Affected  : none
Error Description : Compact Flash test failed
DSAP              : 0 (0x0)
UUID              : 483 (0x1e3)
Time              : Wed Oct 19 21:32:20 2016
(Ticks: 58081EA4 jiffies)

This is interesting.. The flash card on the supervisor is clearly having an issue.

To check the status of the RAID, enter the show system internal file /proc/mdstat command. If the system has a standby supervisor, attach to it first and run the command as well. I will be executing this on a single sup N7004.

N7k# show system internal file /proc/mdstat
Personalities : [raid1]
md6 : active raid1 sdc6[2](F) sdb6[1]
77888 blocks [2/1] [_U]

md5 : active raid1 sdc5[2](F) sdb5[1]
78400 blocks [2/1] [_U]

md4 : active raid1 sdc4[2](F) sdb4[1]
39424 blocks [2/1] [_U]

md3 : active raid1 sdc3[2](F) sdb3[1]
1802240 blocks [2/1] [_U]

unused devices: <none>

In the previous output there are four partitions, md3 through md6, mounted to stored boot images and other persistent configuration data. For each disk partition, a status [2/2] indicates that there are two disks configured and two currently run. [UU] indicates the current status of each disk and identifies the status as “U”p and running.

Any status other than [2/2] [UU] might indicate a degraded RAID array where the status for any failed disk will be displayed as “_”.  For example, the status shown such as “[2/1] [_U]” or “[2/1] [U_]” indicates a degraded RAID array configuration.

It is recommended to recover the offline disks and get them added back into a RAID array as soon as possible.

N7k# show system internal raid | grep -A 1 “Current RAID status info”

Current RAID status info:
RAID data from CMOS = 0xa5 0xc3

The last number in the RAID data indicates the number of disks failed.

0xf0 ==>> No failures reported
0xe1 ==>> Primary flash failed
0xd2 ==>> Mirror flash failed
0xc3 ==>> Both primary and mirror failed

Both flash devices have failed on this device in my case.

So the Cisco Field Notice: FN – 63975 defines and resolves this issue. The field notice associates the issue with a documented bug (CSCus22805). I was able to resolve this issue assisted by the FN and the documented bug. If you have a CCO account you can download the Flash Recovery Tool noted in the FN to resolve this issue. There is also a really well written readme file included with the Flash Recovery Tool download.
Here is the readme file if you are interested. flash_recovery_tool_readme

 

Mike

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s