Netapp SVM : Non-Operational


A diagnostic opportunity presented itself yesterday, like with many things, I grabbed the metaphorical bull 🐂 by the horns to take a look 👀 at damage control and remediation.
Netapp - what’s that ?

Netapp appliance consists of a filer and disk shelves, the filer does all heavy lifting and manages the virtual machines and the disk shelves well, are disk shelves

If you never seen one before, it’s like any other storage appliances and it looks like this, the green box is the "filer" and the red boxes are the "disk shelf" units.


Quick Netapp filer overview

You will find with many Enterprise products you will always need to manage some form of a cluster for resilience and you would usually have a management address for the file and then individual management addresses for each cluster management processor.

The concept is pretty simple The cluster service provides management of the disks and network cards and then on top of that you have your SVM (Storage virtual machines) Running on those clusters, which are essentially a hypervisor that provide your, in this case SMB shares

Yes, you can also do NFS, iSCSI and directly attached storage, regardless of what protocols you are using stored on the SVM I noticed that the SVM (in this case, both of our SVMs) were off-line, and we need to be careful with the terminology here as there is difference between being off-line and not running.

Do you have a diagram of the Netapp Storage Controller?

Absolutely, a picture is worth 1000 words as they say, diagram below at the bottom, shows the disks and the aggregates and then the section on the top shows the Storage controller this is where your SVMs run:

In this example, the storage controller was online partially, but the SVM was offline.


How was it determined it was offline?

The offline status was determined by the fact that the SMB port TCP:445 was not operational and you could not ping the SVM address.

Logically, I therefore drew the conclusion that the SVM was not running, this conclusion as you will find out later on was incorrect

Connecting to the Cluster Node IP

In order to look at the problem you have cluster management interface which is not required here, and you then have a Cluster Node IP which is the management for that "controller" of the cluster, this is what we want to look at for more information.

In this example we will use these names and IP address:

SMBController-01 with the IP 10.70.19.50
SMBController-02 with the IP of 10.70.19.51

I use good old "putty" to connect to these devices but there are many more SSH clients then from the client you need to use the IP address (unless you have DNS records for these) once you have the IP entered, port is 22 (for SSH) then click open as below:


This will then show you a login status screen like this:

login as: managentadmin
admin@10.70.19.50's password:
SP SMBController-01>

You are now logged into the SMBController-01 you now need to enter "system console" then then view the SVM status with the commands:

system console
vserver show

This will return the status as below:

SMBController::> vserver show

                               Admin      Operational Root
Vserver     Type    Subtype    State      State       Volume     Aggregate
----------- ------- ---------- ---------- ----------- ---------- ----------
SMBController    admin   -          -          -           -          -
SMBController-01 node    -          -          -           -          -
SMBController-02 node    -          -          -           -          -

Warning: Unable to list entries on node SMBController-02. RPC: Couldn't make
         connection [from mgwd on node "SMBController-01" (VSID: -1) to mgwd at
         x.x.x.x]

SMB01       data    default    running    -           SMB01_root aggr1_
                                                                 node1
SMB02       data    default    running    -           SMB02_root aggr1_
                                                                 node2
5 entries were displayed.

Immediately you will notice that the SVM machines here called SMB01 and SMB02 are actually running but are not operational, which means technically they are offline but also notice the error above the SVM machines - this should like SMBController-01 cannot talk to SMBController-02:

Warning: Unable to list entries on node SMBController-02. RPC: Couldn't make
         connection [from mgwd on node "SMBController-01" (VSID: -1) to mgwd at
         x.x.x.x]

SMB01       data    default    running    -           SMB01_root aggr1_
                                                                 node1
SMB02       data    default    running    -           SMB02_root aggr1_
                                                                 node2

This means we need to connect to the other controller SMBController-02 (keep you Putty session is SMVBController-01 it will be needed later) to see what that can see, however when you connect to the other controller we get the same login:

login as: managentadmin
admin@10.70.19.51's password:
SP SMBController-02>

However when we try to run this command:

system console

That does not return the console it returns "Loader-B" which is not right

SP ST1C2000-02> system console
Type Ctrl-D to exit.
LOADER-B> 

We also have a limited command set on this which looks like the cluster node is not in a happy state:

Available commands: netboot, boot_diags, boot_backup, boot_primary, boot_ontap, update_flash, sp, flash, version, bye, set, lsmod, autoboot, go, boot, load, ndp, ping, arp, ifconfig, show, savenv, saveenv, unsetenv, set-defaults, setenv, printenv, undi, help

So if we try to run 

vserver show

We get this from the command processor, this is not correct:

vserver show
Invalid command: "vserver"
Available commands: netboot, boot_diags, boot_backup, boot_primary, boot_ontap, update_flash, sp, flash, version, bye, set, lsmod, autoboot, go, boot, load, ndp, ping, arp, ifconfig, show, savenv, saveenv, unsetenv, set-defaults, setenv, printenv, undi, help

This is not right, it should not be "Loader-B" so I would imagine that the device is not booted correctly, so lets start the boot process with this command

LOADER-B> bye

Once you issue this command you will see that the controller will then start its boot proceess as below:

BIOS version: 9.8
Portions Copyright (c) 2011-2017 NetApp. All Rights Reserved
▒Phoenix SecureCore Tiano(TM)
Copyright 1985-2024 Phoenix Technologies Ltd.
All Rights Reserved

Build Date: 09/16/2019
**********************************************
*                    9.8                     *
*     ==================================     *
*           PHOENIX SC-T 2009-2024           *
**********************************************
CPU = 1 Processor(s) Detected, Cores per Processor = 6
Intel(R) Xeon(R) CPU E5-2620 @ 2.00GHz
24576 MB System RAM Installed
256 KB L2 Cache
System BIOS shadowed
Video BIOS shadowed
USB Device: MICRON eUSB DISK

Boot Loader version 5.7
Copyright (C) 2000-2003 Broadcom Corporation.
Portions Copyright (C) 2002-2017 NetApp, Inc. All Rights Reserved.

Starting AUTOBOOT press Ctrl-C to abort...
Loading X86_64/freebsd/image2/kernel:0x200000/14337056 0xfac420/13552104 Entry at 0xffffffff802cdc30
Loading X86_64/freebsd/image2/platform.ko:0x1c99000/2805016 0x1f46000/455200 0x1fb5220/558928
Starting program at 0xffffffff802cdc30
NetApp Data ONTAP 9.3P18

Copyright (C) 1992-2019 NetApp.
All rights reserved.
*******************************
*                             *
* Press Ctrl-C for Boot Menu. *
*                             *
*******************************
cryptomod_fips: Executing Crypto FIPS Self Tests.
cryptomod_fips: Crypto FIPS self-test: 'CPU COMPATIBILITY' passed.
cryptomod_fips: Crypto FIPS self-test: 'AES-128 ECB, AES-256 ECB' passed.
cryptomod_fips: Crypto FIPS self-test: 'AES-128 CBC, AES-256 CBC' passed.
cryptomod_fips: Crypto FIPS self-test: 'CTR_DRBG' passed.
cryptomod_fips: Crypto FIPS self-test: 'SHA1, SHA256, SHA512' passed.
cryptomod_fips: Crypto FIPS self-test: 'HMAC-SHA1, HMAC-SHA256, HMAC-SHA512' passed.
cryptomod_fips: Crypto FIPS self-test: 'PBKDF2' passed.
cryptomod_fips: Crypto FIPS self-test: 'AES-XTS 128, AES-XTS 256' passed.
cryptomod_fips: Crypto FIPS self-test: 'Self-integrity' passed.

Aug 21 20:25:35 Battery charge capacity: 3584 mA*hr. Power outage protection flash de-staging cycles: 110

You will know when the boot completes as the command prompt will then look like this:

Wed Aug 21 20:28:03 BST 2024
SP-login: login:

Once you get here you can login with your account again and the controller should be back online, so now this has booted switch back to the putty session on SMBController-01 (if disconnected reconnect) and then run this command: 

SMBController-01::> vserver show

This will now show you that the SVM are now running and online as below and you will notice that error about SMBController-02 is no longer present:

                               Admin      Operational Root
Vserver     Type    Subtype    State      State       Volume     Aggregate
----------- ------- ---------- ---------- ----------- ---------- ----------
SMBController    admin   -          -          -           -          -
SMBController-01 node    -          -          -           -          -
SMBController-02 node    -          -          -           -          -

SMB01       data    default    running    running           SMB01_root aggr1_
                                                                 node1
SMB02       data    default    running    running           SMB02_root aggr1_
                                                                 node2
5 entries were displayed.

This should mean that the controller is now back online, but the question is why did it go offline and fail?

Identify the cause of the failure

When SMBController-02 was booting up I noticed this in the eventlog on boot up which will give a clue to the problem at hand it would appear to be broken disks:

Aug 21 20:27:01 [SMBController-02:monitor.brokenDisk.notice:notice]: When two disks are broken in raid_dp volume, the system shuts down automatically every 2400 hours to encourage you to replace the disk. If you reboot the system, it will run for another 2400 hours before shutting down.

This would mean if the disks are not replaced we have 2400 hours (100 days) before this event occurs again, so do we have failed disks ?

Aug 21 20:26:52 [SMBController-02:raid.config.spare.disk.failed:error]: Spare Disk 0b.04.11 Shelf 4 Bay 11 [NETAPP   X423_HCOBE900A10 NA02] S/N [KPJ6VRJF] UID [5000CCA0:167D3640:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.

Aug 21 20:26:52 [SMBController-02:disk.failmsg:notice]: Disk 0b.04.11 (KPJ6VRJF): Predictive Failure PFA (0x01), ASC(0x5d), ASCQ(0x90), FRU(0x90). 0 Disk 0b.04.11 Shelf 4 Bay 11 [NETAPP   X423_HCOBE900A10 NA02] S/N [KPJ6VRJF] UID [5000CCA0:167D3640:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]

Yes, we have a failed disk and predictive failed disk that need to be replaced to stop this shutdown every 2400 hours.

Check the "timeout" value before shutdown

If you wish to check the timeout value before shutdown then connect to SMBController-01 and once logged in and you are at the system console with this command:

system console

You then need to run this command to view the timeout value which is in hours as below:

SMBController-01::> storage raid-options show -node SMBController-01 raid.timeout

That will return the current runtime values:

Node: SMBController-01
Option Name: raid.timeout
Option Value: 2400
Option Constraint: only_one

You also need to check the other controller to make sure that is the same value:

SMBController-01::> storage raid-options show -node SMBController-02 raid.timeout

That should return the same values for SMBControler02 as it does below:

Node: SMBController-02
Option Name: raid.timeout
Option Value: 2400
Option Constraint: only_one

Need to amend the "timeout" value?

If you need to amend this timeout value, where the maximum is 2400 hours then you can run this command to amend this setting to your desired value:

Warning : This is not substitute replacing failed or predictive failed disks but it does mean you get more time to react to a failed disk, you can loose data if to many disk fail in these arrays, so please keep maintenance and housekeeping strict.

storage raid-options modify -node SMBController-01 option raid.timeout -value 2200
storage raid-options modify -node SMBController-02 option raid.timeout -value 2200
Previous Post Next Post

نموذج الاتصال