Netapp SVM : Non-Operational

A diagnostic opportunity presented itself yesterday, like with many things, I grabbed the metaphorical bull 🐂 by the horns to take a look 👀 at damage control and remediation.

Netapp - what’s that ?

Netapp appliance consists of a filer and disk shelves, the filer does all heavy lifting and manages the virtual machines and the disk shelves well, are disk shelves

If you never seen one before, it’s like any other storage appliances and it looks like this, the green box is the "filer" and the red boxes are the "disk shelf" units.

Quick Netapp filer overview

You will find with many Enterprise products you will always need to manage some form of a cluster for resilience and you would usually have a management address for the file and then individual management addresses for each cluster management processor.

The concept is pretty simple The cluster service provides management of the disks and network cards and then on top of that you have your SVM (Storage virtual machines) Running on those clusters, which are essentially a hypervisor that provide your, in this case SMB shares

Yes, you can also do NFS, iSCSI and directly attached storage, regardless of what protocols you are using stored on the SVM I noticed that the SVM (in this case, both of our SVMs) were off-line, and we need to be careful with the terminology here as there is difference between being off-line and not running.

Do you have a diagram of the Netapp Storage Controller?

Absolutely, a picture is worth 1000 words as they say, diagram below at the bottom, shows the disks and the aggregates and then the section on the top shows the Storage controller this is where your SVMs run:

In this example, the storage controller was online partially, but the SVM was offline.

How was it determined it was offline?

The offline status was determined by the fact that the SMB port TCP:445 was not operational and you could not ping the SVM address.

Logically, I therefore drew the conclusion that the SVM was not running, this conclusion as you will find out later on was incorrect

Connecting to the Cluster Node IP

In order to look at the problem you have cluster management interface which is not required here, and you then have a Cluster Node IP which is the management for that "controller" of the cluster, this is what we want to look at for more information.

In this example we will use these names and IP address:

SMBController-01 with the IP 10.70.19.50
SMBController-02 with the IP of 10.70.19.51

I use good old "putty" to connect to these devices but there are many more SSH clients then from the client you need to use the IP address (unless you have DNS records for these) once you have the IP entered, port is 22 (for SSH) then click open as below:

This will then show you a login status screen like this:

login as: managentadmin

admin@10.70.19.50's password:

SP SMBController-01>

You are now logged into the SMBController-01 you now need to enter "system console" then then view the SVM status with the commands:

system console

vserver show

This will return the status as below:

SMBController::> vserver show

Admin Operational Root

Vserver Type Subtype State State Volume Aggregate

----------- ------- ---------- ---------- ----------- ---------- ----------

SMBController admin - - - - -

SMBController-01 node - - - - -

SMBController-02 node - - - - -

Warning: Unable to list entries on node SMBController-02. RPC: Couldn't make

connection [from mgwd on node "SMBController-01" (VSID: -1) to mgwd at

x.x.x.x]

SMB01 data default running - SMB01_root aggr1_

node1

SMB02 data default running - SMB02_root aggr1_

node2

5 entries were displayed.

Immediately you will notice that the SVM machines here called SMB01 and SMB02 are actually running but are not operational, which means technically they are offline but also notice the error above the SVM machines - this should like SMBController-01 cannot talk to SMBController-02:

Warning: Unable to list entries on node SMBController-02. RPC: Couldn't make

connection [from mgwd on node "SMBController-01" (VSID: -1) to mgwd at

x.x.x.x]

SMB01 data default running - SMB01_root aggr1_

node1

SMB02 data default running - SMB02_root aggr1_

node2

This means we need to connect to the other controller SMBController-02 (keep you Putty session is SMVBController-01 it will be needed later) to see what that can see, however when you connect to the other controller we get the same login:

login as: managentadmin

admin@10.70.19.51's password:

SP SMBController-02>

However when we try to run this command:

system console

That does not return the console it returns "Loader-B" which is not right

SP ST1C2000-02> system console

Type Ctrl-D to exit.

LOADER-B>

We also have a limited command set on this which looks like the cluster node is not in a happy state:

Available commands: netboot, boot_diags, boot_backup, boot_primary, boot_ontap, update_flash, sp, flash, version, bye, set, lsmod, autoboot, go, boot, load, ndp, ping, arp, ifconfig, show, savenv, saveenv, unsetenv, set-defaults, setenv, printenv, undi, help

So if we try to run

vserver show

We get this from the command processor, this is not correct:

vserver show

Invalid command: "vserver"

Available commands: netboot, boot_diags, boot_backup, boot_primary, boot_ontap, update_flash, sp, flash, version, bye, set, lsmod, autoboot, go, boot, load, ndp, ping, arp, ifconfig, show, savenv, saveenv, unsetenv, set-defaults, setenv, printenv, undi, help

This is not right, it should not be "Loader-B" so I would imagine that the device is not booted correctly, so lets start the boot process with this command

LOADER-B> bye

Once you issue this command you will see that the controller will then start its boot proceess as below:

BIOS version: 9.8

▒Phoenix SecureCore Tiano(TM)

Build Date: 09/16/2019

**********************************************

* 9.8 *

* ================================== *

* PHOENIX SC-T 2009-2024 *

**********************************************

CPU = 1 Processor(s) Detected, Cores per Processor = 6

Intel(R) Xeon(R) CPU E5-2620 @ 2.00GHz

24576 MB System RAM Installed

256 KB L2 Cache

System BIOS shadowed

Video BIOS shadowed

USB Device: MICRON eUSB DISK

Boot Loader version 5.7

Starting AUTOBOOT press Ctrl-C to abort...

Loading X86_64/freebsd/image2/kernel:0x200000/14337056 0xfac420/13552104 Entry at 0xffffffff802cdc30

Loading X86_64/freebsd/image2/platform.ko:0x1c99000/2805016 0x1f46000/455200 0x1fb5220/558928

Starting program at 0xffffffff802cdc30

NetApp Data ONTAP 9.3P18

*******************************

* *

* Press Ctrl-C for Boot Menu. *

* *

*******************************

cryptomod_fips: Executing Crypto FIPS Self Tests.

cryptomod_fips: Crypto FIPS self-test: 'CPU COMPATIBILITY' passed.

cryptomod_fips: Crypto FIPS self-test: 'AES-128 ECB, AES-256 ECB' passed.

cryptomod_fips: Crypto FIPS self-test: 'AES-128 CBC, AES-256 CBC' passed.

cryptomod_fips: Crypto FIPS self-test: 'CTR_DRBG' passed.

cryptomod_fips: Crypto FIPS self-test: 'SHA1, SHA256, SHA512' passed.

cryptomod_fips: Crypto FIPS self-test: 'HMAC-SHA1, HMAC-SHA256, HMAC-SHA512' passed.

cryptomod_fips: Crypto FIPS self-test: 'PBKDF2' passed.

cryptomod_fips: Crypto FIPS self-test: 'AES-XTS 128, AES-XTS 256' passed.

cryptomod_fips: Crypto FIPS self-test: 'Self-integrity' passed.

Aug 21 20:25:35 Battery charge capacity: 3584 mA*hr. Power outage protection flash de-staging cycles: 110

You will know when the boot completes as the command prompt will then look like this:

Wed Aug 21 20:28:03 BST 2024

SP-login: login:

Once you get here you can login with your account again and the controller should be back online, so now this has booted switch back to the putty session on SMBController-01 (if disconnected reconnect) and then run this command:

SMBController-01::> vserver show

This will now show you that the SVM are now running and online as below and you will notice that error about SMBController-02 is no longer present:

Admin Operational Root

Vserver Type Subtype State State Volume Aggregate

----------- ------- ---------- ---------- ----------- ---------- ----------

SMBController admin - - - - -

SMBController-01 node - - - - -

SMBController-02 node - - - - -

SMB01 data default running running SMB01_root aggr1_

node1

SMB02 data default running running SMB02_root aggr1_

node2

5 entries were displayed.

This should mean that the controller is now back online, but the question is why did it go offline and fail?

Identify the cause of the failure

When SMBController-02 was booting up I noticed this in the eventlog on boot up which will give a clue to the problem at hand it would appear to be broken disks:

Aug 21 20:27:01 [SMBController-02:monitor.brokenDisk.notice:notice]: When two disks are broken in raid_dp volume, the system shuts down automatically every 2400 hours to encourage you to replace the disk. If you reboot the system, it will run for another 2400 hours before shutting down.

This would mean if the disks are not replaced we have 2400 hours (100 days) before this event occurs again, so do we have failed disks ?

Aug 21 20:26:52 [SMBController-02:raid.config.spare.disk.failed:error]: Spare Disk 0b.04.11 Shelf 4 Bay 11 [NETAPP X423_HCOBE900A10 NA02] S/N [KPJ6VRJF] UID [5000CCA0:167D3640:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.

Aug 21 20:26:52 [SMBController-02:disk.failmsg:notice]: Disk 0b.04.11 (KPJ6VRJF): Predictive Failure PFA (0x01), ASC(0x5d), ASCQ(0x90), FRU(0x90). 0 Disk 0b.04.11 Shelf 4 Bay 11 [NETAPP X423_HCOBE900A10 NA02] S/N [KPJ6VRJF] UID [5000CCA0:167D3640:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]

Yes, we have a failed disk and predictive failed disk that need to be replaced to stop this shutdown every 2400 hours.

Check the "timeout" value before shutdown

If you wish to check the timeout value before shutdown then connect to SMBController-01 and once logged in and you are at the system console with this command:

system console

You then need to run this command to view the timeout value which is in hours as below:

SMBController-01::> storage raid-options show -node SMBController-01 raid.timeout

That will return the current runtime values:

Node: SMBController-01

Option Name: raid.timeout

Option Value: 2400

Option Constraint: only_one

You also need to check the other controller to make sure that is the same value:

SMBController-01::> storage raid-options show -node SMBController-02 raid.timeout

That should return the same values for SMBControler02 as it does below:

Node: SMBController-02

Option Name: raid.timeout

Option Value: 2400

Option Constraint: only_one

Need to amend the "timeout" value?

If you need to amend this timeout value, where the maximum is 2400 hours then you can run this command to amend this setting to your desired value:

Warning : This is not substitute replacing failed or predictive failed disks but it does mean you get more time to react to a failed disk, you can loose data if to many disk fail in these arrays, so please keep maintenance and housekeeping strict.

storage raid-options modify -node SMBController-01 option raid.timeout -value 2200

storage raid-options modify -node SMBController-02 option raid.timeout -value 2200

Netapp SVM : Non-Operational

نموذج الاتصال