Tuesday 30 May 2017

AIX rootvg failure monitoring

AIX rootvg failure monitoring


AIX has a new “critical volume group” capability which will monitor for the loss or failure of a volume group. You can apply this to any volume group, including rootvg. If applied to rootvg, then you can monitor for the loss of the root volume group.

This feature may be useful if your AIX LPAR experiences a loss of SAN connectivity e.g. total loss of access to SAN storage and/or all SAN switches. Typically, when this happens, AIX will continue to run, in memory for a period of time and will not immediately crash. Often you can still log on to the AIX system but if you attempt to write a file you’ll see in I/O error. But even then the system may (potentially) remain up. When the SAN issue is resolved the AIX system may continue running, with file systems in read-only mode (or not, it depends) but to really resolve the issue you would still need to reboot the AIX LPAR in order for it regain access to its disks. This can result in the need to run fsck against file systems. Note that the behaviour you encounter will be impacted by a variety of factors, such as length and type of outage. As always, your mileage may vary!

You can encounter this behaviour with both VSCSI and NPIV SAN booted LPARs. This new AIX VG option, which caters for the scenario described above, is not enabled by default. From the chvg man page:

“-r y | n                      Changes the critical volume group (VG) option of the volume group.

n                                  Disables the critical VG option of the volume group.

y                                  Enables the critical VG option of the volume group. If the volume group is set
to the critical VG, any I/O request failure starts the Logical Volume Manager
(LVM) metadata write operation to check the state of the disk before
returning the I/O failure. If the critical VG option is set to rootvg and if the
volume group losses access to quorum set of disks (or all disks if quorum
disabled), instead of moving the VG to an offline state, the node is crashed
and a message is displayed on the console.”

PowerHA also caters for and supports this now....and should already be enabled by default. You want this feature enabled for your HA clusters so that they respond appropriately to loss of the root volume group and initiate a failover.

smitty sysmirror -> Custom Cluster Configuration -> Events -> System Events -> Change/Show Event Response (smitty cm_change_show_sys_event)

                     Change/Show Event Response

Type or select values in entry fields.
Press Enter AFTER making all desired changes.

                        [Entry Fields]
* Event Name            ROOTVG                 +

* Response              Log event and reboot   +
* Active                Yes                    +

"Exploitation of LVM rootvg failure monitoring

AIX LVM has recently added the capability to change a volume group to be a known as critical volume group. Though PowerHA has allowed critical volume groups in the past, that
only applied to non-operating system/data volume groups. PowerHA v7.2 now also takes advantage of this functionality specifically for rootvg. If the volume group is set to the critical VG, any I/O request failure starts the Logical Volume Manager (LVM) metadata write operation to check the state of the disk before returning the I/O failure. If the critical VG option is set to rootvg and if the volume group losses access to quorum set of disks (or all disks if quorum is disabled), instead of moving the VG to an offline state, the node is crashed and a message is displayed on the console. You can set and validate rootvg as a critical volume group by executing the commands shown below. The command has to run once since we are using the CAA distributed command clcmd.

# clcmd chvg -r y rootvg
# clcmd lsvg rootvg |grep CRIT
DISK BLOCK SIZE: 512 CRITICAL VG: yes
DISK BLOCK SIZE: 512 CRITICAL VG: yes"

To test this new feature in my lab, I simulated a disk "failure" or accidental unmapping/removal of a rootvg disk from an LPAR.

On the AIX LPAR, prior to disk failure simulation, I turn on the “CRITICAL VG” option for rootvg.

# oslevel -s
7200-00-00-0000

# lsvg rootvg | grep CRIT
DISK BLOCK SIZE:    512                      CRITICAL VG:    no

# chvg -r y rootvg

# lsvg rootvg | grep CRIT
DISK BLOCK SIZE:    512                      CRITICAL VG:    yes

On the VIOS, I unmap the rootvg disk from the corresponding vhost adapter:

$ lsmap -vadapter vhost30
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost30         U8286.42A.214F58V-V2-C38                     0x0000001d

VTD                   vtscsi59
Status                Available
LUN                   0x8100000000000000
Backing device        volume-AIX71LabImage-b8588f28-00000033-boot--26bf124d-40b3.acc7e518f3f561d4719fcecfff88b0f1
Physloc
Mirrored              N/A

VTD                   vtscsi60
Status                Available
LUN                   0x8200000000000000
Backing device        volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9
Physloc
Mirrored              N/A

$ lu -list | grep volume-A | grep Lab
volume-AIX71LabImage-b~ 10240       7790        acc7e518f3f561d4719fcecfff88b0f1
volume-AIX71LabImage-b~ 1024        1021        81823bdb90537250b95b3b2f35bb55f9

$ lu -unmap -luudid acc7e518f3f561d4719fcecfff88b0f1

$ lsmap -vadapter vhost30
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost30         U8286.42A.214F58V-V2-C38                     0x0000001d

VTD                   vtscsi60
Status                Available
LUN                   0x8200000000000000
Backing device        volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9
Physloc
Mirrored              N/A

On the AIX LPAR, I attempt to create (write) a file in /tmp (which resides in rootvg):

# touch /tmp/mynewfile

The LPAR stops responding immediately. I can no longer connect or login to it.

I then remap the disk and restart the LPAR. The AIX error report shows that the system halted due a critical VG going offline.

$ lu -map -luudid acc7e518f3f561d4719fcecfff88b0f1 -vadapter vhost30

$ lsmap -vadapter vhost30
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost30         U8286.42A.214F58V-V2-C38                     0x0000001d

VTD                   vtscsi59
Status                Available
LUN                   0x8100000000000000
Backing device        volume-AIX71LabImage-b8588f28-00000033-boot--26bf124d-40b3.acc7e518f3f561d4719fcecfff88b0f1
Physloc
Mirrored              N/A

VTD                   vtscsi60
Status                Available
LUN                   0x8200000000000000
Backing device        volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9
Physloc
Mirrored              N/A

# errpt -a
...
LABEL:          KERNEL_PANIC
IDENTIFIER:     225E3B63

Date/Time:       Wed Mar  1 14:27:51 AEDT 2017
Sequence Number: 215
Machine Id:      00F94F584C00
Node Id:         aix72lab
Class:           S
Type:            TEMP
WPAR:            Global
Resource Name:   PANIC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
ASSERT STRING

PANIC STRING
Critical VG Force off, halting.

This feature is available with AIX 6.1 or later.

IV52743: ADD CRITICAL VG SUPPORT FOR ROOTVG APPLIES TO AIX 7100-03

No comments:

Post a Comment