AIX rootvg failure monitoring
AIX has a new “critical volume group”
capability which will monitor for the loss or failure of a volume group.
You can apply this to any volume group, including rootvg. If applied to
rootvg, then you can monitor for the loss of the root volume group.
This feature may be useful if your AIX LPAR experiences a
loss of SAN connectivity e.g. total loss of access to SAN storage and/or
all SAN switches. Typically, when this happens, AIX will continue to
run, in memory for a period of time and will not immediately crash.
Often you can still log on to the AIX system but if you attempt to write
a file you’ll see in I/O error. But even then the system may
(potentially) remain up. When the SAN issue is resolved the AIX system
may continue running, with file systems in read-only mode (or not, it
depends) but to really resolve the issue you would still need to reboot
the AIX LPAR in order for it regain access to its disks. This can result
in the need to run fsck against file systems. Note that the behaviour
you encounter will be impacted by a variety of factors, such as length
and type of outage. As always, your mileage may vary!
You can encounter this behaviour with both VSCSI and NPIV
SAN booted LPARs. This new AIX VG option, which caters for the scenario
described above, is not enabled by default. From the chvg man page:
“-r y | n Changes the critical volume group (VG) option of the volume group.
n Disables the critical VG option of the volume group.
y Enables the critical VG option of the volume group. If the volume group is set
to the critical VG, any I/O request failure starts the Logical Volume Manager
(LVM) metadata write operation to check the state of the disk before
returning the I/O failure. If the critical VG option is set to rootvg and if the
volume group losses access to quorum set of disks (or all disks if quorum
disabled), instead of moving the VG to an offline state, the node is crashed
and a message is displayed on the console.”
PowerHA also caters for and supports this now....and should already be enabled by default.
You want this feature enabled for your HA clusters so that they respond
appropriately to loss of the root volume group and initiate a failover.
smitty sysmirror -> Custom Cluster Configuration
-> Events -> System Events -> Change/Show Event Response
(smitty cm_change_show_sys_event)
Change/Show Event Response
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Event Name ROOTVG +
* Response Log event and reboot +
* Active Yes +
"Exploitation of LVM rootvg failure monitoring
AIX LVM has recently added the capability to change a
volume group to be a known as critical volume group. Though PowerHA has
allowed critical volume groups in the past, that
only applied to non-operating system/data volume
groups. PowerHA v7.2 now also takes advantage of this functionality
specifically for rootvg. If the volume group is set to the critical VG,
any I/O request failure starts the Logical Volume Manager (LVM) metadata
write operation to check the state of the disk before returning the I/O
failure. If the critical VG option is set to rootvg and if the volume
group losses access to quorum set of disks (or all disks if quorum is
disabled), instead of moving the VG to an offline state, the node is
crashed and a message is displayed on the console. You can set and
validate rootvg as a critical volume group by executing the commands
shown below. The command has to run once since we are using the CAA
distributed command clcmd.
# clcmd chvg -r y rootvg
# clcmd lsvg rootvg |grep CRIT
DISK BLOCK SIZE: 512 CRITICAL VG: yes
DISK BLOCK SIZE: 512 CRITICAL VG: yes"
To test this new feature in my lab, I simulated a disk "failure" or accidental unmapping/removal of a rootvg disk from an LPAR.
On the AIX LPAR, prior to disk failure simulation, I turn on the “CRITICAL VG” option for rootvg.
# oslevel -s
7200-00-00-0000
# lsvg rootvg | grep CRIT
DISK BLOCK SIZE: 512 CRITICAL VG: no
# chvg -r y rootvg
# lsvg rootvg | grep CRIT
DISK BLOCK SIZE: 512 CRITICAL VG: yes
On the VIOS, I unmap the rootvg disk from the corresponding vhost adapter:
$ lsmap -vadapter vhost30
SVSA Physloc Client Partition ID
--------------- -------------------------------------------- ------------------
vhost30 U8286.42A.214F58V-V2-C38 0x0000001d
VTD vtscsi59
Status Available
LUN 0x8100000000000000
Backing device volume-AIX71LabImage-b8588f28-00000033-boot--26bf124d-40b3.acc7e518f3f561d4719fcecfff88b0f1
Physloc
Mirrored N/A
VTD vtscsi60
Status Available
LUN 0x8200000000000000
Backing device volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9
Physloc
Mirrored N/A
$ lu -list | grep volume-A | grep Lab
volume-AIX71LabImage-b~ 10240 7790 acc7e518f3f561d4719fcecfff88b0f1
volume-AIX71LabImage-b~ 1024 1021 81823bdb90537250b95b3b2f35bb55f9
$ lu -unmap -luudid acc7e518f3f561d4719fcecfff88b0f1
$ lsmap -vadapter vhost30
SVSA Physloc Client Partition ID
--------------- -------------------------------------------- ------------------
vhost30 U8286.42A.214F58V-V2-C38 0x0000001d
VTD vtscsi60
Status Available
LUN 0x8200000000000000
Backing device volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9
Physloc
Mirrored N/A
On the AIX LPAR, I attempt to create (write) a file in /tmp (which resides in rootvg):
# touch /tmp/mynewfile
The LPAR stops responding immediately. I can no longer connect or login to it.
I then remap the disk and restart the LPAR. The AIX error report shows that the system halted due a critical VG going offline.
$ lu -map -luudid acc7e518f3f561d4719fcecfff88b0f1 -vadapter vhost30
$ lsmap -vadapter vhost30
SVSA Physloc Client Partition ID
--------------- -------------------------------------------- ------------------
vhost30 U8286.42A.214F58V-V2-C38 0x0000001d
VTD vtscsi59
Status Available
LUN 0x8100000000000000
Backing device volume-AIX71LabImage-b8588f28-00000033-boot--26bf124d-40b3.acc7e518f3f561d4719fcecfff88b0f1
Physloc
Mirrored N/A
VTD vtscsi60
Status Available
LUN 0x8200000000000000
Backing device volume-AIX71LabImage-b8588f28-00000033-data--1830bd26-ba42.81823bdb90537250b95b3b2f35bb55f9
Physloc
Mirrored N/A
# errpt -a
...
LABEL: KERNEL_PANIC
IDENTIFIER: 225E3B63
Date/Time: Wed Mar 1 14:27:51 AEDT 2017
Sequence Number: 215
Machine Id: 00F94F584C00
Node Id: aix72lab
Class: S
Type: TEMP
WPAR: Global
Resource Name: PANIC
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
ASSERT STRING
PANIC STRING
Critical VG Force off, halting.
This feature is available with AIX 6.1 or later.
IV52743: ADD CRITICAL VG SUPPORT FOR ROOTVG APPLIES TO AIX 7100-03
No comments:
Post a Comment