Tuesday, 30 May 2017

Help! My mksysb is only backing up 4 files!

Help! My mksysb is only backing up 4 files!


I enjoy it when I open my email in the morning and find a new message with a subject line of “weird one….”! I immediately prepare myself for whatever challenge awaits. Fortunately I do delight in helping others with their AIX challenges so I usually open these emails first and start to diagnose and troubleshoot the problem!

This week I was contacted by someone that was having a little trouble with a mksysb backup on one of their AIX systems.

“Hi Chris,

This one has me stumped, any ideas? I’ll have to log a call I think as I’m not sure why this is happening. I run a mksysb and it just backs up 4 files! I also can’t do an alt_disk_copy that also fails.

My /etc/exclude.rootvg is empty.

# cat /etc/exclude.rootvg
# mksysb -i /mksysb/aixlpar1-mksysb

Creating information file (/image.data) for rootvg.

Creating list of files to back up.

Backing up 4 files

4 of 4 files (100%)
0512-038 mksysb: Backup Completed Successfully.

# lsmksysb -f /mksysb/aixlpar1-mksysb
New volume on /mksysb/aixlpar1-mksysb:
Cluster size is 51200 bytes (100 blocks).
The volume number is 1.
The backup date is: Wed Oct 21 22:12:04 EST 2015
Files are backed up by name.
The user is root.
5911 ./bosinst.data
11 ./tmp/vgdata/rootvg/image.info
11837 ./image.data
270567 ./tmp/vgdata/rootvg/backup.data

The total size is 288326 bytes.
The number of archived files is 4.”

 
Yep, that is a weird one! J I replied with the following tips.

“Hi,

A couple of things to check and try.....

1. Check the mount
command is OK. i.e.

# ls -ltr /usr/sbin/mount
-r-sr-xr-x 3 root system 67040 Aug 18 15:52 /usr/sbin/mount << Not zero size?


# mount

2. Check the find command is OK i.e.

# ls -ltr /usr/bin/find
-r-xr-xr-x 1 bin bin 58608 Aug 18 15:44 /usr/bin/find << Not symlink'ed?

3. Delete the /etc/exclude.rootvg file. Try mksysb again. Does it work?

4. Riun mkszfile and check image.data. Are fs data stanzas listed for each file system?

5. Try with verbose and debug enabled on mksysb/mkszfile commands.

# cd /usr/bin
# cp -p mksysb mksysb.orig
# cp -p mkszfile mkszfile.orig

# vi mksysb

; search for main

#################### main ##########################

#
# set up environment
#
PATH=/usr/bin:/usr/sbin:/sbin:/etc: ; export PATH
export ODMDIR="/etc/objrepos"
NAME=`/usr/bin/basename $0`
PLATFORM=`/usr/sbin/bootinfo -a` # Needed for IA64 (value 4)

Add the following:

#################### main ##########################

set -x
for F in $(typeset +f); do typeset -ft $F; done

#
# set up environment
#

; Do the same (above) for mkszfile.

; Save the files and run mkszfile and mksysb with script (to capture the output).

# script /tmp/mksysb.out
# mkszfile
# mksysb -i -v /mksysb/aixlpar-mksysbimg
# exit”
 
A short time later I received another email stating that the problem had been resolved! Curiously, the find command had been (somehow) destroyed on this system.

“LLLLLLLLLLLEEEEEEEEEEEGGGGGGGGGGGGEEEEEEEEEEENNNNNNNNNNNDDDDDDDDDDDDDDDDD!!!!!!!!!!!!!!!!!!!!!!!!

/usr/bin/find was a zero size file!!! WTF!

I just copied the file from another LPAR and it worked perfect.

THANKS!!!!”

I thought I’d share this information here, just in case anyone else comes across a similar problem in the future.

What are those AIX fcstat error numbers all about?

What are those AIX fcstat error numbers all about?


When troubleshooting Fibre Channel adapter issues on AIX (or VIOS), I often use the fcstat command to assist me in the process. The command may display error numbers similar to the following:

$ fcstat fcs0

Error opening device: /dev/fscsi0
errno: 00000045

The error number (errno) displayed can (in some cases) be used to identify the root cause of a problem. In the example above, the error number is 45. This number is in hex. If we convert this to decimal, the number is 69. Now, if we look for 69 in the /usr/include/errno.h file (on AIX), we discover that this error number relates to a "Network is down" event.

# fcstat fcs0
Error opening device: /dev/fscsi0
errno: 00000045

# echo "ibase=16; 45"|bc
69

# grep 69 /usr/include/errno.h
#define ENETDOWN        69      /* Network is down */

The AIX error report (errpt or errlog on VIOS) also tells me that there's some type of link error. This helps me focus my investigation toward the most likely problem area. In this case, I suspect either a physical link or cable problem between my FC adapter and the SAN switch.

IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
7BFEEA1F   1116121816 T H fcs0           LINK ERROR

# errpt -aN fcs0 | grep -p Desc
...
Description
LINK ERROR

We can use the same process for other errors displayed by the fcstat command. For example:

# fcstat fcs0
Error opening device: /dev/fscsi0
errno: 00000046

# echo "ibase=16; 46"|bc
70

# grep 70 /usr/include/errno.h
#define ENETUNREACH     70      /* Network is unreachable */

This particular error may direct us to the solution outlined in the following tech note:

Fibre channel workspace takes over 2 minutes to load but returns no data


fcstat Command (AIX 6.1)

fcstat Command (AIX 7.1)

fcstat Command (AIX 7.2)

AIX WPAR tips and tricks

AIX WPAR tips and tricks


Here’s some handy tips and tricks for managing WPARs in your AIX environment.

  1. How to share a global file system with a WPAR.

- In the Global.

root@750lpar11]/ # uname -W
0

- Global File System that you want to share with the WPAR (750wpar2).

[root@750lpar11]/ # df -g /cg
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/cglv          1.98      0.30   85%       64     1% /cg

[root@750lpar11]/ # lswpar
Name      State  Type  Hostname  Directory        RootVG WPAR
--------------------------------------------------------------
750wpar1  A      S     750wpar1  /wpars/750wpar1  yes
750wpar2  A      S     750wpar2  /wpars/750wpar2  yes

- Create "namefs" file system from the Global. /cg file system will be mounted under /stuff in the WPAR.

[root@750lpar11]/ # crfs -v namefs -A yes -d /cg -m /wpars/750wpar2/stuff -u 750wpar2

[root@750lpar11]/ # lsfs /wpars/750wpar2/stuff
Name            Nodename   Mount Pt               VFS   Size    Options    Auto Accounting
/cg             --         /wpars/750wpar2/stuff  namefs --      --         no   no

root@750lpar11]/ # grep -p /wpars/750wpar2/stuff /etc/filesystems
/wpars/750wpar2/stuff:
        dev             = /cg
        vfs             = namefs
        mount           = true
        type            = 750wpar2
        account         = false

- Mount the file system, in the Global.

[root@750lpar11]/ # mount /wpars/750wpar2/stuff

root@750lpar11]/ # mount | grep stuff
         /cg              /wpars/750wpar2/stuff namefs Jun 23 10:27 rw

- In the WPAR.

[root@750wpar2]/stuff # uname -W
8
[root@750wpar2]/stuff # df -g /stuff
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
Global             1.98      0.30   85%       64     1% /stuff

  1. Don't remove the ipv6 address from the global LPAR or else you may not be able to start your WPARs.

[root@gibopvc1]/ # chdev -l lo0 -a netaddr6=''
lo0 changed

[root@gibopvc1]/ # stopwpar -Fv p8wpar1
Stopping workload partition 'p8wpar1'.
Stopping workload partition subsystem 'cor_p8wpar1'.
0513-044 The cor_p8wpar1 Subsystem was requested to stop.
Shutting down all workload partition processes.
WPAR='p8wpar1' CID=1
sysV mq ID=2097153 key=0x4107001c uid=0 gid=9
sysV sem ID=5242893 key=0x62023457 uid=0 gid=0
Unmounting all workload partition file systems.
Umounting '/wpars/p8wpar1/etc/objrepos/wboot'.
Umounting '/wpars/p8wpar1/opt'.
Umounting '/wpars/p8wpar1/usr'.
Umounting '/wpars/p8wpar1'.
Return Status = SUCCESS.
[root@gibopvc1]/ #

[root@gibopvc1]/ # startwpar -v p8wpar1
Starting workload partition 'p8wpar1'.
Mounting all workload partition file systems.
Mounting '/wpars/p8wpar1'.
Mounting '/wpars/p8wpar1/etc/objrepos/wboot'.
Mounting '/wpars/p8wpar1/opt'.
Mounting '/wpars/p8wpar1/usr'.
Loading workload partition.
startwpar: 0960-231 ATTENTION: '/usr/lib/wpars/loadwpar' failed with return code 13.
startwpar: 0960-244 Error loading workload partition.
Unmounting all workload partition file systems.
Umounting '/wpars/p8wpar1/usr'.
Umounting '/wpars/p8wpar1/opt'.
Umounting '/wpars/p8wpar1/etc/objrepos/wboot'.
Umounting '/wpars/p8wpar1'.
Return Status = FAILURE.
[root@gibopvc1]/ #

[root@gibopvc1]/ # chdev -l lo0 -a netaddr6='::1'
lo0 changed
[root@gibopvc1]/ #
[root@gibopvc1]/ #
[root@gibopvc1]/ # startwpar -v p8wpar1
Starting workload partition 'p8wpar1'.
Mounting all workload partition file systems.
Mounting '/wpars/p8wpar1'.
Mounting '/wpars/p8wpar1/etc/objrepos/wboot'.
Mounting '/wpars/p8wpar1/opt'.
Mounting '/wpars/p8wpar1/usr'.
Loading workload partition.
en1 net default: gateway 10.1.1.10
Exporting workload partition devices.
Exporting workload partition kernel extensions.
Starting workload partition subsystem 'cor_p8wpar1'.
0513-059 The cor_p8wpar1 Subsystem has been started. Subsystem PID is 9896106.
Verifying workload partition startup.
Return Status = SUCCESS.
[root@gibopvc1]/ #

  1. Follow this procedure if you are unable to clogin to a WPAR because root logins are not allowed.

[root@750lpar11]/ GLOBAL # uname -W
0

[root@750lpar11]/ GLOBAL # clogin 750wpar1
Remote logins are not allowed for this account.

[root@750lpar11]/ GLOBAL # chroot /wpars/750wpar1 /usr/bin/ksh
750lpar11 : / # chuser rlogin=true root
750lpar11 : / # exit

[root@750lpar11]/ GLOBAL # clogin 750wpar1
750wpar1 : / # oslevel -s
5300-12-09-1341
750wpar1 : / # uname -W
4

  1. How back up your WPAR to a NIM master.

Backing up WPAR to NIM master over NFS mounted file system.
--------------------------------------------------------------------------------------

[root@750lpar11]/ GLOBAL # mount nimmast:/export/images/wpars /mnt
[root@750lpar11]/ GLOBAL # savewpar -if /mnt/750wpar1-savewpar-image
Creating information file (/image.data) for 750wpar1-savewpar-image.
Creating list of files to back up
Backing up 92766 files......................
92766 of 92766 files backed up (100%)
0512-038 savewpar: Backup Completed Successfully.

############################################################################

Backing up WPAR with NIM.
--------------------------------------
On the NIM master, check the Global LPAR is already configured as a NIM client.

# nim -o showlog 750lpar11
BEGIN:Thu Jan 30 17:25:47 2014:013017254714
Command line is:
/usr/sbin/installp -acNgXY -e /var/adm/ras/nim.installp -f \
/tmp/.workdir.12583118.12255434_1/.genlib.installp.list.12255434-d \
/tmp/_nim_dir_11862198/mnt0
+-----------------------------------------------------------------------------+
                    Pre-installation Verification...
+-----------------------------------------------------------------------------+
...etc...
Installation Summary
--------------------
Name                        Level           Part        Event       Result
-------------------------------------------------------------------------------
openssh.man.en_US           6.0.0.6100      USR         APPLY       SUCCESS

END:Thu Jan 30 17:25:51 2014:013017255114

# lsnim -l 750lpar11
750lpar11:
   class          = machines
   type           = standalone
   connect        = nimsh
   platform       = chrp
   netboot_kernel = 64
   if1            = 10_1_50 750lpar11 FAFAC00EF002
   cable_type1    = N/A
   Cstate         = ready for a NIM operation
   prev_state     = not running
   Mstate         = currently running
   cpuid          = 00F603CD4C00
   Cstate_result  = success

Check that the WPAR hostname/IP can be resolved from the NIM master.

# host 750wpar2
750wpar2 is 10.1.50.44

# host 10.1.50.44
750wpar2 is 10.1.50.44

Define the WPAR in the NIM database. Where 750lpar11 is the Global partition NIM client name and 750wpar2 is the WPAR name.

# nim -o define -t wpar -a mgmt_profile1="750lpar11 750wpar2" -a if1="find_net 750wpar2 0" 750wpar2

Verify that the NIM master can now query the status of the WPAR.

# nim -o lswpar 750wpar2
Name      State  Type  Hostname  Directory        RootVG WPAR
--------------------------------------------------------------
750wpar2  A      S     750wpar2  /wpars/750wpar2  yes

# lsnim -l 750lpar11
750lpar11:
   class          = machines
   type           = standalone
   connect        = nimsh
   platform       = chrp
   netboot_kernel = 64
   if1            = 10_1_50 750lpar11 FAFAC00EF002
   cable_type1    = N/A
   Cstate         = ready for a NIM operation
   prev_state     = not running
   Mstate         = currently running
   cpuid          = 00F603CD4C00
   Cstate_result  = success
   manages        = 750wpar2

# lsnim -l 750wpar2
750wpar2:
   class         = machines
   type          = wpar
   connect       = shell
   platform      = chrp
   if1           = 10_1_50 750wpar2 0
   mgmt_profile1 = 750lpar11 750wpar2
   Cstate        = ready for a NIM operation
   prev_state    = managed system defined but not yet created
   Mstate        = currently running

Backup the WPAR from the NIM master.

# nim -o define -t savewpar -a server=master -a location=/export/mksysb/cg/750wpar2-savewpar-image -a source=750wpar2 -a mk_image=yes 750wpar2-savewpar-image

Creating list of files to back up.
.
Backing up 64560 files..........

64560 of 64560 files (100%)
0512-038 savewpar: Backup Completed Successfully.
  

  1. Moving a Rootvg Versioned Workload Partition (VWPAR) to Another Disk

"I tested in the lab, for a versioned wpar you can use it more like a regular LPAR - extend a disk into rootvg, mirror on it, unmirror from the old one, reduce out and rmdev.  No bosboot or bootlist needed.  There is no bootset in a versioned wpar for some reason.”

Example -  5.3 versioned wpar, rootvg with rootvg on hdisk11 from global env.

global environment:
 
add a new disk (hdisk10) to the vwpar:

# chwpar -D devname=hdisk10 rootvg=yes 53rvg

chwpar: 0960-721 The workload partition's rootvg must be extended with the new disk(s). Please run extendvg inside the workload partition prior to stopping or rebooting it.


In the versioned wpar:

# clogin 53rvg
# cfgmgr
# extendvg rootvg hdisk1
# mirrorvg rootvg hdisk1
# unmirrorvg rootvg hdisk0
# reducevg rootvg hdisk0
# rmdev -dl hdisk0
# exit (go back to global env)


Now in the global deallocate the old rootvg disk.

# chwpar -K -D devname=hdisk11 53rvg

Now you can shut down and start the wpar, it will boot from the other disk.

# stopwpar 53rvg
# startwpar 53rvg


Apparently bootset commands don't work with a versioned rootvg wpar:

# lswpar -Br 53rvg
Name   Device Name      Virtual Device  RootVG  Bootset
--------------------------------------------------------

# chwpar -B devname=hdisk10 rootvg=yes 53rvg

chwpar: 0960-808 Cannot add bootsets to versioned workload partitions.
 

6.Moving a Rootvg Workload Partition (WPAR) to Another Disk

Technote (FAQ)

Question
I have a rootvg WPAR that is on one disk, is there a method to move it to a new disk?

Answer
There may be an occasion where you have created a rootvg WPAR on a specific disk and you want to move the entire WPAR to another disk. One example might be that the original disk is from an older storage enclosure, and you wish to move the WPAR to newly purchased storage, connected to the system.

You can do this by means of an alternate bootset. Similar to how using the alt_disk_copy command in a global LPAR will create a copy of rootvg on another disk, an alternate bootset is a copy of a WPAR's rootvg on another disk.

The example in this technote will use a rootvg wpar that is on a single disk (hdisk11), and has private /opt and /usr filesystems (AKA a "detached" WPAR). This was initially created using these options:

# mkwpar -D devname=hdisk11 rootvg=yes -l -n rootvgwpar
# startwpar rootvgwpar 


1. List the bootsets for the rootvg WPAR (-Br) we are interested in:

# lswpar -Br rootvgwpar
Name     Device Name    Virtual Device  RootVG  Bootset
-------------------------------------------------------------
rootvgwpar  hdisk11      hdisk0      yes   0 


So we can see that hdisk11 is being used for the rootvg in the WPAR, as internal hdisk0.

2a. Now we allocate a new unused disk to the WPAR, and set a bootset on it in the same action:

# chwpar -B devname=hdisk9 rootvgwpar
Creating a bootset for WPAR rootvgwpar. Please wait... 


After this is finished we can see that a new bootset has been created on hdisk9:

# lswpar -Br rootvgwpar
Name     Device Name    Virtual Device  RootVG  Bootset
-------------------------------------------------------------
rootvgwpar  hdisk9      hdisk1      no   
1
rootvgwpar  hdisk11      hdisk0      yes   0 

2b. If we log in to the WPAR we'll see a rootvg and an alternate rootvg:

# clogin rootvgwpar

# lspv
hdisk0     c7a96733e1ebfc      rootvg      active
hdisk1     c7a9673433f0a7      altvg.1 


3. From the global environment again, we can set the bootlist for the WPAR to boot from the bootset on the new disk:

# chwpar -b bootlist=1 rootvgwpar
check that it was set properly:

# lswpar -b rootvgwpar
Name     Bootlist
---------------------
rootvgwpar
ف

4. Reboot the wpar on the new bootset.

# rebootwpar rootvgwpar

Stopping workload partition rootvgwpar.
Stopping workload partition subsystem cor_rootvgwpar.
0513-044 The cor_rootvgwpar Subsystem was requested to stop.
stopwpar: 0960-261 Waiting up to 600 seconds for workload partition to halt.
Shutting down all workload partition processes.
Unmounting all workload partition file systems.

Starting workload partition rootvgwpar.
Mounting all workload partition file systems.
Loading workload partition.
Exporting workload partition devices.
Exporting workload partition kernel extensions.
Starting workload partition subsystem cor_rootvgwpar.
0513-059 The cor_rootvgwpar Subsystem has been started. Subsystem PID is 8323226.
Verifying workload partition startup.


If we look at the bootsets we'll see that the 2nd disk now has rootvg officially on it:

# lswpar -Br rootvgwpar
Name     Device Name    Virtual Device  RootVG  Bootset
-------------------------------------------------------------
rootvgwpar  hdisk9      hdisk0      yes  
1
rootvgwpar  hdisk11      hdisk1      no   ـ

That bootset also believes hdisk9 is "hdisk0" for it, and the other disk is hdisk1. Notice the bootset ID has not changed, bootset 0 is still on (global) disk hdisk11 and bootset 1 on (global) disk hdisk9.

If we log into the WPAR we can see this:

# lspv
hdisk0     c7a9673433f0a7      rootvg      active
hdisk1     c7a96733e1ebfc      altvg.0 


The volume group "altvg.0" is our original rootvg.

5. At this point if everything is looking good, we can remove the original bootset from the global environment:

# chwpar -K -B bootset=0 rootvgwpar

# lswpar -Br rootvgwpar
Name     Device Name    Virtual Device  RootVG  Bootset
-------------------------------------------------------------
rootvgwpar  hdisk9      hdisk0      yes   1


# lswpar -D rootvgwpar | grep disk
rootvgwpar  disk   hdisk9      yes   EXPORTED
rootvgwpar  disk   hdisk11      no    EXPORTED


6. Now we can remove the original rootvg disk:

# chwpar -K -D devname=hdisk11 rootvgwpar 

7. Remember to run cfgmgr to put the disk in an Available state again so it can be used by another WPAR or volume group:

# lsdev -xc disk -l hdisk11
hdisk11 Defined 00-01-02 MPIO IBM 2076 FC Disk

# cfgmgr

# lsdev -xc disk -l hdisk11
hdisk11 Available 00-01-02 MPIO IBM 2076 FC Disk


DLPAR event history on the HMC

DLPAR event history on the HMC


You can use the HMC lssvcevents command to view DLPAR event history. In the following example, I’m looking for “Add processor” related DLPAR events, over the last 30 days, for a specific LPAR.

$ lssvcevents -t console -d 30 | egrep 'Add processor|DLPAR' | grep s82861p22_nim

time=12/01/2016 00:48:33,text=HSCE2209 User name hscroot: DLPAR Add processor resources to partition s82861p22_nim succeeded on managed system Server-8286-42A-SN214F55V

Search for all DLPAR events over the last 30 days.

$ lssvcevents -t console -d 30 | grep -i dlpar
time=12/01/2016 00:52:27,text=HSCE2211 User name hscroot: DLPAR Remove processor resources from partition s82861p22_nim  succeeded on managed system Server-8286-42A-SN214F55V
time=12/01/2016 00:48:33,text=HSCE2209 User name hscroot: DLPAR Add processor resources to partition s82861p22_nim succeeded on managed system Server-8286-42A-SN214F55V

Minimum memory requirement for AIX 7.2 Live Update

Minimum memory requirement for AIX 7.2 Live Update


AIX 7.2 requires 2GB of memory to boot, but this minimum is not enforced in the LPAR profile except by Live Update (to ensure we'll be able to boot the surrogate LPAR). You can check your Minimum Memory setting in your LPARs profile by running the lparstat command (as shown below).

# lparstat -i | grep Memory
Online Memory                              : 4096 MB
Maximum Memory                             : 8192 MB
Minimum Memory                             : 2048 MB

If your partition does not meet the minimum memory profile requirement, you’ll receive the following error message when you perform a live update preview (with geninstall –p –k).

Checking lpar minimal memory size:
------------------------------------------
Required memory size: 2048 MB
Current memory size: 1024 MB
1430-119 FAILED: the lpar minimal memory size is not sufficient

You’ll need to change partition profile so that the minimum memory setting is, at least, 2048 (2GB) and then stop & start the partition for the profile update to take effect.

The /var/adm/ras/liveupdate/logs/lvupdlog log file will also contain error messages indicating the problem:

LVUPD 11/28/2016-18:52:00.716 DEBUG lvupdate_utils32.c - 6713 - lvup_check_minmem_size: Partition minimum memory size (1024 MB) on p8tlc-lvup is lower than the minimum memory size required (2048 MB).

LVUPD 11/28/2016-18:52:00.716 ERROR lvupdate_utils32.c - 8647 - lvup_preview: ERROR(s) while checking the current mimimal memory size against the computed required size.

J2pg high CPU usage on AIX

J2pg high CPU usage on AIX


Recently I've come across an odd issue at two different customers. I thought I'd share the experience, in case others also come across this strange behaviour.

In both cases they reported j2pg high CPU usage.

Similar to this...

image

And, in both cases, we discovered many /usr/sbin/update processes running. Unexpectedly.

When we stopped these processes, j2pg's CPU consumption dropped to nominal levels.

The j2pg process is responsible for, among other things, flushing data to disk and is called by the sync(d) process.

The /usr/sbin/update command is just a script that calls sync in a loop. Its purpose it to "..periodically update the super block., ..execute a sync subroutine every 30 seconds. This action ensures the file system is up-to-date in the event of a system crash".


# cat /usr/sbin/update
...BLAH...
PATH=/usr/bin:/usr/sbin:/etc

while true
do
sync
sleep 30
done &
exit 0

Because of the large number of /usr/sbin/update (sync) processes (in some cases over 150 of them), j2pg was constantly kept busy, assisting with flushing data to disk.

It appears that the application team (in both cases) was attempting to perform some sort of SQL "update" but due to an issue with their shell environment/PATH setting they were calling /usr/sbin/update instead of the intended update (or UPDATE) command. And yes, a non-root user can call /usr/sbin/update - no problem. So, in the "ps -ef" output we found processes that looked similar to this:

fred 50791260 1 0 Jan 09 - 0:04 /usr/bin/bsh /usr/sbin/update prd_ctl_q_and_p.af_data_obj set as_of_dt=2016-12-09 00:00:00.000000 where DATA_OBJ_NM like %LOANIQ%20161209%

# ls -ltr /usr/sbin/update
-r-xr-xr-x    1 bin      bin             943 Aug 17 2011  /usr/sbin/update

The application teams were directed to fix their scripts to prevent them calling /usr/sbin/update and instead call the correct command.

And here’s some information (more than you’ll probably ever need to know) about j2pg on AIX.


"j2pg - Kernel process integral to processing JFS2 I/O requests.

The kernel thread is responsible of managing I/Os in JFS2 filesystems,
so it is normal to see it running in case of lot of I/Os or syncd.

And we could see that j2pg runs syncHashList() very often.The sync is
done in syncHashList(). In syncHashList(), all inodes are extracted from hash
list. And whether the inode needs to synchronize or not is then judged
by iSyncNeeded().

** note that a sync() call will cause the system to scan *all* the
memory currently used for filecaching to see which pages are dirty
and have to be synced to disk

Therefore, the cause of j2pg having this spike is determined by the
two calls that were being made (iSyncNeeded ---> syncHashList). What is
going on here is a flush/sync of the JFS2 metadata to disk. Apparently
some program went recursively through the filesystem accessing files
forcing the inode access timestamp to change. These changes would have
to propogated to the disk.

Here's a few reasons why j2pg would be active and consume high CPU:
1. If there several process issuing sync then the j2pg process will be
very active using cpu resources.
2. If there is file system corruption then the j2pg will use more cpu
resources.
3. If the storage is not running data fast enough then the j2pg process
will be using high amount of cpu resources.

j2pg will get started for any JFS2 dir activity.
Another event that can cause j2pg activity, is syncd.
If the system experiences a lot of JFS2 dir activity, the j2pg process
will also be active handling the I/O.
Since syncd flushes I/O from real memory to disk, then any JFS2 dir's
with files in the buffer will also be hit."

"Checking the syncd...

From data, we see:
$ grep -c sync psb.elfk
351 << this is high
$ grep sync psb.elfk | grep -c oracle
348 << syncd called by Oracle user only

It appears that the number of sync which causes j2pg
to run is causing spikes.

We see:
/usr/sbin/syncd 60

J2pg is responsible for flushing data to disk and is
usually called by the syncd process. If you have a
large number of sync processes running on the system,
that would explain the high CPU for j2pg

The syncd setting determines the frequency with which
the I/O disk-write buffers are flushed.

The AIX default value for syncd as set in /sbin/rc.boot
is 60. It is recommended to change this value to 10.

This will cause the syncd process to run more often
and not allow the dirty file pages to accumulate,
so it runs more frequently but for shorter period of
time. If you wish to make this permanent then edit
the /sbin/rc.boot file and change to the 60 to 10.

You may consider mounting all of the
non-rootvg file systems with the 'noatime' option.
This can be done without any outage:

However selecting a non-peak production hours is better:

Use the commands below: For Example:
# mount -o remount,noatime /oracle
Then use chfs to make is persistent:
# chfs -a options=noatime /oracle

- noatime -
Turns off access-time updates. Using this option can
improve performance on file systems where a large
number of files are read frequently and seldom updated.
If you use the option, the last access time for a file
cannot be determined. If neither atime nor noatime is
specified, atime is the default value."

NIM Master, NIMSH and SSL on AIX 7.1 TL4 SP3

NIM Master, NIMSH and SSL on AIX 7.1 TL4 SP3


Whilst working with one of my AIX customers recently I discovered a problem with NIMSH and SSL. The customer had updated their NIM master from AIX 7.1 TL4 SP1 to AIX 7.1 TL4 SP3. After the SP update, any attempt to connect to a NIM client (over NIMSH+SSL), from the NIM master, would simply hang. For example, we tried to list the filesets on the NIM client with this command, which never returned any output.

[root@750lpar4]/ # nim -o lslpp 750lpar9

The /var/adm/ras/nimsh.log file, on the NIM client, showed that the NIMSH session stopped here:

Thu Jan 12 14:31:49 2017        Loading certificates..
Thu Jan 12 14:31:49 2017        Loading private key file..
Thu Jan 12 14:31:49 2017        create BIO

NIM master: 750lpar4
7100-04-03-1543

NIM client: 750lpar9
7100-04-02-1614

[root@750lpar4]/ # lsnim -l 750lpar9
750lpar9:
   class          = machines
   type           = standalone
   connect        = nimsh (secure)
   platform       = chrp
   netboot_kernel = 64
   if1            = 10_1_50 750lpar9 0
   cable_type1    = N/A
   Cstate         = ready for a NIM operation
   prev_state     = not running
   Mstate         = currently running
   cpuid          = 00F603CD4C00
   Cstate_result  = success

The root cause of the problem become apparent when we ran truss against the nim –o command.

[root@750lpar4]/ # truss -adef -o truss.lsnim.out -w all nim -o lslpp 750lpar9

[root@750lpar4]/ # cat truss.lsnim.out
13959372: C o u l d   n o t   l o a d   m o d u l e   / u s r / l i b / l
13959372: i b s s l . s o .\n S y s t e m   e r r o r :   N o   s u c h
13959372: f i l e   o r   d i r e c t o r y
19267612: C o u l d   n o t   l o a d   m o d u l e   / u s r / l i b / l
19267612: i b c r y p t o . s o .\n S y s t e m   e r r o r :   N o   s u
19267612: c h   f i l e   o r   d i r e c t o r y

The required shared library object files were missing on the NIM master.

[root@750lpar4]/usr/lib # ls -ltr libssl.so libcrypto.so
libssl.so not found
libcrypto.so not found

We fixed this issue by extracting the missing files from the (existing) /usr/lib/libssl.a and /usr/lib/libcrypto.a archives.

[root@750lpar4]/usr/lib # slibclean

[root@750lpar4]/usr/lib # /bin/ar -v -x /usr/lib/libssl.a /usr/lib/libssl.so
x - /usr/lib/libssl.so

[root@750lpar4]/usr/lib # /bin/ar -v -x /usr/lib/libcrypto.a /usr/lib/libcrypto.so
x - /usr/lib/libcrypto.so

[root@750lpar4]/usr/lib # ls -ltr libssl.so libcrypto.so
-rwxr-xr-x    1 root     system       724913 Jan 18 09:08 libssl.so
-rwxr-xr-x    1 root     system      3031337 Jan 18 09:08 libcrypto.so

After that, the nim –o commands started working again.

[root@750lpar4]/usr/lib # nim -o showlog 750lpar9
HELLO

So, the question is why did this happen? Well, in the past, the libssl.so.0.9.8 shared object was extracted by NIM, but more recent updates by the OpenSSL version has forced IBM to move to libssl.so. Usually, the extracted shared library object is added (if not currently present) when nimconfig -c is run.  But given that this is an existing NIM master, we did not want to run this again (as we would lose all of the current SSL key access). So extracting the objects is preferred. The problem is due to the fact that the libssl.so and libcrypto.so files are not populated when the AIX 7100-04-03 update is applied. This is a bug and will be officially addressed, soon, under APAR IV93152 NIM push operation to client hang on nimsh over SSL.

I believe this issue may also occur when you migrate your NIM master from AIX 7.1 to 7.2 (with nimadm for example). But I need to do more testing to reproduce and confirm the issue.

Here’s one good reason to setup NIMSH over SSL.

NIMSH, SSL and LPM

The following link is a great reference guide for configuring NIMSH over SSL.

NIMSH over SSL