Tuesday, 30 May 2017

J2pg high CPU usage on AIX

J2pg high CPU usage on AIX


Recently I've come across an odd issue at two different customers. I thought I'd share the experience, in case others also come across this strange behaviour.

In both cases they reported j2pg high CPU usage.

Similar to this...

image

And, in both cases, we discovered many /usr/sbin/update processes running. Unexpectedly.

When we stopped these processes, j2pg's CPU consumption dropped to nominal levels.

The j2pg process is responsible for, among other things, flushing data to disk and is called by the sync(d) process.

The /usr/sbin/update command is just a script that calls sync in a loop. Its purpose it to "..periodically update the super block., ..execute a sync subroutine every 30 seconds. This action ensures the file system is up-to-date in the event of a system crash".


# cat /usr/sbin/update
...BLAH...
PATH=/usr/bin:/usr/sbin:/etc

while true
do
sync
sleep 30
done &
exit 0

Because of the large number of /usr/sbin/update (sync) processes (in some cases over 150 of them), j2pg was constantly kept busy, assisting with flushing data to disk.

It appears that the application team (in both cases) was attempting to perform some sort of SQL "update" but due to an issue with their shell environment/PATH setting they were calling /usr/sbin/update instead of the intended update (or UPDATE) command. And yes, a non-root user can call /usr/sbin/update - no problem. So, in the "ps -ef" output we found processes that looked similar to this:

fred 50791260 1 0 Jan 09 - 0:04 /usr/bin/bsh /usr/sbin/update prd_ctl_q_and_p.af_data_obj set as_of_dt=2016-12-09 00:00:00.000000 where DATA_OBJ_NM like %LOANIQ%20161209%

# ls -ltr /usr/sbin/update
-r-xr-xr-x    1 bin      bin             943 Aug 17 2011  /usr/sbin/update

The application teams were directed to fix their scripts to prevent them calling /usr/sbin/update and instead call the correct command.

And here’s some information (more than you’ll probably ever need to know) about j2pg on AIX.


"j2pg - Kernel process integral to processing JFS2 I/O requests.

The kernel thread is responsible of managing I/Os in JFS2 filesystems,
so it is normal to see it running in case of lot of I/Os or syncd.

And we could see that j2pg runs syncHashList() very often.The sync is
done in syncHashList(). In syncHashList(), all inodes are extracted from hash
list. And whether the inode needs to synchronize or not is then judged
by iSyncNeeded().

** note that a sync() call will cause the system to scan *all* the
memory currently used for filecaching to see which pages are dirty
and have to be synced to disk

Therefore, the cause of j2pg having this spike is determined by the
two calls that were being made (iSyncNeeded ---> syncHashList). What is
going on here is a flush/sync of the JFS2 metadata to disk. Apparently
some program went recursively through the filesystem accessing files
forcing the inode access timestamp to change. These changes would have
to propogated to the disk.

Here's a few reasons why j2pg would be active and consume high CPU:
1. If there several process issuing sync then the j2pg process will be
very active using cpu resources.
2. If there is file system corruption then the j2pg will use more cpu
resources.
3. If the storage is not running data fast enough then the j2pg process
will be using high amount of cpu resources.

j2pg will get started for any JFS2 dir activity.
Another event that can cause j2pg activity, is syncd.
If the system experiences a lot of JFS2 dir activity, the j2pg process
will also be active handling the I/O.
Since syncd flushes I/O from real memory to disk, then any JFS2 dir's
with files in the buffer will also be hit."

"Checking the syncd...

From data, we see:
$ grep -c sync psb.elfk
351 << this is high
$ grep sync psb.elfk | grep -c oracle
348 << syncd called by Oracle user only

It appears that the number of sync which causes j2pg
to run is causing spikes.

We see:
/usr/sbin/syncd 60

J2pg is responsible for flushing data to disk and is
usually called by the syncd process. If you have a
large number of sync processes running on the system,
that would explain the high CPU for j2pg

The syncd setting determines the frequency with which
the I/O disk-write buffers are flushed.

The AIX default value for syncd as set in /sbin/rc.boot
is 60. It is recommended to change this value to 10.

This will cause the syncd process to run more often
and not allow the dirty file pages to accumulate,
so it runs more frequently but for shorter period of
time. If you wish to make this permanent then edit
the /sbin/rc.boot file and change to the 60 to 10.

You may consider mounting all of the
non-rootvg file systems with the 'noatime' option.
This can be done without any outage:

However selecting a non-peak production hours is better:

Use the commands below: For Example:
# mount -o remount,noatime /oracle
Then use chfs to make is persistent:
# chfs -a options=noatime /oracle

- noatime -
Turns off access-time updates. Using this option can
improve performance on file systems where a large
number of files are read frequently and seldom updated.
If you use the option, the last access time for a file
cannot be determined. If neither atime nor noatime is
specified, atime is the default value."

No comments:

Post a Comment