Friday, 19 May 2017

Optimizing AIX 7 performance: Part 3

 Optimizing AIX 7 performance: Part 3, Tune with ioo, filemon, fileplace, JFS and JFS2

Summary:  Part 3 of the AIX 7 performance series covers how to improve overall file system performance, how to tune your systems with the ioo command, and how to use the filemon and fileplace utilities. You will also learn about JFS and JFS2 that is available in AIX7.

About this series

This three-part series (see Resources) on the AIX® disk and I/O subsystem focuses on the challenges of optimizing disk I/O performance. While disk tuning is arguably less exciting than CPU or memory tuning, it is a crucial component in optimizing server performance. In fact, partly because disk I/O is your weakest subsystem link, there is more you can do to improve disk I/O performance than on any other subsystem.

Introduction

The first and second installments of this series discussed the importance of architecting your systems, the impact it can have on overall system performance, and a new I/O tuning tool, lvmo, which you can use to tune logical volumes. In this installment, you will examine how to tune your systems using theioocommand, which configures the majority of all I/O tuning parameters and displays the current or next boot values for all I/O tuning parameters. You will also learn how and when to use the filemon and fileplace tools. With enhanced journaled file system, the default file system within AIX, improving your overall file system performance, tuning your file systems, and getting the best out of the JFS2 are all important parts of your tuning toolkit. You'll even examine some file system attributes, such as sequential and random access, which can affect performance.

File system overview

This section discusses JFS2, file system performance, and specific performance improvements over JFS. As you know, there are two types of kernels in AIX. There is a 32-bit kernel and a 64-bit kernel. While they both share some common libraries and most commands and utilities, it is important to understand their differences and how the kernels relate to overall performance tuning. JFS2 has been optimized for the 64-bit kernel, while JFS is optimized for the 32-bit kernel. Journaling file systems, while much more secure, historically have been associated with performance overheads. In a Performance Rules shop (at the expense of availability), you would disable metadata logging to increase performance with JFS. With JFS2, you can also disable logging (in AIX 6.1 and higher) to help increase performance. You can disable logging at the point of mounting the filesystem, which means that you don't need to worry about changing or reconfiguring the filesystem. You can instead just modify your mount options. For example, to disable logging on filesystem you would use the following:mount -i log=NULL /database.

Although JFS2 was optimized to improve the performance of metadata operations, that is, those normally handled by the logging framework, switching logging off can have a significant performance benefit for filesystems where there is a high proportion of file changes and newly created/deleted files. For example, filesystems on development filesystems may see an increase in performance. For databases where the files used are static, the performance improvement may be less significant.

However, you should be careful making use of compression. Although compression can save disk space (and disk reads and writes, since less data is physically read from or written to the disk), the overhead on systems with a heavy CPU loads can actually slow performance down.

Enhanced JFS2 uses a binary tree representation while performing inode searches, which is a much better method than the linear method used by JFS. Furthermore, you do not need to assign inodes anymore when creating file systems, as they are now dynamically allocated by JFS2 (meaning you won't be running out of them).

While concurrent I/O was covered in the first installment of the series, it's worth another mention here. Implementation of concurrent I/O allows multiple threads to read and write data concurrently to the same file. This is due to the way in which JFS2 is implemented with write-exclusive inode locks. This allows multiple users to read the same file simultaneously, which increases performance dramatically when multiple users read from the same data file. To turn concurrent I/O on, you just need to mount the f/s with the appropriate flags (see Listing 1). We recommend that you look at using concurrent I/O when using databases such as Oracle.

Listing 1. Turning on concurrent I/O
                
root@lpar29p682e_pub[/] mount -o cio /test
root@lpar29p682e_pub[/] > df -k /test
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/fslv00        131072    130724    1%        4     1% /test
Table 1 illustrates the various enhancements of JFS2 and how they relate to systems performance. It's also important to understand that when tuning your I/O systems, many of the tunables themselves (you'll get into that later) differ, depending on whether you are using JFS or JFS2.

Table 1. Enhancements of JFS2
FunctionJFSJFS2
CompressionYesNo
QuotasYesYes
Deferred updateYesNo
Direct I/O supportYesYes
Optimization32-bit64-bit
Max file system size1 terabyte4 petabytes
Max file size64 gigabyes4 petabytes
Number of inodesFixed when creating f/sDynamic
Large file supportAs mount optionDefault
On-line defragmentationYesYes
NamefsYesYes
DMAPINoYes

Filemon and fileplace

This section introduces two important I/O tools, filemon and fileplace, and discusses how you can use them during systems administration each day.

Filemon uses a trace facility to report on the I/O activity of physical and logical storage, including your actual files. The I/O activity monitored is based on the time interval that is specified when running the trace. It reports on all layers of file system utilization, including the Logical Volume Manager (LVM), virtual memory, and physical disk layers. Without any flags, it runs in the background while application programs or system commands are being run and monitored. The trace starts automatically until it is stopped. At that time, the command generates an I/O activity report and exits. It can also process a trace file that has been recorded by the trace facility. Reports can then be generated from this file. Because reports generated to standard output usually scroll past your screen, it's recommended that you use the-ooption to write the output to a file (see Listing 2).

Listing 2. Using filemon with the-ooption
                  
l488pp065_pub[/] > filemon -o dbmon.out -O all

Run trcstop command to signal end of trace.
Thu Aug 12 09:07:06 2010
System: AIX 7.1 Node: l488pp065_pub Machine: 00F604884C00
l488pp065_pub[/] > trcstop

l488pp065_pub[/] > cat dbmon.out
Thu Aug 12 09:10:09 2010
System: AIX 7.1 Node: l488pp065_pub Machine: 00F604884C00
Cpu utilization:  72.8%
Cpu allocation:  100.0%

21947755 events were lost.  Reported data may have inconsistencies or errors.

Most Active Files
------------------------------------------------------------------------
  #MBs  #opns   #rds   #wrs  file                     volume:inode
------------------------------------------------------------------------
   0.4      1    101      0  unix                     /dev/hd2:82241
   0.0      9     10      0  vfs                      /dev/hd4:9641
   0.0      4      6      1  db.sql                 
   0.0      3      6      2  ksh.cat                  /dev/hd2:111192
   0.0      1      2      0  cmdtrace.cat             /dev/hd2:110757
   0.0     45      1      0  null                   
   0.0      1      1      0  dd.cat                   /dev/hd2:110827
   0.0      9      2      0  SWservAt                 /dev/hd4:9156
   0.0      1      0      3  db2.sql                
   0.0      9      2      0  SWservAt.vc              /dev/hd4:9157

Most Active Segments
------------------------------------------------------------------------
  #MBs  #rpgs  #wpgs  segid  segtype                  volume:inode
------------------------------------------------------------------------
   0.1      2     13   8359ba  client                 

Most Active Logical Volumes
------------------------------------------------------------------------
  util  #rblk  #wblk   KB/s  volume                   description
------------------------------------------------------------------------
  0.04      0     32    0.3  /dev/hd9var              /var
  0.00      0     48    0.5  /dev/hd8                 jfs2log
  0.00      0      8    0.1  /dev/hd4                 /

Most Active Physical Volumes
------------------------------------------------------------------------
  util  #rblk  #wblk   KB/s  volume                   description
------------------------------------------------------------------------
  0.00      0     72    0.7  /dev/hdisk0              N/A

Most Active Files Process-Wise
------------------------------------------------------------------------
  #MBs  #opns   #rds   #wrs  file                     PID(Process:TID)
------------------------------------------------------------------------
   0.0      3      6      0  db.sql                  7667828(ksh:9437345)
   0.0      1      2      0  ksh.cat                 7667828(ksh:9437345)
   0.0      1      0      3  db2.sql                 7667828(ksh:9437345)
   0.0      1      0      1  db.sql                  7733344(ksh:7405633)
   0.4      1    101      0  unix                    7667830(ksh:9437347)
   0.0      1      2      0  cmdtrace.cat            7667830(ksh:9437347)
   0.0      1      2      0  ksh.cat                 7667830(ksh:9437347)
   0.0      9      2      0  SWservAt                7667830(ksh:9437347)
   0.0      9      2      0  SWservAt.vc             7667830(ksh:9437347)
   0.0      1      0      0  systrctl                7667830(ksh:9437347)
   0.0     44      0     44  null                    4325546(slp_srvreg:8585241)
   0.0      1      2      2  ksh.cat                 7667826(ksh:23527615)
   0.0      1      1      0  dd.cat                  7667826(ksh:23527615)
   0.0      1      1      0  null                    7667826(ksh:23527615)
   0.0      1      0      0  test                    7667826(ksh:23527615)
   0.0      8      8      0  vfs                     3473482(topasrec:13566119)
   0.0      1      0      0  CuAt.vc                 3473482(topasrec:13566119)
   0.0      1      0      0  CuAt                    3473482(topasrec:13566119)
   0.0      1      2      0  vfs                     2097252(syncd:2490503)
   0.0      1      0      0  installable             4260046(java:15073489)

Most Active Files Thread-Wise
------------------------------------------------------------------------
  #MBs  #opns   #rds   #wrs  file                     TID(Process:PID)
------------------------------------------------------------------------
   0.0      3      6      0  db.sql                  9437345(ksh:7667828)
   0.0      1      2      0  ksh.cat                 9437345(ksh:7667828)
   0.0      1      0      3  db2.sql                 9437345(ksh:7667828)
   0.0      1      0      1  db.sql                  7405633(ksh:7733344)
   0.4      1    101      0  unix                    9437347(ksh:7667830)
   0.0      1      2      0  cmdtrace.cat            9437347(ksh:7667830)
   0.0      1      2      0  ksh.cat                 9437347(ksh:7667830)
   0.0      9      2      0  SWservAt                9437347(ksh:7667830)
   0.0      9      2      0  SWservAt.vc             9437347(ksh:7667830)
   0.0      1      0      0  systrctl                9437347(ksh:7667830)
   0.0     44      0     44  null                    8585241(slp_srvreg:4325546)
   0.0      1      2      2  ksh.cat                 23527615(ksh:7667826)
   0.0      1      1      0  dd.cat                  23527615(ksh:7667826)
   0.0      1      1      0  null                    23527615(ksh:7667826)
   0.0      1      0      0  test                    23527615(ksh:7667826)
   0.0      8      8      0  vfs                     13566119(topasrec:3473482)
   0.0      1      0      0  CuAt.vc                 13566119(topasrec:3473482)
   0.0      1      0      0  CuAt                    13566119(topasrec:3473482)
   0.0      1      2      0  vfs                     2490503(syncd:2097252)
   0.0      1      0      0  installable             15073489(java:4260046)
dbmon.out: END
Look for long seek times, as they can result in decreased application performance. By looking at the read and write sequence counts in detail, you can further determine if the access is sequential or random. This helps you when it is time to do your I/O tuning. This output clearly illustrates that there is no I/O bottleneck visible. Filemon provides a tremendous amount of information and, truthfully, we've found there is too much information at times. Further, there can be a performance hit using filemon, depending on how much general file activity there is while filemon is running. Let's look at the topas results while running filemon (see Figure 1).

Figure 1. topas results while running filemon
topaz results while running filemon 

In the figure above, filemon is taking up almost 60 percent of the CPU! This is actually less than in previous AIX versions but still a significant impact on your overall system performance. We don't typically like to recommend performance tools that have such a substantial overhead, so we'll reiterate that while filemon certainly has a purpose, you need to be very careful when using it.

What about fileplace? Fileplace reports the placement of a file's blocks within a file system. It is commonly used to examine and assess the efficiency of a file's placement on disk. For what purposes do you use it? One reason would be to help determine if some of the heavily utilized files are substantially fragmented. It can also help you determine the physical volume with the highest utilization and whether or not the drive or I/O adapter is causing the bottleneck.

Let's look at an example of a frequently accessed file in Listing 3.

Listing 3. Frequently accessed file
                
fileplace -pv /tmp/logfile
File: /tmp/logfile  Size: 63801540 bytes  Vol: /dev/hd3
Blk Size: 4096  Frag Size: 4096  Nfrags: 15604 
Inode: 7  Mode: -rw-rw-rw-  Owner: root  Group: system  

  Physical Addresses (mirror copy 1)                                    Logical Extent
  ----------------------------------                                  ----------------
  02884352-02884511  hdisk0        160 frags    655360 Bytes,   1.0%  00000224-00000383
  02884544-02899987  hdisk0      15444 frags  63258624 Bytes,  99.0%  00000416-00015859
        unallocated           -27 frags      -110592 Bytes      0.0%

  15604 frags over space of 15636 frags:   space efficiency = 99.8%
  2 extents out of 15604 possible:   sequentiality = 100.0%
  
You should be interested in space efficiency and sequentiality here. Higher space efficiency means files are less fragmented and provide better sequential file access. A higher sequentiality tells you that the files are more contiguously allocated, which will also be better for sequential file access. In the case here, space efficiency could be better while sequentiality is quite high. If the space and sequentiality are too low, you might want to consider file system reorganization. You would do this with thereorgvgcommand, which can improve logical volume utilization and efficiency. You may also want to consider using thedegrafscommand which can help ensure that the free space on your filesystem is contiguous, which will help with future writes and file creates. Defragmentation can occur in the background while you are using your system.

Tuning with ioo

This section discusses the use of theioocommand, which is used for virtually all I/O-related tuning parameters.

Likevmo, you need to be extremely careful when changingiooparameters, as changing parameters on the fly can cause severe performance degradation. Table 2 details specific tuning parameters that you use often for JFS file systems. As you can see, the majority of the tuning commands for I/O utilize theiooutility.

Table 2. Specific tuning parameters

FunctionJFS tuning parameterEnhanced JFS tuning parameter
Sets max amount of memory for caching filesvmo -o maxperm=valuevmo -o maxclient=value(< or = maxperm)
Sets min amount of memory for cachingvmo -o minperm=valuen/a
Sets a limit (hard) on memory for cachingvmo -o strict_maxpermvmo -o maxclient(hard limit)
Sets max pages used for sequential read aheadioo -o maxpgahead=valueioo -o j2_maxPageReadAhead=value
Sets min pages used for sequential read aheadioo -o minpgaheadioo -o j2_minPageReadAhead=value
Sets max number of pending write I/O to a filechhdev -l sys0 -a maxpout maxpoutchdev -l sys0 -a maxpout maxpout
Sets min number of pending write I/Os to a file at which programs blocked by maxpout might proceedchdev -l sys0 -a minpout minpoutchdev -l sys0 -a minpout minpout
Sets the amount of modified data cache for a file with random writesioo -o maxrandwrt=valueioo -o j2_maxRandomWrite ioo -o j2_nRandomCluster
Controls gathering of I/Os for sequential write behindioo -o numclust=valueioo -o j2_nPagesPerWriteBehindCluster=value
Sets the number of f/s bufstructsioo -o numfsbufs=valueioo -o j2_nBufferPerPagerDevice=value
Let's further discuss some of the more important parameters below, as we've already discussed all thevmotuning parameters in the memory tuning series (see Resources).

There are several ways you can determine the existingioovalues on your system. The long display listing foriooclearly gives you the most information (see Listing 4). It lists the values for current, reboot value, range, unit, type, and dependencies of all tunables parameters managed byioo.

Listing 4. Display for ioo
                
root@lpar29p682e_pub[/] > ioo -L
NAME                      CUR    DEF    BOOT   MIN    MAX    UNIT           TYPE
     DEPENDENCIES

j2_atimeUpdateSymlink     0      0      0      0      1      boolean           D
j2_dynamicBufferPreallo   16     16     16     0      256    16K slabs         D
j2_inodeCacheSize         400    400    400    1      1000                     D
j2_maxPageReadAhead       128    128    128    0      64K    4KB pages         D
j2_maxRandomWrite         0      0      0      0      64K    4KB pages         D
j2_maxUsableMaxTransfer   512    512    512    1      4K     pages             M
j2_metadataCacheSize      400    400    400    1      1000                     D
j2_minPageReadAhead       2      2      2      0      64K    4KB pages         D
j2_nBufferPerPagerDevice  512    512    512    512    256K                     M
j2_nPagesPerWriteBehindC  32     32     32     0      64K                      D
j2_nRandomCluster         0      0      0      0      64K    16KB clusters     D
j2_nonFatalCrashesSystem  0      0      0      0      1      boolean           D
j2_syncModifiedMapped     1      1      1      0      1      boolean           D
j2_syncdLogSyncInterval   1      1      1      0      4K     iterations        D
jfs_clread_enabled        0      0      0      0      1      boolean           D
jfs_use_read_lock         1      1      1      0      1      boolean           D
lvm_bufcnt                9      9      9      1      64     128KB/buffer      D
maxpgahead minpgahead     8      8      8      0      4K     4KB pages         D
maxrandwrt                0      0      0      0      512K   4KB pages         D
memory_frames             512K          512K                 4KB pages         S
Minpgahead maxpgahead     2      2      2      0      4K     4KB pages         D
numclust                  1      1      1      0      2G-1   16KB/cluster      D
numfsbufs                 196    196    196    1      2G-1                     M
pd_npages                 64K    64K    64K    1      512K   4KB pages         D
pgahd_scale_thresh        0      0      0      0      419430 4KB pages         D
pv_min_pbuf               512    512    512    512    2G-1                     D
sync_release_ilock        0      0      0      0      1      boolean           D

n/a means parameter not supported by the current platform or kernel

Parameter types:
    S = Static: cannot be changed
    D = Dynamic: can be freely changed
    B = Bosboot: can only be changed using bosboot and reboot
    R = Reboot: can only be changed during reboot
    C = Connect: changes are only effective for future socket connections
    M = Mount: changes are only effective for future mountings
    I = Incremental: can only be incremented
    d = deprecated: deprecated and cannot be changed
Listing 5 below shows you how to change a tunable.

Listing 5. Changing a tunable
                
root@lpar29p682e_pub[/] > ioo -o maxpgahead=32
Setting maxpgahead to 32
root@lpar29p682e_pub[/] >
This parameter is used for JFS only. For JSF2, there are additional file system performance enhancements including sequential page read ahead and sequential and random write behind. The Virtual Memory Manager (VMM) of AIX anticipates page requirements for observing the patterns of files that are accessed. When a program accesses two pages of a file, VMM assumes that the program keeps trying to access the file in a sequential method. The number of pages to be read ahead can be configured using VMM thresholds. With JFS2, make note of these two important parameters:

  • J2_minPageReadAhead: Determines the number of pages ahead when VMM initially detects a sequential pattern.
  • J2_maxPageReadAhead: Determines the maximum amount of pages that VMM can read in a sequential file.
Sequential and random write behind relates to writing modified pages in memory to disk after a certain threshold is reached. In this way, it does not wait forsyncdto flush out pages to disk. The reason for this is to limit the amount of dirty pages in memory, which further reduces I/O overhead and disk fragmentation. The two types of write behind are sequential and random. With sequential write behind, pages do not stay in memory until thesyncddaemon runs, which can cause real bottlenecks. With random write behind, the number of pages in memory exceeds a specified amount and all subsequent pages are written to disk.

For the sequential write behind, you should specify the number of pages to be scheduled to be written; thej2_nPagesPerWriterBehindClusterparameter specifies this parameter. By default the value is 32 (that is, 128KB), for modern disks and high write environments, such as databases, you may want to increase this parameter so that more data is written in a single block when the data needs to be synced to disk.

The random write behind can be configured by changing the values ofj2_nRandomClusterandj2_maxRandomWrite. Thej2_maxRandomWriteparameter specifies the number of pages of a file that are allowed to stay in memory. The default is 0 (meaning that information is written out as quickly as possible), and this is used to ensure data integrity. If you are willing to sacrifice some integrity in the event of a system failure, for better write performance you can increase these values. This keeps them in cache, so a system failure may not have written the data to disk properly. Thej2_nRandomClusterdefines the number of clusters apart two writes must be to be considered random. Increasing this value can lower the write frequency if you have a high number of files being modified at the same time.

Another important area worth mentioning is large sequential I/O processing. When there is too much simultaneous I/O to your file systems, the I/O can bottleneck at the f/s level. In this case, you should increase thej2_nBufferPerPagerDeviceparameter (numfsbus with JFS). If you use raw I/O as opposed to file systems, this same type of bottleneck can occur through LVM. Here is where you might want to tune thelvm_bufcntparameter.

Summary

This article focused on file system performance. You examined the enhancements in JFS2 and why it would be the preferred file system. Further, you used tools, such as filemon and fileplace, to gather more detailed information about the actual file structures and how they relate to I/O performance. Finally, you tuned your I/O subsystem by using theioocommand. You learned about theJ2_minPageReadAheadandJ2_maxPageReadAheadparameters in an effort to increase performance when encountering sequential I/O.

During this three-part series on I/O you learned that, perhaps more so than any other subsystem, your tuning must start prior to stress testing your systems. Architecting the systems properly can do more to increase performance than anything you can do with tuning I/O parameters. This includes strategic disk placement and making sure you have enough adapters to handle the throughput of your disks. Further, while this series focused on I/O, understand that the VMM is also very tightly linked with I/O performance and must also be tuned to receive optimum I/O performance.

No comments:

Post a Comment