TSM Performance Tuning

Introduction

These are some of the notes I have collected based on experience and posts to the ADSM-L mailing list, on the intricate task of tuning the IBM Tivoli Storage Manager server. Much of it can be seen as steming from my background as a UNIX Sysadmin, and can be applied to software other than TSM.

I know some of the below advice differs from that found elsewhere, whether it be advice on the ADSM mailing list, SHARE notes (e.g. seminar 5722) or even IBM Redbooks. Some of the advice below is based on pure theory, but much of it is based on experience. I'm not saying I'm right and they're wrong, we just have a difference of opinions. Feel free to tell me your thoughts.

This is a work in progress and will be updated as time permits.

Filesystem vs. Raw Volumes

My personal opinion is, that if you don't mind the slight additional "complexity" of raw volumes, from a performance standpoint, they are worth it. I/O to raw volumes (disk slices, logical volumes, etc) bypasses the operating systems buffer cache, which means the operating system wastes less CPU cycles on cache management, and the data has a shorter path through the kernel.

Using raw volumes also allows more RAM to be dedicated to the TSM Buffer Pool, which has much better caching algorithms. Although, any node running as a TSM server should already have its operating system buffer cache tuned to a minimum, see the next topic.

Operating System Buffer Cache

On modern Operating Systems, the kernel maintains a buffer cache of recently accessed file data. By default, some kernels bias memory usage so heavily towards the buffer cache that the system will thrash to maintain a large set of cached file data.

AIX is one such OS. By default, AIX v4 will attempt to use up to 80% of physical memory for file data. Since TSM does such a huge amount of disk I/O, AIX will quite happily page out the TSM server process to maintain an 80% file buffer cache. This may be fixed by running the vmtune command (installed as /usr/samples/kernel/vmtune from the bos.adt.samples fileset). For large memory AIX database systems, I usually use vmtune -p 3 -P 5 in the system startups to shrink the buffer cache to a more reasonable size. This sets minperm, lower file cache size limit to 3%, and maxperm to 5%. Another useful parameter is -h, or strict_maxperm. Setting this to 1 will stop AIX using free memory for the file cache, which, on a system doing large amounts of I/O, should reduce the size of the filecache AIX has to manage, which should reduce kernel (system) CPU usage.

Note: The page replacement scan rate (the sr column in vmstat) is often used as one of the indicators of memory pressure. Note that decreasing the filecache size may indeed increase the scan rate, especially when setting strict_maxperm.

AIX5 introduced the O_DIRECT flag to the open(2) system call, which tells AIX not to delay writes of file data. This is configurable from TSM version 5.1 via the AIXDIRECTIO server option, which is enabled by default. This should allow file system files to be used while maintaining some of the performance benefits of raw volumes. Note, however, that according to IBM RedBook SG24-6554, direct I/O is only used on non-compressed, non-large file enabled file systems, and only by TSM on storage pool volumes (not DB or log volumes). Given these limitations, I would still recommend raw volumes be used.

Update (2003-10-26): I've recently been involved in setting up a new Oracle DB server under AIX 5.1. We ran into some interesting performance "features" with JFS2. Our solution was to mount all the Oracle database datafile JFS2 filesystems with the "dio" mount option. This option became available in a certain maintenance release of AIX 5.1 - I'm not sure which. The database now flies, DB backups and restores run very well, and CPU usage is down (the AIX kernel thread "lrud" rarely uses CPU).

For Solaris 6 (SPARC 105181-10 and greater) and 7, it is recommended that the line set priority_paging=1 be added to /etc/system. On Solaris 8 and newer, this setting was made the default.

TSM Database

Database Volumes (DBVols)

I/O to the TSM DBVols is usually random I/O, with a fairly high proportion of reads. I/O size to DBVols is 4 KiB (confirmed via truss(1) on Solaris 2.8), the TSM database block size.

Given the critical nature of DBVols, some form of mirroring or redundancy should be used. Whether to use OS RAID support, hardware RAID or TSM mirroring is somwhat a matter of taste. From a pure performance standpoint, hardware RAID should be best, provided the rules on number of DBVols is obeyed (see below). The next best method should be TSM mirroring, as this would allow TSM to submit I/Os to both halves of a mirror simultaneously, which will not happen if using OS mirroring.

Within TSM, threads wishing to do I/O to a DBVol must obtain the DBVols mutex, making I/O to DBVols basically synchronous. Thus parallelism is directly controlled by the number of DBVols - the more DBVols, the more I/Os that may be in flight at any moment.

Modern (single) disks, be they SCSI, FC-AL or SSA (untested) allow multiple I/Os to be outstanding. In SCSI terminology, this is "command-tagged queuing". This allows the firmware on the disk to dynamically re-order I/O requests to minimize seek times. To utilise this feature, a reasonable number of requests must be active to the disk, for re-ordering to have a benefit.

Under AIX, the queue size may be specified on a per-hdisk basis. For some disk types (eg. HDS) the queue depth should be increased from default values. See the queue_depth attribute of hdisks.

From a TSM point of view, this means setting up 3 to 4 DBVols per physical disk. If you wish to use stripes (RAID0), you should still have 3 to 4 DBVols per physical disk. For a 5 disk RAID0, this would mean between 15 and 20 DBVols on the stripe, and an absolute minimum of 5 DBVols. With 5 DBVols, if you're lucky, you've got 1 I/O going to each physical disk.

Anyone wishing to use hardware RAID5 should make sure that the hardware supports write-behind caching, otherwise random write performance will suffer. Additionally, the rule regarding the number of DBVols still stands - random I/O performance on Shark, HDS, etc storage is still limited by number of I/Os in flight. Don't let large caches fool you - they are great for writes, but reads will not hit cache on a busy disk farm (they'd have to miss in the TSM cache and the OS cache (unless raw) first, and you're usually competing with other nodes for that shared cache). Performance will not increase if the number of DBVols exceeds the queue depth on the (logical) disk in question. In some cases, this means having a maximum of 8 DBVols per LUN.

Log Volumes (LOGVols)

I/O to the LOGVols is almost always sequential. I/O size appears to vary from 4 KiB upwards, in multiples of 4 KiB. During normal TSM operation, all I/O will be writes. During the part of a TSM database backup that backs up the transaction log, I/O to the log volumes will quite obviously be sequential reads.

For the same reasons as with DBVols, LOGVols should be made redundant. For simplicities sake, whatever was used for the DBVols should also be used for the LOGVols.

As with other volume types, I/O is made synchronous to LOGVols by the use of a mutex. However, given that I/O is predominently sequential, I do not see any benefit to increasing the number of LOGVols.

Disk Storage Pool Volumes (STGVols)

In a TSM environment using disk storage pool volumes for initial backup destination, with fast networks or fast tape drives, layout of disk storage pool volumes is critical. These volumes will do millions of I/Os, for terabytes of data.

I/O to STGVols is of a "multi-sequential" nature. During node backups, client sessions may be writing to different locations within the same STGVol. If migrations are in progress, each migration process will be sequentially reading from a STGVol. Under Solaris 8, I/Os have been seen to be always 256 KB through truss. I do not believe this would vary between implementations.

It is a matter of policy and recoverability requirements as to whether or not redundancy is required for STGVols. Since this article focuses on performance, those arguments will not be discussed here.

If redundancy is not required, the obvious choice is RAID0 (striping). It is important to note, that for best performance, the stripe-unit size or block size should be made 256 KB. This allows a single I/O, whether read or write, to be satisied by a single disk, and with sufficient activity (even from a single client backup session over a fast network) the disks will be kept fairly busy.

In the case that redundancy is required, the normally large size of disk storage pools may rule out pure RAID1 (mirroring), although the ever increasing size of hard disks may make mirroring practical. If using a hardware RAID solution, the wins that careful tuning can bring are smaller, but do still exist. For RAID5, to minimize the overheads in the RAID controller, the performance suggestions for software RAID5 could apply, although it may be possible to treat the RAID5 like RAID0 and use those suggestions. Although untested, hardware RAID3 may be a suitable alternative to RAID5.

Normally software RAID5 solutions can not be easily used where high performance is required. However, with careful thought and tuning, it may be possible to get reasonable performance from software RAID5. Since STGVols must perform for both reads and writes, some thought is required. To optimize writes, the RAID5 performance killer, the partial stripe write, must be prevented. To do this, the size of a stripe must be set to a multiple of the I/O size, being 256 KB. Normally, tools will often allow only the stripe unit size or block size to be specified. In this case, the stripe unit size should be a fraction of 256 KB, and the number of disks in the RAID5 array must be one greater than a power of two (3, 5, 9, etc). This allows the size of an entire stripe to be made 256 KB, and the performance panelty of the read, parity update, write cycle of a partial stripe write, to be avoided.

For number of STGVols, as before, TSM uses a mutex to serialize I/Os to each STGVol. As the I/Os are much larger, and I/O is sequential, the number of volumes is not as critical as it is with DBVols. For hardware RAID solutions, and RAID0 (striping), it does make sense to have more volumes. Even for software solutions, it does make sense to have more than one volume on a set of disks, to keep the disks sufficiently busy.

$Id: tsm-perftune.html,v 1.1 2004/10/09 10:18:48 stix Exp $

IBM Tivoli Storage Manager (ITSM) Performance Tuning