The complete explanation for await, svctm and %util in iostat
iostat is a Linux I/O performance monitoring utility. It’s very commonly used to analyze device utilization.
/proc/diskstats
The statistics fields in iostat are calculated based on the I/O statistics of block devices in /proc/diskstats.
Each line in the /proc/diskstats file contains the following 14 fields. More fields are added in kernel 4.18 and later.
1 | 1 major number |
We will need to know these statistics later when we learn how iostat calculates its fields.
iostat
Now, let’s use fio load generator to benchamrk a AWS EBS gp3 volume and examine the iostat report.
1 | $ fio --ioengine=libaio --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --iodepth=128 --direct=1 --group_reporting -time_based --runtime=60 --numjobs=1 --name=fiojob1 --filename=/dev/nvme1n1 |
1 | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util |
As we can see above, it achieves 10903 iops and the disk is 100% utilized.
You may naturally conclude that the disk runs into bottleneck since it’s 100% busy. Is this really true? Before we answer this question, let’s firstly understand how iostat computes the statistics fields.
How are the iostat fields calculated?
Since the basic fields, such as r/s and w/s, are very straightforward, we will mainly focus on the following three extended fields because they are commonly used to identify the disk bottlenck.
- await
- svctm
- %util
From iostat source code, to calcualte the total number of read and write IOs:
1 | n_ios = blkio.rd_ios + blkio.wr_ios; |
To calculate the total amount of time(ms) waiting in queue:
1 | n_ticks = blkio.rd_ticks + blkio.wr_ticks; |
To calculate the average I/O wait time(ms):
1 | wait = n_ios ? n_ticks / n_ios : 0.0; |
To calcualte the average I/O service time(ms):
1 | svc_t = n_ios ? blkio.ticks / n_ios : 0.0; |
Note: the blkio.ticks is calculated based on the field 13 “time spent doing I/Os (ms)” in /proc/diskstats.
To calculate the disk utilization:
1 | busy = 100.0 * blkio.ticks / deltams; /* percentage! */ |
Two traps in iostat
With above implementaion, there are actually two traps when to use them to identify disk bottleneck.
- svctm - average I/O service time(ms)
- %util - disk utilization
Let’s try to increase the fio numjobs from 1 to 4.
1 | $ fio --ioengine=libaio --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --iodepth=128 --direct=1 --group_reporting -time_based --runtime=60 --numjobs=4 --name=fiojob1 --filename=/dev/nvme1n1 |
1 | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util |
Comparing the 1 fio job and 4 jobs runs:
- The w/s increases from 10903 to 16001 even though the disk utilization is 100% for the two runs. It means, with one job run, the disk is not fully saturated yet.
- The average I/O service time(svctm) reduces from 0.09 to 0.06.
Literally, the disk could respond the request in 0.09ms under ligher load, and 0.06ms under heavier load. This seems unlikely and it’s not what iostat supposes to tell us on the average disk service time.
And from iostat man page, you may see this warning.
svctm
The average service time (in milliseconds) for I/O requests that were issued to the device. Warning! Do not trust this field any more. This field will be removed in a future sysstat version.
For the traditional spining harddisk, the I/O has to be serialized due to the nature of disk head movement from one disk platter location to another. That means only one I/O can be serviced at once. In such case, the svctm can reflect how fast each I/O is responded.
For the modern SSD disk, it won’t be true any more because the disk can respond to multiple IOs at once. Even if it’s 100% utilized, it just means during that period, the disk is busy responding I/O requests. It doesn’t always mean the disk is already saturated.
In order to idenitfy if the disk is completely saturated(peaked), the only way is to offer more work to do in parallel. In the above example, by increasing fio numjobs from 1 to 4, the w/s is peaked at 16001 which is align with the specfied iops for the EBS gp3 volume.
In a word, both the svctm and %util fields in iostat are misleading for the modern SSD storage system. They should be used with extra care.