blktrace - A block layer IO tracing utility
What can blktrace do?
Let’s dive into the block device layer and see how the I/O are handled in disk queues.
The following stack shows the I/O paths including the block device layer. The application can issue I/O requests directly to block device or through file systems. In the following sections, let’s dig into the block device layer with blktrace in order to understand the I/O pattern and disk queue activities.
Applications
| | |
V | |
File systems | |
| | |
V | |
Page Cache <--| |
| |
V V
Block I/O Layer: Request Queues
|
V
SCSI Drivers
|
V
Physical Devices
Don’t forget iostat
iostat is always the first place to understand the I/O characteristics before we turn to other advanced utilities like blktrace.
It provides the following information for the disk IOs.
- Number of read/write merges per second
- Number of reads/writes per second
- Average I/O request size(in sectors)
- Average request queue size
- Average I/O wait time
If any of the above metrics indicates disk I/O performance concerns and it’s not sufficient to help us explain the performance issue, we can turn to blktrace or other tracing utilities for more insights.
blktrace and blkparse
To trace the target block device:
$ blktrace -d /dev/<sd-device-name> -D <trace-raw-data-save-dir> -w <trace-time-in-seconds>
To parse the blktrace data:
$ blkparse -i <sd-device-name> -D <trace-raw-data-save-dir> -o blkparse.<sd-device-name>.out -d blktrace.bin
blkparse output snippet:
8,0 7 3 0.992335623 4180 A WS 680911952 + 8 <- (8,2) 679885904
8,0 7 4 0.992336407 4180 Q WS 680911952 + 8 [jbd2/dm-7-8]
8,0 7 5 0.992338784 4180 G WS 680911952 + 8 [jbd2/dm-7-8]
8,0 7 6 0.992339977 4180 I WS 680911952 + 8 [jbd2/dm-7-8]
8,0 7 7 0.992341444 4180 D WS 680911952 + 8 [jbd2/dm-7-8]
8,0 56 1 0.992499505 0 C WS 680911952 + 8 [0]
8,0 47 7 0.991930131 4180 A WS 680911920 + 8 <- (8,2) 679885872
8,0 47 8 0.991930522 4180 Q WS 680911920 + 8 [jbd2/dm-7-8]
8,0 47 9 0.991932697 4180 M WS 680911920 + 8 [jbd2/dm-7-8]
The columns are Dev major,minor, CPU id, Sequence number, Timestamp, PID, Event, Operation, Start block + number of blocks(offset), Process name.
In the above example, The first IO starts at block 680911952 with the offset of 8 blocks. It is handled in the following sequence.
- Remapped to a different device(8,2)
- Handled by the request queue code
- Get the request
- Inserted to the request queue
- Dispatch to device driver
- Completion
The second IO starts at block 680911920 with the offset of 8 blocks. It’s handled in the following sequence.
- Remapped to a different device(8,2)
- Handled by the request queue code
- Back merged with request on queue
blkparse output also includes a summary to explain the number of I/Os in each queuing phase. In the following example, there are 86 writes handled by request queue. 19 out of 86 writes are merged with request on queue. Thus, only 67 writes are issued to device to complete the front end requests.
Total (sda):
Reads Queued: 0, 0KiB Writes Queued: 86, 628KiB
Read Dispatches: 0, 0KiB Write Dispatches: 67, 628KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 67, 628KiB
Read Merges: 0, 0KiB Write Merges: 19, 88KiB
IO unplugs: 21 Timer unplugs: 0
btt
btt is a post-processing tool for blktrace. blktrace is capable of producing tremendous amounts of output in the form of multiple individual traces per IO executed during the traced run. It is also capable of producing some general statistics concerning IO rates and the like. btt goes further and produces a variety of overall statistics about each of the individual handling of IOs, and provides data we believe is useful to plot to provide visual comparisons for evaluation.
btt processes the binary file produced by blkparse.The major areas of output measured by btt include:
- Q2Q : Queue-to-Queue time
- Q2G : Queue-to-GetRequest time
- S2G : Sleep-to-GetRequest time
- G2I : GetRequest-to-Insert time
- Q2M : Queue-to-Merge time
- I2D : Insert-to-Issue time
- M2D : Merge-to-Issue time
- D2C : Issue-to-Complete time
- Q2C : Queue-to-Complete time
For the D2C, it includes the driver and device time. It’s the average time from when the actual IO was issued to the driver until is completed (completion trace) back to the block IO layer. The D2C time should be greater than the actual physcial disk I/O latency which is usually measured in disk(array) side.
In the following exampl, 98.9265% of time is spent on D2C which is expected. The average IO time serviced by the disk is 1.65ms. The max IO time is 5.10ms. We may measure the I2D metric for different I/O scheduler, like noop in SSD case.
$ btt -i blktrace.bin -B offset -o btt.out
$ ls btt.*.out*
btt.sda.30s.out.avg btt.sda.30s.out.dat btt.sda.30s.out_dhist.dat btt.sda.30s.out.msg btt.sda.30s.out_qhist.dat
$ cat btt.out.avg
==================== All Devices ====================
ALL MIN AVG MAX N
--------------- ------------- ------------- ------------- -----------
Q2Q 0.000001053 0.304967428 5.071043004 85
Q2G 0.000000338 0.000001928 0.000008143 67
G2I 0.000000161 0.000008231 0.000126276 67
Q2M 0.000000178 0.000000664 0.000002175 19
I2D 0.000000223 0.000003979 0.000024517 67
M2D 0.000002771 0.000030058 0.000116136 19
D2C 0.000089761 0.001640496 0.005096214 86
Q2C 0.000093453 0.001658297 0.005098018 86
==================== Device Overhead ====================
DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 0) | 0.0906% 0.3867% 0.0088% 0.1869% 98.9265%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0906% 0.3867% 0.0088% 0.1869% 98.9265%
[..]