What can blktrace do?

Let’s dive into the block device layer and see how the I/O are handled in disk queues.

The following stack shows the I/O paths including the block device layer. The application can issue I/O requests directly to block device or through file systems. In the following sections, let’s dig into the block device layer with blktrace in order to understand the I/O pattern and disk queue activities.

        Applications
        |      |   |  
        V      |   | 
  File systems |   |
        |      |   |
        V      |   |
 Page Cache <--|   | 
        |          | 
        V          V
Block I/O Layer: Request Queues
        |
        V 
  SCSI Drivers
        |
        V
 Physical Devices               

Don’t forget iostat

iostat is always the first place to understand the I/O characteristics before we turn to other advanced utilities like blktrace.

It provides the following information for the disk IOs.

  • Number of read/write merges per second
  • Number of reads/writes per second
  • Average I/O request size(in sectors)
  • Average request queue size
  • Average I/O wait time

If any of the above metrics indicates disk I/O performance concerns and it’s not sufficient to help us explain the performance issue, we can turn to blktrace or other tracing utilities for more insights.

blktrace and blkparse

To trace the target block device:

$ blktrace -d /dev/<sd-device-name> -D <trace-raw-data-save-dir> -w <trace-time-in-seconds>

To parse the blktrace data:

$ blkparse -i <sd-device-name> -D <trace-raw-data-save-dir> -o blkparse.<sd-device-name>.out -d blktrace.bin

blkparse output snippet:

  8,0    7        3     0.992335623  4180  A  WS 680911952 + 8 <- (8,2) 679885904
  8,0    7        4     0.992336407  4180  Q  WS 680911952 + 8 [jbd2/dm-7-8]
  8,0    7        5     0.992338784  4180  G  WS 680911952 + 8 [jbd2/dm-7-8]
  8,0    7        6     0.992339977  4180  I  WS 680911952 + 8 [jbd2/dm-7-8]
  8,0    7        7     0.992341444  4180  D  WS 680911952 + 8 [jbd2/dm-7-8]
  8,0   56        1     0.992499505     0  C  WS 680911952 + 8 [0]
  8,0   47        7     0.991930131  4180  A  WS 680911920 + 8 <- (8,2) 679885872
  8,0   47        8     0.991930522  4180  Q  WS 680911920 + 8 [jbd2/dm-7-8]
  8,0   47        9     0.991932697  4180  M  WS 680911920 + 8 [jbd2/dm-7-8]

The columns are Dev major,minor, CPU id, Sequence number, Timestamp, PID, Event, Operation, Start block + number of blocks(offset), Process name.

In the above example, The first IO starts at block 680911952 with the offset of 8 blocks. It is handled in the following sequence.

  • Remapped to a different device(8,2)
  • Handled by the request queue code
  • Get the request
  • Inserted to the request queue
  • Dispatch to device driver
  • Completion

The second IO starts at block 680911920 with the offset of 8 blocks. It’s handled in the following sequence.

  • Remapped to a different device(8,2)
  • Handled by the request queue code
  • Back merged with request on queue

blkparse output also includes a summary to explain the number of I/Os in each queuing phase. In the following example, there are 86 writes handled by request queue. 19 out of 86 writes are merged with request on queue. Thus, only 67 writes are issued to device to complete the front end requests.

Total (sda):
 Reads Queued:           0,        0KiB  Writes Queued:          86,      628KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:       67,      628KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:       67,      628KiB
 Read Merges:            0,        0KiB  Write Merges:           19,       88KiB
 IO unplugs:            21               Timer unplugs:           0

btt

btt is a post-processing tool for blktrace. blktrace is capable of producing tremendous amounts of output in the form of multiple individual traces per IO executed during the traced run. It is also capable of producing some general statistics concerning IO rates and the like. btt goes further and produces a variety of overall statistics about each of the individual handling of IOs, and provides data we believe is useful to plot to provide visual comparisons for evaluation.

btt processes the binary file produced by blkparse.The major areas of output measured by btt include:

  • Q2Q : Queue-to-Queue time
  • Q2G : Queue-to-GetRequest time
  • S2G : Sleep-to-GetRequest time
  • G2I : GetRequest-to-Insert time
  • Q2M : Queue-to-Merge time
  • I2D : Insert-to-Issue time
  • M2D : Merge-to-Issue time
  • D2C : Issue-to-Complete time
  • Q2C : Queue-to-Complete time

For the D2C, it includes the driver and device time. It’s the average time from when the actual IO was issued to the driver until is completed (completion trace) back to the block IO layer. The D2C time should be greater than the actual physcial disk I/O latency which is usually measured in disk(array) side.

In the following exampl, 98.9265% of time is spent on D2C which is expected. The average IO time serviced by the disk is 1.65ms. The max IO time is 5.10ms. We may measure the I2D metric for different I/O scheduler, like noop in SSD case.

$ btt -i blktrace.bin -B offset -o btt.out

$ ls btt.*.out*
btt.sda.30s.out.avg  btt.sda.30s.out.dat  btt.sda.30s.out_dhist.dat  btt.sda.30s.out.msg  btt.sda.30s.out_qhist.dat

$ cat btt.out.avg
==================== All Devices ====================

ALL               MIN           AVG           MAX                  N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000001053   0.304967428   5.071043004          85
Q2G               0.000000338   0.000001928   0.000008143          67
G2I               0.000000161   0.000008231   0.000126276          67
Q2M               0.000000178   0.000000664   0.000002175          19
I2D               0.000000223   0.000003979   0.000024517          67
M2D               0.000002771   0.000030058   0.000116136          19
D2C               0.000089761   0.001640496   0.005096214          86
Q2C               0.000093453   0.001658297   0.005098018          86

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,  0) |   0.0906%   0.3867%   0.0088%   0.1869%  98.9265%
---------- | --------- --------- --------- --------- ---------
   Overall |   0.0906%   0.3867%   0.0088%   0.1869%  98.9265%
[..]   

What’s meaning of await in iostat?

The following is the description provided for await field in iostat man page.

1
2
3
$ man iostat
await
The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.

It is a measure of disk I/O latency in milliseconds. The latency is from the front of the I/O scheduler to the I/O completion time.

I/O path

The I/O path mainly includes the following footprints from block layer to underneath storage device.

  • Get the I/O requests from application(filesystem)
  • Merge the I/O requests to existing device queue
  • Dispatch the I/O requests(by the I/O scheduler) to the device driver
  • Hypervisor scheduler in virtualization if any
  • Multipathing if any
  • Hardware handling
  • HBA driver
  • Transportation(bus)
  • FC switch routing if any
  • Storage controller queuing, caching and processing
  • Actual disk latency

How the await time is calculated?

The await is the average time on a per I/O basis, measured in milliseconds. It mainly includes the time spent in I/O scheduler queue and time spent on storage servicing it if the HBA/SAN latency is relatively marginal.

There are two queues involved in the I/O processing path.

  • The queue in I/O scheduler
  • The queue in storage side(e.g. controller)

nr_requests limits the maximum number of I/Os in the sorted request queue. The front thread will be blocked if the I/O can’t be merged/inserted into the scheduler queue due to the full occupancy of the queue . Note that the nr_requests is applied to read and write separately.

After the I/O is passed to the driver, it is no longer in the scheduler queue and doesn’t cout to nr_requests limit. However, it will count to avgqu-sz. So, the avgqu-sz could reach the sum of nr_requests and LUN queue_depth.

How the svctm time is measured?

await measures the I/O latency on a per I/O basis while svctm take into account parallel I/O. For example, if 100 I/Os are submitted to the I/O scheduler in parallel and queued onto storage(say queue_depth=50), and the 100 I/Os completes in 10ms, the await time would be 10ms, but the svctm time could be 2ms.

Follow up

Since await includes the time spent in I/O scheduler and storage queue servicing. We may want to see a breakdown for the two phases by using blktrace. It would tell us the overheads on disk queue(I2D) and actual I/O service latency(D2C). For furhter study of blktrace, you can read this article.

Buffered and Direct I/O

The VxFS responds with read-ahead for sequential read I/O. This results in buffered I/O. The data is prefetched and retained in buffers, in anticipation of application asking for it. The data buffers are commonly referred to as the VxFS buffer cache. This is the default VxFS behavior.

Direct I/O, on the other hand, does not buffer the data when the I/O to the underlying device is completed. This saves system resources like memory and CPU usage. Direct I/O is possible only when alignment and sizing criteria are satisfied.

All the supported platforms have a VxFS buffer cache. Each platform also has either a page cache, (aix/solaris/linux) or its own buffer cache (HP-UX). These caches are commonly known as the file system caches.

Direct I/O does not use these caches. The memory used for direct I/O is discarded after the I/O is complete, and is therefore not buffered.

Direct I/O

Direct I/O is an unbuffered form of I/O. If the VX_DIRECT advisory is set, the user is requesting direct data transfer between the disk and the user-supplied buffer for reads and writes. This bypasses the kernel buffering of data, and reduces the CPU overhead associated with I/O by eliminating the data copy between the kernel buffer and the user’s buffer. This also avoids taking up space in the buffer cache that might be better used for something else. The direct I/O feature can provide significant performance gains for some applications.

The direct I/O and VX_DIRECT advisories are maintained on a per-file-descriptor basis.

Direct I/O requirements

For an I/O operation to be performed as direct I/O, it must meet certain alignment criteria. The alignment constraints are usually determined by the disk driver, the disk controller, and the system memory management hardware and software.

The requirements for direct I/O are as follows:

The starting file offset must be aligned to a 512-byte boundary.

The ending file offset must be aligned to a 512-byte boundary, or the length must be a multiple of 512 bytes.

The memory buffer must start on an 8-byte boundary.

Direct I/O vs. synchronous I/O

Because direct I/O maintains the same data integrity as synchronous I/O, it can be used in many applications that currently use synchronous I/O. If a direct I/O request does not allocate storage or extend the file, the inode is not immediately written.

Direct I/O CPU overhead

The CPU cost of direct I/O is about the same as a raw disk transfer. For sequential I/O to very large files, using direct I/O with large transfer sizes can provide the same speed as buffered I/O with much less CPU overhead.

If the file is being extended or storage is being allocated, direct I/O must write the inode change before returning to the application. This eliminates some of the performance advantages of direct I/O.

Discovered Direct I/O

Discovered Direct I/O is a file system tunable you can set using the vxtunefs command. When the file system gets an I/O request larger than the discovered_direct_iosz, it tries to use direct I/O on the request. For large I/O sizes, Discovered Direct I/O can perform much better than buffered I/O.

Discovered Direct I/O behavior is similar to direct I/O and has the same alignment constraints, except writes that allocate storage or extend the file size do not require writing the inode changes before returning to the application.

Unbuffered I/O

If the VX_UNBUFFERED advisory is set, I/O behavior is the same as direct I/O with the VX_DIRECT advisory set, so the alignment constraints that apply to direct I/O also apply to unbuffered I/O. For unbuffered I/O, however, if the file is being extended, or storage is being allocated to the file, inode changes are not updated synchronously before the write returns to the user. The VX_UNBUFFERED advisory is maintained on a per-file-descriptor basis.

For information on how to set the discovered_direct_iosz, see Tuning I/O.

Data synchronous I/O

If the VX_DSYNC advisory is set, the user is requesting data synchronous I/O. In synchronous I/O, the data is written, and the inode is written with updated times and (if necessary) an increased file size. In data synchronous I/O, the data is transferred to disk synchronously before the write returns to the user. If the file is not extended by the write, the times are updated in memory, and the call returns to the user. If the file is extended by the operation, the inode is written before the write returns.

The direct I/O and VX_DSYNC advisories are maintained on a per-file-descriptor basis.

Data synchronous I/O vs. synchronous I/O

Like direct I/O, the data synchronous I/O feature can provide significant application performance gains. Because data synchronous I/O maintains the same data integrity as synchronous I/O, it can be used in many applications that currently use synchronous I/O. If the data synchronous I/O does not allocate storage or extend the file, the inode is not immediately written. The data synchronous I/O does not have any alignment constraints, so applications that find it difficult to meet the alignment constraints of direct I/O should use data synchronous I/O.

If the file is being extended or storage is allocated, data synchronous I/O must write the inode change before returning to the application. This case eliminates the performance advantage of data synchronous I/O.

References

Introduction

Receive ring buffers are shared between the device driver and NIC. The card assigns a transmit (TX) and receive (RX) ring buffer. As the name implies, the ring buffer is a circular buffer where an overflow simply overwrites existing data. It should be noted that there are two ways to move data from the NIC to the kernel, hardware interrupts and software interrupts, also called SoftIRQs.

The RX ring buffer is used to store incoming packets until they can be processed by the device driver. The device driver drains the RX ring, typically via SoftIRQs, which puts the incoming packets into a kernel data structure called an sk_buff or “skb” to begin its journey through the kernel and up to the application which owns the relevant socket. The TX ring buffer is used to hold outgoing packets which are destined for the wire.

These ring buffers reside at the bottom of the stack and are a crucial point at which packet drop can occur, which in turn will adversely affect network performance.

You can increase the size of the Ethernet device RX ring buffer if the packet drop rate causes applications to report:

  • a loss of data
  • cluster fence
  • slow performance
  • timeouts
  • failed backups

Interrupts and Interrupt Handlers

Interrupts from the hardware are known as “top-half” interrupts. When a NIC receives incoming data, it copies the data into kernel buffers using DMA. The NIC notifies the kernel of this data by raising a hard interrupt. These interrupts are processed by interrupt handlers which do minimal work, as they have already interrupted another task and cannot be interrupted themselves. Hard interrupts can be expensive in terms of CPU usage, especially when holding kernel locks. The hard interrupt handler then leaves the majority of packet reception to a software interrupt, or SoftIRQ, process which can be scheduled more fairly.

Hard interrupts can be seen in /proc/interrupts where each queue has an interrupt vector in the 1st column assigned to it. These are initialized when the system boots or when the NIC device driver module is loaded. Each RX and TX queue is assigned a unique vector, which informs the interrupt handler as to which NIC/queue the interrupt is coming from. The columns represent the number of incoming interrupts as a counter value:

$ egrep “CPU0|eth2” /proc/interrupts
 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
 105: 141606 0 0 0 0 0 IR-PCI-MSI-edge eth2-rx-0
 106: 0 141091 0 0 0 0 IR-PCI-MSI-edge eth2-rx-1
 107: 2 0 163785 0 0 0 IR-PCI-MSI-edge eth2-rx-2
 108: 3 0 0 194370 0 0 IR-PCI-MSI-edge eth2-rx-3
 109: 0 0 0 0 0 0 IR-PCI-MSI-edge eth2-tx

SoftIRQs

Also known as “bottom-half” interrupts, software interrupt requests (SoftIRQs) are kernel routines which are scheduled to run at a time when other tasks will not be interrupted. The SoftIRQ’s purpose is to drain the network adapter receive ring buffers. These routines run in the form of ksoftirqd/cpu-number processes and call driver-specific code functions. They can be seen in process monitoring tools such as ps and top.

The following call stack, read from the bottom up, is an example of a SoftIRQ polling a Mellanox card. The functions marked [mlx4_en] are the Mellanox polling routines in the mlx4_en.ko driver kernel module, called by the kernel’s generic polling routines such as net_rx_action. After moving from the driver to the kernel, the traffic being received will then move up to the socket, ready for the application to consume:

 mlx4_en_complete_rx_desc [mlx4_en]
 mlx4_en_process_rx_cq [mlx4_en]
 mlx4_en_poll_rx_cq [mlx4_en]
 net_rx_action
 __do_softirq
 run_ksoftirqd
 smpboot_thread_fn
 kthread
 kernel_thread_starter
 kernel_thread_starter
 1 lock held by ksoftirqd

SoftIRQs can be monitored as follows. Each column represents a CPU:

$ watch -n1 grep RX /proc/softirqs
$ watch -n1 grep TX /proc/softirqs

Displaying the number of dropped packets

The ethtool utility enables administrators to query, configure, or control network driver settings.

The exhaustion of the RX ring buffer causes an increment in the counters, such as “discard” or “drop” in the output of ethtool -S interface_name. The discarded packets indicate that the available buffer is filling up faster than the kernel can process the packets.

To display drop counters for the enp1s0 interface, enter:

$ ethtool -S enp1s0

Increasing the RX ring buffer to reduce a high packet drop rate

The ethtool utility helps to increase the RX buffer to reduce a high packet drop rate.

  1. To view the maximum RX ring buffer size:

    $ ethtool -g nic0
    Ring parameters for nic0:
    Pre-set maximums:
    RX: 4078
    RX Mini: 0
    RX Jumbo: 0
    TX: 4078
    Current hardware settings:
    RX: 2048
    RX Mini: 0
    RX Jumbo: 0
    TX: 2048

  2. If the values in the Pre-set maximums section are higher than in the Current hardware settings section, increase RX ring buffer:

  • To temporary change the RX ring buffer of the nic0 device to 4078, enter:

    $ ethtool -G nic0 rx 4078

  • To permanently change the RX ring buffer create a NetworkManager dispatcher script.

Understanding the maximum RX/TX ring buffer

From ethtool source code, we can find the following function which is used by command “ethtool -g [nic]”

* @get_ringparam: Report ring sizes

For different NIC vender, the driver may be implemented differently.

In this example, the NIC driver is bnx2x.

$ ethtool -i nic0
driver: bnx2x

So, we can check the source code as below.

static void bnx2x_get_ringparam(struct net_device *dev,
                struct ethtool_ringparam *ering)
{
    struct bnx2x *bp = netdev_priv(dev);

    ering->rx_max_pending = MAX_RX_AVAIL;

    /* If size isn't already set, we give an estimation of the number
     * of buffers we'll have. We're neglecting some possible conditions
     * [we couldn't know for certain at this point if number of queues
     * might shrink] but the number would be correct for the likely
     * scenario.
     */
    if (bp->rx_ring_size)
        ering->rx_pending = bp->rx_ring_size;
    else if (BNX2X_NUM_RX_QUEUES(bp))
        ering->rx_pending = MAX_RX_AVAIL / BNX2X_NUM_RX_QUEUES(bp);
    else
        ering->rx_pending = MAX_RX_AVAIL;

    ering->tx_max_pending = IS_MF_FCOE_AFEX(bp) ? 0 : MAX_TX_AVAIL;
    ering->tx_pending = bp->tx_ring_size;
}

MAX_RX_AVAIL is the place to define the maximum RX ring buffer. We can further check the formula as below.

#define MAX_RX_AVAIL		(MAX_RX_DESC_CNT * NUM_RX_RINGS - 2)

#define NUM_RX_RINGS		8

#define MAX_RX_DESC_CNT		(RX_DESC_CNT - NEXT_PAGE_RX_DESC_CNT)
#define RX_DESC_CNT		(BCM_PAGE_SIZE / sizeof(struct eth_rx_bd))
#define NEXT_PAGE_RX_DESC_CNT	2

#define BCM_PAGE_SIZE		(1 << BCM_PAGE_SHIFT)
#define BCM_PAGE_SHIFT		12

/*
 * The eth Rx Buffer Descriptor
 */
struct eth_rx_bd {
    __le32 addr_lo;
    __le32 addr_hi;
};

So, based on the formula above, we can calculate the maximum RX ring buffer as below.

rx_max = MAX_RX_AVAIL 
       = MAX_RX_DESC_CNT * NUM_RX_RINGS - 2
       = (RX_DESC_CNT - NEXT_PAGE_RX_DESC_CNT) * NUM_RX_RINGS - 2
       = ((BCM_PAGE_SIZE / sizeof(struct eth_rx_bd)) - 2) * 8 - 2
       = (((1 << BCM_PAGE_SHIFT) / sizeof(struct eth_rx_bd)) - 2) * 8 - 2
       = ((4096 / 8 ) - 2) * 8 - 2
       = 4078

References

What is semaphore

A semaphore is a very relaxed type of lockable object. A given semaphore has a predefined maximum count, and a current count. You take ownership of a semaphore with a wait operation, also referred to as decrementing the semaphore, or even just abstractly called P. You release ownership with a signal operation, also referred to as incrementing the semaphore, a post operation, or abstractly called V. The single-letter operation names are from Dijkstra’s original paper on semaphores.

Every time you wait on a semaphore, you decrease the current count. If the count was greater than zero then the decrement just happens, and the wait call returns. If the count was already zero then it cannot be decremented, so the wait call will block until another thread increases the count by signaling the semaphore.

Semaphore tuning

To display the semaphore limits:

$ cat /proc/sys/kernel/sem
300	307200	32	1024

$ sysctl -a | grep sem
kernel.sem = 300	307200	32	1024

$ ipcs -l | grep -i "sem"
------ Semaphore Limits --------
max semaphores per array = 300
max semaphores system wide = 307200
max ops per semop call = 32
semaphore max value = 32767

The values of the semaphore parameters are displayed in the following order.

  • SEMMSL - The maximum number of semaphores in a sempahore set.
  • SEMMNS - A system-wide limit on the number of semaphores in all semaphore sets. The maximum number of sempahores in the system.
  • SEMOPM - The maximum number of operations in a single semop call
  • SEMMNI - A system-wide limit on the maximum number of semaphore identifiers (sempahore sets)

To display the current semaphore status:

$ ipcs -u | egrep -i "used arrays|sem"
------ Semaphore Status --------
used arrays = 3
allocated semaphores = 3

To display the active semaphore sets info:

$ ipcs -s
------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 0          root       600        1
0x00000000 32769      root       600        1
0x00005653 229380     root       666        1

To adjust the semaphore values on the fly:

$ echo 300   307200   32   1024 > /proc/sys/kernel/sem

To modify system semaphore values permanently:

$ echo "kernel.sem = 300  307200  32  1024" >> /etc/sysctl.conf
$ sysctl -p 

System wide open files limit

To check system wide files limit:

$ cat /proc/sys/fs/file-max
4875932
$ sysctl -a | grep file-max
fs.file-max = 4875932

To change system wide files limit:

$ echo "fs.file-max = 4875932" >> /etc/sysctl.conf
$ sysctl -p /etc/sysctl.conf

User level open files limit

To check hard/soft limits:

$ ulimit -Hn
40960
$ ulimit -Sn
40960

To change hard/soft limits:

$ vi /etc/security/limits.conf
*  hard nofile 40960
*  soft nofile 40960

Process level open files limit

To check the max open files per process:

$ cat /proc/sys/fs/nr_open
1048576

To check the specific process max open files limit:

$ cat /proc/`pidof <process-name>`/limits | egrep "Limit |Max open files"
Limit                     Soft Limit           Hard Limit           Units
Max open files            524352               524352               files

Sometimes, application process may need change the max open files limits on the fly.

In Docker container, the process does not have the permission to do so by default.

Docker provides the following two ways to extend Linux capabilities for the container processes.

  • –cap-add Add Linux capabilities
  • –cap-drop Drop Linux capabilities
  • –privileged Give extended privileges to this container

When using the “–privileged” option is not allowed for security reason, we can have fine grain control over the capabilities using –cap-add and –cap-drop.

For example, if we want to grant the container process the permission to change max open files limit on the fly, we can use the following capability option.

  • SYS_RESOURCE - Override resource Limits.

We can pass this option to the target docker container.

$ docker run --cap-add=SYS_ADMIN ...

References

Introduction

IOzone is a filesystem benchmark tool. The benchmark generates and measures a variety of file operations.

Iozone is useful for performing a broad filesystem analysis of a vendor’s computer platform. The benchmark tests file I/O performance for the following operations:

  • Read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread ,mmap, aio_read, aio_write

Example for throughput benchmark

In this example, the target is to measure the filesystem throughput(KB/s) for different workloads with 4k iosize. The workload operations include sequential read/write, random read/write and mix random read/write.

We run iozone on 6 mounted filesystems and 1GB files are read and written in each filesystem.

$ /opt/iozone/bin/iozone -h
    -r #  record size in Kb
    -s #  file size in Kb
    -t #  Number of threads or processes to use in throughput test
    -I  Use VxFS VX_DIRECT, O_DIRECT,or O_DIRECTIO for all file operations    
    -F filenames  for each process/thread in throughput test
    -i #  Test to run (0=write/rewrite, 1=read/re-read, 2=random-read/write
        3=Read-backwards, 4=Re-write-record, 5=stride-read, 6=fwrite/re-fwrite
        7=fread/Re-fread, 8=random_mix, 9=pwrite/Re-pwrite, 10=pread/Re-pread
        11=pwritev/Re-pwritev, 12=preadv/Re-preadv)
    [...]    

$ /opt/iozone/bin/iozone -i 0 -i 1 -i 2 -i 8 -r 4k -s 1g -t 6 -I -F /testmnt1/testfile1 /testmnt2/testfile1 /testmnt3/testfile1 /testmnt4/testfile1 /testmnt5testfile1 /testmnt6/testfile1 > iozone.out

$ cat iozone.out
Iozone: Performance Test of File I/O
        Version $Revision: 3.489 $
    Compiled for 64 bit mode.
    Build: linux-AMD64
Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
             Vangel Bojaxhi, Ben England, Vikentsi Lapa,
             Alexey Skidanov, Sudhir Kumar.
Run began: Mon Mar  1 12:25:11 2021
Record Size 4 kB
File size set to 1048576 kB
O_DIRECT feature enabled
Command line used: /opt/iozone/bin/iozone -i 0 -i 1 -i 2 -i 8 -r 4k -s 1g -t 6 -I -F /testmnt1/testfile1 /testmnt2/testfile1 /testmnt3/testfile1 /testmnt4testfile1 /testmnt5/testfile1 /testmnt6/testfile1
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 kBytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 6 processes
Each process writes a 1048576 kByte file in 4 kByte records
Children see throughput for  6 initial writers 	=  151721.80 kB/sec
Parent sees throughput for  6 initial writers 	=  146366.56 kB/sec
Min throughput per process 			=   24712.40 kB/sec
Max throughput per process 			=   25707.35 kB/sec
Avg throughput per process 			=   25286.97 kB/sec
Min xfer 					= 1007996.00 kB
Children see throughput for  6 rewriters 	=  152089.88 kB/sec
Parent sees throughput for  6 rewriters 	=  152084.55 kB/sec
Min throughput per process 			=   25109.69 kB/sec
Max throughput per process 			=   25674.81 kB/sec
Avg throughput per process 			=   25348.31 kB/sec
Min xfer 					= 1025500.00 kB
Children see throughput for  6 readers 		=    7618.06 kB/sec
Parent sees throughput for  6 readers 		=    7618.04 kB/sec
Min throughput per process 			=    1268.31 kB/sec
Max throughput per process 			=    1270.73 kB/sec
Avg throughput per process 			=    1269.68 kB/sec
Min xfer 					= 1046580.00 kB
Children see throughput for 6 re-readers 	=    7629.77 kB/sec
Parent sees throughput for 6 re-readers 	=    7629.74 kB/sec
Min throughput per process 			=    1270.79 kB/sec
Max throughput per process 			=    1273.63 kB/sec
Avg throughput per process 			=    1271.63 kB/sec
Min xfer 					= 1046240.00 kB
Children see throughput for 6 random readers 	=    7605.91 kB/sec
Parent sees throughput for 6 random readers 	=    7605.89 kB/sec
Min throughput per process 			=    1266.91 kB/sec
Max throughput per process 			=    1268.54 kB/sec
Avg throughput per process 			=    1267.65 kB/sec
Min xfer 					= 1047228.00 kB
Children see throughput for 6 mixed workload 	=   79687.92 kB/sec
Parent sees throughput for 6 mixed workload 	=   78974.22 kB/sec
Min throughput per process 			=    1275.41 kB/sec
Max throughput per process 			=   25449.38 kB/sec
Avg throughput per process 			=   13281.32 kB/sec
Min xfer 					=   52552.00 kB
Children see throughput for 6 random writers 	=  146210.38 kB/sec
Parent sees throughput for 6 random writers 	=  143822.17 kB/sec
Min throughput per process 			=   24206.99 kB/sec
Max throughput per process 			=   24653.06 kB/sec
Avg throughput per process 			=   24368.40 kB/sec
Min xfer 					= 1029604.00 kB
iozone test complete.   

iostat

In the above test, we use a flash array with single 500TB LUN created. There are 4 active paths from host to the single LUN. Six logical volumes and filesystems are created on the single LUN.

The following is a piece of iostat output for one of the four disks(paths) when the mix random read/write workload is running. The read throughput is ~940KB/s and the write throughput is 19MB/s.

To measure the maximum throughput, we need keep increasing the number of read/write threads until the throughput(KB/s) is capped. Also, the iosize has big impact on the throughput. We may test with different I/O sizes.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.00  189.80 4255.20   759.20 17020.80     8.00     1.13    0.25    3.07    0.13   0.18  81.80
sdd               0.00     0.00  237.20 4708.20   948.80 18832.80     8.00     1.35    0.27    3.08    0.13   0.20  96.94
sdd               0.00     0.00  235.40 4898.20   941.60 19592.80     8.00     1.37    0.27    3.06    0.13   0.19  99.54
sdd               0.00     0.00  229.60 4575.60   918.40 18302.40     8.00     1.32    0.27    3.09    0.13   0.20  94.28
sdd               0.00     0.00  229.94 4822.55   919.76 19290.22     8.00     1.32    0.26    3.06    0.13   0.19  96.89
sdd               0.00     0.00  228.80 4512.40   915.20 18049.60     8.00     1.30    0.27    3.07    0.13   0.20  94.40
sdd               0.00     0.00  234.20 4810.20   936.80 19240.80     8.00     1.33    0.26    3.08    0.13   0.19  97.10
sdd               0.00     0.00  246.60 4497.40   986.40 17989.60     8.00     1.35    0.28    3.06    0.13   0.20  97.04
sdd               0.00     0.00  106.60 1665.60   426.40  6661.70     8.00     0.55    0.31    3.05    0.14   0.22  38.26

References

Kubernetes and Gluster Intro

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.

Gluster is a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace.

Gluster performance study

In this article, we will discuss the Gluster performance in a Docker container environment which is built on Kubernetes.

Configuration

We use three Redhat Linux servers to form a Kubernetes cluster in this study. A Kubernetes cluster that handles production traffic should have a minimum of three nodes.

We have three Docker containers(application instances) provisioned within the Kubernetes cluster. Each container instance has its own Gluster storage pool.

[node1:root]~> kubectl get services
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   6d14h
 
[node1:root]~> kubectl get nodes
NAME    STATUS   ROLES    AGE     VERSION
node1   Ready    master   6d14h   v1.19.3
node2   Ready    master   6d14h   v1.19.3
node3   Ready    master   6d14h   v1.19.3
 
[node1:root]~> kubectl get pods --namespace ns-1
NAME         READY   STATUS    RESTARTS   AGE
container1   1/1     Running   0          6d14h
container2   1/1     Running   0          6d14h
container3   1/1     Running   0          6d14h
[...]
 
[node1:root]~> gluster pool list
UUID                    Hostname    State
45d8ec04-4e7a-4442-bbb4-557256b864d6    10.10.1.3 Connected
875be270-ae69-45ea-b38e-2768b7c6ce05    10.10.1.4 Connected
f2696790-b305-4099-8dba-b31d23b0beac    localhost Connected

Workload and performance

We keep increasing the number of workload processes across three container instances and measure the throughput in MB/s. For each workload process, it ingests data from multiple clients through 10GbE bonding network and writes the data to the mounted Gluster filesystem.

There are two kinds of workloads. One is very I/O intensive and the other is CPU bound.

Observation

  • For the CPU bound workload, the performance scales very well.
  • For the I/O bound workload, the performance does not scale when the number of workload processes increases.

Analysis

As we increase the number of workload processes, the workload can be distributed evenly across three instances(on three nodes). Thus, the CPU bandwidth from three nodes are available for the application computing.

Although the storage from three nodes are usable for three container instances, with the default disk allocation for Gluster filesystem, the I/O performance may not be optimal. It depends on how the disks are allocated to each Gluster filesystem bricks and how the bricks are assigned to the Gluster filesystems. Also, writing to remote disk would have worse performance due to network latency.

The following is the disk mapping to bricks which are used by one of the three Gluster filesystems. It shows four bricks are created on the same disk /dev/sde on the node 10.10.1.4. Obviously, the I/O performance could be degraded if multiple processes write on it.

brickSize(GB)	device	node
3200	/dev/sdd	10.10.1.2
3200	/dev/sde	10.10.1.2
3200	/dev/sdd	10.10.1.2
3200	/dev/sdd	10.10.1.2
3200	/dev/sdc	10.10.1.3
3200	/dev/sdc	10.10.1.3
3200	/dev/sdb	10.10.1.3
3200	/dev/sdb	10.10.1.3
3200	/dev/sde	10.10.1.4
3200	/dev/sde	10.10.1.4
3200	/dev/sde	10.10.1.4
3200	/dev/sde	10.10.1.4

Conclusion

From this case study, we had basic understanding on how the Gluster file system works with storage across multiple nodes. The default Gluster filesystem layout is not fit for all use cases. A custom storage layout would be needed to meet the performance requirement.

What is Gluster?

Gluster is a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace.

Advantages

  • Scales to several petabytes
  • Handles thousands of clients
  • POSIX compatible
  • Uses commodity hardware
  • Can use any ondisk filesystem that supports extended attributes
  • Accessible using industry standard protocols like NFS and SMB
  • Provides replication, quotas, geo-replication, snapshots and bitrot detection
  • Allows optimization for different workloads
  • Open Source

Installation and configuration

  1. To install gluster and start gluster service:

    [root@centos83-1 ~]# cat /etc/centos-release
    CentOS Linux release 8.3.2011

    [root@centos83-1 ~]# systemctl stop firewalld
    [root@centos83-1 ~]# systemctl disable firewalld

    [root@centos83-1 ~]# yum install -y centos-release-gluster
    [root@centos83-1 ~]# yum install -y glusterfs-server
    [root@centos83-1 ~]# rpm -qa |grep gluster
    glusterfs-cli-8.3-1.el8.x86_64
    libvirt-daemon-driver-storage-gluster-6.0.0-28.module_el8.3.0+555+a55c8938.x86_64
    glusterfs-client-xlators-8.3-1.el8.x86_64
    qemu-kvm-block-gluster-4.2.0-34.module_el8.3.0+555+a55c8938.x86_64
    libglusterd0-8.3-1.el8.x86_64
    glusterfs-8.3-1.el8.x86_64
    pcp-pmda-gluster-5.1.1-3.el8.x86_64
    glusterfs-fuse-8.3-1.el8.x86_64
    centos-release-gluster8-1.0-1.el8.noarch
    libglusterfs0-8.3-1.el8.x86_64
    glusterfs-server-8.3-1.el8.x86_64

    [root@centos83-1 ~]# systemctl enable glusterd
    [root@centos83-1 ~]# systemctl restart glusterd

    [root@centos83-1 ~]# systemctl status glusterd
    glusterd.service - GlusterFS, a clustered file-system server
    Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
    Active: active (running) since Tue 2021-01-05 17:28:42 PST; 1 months 22 days ago
    Docs: man:glusterd(8)
    Main PID: 1420 (glusterd)
    Tasks: 26 (limit: 409792)
    Memory: 152.3M
    CGroup: /system.slice/glusterd.service

  2. To form a trusted storage pool with the second server:

    [root@centos83-1 ~]# gluster peer probe centos83-2
    [root@centos83-1 ~]# gluster peer status
    Number of Peers: 1

    Hostname: centos83-2
    Uuid: b07d3d6e-4d6e-42a9-ad21-018223843fd5
    State: Peer in Cluster (Connected)

  3. To create brick on the first server:

    [root@centos83-1 ~]# lsblk | egrep “NAME|sdb”
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sdb 8:16 0 1T 0 disk

    [root@centos83-1 ~]# pvcreate /dev/sdb
    [root@centos83-1 ~]# vgcreate vg_bricks /dev/sdb
    [root@centos83-1 ~]# lvcreate -L 800g -n gfslv1 vg_bricks
    [root@centos83-1 ~]# mkfs.xfs /dev/vg_bricks/gfslv1
    [root@centos83-1 ~]# mkdir -p /bricks/vm1_brick1
    [root@centos83-1 ~]# vim /etc/fstab
    /dev/vg_bricks/gfslv1 /bricks/vm1_brick1 xfs defaults 0 0
    [root@centos83-1 ~]# mount -a
    [root@centos83-1 ~]# df -h |grep gfs
    /dev/mapper/vg_bricks-gfslv1 800G 5.7G 794G 1% /bricks/vm1_brick1

  4. To create brick on the second server:

    [root@centos83-2 ~]# lsblk | egrep “NAME|sdb”
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sdb 8:16 0 1T 0 disk

    [root@centos83-2 ~]# pvcreate /dev/sdb
    [root@centos83-2 ~]# vgcreate vg_bricks /dev/sdb
    [root@centos83-2 ~]# lvcreate -L 800g -n gfslv1 vg_bricks
    [root@centos83-2 ~]# mkfs.xfs /dev/vg_bricks/gfslv1
    [root@centos83-2 ~]# mkdir -p /bricks/vm2_brick1
    [root@centos83-2 ~]# vim /etc/fstab
    /dev/vg_bricks/gfslv1 /bricks/vm2_brick1 xfs defaults 0 0
    [root@centos83-2 ~]# mount -a
    [root@centos83-2 ~]# df -h |grep gfs
    /dev/mapper/vg_bricks-gfslv1 800G 5.7G 794G 1% /bricks/vm2_brick1

  5. To create distributed volume with the two bricks which are created on the two nodes:

    [root@centos83-1 ~]# gluster volume create gv0 centos83-1:/bricks/vm1_brick1/gv0 centos83-2:/bricks/vm2_brick1/gv0
    [root@centos83-1 ~]# gluster volume start gv0

  6. To verify the volume status:

    [root@centos83-1 ~]# gluster volume info gv0

    Volume Name: gv0
    Type: Distribute
    Volume ID: ee08d16a-f940-4ec2-aba8-5f1fcfe41bd4
    Status: Started
    Snapshot Count: 0
    Number of Bricks: 2
    Transport-type: tcp
    Bricks:
    Brick1: centos83-1:/bricks/vm1_brick1/gv0
    Brick2: centos83-2:/bricks/vm2_brick1/gv0
    Options Reconfigured:
    storage.fips-mode-rchecksum: on
    transport.address-family: inet
    nfs.disable: on

  7. To mount the distributed volume on one of the servers(treat it as client for simple demonstration):

    [root@centos83-1 ~]# mkdir /testmnt
    [root@centos83-1 ~]# mount -t glusterfs centos83-2:/gv0 /testmnt
    [root@centos83-1 ~]# df -h | grep testmnt
    centos83-2:/gv0 1.6T 28G 1.6T 2% /testmnt

As shown above, the usable storage size is the sum of the brick size from two nodes.

References

Introduction

Changed Block Tracking is an incremental backup technology for virtual machines. It helps create faster and smaller backups. It has the following advantages.

  • Reduce backup time
  • Save disk space by only storing changed data to the previous backup

Block changes are tracked in the virtualization layer, outside the virtual machines. During a backup, only the changed block since the last backup are transmitted. For VMware, the vSphere APIs can be used to request the VMkernel to return the changed blocks from the last snapshot backup. Microsoft provides Resilient Change Tracking(RCT) to provide the native CBT feature for Hyper-V.

Veritas NetBackup Accelerator reduces the backup time for VMware backups. NetBackup uses VMware Changed Block Tracking (CBT) to identify the changes that were made within a virtual machine. Only the changed data blocks are sent to the NetBackup media server, to significantly reduce the I/O and backup time. The media server combines the new data with previous backup data and produces a traditional full NetBackup image that includes the complete virtual machine files.

References

0%