Rationality Island is a city where the citizens choose to live a happy and successful life. The city earned its name based on the mission statement, “Using rationality to solve your problems is the best way to enhance your life.” The citizens demonstrate the values of rationality, self-reliance, and determination. The people solve their own problems by analyzing and finding the cause of it. Instead of feeling disheartened when they face challenges, the citizens rationally find ways to overcome the problems. They persevere through the hardships and help each other. The city of Rationality Island is progressing every day and people work hard to live happily.

Because Rationality Island has a variety of landforms on it, there is a wide range of natural resources available.  The city is surrounded by water and there are exotic fish, which traders from all over the world come to buy. To the North of the main city lies the Crystal Mountains, where miners mine an abundant amount of gold, silver, and other precious metals. These metals are either crafted into jewelry or traded in the market. On the opposite side of the island, woodchoppers get a large amount of lumber, which is used for building ships for trading or used for houses and furniture. Trade is also an important factor in the economy. Goods are imported in the East Harbor and exported in the North Harbor. Hundreds of ships arrive every day and merchants trade spices and extraordinary furniture like Persian rugs for the supreme jewelry that craftsmen produce. The citizens  of Rationality Island are productive entrepreneurs who use their resources to enhance their life.

To live in Rationality Island, the citizens must follow one guiding principle: Respect everyone’s natural rights. Everyone is equal regardless of race and skin color because they enjoy the same natural rights. Also the citizens must be honest and honor their contract. Those who don’t show integrity and steal other’s property will be sentenced to jail with a fair trial. Even if anyone is caught doing a dishonorable act, the judges must listen to the defendant’s argument before giving the judgement. The citizens of Rationality Island believe that rationality should be used to enhance your life. Therefore, everyone must go to school to gain valuable knowledge. The citizens are free to produce and trade in a  place where violence is unacceptable and contracts are reliable.

The flag of Rationality Island is a symbol of the values that the citizens exhibit. The gold bar on the island represents the discoveries and opportunities that people make and receive. The purple mountains indicate the precious metals that are mined in the Crystal Mountains. The lush, green tree and bushes symbolize the large amount of lumber that is chopped. The sun symbolizes hope for citizens to improve and enhance their lives. The fish stands for the exotic salmon and tuna available. The blue ocean represents the peace and freedom that people enjoy. Finally, the light blue sky symbolizes the trade and success in the city.  Rationality Island is a prosperous city, for people demonstrate the values of rationality and independence.

All the continents of Planet Earth rest on giant slabs of rock called plate tectonics. Due to the movement of the plates, the continents drifted around until they completely separated from the supercontinent “Pangaea”. Over the past millions of years, the continents moved to where they are now. They are still moving today and the location of the continents will be very different after millions of years.

Alfred Wegener created the theory of Continental Drift. He thought that the continents were kind of like a jigsaw puzzle. Mountain ranges seemed to start on one continent and continue on another. For example, the Appalachian Mountains in North America matched up neatly with the Scottish Highlands. Fossils in various places showed that the climate there a long time ago had been different. Fossils of fresh water reptile Mesosaurus were discovered in both Africa and South America. It could not have swum across the salty ocean so scientists conclude that the two continents were once joined.

This theory was supported by the theory of plate tectonics. This theory explains how forces deep within Earth caused ocean floors to spread and continents to move. It describes how the lithosphere is made up of huge plates of solid rock. The continents rest on these plates. The asthenosphere is made up of almost melted rock and acts as a slippery surface for the plates to move on. Magma is pushed from the mantle toward the movement when the plates move. Tension is caused by the upward movement. It moves the ocean floor apart and separates the plates. The continents that rest on these plates also move apart.

Based on fossils, rocks, and other geological evidence, scientists concluded that the continents were once part of the supercontinents “Pangaea”. Over time, the continents spread apart due to the movement of plate tectonics. Even now, the continents are still moving and North America is drifting closer to Asia and Australia.

Check the pools, images and OSDs

[ceph: root@host1 /]$ ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         83.83411  root default
-3         27.94470      host host1
 0    ssd   3.49309          osd.0             up   1.00000  1.00000
 1    ssd   3.49309          osd.1             up   1.00000  1.00000
 2    ssd   3.49309          osd.2             up   1.00000  1.00000
 3    ssd   3.49309          osd.3             up   1.00000  1.00000
 4    ssd   3.49309          osd.4             up   1.00000  1.00000
 5    ssd   3.49309          osd.5             up   1.00000  1.00000
 6    ssd   3.49309          osd.6             up   1.00000  1.00000
 7    ssd   3.49309          osd.7             up   1.00000  1.00000
-5         27.94470      host host2
 8    ssd   3.49309          osd.8             up   1.00000  1.00000
 9    ssd   3.49309          osd.9             up   1.00000  1.00000
10    ssd   3.49309          osd.10            up   1.00000  1.00000
11    ssd   3.49309          osd.11            up   1.00000  1.00000
12    ssd   3.49309          osd.12            up   1.00000  1.00000
13    ssd   3.49309          osd.13            up   1.00000  1.00000
14    ssd   3.49309          osd.14            up   1.00000  1.00000
15    ssd   3.49309          osd.15            up   1.00000  1.00000
-7         27.94470      host host3
16    ssd   3.49309          osd.16            up   1.00000  1.00000
17    ssd   3.49309          osd.17            up   1.00000  1.00000
18    ssd   3.49309          osd.18            up   1.00000  1.00000
19    ssd   3.49309          osd.19            up   1.00000  1.00000
20    ssd   3.49309          osd.20            up   1.00000  1.00000
21    ssd   3.49309          osd.21            up   1.00000  1.00000
22    ssd   3.49309          osd.22            up   1.00000  1.00000
23    ssd   3.49309          osd.23            up   1.00000  1.00000

[ceph: root@host1 /]$ ceph osd lspools
1 device_health_metrics
2 datapool

[ceph: root@host1 /]$ rbd showmapped
id  pool      namespace  image    snap  device
0   datapool             rbdvol1  -     /dev/rbd0
1   datapool             rbdvol2  -     /dev/rbd1
2   datapool             rbdvol3  -     /dev/rbd2
3   datapool             rbdvol4  -     /dev/rbd3

Remove the images and pools

[ceph: root@host1 /]$ rbd unmap /dev/rbd0
[ceph: root@host1 /]$ rbd unmap /dev/rbd1
[ceph: root@host1 /]$ rbd unmap /dev/rbd2
[ceph: root@host1 /]$ rbd unmap /dev/rbd3

[ceph: root@host1 /]$ rbd showmapped

[ceph: root@host1 /]$ rbd rm datapool/rbdvol1
Removing image: 100% complete...done.
[ceph: root@host1 /]$ rbd rm datapool/rbdvol2
Removing image: 100% complete...done.
[ceph: root@host1 /]$ rbd rm datapool/rbdvol3
Removing image: 100% complete...done.
[ceph: root@host1 /]$ rbd rm datapool/rbdvol4
Removing image: 100% complete...done.

[ceph: root@host1 /]$ ceph osd pool rm datapool datapool --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool

[ceph: root@host1 /]$ ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
mon.host1: mon_allow_pool_delete = 'true'
mon.host1: {}
mon.host3: mon_allow_pool_delete = 'true'
mon.host3: {}
mon.host2: mon_allow_pool_delete = 'true'
mon.host2: {}
[ceph: root@host1 /]$ ceph osd pool rm datapool datapool --yes-i-really-really-mean-it
pool 'datapool' removed

Remove the OSDs

[ceph: root@host1 /]$ for i in `seq 0 23`
> do
> ceph osd down $i && ceph osd destroy $i --force
> done
marked down osd.0.
destroyed osd.0
[omitted...]

[ceph: root@host1 /]$  ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME             STATUS     REWEIGHT  PRI-AFF
-1         83.83411  root default
-3         27.94470      host host1
 0    ssd   3.49309          osd.0         destroyed   1.00000  1.00000
 1    ssd   3.49309          osd.1         destroyed   1.00000  1.00000
 2    ssd   3.49309          osd.2         destroyed   1.00000  1.00000
 3    ssd   3.49309          osd.3         destroyed   1.00000  1.00000
 4    ssd   3.49309          osd.4         destroyed   1.00000  1.00000
 5    ssd   3.49309          osd.5         destroyed   1.00000  1.00000
 6    ssd   3.49309          osd.6         destroyed   1.00000  1.00000
 7    ssd   3.49309          osd.7         destroyed   1.00000  1.00000
-5         27.94470      host host2
 8    ssd   3.49309          osd.8         destroyed   1.00000  1.00000
 9    ssd   3.49309          osd.9         destroyed   1.00000  1.00000
10    ssd   3.49309          osd.10        destroyed   1.00000  1.00000
11    ssd   3.49309          osd.11        destroyed   1.00000  1.00000
12    ssd   3.49309          osd.12        destroyed   1.00000  1.00000
13    ssd   3.49309          osd.13        destroyed   1.00000  1.00000
14    ssd   3.49309          osd.14        destroyed   1.00000  1.00000
15    ssd   3.49309          osd.15        destroyed   1.00000  1.00000
-7         27.94470      host host3
16    ssd   3.49309          osd.16        destroyed   1.00000  1.00000
17    ssd   3.49309          osd.17        destroyed   1.00000  1.00000
18    ssd   3.49309          osd.18        destroyed   1.00000  1.00000
19    ssd   3.49309          osd.19        destroyed   1.00000  1.00000
20    ssd   3.49309          osd.20        destroyed   1.00000  1.00000
21    ssd   3.49309          osd.21        destroyed   1.00000  1.00000
22    ssd   3.49309          osd.22        destroyed   1.00000  1.00000
23    ssd   3.49309          osd.23               up   1.00000  1.00000

Remove the cluster hosts

[ceph: root@host1 /]$ ceph orch host rm host3
Removed host 'host3'
[ceph: root@host1 /]$ ceph orch host rm host2
Removed host 'host2'
[ceph: root@host1 /]$ ceph orch host rm host1
Removed host 'host1'

Check if there is ceph daemon running

[ceph: root@host1 /]$ ceph orch ps host3
No daemons reported
[ceph: root@host1 /]$ ceph orch ps host2
No daemons reported
[ceph: root@host1 /]$ ceph orch ps host1
No daemons reported

Remove the ceph storage cluster

[root@host1 ~]$ cephadm rm-cluster --fsid fec2332e-1b0b-11ec-abbe-ac1f6bc8d268 --force
[root@host1 ~]$ cephadm ls
[]

Cleanup the ceph configuration files

[root@host1 ~]$ rm -rf /etc/ceph
[root@host1 ~]$ rm -rf /var/lib/ce
ceph/       cephadm/    certmonger/
[root@host1 ~]$ rm -rf /var/lib/ceph*

Cleanup the ceph block devices

Do the following on each cluster node.

[root@host1 ~]$ lsblk
NAME                                                                                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme0n1                                                                                               259:0    0  3.5T  0 disk
├─nvme0n1p3                                                                                           259:4    0  3.5T  0 part
│ ├─vgroot-lvswap01                                                                                   253:1    0    4G  0 lvm
│ └─vgroot-lvroot                                                                                     253:0    0  3.5T  0 lvm  /
├─nvme0n1p1                                                                                           259:2    0    1G  0 part /boot/efi
└─nvme0n1p2                                                                                           259:3    0  500M  0 part /boot
nvme3n1                                                                                               259:6    0  3.5T  0 disk
└─ceph--ab144c40--73d6--49bc--921b--65025c383bb1-osd--block--2b965e29--b194--4363--8c96--20ab5b97db33 253:3    0  3.5T  0 lvm
nvme2n1                                                                                               259:5    0  3.5T  0 disk
└─ceph--b1ffe76d--1043--43a2--848b--6ba117e71a75-osd--block--0d6ff85d--9c49--43a0--98a3--c519fbb20b9c 253:4    0  3.5T  0 lvm
nvme1n1                                                                                               259:1    0  3.5T  0 disk

[root@host1 ~]$ for i in `seq 2 9`; do dd if=/dev/zero of=/dev/nvme${i}n1 bs=1M count=1000; done
[root@host1 ~]$ reboot

fio directory and filename options

To run fio benchmark on multiple files or deives, we should understand the following fio options.

  • directory=str

Prefix filenames with this directory. Used to place files in a different location than ./. You can specify a number of directories by separating the names with a ‘:’ character. These directories will be assigned equally distributed to job clones created by numjobs as long as they are using generated filenames. If specific filename(s) are set fio will use the first listed directory, and thereby matching the filename semantic (which generates a file for each clone if not specified, but lets all clones use the same file if set).

  • filename=str

Fio normally makes up a filename based on the job name, thread number, and file number (see filename_format). If you want to share files between threads in a job or several jobs with fixed file paths, specify a filename for each of them to override the default. If the ioengine is file based, you can specify a number of files by separating the names with a ‘:’ colon. So if you wanted a job to open /dev/sda and /dev/sdb as the two working files, you would use filename=/dev/sda:/dev/sdb. This also means that whenever this option is specified, nrfiles is ignored. The size of regular files specified by this option will be size divided by number of files unless an explicit size is specified by filesize.

Run fio on single directory

The following example runs four fio jobs on single directory dir1. Four different files are laid out automatically before the benchmark.

$ fio --name=4kwrite --ioengine=libaio --directory=dir1 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
4kwrite: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.7
Starting 4 processes
4kwrite: Laying out IO file (1 file / 10240MiB)
4kwrite: Laying out IO file (1 file / 10240MiB)
4kwrite: Laying out IO file (1 file / 10240MiB)
4kwrite: Laying out IO file (1 file / 10240MiB)
bs: 4 (f=4): [W(4)][4.5%][r=0KiB/s,w=394MiB/s][r=0,w=101k IOPS][eta 01m:46s]
<...>

$ ps -ef |grep fio | grep -v grep
root     25940 27212 23 21:10 pts/1    00:00:00 fio --name=4kwrite --ioengine=libaio --directory=dir1 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     25976 25940 27 21:10 ?        00:00:01 fio --name=4kwrite --ioengine=libaio --directory=dir1 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     25977 25940 28 21:10 ?        00:00:01 fio --name=4kwrite --ioengine=libaio --directory=dir1 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     25978 25940 28 21:10 ?        00:00:01 fio --name=4kwrite --ioengine=libaio --directory=dir1 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     25979 25940 27 21:10 ?        00:00:01 fio --name=4kwrite --ioengine=libaio --directory=dir1 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting

$ lsof | egrep "dir1"
fio       25976          root    3u      REG              253,2 10737418240   23782491 /dir1/4kwrite.0.0
fio       25977          root    3u      REG              253,2 10737418240   23782492 /dir1/4kwrite.3.0
fio       25978          root    3u      REG              253,2 10737418240   23782495 /dir1/4kwrite.2.0
fio       25979          root    3u      REG              253,2 10737418240    5234528 /dir1/4kwrite.1.0

$ ls -la dir1 | grep write
-rw-r--r-- 1 root root 10737418240 Mar  1 21:11 4kwrite.0.0
-rw-r--r-- 1 root root 10737418240 Mar  1 21:11 4kwrite.1.0
-rw-r--r-- 1 root root 10737418240 Mar  1 21:11 4kwrite.2.0
-rw-r--r-- 1 root root 10737418240 Mar  1 21:11 4kwrite.3.0

Run fio on multiple directories

The following example runs four fio jobs on two directories dir1 and dir2. Two files are laid out automatically under each directory.

$ fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting

$ ps -ef |grep fio | grep write
root     27362 27212  3 21:13 pts/1    00:00:01 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     27396 27362 29 21:13 ?        00:00:08 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     27397 27362 30 21:13 ?        00:00:08 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     27398 27362 31 21:13 ?        00:00:09 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     27399 27362 30 21:13 ?        00:00:08 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting

$ lsof | egrep "dir1|dir2"
fio       27396          root    3u      REG              253,2 10737418240   23782491 /dir1/4kwrite.0.0
fio       27397          root    3u      REG              253,2 10737418240  538334779 /dir2/4kwrite.3.0
fio       27398          root    3u      REG              253,2 10737418240   23782492 /dir1/4kwrite.2.0
fio       27399          root    3u      REG              253,2 10737418240  538334780 /dir2/4kwrite.1.0

$ ls -ltr dir*/
dir2/:
total 20971520
-rw-r--r-- 1 root root 10737418240 Mar  1 21:13 4kwrite.3.0
-rw-r--r-- 1 root root 10737418240 Mar  1 21:13 4kwrite.1.0

dir1/:
total 20971520
-rw-r--r-- 1 root root 10737418240 Mar  1 21:13 4kwrite.2.0
-rw-r--r-- 1 root root 10737418240 Mar  1 21:13 4kwrite.0.0

If the option filename is specified, only the first listed directory will be used to create files.

$ fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting

$ ps -ef |grep fio | grep write
root     29764 27212  8 21:17 pts/1    00:00:00 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     29798 29764 33 21:17 ?        00:00:04 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     29799 29764 35 21:17 ?        00:00:04 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     29800 29764 35 21:17 ?        00:00:04 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     29801 29764 33 21:17 ?        00:00:03 fio --name=4kwrite --ioengine=libaio --directory=dir1:dir2 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting

$ lsof | egrep "dir1|dir2"
fio       29798          root    3u      REG              253,2 10737418240   23782491 /dir1/testfile
fio       29799          root    3u      REG              253,2 10737418240   23782491 /dir1/testfile
fio       29800          root    3u      REG              253,2 10737418240   23782491 /dir1/testfile
fio       29801          root    3u      REG              253,2 10737418240   23782491 /dir1/testfile

$ ls -la dir*/
dir1/:
total 10485760
drwxr-xr-x 2 root root          22 Mar  1 21:17 .
drwxr-xr-x 7 root root         225 Mar  1 20:28 ..
-rw-r--r-- 1 root root 10737418240 Mar  1 21:18 testfile

dir2/:
total 0
drwxr-xr-x 2 root root   6 Mar  1 21:16 .
drwxr-xr-x 7 root root 225 Mar  1 20:28 ..

Run multiple fio jobs on single file

The following example runs four jobs on single file.

$ fio --name=4kwrite --ioengine=libaio --directory=dir1 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting

$ ps -ef |grep fio | grep write
root     28819 27212  9 21:16 pts/1    00:00:00 fio --name=4kwrite --ioengine=libaio --directory=dir1 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     28884 28819 34 21:16 ?        00:00:03 fio --name=4kwrite --ioengine=libaio --directory=dir1 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     28885 28819 33 21:16 ?        00:00:02 fio --name=4kwrite --ioengine=libaio --directory=dir1 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     28886 28819 36 21:16 ?        00:00:03 fio --name=4kwrite --ioengine=libaio --directory=dir1 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting
root     28887 28819 35 21:16 ?        00:00:03 fio --name=4kwrite --ioengine=libaio --directory=dir1 --filename=testfile --blocksize=4k --readwrite=write --filesize=10G --end_fsync=1 --numjobs=4 --iodepth=128 --direct=1 --group_reporting

$ lsof | egrep "dir1"
fio       28884          root    3u      REG              253,2 10737418240   23782491 /dir1/testfile
fio       28885          root    3u      REG              253,2 10737418240   23782491 /dir1/testfile
fio       28886          root    3u      REG              253,2 10737418240   23782491 /dir1/testfile
fio       28887          root    3u      REG              253,2 10737418240   23782491 /dir1/testfile

Run fio on multiple files from different directories

One job writes two files

In this example, there is one fio job to write two files from two different directories. The total iodepth on the two files is 128. Note that, the iodepth for each file is about 64 which is only half of the specified iodepth in the fio command.

$ fio --blocksize=4k --filename=/mnt/dir1/testfile:/mnt/dir2/testfile --ioengine=libaio --readwrite=write --size=50G --name=test --numjobs=1 --group_reporting --direct=1 --iodepth=128 --end_fsync=1
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.7
Starting 1 process
<...>

$ lsof | egrep "/mnt/dir"
fio       74145                 root    3u      REG              252,1 53687091200         11 /mnt/dir1/testfile
fio       74145                 root    4u      REG              252,2 53687091200         11 /mnt/dir2/testfile

$ iostat -ktdx 2
Device:                     rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
pxd!pxd90182615933154185     0.00     0.00    0.00 98857.00     0.00 395428.00     8.00    62.86    0.64    0.00    0.64   0.01 100.00
pxd!pxd798820514973607815     0.00     0.00    0.00 98858.00     0.00 395432.00     8.00    62.84    0.64    0.00    0.64   0.01 100.00

Three jobs write two files

In this example, there are three fio jobs and each job is to write two files. The actual iodepth on each file is ~184(roughly = 128/2 * 3) which is the accumulated iodepth from three jobs.

$ fio --blocksize=4k --filename=/mnt/dir1/testfile:/mnt/dir2/testfile --ioengine=libaio --readwrite=write --size=50G --name=test --numjobs=3 --group_reporting --direct=1 --iodepth=128 --end_fsync=1
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.7
Starting 3 processes
<...>

$ lsof | egrep "/mnt/dir"
fio       85081                 root    3u      REG              252,1 53687091200         11 /mnt/dir1/testfile
fio       85081                 root    4u      REG              252,2 53687091200         11 /mnt/dir2/testfile
fio       85082                 root    3u      REG              252,1 53687091200         11 /mnt/dir1/testfile
fio       85082                 root    4u      REG              252,2 53687091200         11 /mnt/dir2/testfile
fio       85083                 root    3u      REG              252,1 53687091200         11 /mnt/dir1/testfile
fio       85083                 root    4u      REG              252,2 53687091200         11 /mnt/dir2/testfile

$ iostat -ktex 2 
pxd!pxd90182615933154185     0.00     0.50    0.00 99324.50     0.00 397300.00     8.00   184.13    1.85    0.00    1.85   0.01 100.00
pxd!pxd798820514973607815     0.00     0.50    0.00 99324.50     0.00 397300.00     8.00   184.02    1.85    0.00    1.85   0.01 100.00

Using dedicated jobs writes each file

In this example, there are two fio jobs and each job is to write a different file. The actual iodepth on each file is ~128 which is the same as the specified iodepth in the fio command. This is usually expected pattern in the benchmark.

$ fio --blocksize=4k --ioengine=libaio --readwrite=write --size=50G --direct=1 --iodepth=128 --end_fsync=1 --group_reporting --numjobs=1 --name=job1 --filename=/mnt/dir1/testfile --name=job2 --filename=/mnt/dir2/testfile
job1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
job2: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.7
Starting 2 processes
<...>

$ lsof | egrep "/mnt/dir"
fio       79794                 root    3u      REG              252,1 53687091200         11 /mnt/dir1/testfile
fio       79795                 root    3u      REG              252,2 53687091200         11 /mnt/dir2/testfile

$ iostat -ktdx 2
pxd!pxd90182615933154185     0.00     0.00    0.00 94151.00     0.00 376604.00     8.00   127.01    1.35    0.00    1.35   0.01 100.00
pxd!pxd798820514973607815     0.00     0.00    0.00 94152.50     0.00 376610.00     8.00   127.01    1.35    0.00    1.35   0.01 100.00

In this example, there are four fio jobs and each file is written by two jobs. The actual iodepth on each file is ~256 which is the twice of the specified iodepth in the fio command.

$ fio --blocksize=4k --ioengine=libaio --readwrite=write --size=50G --direct=1 --iodepth=128 --end_fsync=1 --group_reporting --numjobs=2 --name=job1 --filename=/mnt/dir1/testfile --name=job2 --filename=/mnt/dir2/testfile
job1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
job2: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.7
Starting 4 processes
<...>

$ lsof | egrep "/mnt/dir"
fio       81972                 root    3u      REG              252,1 53687091200         11 /mnt/dir1/testfile
fio       81973                 root    3u      REG              252,1 53687091200         11 /mnt/dir1/testfile
fio       81974                 root    3u      REG              252,2 53687091200         11 /mnt/dir2/testfile
fio       81975                 root    3u      REG              252,2 53687091200         11 /mnt/dir2/testfile

$ iostat -ktdx 2
pxd!pxd90182615933154185     0.00     0.50    0.00 93394.50     0.00 373580.00     8.00   254.94    2.73    0.00    2.73   0.01 100.00
pxd!pxd798820514973607815     0.00     0.50    0.00 93408.00     0.00 373634.00     8.00   254.96    2.73    0.00    2.73   0.01 100.00

Reference

In this post, we study how to run fio benchmark on multiple devices. We also try to understand how the iodepth reflects on each device.

We start with single device and the following global parameters are used.

  • blocksize=16k
  • filesize=50G (write/read 50G data on each device)
  • iodepth=64 (will explain more with the experiment)
  • end_fsync
  • group_reporting

Write single device

Using one job to write single device /dev/nvme2n1:

$ fio --ioengine=libaio --direct=1 --readwrite=write --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --group_reporting --name=job1 --filename=/dev/nvme2n1
job1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=1906MiB/s][r=0,w=122k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=32600: Fri Apr 22 22:19:32 2022
  write: IOPS=116k, BW=1820MiB/s (1908MB/s)(50.0GiB/28134msec)
    slat (nsec): min=1362, max=64140, avg=2467.92, stdev=1052.40
    clat (usec): min=4, max=4503, avg=546.60, stdev=554.19
     lat (usec): min=12, max=4505, avg=549.15, stdev=554.20
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[   18], 10.00th=[   23], 20.00th=[   34],
     | 30.00th=[   50], 40.00th=[   90], 50.00th=[  474], 60.00th=[  619],
     | 70.00th=[  775], 80.00th=[ 1029], 90.00th=[ 1418], 95.00th=[ 1631],
     | 99.00th=[ 1942], 99.50th=[ 2057], 99.90th=[ 2311], 99.95th=[ 2474],
     | 99.99th=[ 3228]
   bw (  MiB/s): min= 1664, max= 1966, per=100.00%, avg=1821.04, stdev=86.43, samples=56
   iops        : min=106554, max=125860, avg=116546.80, stdev=5531.48, samples=56
  lat (usec)   : 10=0.01%, 20=7.69%, 50=22.50%, 100=10.42%, 250=4.17%
  lat (usec)   : 500=5.92%, 750=17.86%, 1000=10.58%
  lat (msec)   : 2=20.14%, 4=0.71%, 10=0.01%
  cpu          : usr=12.36%, sys=36.52%, ctx=1591388, majf=0, minf=17
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,3276800,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=1820MiB/s (1908MB/s), 1820MiB/s-1820MiB/s (1908MB/s-1908MB/s), io=50.0GiB (53.7GB), run=28134-28134msec

Disk stats (read/write):
  nvme2n1: ios=88/3276800, merge=0/0, ticks=10/1784551, in_queue=1784561, util=99.65%

In the iostat, w/s(writes per second) on nvme2n1 is 116k which is equal to the fio iops. The avgqu-sz(queue depth) is equal to fio iodepth 64.

$ iostat -ktdx 5 | egrep "Device|nvme2n1|nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 116077.60     0.00 1857241.60    32.00    63.36    0.55    0.00    0.55   0.01 100.00
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 116236.40     0.00 1859782.40    32.00    63.48    0.55    0.00    0.55   0.01 100.02
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 116508.00     0.00 1864128.00    32.00    63.50    0.55    0.00    0.55   0.01  99.98
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 115832.60     0.00 1853321.60    32.00    63.49    0.55    0.00    0.55   0.01 100.02
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 117750.00     0.00 1884000.00    32.00    63.48    0.54    0.00    0.54   0.01 100.00
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Write two devices

Using two jobs to write two devices /dev/nvme2n1 and /dev/nvme3n1 separately:

$ fio --ioengine=libaio --direct=1 --readwrite=write --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --group_reporting --name=job1 --filename=/dev/nvme2n1 --name=job2 --filename=/dev/nvme3n1
job1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
job2: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.7
Starting 2 processes
Jobs: 1 (f=1): [W(1),_(1)][100.0%][r=0KiB/s,w=2904MiB/s][r=0,w=186k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=2): err= 0: pid=32892: Fri Apr 22 22:22:05 2022
  write: IOPS=233k, BW=3648MiB/s (3825MB/s)(100GiB/28072msec)
    slat (nsec): min=1356, max=57113, avg=2474.06, stdev=784.80
    clat (usec): min=6, max=4165, avg=539.57, stdev=563.55
     lat (usec): min=12, max=4167, avg=542.13, stdev=563.57
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[   16], 10.00th=[   21], 20.00th=[   32],
     | 30.00th=[   44], 40.00th=[   67], 50.00th=[  490], 60.00th=[  619],
     | 70.00th=[  775], 80.00th=[ 1037], 90.00th=[ 1450], 95.00th=[ 1647],
     | 99.00th=[ 1926], 99.50th=[ 2024], 99.90th=[ 2278], 99.95th=[ 2376],
     | 99.99th=[ 2900]
   bw (  MiB/s): min= 1611, max= 1981, per=50.57%, avg=1844.83, stdev=83.99, samples=109
   iops        : min=103146, max=126830, avg=118069.08, stdev=5375.32, samples=109
  lat (usec)   : 10=0.01%, 20=9.06%, 50=24.62%, 100=11.34%, 250=2.22%
  lat (usec)   : 500=2.86%, 750=18.76%, 1000=10.17%
  lat (msec)   : 2=20.38%, 4=0.59%, 10=0.01%
  cpu          : usr=12.43%, sys=36.55%, ctx=3200368, majf=0, minf=32
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,6553600,0,2 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=3648MiB/s (3825MB/s), 3648MiB/s-3648MiB/s (3825MB/s-3825MB/s), io=100GiB (107GB), run=28072-28072msec

Disk stats (read/write):
  nvme2n1: ios=88/3276800, merge=0/0, ticks=10/1782540, in_queue=1782549, util=99.67%
  nvme3n1: ios=88/3276800, merge=0/0, ticks=13/1745688, in_queue=1745702, util=97.57%

Note that each job writes one device separately. The iops doubles compared to single device write.

In the iostat, the w/s on each device is 116k and the total w/s on two devices are ~233k. The avgqu-sz on each device is 64 which is expected and equal to fio iodepth 64.

$ iostat -ktdx 5 | egrep "Device|nvme2n1|nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 116848.20     0.00 1869571.20    32.00    63.56    0.54    0.00    0.54   0.01 100.00
nvme3n1           0.00     0.00    0.00 119530.00     0.00 1912480.00    32.00    63.57    0.53    0.00    0.53   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 116253.00     0.00 1860048.00    32.00    63.57    0.55    0.00    0.55   0.01 100.00
nvme3n1           0.00     0.00    0.00 119619.80     0.00 1913916.80    32.00    63.58    0.53    0.00    0.53   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 116381.20     0.00 1862099.20    32.00    63.56    0.55    0.00    0.55   0.01 100.08
nvme3n1           0.00     0.00    0.00 118331.00     0.00 1893296.00    32.00    63.57    0.54    0.00    0.54   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 116712.20     0.00 1867395.20    32.00    63.57    0.54    0.00    0.54   0.01 100.00
nvme3n1           0.00     0.00    0.00 119082.40     0.00 1905318.40    32.00    63.56    0.53    0.00    0.53   0.01 100.00

Read single device

Using one job to read single device /dev/nvme2n1:

$ fio --ioengine=libaio --direct=1 --readwrite=read --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --group_reporting --name=job1 --filename=/dev/nvme2n1
job1: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=2825MiB/s,w=0KiB/s][r=181k,w=0 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=33037: Fri Apr 22 22:24:12 2022
   read: IOPS=181k, BW=2820MiB/s (2957MB/s)(50.0GiB/18153msec)
    slat (nsec): min=1274, max=52836, avg=1738.48, stdev=799.29
    clat (usec): min=75, max=2997, avg=352.48, stdev=89.63
     lat (usec): min=76, max=2999, avg=354.28, stdev=89.64
    clat percentiles (usec):
     |  1.00th=[  192],  5.00th=[  229], 10.00th=[  245], 20.00th=[  273],
     | 30.00th=[  302], 40.00th=[  322], 50.00th=[  351], 60.00th=[  371],
     | 70.00th=[  392], 80.00th=[  416], 90.00th=[  461], 95.00th=[  506],
     | 99.00th=[  627], 99.50th=[  676], 99.90th=[  807], 99.95th=[  889],
     | 99.99th=[  988]
   bw (  MiB/s): min= 2781, max= 2826, per=100.00%, avg=2820.55, stdev= 7.65, samples=36
   iops        : min=178016, max=180900, avg=180515.03, stdev=489.56, samples=36
  lat (usec)   : 100=0.01%, 250=11.85%, 500=82.74%, 750=5.23%, 1000=0.17%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=14.06%, sys=44.38%, ctx=2003314, majf=0, minf=271
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=3276800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=2820MiB/s (2957MB/s), 2820MiB/s-2820MiB/s (2957MB/s-2957MB/s), io=50.0GiB (53.7GB), run=18153-18153msec

Disk stats (read/write):
  nvme2n1: ios=3274022/0, merge=0/0, ticks=1149877/0, in_queue=1149877, util=99.53%

In the iostat, r/s(reads per second) on nvme2n1 is ~181k which is equal to the fio iops. The avgqu-sz(queue depth) is equal to fio iodepth 64.

$ iostat -ktdx 5 | egrep "Device|nvme2n1|nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 180561.80    0.00 2888988.80     0.00    32.00    63.44    0.35    0.35    0.00   0.01 100.02
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 180625.60    0.00 2890009.60     0.00    32.00    63.42    0.35    0.35    0.00   0.01 100.00
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 180660.80    0.00 2890572.80     0.00    32.00    63.44    0.35    0.35    0.00   0.01 100.04
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Read two devices

Using two jobs to read two devices /dev/nvme2n1 and /dev/nvme3n1 separately:

$ fio --ioengine=libaio --direct=1 --readwrite=read --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --group_reporting --name=job1 --filename=/dev/nvme2n1  --name=job2 --filename=/dev/nvme3n1
job1: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
job2: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.7
Starting 2 processes
Jobs: 2 (f=2): [R(2)][100.0%][r=5639MiB/s,w=0KiB/s][r=361k,w=0 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=2): err= 0: pid=33148: Fri Apr 22 22:25:16 2022
   read: IOPS=360k, BW=5628MiB/s (5901MB/s)(100GiB/18195msec)
    slat (nsec): min=1272, max=54671, avg=1803.20, stdev=748.72
    clat (usec): min=70, max=1344, avg=353.14, stdev=87.70
     lat (usec): min=73, max=1352, avg=355.00, stdev=87.70
    clat percentiles (usec):
     |  1.00th=[  186],  5.00th=[  225], 10.00th=[  245], 20.00th=[  277],
     | 30.00th=[  306], 40.00th=[  330], 50.00th=[  351], 60.00th=[  371],
     | 70.00th=[  392], 80.00th=[  416], 90.00th=[  461], 95.00th=[  502],
     | 99.00th=[  611], 99.50th=[  660], 99.90th=[  775], 99.95th=[  873],
     | 99.99th=[  979]
   bw (  MiB/s): min= 2779, max= 2819, per=50.02%, avg=2814.82, stdev= 6.34, samples=72
   iops        : min=177878, max=180456, avg=180148.69, stdev=405.55, samples=72
  lat (usec)   : 100=0.01%, 250=11.87%, 500=83.07%, 750=4.92%, 1000=0.12%
  lat (msec)   : 2=0.01%
  cpu          : usr=14.32%, sys=45.95%, ctx=3778567, majf=0, minf=541
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=6553600,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=5628MiB/s (5901MB/s), 5628MiB/s-5628MiB/s (5901MB/s-5901MB/s), io=100GiB (107GB), run=18195-18195msec

Disk stats (read/write):
  nvme2n1: ios=3265872/0, merge=0/0, ticks=1149825/0, in_queue=1149825, util=99.51%
  nvme3n1: ios=3267478/0, merge=0/0, ticks=1149712/0, in_queue=1149712, util=99.52%

Note that each job reads one device separately. The iops doubles compared to single device write.

In the iostat, the r/s on each device is 180k and the total w/s on two devices are ~360k. The avgqu-sz on each device is 64 which is expected and equal to fio iodepth 64.

$ iostat -ktdx 5 | egrep "Device|nvme2n1|nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 180121.20    0.00 2881939.20     0.00    32.00    63.44    0.35    0.35    0.00   0.01 100.04
nvme3n1           0.00     0.00 180181.60    0.00 2882905.60     0.00    32.00    63.43    0.35    0.35    0.00   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 180242.20    0.00 2883875.20     0.00    32.00    63.44    0.35    0.35    0.00   0.01 100.00
nvme3n1           0.00     0.00 180260.00    0.00 2884160.00     0.00    32.00    63.44    0.35    0.35    0.00   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 180221.40    0.00 2883542.40     0.00    32.00    63.44    0.35    0.35    0.00   0.01 100.00
nvme3n1           0.00     0.00 180278.40    0.00 2884454.40     0.00    32.00    63.43    0.35    0.35    0.00   0.01 100.00

Incorrect way to write/read multiple devices

Using one job to write two devices:

$ fio --ioengine=libaio --direct=1 --readwrite=write --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --group_reporting --name=job1 --filename=/dev/nvme2n1:/dev/nvme3n1
job1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.7
Starting 1 process
Jobs: 1 (f=2): [W(1)][100.0%][r=0KiB/s,w=3537MiB/s][r=0,w=226k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=33535: Fri Apr 22 22:59:18 2022
  write: IOPS=210k, BW=3284MiB/s (3444MB/s)(100GiB/31177msec)
    slat (nsec): min=1376, max=52293, avg=2321.83, stdev=1055.59
    clat (usec): min=2, max=3190, avg=301.76, stdev=424.79
     lat (usec): min=12, max=3192, avg=304.15, stdev=424.78
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[   16], 10.00th=[   20], 20.00th=[   26],
     | 30.00th=[   32], 40.00th=[   40], 50.00th=[   57], 60.00th=[  112],
     | 70.00th=[  334], 80.00th=[  635], 90.00th=[  979], 95.00th=[ 1254],
     | 99.00th=[ 1663], 99.50th=[ 1811], 99.90th=[ 2073], 99.95th=[ 2147],
     | 99.99th=[ 2343]
   bw (  MiB/s): min= 2935, max= 3785, per=99.93%, avg=3282.15, stdev=221.08, samples=62
   iops        : min=187876, max=242266, avg=210057.40, stdev=14148.71, samples=62
  lat (usec)   : 4=0.01%, 10=0.01%, 20=11.44%, 50=35.69%, 100=11.67%
  lat (usec)   : 250=8.20%, 500=8.42%, 750=8.43%, 1000=6.59%
  lat (msec)   : 2=9.38%, 4=0.17%
  cpu          : usr=19.43%, sys=52.57%, ctx=1390981, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,6553600,0,2 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=3284MiB/s (3444MB/s), 3284MiB/s-3284MiB/s (3444MB/s-3444MB/s), io=100GiB (107GB), run=31177-31177msec

Disk stats (read/write):
  nvme2n1: ios=59/3276800, merge=0/0, ticks=6/1239109, in_queue=1239116, util=99.68%
  nvme3n1: ios=57/3276800, merge=0/0, ticks=7/649690, in_queue=649696, util=99.70%

In the iostat, the w/s on each device is ~102k. The avgqu-sz on the two devices are different(40 and 22) and the total queue depth is about 64. This is something we don’t expect on the benchmark. We usually expect the queue depth is identical on all the devices under benchmark workload.

$ iostat -ktdx 5 | egrep "Device|nvme2n1|nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 102307.80     0.00 1636924.80    32.00    39.91    0.39    0.00    0.39   0.01 100.00
nvme3n1           0.00     0.00    0.00 102309.40     0.00 1636950.40    32.00    22.12    0.22    0.00    0.22   0.01 100.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 101318.40     0.00 1621094.40    32.00    40.48    0.40    0.00    0.40   0.01 100.00
nvme3n1           0.00     0.00    0.00 101295.80     0.00 1620732.80    32.00    21.48    0.21    0.00    0.21   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 100715.00     0.00 1611440.00    32.00    39.92    0.40    0.00    0.40   0.01 100.00
nvme3n1           0.00     0.00    0.00 100736.60     0.00 1611785.60    32.00    22.21    0.22    0.00    0.22   0.01 100.02
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 101647.20     0.00 1626355.20    32.00    40.23    0.40    0.00    0.40   0.01 100.00
nvme3n1           0.00     0.00    0.00 101632.20     0.00 1626115.20    32.00    21.81    0.21    0.00    0.21   0.01  99.98
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 109138.00     0.00 1746208.00    32.00    43.43    0.40    0.00    0.40   0.01 100.04
nvme3n1           0.00     0.00    0.00 109157.60     0.00 1746521.60    32.00    16.06    0.15    0.00    0.15   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 114420.80     0.00 1830732.80    32.00    36.36    0.32    0.00    0.32   0.01 100.00
nvme3n1           0.00     0.00    0.00 114415.40     0.00 1830646.40    32.00    21.52    0.19    0.00    0.19   0.01 100.00

Using two jobs to write two devices:

$ fio --ioengine=libaio --direct=1 --readwrite=write --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --numjobs=2 --group_reporting --name=job1 --filename=/dev/nvme2n1:/dev/nvme3n1
job1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
...
fio-3.7
Starting 2 processes
Jobs: 2 (f=4): [W(2)][100.0%][r=0KiB/s,w=3139MiB/s][r=0,w=201k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=2): err= 0: pid=33670: Fri Apr 22 23:09:45 2022
  write: IOPS=216k, BW=3378MiB/s (3542MB/s)(200GiB/60623msec)
    slat (nsec): min=1361, max=55969, avg=2845.77, stdev=1030.55
    clat (usec): min=5, max=6608, avg=588.58, stdev=938.21
     lat (usec): min=11, max=6610, avg=591.51, stdev=938.21
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[   14], 10.00th=[   16], 20.00th=[   21],
     | 30.00th=[   25], 40.00th=[   29], 50.00th=[   34], 60.00th=[   40],
     | 70.00th=[  330], 80.00th=[ 1663], 90.00th=[ 2311], 95.00th=[ 2540],
     | 99.00th=[ 2966], 99.50th=[ 3097], 99.90th=[ 3458], 99.95th=[ 3621],
     | 99.99th=[ 4047]
   bw (  MiB/s): min= 1498, max= 1895, per=50.01%, avg=1689.53, stdev=95.93, samples=242
   iops        : min=95908, max=121316, avg=108129.62, stdev=6139.49, samples=242
  lat (usec)   : 10=0.01%, 20=19.61%, 50=45.92%, 100=2.35%, 250=1.43%
  lat (usec)   : 500=2.03%, 750=1.82%, 1000=1.70%
  lat (msec)   : 2=9.28%, 4=15.82%, 10=0.01%
  cpu          : usr=12.52%, sys=35.52%, ctx=4384091, majf=0, minf=42
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,13107200,0,4 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=3378MiB/s (3542MB/s), 3378MiB/s-3378MiB/s (3542MB/s-3542MB/s), io=200GiB (215GB), run=60623-60623msec

Disk stats (read/write):
  nvme2n1: ios=118/6553600, merge=0/0, ticks=11/5128161, in_queue=5128173, util=99.87%
  nvme3n1: ios=90/6553600, merge=0/0, ticks=10/2544319, in_queue=2544330, util=99.86%

In this experiment, by setting numjobs=2, there are two cloned jobs to run the same workload. Each job writes two devices.

In the iostat, w/s on each device is ~105k and the total w/s is 210k which is close to fio iops. However, the avgqu-sz on each device is very different(113 vs. 14). The total avgqu-sz is 126 which is close to the fio iodepth of two jobs(2x64=128).

Even though the total w/s is close to our previous experiment which has two separate jobs write each device, the avgqu-sz is not the same as fio iodepth 64 on each device.

So, we prefer to use two separate jobs to write different devices when to benchmark multiple devices.

$ iostat -ktdx 5 | egrep "Device|nvme2n1|nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 104810.00     0.00 1676963.20    32.00   112.91    1.08    0.00    1.08   0.01 100.04
nvme3n1           0.00     0.00    0.00 104813.00     0.00 1677008.00    32.00    13.72    0.13    0.00    0.13   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 103548.00     0.00 1656764.80    32.00   112.87    1.09    0.00    1.09   0.01 100.04
nvme3n1           0.00     0.00    0.00 103550.60     0.00 1656809.60    32.00    13.74    0.13    0.00    0.13   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 104651.80     0.00 1674428.80    32.00   112.75    1.08    0.00    1.08   0.01 100.00
nvme3n1           0.00     0.00    0.00 104644.20     0.00 1674307.20    32.00    13.89    0.13    0.00    0.13   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 105734.20     0.00 1691747.20    32.00   113.22    1.07    0.00    1.07   0.01 100.00
nvme3n1           0.00     0.00    0.00 105744.40     0.00 1691910.40    32.00    13.40    0.13    0.00    0.13   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 105221.00     0.00 1683536.00    32.00   117.70    1.12    0.00    1.12   0.01 100.00
nvme3n1           0.00     0.00    0.00 105215.60     0.00 1683449.60    32.00     8.93    0.08    0.00    0.08   0.01 100.02
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 114836.60     0.00 1837388.80    32.00    82.89    0.72    0.00    0.72   0.01 100.00
nvme3n1           0.00     0.00    0.00 114786.80     0.00 1836588.80    32.00    43.73    0.38    0.00    0.38   0.01  99.98
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 115051.20     0.00 1840816.00    32.00    78.09    0.68    0.00    0.68   0.01 100.04
nvme3n1           0.00     0.00    0.00 115083.60     0.00 1841337.60    32.00    48.55    0.42    0.00    0.42   0.01 100.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 114826.60     0.00 1837225.60    32.00    81.10    0.71    0.00    0.71   0.01 100.00
nvme3n1           0.00     0.00    0.00 114844.60     0.00 1837513.60    32.00    45.49    0.40    0.00    0.40   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 111133.00     0.00 1778128.00    32.00    48.09    0.43    0.00    0.43   0.01 100.02
nvme3n1           0.00     0.00    0.00 111080.80     0.00 1777292.80    32.00    78.49    0.71    0.00    0.71   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 106740.00     0.00 1707840.00    32.00    26.91    0.25    0.00    0.25   0.01 100.00
nvme3n1           0.00     0.00    0.00 106743.40     0.00 1707894.40    32.00    99.64    0.93    0.00    0.93   0.01 100.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00 106593.60     0.00 1705497.60    32.00    31.85    0.30    0.00    0.30   0.01 100.00
nvme3n1           0.00     0.00    0.00 106640.80     0.00 1706252.80    32.00    94.76    0.89    0.00    0.89   0.01 100.04

Using one job to read two devices:

$ fio --ioengine=libaio --direct=1 --readwrite=read --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --group_reporting --name=job1 --filename=/dev/nvme2n1:/dev/nvme3n1
job1: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.7
Starting 1 process
Jobs: 1 (f=2): [R(1)][100.0%][r=4056MiB/s,w=0KiB/s][r=260k,w=0 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=33378: Fri Apr 22 22:29:06 2022
   read: IOPS=258k, BW=4035MiB/s (4231MB/s)(100GiB/25375msec)
    slat (nsec): min=1268, max=80634, avg=1910.78, stdev=947.79
    clat (usec): min=51, max=2289, avg=245.54, stdev=92.47
     lat (usec): min=53, max=2291, avg=247.52, stdev=92.47
    clat percentiles (usec):
     |  1.00th=[   85],  5.00th=[  113], 10.00th=[  133], 20.00th=[  151],
     | 30.00th=[  172], 40.00th=[  204], 50.00th=[  239], 60.00th=[  289],
     | 70.00th=[  314], 80.00th=[  338], 90.00th=[  363], 95.00th=[  383],
     | 99.00th=[  441], 99.50th=[  457], 99.90th=[  498], 99.95th=[  529],
     | 99.99th=[  709]
   bw (  MiB/s): min= 3783, max= 4101, per=99.99%, avg=4034.89, stdev=43.41, samples=50
   iops        : min=242150, max=262484, avg=258233.02, stdev=2778.43, samples=50
  lat (usec)   : 100=2.81%, 250=48.78%, 500=48.31%, 750=0.09%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=17.61%, sys=59.69%, ctx=1442924, majf=0, minf=274
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=6553600,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=4035MiB/s (4231MB/s), 4035MiB/s-4035MiB/s (4231MB/s-4231MB/s), io=100GiB (107GB), run=25375-25375msec

Disk stats (read/write):
  nvme2n1: ios=3245464/0, merge=0/0, ticks=801718/0, in_queue=801719, util=99.67%
  nvme3n1: ios=3245472/0, merge=0/0, ticks=762968/0, in_queue=762969, util=99.67%


$ iostat -ktdx 5 | egrep "Device|nvme2n1|nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 128705.40    0.00 2059286.40     0.00    32.00    31.86    0.25    0.25    0.00   0.01  99.98
nvme3n1           0.00     0.00 128703.00    0.00 2059248.00     0.00    32.00    30.47    0.24    0.24    0.00   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 129154.60    0.00 2066473.60     0.00    32.00    31.82    0.25    0.25    0.00   0.01 100.02
nvme3n1           0.00     0.00 129157.80    0.00 2066524.80     0.00    32.00    30.53    0.24    0.24    0.00   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 129703.80    0.00 2075260.80     0.00    32.00    31.93    0.25    0.25    0.00   0.01 100.00
nvme3n1           0.00     0.00 129702.40    0.00 2075238.40     0.00    32.00    30.42    0.23    0.23    0.00   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 129521.60    0.00 2072345.60     0.00    32.00    32.04    0.25    0.25    0.00   0.01 100.04
nvme3n1           0.00     0.00 129523.60    0.00 2072377.60     0.00    32.00    30.32    0.23    0.23    0.00   0.01 100.02

Using two jobs to read two devices:

$ fio --ioengine=libaio --direct=1 --readwrite=read --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --numjobs=2 --group_reporting --name=job1 --filename=/dev/nvme2n1:/dev/nvme3n1
job1: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
...
fio-3.7
Starting 2 processes
Jobs: 2 (f=4): [R(2)][100.0%][r=5606MiB/s,w=0KiB/s][r=359k,w=0 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=2): err= 0: pid=33809: Fri Apr 22 23:18:37 2022
   read: IOPS=358k, BW=5597MiB/s (5869MB/s)(200GiB/36593msec)
    slat (nsec): min=1260, max=52904, avg=1967.07, stdev=877.86
    clat (usec): min=63, max=9900, avg=355.00, stdev=150.12
     lat (usec): min=65, max=9901, avg=357.03, stdev=150.12
    clat percentiles (usec):
     |  1.00th=[  165],  5.00th=[  198], 10.00th=[  219], 20.00th=[  245],
     | 30.00th=[  269], 40.00th=[  297], 50.00th=[  334], 60.00th=[  371],
     | 70.00th=[  412], 80.00th=[  453], 90.00th=[  510], 95.00th=[  562],
     | 99.00th=[  685], 99.50th=[  766], 99.90th=[ 2212], 99.95th=[ 2704],
     | 99.99th=[ 3195]
   bw (  MiB/s): min= 2725, max= 2811, per=50.00%, avg=2798.56, stdev= 8.75, samples=146
   iops        : min=174406, max=179932, avg=179107.78, stdev=559.92, samples=146
  lat (usec)   : 100=0.01%, 250=22.38%, 500=66.31%, 750=10.74%, 1000=0.32%
  lat (msec)   : 2=0.13%, 4=0.12%, 10=0.01%
  cpu          : usr=14.17%, sys=46.00%, ctx=5209795, majf=0, minf=550
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=13107200,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=5597MiB/s (5869MB/s), 5597MiB/s-5597MiB/s (5869MB/s-5869MB/s), io=200GiB (215GB), run=36593-36593msec

Disk stats (read/write):
  nvme2n1: ios=6516709/0, merge=0/0, ticks=2254035/0, in_queue=2254035, util=99.79%
  nvme3n1: ios=6516726/0, merge=0/0, ticks=2345468/0, in_queue=2345468, util=99.82%


$ iostat -ktdx 5 | egrep "Device|nvme2n1|nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 179103.00    0.00 2865648.00     0.00    32.00    61.69    0.34    0.34    0.00   0.01 100.00
nvme3n1           0.00     0.00 179101.20    0.00 2865619.20     0.00    32.00    64.77    0.36    0.36    0.00   0.01 100.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 179355.00    0.00 2869680.00     0.00    32.00    61.83    0.34    0.34    0.00   0.01 100.08
nvme3n1           0.00     0.00 179356.40    0.00 2869702.40     0.00    32.00    64.64    0.36    0.36    0.00   0.01 100.08
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 179112.80    0.00 2865804.80     0.00    32.00    61.77    0.34    0.34    0.00   0.01 100.00
nvme3n1           0.00     0.00 179112.40    0.00 2865798.40     0.00    32.00    64.69    0.36    0.36    0.00   0.01 100.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 179088.00    0.00 2865408.00     0.00    32.00    61.85    0.35    0.35    0.00   0.01 100.02
nvme3n1           0.00     0.00 179087.20    0.00 2865395.20     0.00    32.00    64.61    0.36    0.36    0.00   0.01 100.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 179096.20    0.00 2865539.20     0.00    32.00    62.14    0.35    0.35    0.00   0.01  99.98
nvme3n1           0.00     0.00 179095.00    0.00 2865520.00     0.00    32.00    64.33    0.36    0.36    0.00   0.01 100.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00 179127.40    0.00 2866038.40     0.00    32.00    62.21    0.35    0.35    0.00   0.01 100.02
nvme3n1           0.00     0.00 179128.80    0.00 2866060.80     0.00    32.00    64.26    0.36    0.36    0.00   0.01 100.00

From the open files output from the command ps and lsof, we know that each job opens two devices for read.

$ ps -ef |grep fio
root     33827 30166 63 23:22 pts/0    00:00:07 fio --ioengine=libaio --direct=1 --readwrite=read --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --numjobs=2 --group_reporting --name=job1 --filename=/dev/nvme2n1:/dev/nvme3n1
root     33925 33827 57 23:22 ?        00:00:06 fio --ioengine=libaio --direct=1 --readwrite=read --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --numjobs=2 --group_reporting --name=job1 --filename=/dev/nvme2n1:/dev/nvme3n1
root     33926 33827 57 23:22 ?        00:00:06 fio --ioengine=libaio --direct=1 --readwrite=read --blocksize=16k --filesize=50G --end_fsync=1 --iodepth=64 --numjobs=2 --group_reporting --name=job1 --filename=/dev/nvme2n1:/dev/nvme3n1
root     33945 30233  0 23:22 pts/1    00:00:00 grep --color=auto fio

$ lsof | grep nvme | grep fio
fio       33925                 root    3r      BLK              259,0       0t0      33809 /dev/nvme2n1
fio       33925                 root    4r      BLK             259,11       0t0      33820 /dev/nvme3n1
fio       33925                 root    5r      BLK             259,11       0t0      33820 /dev/nvme3n1
fio       33926                 root    3r      BLK              259,0       0t0      33809 /dev/nvme2n1
fio       33926                 root    4r      BLK              259,0       0t0      33809 /dev/nvme2n1
fio       33926                 root    5r      BLK             259,11       0t0      33820 /dev/nvme3n1

Backup a database

mysqldump is a command-line utility which can be used to generate backups of MySQL database.

$ mysqldump -u root --password=<db_password> mydb > mydb_dump_`date +"%Y%m%d_%H%M%S"`.sql
$  ls -ltr | grep mydb
-rw-r--r--.   1 root root 4834575 Sep 28 21:11 mydb_dump_20210928_144610.sql

Restore a database

Create an empty database before restore as below:

$ mysql -u root -p

mysql> create database mydb;
mysql> show databases;
mysql> exit

Restore the database:

$ mysql -u root -p mydb < mydb_dump_20210928_144610.sql

Check the database size as below:

mysql> SELECT table_schema "DB Name", ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB"  FROM information_schema.tables  GROUP BY table_schema;
+--------------------+---------------+
| DB Name            | DB Size in MB |
+--------------------+---------------+
| mydb               |           8.1 |
+--------------------+---------------+

An image gallery can be easily built by using LightBox and Image Gallery scripts in Jekyll.

  1. LightBox

Lightbox is a solution that loads your image links, your Youtube links and your Vimeo links automatically in a minimalistic and responsive pseudo window/overlay. No adjustment to your links is required, just follow the instructions to install the css and js script.

  1. Image Gallery

The script Image Gallery creates an image gallery. The script reads all images from a specific (user-defined) folder in Jekyll, automagically crops them to 300px squares, using an image resize proxy service and shows them in rows of five. Just follow the very easy instrutions to install it and we are good to go.

Intro to Sysbench

sysbench is a scriptable multi-threaded benchmark tool based on LuaJIT. It is most frequently used for database benchmarks, but can also be used to create arbitrarily complex workloads that do not involve a database server.

sysbench comes with the following bundled benchmarks:

  • oltp_*.lua: a collection of OLTP-like database benchmarks
  • fileio: a filesystem-level benchmark
  • cpu: a simple CPU benchmark
  • memory: a memory access benchmark
  • threads: a thread-based scheduler benchmark
  • mutex: a POSIX mutex benchmark

Below is a description of typical test commands and their purpose:

  • prepare: performs preparative actions for those tests which need them, e.g. creating the necessary files on disk for the fileio test, or filling the test database for database benchmarks.
  • run: runs the actual test specified with the testname argument. This command is provided by all tests.
  • cleanup: removes temporary data after the test run in those tests which create one.
  • help: displays usage information for the test specified with the testname argument. This includes the full list of commands provided by the test, so it should be used to get the available commands.

Install sysbench on CentOS 7.5

$ cat /etc/centos-release
CentOS Linux release 7.5.1804 (Core)
$ uname -r
5.7.12-1.el7.elrepo.x86_64

$ curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.rpm.sh | sudo bash
$ sudo yum -y install sysbench

$ sysbench --version
sysbench 1.0.20

$ sysbench --help
Usage:
  sysbench [options]... [testname] [command]

Commands implemented by most tests: prepare run cleanup help

General options:
  --threads=N                     number of threads to use [1]
  --events=N                      limit for total number of events [0]
  --time=N                        limit for total execution time in seconds [10]
  --forced-shutdown=STRING        number of seconds to wait after the --time limit before forcing shutdown, or 'off' to disable [off]
  --thread-stack-size=SIZE        size of stack per thread [64K]
  --rate=N                        average transactions rate. 0 for unlimited rate [0]
  --report-interval=N             periodically report intermediate statistics with a specified interval in seconds. 0 disables intermediate reports [0]
  --report-checkpoints=[LIST,...] dump full statistics and reset all counters at specified points in time. The argument is a list of comma-separated values representing the amount of time in seconds elapsed from start of test when report checkpoint(s) must be performed. Report checkpoints are off by default. []
  --debug[=on|off]                print more debugging info [off]
  --validate[=on|off]             perform validation checks where possible [off]
  --help[=on|off]                 print help and exit [off]
  --version[=on|off]              print version and exit [off]
  --config-file=FILENAME          File containing command line options
  --tx-rate=N                     deprecated alias for --rate [0]
  --max-requests=N                deprecated alias for --events [0]
  --max-time=N                    deprecated alias for --time [0]
  --num-threads=N                 deprecated alias for --threads [1]

Pseudo-Random Numbers Generator options:
  --rand-type=STRING random numbers distribution {uniform,gaussian,special,pareto} [special]
  --rand-spec-iter=N number of iterations used for numbers generation [12]
  --rand-spec-pct=N  percentage of values to be treated as 'special' (for special distribution) [1]
  --rand-spec-res=N  percentage of 'special' values to use (for special distribution) [75]
  --rand-seed=N      seed for random number generator. When 0, the current time is used as a RNG seed. [0]
  --rand-pareto-h=N  parameter h for pareto distribution [0.2]

Log options:
  --verbosity=N verbosity level {5 - debug, 0 - only critical messages} [3]

  --percentile=N       percentile to calculate in latency statistics (1-100). Use the special value of 0 to disable percentile calculations [95]
  --histogram[=on|off] print latency histogram in report [off]

General database options:

  --db-driver=STRING  specifies database driver to use ('help' to get list of available drivers) [mysql]
  --db-ps-mode=STRING prepared statements usage mode {auto, disable} [auto]
  --db-debug[=on|off] print database-specific debug information [off]


Compiled-in database drivers:
  mysql - MySQL driver
  pgsql - PostgreSQL driver

mysql options:
  --mysql-host=[LIST,...]          MySQL server host [localhost]
  --mysql-port=[LIST,...]          MySQL server port [3306]
  --mysql-socket=[LIST,...]        MySQL socket
  --mysql-user=STRING              MySQL user [sbtest]
  --mysql-password=STRING          MySQL password []
  --mysql-db=STRING                MySQL database name [sbtest]
  --mysql-ssl[=on|off]             use SSL connections, if available in the client library [off]
  --mysql-ssl-cipher=STRING        use specific cipher for SSL connections []
  --mysql-compression[=on|off]     use compression, if available in the client library [off]
  --mysql-debug[=on|off]           trace all client library calls [off]
  --mysql-ignore-errors=[LIST,...] list of errors to ignore, or "all" [1213,1020,1205]
  --mysql-dry-run[=on|off]         Dry run, pretend that all MySQL client API calls are successful without executing them [off]

pgsql options:
  --pgsql-host=STRING     PostgreSQL server host [localhost]
  --pgsql-port=N          PostgreSQL server port [5432]
  --pgsql-user=STRING     PostgreSQL user [sbtest]
  --pgsql-password=STRING PostgreSQL password []
  --pgsql-db=STRING       PostgreSQL database name [sbtest]

Compiled-in tests:
  fileio - File I/O test
  cpu - CPU performance test
  memory - Memory functions speed test
  threads - Threads subsystem performance test
  mutex - Mutex performance test

See 'sysbench <testname> help' for a list of options for each test.

$ ls -la /usr/share/sysbench/tests/include/oltp_legacy/
total 56
drwxr-xr-x 2 root root  284 Sep  7 20:53 .
drwxr-xr-x 3 root root 4096 Sep  7 20:53 ..
-rw-r--r-- 1 root root 1195 Apr 24  2020 bulk_insert.lua
-rw-r--r-- 1 root root 4696 Apr 24  2020 common.lua
-rw-r--r-- 1 root root  366 Apr 24  2020 delete.lua
-rw-r--r-- 1 root root 1171 Apr 24  2020 insert.lua
-rw-r--r-- 1 root root 3004 Apr 24  2020 oltp.lua
-rw-r--r-- 1 root root  368 Apr 24  2020 oltp_simple.lua
-rw-r--r-- 1 root root  527 Apr 24  2020 parallel_prepare.lua
-rw-r--r-- 1 root root  369 Apr 24  2020 select.lua
-rw-r--r-- 1 root root 1448 Apr 24  2020 select_random_points.lua
-rw-r--r-- 1 root root 1556 Apr 24  2020 select_random_ranges.lua
-rw-r--r-- 1 root root  369 Apr 24  2020 update_index.lua
-rw-r--r-- 1 root root  578 Apr 24  2020 update_non_index.lua

MariaDB vs. MySQL

MariaDB is a community-developed, commercially supported fork of the MySQL relational database management system (RDBMS), intended to remain free and open-source software under the GNU General Public License. Development is led by some of the original developers of MySQL, who forked it due to concerns over its acquisition by Oracle Corporation in 2009. Refer to wiki for more information.

Create the MariaDB database

Provision the MariaDB docker instance

In this example, we use Portworx to manage the disk storage. A volume testVol is created to store MariaDB data.

$ pxctl v create testVol --size 1024 --repl 1

$ docker run --name mariadbtest -v testVol:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=password -p 3306:3306 -d docker.io/library/mariadb:latest

$ docker ps | egrep "CONTAINER|mariadbtest"
CONTAINER ID   IMAGE                        COMMAND                  CREATED          STATUS          PORTS                                       NAMES
2e5fe8ca177d   mariadb:latest               "docker-entrypoint.s…"   39 seconds ago   Up 37 seconds   0.0.0.0:3306->3306/tcp, :::3306->3306/tcp   mariadbtest

Create a database

$ docker exec -it mariadbtest bash

root@2e5fe8ca177d:/# ip a | grep eth
139: eth0@if140: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:05 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.5/16 brd 172.17.255.255 scope global eth0

root@2e5fe8ca177d:/# df -h 
Filesystem                       Size  Used Avail Use% Mounted on
overlay                           50G   23G   28G  45% /
tmpfs                             64M     0   64M   0% /dev
tmpfs                            126G     0  126G   0% /sys/fs/cgroup
shm                               64M     0   64M   0% /dev/shm
/dev/mapper/centos-root           50G   23G   28G  45% /etc/hosts
/dev/pxd/pxd1020609855122786711 1007G  209M  956G   1% /var/lib/mysql
tmpfs                            126G     0  126G   0% /proc/acpi
tmpfs                            126G     0  126G   0% /proc/scsi
tmpfs                            126G     0  126G   0% /sys/firmware

root@2e5fe8ca177d:/# ls -la /var/lib/mysql
total 123332
drwxr-xr-x. 5 mysql mysql      4096 Sep  7 21:02 .
drwxr-xr-x  1 root  root         68 Aug 31 03:44 ..
-rw-rw----  1 mysql mysql    417792 Sep  7 21:02 aria_log.00000001
-rw-rw----  1 mysql mysql        52 Sep  7 21:02 aria_log_control
-rw-rw----  1 mysql mysql         9 Sep  7 21:02 ddl_recovery.log
-rw-rw----  1 mysql mysql       946 Sep  7 21:02 ib_buffer_pool
-rw-rw----  1 mysql mysql 100663296 Sep  7 21:02 ib_logfile0
-rw-rw----  1 mysql mysql  12582912 Sep  7 21:02 ibdata1
-rw-rw----  1 mysql mysql  12582912 Sep  7 21:02 ibtmp1
-rw-rw----  1 mysql mysql         0 Sep  7 21:00 multi-master.info
drwx------  2 mysql mysql      4096 Sep  7 21:00 mysql
drwx------  2 mysql mysql      4096 Sep  7 21:00 performance_schema
drwx------  2 mysql mysql     12288 Sep  7 21:00 sys

root@2e5fe8ca177d:/#  mysql -u root -p
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.6.4-MariaDB-1:10.6.4+maria~focal mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW VARIABLES LIKE "%version%";
+-----------------------------------+------------------------------------------+
| Variable_name                     | Value                                    |
+-----------------------------------+------------------------------------------+
| in_predicate_conversion_threshold | 1000                                     |
| innodb_version                    | 10.6.4                                   |
| protocol_version                  | 10                                       |
| slave_type_conversions            |                                          |
| system_versioning_alter_history   | ERROR                                    |
| system_versioning_asof            | DEFAULT                                  |
| tls_version                       | TLSv1.1,TLSv1.2,TLSv1.3                  |
| version                           | 10.6.4-MariaDB-1:10.6.4+maria~focal      |
| version_comment                   | mariadb.org binary distribution          |
| version_compile_machine           | x86_64                                   |
| version_compile_os                | debian-linux-gnu                         |
| version_malloc_library            | system                                   |
| version_source_revision           | 2db692f5b4d6bb31a331dab44544171c455f6aca |
| version_ssl_library               | OpenSSL 1.1.1f  31 Mar 2020              |
| wsrep_patch_version               | wsrep_26.22                              |
+-----------------------------------+------------------------------------------+
15 rows in set (0.002 sec)

MariaDB [(none)]> SHOW VARIABLES WHERE Variable_Name LIKE "%dir";
+---------------------------+----------------------------+
| Variable_name             | Value                      |
+---------------------------+----------------------------+
| aria_sync_log_dir         | NEWFILE                    |
| basedir                   | /usr                       |
| character_sets_dir        | /usr/share/mysql/charsets/ |
| datadir                   | /var/lib/mysql/            |
| innodb_data_home_dir      |                            |
| innodb_log_group_home_dir | ./                         |
| innodb_tmpdir             |                            |
| lc_messages_dir           | /usr/share/mysql           |
| plugin_dir                | /usr/lib/mysql/plugin/     |
| slave_load_tmpdir         | /tmp                       |
| tmpdir                    | /tmp                       |
| wsrep_data_home_dir       | /var/lib/mysql/            |
+---------------------------+----------------------------+
12 rows in set (0.002 sec)

MariaDB [(none)]> CREATE DATABASE sbtest;
Query OK, 1 row affected (0.001 sec)

MariaDB [(none)]> CREATE USER sbtest@localhost;
Query OK, 0 rows affected (0.004 sec)

MariaDB [(none)]> GRANT ALL PRIVILEGES ON sbtest.* TO sbtest@localhost;
Query OK, 0 rows affected (0.002 sec)

MariaDB [(none)]> use sbtest;
Database changed

MariaDB [sbtest]> select database();
+------------+
| database() |
+------------+
| sbtest     |
+------------+
1 row in set (0.000 sec)

MariaDB [sbtest]> show tables;
Empty set (0.000 sec)

MariaDB [(none)]>  exit
Bye
root@2e5fe8ca177d:/# exit
exit

root@2e5fe8ca177d:/# ls -la /var/lib/mysql
total 123336
drwxr-xr-x. 6 mysql mysql      4096 Sep  7 21:07 .
drwxr-xr-x  1 root  root         68 Aug 31 03:44 ..
-rw-rw----  1 mysql mysql    417792 Sep  7 21:07 aria_log.00000001
-rw-rw----  1 mysql mysql        52 Sep  7 21:02 aria_log_control
-rw-rw----  1 mysql mysql         9 Sep  7 21:02 ddl_recovery.log
-rw-rw----  1 mysql mysql       946 Sep  7 21:02 ib_buffer_pool
-rw-rw----  1 mysql mysql 100663296 Sep  7 21:02 ib_logfile0
-rw-rw----  1 mysql mysql  12582912 Sep  7 21:02 ibdata1
-rw-rw----  1 mysql mysql  12582912 Sep  7 21:02 ibtmp1
-rw-rw----  1 mysql mysql         0 Sep  7 21:00 multi-master.info
drwx------  2 mysql mysql      4096 Sep  7 21:00 mysql
drwx------  2 mysql mysql      4096 Sep  7 21:00 performance_schema
drwx------  2 mysql mysql      4096 Sep  7 21:07 sbtest
drwx------  2 mysql mysql     12288 Sep  7 21:00 sys
root@2e5fe8ca177d:/# ls -la /var/lib/mysql/sbtest/
total 12
drwx------  2 mysql mysql 4096 Sep  7 21:07 .
drwxr-xr-x. 6 mysql mysql 4096 Sep  7 21:07 ..
-rw-rw----  1 mysql mysql   67 Sep  7 21:07 db.opt

Build the database

On the host, using sysbench to create tables and insert data rows in the database. We need know how much data should be created in the database. 1 million rows will result in ~240MB of data. So, 32 tables, 2 millions rows each create 15GB data.

$ sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --threads=1 --mysql-host=172.17.0.5 --mysql-password=password  --mysql-user=root --mysql-db=sbtest --oltp-tables-count=32 --oltp-table-size=2000000 prepare
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Creating table 'sbtest1'...
Inserting 2000000 records into 'sbtest1'
Creating secondary indexes on 'sbtest1'...
[omitted...]

In the MariaDB container, we can check the created data and table size.

root@2e5fe8ca177d:/#  mysql -u root -p
MariaDB [sbtest]> select * from sbtest1 limit 6;
+----+---------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| id | k       | c                                                                                                                       | pad                                                         |
+----+---------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
|  1 |  998567 | 83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330 | 67847967377-48000963322-62604785301-91415491898-96926520291 |
|  2 | 1003937 | 38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630 | 23183251411-36241541236-31706421314-92007079971-60663066966 |
|  3 | 1008521 | 33973744704-80540844748-72700647445-87330233173-87249600839-07301471459-22846777364-58808996678-64607045326-48799346817 | 38615512647-91458489257-90681424432-95014675832-60408598704 |
|  4 | 1004027 | 37002370280-58842166667-00026392672-77506866252-09658311935-56926959306-83464667271-94685475868-28264244556-14550208498 | 63947013338-98809887124-59806726763-79831528812-45582457048 |
|  5 |  999625 | 44257470806-17967007152-32809666989-26174672567-29883439075-95767161284-94957565003-35708767253-53935174705-16168070783 | 34551750492-67990399350-81179284955-79299808058-21257255869 |
|  6 | 1001169 | 37216201353-39109531021-11197415756-87798784755-02463049870-83329763120-57551308766-61100580113-80090253566-30971527105 | 05161542529-00085727016-35134775864-52531204064-98744439797 |
+----+---------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
6 rows in set (0.004 sec)

MariaDB [sbtest]> SELECT   TABLE_NAME AS `Table`,   ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)` FROM   information_schema.TABLES WHERE   TABLE_SCHEMA = "sbtest" ORDER BY   (DATA_LENGTH + INDEX_LENGTH) DESC;
+---------+-----------+
| Table   | Size (MB) |
+---------+-----------+
| sbtest3 |       459 |
| sbtest1 |       459 |
| sbtest4 |       459 |
| sbtest2 |       459 |
| sbtest5 |       459 |
| sbtest6 |       146 |
+---------+-----------+
6 rows in set (0.002 sec)

MariaDB [sbtest]> SELECT table_schema "DB Name", ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB"  FROM information_schema.tables  GROUP BY table_schema;
+--------------------+---------------+
| DB Name            | DB Size in MB |
+--------------------+---------------+
| information_schema |           0.2 |
| mysql              |          10.5 |
| performance_schema |           0.0 |
| sbtest             |       14702.0 |
| sys                |           0.0 |
+--------------------+---------------+
5 rows in set (0.033 sec)

We also can check the running process in the MariaDB.

MariaDB [sbtest]> show processlist;
+----+------+------------------+--------+---------+------+----------
| Id | User | Host             | db     | Command | Time | State    | Info                                                                                                 | Progress |
+----+------+------------------+--------+---------+------+----------
|  7 | root | 172.17.0.1:55000 | sbtest | Query   |    0 | Update   | INSERT INTO sbtest13(k, c, pad) VALUES(1185731, '26498931212-26730519067-66264645428-09623019003-787' |    0.000 |
| 11 | root | localhost        | sbtest | Query   |    0 | starting | show processlist                                                                                     |    0.000 |
+----+------+------------------+--------+---------+------+----------
2 rows in set (0.000 sec)

Run sysbench benchmark

$ threads=1; seconds=1800; interval=60
$ sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --threads=$threads --mysql-host=172.17.0.5 --mysql-password=password  --mysql-user=root  --oltp-tables-count=32 --oltp-table-size=2000000 --events=0 --time=$seconds --report-interval=$interval --delete_inserts=10 --index_updates=10 --non_index_updates=10 --db-ps-mode=disable run

SQL statistics:
    queries performed:
        read:                            1315888
        write:                           375968
        other:                           187984
        total:                           1879840
    transactions:                        93992  (52.22 per sec.)
    queries:                             1879840 (1044.35 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          1800.0066s
    total number of events:              93992

Latency (ms):
         min:                                    6.52
         avg:                                   19.14
         max:                                 1018.82
         95th percentile:                       25.28
         sum:                              1799473.52

Threads fairness:
    events (avg/stddev):           93992.0000/0.00
    execution time (avg/stddev):   1799.4735/0.00

Reference

Kubernetes can be installed with the following deployments tools.

  • Bootstrapping clusters with kubeadm
  • Installing Kubernetes with kops
  • Installing Kubernetes with Kubespray

In this article, we learn to install Kubernetes cluster with kubeadm.

Prepare the cluster nodes

We have three Centos nodes to install Kubernetes cluster.

$ cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
$ uname -r
3.10.0-1160.11.1.el7.x86_64

Configuring network, firewall and selinux

We disable firewall and selinux to make deployment easier and this is only for study purpose.

You can configure network by following the official document.

Installing container runtime

To run containers in Pods, Kubernetes uses a container runtime. We need to install a container runtime into each node in the cluster so that Pods can run there. The following are the common container runtimes with Kubernetes on Linux:

  • containerd
  • CRI-O
  • Docker

Install Docker runtime

On each node, install Docker Engine as below:

$ yum install -y yum-utils
$ yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
$ yum install docker-ce docker-ce-cli containerd.io
$ systemctl start docker
$ systemctl status docker

Configure Docker daemon

On each node, configure the Docker daemon, in particular to use systemd for the management of the container’s cgroups.

$ sudo mkdir /etc/docker
$ cat <<EOF | sudo tee /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOF

$ sudo systemctl enable docker
$ sudo systemctl daemon-reload
$ sudo systemctl restart docker

Note: overlay2 is the preferred storage driver for systems running Linux kernel version 4.0 or higher, or RHEL or CentOS using version 3.10.0-514 and above.

Installing kubeadm, kubelet and kubectl

We need to install the following packages on all of cluster nodes:

Configuring a cgroup driver

Both the container runtime and the kubelet have a property called “cgroup driver”, which is important for the management of cgroups on Linux machines.

kubeadm allows you to pass a KubeletConfiguration structure during kubeadm init. This KubeletConfiguration can include the cgroupDriver field which controls the cgroup driver of the kubelet.

A minimal example of configuring the field explicitly:

[root@node1 ~]# cat kubeadm-config.yaml
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.22.1
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd

Such a configuration file can then be passed to the kubeadm command:

[root@node1 ~]# kubeadm init --config kubeadm-config.yaml
[init] Using Kubernetes version: v1.22.1
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [node1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 <node1-ip>]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [node1 localhost] and IPs [<node1-ip> 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [node1 localhost] and IPs [<node1-ip> 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 9.002713 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.22" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node node1 as control-plane by adding the labels: [node-role.kubernetes.io/master(deprecated) node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node node1 as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[bootstrap-token] Using token: un7mhw.i9enhg84xl2tpgup
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join <node1-ip>:6443 --token un7mhw.i9enhg84xl2tpgup \
    --discovery-token-ca-cert-hash sha256:5553ba3acbbec95383fc4a274e4f21126ac8101c39dfe5262718a9f0fd1b3c32

Creating a cluster with kubeadm

Using kubeadm, you can create a minimum viable Kubernetes cluster that conforms to best practices.

The kubeadm tool is good if you need:

  • A simple way for you to try out Kubernetes, possibly for the first time.
  • A way for existing users to automate setting up a cluster and test their application.
  • A building block in other ecosystem and/or installer tools with a larger scope.

Initializing the control-plane node

The control-plane node is the machine where the control plane components run, including etcd (the cluster database) and the API Server (which the kubectl command line tool communicates with).

  1. (Recommended) If you have plans to upgrade this single control-plane kubeadm cluster to high availability you should specify the –control-plane-endpoint to set the shared endpoint for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer.
  2. Choose a Pod network add-on, and verify whether it requires any arguments to be passed to kubeadm init. Depending on which third-party provider you choose, you might need to set the –pod-network-cidr to a provider-specific value.
  3. (Optional) Since version 1.14, kubeadm tries to detect the container runtime on Linux by using a list of well known domain socket paths. To use different container runtime or if there are more than one installed on the provisioned node, specify the –cri-socket argument to kubeadm init.
  4. (Optional) Unless otherwise specified, kubeadm uses the network interface associated with the default gateway to set the advertise address for this particular control-plane node’s API server. To use a different network interface, specify the –apiserver-advertise-address= argument to kubeadm init. To deploy an IPv6 Kubernetes cluster using IPv6 addressing, you must specify an IPv6 address, for example –apiserver-advertise-address=fd00::101
  5. (Optional) Run kubeadm config images pull prior to kubeadm init to verify connectivity to the gcr.io container image registry

To initialize the control-plane node run “kubeadm init “.

$ kubeadm init --pod-network-cidr=192.168.0.0/16 

kubeadm init first runs a series of prechecks to ensure that the machine is ready to run Kubernetes. These prechecks expose warnings and exit on errors. kubeadm init then downloads and installs the cluster control plane components. This may take several minutes.

In previous section Configuring a cgroup driver, we have run the command to initialize the control-plane node.

If you need to run kubeadm init again, you must first tear down the cluster.

[root@node1 ~]# kubeadm reset

Execute the following commands to configure kubectl (also returned by kubeadm init) if you are the root user.

[root@node1 ~]# export KUBECONFIG=/etc/kubernetes/admin.conf

Installing a Pod network add-on

In this practice, we install Calico which is an open source networking and network security solution for containers, virtual machines, and native host-based workloads.

[root@node1 ~]# kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
[root@node1 ~]# kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml

Note: Before creating this manifest, read its contents and make sure its settings are correct for your environment. For example, you may need to change the default IP pool CIDR to match your pod network CIDR.

[root@node1 ~]# kubectl get nodes
NAME                                 STATUS   ROLES                  AGE   VERSION
node1   Ready    control-plane,master   11m   v1.22.1

[root@node1 ~]# kubectl get pods -n calico-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-868b656ff4-gv2tq   1/1     Running   0          2m41s
calico-node-wclb2                          1/1     Running   0          2m41s
calico-typha-d8c5c85c5-kldfh               1/1     Running   0          2m42s

[root@node1 ~]# kubectl get pods --all-namespaces
NAMESPACE          NAME                                        READY   STATUS    RESTARTS   AGE
calico-apiserver   calico-apiserver-554fbf9554-45d6l           1/1     Running   0          15m
calico-system      calico-kube-controllers-868b656ff4-gv2tq    1/1     Running   0          16m
calico-system      calico-node-wclb2                           1/1     Running   0          16m
calico-system      calico-typha-d8c5c85c5-kldfh                1/1     Running   0          16m
kube-system        coredns-78fcd69978-lq9pp                    1/1     Running   0          18m
kube-system        coredns-78fcd69978-nm29f                    1/1     Running   0          18m
kube-system        etcd-node1                                  1/1     Running   1          19m
kube-system        kube-apiserver-node1                        1/1     Running   1          19m
kube-system        kube-controller-manager-node1               1/1     Running   0          19m
kube-system        kube-proxy-m48qn                            1/1     Running   0          18m
kube-system        kube-scheduler-node1                        1/1     Running   1          19m
tigera-operator    tigera-operator-698876cbb5-dghgv            1/1     Running   0          17m

You can install only one Pod network per cluster.

Control plane node isolation

Untaint the master so that it will be available for scheduling workloads

[root@node1 ~]# kubectl taint nodes --all node-role.kubernetes.io/master-
node/node1 untainted

Joining your nodes

The nodes are where your workloads (containers and Pods, etc) run. To add new nodes to your cluster do the following for each machine:

  • SSH to the machine

  • Become root (e.g. sudo su -)

  • Run the command that was output by kubeadm init

    [root@node2 ~]# kubeadm join :6443 –token un7mhw.i9enhg84xl2tpgup –discovery-token-ca-cert-hash sha256:5553ba3acbbec95383fc4a274e4f21126ac8101c39dfe5262718a9f0fd1b3c32
    [root@node3 ~]# kubeadm join :6443 –token un7mhw.i9enhg84xl2tpgup –discovery-token-ca-cert-hash sha256:5553ba3acbbec95383fc4a274e4f21126ac8101c39dfe5262718a9f0fd1b3c32
    [preflight] Running pre-flight checks
    [preflight] Reading configuration from the cluster…
    [preflight] FYI: You can look at this config file with ‘kubectl -n kube-system get cm kubeadm-config -o yaml’
    [kubelet-start] Writing kubelet configuration to file “/var/lib/kubelet/config.yaml”
    [kubelet-start] Writing kubelet environment file with flags to file “/var/lib/kubelet/kubeadm-flags.env”
    [kubelet-start] Starting the kubelet
    [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap…

    This node has joined the cluster:

    • Certificate signing request was sent to apiserver and a response was received.
    • The Kubelet was informed of the new secure connection details.

    Run ‘kubectl get nodes’ on the control-plane to see this node join the cluster.

If you do not have the token, you can get it by running the following command on the control-plane node:

$ kubeadm token list

By default, tokens expire after 24 hours. If you are joining a node to the cluster after the current token has expired, you can create a new token by running the following command on the control-plane node:

$ kubeadm token create

If you don’t have the value of –discovery-token-ca-cert-hash, you can get it by running the following command chain on the control-plane node:

$ openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //' 

We can check the cluster nodes as below.

[root@node1 ~]#  kubectl get nodes
NAME    STATUS   ROLES                  AGE     VERSION
node1   Ready    control-plane,master   37m     v1.22.1
node2   Ready    <none>                 4m56s   v1.22.1
node3   Ready    <none>                 7m49s   v1.22.1

[root@node1 ~]# kubectl get pods --all-namespaces
NAMESPACE          NAME                                        READY   STATUS    RESTARTS   AGE
calico-apiserver   calico-apiserver-554fbf9554-45d6l           1/1     Running   0          34m
calico-system      calico-kube-controllers-868b656ff4-gv2tq    1/1     Running   0          35m
calico-system      calico-node-cl5kt                           1/1     Running   0          7m51s
calico-system      calico-node-rtgcs                           1/1     Running   0          4m58s
calico-system      calico-node-wclb2                           1/1     Running   0          35m
calico-system      calico-typha-d8c5c85c5-7knvv                1/1     Running   0          7m46s
calico-system      calico-typha-d8c5c85c5-kldfh                1/1     Running   0          35m
calico-system      calico-typha-d8c5c85c5-qflvv                1/1     Running   0          4m56s
kube-system        coredns-78fcd69978-lq9pp                    1/1     Running   0          37m
kube-system        coredns-78fcd69978-nm29f                    1/1     Running   0          37m
kube-system        etcd-node1                                  1/1     Running   1          37m
kube-system        kube-apiserver-node1                        1/1     Running   1          37m
kube-system        kube-controller-manager-node1               1/1     Running   0          37m
kube-system        kube-proxy-d55xr                            1/1     Running   0          7m51s
kube-system        kube-proxy-m48qn                            1/1     Running   0          37m
kube-system        kube-proxy-m7drg                            1/1     Running   0          4m58s
kube-system        kube-scheduler-node1                        1/1     Running   1          37m
tigera-operator    tigera-operator-698876cbb5-dghgv            1/1     Running   0          35m

Reference

File system testing - Method 1

Vdbench filesystem testing terminologies:

  • Anchor - A directory or a filesystme mount point. A file system structure will be created by specifying the structure information including directory depth, width, number of files and file size. Multiple anchor can be defined and used by filesystem workloads.
  • Operation - File system operations. For example, directory create/delete, file create/delete, file read/write, file open/close, setattr and getattr.

Vdbench parameters for filesystem benchmark:

  • File system definition(FSD) - Describe the directory structure.
  • File system workload definition(FWD) - Describe the workload parameters.
  • Run definition(RD) - Describe how the workload will be run.

The following is an example of the vdbench job file.

hd=default,vdbench=/home/tester/vdbench_test,shell=ssh,user=root
hd=host1,jvms=1,system=192.168.1.50
fsd=fsd1,anchor=/mnt/testdir1,depth=1,width=1,files=4,size=50g,openflag=o_direct
fsd=fsd2,anchor=/mnt/testdir2,depth=1,width=1,files=4,size=50g,openflag=o_direct
fwd=fwd1,fsd=fsd1,host=host1,fileio=random,operation=write,xfersize=4k,fileselect=random,threads=$th
fwd=fwd2,fsd=fsd2,host=host1,fileio=random,operation=write,xfersize=4k,fileselect=random,threads=$th
rd=rd1,fwd=fwd*,fwdrate=max,format=yes,elapsed=180,interval=30

Explanation:

  1. hd - it specifies which host to run the filesystem workload. The number of jvms is default to 1. It can be increased if the number of jvms can not handle very high iops with a fast system.

  2. fsd - it specifies under which directory to create the filesystem structures. “depth” defines how many level of directories will be created. “width” defines how many sub-directories will be created under each parent directory. “files” defines how many files will be created under each directory. “size” defines the file size. “openflag” controls how the file will be opened.

  3. fwd - it specifies what workload will be run on the target filesystems. In this example, it will run random write with 4k blocksize. The specified number of threads will be used to write corresponding number of files. The files will be randomly selected for the workload to run. Note that, the number of threads should less than or equal to the number of files. Note that, the writes to the file is single threaded unless “filiio=(random,shared) is specified.

  4. rd - it controls how the workload will be run. In this example, the workload will be run for 3 minutes. The “format” option is very useful. It will recreate the filesystem structure before the workload run. It will give us more repeatable results. “fwdrate” indicates the iorate will be unlimited in order to stress the system as much as possible.

  5. In this example, we run the 4k random write workload concurrently on two filesystems. The number of threads to write each filesystem can be controlled by passing from shell variable during run time as below.

    ./vdbench jobfile/vdb.job th=2

File system testing - Method 2

The similar workload in the method one can also be run as below.

hd=default,vdbench=/home/tester/vdbench_test,shell=ssh,user=root
hd=host1,system=192.168.1.50
sd=sd1,host=host1,lun=/mnt/testdir1/testfile,hitarea=1m,openflag=o_direct,size=50g
sd=sd2,host=host1,lun=/mnt/testdir2/testfile,hitarea=1m,openflag=o_direct,size=50g
wd=wd1,sd=sd*,seekpct=100
rd=rd1,wd=wd1,iorate=max,rdpct=0,xfersize=4K,elapsed=180,interval=30,th=$th

Reference

0%