Normally, SystemTap scripts can only be run on systems where SystemTap is deployed together with the following required kernel packages.

  • kernel-devel-$(uname -r)
  • kernel-debuginfo-$(uname -r)
  • kernel-debuginfo-common-$(uname -m)-$(uname -r)

In the case that it’s neither feasible nor desired to install the denpendent packages on all the target systems. Using cross-instrumentation can work around. Cross-instrumentation is the process of generating SystemTap instrumentation modules from a SystemTap script on one system to be used on another system. This process offers the following benefits:

  • The kernel information packages for various machines can be installed on a single host machine.
  • Each target machine only needs one package to be installed to use the generated SystemTap instrumentation module: systemtap-runtime.

To build the instrumentation module, do the following on the host which has the dependent kernel packages installed:

[root@host1 ~]# uname -r
5.18.10-1.el7.elrepo.x86_64

[root@host1 ~]# rpm -qa | egrep "kernel-debug|kernel-devel|systemtap"
systemtap-client-4.0-13.el7.x86_64
systemtap-runtime-4.0-13.el7.x86_64
kernel-devel-3.10.0-1160.el7.x86_64
systemtap-devel-4.0-13.el7.x86_64
kernel-debuginfo-common-x86_64-3.10.0-1160.el7.x86_64
kernel-debuginfo-3.10.0-1160.el7.x86_64
systemtap-4.0-13.el7.x86_64

[root@host1 ~]# stap -r 3.10.0-1160.el7.x86_64 -e 'probe vfs.read {printf("read performed\n");} probe timer.s(5){exit()}' -m ported_stap -p4
ported_stap.ko

[root@host1 ~]# file ported_stap.ko
ported_stap.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=1ad58adccafc9bf629f1edcbc285e9090e84dc35, not stripped

[root@host1 ~]# ls -la ported_stap.ko
-rw-r--r-- 1 root root 103608 Jul 20 22:33 ported_stap.ko

Once the instrumentation module is compiled, copy it to the target system and then run it using:

[root@host1 ~]# scp ported_stap.ko host2:/root

[root@host2 ~]# uname -r
3.10.0-1160.el7.x86_64

[root@host2 ~]# rpm -qa | egrep "kernel-debug|kernel-devel|systemtap"
systemtap-runtime-4.0-13.el7.x86_64

[root@host2 ~]# staprun ported_stap.ko
read performed
read performed
read performed
<...>

When users run a SystemTap script, a kernel module is built out of that script. SystemTap then loads the module into the kernel, allowing it to extract the specified data directly from the kernel. From the following command output, it’s clear that the SystemTap kernel module is dynamically loaded during the script run.

[root@host2 ~]# lsmod | grep ported_stap
ported_stap           139049  2
[root@host2 ~]# lsmod | grep ported_stap
[root@host2 ~]#

Reference

In this post, we continue to explore how to use SystemTap to analyze the latency of the kernel module function. In the following example, we want to analyze the latency of the function “nfsd_vfs_write” from the kernel module “nfsd”.

Deploy the SystemTap packages

Refer to this post to deploy SystemTap and its required packages.

Check the nfsd kernel module info

[root@host1 ~]# uname -r
3.10.0-1160.el7.x86_64

[root@host1 ~]# ls /lib/modules/`uname -r`/kernel/
arch  crypto  drivers  fs  kernel  lib  mm  net  sound  virt
[root@host1 ~]# ls /lib/modules/`uname -r`/kernel/fs
binfmt_misc.ko.xz  btrfs  cachefiles  ceph  cifs  cramfs  dlm  exofs  ext4  fat  fscache  fuse  gfs2  isofs  jbd2  lockd  mbcache.ko.xz  nfs  nfs_common  nfsd  nls  overlayfs  pstore  squashfs  udf  xfs

[root@host1 ~]# ls /lib/modules/`uname -r`/kernel/fs/ext4
ext4.ko.xz
[root@host1 ~]# ls /lib/modules/`uname -r`/kernel/fs/xfs
xfs.ko.xz
[root@host1 ~]# ls /lib/modules/`uname -r`/kernel/fs/btrfs
btrfs.ko.xz
[root@host1 ~]# ls /lib/modules/`uname -r`/kernel/fs/nfs
blocklayout  filelayout  flexfilelayout  nfs.ko.xz  nfsv3.ko.xz  nfsv4.ko.xz  objlayout


[root@host1 ~]# lsmod | grep "nfsd "
nfsd                  351321  13
[root@host1 ~]# modinfo nfsd
filename:       /lib/modules/3.10.0-1160.el7.x86_64/kernel/fs/nfsd/nfsd.ko.xz
license:        GPL
author:         Olaf Kirch <okir@monad.swb.de>
alias:          fs-nfsd
retpoline:      Y
rhelversion:    7.9
srcversion:     61A6390CD82AA4A7492CB06
depends:        auth_rpcgss,sunrpc,grace,lockd,nfs_acl
intree:         Y
vermagic:       3.10.0-1160.el7.x86_64 SMP mod_unload modversions
signer:         CentOS Linux kernel signing key
sig_key:        E1:FD:B0:E2:A7:E8:61:A1:D1:CA:80:A2:3D:CF:0D:BA:3A:A4:AD:F5
sig_hashalgo:   sha256
parm:           cltrack_prog:Path to the nfsdcltrack upcall program (string)
parm:           cltrack_legacy_disable:Disable legacy recoverydir conversion. Default: false (bool)
parm:           nfs4_disable_idmapping:Turn off server's NFSv4 idmapping when using 'sec=sys' (bool)

[root@host1 ~]# cat /proc/kallsyms |  egrep -i -w "nfsd_vfs_write"
ffffffffc094fdd0 t nfsd_vfs_write	[nfsd] 

SystemTap script

In the following SystemTap script, we have implemented

  • a probe to detect the kernel module “nfsd” and its function “nfsd_vfs_write”

  • a probe to detect the return of the function “nfsd_vfs_write”

  • handlers to get the execname, pid, tid and timestamp in each probe function

  • a probe as timer(5 seconds tracing)

  • a probe “end” to analyze the collected runtime of the function “nfsd_vfs_write” in the end of tracing

    [root@host1 ~]# cat nfs_trace.stp
    global count
    global start_t,diff_t[1000000]

    probe module(“nfsd”).function(“nfsd_vfs_write”){
    count++
    e = execname()
    p = pid()
    t = tid()
    start_t[e,p,t] = gettimeofday_us()
    }

    probe module(“nfsd”).function(“nfsd_vfs_write”).return {
    e = execname()
    p = pid()
    t = tid()
    start_ts = start_t[e,p,t]
    end_ts = gettimeofday_us()
    if(start_ts>0)
    diff_t[e,p,t,start_ts] = end_ts - start_ts
    }

    probe timer.s(5){exit()}

    probe end{
    count=1
    total_time=0
    foreach([e,p,t,ts] in diff_t){
    printf(“nfsd_vfs_write(%d %s %d %d %d) call time: %d\n”,count,e,p,t,ts,diff_t[e,p,t,ts])
    count++
    total_time+=diff_t[e,p,t,ts]
    }
    count–
    printf(“nfsd_vfs_write total call time(us): %d\n”,total_time)
    printf(“nfsd_vfs_write calls: %d\n”,count)
    printf(“nfsd_vfs_write average call time(us): %d\n”,total_time/count)
    }

Run the SystemTap script

[root@host1 ~]# cat stp.sh
runid=$1
sleep 10
stap -D MAXACTION=1000000 nfs_trace.stp > nfs_trace.${runid}.out

[root@host1 ~]# ./stp.sh kernel-3.10-run1

Note that, the script will wait for 10 seconds before tracing. This is to make sure the network workload to be monitored will be running stable.

Get the tracing result

[root@host1 ~]# cat stp_3.10/nfs_trace.1.out | head -3
nfsd_vfs_write(1 nfsd 3597 3597 1658189341055849) call time: 139
nfsd_vfs_write(2 nfsd 3598 3598 1658189341055864) call time: 136
nfsd_vfs_write(3 nfsd 3596 3596 1658189341055977) call time: 149
[root@host1 ~]# cat stp_3.10/nfs_trace.1.out | tail -5
nfsd_vfs_write(116340 nfsd 3595 3595 1658189346055536) call time: 182
nfsd_vfs_write(116341 nfsd 3598 3598 1658189346055621) call time: 237
nfsd_vfs_write total call time(us): 23050967
nfsd_vfs_write calls: 116341
nfsd_vfs_write average call time(us): 198

As we can see, there are totally 116341 calls for the function “nfsd_vfs_write” and the average call time is 198 us. This gives us a clear sense of the function call latency so that we can compare the same for different system configurations(e.g. different kernel versions).

SystemTap runtime errors

When we run the SystemTap script, we added a option “-D MAXACTION=1000000” to fix the following runtime error.

[root@host1 ~]# stap nfs_trace.stp > nfs_trace.out
ERROR: MAXACTION exceeded near identifier 'printf' at nfs_trace.stp:27:3
WARNING: Number of errors: 1, skipped probes: 0
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]

What does “⁠MAXACTION exceeded” mean?

The probe handler attempted to execute too many statements in the probe handler. The default number of actions allowed in a probe handler is 1000.

Reference

Intro to SystemTap

SystemTap is a tracing and probing tool that allows users to study and monitor the activities of the operating system (particularly, the kernel) in fine detail. It provides information similar to the output of tools like netstat, ps, top, and iostat; however, SystemTap is designed to provide more filtering and analysis options for collected information.

SystemTap can be used by system administrators as a performance monitoring tool for Red Hat Enterprise Linux 5 or later. It is most useful when other similar tools cannot precisely pinpoint a bottleneck in the system, thus requiring a deep analysis of system activity. In the same manner, application developers can also use SystemTap to monitor, in fine detail, how their application behaves within the Linux system.

SystemTap was originally developed to provide functionality for Red Hat Enterprise Linux similar to previous Linux probing tools such as dprobes and the Linux Trace Toolkit. SystemTap aims to supplement the existing suite of Linux monitoring tools by providing users with the infrastructure to track kernel activity. In addition, SystemTap combines this capability with two attributes:

  • Flexibility: SystemTap’s framework allows users to develop simple scripts for investigating and monitoring a wide variety of kernel functions, system calls, and other events that occur in kernel space. With this, SystemTap is not so much a tool as it is a system that allows you to develop your own kernel-specific forensic and monitoring tools.
  • Ease-of-Use: as mentioned earlier, SystemTap allows users to probe kernel-space events without having to resort to the lengthy instrument, recompile, install, and reboot the kernel process.

Understanding how SystemTap works

SystemTap allows users to write and reuse simple scripts to deeply examine the activities of a running Linux system. These scripts can be designed to extract data, filter it, and summarize it quickly (and safely), enabling the diagnosis of complex performance (or even functional) problems.

The essential idea behind a SystemTap script is to name events, and to give them handlers. When SystemTap runs the script, SystemTap monitors for the event; once the event occurs, the Linux kernel then runs the handler as a quick sub-routine and then resumes its normal operation.

There are several kinds of events; entering or exiting a function, timer expiration, session termination, etc. A handler is a series of script language statements that specify the work to be done whenever the event occurs. This work normally includes extracting data from the event context, storing them into internal variables, and printing results.

Setting up SystemTap and its required kernel packages

To deploy SystemTap, SystemTap packages along with the corresponding set of -devel, -debuginfo and -debuginfo-common-arch packages for the kernel need to be installed. To use SystemTap on more than one kernel where a system has multiple kernels installed, install the -devel and -debuginfo packages for each of those kernel versions.

SystemTap needs information about the kernel in order to place instrumentation in it (probe it). This information, which allows SystemTap to generate the code for the instrumentation, is contained in the matching kernel-devel, kernel-debuginfo, and kernel-debuginfo-common-arch packages.

To install SystemTap packages:

[root@host1 ~]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
[root@host1 ~]# uname -r
3.10.0-1160.el7.x86_64

[root@host1 ~]# yum install -y systemtap systemtap-runtime

[root@host1 ~]# rpm -qa | grep systemtap
systemtap-runtime-4.0-13.el7.x86_64
systemtap-devel-4.0-13.el7.x86_64
systemtap-4.0-13.el7.x86_64
systemtap-client-4.0-13.el7.x86_64

To install devel and debuginfo packages in CentOS(set to enabled=1):

[root@host1 ~]# vim /etc/yum.repos.d/CentOS-Debuginfo.repo
[base-debuginfo]
name=CentOS-7 - Debuginfo
baseurl=http://debuginfo.centos.org/7/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Debug-7
enabled=1

[root@host1 ~]# yum install -y kernel-devel-$(uname -r) \
> kernel-debuginfo-$(uname -r) \
> kernel-debuginfo-common-$(uname -m)-$(uname -r)

[root@host1 ~]# rpm -qa | grep debuginfo
kernel-debuginfo-common-x86_64-3.10.0-1160.el7.x86_64
kernel-debuginfo-3.10.0-1160.el7.x86_64

The devel package was not installed successfully since it’s not available in the existing CentOS repository. It might be fixed by adding the expected repository for yum installation. The devel package is required by SystemTap otherwise the following error is seen.

[root@host1 ~]# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'
Checking "/lib/modules/3.10.0-1160.el7.x86_64/build/.config" failed with error: No such file or directory
Incorrect version or missing kernel-devel package, use: yum install kernel-devel-3.10.0-1160.el7.x86_64

Here we just install it directly from the downloaded format as below.

[root@host1 ~]# wget https://rpmfind.net/linux/centos/7.9.2009/os/x86_64/Packages/kernel-devel-3.10.0-1160.el7.x86_64.rpm

[root@host1 ~]# rpm -ivh kernel-devel-3.10.0-1160.el7.x86_64.rpm

[root@host1 ~]# rpm -qa | grep kernel | grep  3.10.0
kernel-devel-3.10.0-1160.el7.x86_64
kernel-tools-libs-3.10.0-1160.el7.x86_64
kernel-tools-3.10.0-1160.el7.x86_64
kernel-debuginfo-common-x86_64-3.10.0-1160.el7.x86_64
kernel-debuginfo-3.10.0-1160.el7.x86_64
kernel-headers-3.10.0-1160.59.1.el7.x86_64
kernel-3.10.0-1160.el7.x86_64

To verify the SystemTap setup again:

[root@host1 ~]# stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'
Pass 1: parsed user script and 474 library scripts using 271960virt/69264res/3504shr/65852data kb, in 640usr/30sys/672real ms.
Pass 2: analyzed script: 1 probe, 1 function, 7 embeds, 0 globals using 439304virt/232180res/4884shr/233196data kb, in 2180usr/950sys/2977real ms.
Pass 3: translated to C into "/tmp/stap6kYO8U/stap_cc0f60b74db3020f09599659b9758c89_2771_src.c" using 439304virt/232436res/5140shr/233196data kb, in 10usr/50sys/67real ms.
Pass 4: compiled C into "stap_cc0f60b74db3020f09599659b9758c89_2771.ko" in 8040usr/1720sys/9477real ms.
Pass 5: starting run.
read performed
Pass 5: run completed in 30usr/90sys/442real ms.

SystemTap scripts

For the most part, SystemTap scripts are the foundation of each SystemTap session. SystemTap scripts instruct SystemTap on what type of information to collect, and what to do once that information is collected. SystemTap scripts are made up of two components: events and handlers. Once a SystemTap session is underway, SystemTap monitors the operating system for the specified events and executes the handlers as they occur.

SystemTap scripts allow insertion of the instrumentation code without recompilation of the code and allows more flexibility with regard to handlers. Events serve as the triggers for handlers to run; handlers can be specified to record specified data and print it in a certain manner.

SystemTap scripts use the .stp file extension and contains probes written in the following format:

probe event {statements}

Systemtap allows you to write functions to factor out code to be used by a number of probes. Thus, rather than repeatedly writing the same series of statements in multiple probes, you can just place the instructions in a function, as in:

function function_name(arguments){statements}
probe event {function_name(arguments)}

The statements in function_name are executed when the probe for event executes. The arguments are optional values passed into the function.

Running SystemTap Scripts

SystemTap scripts are run through the command stap. stap can run SystemTap scripts from the standard input or from a file.

We have seen how to run SystemTap from the standard input when we tried to verify the installation in previous section.

We can also run it from a file as below.

[root@host1 ~]# cat runfromfile.stp
probe vfs.read {
    printf("read performed\n");
    exit()
}

[root@host1 ~]# stap runfromfile.stp
read performed

At this point, we know what is SystemTap and how to deploy it. We will explore more meaningful usage of it in future posts.

Reference

In this example, we study the latency of the function “nfsd_vfs_write” from kernel module “nfsd”.

ftrace configuration

The following ftrace options are used in this example. There are 8 nfsd processes to be traced.

  • current_tracer: function_graph
  • pids: 3591 3592 3593 3594 3595 3596 3597 3598
  • filters: nfsd_vfs_write [nfsd]

ftrace result

From the following trace result, it helps understand the call time of the function “nfsd_vfs_write”. This can be very helpful if we need to analyze the function latency in the kernel module.

$ cat ftrace.out | grep nfsd_vfs_write | grep "nfsd_vfs_write \[nfsd\]();" | head -5
  1)  ! 185.011 us  |  nfsd_vfs_write [nfsd]();
  2)  ! 161.237 us  |  nfsd_vfs_write [nfsd]();
  3)  ! 200.954 us  |  nfsd_vfs_write [nfsd]();
  4)  ! 255.285 us  |  nfsd_vfs_write [nfsd]();
  5)  ! 171.537 us  |  nfsd_vfs_write [nfsd]();

Reference

In this post, we are going to explore how to use Perf - The official Linux profiler for dynamic tracing with user-defined tracepoint. When we say dynamic tracing, the kernel event(function) to be traced is not predefined in perf. Instead, we add the tracepoint event manually.

We will study it based on a real world nfs write performance issue. With the 4k sequential write from fio, the throughput is much lower with newer kernel(5.x) compared to older kernel(3.x). By profiling the system, it’s clear to see the call stack and samples of nfsd are very different for the two kernel versions.

Image

Add tracepoint in perf

At first look, I want to add a tracepoint for the kernel function “nfsd_vfs_write” since it appears in the main code path of nfsd for both kernel versions. But perf complains the error of “out of .text” as below.

[root@host1 ~]# perf probe --add nfsd_vfs_write 
nfsd_vfs_write is out of .text, skip it.
  Error: Failed to add events. Reason: No such file or directory (Code: -2)

By checking the exported kernel symbols from /proc/kallsyms, the symbol type is lowercase “t” for the function “nfsd_vfs_write”.

[root@host1 ~]# cat /proc/kallsyms |  egrep -i -w "nfsd_vfs_write"
ffffffffc094fdd0 t nfsd_vfs_write	[nfsd] 

[root@host1 ~]# perf probe -F | egrep -i "nfsd_vfs_write$"
nfsd_vfs_write

Based on the manual page of nm, lowercase means it is local symbol. It’s likely that the local symbol can’t be added as probe event.

The symbol type.

If lowercase, the symbol is usually local; if uppercase, the symbol is global (external).

“T” “t” The symbol is in the text (code) section.

If we still want to trace and understand the overhead for the function “nfsd_vfs_write”, ftrace is a way to go.

In this post, we want to discuss how to add a probe event in perf. So we try a different function “vfs_fsync_range” whose type is global symbol.

With the following commands, the probe for “vfs_fsync_range” is added to perf.

[root@host1 ~]# uname -r
3.10.0-1160.el7.x86_64
[root@host1 ~]# cat /proc/kallsyms | awk '{if($2=="T")print}' | grep -i vfs_fsync_range
ffffffff960837a0 T vfs_fsync_range

[root@host1 ~]# perf list | grep probe

[root@host1 ~]# perf probe vfs_fsync_range
Added new event:
  probe:vfs_fsync_range (on vfs_fsync_range)

You can now use it in all perf tools, such as:

    perf record -e probe:vfs_fsync_range -aR sleep 1

[root@host1 ~]# perf list | grep probe
  probe:vfs_fsync_range                              [Tracepoint event]

Trace the user-defined probe event

At this point, we had the tracepoint for the function “vfs_fsync_range” added in perf. We are able to sample the stack trace and check the function runtime from perf script output. This can be useful because without the dynamic tracing like this, we can’t identify the overhead(runtime) for the target function. The similar tracing can be done with ftrace. Here we just study how to use perf to dynamically add a tracepoint for sampling in perf. However, based on the experiments, ftrace seems more powerful when to identify the kernel function overhead compared to perf.

The following commands show how to record the stack traces for the target tracepoint and how to extract the call time from it. By comparing the function call time for the differnt kernels, we could identify the possible issue in nfsd call stack.

[root@host1 ~]#  perf record -e probe:vfs_fsync_range -aR sleep 10
[ perf record: Woken up 57 times to write data ]
[ perf record: Captured and wrote 16.284 MB perf.data (233480 samples) ]

[root@host1 ~]# ls -la perf.data
-rw------- 1 root root 17120542 Jul 13 23:50 perf.data

[root@host1 ~]#  perf record -e probe:vfs_fsync_range -g -aR sleep 10
[ perf record: Woken up 146 times to write data ]
[ perf record: Captured and wrote 38.179 MB perf.data (248546 samples) ]

[root@host1 ~]# ls -la perf.data
-rw------- 1 root root 40079506 Jul 13 23:52 perf.data

[root@host1 ~]# perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 248K of event 'probe:vfs_fsync_range'
# Event count (approx.): 248546
#
# Children      Self  Trace output
# ........  ........  ..................
#
   100.00%   100.00%  (ffffffff960837a0)
            |
            ---ret_from_fork_nospec_end
               kthread
               nfsd
               svc_process
               svc_process_common
               nfsd_dispatch
               nfsd4_proc_compound
               nfsd4_write
               vfs_fsync_range

[root@host1 ~]# perf script
nfsd  3592 [025]   677.464643: probe:vfs_fsync_range: (ffffffff960837a0)
        ffffffff960837a1 vfs_fsync_range+0x1 ([kernel.kallsyms])
        ffffffffc07a0b0f nfsd4_write+0x1cf ([kernel.kallsyms])
        ffffffffc07a267d nfsd4_proc_compound+0x3dd ([kernel.kallsyms])
        ffffffffc078d810 nfsd_dispatch+0xe0 ([kernel.kallsyms])
        ffffffffc0f61850 svc_process_common+0x400 ([kernel.kallsyms])
        ffffffffc0f61d13 svc_process+0xf3 ([kernel.kallsyms])
        ffffffffc078d16f nfsd+0xdf ([kernel.kallsyms])
        ffffffff95ec5c21 kthread+0xd1 ([kernel.kallsyms])
        ffffffff96593df7 ret_from_fork_nospec_end+0x0 ([kernel.kallsyms])
<...>

Tuned

Tuned is a daemon that uses udev to monitor connected devices and statically and dynamically tunes system settings according to a selected profile. Tuned is distributed with a number of predefined profiles for common use cases like high throughput, low latency, or powersave. It is possible to modify the rules defined for each profile and customize how to tune a particular device. To revert all changes made to the system settings by a certain profile, you can either switch to another profile or deactivate the tuned service.

Check the tuned service and active profile

[root@host1 ~]# systemctl status tuned
● tuned.service - Dynamic System Tuning Daemon
   Loaded: loaded (/usr/lib/systemd/system/tuned.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2022-07-12 04:52:53 UTC; 13h ago
     Docs: man:tuned(8)
           man:tuned.conf(5)
           man:tuned-adm(8)
 Main PID: 2592 (tuned)
    Tasks: 5
   Memory: 33.9M
   CGroup: /system.slice/tuned.service
           └─2592 /usr/bin/python2 -Es /usr/sbin/tuned -l -P

Jul 12 04:52:52 host1 systemd[1]: Starting Dynamic System Tuning Daemon...
Jul 12 04:52:53 host1 systemd[1]: Started Dynamic System Tuning Daemon.

[root@host1 ~]# tuned-adm list
Available profiles:
- balanced                    - General non-specialized tuned profile
- desktop                     - Optimize for the desktop use-case
- hpc-compute                 - Optimize for HPC compute workloads
- latency-performance         - Optimize for deterministic performance at the cost of increased power consumption
- network-latency             - Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance
- network-throughput          - Optimize for streaming network throughput, generally only necessary on older CPUs or 40G+ networks
- powersave                   - Optimize for low power consumption
- throughput-performance      - Broadly applicable tuning that provides excellent performance across a variety of common server workloads
- virtual-guest               - Optimize for running inside a virtual guest
- virtual-host                - Optimize for running KVM guests
Current active profile: balanced

[root@host1 ~]# tuned-adm active
Current active profile: balanced

[root@host1 ~]# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | head
powersave
powersave
<omitted..>

Change the tuned profile

[root@host1 ~]# tuned-adm profile throughput-performance

[root@host1 ~]# tuned-adm active
Current active profile: throughput-performance

[root@host1 ~]# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
<omitted..>

Reference

Install debuginfo package

On CentOS, we can install debuginfo package as below.

  • Modify /etc/yum.repos.d/CentOS-Debuginfo.repo by setting “enabled=1”

  • Run “yum install kernel-debuginfo”

    [root@host1 ~]# uname -r
    3.10.0-1160.el7.x86_64

    [root@host1 ~]# vim /etc/yum.repos.d/CentOS-Debuginfo.repo
    [base-debuginfo]
    name=CentOS-7 - Debuginfo
    baseurl=http://debuginfo.centos.org/7/$basearch/
    gpgcheck=1
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Debug-7
    enabled=1

    [root@host1 ~]# yum install kernel-debuginfo
    Installed:
    kernel-debuginfo.x86_64 0:4.19.113-300.el7
    Dependency Installed:
    kernel-debuginfo-common-x86_64.x86_64 0:4.19.113-300.el7
    [root@host1 ~]# rpm -qa | grep debuginfo
    kernel-debuginfo-4.19.113-300.el7.x86_64
    kernel-debuginfo-common-x86_64-4.19.113-300.el7.x86_64

Notice that the installed kernel-debuginfo package version does not match with the kernel version. We can use the following commands to install the identical version as kernel.

[root@host1 ~]# yum install -y kernel-devel-$(uname -r) \
kernel-debuginfo-$(uname -r) \
kernel-debuginfo-common-$(uname -m)-$(uname -r)

Installed:
  kernel-debuginfo.x86_64 0:3.10.0-1160.el7  kernel-debuginfo-common-x86_64.x86_64 0:3.10.0-1160.el7

Remove debuginfo package

[root@host1 ~]# rpm -qa | grep debuginfo
kernel-debuginfo-4.19.113-300.el7.x86_64
kernel-debuginfo-common-x86_64-4.19.113-300.el7.x86_64

[root@host1 ~]# yum remove kernel-debuginfo
[root@host1 ~]# yum remove kernel-debuginfo-common-x86_64-4.19.113-300.el7.x86_64

[root@host1 ~]# rpm -qa | grep debug

About sadc and sar

sadc is known as system activity data collector. It samples system data a specified number of times (count) at a specified interval measured in seconds (interval). It writes in binary format to the specified outfile or to standard output. The sadc command is intended to be used as a backend to the sar command. The sar command can be used to collect, report or save system activity information.

The following are examples for some of the often used options to understand system activity including CPU, Memory, Disk IO and network.

Report CPU utilization

The ALL keyword with “-u” indicates that all the CPU fields should be displayed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@host1 ~]# sar -u 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 07/06/2022 _x86_64_ (96 CPU)

06:47:11 PM CPU %user %nice %system %iowait %steal %idle
06:47:12 PM all 0.00 0.00 0.01 0.00 0.00 99.99
06:47:13 PM all 0.01 0.00 0.00 0.00 0.00 99.99
Average: all 0.01 0.00 0.01 0.00 0.00 99.99

[root@host1 ~]# sar -u ALL 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 07/06/2022 _x86_64_ (96 CPU)

06:47:16 PM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
06:47:17 PM all 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 99.96
06:47:18 PM all 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 99.99
Average: all 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 99.97

Report memory statistics

[root@host1 ~]# sar -R 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

07:00:19 PM   frmpg/s   bufpg/s   campg/s
07:00:20 PM    -10.00      0.00      0.00
07:00:21 PM      8.00      0.00      0.00
Average:        -1.00      0.00      0.00

Report memory utilization statistics

[root@host1 ~]# sar -r 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:47:59 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
06:48:00 PM 1052993760   3499308      0.33    155084    226796   1805324      0.17    343248    178520       184
06:48:01 PM 1052993776   3499292      0.33    155084    226796   1805324      0.17    343464    178520       184
Average:    1052993768   3499300      0.33    155084    226796   1805324      0.17    343356    178520       184

Report swap space utilization statistics

[root@host1 ~]# sar -S 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:47:25 PM kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
06:47:26 PM   4194300         0      0.00         0      0.00
06:47:27 PM   4194300         0      0.00         0      0.00
Average:      4194300         0      0.00         0      0.00

Report swapping statistics

[root@host1 ~]# sar -W 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:50:10 PM  pswpin/s pswpout/s
06:50:11 PM      0.00      0.00
06:50:12 PM      0.00      0.00
Average:         0.00      0.00

Report paging statistics

[root@host1 ~]# sar -B 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:51:29 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
06:51:30 PM      0.00      0.00     22.00      0.00     48.00      0.00      0.00      0.00      0.00
06:51:31 PM      0.00      0.00     19.00      0.00     47.00      0.00      0.00      0.00      0.00
Average:         0.00      0.00     20.50      0.00     47.50      0.00      0.00      0.00      0.00

Report hugepages utilization statistics

[root@host1 ~]# sar -H 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:54:34 PM kbhugfree kbhugused  %hugused
06:54:35 PM         0         0      0.00
06:54:36 PM         0         0      0.00
Average:            0         0      0.00

Report task creation and system switching activity

[root@host1 ~]# sar -w 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:50:34 PM    proc/s   cswch/s
06:50:35 PM      0.00    524.00
06:50:36 PM      0.00    721.00
Average:         0.00    622.50

Report I/O and transfer rate statistics

[root@host1 ~]# sar -b 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:52:14 PM       tps      rtps      wtps   bread/s   bwrtn/s
06:52:15 PM      0.00      0.00      0.00      0.00      0.00
06:52:16 PM      0.00      0.00      0.00      0.00      0.00
Average:         0.00      0.00      0.00      0.00      0.00

Report activity for each block device

[root@host1 ~]# sar -d 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:53:04 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
06:53:05 PM  dev259-0      1.00      0.00      8.00      8.00      0.00      0.00      1.00      0.10
06:53:05 PM  dev259-4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM  dev259-5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM  dev259-6      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM  dev259-7      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM  dev259-8      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM  dev259-9      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM dev259-10      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM dev259-11      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM dev259-12      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:05 PM  dev253-0      1.00      0.00      8.00      8.00      0.00      0.00      1.00      0.10
06:53:05 PM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

06:53:05 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
06:53:06 PM  dev259-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM  dev259-4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM  dev259-5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM  dev259-6      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM  dev259-7      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM  dev259-8      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM  dev259-9      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM dev259-10      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM dev259-11      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM dev259-12      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM  dev253-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:53:06 PM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
Average:     dev259-0      0.50      0.00      4.00      8.00      0.00      0.00      1.00      0.05
Average:     dev259-4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev259-5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev259-6      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev259-7      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev259-8      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev259-9      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:    dev259-10      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:    dev259-11      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:    dev259-12      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev253-0      0.50      0.00      4.00      8.00      0.00      0.00      1.00      0.05
Average:     dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Report network statistics

[root@host1 ~]# sar -n DEV 1 2
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

06:56:34 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
06:56:35 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:56:35 PM      eth0     15.00      1.00      0.89      0.14      0.00      0.00      0.00
06:56:35 PM      eth3      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:56:35 PM      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:56:35 PM      eth2      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:56:35 PM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

06:56:35 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
06:56:36 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:56:36 PM      eth0     16.00      1.00      0.94      0.74      0.00      0.00      0.00
06:56:36 PM      eth3      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:56:36 PM      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:56:36 PM      eth2      0.00      0.00      0.00      0.00      0.00      0.00      0.00
06:56:36 PM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:         eth0     15.50      1.00      0.92      0.44      0.00      0.00      0.00
Average:         eth3      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:         eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:         eth2      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:      docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Save and extract records from file

Save the readings in the file in binary form. Each reading is in a separate record. The default value of the filename parameter is the current daily data file, the /var/log/sa/sadd file. The -o option is exclusive of the -f option. All the data available from the kernel are saved in the file (in fact, sar calls its data collector sadc with the option “-S ALL”).

[root@host1 ~]# sar -o sar.out.1.2

[root@host1 ~]# sar -f sar.out.1.2 -u
Linux 5.7.12-1.el7.elrepo.x86_64 (host1) 	07/06/2022 	_x86_64_	(96 CPU)

07:04:45 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:04:46 PM     all      0.02      0.00      0.05      0.00      0.00     99.93
07:04:47 PM     all      0.06      0.00      0.11      0.00      0.00     99.82
Average:        all      0.04      0.00      0.08      0.00      0.00     99.87

The nfs mount option “nconnect=n” exists in all Linux distributions with kernel 5.3 or higher. nconnect enables multiple TCP connections for a single NFS mount.

From the nfs manual page:

nconnect=n

When using a connection oriented protocol such as TCP, it may sometimes be advantageous to set up multiple connections between the client and server. For instance, if your clients and/or servers are equipped with multiple network interface cards (NICs), using multiple connections to spread the load may improve overall performance. In such cases, the nconnect option allows the user to specify the number of connections that should be established between the client and server up to a limit of 16.

The nconnect option can be easily added during nfs mount as the following command.

1
$ mount -t nfs -o nconnect=16 nfs-server-hostname:/mnt/nfsshare /mnt/nfsmnt1 		

The following is the 4k sequential write performance comparison by increasing nconnect to 16 over a single 10GbE link.

Image

Reference

We need to install NFS packages on the NFS Server as well as on NFS Client machine. We can install it via “yum” for RHEL/CentOS.

On the NFS server side, run the following commands to setup NFS server.

$ systemctl status nfs
$ mkfs.ext4 /dev/nvme9n1
$ mkdir /mnt/nfsshare

$ vim /etc/exports
/mnt/nfsshare 10.10.10.1(rw,no_root_squash)

$ systemctl restart nfs
$ exportfs -s
/mnt/nfsshare  10.10.10.1(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)

On the NFS client side, run the following command to mount the exported NFS share.

$ mkdir /mnt/nfs1
$ mount -t nfs nfs-server-hostname:/mnt/nfsshare /mnt/
$ mount  | grep nfs1
fs-server-hostname:/mnt/nfsshare on /mnt/nfs1 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.2,local_lock=none,addr=10.10.10.1)
0%