This post guides us how to migrate the docker data from the existing directory to a target directory in the case that the old directory runs out of space.

Stop the docker daemon

$ cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
    
$ systemctl stop docker.service
$ ps aux | grep -i docker | grep -v grep

Add a configuration file to tell docker where is the new location of the data

$ mkdir /data/var_lib_docker/
$ vim /etc/docker/daemon.json
{
    "data-root": "/data/var_lib_docker"
}

Copy the docker data to new directory(It takes time)

sudo rsync -aP /var/lib/docker/ /data/var_lib_docker

Verify if the migration works

$ mv /var/lib/docker/ /var/lib/docker.old

$ sudo systemctl start docker

$ ps aux | grep -i docker | grep -v grep
root     29227  0.2  0.1 1244076 28448 ?       Ssl  10:43   0:01 /usr/bin/dockerd
root     29243  0.0  0.0 984268  7696 ?        Ssl  10:43   0:00 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc

$ docker info  | grep Root
    Docker Root Dir: /data/var_lib_docker

$ docker images
$ docker inspect 1cd20ecd897d | grep RootDir
"RootDir": "/data/var_lib_docker/overlay/90021ce8266c3f717e2d30e258311e850b50e946b7f68d505f504b008378414c/root"

Young Author Gold Award, 4th grade, 2021

It was a fine summer morning and Zachory Taylor was enjoying the fresh air and sunshine. He was on summer vacation to Azogia, one of the richest countries which controlled most of the gold mines in the world. Azogia was located next to Zemonia, which controlled most of the silver mines. Azog, who was the ruler of Azogia, was a cousin of Zemo, the ruler of Zemonia. The two cousins had been competing against each other for a few decades on whose country was the most powerful. Azog was building golden walls to protect Azogopolis, the capital of Azogia, in case Zemo tried to launch a surprise attack on Azogia.

Read more »

In a Docker container environment, we won’t get a valid stack trace directly on the container host as below.

1
2
3
4
5
6
7
8
9
10
11
12
13
$  ps -ef |grep smbd
root 171118 167977 0 Apr22 ? 00:00:02 /usr/sbin/smbd --foreground --no-process-group
root 171166 171118 0 Apr22 ? 00:00:00 /usr/sbin/smbd --foreground --no-process-group
root 171168 171118 0 Apr22 ? 00:00:00 /usr/sbin/smbd --foreground --no-process-group
root 171208 171118 0 Apr22 ? 00:00:00 /usr/sbin/smbd --foreground --no-process-group
root 190574 186140 0 15:01 pts/3 00:00:00 grep --color=auto smbd

$ gstack 171118
#0 0x00007fb66768afb3 in ?? () from /lib64/libc.so.6
#1 0x00007fb667b70f1b in ?? ()
#2 0x0000000000000060 in ?? ()
#3 0x00007fb668a92f5c in ?? ()
#4 0x0000000000000000 in ?? ()

To print the stack trace for a process running inside container, we can do the following.

  1. Get the process id inside container

    1
    2
    3
    4
    5
    6
    7
    $ docker exec -it iNfcP_9-0-1 bash
    bash-4.2# ps -ef | grep smbd
    root 1864 1 0 Apr22 ? 00:00:02 /usr/sbin/smbd --foreground --no-process-group
    root 1908 1864 0 Apr22 ? 00:00:00 /usr/sbin/smbd --foreground --no-process-group
    root 1910 1864 0 Apr22 ? 00:00:00 /usr/sbin/smbd --foreground --no-process-group
    root 1950 1864 0 Apr22 ? 00:00:00 /usr/sbin/smbd --foreground --no-process-group
    root 177346 177338 0 15:02 ? 00:00:00 grep smbd
  2. Get the container instance pid

    1
    2
    $ /bin/docker inspect --format '\{\{ .State.Pid }}' e2590333640e
    167977
  3. Print stack trace with gstack as below

    1
    2
    3
    4
    5
    6
    7
    8
    $  /bin/nsenter -Z -m -n -p -t 167977 /bin/gstack 1864
    #0 0x00007fb66768afb3 in __epoll_wait_nocancel () from /lib64/libc.so.6
    #1 0x00007fb667b70f1b in epoll_event_loop_once () from /lib64/libtevent.so.0
    #2 0x00007fb667b6f057 in std_event_loop_once () from /lib64/libtevent.so.0
    #3 0x00007fb667b6a25d in _tevent_loop_once () from /lib64/libtevent.so.0
    #4 0x00007fb667b6a4bb in tevent_common_loop_wait () from /lib64/libtevent.so.0
    #5 0x00007fb667b6eff7 in std_event_loop_wait () from /lib64/libtevent.so.0
    #6 0x00005588a9175a98 in main ()

We use nsenter to get the stack trace of the target process in contaienr namespace.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ man nsenter

NAME
nsenter - run program with namespaces of other processes

SYNOPSIS
nsenter [options] [program [arguments]]

DESCRIPTION
Enters the namespaces of one or more other processes and then executes the specified program. Enterable namespaces are:

mount namespace
Mounting and unmounting filesystems will not affect the rest of the system (CLONE_NEWNS flag), except for filesystems which are explicitly marked as
shared (with mount --make-shared; see /proc/self/mountinfo for the shared flag).

UTS namespace
Setting hostname or domainname will not affect the rest of the system. (CLONE_NEWUTS flag)

IPC namespace
The process will have an independent namespace for System V message queues, semaphore sets and shared memory segments. (CLONE_NEWIPC flag)

network namespace
The process will have independent IPv4 and IPv6 stacks, IP routing tables, firewall rules, the /proc/net and /sys/class/net directory trees, sockets,
etc. (CLONE_NEWNET flag)

PID namespace
Children will have a set of PID to process mappings separate from the nsenter process (CLONE_NEWPID flag). nsenter will fork by default if changing the
PID namespace, so that the new program and its children share the same PID namespace and are visible to each other. If --no-fork is used, the new pro‐
gram will be exec'ed without forking.

user namespace
The process will have a distinct set of UIDs, GIDs and capabilities. (CLONE_NEWUSER flag)

See clone(2) for the exact semantics of the flags.

If program is not given, then ``${SHELL}'' is run (default: /bin/sh).

Network bonding enables the combination of two or more network interfaces into a single-bonded (logical) interface, which increases the bandwidth and provides redundancy. If a specific network interface card (NIC) experiences a problem, communications are not affected significantly as long as the other slave NICs remain active.

Bonding modes supported by RHEL and CentOS operating systems

The behavior of the bonded interfaces depends on the mode that is selected. RHEL supports the following common bonding modes:

  • Mode 0 (balance-rr): This mode is also known as round-robin mode. Packets are sequentially transmitted and received through each interface one by one. This mode provides load balancing functionality.
  • Mode 1 (active-backup): This mode has only one interface set to active, while all other interfaces are in the backup state. If the active interface fails, a backup interface replaces it as the only active interface in the bond. The media access control (MAC) address of the bond interface in mode 1 is visible on only one port (the network adapter), which prevents confusion for the switch. Mode 1 provides fault tolerance.
  • Mode 2 (balance-xor): The source MAC address uses exclusive or (XOR) logic with the destination MAC address. This calculation ensures that the same slave interface is selected for each destination MAC address. Mode 2 provides fault tolerance and load balancing.
  • Mode 3 (broadcast): All transmissions are sent to all the slaves. This mode provides fault tolerance.
  • Mode 4 (802.3ad): This mode creates aggregation groups that share the same speed and duplex settings, and it requires a switch that supports an IEEE 802.3ad dynamic link. Mode 4 uses all interfaces in the active aggregation group. For example, you can aggregate three 1 GB per second (GBPS) ports into a 3 GBPS trunk port. This is equivalent to having one interface with 3 GBPS speed. It provides fault tolerance and load balancing.
  • Mode 5 (balance-tlb): This mode ensures that the outgoing traffic distribution is set according to the load on each interface and that the current interface receives all the incoming traffic. If the assigned interface fails to receive traffic, another interface is assigned to the receiving role. It provides fault tolerance and load balancing.
  • Mode 6 (balance-alb): This mode is supported only in x86 environments. The receiving packets are load balanced through Address Resolution Protocol (ARP) negotiation. This mode provides fault tolerance and load balancing.

Before we explore LACP configuration, we should understand the IEEE 802.3ad link aggregation policy and LACP bonding, which allows us to aggregate multiple ports into a single group. This process combines the bandwidth into a single connection.

IEEE 802.3ad link aggregation enables us to group Ethernet interfaces at the physical layer to form a single link layer interface, also known as a link aggregation group (LAG) or bundle.

Some users require more bandwidth in their network than a single fast Ethernet link can provide. Using IEEE 802.3ad link aggregation in this situation provides increased port density and bandwidth at a lower cost.

For example, if you need 2 GBPS bandwidth to transmit data and have only 1 GBPS Fast Ethernet links installed on your system, creating a LAG bundle containing two 1 GBPS Fast Ethernet links is more cost-effective than purchasing a single 2 GBPS Ethernet link.

The following diagram illustrates the IEEE 802.3ad link aggregation policy:

Image

LACP is a mechanism for exchanging port and system information to create and maintain LAG bundles. The LAG bundle distributes MAC clients across the link layer interface and collects traffic from the links to present to the MAC clients of the LAG bundle.

LACP identifies the MAC address of the Ethernet link that has the highest port priority and is of the lowest value, and it assigns that MAC address to the LAG bundle.

This bonding mode requires a switch that supports IEEE 802.3ad dynamic links.

Install sshpass package

To install sshpass package on the source server from which to enable ssh passwordless login to remote servers:

-bash-4.2# rpm -ivh sshpass-1.06-1.el7.x86_64.rpm

Enable ssh passwordless login to remote servers

To enable ssh passwordless login to a large amount of remote servers, we can use the following command to automate without needs of providing password for each server.

-bash-4.2# for n in `seq 0 239`
do
    sshpass -p "password" ssh-copy-id -i /root/.ssh/id_rsa.pub -o StrictHostKeyChecking=no Client-hostname$n
done

Configure namespaced kernel parameters(sysctl) at runtime

The –sysctl sets namespaced kernel parameters (sysctls) in the container. For example, to turn on IP forwarding in the containers network namespace, run this command:

$ docker run --sysctl net.ipv4.ip_forward=1 someimage

Note

Not all sysctls are namespaced. Docker does not support changing sysctls inside of a container that also modify the host system. As the kernel evolves we expect to see more sysctls become namespaced.

CURRENTLY SUPPORTED SYSCTLS

IPC Namespace:

  • kernel.msgmax, kernel.msgmnb, kernel.msgmni, kernel.sem, kernel.shmall, kernel.shmmax, kernel.shmmni, kernel.shm_rmid_forced.
  • Sysctls beginning with fs.mqueue.*
  • If you use the –ipc=host option these sysctls are not allowed.

Network Namespace:

  • Sysctls beginning with net.*
  • If you use the –network=host option using these sysctls are not allowed.

Hard and soft ulimit settings

There are two types of ulimit settings:

  • The hard limit is the maximum value that is allowed for the soft limit. Any changes to the hard limit require root access.
  • The soft limit is the value that Linux uses to limit the system resources for running processes. The soft limit cannot be greater than the hard limit.

Updating hard and soft ulimit settings in Linux

To change the open files value on your operating system:
On RHEL and CentOS, edit the /etc/security/limits.d/91-nofile.conf file as shown in the following example:
@streamsadmin - nofile open-files-value
On SLES, edit the /etc/security/limits.conf file as shown in the following example:
@streamsadmin - nofile open-files-value

To change the max user processes value on your operating system:
On RHEL and CentOS, edit the /etc/security/limits.d/90-nproc.conf file as shown in the following example:
@streamsadmin hard nproc max-user-processes-value
@streamsadmin soft nproc max-user-processes-value
On SLES, edit the /etc/security/limits.conf file as shown in the following example:
@streamsadmin hard nproc max-user-processes-value
@streamsadmin soft nproc max-user-processes-value

To set the hard stack and soft stack values, add the following lines to the /etc/security/limits.conf file:
@streamsadmin hard stack unlimited 
@streamsadmin soft stack 20480

Use the following ulimit commands to verify the updated settings:
To verify the updated hard limit, enter the following command:
ulimit -aH
To verify the updated soft limit, enter the following command:
ulimit -aS

Set ulimits in container (–ulimit)

Since setting ulimit settings in a container requires extra privileges not available in the default container, you can set these using the –ulimit flag. –ulimit is specified with a soft and hard limit as such: =[:], for example:

$ docker run --ulimit nofile=1024:1024 --rm debian sh -c "ulimit -n"
1024

Note

If you do not provide a hard limit, the soft limit is used for both values. If no ulimits are set, they are inherited from the default ulimits set on the daemon. The as option is disabled now. In other words, the following script is not supported:

$ docker run -it --ulimit as=1024 fedora /bin/bash

The values are sent to the appropriate syscall as they are set. Docker doesn’t perform any byte conversion. Take this into account when setting the values.

In this post, it includes the following sorting algorithms. The code is self explained.

  • selection sort

  • insert sort

  • bubble sort

  • quick sort

  • merge sort

    import java.util.Arrays;

    public class Sort {
    public static void selection_sort(int[] arr) {
    for (int i = 0; i < arr.length - 1; i++) {
    int min_idx = i;

    // find the minimum value from right of arr[i] and swap with arr[i]
    for (int j = i + 1; j < arr.length; j++) {
    if (arr[j] < arr[min_idx]) {
    min_idx = j;
    }
    }

    // move the min value to the beginning of array
    int temp = arr[i];
    arr[i] = arr[min_idx];
    arr[min_idx] = temp;
    }
    }

    public static void insert_sort(int[] arr) {
    // sort from the second element since the first one is already sorted for itself
    for (int i = 1; i < arr.length; i++) {
    int curr = arr[i];
    int j = i - 1;

    // move all the greater values(than curr) to one position right
    while (j >= 0 && arr[j] > curr) {
    arr[j + 1] = arr[j];
    j–;
    }

    arr[j + 1] = curr;
    }
    }

    public static void bubble_sort(int[] arr) {
    // bubble sort for n - 1 rounds
    for (int i = 0; i < arr.length - 1; i++) {
    boolean swapped = false;
    // For each round, bubble up the maximum element to the right
    for (int j = 0; j < arr.length - i - 1; j++) {
    if (arr[j] > arr[j + 1]) {
    int temp = arr[j];
    arr[j] = arr[j + 1];
    arr[j + 1] = temp;
    swapped = true;
    }
    }

    if (!swapped)
    break;
    }
    }

    public static void quick_sort(int[] arr, int left, int right) {
    if (left < right) {
    int pivot = partition(arr, left, right);
    quick_sort(arr, left, pivot - 1);
    quick_sort(arr, pivot + 1, right);
    }
    }

    private static int partition(int[] arr, int left, int right) {
    int pivot = arr[right];
    int curr = left - 1;

    for (int i = left; i < right; i++) {
    if (arr[i] <= pivot) {
    curr++;
    int temp = arr[curr];
    arr[curr] = arr[i];
    arr[i] = temp;
    }
    }

    curr++;
    int temp = arr[curr];
    arr[curr] = pivot;
    arr[right] = temp;
    return curr;
    }

    public static void merge_sort(int[] arr, int left, int right) {
    if (left < right) {
    //int mid = (left + right) / 2;
    int mid = left + (right - left) / 2; // avoid overflow
    merge_sort(arr, left, mid);
    merge_sort(arr, mid + 1, right);
    merge(arr, left, mid, right);
    }
    }

    private static void merge(int[] arr, int left, int mid, int right) {
    int l1 = mid - left + 1;
    int l2 = right - mid;

    // copy left and right to the temporary array
    int[] a1 = new int[l1];
    int[] a2 = new int[l2];


    for (int i = 0; i < l1; i++) {
    a1[i] = arr[left + i];
    }

    for (int i = 0; i < l2; i++) {
    a2[i] = arr[mid + 1 + i];
    }

    // merge back to the original array
    int i = 0, j = 0, k = left;
    while (i < l1 && j < l2) {
    if (a1[i] <= a2[j]) {
    arr[k++] = a1[i++];
    } else {
    arr[k++] = a2[j++];
    }
    }

    while (i < l1) {
    arr[k++] = a1[i++];
    }
    while (j < l2) {
    arr[k++] = a2[j++];
    }
    }

    public static void main(String[] args) {
    // TODO Auto-generated method stub
    int[] arr = { 11, 25, 12, 22, 64 };
    selection_sort(arr);
    System.out.println(Arrays.toString(arr));

    int[] arr1 = { 11, 25, 12, 22, 64 };
    insert_sort(arr1);
    System.out.println(Arrays.toString(arr1));

    int[] arr2 = { 11, 25, 12, 22, 64 };
    bubble_sort(arr2);
    System.out.println(Arrays.toString(arr2));

    int[] arr3 = { 11, 25, 12, 22, 64 };
    quick_sort(arr3, 0, arr3.length - 1);
    System.out.println(Arrays.toString(arr3));

    int[] arr4 = { 11, 25, 12, 22, 64 };
    merge_sort(arr4, 0, arr4.length - 1);
    System.out.println(Arrays.toString(arr4));
    }
    }

Mutexes

A mutex is basically a lock that we set (lock) before accessing a shared resource and release (unlock) when we’re done. While it is set, any other thread that tries to set it will block until we release it. If more than one thread is blocked when we unlock the mutex, then all threads blocked on the lock will be made runnable, and the first one to run will be able to set the lock. The others will see that the mutex is still locked and go back to waiting for it to become available again. In this way, only one thread will proceed at a time.

This mutual-exclusion mechanism works only if we design our threads to follow the same data-access rules. The operating system doesn’t serialize access to data for us. If we allow one thread to access a shared resource without first acquiring a lock, then inconsistencies can occur even though the rest of our threads do acquire the lock before attempting to access the shared resource.

Reader/Wrtier lock

Reader–writer locks are similar to mutexes, except that they allow for higher degrees of parallelism. With a mutex, the state is either locked or unlocked, and only one thread can lock it at a time. Three states are possible with a reader–writer lock: locked in read mode, locked in write mode, and unlocked. Only one thread at a time can hold a reader–writer lock in write mode, but multiple threads can hold a reader–writer lock in read mode at the same time.

When a reader–writer lock is write locked, all threads attempting to lock it block until it is unlocked. When a reader–writer lock is read locked, all threads attempting to lock it in read mode are given access, but any threads attempting to lock it in write mode block until all the threads have released their read locks. Although implementations vary, reader–writer locks usually block additional readers if a lock is already held in read mode and a thread is blocked trying to acquire the lock in write mode. This prevents a constant stream of readers from starving waiting writers.

Reader–writer locks are well suited for situations in which data structures are read more often than they are modified. When a reader–writer lock is held in write mode, the data structure it protects can be modified safely, since only one thread at a time can hold the lock in write mode. When the reader–writer lock is held in read mode, the data structure it protects can be read by multiple threads, as long as the threads first acquire the lock in read mode.

Reader–writer locks are also called shared–exclusive locks. When a reader–writer lock is read locked, it is said to be locked in shared mode. When it is write locked, it is said to be locked in exclusive mode

Spin Locks

A spin lock is like a mutex, except that instead of blocking a process by sleeping, the process is blocked by busy-waiting (spinning) until the lock can be acquired. A spin lock could be used in situations where locks are held for short periods of times and threads don’t want to incur the cost of being descheduled.

Spin locks are often used as low-level primitives to implement other types of locks. Depending on the system architecture, they can be implemented efficiently using test- and-set instructions. Although efficient, they can lead to wasting CPU resources: while a thread is spinning and waiting for a lock to become available, the CPU can’t do anything else. This is why spin locks should be held only for short periods of time.

Barriers

Barriers are a synchronization mechanism that can be used to coordinate multiple threads working in parallel. A barrier allows each thread to wait until all cooperating threads have reached the same point, and then continue executing from there. We’ve already seen one form of barrier—the pthread_join function acts as a barrier to allow one thread to wait until another thread exits.

Barrier objects are more general than this, however. They allow an arbitrary number of threads to wait until all of the threads have completed processing, but the threads don’t have to exit. They can continue working after all threads have reached the barrier.

Reference

  • Advanced Programming in the UNIX Environment

Pipes

Pipes are the oldest form of UNIX System IPC and are provided by all UNIX systems. Pipes have two limitations.

  1. Historically, they have been half duplex (i.e., data flows in only one direction). Some systems now provide full-duplex pipes, but for maximum portability, we should never assume that this is the case.
  2. Pipes can be used only between processes that have a common ancestor. Normally, a pipe is created by a process, that process calls fork, and the pipe is used between the parent and the child.

FIFOs get around the second limitation, and UNIX domain sockets get around both limitations.

FIFOs

FIFOs are sometimes called named pipes. Unnamed pipes can be used only between related processes when a common ancestor has created the pipe. With FIFOs, however, unrelated processes can exchange data.

Creating a FIFO is similar to creating a file. Indeed, the pathname for a FIFO exists in the file system.

Message Queues

A message queue is a linked list of messages stored within the kernel and identified by a message queue identifier. We’ll call the message queue just a queue and its identifier a queue ID.

A new queue is created or an existing queue opened by msgget. New messages are added to the end of a queue by msgsnd. Every message has a positive long integer type field, a non-negative length, and the actual data bytes (corresponding to the length), all of which are specified to msgsnd when the message is added to a queue. Messages are fetched from a queue by msgrcv. We don’t have to fetch the messages in a first-in, first-out order. Instead, we can fetch messages based on their type field.

Semaphores

A semaphore isn’t a form of IPC similar to the others that we’ve described (pipes, FIFOs, and message queues). A semaphore is a counter used to provide access to a shared data object for multiple processes.

To obtain a shared resource, a process needs to do the following:

  1. Test the semaphore that controls the resource.
  2. If the value of the semaphore is positive, the process can use the resource. In this case, the process decrements the semaphore value by 1, indicating that it has used one unit of the resource.
  3. Otherwise, if the value of the semaphore is 0, the process goes to sleep until the semaphore value is greater than 0. When the process wakes up, it returns to step 1.

When a process is done with a shared resource that is controlled by a semaphore, the semaphore value is incremented by 1. If any other processes are asleep, waiting for the semaphore, they are awakened.

To implement semaphores correctly, the test of a semaphore’s value and the decrementing of this value must be an atomic operation. For this reason, semaphores are normally implemented inside the kernel.

Shared Memory

Shared memory allows two or more processes to share a given region of memory. This is the fastest form of IPC, because the data does not need to be copied between the client and the server. The only trick in using shared memory is synchronizing access to a given region among multiple processes. If the server is placing data into a shared memory region, the client shouldn’t try to access the data until the server is done. Often, semaphores are used to synchronize shared memory access. But record locking or mutexes can also be used.

Network IPC: Sockets

The socket network IPC interface can be used by processes to communicate with other processes, regardless of where they are running, on the same machine or on different machines. Indeed, this was one of the design goals of the socket interface. The same interfaces can be used for both intermachine communication and intramachine communication.

A socket is an abstraction of a communication endpoint. Just as they would use file descriptors to access files, applications use socket descriptors to access sockets. Socket descriptors are implemented as file descriptors in the UNIX System. Indeed, many of the functions that deal with file descriptors, such as read and write, will work with a socket descriptor.

Normally, the recv functions will block when no data is immediately available. Similarly, the send functions will block when there is not enough room in the socket’s output queue to send the message. This behavior changes when the socket is in nonblocking mode.

Reference

  • Advanced Programming in the UNIX Environment

In Red Hat Enterprise Linux 8.1, Red Hat ships a set of fully supported on x86_64 dynamic kernel tracing tools, called bcc-tools, that make use of a kernel technology called extended Berkeley Packet Filter (eBPF). With these tools, you can quickly gain insight into certain aspects of system performance that would have previously required more time and effort from the system and operator.

The eBPF technology allows dynamic kernel tracing without requiring kernel modules (like systemtap) or rebooting of the kernel (as with debug kernels). eBPF accomplishes this while maintaining minimal overhead for each trace point, making these tools an ideal way to instrument running kernels in production.

To ensure that an eBPF program will not harm the running kernel, tools built on eBPF go through the following process when instantiated by root on the command line:

  • The program is compiled into eBPF bytecode.
  • Loaded into the kernel.
  • Run through a technology called the eBPF verifier to ensure that the program will not harm the running kernel.
  • Upon passing the verifier, it begins execution. If it does not pass the verifier, the code is unloaded and does not execute.

That said, bear in mind that you are still inserting tracing and some system calls are called significantly more than others, so depending on what you are tracing, there may be increased overhead.

If you are interested in more information on eBPF in general, please see Stanislav Kozina’s blog: Introduction to eBPF in Red Hat Enterprise Linux 7.

Installation

With RHEL 8.1, bcc-tools has gone fully supported on x86. To install bcc-tools on RHEL 7 (7.6+) and RHEL 8, run yum install as root:

$ uname -r
3.10.0-1062.40.1.el7.x86_64
$  cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)
$ yum install bcc-tools
Installed:
  bcc-tools.x86_64 0:0.10.0-1.el7
Dependency Installed:
  bcc.x86_64 0:0.10.0-1.el7                kernel-devel.x86_64 0:3.10.0-1160.21.1.el7
  llvm-private.x86_64 0:7.0.1-1.el7        python-bcc.x86_64 0:0.10.0-1.el7

$ pwd
/usr/share/bcc/tools
$  ls
argdist       drsnoop         memleak         pythonstat   tclobjnew
bashreadline  execsnoop       mountsnoop      reset-trace  tclstat
biolatency    ext4dist        mysqld_qslower  rubycalls    tcpaccept
biosnoop      ext4slower      nfsdist         rubyflow     tcpconnect
biotop        filelife        nfsslower       rubygc       tcpconnlat
bitesize      fileslower      nodegc          rubyobjnew   tcpdrop
bpflist       filetop         nodestat        rubystat     tcplife
btrfsdist     funccount       offcputime      runqlat      tcpretrans
btrfsslower   funclatency     offwaketime     runqlen      tcpsubnet
cachestat     funcslower      oomkill         runqslower   tcptop
cachetop      gethostlatency  opensnoop       shmsnoop     tcptracer
capable       hardirqs        perlcalls       slabratetop  tplist
cobjnew       javacalls       perlflow        sofdsnoop    trace
cpudist       javaflow        perlstat        softirqs     ttysnoop
cpuunclaimed  javagc          phpcalls        solisten     vfscount
dbslower      javaobjnew      phpflow         sslsniff     vfsstat
dbstat        javastat        phpstat         stackcount   wakeuptime
dcsnoop       javathreads     pidpersec       statsnoop    xfsdist
dcstat        killsnoop       profile         syncsnoop    xfsslower
deadlock      lib             pythoncalls     syscount
deadlock.c    llcstat         pythonflow      tclcalls
doc           mdflush         pythongc        tclflow  

bcc-tools Framework

Before we dive into the different types of tools that are included in bcc-tools, it’s important to note a few things:

  • All of these tools live in /usr/share/bcc/tools.
  • These tools must run as the root user as any eBPF program can read kernel data. As such, injecting eBPF bytecode as a regular user is not allowed in RHEL 8.1.
  • Each tool has a man page. To view the man page, run man . These man pages include descriptions of the tools, provide the options that can be called, and have information on the expected overhead of the specific tool.

Since there are a lot of tools in bcc-tools, I’m going to divide the tools into the following classes and then we’ll dive into each class:

  • The Snoops
  • Latency Detectors
  • Slower
  • Top Up with bcc-tools
  • Java/Perl/Python/Ruby

References

0%