Using cgroups to limit CPU utilization
Intro to cgroups
Cgroups(control groups) make it possible to allocate system resources such as CPU time, memory, disk I/O and network bandwidth, or combinations of them, among a group of tasks(processes) running on a system.
The following commands output the available subsystems(resource controllers) for the cgroups. Each subsystem has a bunch of tunables to control the resource allocation.
$ lssubsys -am
cpuset /sys/fs/cgroup/cpuset
cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct
blkio /sys/fs/cgroup/blkio
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb
pids /sys/fs/cgroup/pids
rdma /sys/fs/cgroup/rdma
$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
CPU subsystem and tunables
Ceiling enforcement parameters
cpu.cfs_period_us
specifies a period of time in microseconds (µs, represented here as “us”) for how regularly a cgroup’s access to CPU resources should be reallocated. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. The upper limit of the cpu.cfs_quota_us parameter is 1 second and the lower limit is 1000 microseconds.
cpu.cfs_quota_us
specifies the total amount of time in microseconds (µs, represented here as “us”) for which all tasks in a cgroup can run during one period (as defined by cpu.cfs_period_us). As soon as tasks in a cgroup use up all the time specified by the quota, they are throttled for the remainder of the time specified by the period and not allowed to run until the next period. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. Note that the quota and period parameters operate on a CPU basis. To allow a process to fully utilize two CPUs, for example, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 100000.
Setting the value in cpu.cfs_quota_us to -1 indicates that the cgroup does not adhere to any CPU time restrictions. This is also the default value for every cgroup (except the root cgroup).
Relative shares parameter
cpu.shares
contains an integer value that specifies a relative share of CPU time available to the tasks in a cgroup. For example, tasks in two cgroups that have cpu.shares set to 100 will receive equal CPU time, but tasks in a cgroup that has cpu.shares set to 200 receive twice the CPU time of tasks in a cgroup where cpu.shares is set to 100. The value specified in the cpu.shares file must be 2 or higher.
Note that shares of CPU time are distributed per all CPU cores on multi-core systems. Even if a cgroup is limited to less than 100% of CPU on a multi-core system, it may use 100% of each individual CPU core.
Using relative shares to specify CPU access has two implications on resource management that should be considered:
Because the CFS does not demand equal usage of CPU, it is hard to predict how much CPU time a cgroup will be allowed to utilize. When tasks in one cgroup are idle and are not using any CPU time, the leftover time is collected in a global pool of unused CPU cycles. Other cgroups are allowed to borrow CPU cycles from this pool.
The actual amount of CPU time that is available to a cgroup can vary depending on the number of cgroups that exist on the system. If a cgroup has a relative share of 1000 and two other cgroups have a relative share of 500, the first cgroup receives 50% of all CPU time in cases when processes in all cgroups attempt to use 100% of the CPU. However, if another cgroup is added with a relative share of 1000, the first cgroup is only allowed 33% of the CPU (the rest of the cgroups receive 16.5%, 16.5%, and 33% of CPU).
Using libcgroup tools
Install libcgroup package to manage cgroups:
$ yum install libcgroup libcgroup-tools
List the cgroups:
$ lscgroup
hugetlb:/
cpu,cpuacct:/
cpuset:/
blkio:/
memory:/
freezer:/
net_cls,net_prio:/
pids:/
rdma:/
perf_event:/
devices:/
devices:/system.slice
devices:/system.slice/irqbalance.service
devices:/system.slice/systemd-udevd.service
devices:/system.slice/polkit.service
devices:/system.slice/chronyd.service
devices:/system.slice/auditd.service
devices:/system.slice/tuned.service
devices:/system.slice/systemd-journald.service
devices:/system.slice/sshd.service
devices:/system.slice/crond.service
devices:/system.slice/NetworkManager.service
devices:/system.slice/rsyslog.service
devices:/system.slice/abrtd.service
devices:/system.slice/lvm2-lvmetad.service
devices:/system.slice/postfix.service
devices:/system.slice/dbus.service
devices:/system.slice/system-getty.slice
devices:/system.slice/systemd-logind.service
devices:/system.slice/abrt-oops.service
$ ls /sys/fs/cgroup
blkio cpuacct cpuset freezer memory net_cls,net_prio perf_event rdma
cpu cpu,cpuacct devices hugetlb net_cls net_prio pids systemd
Create the cgroup:
$ cgcreate -g cpu:/cpulimited
$ lscgroup | grep cpulimited
cpu,cpuacct:/cpulimited
$ ls cpulimited/
cgroup.clone_children cpuacct.usage_percpu cpu.cfs_period_us cpu.stat
cgroup.procs cpuacct.usage_percpu_sys cpu.cfs_quota_us notify_on_release
cpuacct.stat cpuacct.usage_percpu_user cpu.rt_period_us tasks
cpuacct.usage cpuacct.usage_sys cpu.rt_runtime_us
cpuacct.usage_all cpuacct.usage_user cpu.shares
Limit CPU utilization by percentage:
$ lscpu | grep ^CPU\(s\):
CPU(s): 96
$ cgset -r cpu.cfs_quota_us=200000 cpulimited
Check the cgroup settings:
$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: 200000
$ cgget -g cpu:cpulimited
cpulimited:
cpu.cfs_period_us: 100000
cpu.stat: nr_periods 2
nr_throttled 0
throttled_time 0
cpu.shares: 1024
cpu.cfs_quota_us: 200000
cpu.rt_runtime_us: 0
cpu.rt_period_us: 1000000
Delete the cgroup:
$ cgdelete cpu,cpuacct:/cpulimited
Verify the CPU utilization with fio workload
Create a fio job file:
$ cat burn_cpu.job
[burn_cpu]
# Don't transfer any data, just burn CPU cycles
ioengine=cpuio
# Stress the CPU at 100%
cpuload=100
# Make 4 clones of the job
numjobs=4
Run the fio jobs without CPU limit:
$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: -1
$ cgexec -g cpu:cpulimited fio burn_cpu.job
Check the CPU usage:
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13775 root 20 0 1079912 4016 2404 R 100.0 0.0 0:11.65 fio
13776 root 20 0 1079916 4004 2392 R 100.0 0.0 0:11.65 fio
13777 root 20 0 1079920 4004 2392 R 100.0 0.0 0:11.65 fio
13778 root 20 0 1079924 4004 2392 R 100.0 0.0 0:11.65 fio
The CPU utilization is 400% for the 4 fio jobs when there is no CPU limit set. Note that, there is totally 9600% CPU bandwidth available.
Limit the CPU utilization to 200%:
$ cgset -r cpu.cfs_quota_us=200000 cpulimited
$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: 200000
Run the fio jobs again:
$ cgexec -g cpu:cpulimited fio burn_cpu.job
Check the CPU usage:
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12908 root 20 0 1079916 3948 2336 R 50.3 0.0 0:06.91 fio
12909 root 20 0 1079920 3948 2336 R 50.0 0.0 0:06.88 fio
12910 root 20 0 1079924 3948 2336 R 50.0 0.0 0:06.93 fio
12907 root 20 0 1079912 3948 2336 R 49.3 0.0 0:06.86 fio
The CPU utilization is 200% for the 4 fio jobs when the CPU utilization is limited to 200%.
Check the processes are running on which CPU cores:
$ mpstat -P ALL 5 | awk '{if ($3=="CPU" || $NF<99)print;}'
12:40:32 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:40:37 AM all 2.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 97.89
12:40:37 AM 0 20.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 79.48
12:40:37 AM 1 50.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.40
12:40:37 AM 2 50.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.90
12:40:37 AM 24 29.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 70.26
12:40:37 AM 87 50.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.80
12:40:37 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:40:42 AM all 2.11 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 97.88
12:40:42 AM 0 11.49 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 88.31
12:40:42 AM 1 50.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.40
12:40:42 AM 2 50.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.70
12:40:42 AM 24 38.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 61.03
12:40:42 AM 87 49.90 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 50.10
The 4 fio jobs are running on 5 CPU cores with total utilization of 200%. So, it indicates this method limits the total CPU utilization out of all the CPU cores. However, the number of CPU cores is not limited.
Reference
- https://docs.kernel.org/admin-guide/cgroup-v1/cgroups.html
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/chap-using_libcgroup_tools
- https://scoutapm.com/blog/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups