What is Lambda Architecture

When working with very large data sets, it can take a long time to run the sort of queries that clients need. These queries can’t be performed in real time, and often require algorithms such as MapReduce that operate in parallel across the entire data set. The results are then stored separately from the raw data and used for querying.

One drawback to this approach is that it introduces latency — if processing takes a few hours, a query may return results that are several hours old. Ideally, you would like to get some results in real time (perhaps with some loss of accuracy), and combine these results with the results from the batch analytics.

The lambda architecture, first proposed by Nathan Marz, addresses this problem by creating two paths for data flow. All data coming into the system goes through these two paths:

  • A batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as a batch view.
  • A speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of accuracy.

The batch layer feeds into a serving layer that indexes the batch view for efficient querying. The speed layer updates the serving layer with incremental updates based on the most recent data.

Image

Source

Data that flows into the hot path is constrained by latency requirements imposed by the speed layer, so that it can be processed as quickly as possible. Often, this requires a tradeoff of some level of accuracy in favor of data that is ready as quickly as possible. For example, consider an IoT scenario where a large number of temperature sensors are sending telemetry data. The speed layer may be used to process a sliding time window of the incoming data.

Data flowing into the cold path, on the other hand, is not subject to the same low latency requirements. This allows for high accuracy computation across large data sets, which can be very time intensive.

Eventually, the hot and cold paths converge at the analytics client application. If the client needs to display timely, yet potentially less accurate data in real time, it will acquire its result from the hot path. Otherwise, it will select results from the cold path to display less timely but more accurate data. In other words, the hot path has data for a relatively small window of time, after which the results can be updated with more accurate data from the cold path.

The raw data stored at the batch layer is immutable. Incoming data is always appended to the existing data, and the previous data is never overwritten. Any changes to the value of a particular datum are stored as a new timestamped event record. This allows for recomputation at any point in time across the history of the data collected. The ability to recompute the batch view from the original raw data is important, because it allows for new views to be created as the system evolves.

Batch Layer

New data comes continuously, as a feed to the data system. It gets fed to the batch layer and the speed layer simultaneously. It looks at all the data at once and eventually corrects the data in the stream layer. Here we can find lots of ETL and a traditional data warehouse. This layer is built using a predefined schedule, usually once or twice a day. The batch layer has two very important functions:

  • To manage the master dataset
  • To pre-compute the batch views.

Serving Layer

The outputs from the batch layer in the form of batch views and those coming from the speed layer in the form of near real-time views get forwarded to the serving. This layer indexes the batch views so that they can be queried in low-latency on an ad-hoc basis.

Speed Layer (Stream Layer)

This layer handles the data that are not already delivered in the batch view due to the latency of the batch layer. In addition, it only deals with recent data in order to provide a complete view of the data to the user by creating real-time views.

Benefits of lambda architectures

Here are the main benefits of lambda architectures:

  • No Server Management – you do not have to install, maintain, or administer any software.
  • Flexible Scaling – your application can be either automatically scaled or scaled by the adjustment of its capacity
  • Automated High Availability – refers to the fact that serverless applications have already built-in availability and faults tolerance. It represents a guarantee that all requests will get a response about whether they were successful or not.
  • Business Agility – React in real-time to changing business/market scenarios

Challenges with lambda architectures

  • Complexity – lambda architectures can be highly complex. Administrators must typically maintain two separate code bases for batch and streaming layers, which can make debugging difficult.

Reference

Peace Pals International Art Exhibition and Awards - 2022

Finalist for Age category 11-13

https://peacepalsinternational.org/

Celebrating Art Contest - 2021 Summer

Theme: My Summer Life

https://www.celebratingart.com/

Google Doodles Participation - 2021

Cupertino Young Artist Award - 2020

Theme: New Normal(Covid 19)

Platinum Award, 1st Place (age 6 to 10)

https://www.cupertino.org/our-city/commissions/arts-and-culture-commission/artist-awards

2018-2019 California International Student Art Contest

Honorable Mention

Intro

Cockroach uses TPC-C as the official OLTP workload benchmark since it’s a more realistic measurement by modeling the real world applications.

However, sysbench is a straight-forward throughput/latency benchmarking tool. It is a scriptable multi-threaded benchmark tool based on LuaJIT. It is most frequently used for database benchmarks, but can also be used to create arbitrarily complex workloads that do not involve a database server.

  • oltp_*.lua: a collection of OLTP-like database benchmarks
  • fileio: a filesystem-level benchmark
  • cpu: a simple CPU benchmark
  • memory: a memory access benchmark
  • threads: a thread-based scheduler benchmark
  • mutex: a POSIX mutex benchmark

This Cockroach blog explains why we use sysbench for competitive benchmarks.

Sysbench contains a collection of simple SQL workloads. These workloads perform low-level SQL operations. For example, they run concurrent INSERT or UPDATE statements on rows as fast as possible. It gives us a picture on the system performance under different access patterns. Unlike TPC-C, Sysbench does not attempt to model a real application.

Sysbench includes the following workloads:

  • oltp_point_select: single-row point selects
  • oltp_insert: single-row inserts
  • oltp_delete: single-row deletes
  • oltp_update_index: single-row update on column that requires update to secondary index
  • oltp_update_non_index single-row: update on column that does not require update to secondary index
  • oltp_read_only: transactions that run collection of small scans
  • oltp_read_write: transactions that run collection of small scans and writes
  • oltp_write_only: transactions that run collection of writes

Sysbench supports the following two database drivers.

  • mysql - MySQL driver
  • pgsql - PostgreSQL driver

From this Cockroach blog, CockroachDB is compatible with PostgreSQL.

In this article, we will explore how to run sysbench with CockroachDB using pqsql driver.

Install the CockroachDB cluster

Refer to this post on how to deploy CockroachDB cluster.

[root@crdb_node1 ~]# cockroach version
Build Tag:        v22.1.6
Build Time:       2022/08/23 17:05:04
Distribution:     CCL
Platform:         linux amd64 (x86_64-pc-linux-gnu)
Go Version:       go1.17.11
C Compiler:       gcc 6.5.0
Build Commit ID:  760a8253ae6478d69da0330133e3efec8e950e4e
Build Type:       release

Interact with the CockroachDB

Use the CockroachDB built-in client

CockroachDB comes with a built-in client for executing SQL statements from an interactive shell or directly from the command line.

To use this client, run the cockroach sql command as following:

[root@node0 ~]# cockroach sql --insecure --host=node1:26257 -e 'create database testdb'

[root@node0 ~]# cockroach sql --insecure --host=node1:26257 -e 'show databases'
  database_name | owner | primary_region | regions | survival_goal
----------------+-------+----------------+---------+----------------
  defaultdb     | root  | NULL           | {}      | NULL
  postgres      | root  | NULL           | {}      | NULL
  system        | node  | NULL           | {}      | NULL
  testdb        | root  | NULL           | {}      | NULL
(4 rows)

[root@node0 ~]# cockroach sql --insecure --host=node1:26257 -e 'show tables from testdb'
SHOW TABLES 0

[root@node0 ~]# cockroach sql --insecure --host=node1:26257 -e 'drop database testdb'

Use the Postgres client

The cockroachDB can also be interacted by using Postgres client.

To install the Postgres client:

[root@node0 ~]# yum -y install postgresql postgresql-libs
Installed:
  postgresql.x86_64 0:9.2.24-8.el7_9

To create the database and user:

[root@node0 ~]# psql -h node1 -U root -p 26257
psql (9.2.24, server 13.0.0)

root=> create database testdb;
root=> create user tester;
root=> grant all on database testdb to tester;

root=> show databases;
 database_name | owner | primary_region | regions | survival_goal
---------------+-------+----------------+---------+---------------
 defaultdb     | root  |                | {}      |
 postgres      | root  |                | {}      |
 system        | node  |                | {}      |
 testdb        | root  |                | {}      |
(4 rows)

root=> show users;
 username | options | member_of
----------+---------+-----------
 admin    |         | {}
 root     |         | {admin}
 tester   |         | {}
(3 rows)

root=> \c testdb;
psql (9.2.24, server 13.0.0)
WARNING: psql version 9.2, server version 13.0.
         Some psql features might not work.
You are now connected to database "testdb" as user "root".

testdb=> \dt;
                          List of relations
    Schema    |       Name        | Type  |          Owner
--------------+-------------------+-------+--------------------------
 pg_extension | geography_columns | table | unknown (OID=3233629770)
 pg_extension | geometry_columns  | table | unknown (OID=3233629770)
 pg_extension | spatial_ref_sys   | table | unknown (OID=3233629770)
(3 rows)

testdb=>  SELECT * FROM pg_catalog.pg_tables where schemaname != 'pg_catalog' AND schemaname != 'information_schema' and schemaname != 'crdb_internal';
  schemaname  |     tablename     | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity
--------------+-------------------+------------+------------+------------+----------+-------------+-------------
 pg_extension | geography_columns | node       |            | f          | f        | f           | f
 pg_extension | geometry_columns  | node       |            | f          | f        | f           | f
 pg_extension | spatial_ref_sys   | node       |            | f          | f        | f           | f
(3 rows)

Install sysbench

To install sysbench:

[root@node0 ~]# curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.rpm.sh | sudo bash^C
[root@node0 ~]# sudo yum -y install sysbench
[root@node0 ~]# sysbench --version
sysbench 1.0.20

To get familiar with the sysbench parameters:

[root@node0 ~]# sysbench --help
Usage:
  sysbench [options]... [testname] [command]

Commands implemented by most tests: prepare run cleanup help

General options:
  --threads=N                     number of threads to use [1]
  --events=N                      limit for total number of events [0]
  --time=N                        limit for total execution time in seconds [10]
  --forced-shutdown=STRING        number of seconds to wait after the --time limit before forcing shutdown, or 'off' to disable [off]
  --thread-stack-size=SIZE        size of stack per thread [64K]
  --rate=N                        average transactions rate. 0 for unlimited rate [0]
  --report-interval=N             periodically report intermediate statistics with a specified interval in seconds. 0 disables intermediate reports [0]
  --report-checkpoints=[LIST,...] dump full statistics and reset all counters at specified points in time. The argument is a list of comma-separated values representing the amount of time in seconds elapsed from start of test when report checkpoint(s) must be performed. Report checkpoints are off by default. []
  --debug[=on|off]                print more debugging info [off]
  --validate[=on|off]             perform validation checks where possible [off]
  --help[=on|off]                 print help and exit [off]
  --version[=on|off]              print version and exit [off]
  --config-file=FILENAME          File containing command line options
  --tx-rate=N                     deprecated alias for --rate [0]
  --max-requests=N                deprecated alias for --events [0]
  --max-time=N                    deprecated alias for --time [0]
  --num-threads=N                 deprecated alias for --threads [1]

Pseudo-Random Numbers Generator options:
  --rand-type=STRING random numbers distribution {uniform,gaussian,special,pareto} [special]
  --rand-spec-iter=N number of iterations used for numbers generation [12]
  --rand-spec-pct=N  percentage of values to be treated as 'special' (for special distribution) [1]
  --rand-spec-res=N  percentage of 'special' values to use (for special distribution) [75]
  --rand-seed=N      seed for random number generator. When 0, the current time is used as a RNG seed. [0]
  --rand-pareto-h=N  parameter h for pareto distribution [0.2]

Log options:
  --verbosity=N verbosity level {5 - debug, 0 - only critical messages} [3]

  --percentile=N       percentile to calculate in latency statistics (1-100). Use the special value of 0 to disable percentile calculations [95]
  --histogram[=on|off] print latency histogram in report [off]

General database options:

  --db-driver=STRING  specifies database driver to use ('help' to get list of available drivers) [mysql]
  --db-ps-mode=STRING prepared statements usage mode {auto, disable} [auto]
  --db-debug[=on|off] print database-specific debug information [off]


Compiled-in database drivers:
  mysql - MySQL driver
  pgsql - PostgreSQL driver

mysql options:
  --mysql-host=[LIST,...]          MySQL server host [localhost]
  --mysql-port=[LIST,...]          MySQL server port [3306]
  --mysql-socket=[LIST,...]        MySQL socket
  --mysql-user=STRING              MySQL user [sbtest]
  --mysql-password=STRING          MySQL password []
  --mysql-db=STRING                MySQL database name [sbtest]
  --mysql-ssl[=on|off]             use SSL connections, if available in the client library [off]
  --mysql-ssl-cipher=STRING        use specific cipher for SSL connections []
  --mysql-compression[=on|off]     use compression, if available in the client library [off]
  --mysql-debug[=on|off]           trace all client library calls [off]
  --mysql-ignore-errors=[LIST,...] list of errors to ignore, or "all" [1213,1020,1205]
  --mysql-dry-run[=on|off]         Dry run, pretend that all MySQL client API calls are successful without executing them [off]

pgsql options:
  --pgsql-host=STRING     PostgreSQL server host [localhost]
  --pgsql-port=N          PostgreSQL server port [5432]
  --pgsql-user=STRING     PostgreSQL user [sbtest]
  --pgsql-password=STRING PostgreSQL password []
  --pgsql-db=STRING       PostgreSQL database name [sbtest]

Compiled-in tests:
  fileio - File I/O test
  cpu - CPU performance test
  memory - Memory functions speed test
  threads - Threads subsystem performance test
  mutex - Mutex performance test

See 'sysbench <testname> help' for a list of options for each test.

Sysbench includes the following lua scripts to simulate OLTP workloads.

[root@node0 ~]# ls -l /usr/share/sysbench/
total 60
-rwxr-xr-x   1 root root  1452 Apr 24  2020 bulk_insert.lua
-rw-r--r--   1 root root 14369 Apr 24  2020 oltp_common.lua
-rwxr-xr-x   1 root root  1290 Apr 24  2020 oltp_delete.lua
-rwxr-xr-x   1 root root  2415 Apr 24  2020 oltp_insert.lua
-rwxr-xr-x   1 root root  1265 Apr 24  2020 oltp_point_select.lua
-rwxr-xr-x   1 root root  1649 Apr 24  2020 oltp_read_only.lua
-rwxr-xr-x   1 root root  1824 Apr 24  2020 oltp_read_write.lua
-rwxr-xr-x   1 root root  1118 Apr 24  2020 oltp_update_index.lua
-rwxr-xr-x   1 root root  1127 Apr 24  2020 oltp_update_non_index.lua
-rwxr-xr-x   1 root root  1440 Apr 24  2020 oltp_write_only.lua
-rwxr-xr-x   1 root root  1919 Apr 24  2020 select_random_points.lua
-rwxr-xr-x   1 root root  2118 Apr 24  2020 select_random_ranges.lua
drwxr-xr-x   4 root root    49 Dec  8 20:39 tests

[root@node0 ~]# ls -l /usr/share/sysbench/tests/include/oltp_legacy/
total 52
-rw-r--r-- 1 root root 1195 Apr 24  2020 bulk_insert.lua
-rw-r--r-- 1 root root 4696 Apr 24  2020 common.lua
-rw-r--r-- 1 root root  366 Apr 24  2020 delete.lua
-rw-r--r-- 1 root root 1171 Apr 24  2020 insert.lua
-rw-r--r-- 1 root root 3004 Apr 24  2020 oltp.lua
-rw-r--r-- 1 root root  368 Apr 24  2020 oltp_simple.lua
-rw-r--r-- 1 root root  527 Apr 24  2020 parallel_prepare.lua
-rw-r--r-- 1 root root  369 Apr 24  2020 select.lua
-rw-r--r-- 1 root root 1448 Apr 24  2020 select_random_points.lua
-rw-r--r-- 1 root root 1556 Apr 24  2020 select_random_ranges.lua
-rw-r--r-- 1 root root  369 Apr 24  2020 update_index.lua
-rw-r--r-- 1 root root  578 Apr 24  2020 update_non_index.lua

To get more options for each specific lua workload:

[root@node0 ~]# sysbench /usr/share/sysbench/oltp_insert.lua help
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

oltp_insert.lua options:
  --auto_inc[=on|off]           Use AUTO_INCREMENT column as Primary Key (for MySQL), or its alternatives in other DBMS. When disabled, use client-generated IDs [on]
  --create_secondary[=on|off]   Create a secondary index in addition to the PRIMARY KEY [on]
  --delete_inserts=N            Number of DELETE/INSERT combinations per transaction [1]
  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --mysql_storage_engine=STRING Storage engine, if MySQL is used [innodb]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --pgsql_variant=STRING        Use this PostgreSQL variant when running with the PostgreSQL driver. The only currently supported variant is 'redshift'. When enabled, create_secondary is automatically disabled, and delete_inserts is set to 0
  --point_selects=N             Number of point SELECT queries per transaction [10]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]
  --range_size=N                Range size for range SELECT queries [100]
  --secondary[=on|off]          Use a secondary index in place of the PRIMARY KEY [off]
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --table_size=N                Number of rows per table [10000]
  --tables=N                    Number of tables [1]

[root@node0 ~]# sysbench /usr/share/sysbench/oltp_write_only.lua help
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

oltp_write_only.lua options:
  --auto_inc[=on|off]           Use AUTO_INCREMENT column as Primary Key (for MySQL), or its alternatives in other DBMS. When disabled, use client-generated IDs [on]
  --create_secondary[=on|off]   Create a secondary index in addition to the PRIMARY KEY [on]
  --delete_inserts=N            Number of DELETE/INSERT combinations per transaction [1]
  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --mysql_storage_engine=STRING Storage engine, if MySQL is used [innodb]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --pgsql_variant=STRING        Use this PostgreSQL variant when running with the PostgreSQL driver. The only currently supported variant is 'redshift'. When enabled, create_secondary is automatically disabled, and delete_inserts is set to 0
  --point_selects=N             Number of point SELECT queries per transaction [10]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]
  --range_size=N                Range size for range SELECT queries [100]
  --secondary[=on|off]          Use a secondary index in place of the PRIMARY KEY [off]
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --table_size=N                Number of rows per table [10000]
  --tables=N                    Number of tables [1]

[root@node0 ~]# sysbench /usr/share/sysbench/oltp_read_only.lua help
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

oltp_read_only.lua options:
  --auto_inc[=on|off]           Use AUTO_INCREMENT column as Primary Key (for MySQL), or its alternatives in other DBMS. When disabled, use client-generated IDs [on]
  --create_secondary[=on|off]   Create a secondary index in addition to the PRIMARY KEY [on]
  --delete_inserts=N            Number of DELETE/INSERT combinations per transaction [1]
  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --mysql_storage_engine=STRING Storage engine, if MySQL is used [innodb]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --pgsql_variant=STRING        Use this PostgreSQL variant when running with the PostgreSQL driver. The only currently supported variant is 'redshift'. When enabled, create_secondary is automatically disabled, and delete_inserts is set to 0
  --point_selects=N             Number of point SELECT queries per transaction [10]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]
  --range_size=N                Range size for range SELECT queries [100]
  --secondary[=on|off]          Use a secondary index in place of the PRIMARY KEY [off]
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --table_size=N                Number of rows per table [10000]
  --tables=N                    Number of tables [1]

[root@node0 ~]# sysbench /usr/share/sysbench/oltp_point_select.lua help
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

oltp_point_select.lua options:
  --auto_inc[=on|off]           Use AUTO_INCREMENT column as Primary Key (for MySQL), or its alternatives in other DBMS. When disabled, use client-generated IDs [on]
  --create_secondary[=on|off]   Create a secondary index in addition to the PRIMARY KEY [on]
  --delete_inserts=N            Number of DELETE/INSERT combinations per transaction [1]
  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --mysql_storage_engine=STRING Storage engine, if MySQL is used [innodb]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --pgsql_variant=STRING        Use this PostgreSQL variant when running with the PostgreSQL driver. The only currently supported variant is 'redshift'. When enabled, create_secondary is automatically disabled, and delete_inserts is set to 0
  --point_selects=N             Number of point SELECT queries per transaction [10]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]
  --range_size=N                Range size for range SELECT queries [100]
  --secondary[=on|off]          Use a secondary index in place of the PRIMARY KEY [off]
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --table_size=N                Number of rows per table [10000]
  --tables=N                    Number of tables [1]

Prepare for the benchmark

Before running the benchmark, the database should be created and the data tables should be populated.

To create the database:

[root@node0 ~]# cockroach sql --insecure --host=node1:26257 -e 'create database testdb;'

To populate the database:

[root@node0 ~]# sysbench /usr/share/sysbench/oltp_write_only.lua --pgsql-host=node1 --pgsql-port=26257 --pgsql-db=testdb --pgsql-user=tester --pgsql-password= --table_size=100000 --tables=24 --threads=1 --db-driver=pgsql prepare
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Creating table 'sbtest1'...
Inserting 100000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...
Creating table 'sbtest2'...
Inserting 100000 records into 'sbtest2'
Creating a secondary index on 'sbtest2'...
Creating table 'sbtest3'...
Inserting 100000 records into 'sbtest3'
Creating a secondary index on 'sbtest3'...
Creating table 'sbtest4'...
Inserting 100000 records into 'sbtest4'
Creating a secondary index on 'sbtest4'...
Creating table 'sbtest5'...
Inserting 100000 records into 'sbtest5'
Creating a secondary index on 'sbtest5'...
Creating table 'sbtest6'...
Inserting 100000 records into 'sbtest6'
Creating a secondary index on 'sbtest6'...
Creating table 'sbtest7'...
Inserting 100000 records into 'sbtest7'
Creating a secondary index on 'sbtest7'...
Creating table 'sbtest8'...
Inserting 100000 records into 'sbtest8'
Creating a secondary index on 'sbtest8'...
Creating table 'sbtest9'...
Inserting 100000 records into 'sbtest9'
Creating a secondary index on 'sbtest9'...
Creating table 'sbtest10'...
Inserting 100000 records into 'sbtest10'
Creating a secondary index on 'sbtest10'...
Creating table 'sbtest11'...
Inserting 100000 records into 'sbtest11'
Creating a secondary index on 'sbtest11'...
Creating table 'sbtest12'...
Inserting 100000 records into 'sbtest12'
Creating a secondary index on 'sbtest12'...
Creating table 'sbtest13'...
Inserting 100000 records into 'sbtest13'
Creating a secondary index on 'sbtest13'...
Creating table 'sbtest14'...
Inserting 100000 records into 'sbtest14'
Creating a secondary index on 'sbtest14'...
Creating table 'sbtest15'...
Inserting 100000 records into 'sbtest15'
Creating a secondary index on 'sbtest15'...
Creating table 'sbtest16'...
Inserting 100000 records into 'sbtest16'
Creating a secondary index on 'sbtest16'...
Creating table 'sbtest17'...
Inserting 100000 records into 'sbtest17'
Creating a secondary index on 'sbtest17'...
Creating table 'sbtest18'...
Inserting 100000 records into 'sbtest18'
Creating a secondary index on 'sbtest18'...
Creating table 'sbtest19'...
Inserting 100000 records into 'sbtest19'
Creating a secondary index on 'sbtest19'...
Creating table 'sbtest20'...
Inserting 100000 records into 'sbtest20'
Creating a secondary index on 'sbtest20'...
Creating table 'sbtest21'...
Inserting 100000 records into 'sbtest21'
Creating a secondary index on 'sbtest21'...
Creating table 'sbtest22'...
Inserting 100000 records into 'sbtest22'
Creating a secondary index on 'sbtest22'...
Creating table 'sbtest23'...
Inserting 100000 records into 'sbtest23'
Creating a secondary index on 'sbtest23'...
Creating table 'sbtest24'...
Inserting 100000 records into 'sbtest24'

testdb=>  show tables from testdb;
 schema_name | table_name | type  | owner  | estimated_row_count | locality
-------------+------------+-------+--------+---------------------+----------
 public      | sbtest1    | table | tester |              100000 |
 public      | sbtest10   | table | tester |              100000 |
 public      | sbtest11   | table | tester |              100000 |
 public      | sbtest12   | table | tester |              100000 |
 public      | sbtest13   | table | tester |              100000 |
 public      | sbtest14   | table | tester |              100000 |
 public      | sbtest15   | table | tester |              100000 |
 public      | sbtest16   | table | tester |              100000 |
 public      | sbtest17   | table | tester |              100000 |
 public      | sbtest18   | table | tester |              100000 |
 public      | sbtest19   | table | tester |              100000 |
 public      | sbtest2    | table | tester |              100000 |
 public      | sbtest20   | table | tester |              100000 |
 public      | sbtest21   | table | tester |              100000 |
 public      | sbtest22   | table | tester |              100000 |
 public      | sbtest23   | table | tester |              100000 |
 public      | sbtest24   | table | tester |              100000 |
 public      | sbtest3    | table | tester |              100000 |
 public      | sbtest4    | table | tester |              100000 |
 public      | sbtest5    | table | tester |              100000 |
 public      | sbtest6    | table | tester |              100000 |
 public      | sbtest7    | table | tester |              100000 |
 public      | sbtest8    | table | tester |              100000 |
 public      | sbtest9    | table | tester |              100000 |
(24 rows)

testdb=> select sum(range_size)/1000 from crdb_internal.ranges where database_name='testdb';
       ?column?
-----------------------
 602418.53400000000000
(1 row)

To populate the database with 2 or more threads:

[root@node0 ~]# sysbench /usr/share/sysbench/oltp_write_only.lua --pgsql-host=node1 --pgsql-port=26257 --pgsql-db=testdb --pgsql-user=root --pgsql-password= --table_size=1000000 --tables=24 --threads=2 --db-driver=pgsql prepare
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Initializing worker threads...

Creating table 'sbtest2'...
Creating table 'sbtest1'...
Inserting 1000000 records into 'sbtest1'
Inserting 1000000 records into 'sbtest2'
Creating a secondary index on 'sbtest2'...
Creating table 'sbtest4'...
Inserting 1000000 records into 'sbtest4'
Creating a secondary index on 'sbtest1'...
Creating table 'sbtest3'...
Inserting 1000000 records into 'sbtest3'
Creating a secondary index on 'sbtest4'...
Creating a secondary index on 'sbtest3'...
Creating table 'sbtest6'...
Inserting 1000000 records into 'sbtest6'
Creating table 'sbtest5'...
Inserting 1000000 records into 'sbtest5'
Creating a secondary index on 'sbtest5'...
Creating a secondary index on 'sbtest6'...
Creating table 'sbtest7'...
Inserting 1000000 records into 'sbtest7'
Creating table 'sbtest8'...
Inserting 1000000 records into 'sbtest8'
Creating a secondary index on 'sbtest7'...
Creating a secondary index on 'sbtest8'...
Creating table 'sbtest9'...
Inserting 1000000 records into 'sbtest9'
Creating table 'sbtest10'...
Inserting 1000000 records into 'sbtest10'
Creating a secondary index on 'sbtest10'...
Creating table 'sbtest12'...
Inserting 1000000 records into 'sbtest12'
Creating a secondary index on 'sbtest9'...
Creating table 'sbtest11'...
Inserting 1000000 records into 'sbtest11'
Creating a secondary index on 'sbtest12'...
Creating table 'sbtest14'...
Inserting 1000000 records into 'sbtest14'
Creating a secondary index on 'sbtest11'...
Creating table 'sbtest13'...
Inserting 1000000 records into 'sbtest13'
Creating a secondary index on 'sbtest14'...
Creating table 'sbtest16'...
Inserting 1000000 records into 'sbtest16'
Creating a secondary index on 'sbtest13'...
Creating table 'sbtest15'...
Inserting 1000000 records into 'sbtest15'
Creating a secondary index on 'sbtest16'...
Creating table 'sbtest18'...
Inserting 1000000 records into 'sbtest18'
Creating a secondary index on 'sbtest15'...
Creating table 'sbtest17'...
Inserting 1000000 records into 'sbtest17'
Creating a secondary index on 'sbtest18'...
Creating table 'sbtest20'...
Inserting 1000000 records into 'sbtest20'
Creating a secondary index on 'sbtest17'...
Creating table 'sbtest19'...
Inserting 1000000 records into 'sbtest19'
Creating a secondary index on 'sbtest20'...
Creating table 'sbtest22'...
Inserting 1000000 records into 'sbtest22'
Creating a secondary index on 'sbtest19'...
Creating table 'sbtest21'...
Inserting 1000000 records into 'sbtest21'
Creating a secondary index on 'sbtest22'...
Creating table 'sbtest24'...
Inserting 1000000 records into 'sbtest24'
Creating a secondary index on 'sbtest21'...
Creating table 'sbtest23'...
Inserting 1000000 records into 'sbtest23'
Creating a secondary index on 'sbtest24'...
Creating a secondary index on 'sbtest23'...

Run sysbench

Now we can run sysbench on the populated database. We can increase the threads to stress the database with more workloads in order to measure the system performance limit.

[root@node0 ~]# sysbench /usr/share/sysbench/oltp_write_only.lua --pgsql-host=node1 --pgsql-port=26257 --pgsql-db=testdb --pgsql-user=tester --pgsql-password= --table_size=100000 --tables=24 --threads=1 --time=60 --report-interval=10 --db-driver=pgsql run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 1
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 1 tps: 80.86 qps: 485.66 (r/w/o: 0.00/84.66/401.00) lat (ms,95%): 20.37 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 1 tps: 82.60 qps: 495.41 (r/w/o: 0.00/92.00/403.41) lat (ms,95%): 20.74 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 1 tps: 81.00 qps: 485.79 (r/w/o: 0.00/94.10/391.69) lat (ms,95%): 20.74 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 1 tps: 81.00 qps: 486.21 (r/w/o: 0.00/102.90/383.31) lat (ms,95%): 20.37 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 1 tps: 79.20 qps: 474.90 (r/w/o: 0.00/103.10/371.80) lat (ms,95%): 21.50 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 1 tps: 78.20 qps: 469.39 (r/w/o: 0.00/107.30/362.09) lat (ms,95%): 21.11 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            0
        write:                           5842
        other:                           23138
        total:                           28980
    transactions:                        4830   (80.48 per sec.)
    queries:                             28980  (482.90 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          60.0081s
    total number of events:              4830

Latency (ms):
         min:                                    6.87
         avg:                                   12.42
         max:                                   32.90
         95th percentile:                       21.11
         sum:                                59978.07

Threads fairness:
    events (avg/stddev):           4830.0000/0.00
    execution time (avg/stddev):   59.9781/0.00

Cleanup the database

Once the sysbench is done, we can cleanup the test data as below.

[root@node0 ~]# sysbench /usr/share/sysbench/oltp_write_only.lua --pgsql-host=node1 --pgsql-port=26257 --pgsql-db=testdb --pgsql-user=tester --pgsql-password= --table_size=100000 --tables=24 --threads=1 --db-driver=pgsql cleanup
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Dropping table 'sbtest1'...
Dropping table 'sbtest2'...
Dropping table 'sbtest3'...
Dropping table 'sbtest4'...
Dropping table 'sbtest5'...
Dropping table 'sbtest6'...
Dropping table 'sbtest7'...
Dropping table 'sbtest8'...
Dropping table 'sbtest9'...
Dropping table 'sbtest10'...
Dropping table 'sbtest11'...
Dropping table 'sbtest12'...
Dropping table 'sbtest13'...
Dropping table 'sbtest14'...
Dropping table 'sbtest15'...
Dropping table 'sbtest16'...
Dropping table 'sbtest17'...
Dropping table 'sbtest18'...
Dropping table 'sbtest19'...
Dropping table 'sbtest20'...
Dropping table 'sbtest21'...
Dropping table 'sbtest22'...
Dropping table 'sbtest23'...
Dropping table 'sbtest24'...

testdb=> show tables from testdb;
 schema_name | table_name | type | owner | estimated_row_count | locality
-------------+------------+------+-------+---------------------+----------
(0 rows)

[root@node0 ~]# cockroach sql --insecure --host=node1:26257 -e "drop database testdb"

Reference

The block I/O controller specifies upper IO rate limits on devices.

$ mount | egrep "/cgroup |/blkio"
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)

$ lscgroup  | grep blkio
blkio:/

In this post, we will learn how to use the following control files to limit block device bandwidth for the user tasks.

  • blkio.throttle.read_bps_device - Specifies upper limit on READ rate from the device. IO rate is specified in bytes per second. Rules are per device.
  • blkio.throttle.read_iops_device - Specifies upper limit on READ rate from the device. IO rate is specified in IO per second. Rules are per device.
  • blkio.throttle.write_bps_device - Specifies upper limit on WRITE rate to the device. IO rate is specified in bytes per second. Rules are per device.
  • blkio.throttle.write_iops_device - Specifies upper limit on WRITE rate to the device. IO rate is specified in io per second. Rules are per device.

Create blkio control group

Install libcgroup package to manage cgroups:

$ yum install libcgroup libcgroup-tools

Create blkio control group:

$ cgcreate -g blkio:/blkiolimited

$ lscgroup | grep blkio
blkio:/
blkio:/blkiolimited

$ ls /sys/fs/cgroup/blkio/blkiolimited/
blkio.bfq.io_service_bytes            blkio.bfq.weight                 blkio.throttle.io_service_bytes_recursive  blkio.throttle.read_iops_device   cgroup.procs
blkio.bfq.io_service_bytes_recursive  blkio.bfq.weight_device          blkio.throttle.io_serviced                 blkio.throttle.write_bps_device   notify_on_release
blkio.bfq.io_serviced                 blkio.reset_stats                blkio.throttle.io_serviced_recursive       blkio.throttle.write_iops_device  tasks
blkio.bfq.io_serviced_recursive       blkio.throttle.io_service_bytes  blkio.throttle.read_bps_device             cgroup.clone_children

Limit the block device bandwidth

Limit block device bandwidth for root group

To specify a bandwidth rate on particular device for root group, we can use the policy format as “major:minor bytes_per_second”.

The following example puts a bandwidth limit of 1MB/s on writes for root group on device having major/minor number 259:12.

$ ls -la /dev/ | grep "nvme3n1"
brw-rw----   1 root disk    259,  12 Dec 23 21:44 nvme3n1

$ echo "259:12  1048576" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device

Limit block device bandwidth for user defined group

The following examples specifies the IOPS limit on the device 259:12 for the user defined cgroups “blkio:/blkiolimited”.

Use control files to specify the limit directly:

$ cat /sys/fs/cgroup/blkio/blkiolimited/blkio.throttle.write_iops_device
$ echo "259:12  8192" > /sys/fs/cgroup/blkio/blkiolimited/blkio.throttle.write_iops_device
$ cat /sys/fs/cgroup/blkio/blkiolimited/blkio.throttle.write_iops_device
259:12 8192

Use libcgroup tools to specify the limit:

$ cgset -r blkio.throttle.write_iops_device="259:12 8192" blkiolimited
$ cgget -r blkio.throttle.write_iops_device blkiolimited
blkiolimited:
blkio.throttle.write_iops_device: 259:12 8192

Verify the disk bandwidth usage

Use fio to write 50G data on root group(unlimited bandwidth)

$ fio --blocksize=4k --ioengine=libaio --readwrite=randwrite --filesize=50G --group_reporting --direct=1 --iodepth=128 --end_fsync=1 --name=job1 --filename=/mnt/fio.dat


$ iostat -ktdx 5 | grep "nvme3n1"
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme3n1           0.00     0.05    0.00    8.02     0.01   508.54   126.83     0.06    7.80    0.08    7.80   0.04   0.03
nvme3n1           0.00     0.00    0.00 133828.60     0.00 535314.40     8.00     1.18    0.01    0.00    0.01   0.00  55.64
nvme3n1           0.00     0.20    0.00 246289.20     0.00 985157.60     8.00     2.21    0.01    0.00    0.01   0.00 100.04
nvme3n1           0.00     0.20    0.00 246518.80     0.00 986076.80     8.00     2.23    0.01    0.00    0.01   0.00 100.04
nvme3n1           0.00     0.20    0.00 244115.60     0.00 976462.40     8.00     2.24    0.01    0.00    0.01   0.00 100.00
nvme3n1           0.00     0.20    0.00 240184.00     0.00 960736.80     8.00     2.23    0.01    0.00    0.01   0.00 100.04
nvme3n1           0.00     0.20    0.00 250391.60     0.00 1001567.20     8.00     2.41    0.01    0.00    0.01   0.00 100.04
nvme3n1           0.00     0.20    0.00 262449.60     0.00 1049799.20     8.00     2.53    0.01    0.00    0.01   0.00 100.00
nvme3n1           0.00     0.20    0.00 252171.20     0.00 1008685.60     8.00     2.30    0.01    0.00    0.01   0.00 100.00
nvme3n1           0.00     0.20    0.00 236467.60     0.00 945872.00     8.00     2.16    0.01    0.00    0.01   0.00 100.00
nvme3n1           0.00     0.20    0.00 255060.80     0.00 1020244.00     8.00    17.04    0.07    0.00    0.07   0.00 100.00
nvme3n1           0.00     0.20    0.00 235199.60     0.00 940798.40     8.00    44.76    0.19    0.00    0.19   0.00 100.00
nvme3n1           0.00     0.20    0.00 18768.00     0.00 75072.80     8.00     5.81    0.31    0.00    0.31   0.00   8.40
nvme3n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
^C

Use fio to write 50G data on user defined group(limited bandwidth)

$ cgget -r blkio.throttle.write_iops_device blkiolimited
blkiolimited:
blkio.throttle.write_iops_device: 259:12 8192

$ cgexec -g blkio:blkiolimited fio --blocksize=4k --ioengine=libaio --readwrite=randwrite --filesize=50G --group_reporting --direct=1 --iodepth=128 --end_fsync=1 --name=job1 --filename=/mnt/fio.dat


$ iostat -ktdx 5 | grep "nvme3n1"
nvme3n1           0.00     0.05    0.00   24.54     0.01   574.51    46.82     0.06    2.57    0.08    2.57   0.01   0.04
nvme3n1           0.00     0.00    0.00 3112.20     0.00 12448.80     8.00     0.04    0.01    0.00    0.01   0.01   4.60
nvme3n1           0.00     0.20    0.00 8190.40     0.00 32762.40     8.00     0.11    0.01    0.00    0.01   0.01  11.92
nvme3n1           0.00     0.20    0.00 8190.40     0.00 32762.40     8.00     0.12    0.01    0.00    0.01   0.01  12.02
nvme3n1           0.00     0.20    0.00 8190.40     0.00 32762.40     8.00     0.12    0.01    0.00    0.01   0.01  12.22
nvme3n1           0.00     0.20    0.00 8190.40     0.00 32762.40     8.00     0.12    0.01    0.00    0.01   0.01  12.10
nvme3n1           0.00     0.20    0.00 8190.40     0.00 32762.40     8.00     0.12    0.01    0.00    0.01   0.02  12.36
^C

Reference

The memory controller isolates the memory behaviour of a group of tasks from the rest of the system.

$ mount | egrep "/cgroup |/memory"
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)

$ lssubsys -am | grep memory
memory /sys/fs/cgroup/memory

In this post, we will learn how to use the following control files to limit and monitor memory usage for the user tasks.

  • memory.usage_in_bytes - show current usage for memory
  • memory.limit_in_bytes - set/show limit of memory usage

Create memory control group

Install libcgroup package to manage cgroups:

$ yum install libcgroup libcgroup-tools

Create memory control group:

$ cgcreate -g memory:/memlimited
$ lscgroup | grep memory
memory:/
memory:/memlimited

$ ls /sys/fs/cgroup/memory/memlimited/
cgroup.clone_children           memory.kmem.slabinfo                memory.memsw.failcnt             memory.soft_limit_in_bytes
cgroup.event_control            memory.kmem.tcp.failcnt             memory.memsw.limit_in_bytes      memory.stat
cgroup.procs                    memory.kmem.tcp.limit_in_bytes      memory.memsw.max_usage_in_bytes  memory.swappiness
memory.failcnt                  memory.kmem.tcp.max_usage_in_bytes  memory.memsw.usage_in_bytes      memory.usage_in_bytes
memory.force_empty              memory.kmem.tcp.usage_in_bytes      memory.move_charge_at_immigrate  memory.use_hierarchy
memory.kmem.failcnt             memory.kmem.usage_in_bytes          memory.numa_stat                 notify_on_release
memory.kmem.limit_in_bytes      memory.limit_in_bytes               memory.oom_control               tasks
memory.kmem.max_usage_in_bytes  memory.max_usage_in_bytes           memory.pressure_level

Limit the memory usage

Using control files directly

$ echo 32G > /sys/fs/cgroup/memory/memlimited/memory.limit_in_bytes
$ cat /sys/fs/cgroup/memory/memlimited/memory.limit_in_bytes
34359738368

Using libcgroup tools

Limit the memory usage:

$ cgset -r memory.limit_in_bytes=32G memlimited
$ cgget -r memory.limit_in_bytes memlimited
memlimited:
memory.limit_in_bytes: 34359738368

Verify the memory usage

Unlimit the memory usage

$ cgset -r memory.limit_in_bytes=-1 memlimited
$ cgget -r memory.limit_in_bytes memlimited
memlimited:
memory.limit_in_bytes: 9223372036854771712

Use fio to write 50G data:

$ echo 3 > /proc/sys/vm/drop_caches
$ cgexec -g memory:memlimited fio --blocksize=64k --ioengine=libaio --readwrite=write --filesize=50G --group_reporting --direct=0 --iodepth=128 --end_fsync=1 --name=job1 --filename=/mnt/fio.dat

Verify the memory usage is unlimited:

$ while true; do cat /sys/fs/cgroup/memory/memlimited/memory.usage_in_bytes; sleep 5; done
14143488
14143488
819499008
13288558592
25772953600
38258790400
50776608768
55638511616
55638429696
55638478848
55638528000
55638577152
55210229760
55210229760
^C

Limit the memory usage to 32GB

$ cgset -r memory.limit_in_bytes=32G memlimited
$ cgget -r memory.limit_in_bytes memlimited
memlimited:
memory.limit_in_bytes: 34359738368

Use fio to write 50G data:

$ echo 3 > /proc/sys/vm/drop_caches
$ cgexec -g memory:memlimited fio --blocksize=64k --ioengine=libaio --readwrite=write --filesize=50G --group_reporting --direct=0 --iodepth=128 --end_fsync=1 --name=job1 --filename=/mnt/fio.dat

Verify the memory usage is limited to 32GB:

$ while true; do cat /sys/fs/cgroup/memory/memlimited/memory.usage_in_bytes; sleep 5; done
9134080
6819614720
18048208896
29089763328
34359726080
34359672832
34359607296
34359717888
34359635968
34359619584
34280120320
34280120320
^C

From vmstat output, the cache usage is limited to 32GB. There is also swapping out activity due the memory pressure.

$ vmstat 5 -t
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st                 UTC
 0  0      0 1053891776    968 142992    0    0     0    30    0    0  0  0 100  0  0 2022-12-23 22:04:30
 2  0      0 1045755200    968 8266836    0    0  1697     0 3789  727  2  1 98  0  0 2022-12-23 22:04:35
 1  0      0 1034800320    976 19222628    0    0     0     2 1770  575  0  1 99  0  0 2022-12-23 22:04:40
 1  0      0 1024034304    984 29988256    0    0     0 163843 2999  780  0  1 99  0  0 2022-12-23 22:04:45
 1  1      0 1020353664    992 33665560    0    0     0 985500 8317 2414  0  1 98  0  0 2022-12-23 22:04:50
 1  1 317440 1020354112   1000 33666764    0 63474     3 1518376 18524 2604  0  1 98  1  0 2022-12-23 22:04:55
 1  1 340480 1020353472   1008 33666660    0 4571     0 1602014 15377 2381  0  1 98  1  0 2022-12-23 22:05:00
 2  0 340480 1020355904   1016 33667212   39    0    39 1632866 22166 70618  0  1 98  0  0 2022-12-23 22:05:05
 1  0 340480 1020356032   1020 33667228    0    0     0 1861355 27437 108029  0  1 99  0  0 2022-12-23 22:05:10
 2  0 340480 1020356224   1024 33667236    0    0     0 1874438 28732 111364  0  1 98  0  0 2022-12-23 22:05:15
 0  0    512 1020432704   1024 33599516  212    0   550 915429 13644 56805  0  1 99  0  0 2022-12-23 22:05:20
 0  0    512 1020432960   1028 33599516    0    0     3     4  165  149  0  0 100  0  0 2022-12-23 22:05:25
^C

Limit the memory usage for the tasks in the current bash

$ cat /sys/fs/cgroup/memory/memlimited/tasks
$ echo $$ > /sys/fs/cgroup/memory/memlimited/tasks
$ cat /sys/fs/cgroup/memory/memlimited/tasks
27875
28889

$ ps -ef | egrep "27875|29023" | grep -v grep
root     27875 27873  0 21:11 pts/0    00:00:17 -bash
root     29026 27875  0 22:11 pts/0    00:00:00 ps -ef

Use fio to write 50G data:

$ fio --blocksize=64k --ioengine=libaio --readwrite=write --filesize=50G --group_reporting --direct=0 --iodepth=128 --end_fsync=1 --name=job1 --filename=/mnt/fio.dat

Verify the memory usage is limited to 32GB:

$ while true; do cat /sys/fs/cgroup/memory/memlimited/memory.usage_in_bytes; sleep 5; done
22835200
22835200
22835200
10150678528
21453983744
32693850112
34359607296
34359607296
34359738368
34359730176
34359730176
34359660544
34280062976
^C


$ vmstat 5 -t
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st                 UTC
 0  0    512 1053886784   1004 144860    0    0     0    31    0    0  0  0 100  0  0 2022-12-23 22:09:17
 0  0    512 1053886528   1004 144824    0    0     0     0  143  148  0  0 100  0  0 2022-12-23 22:09:22
 1  0    512 1050907776   1004 3109688    0    0  1660     3 2690  566  1  0 98  0  0 2022-12-23 22:09:27
 1  0    512 1039836288   1012 14180528    0    0     0     2 1699  508  0  1 99  0  0 2022-12-23 22:09:32
 1  0    512 1028822720   1020 25194536    0    0     0     2 1625  494  0  1 99  0  0 2022-12-23 22:09:37
 1  1    512 1020348800   1028 33665604    0    0     0 385026 4478 1313  0  1 99  0  0 2022-12-23 22:09:42
 1  1  46080 1020340224   1036 33676796    0 9090     0 1563114 13781 2301  0  1 98  1  0 2022-12-23 22:09:47
 1  1 340480 1020338688   1044 33676544    0 58904     0 1557223 19212 2580  0  1 98  1  0 2022-12-23 22:09:52
 1  1 340480 1020337792   1052 33676376   33    0    33 1481028 14133 5776  0  1 98  1  0 2022-12-23 22:09:57
 1  0 340480 1020342208   1056 33676396    0    0   184 1877767 27447 114955  0  1 98  0  0 2022-12-23 22:10:02
 1  1 340480 1020341056   1056 33677100    0    0     0 1817494 27635 101675  0  1 98  0  0 2022-12-23 22:10:07
 1  0 195144 1020350208   1060 33676988   83    0    83 1872111 27068 113280  0  1 98  0  0 2022-12-23 22:10:12
^C

Reference

CockroachDB key concepts

Range - CockroachDB stores all user data (tables, indexes, etc.) and almost all system data in a giant sorted map of key-value pairs. This keyspace is divided into “ranges”, contiguous chunks of the keyspace, so that every key can always be found in a single range.

From a SQL perspective, a table and its secondary indexes initially map to a single range, where each key-value pair in the range represents a single row in the table (also called the primary index because the table is sorted by the primary key) or a single row in a secondary index. As soon as that range reaches 512 MiB in size, it splits into two ranges. This process continues for these new ranges as the table and its indexes continue growing.

[root@host1 ~]# tail -f /var/log/cockroachdb_logs/cockroach.log
I220908 17:54:02.220498 5523606 kv/kvserver/pkg/kv/kvserver/replica_command.go:420 ⋮ [n1,split,s1,r499/1:‹/Table/113/1/4{394/6…-693/6…}›] 1700  initiating a split of this range at key ‹/Table/113/1/4402/59627› [r501] (‹512 MiB above threshold size 512 MiB›)‹›

Replica - CockroachDB replicates each range (3 times by default) and stores each replica on a different node.

Refer to here for more.

Image

Production checklist

Check here for the important recommendations for production deployments of CockroachDB. The following only lists some of the recommended settings.

Memory

  • Disable Linux memory swapping. Over-allocating memory on production machines can lead to unexpected performance issues when pages have to be read back into memory.
  • For production deployments, set –cache to 25% or higher. Avoid setting –cache and –max-sql-memory to a combined value of more than 75% of a machine’s total RAM. Doing so increases the risk of memory-related failures.

Storage

  • The maximum recommended storage capacity per node is 2.5 TiB, regardless of the number of vCPUs.
  • Use dedicated volumes for the CockroachDB store. Do not share the store volume with any other I/O activity.
  • Store CockroachDB log files in a separate volume from the main data store so that logging is not impacted by I/O throttling.
  • The recommended Linux filesystems are ext4 and XFS.

Disk I/O

  • Use sysbench to benchmark IOPS on your cluster. If IOPS decrease, add more nodes to your cluster to increase IOPS.
  • Do not use LVM in the I/O path. Dynamically resizing CockroachDB store volumes can result in significant performance degradation. Using LVM snapshots in lieu of CockroachDB backup and restore is also not supported.
  • The optimal configuration for striping more than one device is RAID 10. RAID 0 and 1 are also acceptable from a performance perspective.

Network

When starting a node, two main flags are used to control its network connections:

  • –listen-addr determines which address(es) to listen on for connections from other nodes and clients.
  • –advertise-addr determines which address to tell other nodes to use.

Image

Load balancing

Each CockroachDB node is an equally suitable SQL gateway to a cluster, but to ensure client performance and reliability, it’s important to use load balancing:

  • Performance: Load balancers spread client traffic across nodes. This prevents any one node from being overwhelmed by requests and improves overall cluster performance (queries per second).
  • Reliability: Load balancers decouple client health from the health of a single CockroachDB node. To ensure that traffic is not directed to failed nodes or nodes that are not ready to receive requests, load balancers should use CockroachDB’s readiness health check.

Cache and SQL memory size

CockroachDB manages its own memory caches, independently of the operating system. These are configured via the –cache and –max-sql-memory flags.

Each node has a default cache size of 128MiB that is passively consumed. The default was chosen to facilitate development and testing, where users are likely to run multiple CockroachDB nodes on a single machine. Increasing the cache size will generally improve the node’s read performance.

Each node has a default SQL memory size of 25%. This memory is used as-needed by active operations to store temporary data for SQL queries.

  • Increasing a node’s cache size will improve the node’s read performance.
  • Increasing a node’s SQL memory size will increase the number of simultaneous client connections it allows, as well as the node’s capacity for in-memory processing of rows when using ORDER BY, GROUP BY, DISTINCT, joins, and window functions.

You can check cache size and SQL memory pool size in the log. In the following example output, it matches with the specified 25% cache and SQL memory size setting.

$ vim cockroach.log
I220905 17:49:33.671304 1 server/config.go:487 ⋮ [n?] 6  system total memory: 1008 GiB
I220905 17:49:33.671318 1 server/config.go:489 ⋮ [n?] 7  server configuration:
I220905 17:49:33.671318 1 server/config.go:489 ⋮ [n?] 7 +‹max offset             500000000›
I220905 17:49:33.671318 1 server/config.go:489 ⋮ [n?] 7 +‹cache size             252 GiB›
I220905 17:49:33.671318 1 server/config.go:489 ⋮ [n?] 7 +‹SQL memory pool size   252 GiB›

CockroachDB workloads

Image

bank workload

$ cockroach workload init bank 'postgresql://root@host1:26257?sslmode=disable'
I220831 23:52:14.593170 1 workload/workloadsql/dataload.go:146  [-] 1  imported bank (0s, 1000 rows)
I220831 23:52:14.609421 1 workload/workloadsql/workloadsql.go:136  [-] 2  starting 9 splits


$ cockroach workload run bank --duration=1m 'postgresql://root@host1:26257?sslmode=disable'
I220831 23:52:55.378492 1 workload/cli/run.go:414  [-] 1  creating load generator...
I220831 23:52:55.380594 1 workload/cli/run.go:445  [-] 2  creating load generator... done (took 2.103665ms)
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    1.0s        0         2929.8         2936.4      3.0     15.2     50.3    234.9 transfer
    2.0s        0         3387.3         3161.8      3.4     13.6     23.1     54.5 transfer
    3.0s        0         2755.2         3026.3      3.9     13.6     23.1    302.0 transfer
    4.0s        0         3295.5         3093.6      3.3     13.6     33.6    469.8 transfer
    5.0s        0         3536.5         3182.2      3.5     12.1     18.9     35.7 transfer
    6.0s        0         3558.1         3244.8      3.5     11.0     21.0     39.8 transfer
    7.0s        0         3566.9         3290.8      3.7     11.0     17.8     39.8 transfer
    8.0s        0         3317.6         3294.2      3.7     12.6     24.1     62.9 transfer
    9.0s        0         2992.9         3260.7      4.1     14.7     22.0     39.8 transfer
   10.0s        0         3628.2         3297.4      3.5     11.5     19.9     39.8 transfer
   11.0s        0         3604.5         3325.3      3.5     11.5     17.8     37.7 transfer
   12.0s        0         3668.4         3353.9      3.5     11.0     17.8     31.5 transfer
   13.0s        0         3485.9         3364.1      3.7     11.5     18.9     31.5 transfer
   14.0s        0         3377.9         3365.1      3.7     12.6     21.0     56.6 transfer
   15.0s        0         3084.4         3346.3      4.1     13.6     21.0     92.3 transfer
   16.0s        0         3650.1         3365.3      3.4     11.5     17.8     44.0 transfer
   17.0s        0         3662.7         3382.8      3.5     11.5     17.8     37.7 transfer
   18.0s        0         3461.8         3387.2      3.7     11.0     19.9    159.4 transfer
   19.0s        0         3426.3         3389.3      3.7     12.1     19.9     41.9 transfer
   20.0s        0         3196.0         3379.6      3.9     13.1     19.9     35.7 transfer
<omitted..>

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   60.0s        0         206551         3442.5      4.6      3.7     12.1     19.9    469.8  transfer

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   60.0s        0         206551         3442.5      4.6      3.7     12.1     19.9    469.8

TPC-C workload

–warehouses

The number of warehouses for loading initial data, at approximately 200 MB per warehouse.

Applicable commands: init or run

Default: 1

–workers

The number of concurrent workers.

Applicable commands: init or run

Default: –warehouses * 10

The number os wareshouses can be specified by “–warehouses” option. By default, only one warehouse is created.

$ cockroach workload init tpcc 'postgresql://root@host1:26257?sslmode=disable'
Error: failed insert into warehouse: pq: duplicate key value violates unique constraint "warehouse_pkey"

$ cockroach workload init tpcc 'postgresql://root@host1:26257?sslmode=disable' --drop
I220901 00:32:56.365593 1 workload/workloadsql/dataload.go:146  [-] 1  imported warehouse (0s, 1 rows)
I220901 00:32:56.372727 1 workload/workloadsql/dataload.go:146  [-] 2  imported district (0s, 10 rows)
I220901 00:32:57.034349 1 workload/workloadsql/dataload.go:146  [-] 3  imported customer (1s, 30000 rows)
I220901 00:32:57.248382 1 workload/workloadsql/dataload.go:146  [-] 4  imported history (0s, 30000 rows)
I220901 00:32:57.462320 1 workload/workloadsql/dataload.go:146  [-] 5  imported order (0s, 30000 rows)
I220901 00:32:57.490476 1 workload/workloadsql/dataload.go:146  [-] 6  imported new_order (0s, 9000 rows)
I220901 00:32:57.921658 1 workload/workloadsql/dataload.go:146  [-] 7  imported item (0s, 100000 rows)
I220901 00:32:59.267268 1 workload/workloadsql/dataload.go:146  [-] 8  imported stock (1s, 100000 rows)
I220901 00:33:00.916392 1 workload/workloadsql/dataload.go:146  [-] 9  imported order_line (2s, 300343 rows)


$ cockroach workload run tpcc --duration=10m 'postgresql://root@host1:26257?sslmode=disable'
I220901 00:34:17.147411 1 workload/cli/run.go:414  [-] 1  creating load generator...
Initializing 2 connections...
Initializing 0 idle connections...
Initializing 10 workers and preparing statements...
I220901 00:34:17.151817 1 workload/cli/run.go:445  [-] 2  creating load generator... done (took 4.411152ms)
<omitted..>

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0             12            0.0     32.6     31.5     44.0     44.0     44.0  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0            120            0.2     17.9     17.8     26.2     41.9     54.5  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0             13            0.0      6.7      5.8     14.2     15.2     15.2  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0            131            0.2     10.8     11.0     14.7     24.1     29.4  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0             13            0.0     10.8     10.5     15.7     15.7     15.7  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0            289            0.5     14.4     13.1     28.3     41.9     54.5
Audit check 9.2.1.7: SKIP: not enough delivery transactions to be statistically significant
Audit check 9.2.2.5.1: SKIP: not enough orders to be statistically significant
Audit check 9.2.2.5.2: SKIP: not enough orders to be statistically significant
Audit check 9.2.2.5.5: SKIP: not enough payments to be statistically significant
Audit check 9.2.2.5.6: SKIP: not enough order status transactions to be statistically significant
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s       12.0  93.3%     17.9     17.8     23.1     26.2     41.9     54.5

YCSB workload

$ cockroach workload init ycsb 'postgresql://root@host1:26257?sslmode=disable'
I220831 23:57:53.909658 1 workload/workloadsql/dataload.go:146  [-] 1  imported usertable (2s, 10000 rows)


$ cockroach workload run ycsb --duration=10m 'postgresql://root@host1:26257?sslmode=disable'
I220831 23:58:24.410319 1 workload/cli/run.go:414  [-] 1  creating load generator...
I220831 23:58:24.427528 1 workload/cli/run.go:445  [-] 2  creating load generator... done (took 17.212701ms)
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    1.0s        0        11011.5        11243.7      1.1      2.4      3.8     11.0 read
    1.0s        0          581.7          593.9      3.0      5.5      7.3     12.6 update
    2.0s        0        12105.1        11674.4      1.1      1.8      3.4     10.0 read
    2.0s        0          636.9          615.4      2.8      3.9      6.6     11.0 update
    3.0s        0        11590.6        11646.5      1.1      2.0      4.5     14.2 read
    3.0s        0          607.1          612.7      2.8      6.0      9.4     12.1 update
    4.0s        0        11813.5        11688.2      1.1      1.9      3.5     13.1 read
    4.0s        0          622.0          615.0      2.8      4.5      7.6     13.6 update
    5.0s        0        11959.5        11742.5      1.1      2.0      3.3      8.1 read
    5.0s        0          607.0          613.4      2.6      4.7      7.1     12.1 update
    6.0s        0        12186.2        11816.5      1.0      1.8      3.5     12.6 read
    6.0s        0          638.0          617.5      2.6      4.1     10.0     16.8 update
    7.0s        0        11815.8        11816.3      1.1      2.0      3.7      8.4 read
    7.0s        0          646.0          621.6      2.8      4.7      6.8      9.4 update
    8.0s        0        11850.8        11820.6      1.1      2.0      3.3      7.6 read
    8.0s        0          617.9          621.1      2.8      4.5      6.6      9.4 update
    9.0s        0        11713.5        11808.7      1.1      2.0      3.9     11.0 read
    9.0s        0          609.1          619.8      2.8      4.5      7.6     11.5 update
   10.0s        0        11949.6        11822.8      1.1      2.0      3.3     13.1 read
   10.0s        0          622.0          620.0      2.8      4.2      6.0      8.4 update

<omitted..>

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0        8576709        14294.5      1.0      1.0      1.6      2.9     62.9  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         451026          751.7      2.5      2.5      3.8      6.6     27.3  update

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0        9027735        15046.2      1.1      1.0      2.2      3.5     62.9

Restart CockroachDB cluster

  1. Stop the process on each node

    $ ps -ef | grep cock | grep -v grep
    root 12217 1 99 Sep06 ? 2-20:28:09 cockroach start –log-dir=/var/log/cockroachdb_logs –store=/mnt/cockroachdb_mnt1 –insecure –advertise-addr=host1 –join=host1,host2,host3 –cache=.25 –max-sql-memory=.25

    $ kill -9 12217
    $ ps -ef | grep cock | grep -v grep

  2. Start the process on each node

    $ cockroach start –store=/mnt/cockroanchdb_mnt1 –insecure –advertise-addr=host1 –join=host1,host2,host3 –cache=.25 -max-sql-memory=.25 –background

CockroachDB commands

List node IDs:

$ cockroach node ls --insecure
  id
------
   1
   2
   3
(3 rows)

Show node status:

$ cockroach node status --insecure
  id |       address       |     sql_address     |  build  |         started_at         |         updated_at         | locality | is_available | is_live
-----+---------------------+---------------------+---------+----------------------------+----------------------------+----------+--------------+----------
   1 | host1:26257 | host1:26257 | v22.1.6 | 2022-09-03 16:24:12.001861 | 2022-09-03 22:23:09.022236 |          | true         | true
   2 | host2:26257 | host2:26257 | v22.1.6 | 2022-09-03 16:24:12.04249  | 2022-09-03 22:23:09.048582 |          | true         | true
   3 | host3:26257 | host3:26257 | v22.1.6 | 2022-09-03 16:24:12.390521 | 2022-09-03 22:23:09.402081 |          | true         | true
(3 rows)

Show status and range/replica details:

$ cockroach node status --ranges --insecure
  id |       address       |     sql_address     |  build  |         started_at         |         updated_at         | locality | is_available | is_live | replicas_leaders | replicas_leaseholders | ranges | ranges_unavailable | ranges_underreplicated
-----+---------------------+---------------------+---------+----------------------------+----------------------------+----------+--------------+---------
   1 | host1:26257 | host1:26257 | v22.1.6 | 2022-09-03 16:24:12.001861 | 2022-09-03 22:23:36.022463 |          | true         | true    |         306 |                   306 |    941 |                  0 |                      0
   2 | host2:26257 | host2:26257 | v22.1.6 | 2022-09-03 16:24:12.04249  | 2022-09-03 22:23:36.048508 |          | true         | true    |              310 |                   310 |    941 |                  0 |                      0
   3 | host3:26257 | host3:26257 | v22.1.6 | 2022-09-03 16:24:12.390521 | 2022-09-03 22:23:36.40421  |          | true         | true    |              325 |                   325 |    941 |                  0 |                      0
(3 rows)

Show status and disk usage details:

$ cockroach node status --stats --insecure
  id |       address       |     sql_address     |  build  |         started_at         |         updated_at         | locality | is_available | is_live |  live_bytes  |  key_bytes  | value_bytes  | intent_bytes | system_bytes
-----+---------------------+---------------------+---------+----------------------------+----------------------------+----------+--------------+---------+--------------+-------------+--------------+--------------+---------------
   1 | host1:26257 | host1:26257 | v22.1.6 | 2022-09-03 16:24:12.001861 | 2022-09-03 22:23:45.022008 |          | true         | true    | 246665681135 | 47930398573 | 220956161393 |        40331 |       812555
   2 | host2:26257 | host2:26257 | v22.1.6 | 2022-09-03 16:24:12.04249  | 2022-09-03 22:23:45.048777 |          | true         | true    | 246665683880 | 47930401894 | 220956192532 |        28025 |       817028
   3 | host3:26257 | host3:26257 | v22.1.6 | 2022-09-03 16:24:12.390521 | 2022-09-03 22:23:45.40156  |          | true         | true    | 246666008478 | 47930560065 | 220957658233 |        74382 |       741028
(3 rows)

Show status and decommissioning details for active and inactive nodes:

$ cockroach node status --decommission --insecure
  id |       address       |     sql_address     |  build  |         started_at         |         updated_at         | locality | is_available | is_live | gossiped_replicas | is_decommissioning | membership | is_draining
-----+---------------------+---------------------+---------+----------------------------+----------------------------+----------+--------------+---------
   1 | host1:26257 | host1:26257 | v22.1.6 | 2022-09-03 16:24:12.001861 | 2022-09-03 22:24:03.043432 |          | true         | true    |               941 | false              | active     | false
   2 | host2:26257 | host2:26257 | v22.1.6 | 2022-09-03 16:24:12.04249  | 2022-09-03 22:24:03.048948 |          | true         | true    |               941 | false              | active     | false
   3 | host3:26257 | host3:26257 | v22.1.6 | 2022-09-03 16:24:12.390521 | 2022-09-03 22:24:03.401953 |          | true         | true    |               941 | false              | active     | false
(3 rows)

Show complete status details for active and inactive nodes:

$ cockroach node status --all --insecure
  id |       address       |     sql_address     |  build  |         started_at         |         updated_at         | locality | is_available | is_live | replicas_leaders | replicas_leaseholders | ranges | ranges_unavailable | ranges_underreplicated |  live_bytes  |  key_bytes  | value_bytes  | intent_bytes | system_bytes | gossiped_replicas | is_decommissioning | membership | is_draining
-----+---------------------+---------------------+---------+----------------------------+----------------------------+----------+--------------+---------
   1 | host1:26257 | host1:26257 | v22.1.6 | 2022-09-03 16:24:12.001861 | 2022-09-03 22:24:07.539077 |          |         true |    true |              306 |                   306 |    942 |                  0 |                      0 | 246677946612 | 47939212836 | 221027696131 |        81398 |       722359 |        941        |              false |   active   |    false
   2 | host2:26257 | host2:26257 | v22.1.6 | 2022-09-03 16:24:12.04249  | 2022-09-03 22:24:07.548886 |          |         true |    true |              311 |                   311 |    942 |                  0 |                      0 | 246677952989 | 47939228198 | 221027812276 |        67816 |       731258 |        941        |              false |   active   |    false
   3 | host3:26257 | host3:26257 | v22.1.6 | 2022-09-03 16:24:12.390521 | 2022-09-03 22:24:07.901765 |          |         true |    true |              325 |                   325 |    942 |                  0 |                      0 | 246678252011 | 47939394591 | 221029217819 |        55249 |       869542 |        941        |              false |   active   |    false
(3 rows)

Show status details for a specific node:

$ cockroach node status 1 --insecure
  id |       address       |     sql_address     |  build  |         started_at         |         updated_at         | locality | is_available | is_live
-----+---------------------+---------------------+---------+----------------------------+----------------------------+----------+--------------+----------
   1 | host1:26257 | host1:26257 | v22.1.6 | 2022-09-03 16:24:12.001861 | 2022-09-03 22:24:30.044382 |          | true         | true
(1 row)

Reference

Intro to cgroups

Cgroups(control groups) make it possible to allocate system resources such as CPU time, memory, disk I/O and network bandwidth, or combinations of them, among a group of tasks(processes) running on a system.

The following commands output the available subsystems(resource controllers) for the cgroups. Each subsystem has a bunch of tunables to control the resource allocation.

$ lssubsys -am
cpuset /sys/fs/cgroup/cpuset
cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct
blkio /sys/fs/cgroup/blkio
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb
pids /sys/fs/cgroup/pids
rdma /sys/fs/cgroup/rdma

$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)

CPU subsystem and tunables

Ceiling enforcement parameters

cpu.cfs_period_us

specifies a period of time in microseconds (µs, represented here as “us”) for how regularly a cgroup’s access to CPU resources should be reallocated. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. The upper limit of the cpu.cfs_quota_us parameter is 1 second and the lower limit is 1000 microseconds.

cpu.cfs_quota_us

specifies the total amount of time in microseconds (µs, represented here as “us”) for which all tasks in a cgroup can run during one period (as defined by cpu.cfs_period_us). As soon as tasks in a cgroup use up all the time specified by the quota, they are throttled for the remainder of the time specified by the period and not allowed to run until the next period. If tasks in a cgroup should be able to access a single CPU for 0.2 seconds out of every 1 second, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 1000000. Note that the quota and period parameters operate on a CPU basis. To allow a process to fully utilize two CPUs, for example, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 100000.

Setting the value in cpu.cfs_quota_us to -1 indicates that the cgroup does not adhere to any CPU time restrictions. This is also the default value for every cgroup (except the root cgroup).

Relative shares parameter

cpu.shares

contains an integer value that specifies a relative share of CPU time available to the tasks in a cgroup. For example, tasks in two cgroups that have cpu.shares set to 100 will receive equal CPU time, but tasks in a cgroup that has cpu.shares set to 200 receive twice the CPU time of tasks in a cgroup where cpu.shares is set to 100. The value specified in the cpu.shares file must be 2 or higher.

Note that shares of CPU time are distributed per all CPU cores on multi-core systems. Even if a cgroup is limited to less than 100% of CPU on a multi-core system, it may use 100% of each individual CPU core.

Using relative shares to specify CPU access has two implications on resource management that should be considered:

Because the CFS does not demand equal usage of CPU, it is hard to predict how much CPU time a cgroup will be allowed to utilize. When tasks in one cgroup are idle and are not using any CPU time, the leftover time is collected in a global pool of unused CPU cycles. Other cgroups are allowed to borrow CPU cycles from this pool.

The actual amount of CPU time that is available to a cgroup can vary depending on the number of cgroups that exist on the system. If a cgroup has a relative share of 1000 and two other cgroups have a relative share of 500, the first cgroup receives 50% of all CPU time in cases when processes in all cgroups attempt to use 100% of the CPU. However, if another cgroup is added with a relative share of 1000, the first cgroup is only allowed 33% of the CPU (the rest of the cgroups receive 16.5%, 16.5%, and 33% of CPU).

Using libcgroup tools

Install libcgroup package to manage cgroups:

$ yum install libcgroup libcgroup-tools

List the cgroups:

$ lscgroup
hugetlb:/
cpu,cpuacct:/
cpuset:/
blkio:/
memory:/
freezer:/
net_cls,net_prio:/
pids:/
rdma:/
perf_event:/
devices:/
devices:/system.slice
devices:/system.slice/irqbalance.service
devices:/system.slice/systemd-udevd.service
devices:/system.slice/polkit.service
devices:/system.slice/chronyd.service
devices:/system.slice/auditd.service
devices:/system.slice/tuned.service
devices:/system.slice/systemd-journald.service
devices:/system.slice/sshd.service
devices:/system.slice/crond.service
devices:/system.slice/NetworkManager.service
devices:/system.slice/rsyslog.service
devices:/system.slice/abrtd.service
devices:/system.slice/lvm2-lvmetad.service
devices:/system.slice/postfix.service
devices:/system.slice/dbus.service
devices:/system.slice/system-getty.slice
devices:/system.slice/systemd-logind.service
devices:/system.slice/abrt-oops.service

$ ls /sys/fs/cgroup
blkio  cpuacct      cpuset   freezer  memory   net_cls,net_prio  perf_event  rdma
cpu    cpu,cpuacct  devices  hugetlb  net_cls  net_prio          pids        systemd

Create the cgroup:

$ cgcreate -g cpu:/cpulimited

$ lscgroup | grep cpulimited
cpu,cpuacct:/cpulimited

$ ls cpulimited/
cgroup.clone_children  cpuacct.usage_percpu       cpu.cfs_period_us  cpu.stat
cgroup.procs           cpuacct.usage_percpu_sys   cpu.cfs_quota_us   notify_on_release
cpuacct.stat           cpuacct.usage_percpu_user  cpu.rt_period_us   tasks
cpuacct.usage          cpuacct.usage_sys          cpu.rt_runtime_us
cpuacct.usage_all      cpuacct.usage_user         cpu.shares

Limit CPU utilization by percentage:

$ lscpu | grep ^CPU\(s\):
CPU(s):                96

$ cgset -r cpu.cfs_quota_us=200000 cpulimited

Check the cgroup settings:

$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: 200000

$ cgget -g cpu:cpulimited
cpulimited:
cpu.cfs_period_us: 100000
cpu.stat: nr_periods 2
    nr_throttled 0
    throttled_time 0
cpu.shares: 1024
cpu.cfs_quota_us: 200000
cpu.rt_runtime_us: 0
cpu.rt_period_us: 1000000

Delete the cgroup:

$ cgdelete cpu,cpuacct:/cpulimited

Verify the CPU utilization with fio workload

Create a fio job file:

$ cat burn_cpu.job
[burn_cpu]
# Don't transfer any data, just burn CPU cycles
ioengine=cpuio
# Stress the CPU at 100%
cpuload=100
# Make 4 clones of the job
numjobs=4

Run the fio jobs without CPU limit:

$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: -1

$ cgexec -g cpu:cpulimited fio burn_cpu.job

Check the CPU usage:

$ top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13775 root      20   0 1079912   4016   2404 R 100.0  0.0   0:11.65 fio
13776 root      20   0 1079916   4004   2392 R 100.0  0.0   0:11.65 fio
13777 root      20   0 1079920   4004   2392 R 100.0  0.0   0:11.65 fio
13778 root      20   0 1079924   4004   2392 R 100.0  0.0   0:11.65 fio

The CPU utilization is 400% for the 4 fio jobs when there is no CPU limit set. Note that, there is totally 9600% CPU bandwidth available.

Limit the CPU utilization to 200%:

$ cgset -r cpu.cfs_quota_us=200000 cpulimited

$ cgget -r cpu.cfs_quota_us cpulimited
cpulimited:
cpu.cfs_quota_us: 200000

Run the fio jobs again:

$ cgexec -g cpu:cpulimited fio burn_cpu.job

Check the CPU usage:

$ top
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
12908 root      20   0 1079916   3948   2336 R  50.3  0.0   0:06.91 fio
12909 root      20   0 1079920   3948   2336 R  50.0  0.0   0:06.88 fio
12910 root      20   0 1079924   3948   2336 R  50.0  0.0   0:06.93 fio
12907 root      20   0 1079912   3948   2336 R  49.3  0.0   0:06.86 fio

The CPU utilization is 200% for the 4 fio jobs when the CPU utilization is limited to 200%.

Check the processes are running on which CPU cores:

$ mpstat -P ALL 5 | awk '{if ($3=="CPU" || $NF<99)print;}'

12:40:32 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:40:37 AM  all    2.11    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   97.89
12:40:37 AM    0   20.52    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   79.48
12:40:37 AM    1   50.60    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.40
12:40:37 AM    2   50.10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.90
12:40:37 AM   24   29.74    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   70.26
12:40:37 AM   87   50.20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.80

12:40:37 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:40:42 AM  all    2.11    0.00    0.01    0.00    0.00    0.00    0.00    0.00    0.00   97.88
12:40:42 AM    0   11.49    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   88.31
12:40:42 AM    1   50.60    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.40
12:40:42 AM    2   50.30    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.70
12:40:42 AM   24   38.97    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   61.03
12:40:42 AM   87   49.90    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   50.10

The 4 fio jobs are running on 5 CPU cores with total utilization of 200%. So, it indicates this method limits the total CPU utilization out of all the CPU cores. However, the number of CPU cores is not limited.

Reference

Deploy Kubernetes cluster

$ kubectl get nodes
NAME            STATUS   ROLES    AGE    VERSION
node0   Ready    <none>   115d   v1.19.2
node1   Ready    <none>   115d   v1.19.2
node2   Ready    <none>   115d   v1.19.2
node3   Ready    master   115d   v1.19.2

Start CockroachDB cluster

Start the CockroachDB nodes with a configuration file that has been customized for performance:

$ curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/performance/cockroachdb-statefulset-insecure.yaml

$ kubectl create -f cockroachdb-statefulset-insecure.yaml
service/cockroachdb-public created
service/cockroachdb created
poddisruptionbudget.policy/cockroachdb-budget created
statefulset.apps/cockroachdb created

Confirm that three pods are Running successfully. Note that they will not be considered Ready until after the cluster has been initialized:

$ kubectl get pods
NAME            READY     STATUS    RESTARTS   AGE
cockroachdb-0   0/1       Running   0          2m
cockroachdb-1   0/1       Running   0          2m
cockroachdb-2   0/1       Running   0          2m

Confirm that the persistent volumes and corresponding claims were created successfully for all three pods:

$ kubectl get pvc
NAME                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
datadir-cockroachdb-0   Bound    pvc-ac210538-2a20-4d99-9b5d-c5a03d733b4d   1Ti        RWO            px-repl1       35s
datadir-cockroachdb-1   Bound    pvc-54f24a59-a063-46bb-9645-d4292eb483a3   1Ti        RWO            px-repl1       25s
datadir-cockroachdb-2   Bound    pvc-0ebd7fa7-dc36-4c77-ab9d-003d10680f56   1Ti        RWO            px-repl1       15s

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                           STORAGECLASS   REASON   AGE
pvc-0ebd7fa7-dc36-4c77-ab9d-003d10680f56   1Ti        RWO            Delete           Bound    default/datadir-cockroachdb-2   px-repl1                17s
pvc-54f24a59-a063-46bb-9645-d4292eb483a3   1Ti        RWO            Delete           Bound    default/datadir-cockroachdb-1   px-repl1                27s
pvc-ac210538-2a20-4d99-9b5d-c5a03d733b4d   1Ti        RWO            Delete           Bound    default/datadir-cockroachdb-0   px-repl1                27s

Before initialize the CockroachDB cluster, check the network(use ping command) to make sure each host can communicate with each other. Otherwise, you may run into connection failure between cluster nodes.

In my case, I want to run the cluster over the private network which doesn’t have DNS setup. So, I just add the private IP addresses and Kubernetes qualified hostnames and CockroachDB node names in /etc/hosts file on each host as below.

10.0.0.1 ip-10-10-0-1 cockroachdb-0.cockroachdb
10.0.0.2 ip-10-10-0-2 cockroachdb-1.cockroachdb
10.0.0.3 ip-10-10-0-3 cockroachdb-2.cockroachdb

Use the provided cluster-init.yaml file to perform a one-time initialization that joins the CockroachDB nodes into a single cluster:

$ kubectl create \
-f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/cluster-init.yaml

Confirm that cluster initialization has completed successfully. The job should be considered successful and the Kubernetes pods should soon be considered Ready:

$ kubectl get job cluster-init
NAME           COMPLETIONS   DURATION   AGE
cluster-init   1/1           7s         27s
kubectl get pods
NAME                 READY   STATUS      RESTARTS   AGE
cluster-init-cqf8l   0/1     Completed   0          56s
cockroachdb-0        1/1     Running     0          7m51s
cockroachdb-1        1/1     Running     0          7m51s
cockroachdb-2        1/1     Running     0          7m51s

Use the built-in SQL client

$ kubectl run cockroachdb -it --image=cockroachdb/cockroach:v22.1.7 --rm --restart=Never -- sql --insecure --host=cockroachdb-public

root@cockroachdb-public:26257/defaultdb> show databases;
  database_name | owner | primary_region | regions | survival_goal
----------------+-------+----------------+---------+----------------
  defaultdb     | root  | NULL           | {}      | NULL
  postgres      | root  | NULL           | {}      | NULL
  system        | node  | NULL           | {}      | NULL
(3 rows)


Time: 5ms total (execution 4ms / network 1ms)

Run TPC-C workload on the CockroachDB cluster

$ kubectl run cockroachdb -it --image=cockroachdb/cockroach:v22.1.7 --rm --restart=Never -- bash
[root@cockroachdb cockroach]# cockroach node ls --insecure --host=cockroachdb-public
  id
------
   1
   2
   3
(3 rows)

[root@cockroachdb cockroach]# cockroach workload init tpcc "postgresql://root@cockroachdb-public:26257?sslmode=disable" --warehouses 10 --drop
I220922 19:48:15.543311 1 workload/workloadsql/dataload.go:146  [-] 1  imported warehouse (0s, 10 rows)
I220922 19:48:15.580634 1 workload/workloadsql/dataload.go:146  [-] 2  imported district (0s, 100 rows)
I220922 19:48:20.098986 1 workload/workloadsql/dataload.go:146  [-] 3  imported customer (5s, 300000 rows)
I220922 19:48:21.601978 1 workload/workloadsql/dataload.go:146  [-] 4  imported history (2s, 300000 rows)
I220922 19:48:24.317881 1 workload/workloadsql/dataload.go:146  [-] 5  imported order (3s, 300000 rows)
I220922 19:48:24.540100 1 workload/workloadsql/dataload.go:146  [-] 6  imported new_order (0s, 90000 rows)
I220922 19:48:24.998829 1 workload/workloadsql/dataload.go:146  [-] 7  imported item (0s, 100000 rows)
I220922 19:48:34.737491 1 workload/workloadsql/dataload.go:146  [-] 8  imported stock (10s, 1000000 rows)
I220922 19:48:49.030366 1 workload/workloadsql/dataload.go:146  [-] 9  imported order_line (14s, 3001222 rows)

[root@cockroachdb cockroach]# cockroach sql --insecure --database tpcc --host=cockroachdb-public:26257 -e 'show tables'
  schema_name | table_name | type  | owner | estimated_row_count | locality
--------------+------------+-------+-------+---------------------+-----------
  public      | customer   | table | root  |              300000 | NULL
  public      | district   | table | root  |                 100 | NULL
  public      | history    | table | root  |              300000 | NULL
  public      | item       | table | root  |              100000 | NULL
  public      | new_order  | table | root  |               90000 | NULL
  public      | order      | table | root  |              300000 | NULL
  public      | order_line | table | root  |             3001222 | NULL
  public      | stock      | table | root  |             1000000 | NULL
  public      | warehouse  | table | root  |                  10 | NULL
(9 rows)

[root@cockroachdb cockroach]# cockroach workload run tpcc --warehouses=10 --ramp=10s --duration=20s postgres://root@cockroachdb-public:26257?sslmode=disable
I220922 19:52:41.633625 1 workload/cli/run.go:414  [-] 1  creating load generator...
Initializing 20 connections...
Initializing 0 idle connections...
Initializing 100 workers and preparing statements...
I220922 19:52:51.646000 1 workload/cli/run.go:445  [-] 2  creating load generator... done (took 10.012371773s)

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
   20.0s      153.0 119.0%     38.2     44.0     48.2     48.2     50.3     52.4

Reference

Create docker volume

Create docker volume on each host:

[root@host1 ~]# docker volume create --driver local --opt type=ext4 --opt device=/dev/nvme2n1 vol1
[root@host1 ~]# mkfs.ext4 /dev/nvme2n1
[root@host1 ~]# docker inspect vol1
[
    {
        "CreatedAt": "2022-09-13T01:10:50Z",
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/vol1/_data",
        "Name": "vol1",
        "Options": {
            "device": "/dev/nvme2n1",
            "type": "ext4"
        },
        "Scope": "local"
    }
]

Start CockroachDB instance

Start the CockroachDB instance on each host:

[root@host1 ~]# docker run -d --cpus="0.25" --memory="32g" --name=roach1 --hostname=host1 -p 26257:26257 -p 8080:8080 -v "vol1:/cockroach/cockroach-data" cockroachdb/cockroach:v22.1.6 start --insecure --join=host1,host2,host3

[root@host2 ~]# docker run -d --cpus="0.25" --memory="32g" --name=roach2 --hostname=host2 -p 26257:26257 -p 8080:8080 -v "vol2:/cockroach/cockroach-data" cockroachdb/cockroach:v22.1.6 start --insecure --join=host1,host2,host3

[root@host3 ~]# docker run -d --cpus="0.25" --memory="32g" --name=roach3 --hostname=host3 -p 26257:26257 -p 8080:8080 -v "vol3:/cockroach/cockroach-data" cockroachdb/cockroach:v22.1.6 start --insecure --join=host1,host2,host3

Initialize the CockroachDB cluster

Initialize the cluster on any of the cluster nodes:

[root@host1 ~]# docker exec -it roach1 ./cockroach init --insecure

Check the cluster status:

[root@host1 ~]# docker exec -it roach1 ./cockroach node ls --insecure
  id
------
   1
   2
   3
(3 rows)

[root@host1 ~]# docker exec -it roach1 ./cockroach node status --insecure
  id |     address      |   sql_address    |  build  |         started_at         |         updated_at         | locality | is_available | is_live
-----+------------------+------------------+---------+----------------------------+----------------------------+----------+--------------+----------
   1 | host1:26257 | host1:26257 | v22.1.6 | 2022-09-13 01:09:19.857864 | 2022-09-13 01:10:22.870301 |          | true         | true
   2 | host2:26257 | host2:26257 | v22.1.6 | 2022-09-13 01:09:20.668787 | 2022-09-13 01:10:23.671303 |          | true         | true
   3 | host3:26257 | host3:26257 | v22.1.6 | 2022-09-13 01:09:20.910602 | 2022-09-13 01:10:19.415254 |          | true         | true
(3 rows)

[root@host1 ~]# docker exec -it roach1 df -h
Filesystem               Size  Used Avail Use% Mounted on
overlay                   50G  7.1G   43G  15% /
tmpfs                     64M     0   64M   0% /dev
tmpfs                    504G     0  504G   0% /sys/fs/cgroup
shm                       64M     0   64M   0% /dev/shm
/dev/nvme2n1             1.5T  1.6G  1.4T   1% /cockroach/cockroach-data
/dev/mapper/centos-root   50G  7.1G   43G  15% /etc/hosts
tmpfs                    504G     0  504G   0% /proc/acpi
tmpfs                    504G     0  504G   0% /proc/scsi
tmpfs                    504G     0  504G   0% /sys/firmware

[root@host1 ~]# mount | grep nvme2n1
/dev/nvme2n1 on /var/lib/docker/volumes/vol1/_data type ext4 (rw,relatime)

Check the cluster startup logs

[root@host1 ~]# docker exec -it roach1 grep 'node starting' cockroach-data/logs/cockroach.log -A 11
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +CockroachDB node starting at 2022-09-13 00:57:22.532397342 +0000 UTC (took 110.6s)
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +build:               CCL v22.1.6 @ 2022/08/23 17:05:04 (go1.17.11)
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +webui:               ‹http://host1:8080›
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +sql:                 ‹postgresql://root@host1:26257/defaultdb?sslmode=disable›
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +sql (JDBC):          ‹jdbc:postgresql://host1:26257/defaultdb?sslmode=disable&user=root›
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +RPC client flags:    ‹/cockroach/cockroach <client cmd> --host=host1:26257 --insecure›
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +logs:                ‹/cockroach/cockroach-data/logs›
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +temp dir:            ‹/cockroach/cockroach-data/cockroach-temp1471904785›
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +external I/O path:   ‹/cockroach/cockroach-data/extern›
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +store[0]:            ‹path=/cockroach/cockroach-data›
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +storage engine:      pebble
I220913 00:57:22.532570 151 1@cli/start.go:1028 ⋮ [n1] 252 +clusterID:           ‹8276f8fb-30f2-4712-ad4e-f008d382372d›

Load the tpcc dataset

[root@host1 ~]# docker exec -it roach1 ./cockroach workload init tpcc "postgresql://root@host1:26257?sslmode=disable" --warehouses 2500 --drop

[root@host2 ~]# iostat -ktdx 2 nvme2n1
09/13/2022 01:14:00 AM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00   239.50    0.00 2112.00     0.00 167118.00   158.26     0.27    0.13    0.00    0.13   0.25  53.00

09/13/2022 01:14:02 AM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00   268.00    0.00 2369.50     0.00 185266.00   156.38     0.30    0.13    0.00    0.13   0.24  56.50

The following are the elapsed time to load dataset for 2500 warehouses.

I220913 04:28:20.633547 1 workload/workloadsql/dataload.go:146  [-] 1  imported warehouse (0s, 2500 rows)
I220913 04:28:20.900607 1 workload/workloadsql/dataload.go:146  [-] 2  imported district (0s, 25000 rows)
I220913 04:46:27.360882 7 1@util/log/log_flush.go:99  [-] 3  hangup received, flushing logs
I220913 04:59:37.468628 1 workload/workloadsql/dataload.go:146  [-] 4  imported customer (31m17s, 75000000 rows)
I220913 05:06:54.166160 1 workload/workloadsql/dataload.go:146  [-] 5  imported history (7m17s, 75000000 rows)
I220913 09:17:27.487420 1 workload/workloadsql/dataload.go:146  [-] 6  imported order (4h10m33s, 75000000 rows)
I220913 09:18:29.376933 1 workload/workloadsql/dataload.go:146  [-] 7  imported new_order (1m2s, 22500000 rows)
I220913 09:18:29.975510 1 workload/workloadsql/dataload.go:146  [-] 8  imported item (1s, 100000 rows)
I220913 10:07:29.447874 1 workload/workloadsql/dataload.go:146  [-] 9  imported stock (48m59s, 250000000 rows)
I220913 11:18:52.140400 1 workload/workloadsql/dataload.go:146  [-] 10  imported order_line (1h11m23s, 750022630 rows)

The following shows the used file system size for the 2500 warehouses.

[root@host1 ~]# docker exec -it roach1 df -h
Filesystem               Size  Used Avail Use% Mounted on
overlay                   50G  7.1G   43G  15% /
tmpfs                     64M     0   64M   0% /dev
tmpfs                    504G     0  504G   0% /sys/fs/cgroup
shm                       64M     0   64M   0% /dev/shm
/dev/nvme2n1             1.5T   49G  1.4T   4% /cockroach/cockroach-data
/dev/mapper/centos-root   50G  7.1G   43G  15% /etc/hosts
tmpfs                    504G     0  504G   0% /proc/acpi
tmpfs                    504G     0  504G   0% /proc/scsi
tmpfs                    504G     0  504G   0% /sys/firmware

Run the tpcc workload

Run tpcc workload from remote host:

[root@host4 ~]# cat addr
postgres://root@host1:26257?sslmode=disable postgres://root@host2:26257?sslmode=disable postgres://root@host3:26257?sslmode=disable
[root@host4 ~]# cockroach workload run tpcc --warehouses=2500 --ramp=5m --duration=30m $(cat addr)

Reference

Each CockroachDB node is an equally suitable SQL gateway to your cluster, but to ensure client performance and reliability, it’s important to use load balancing.

  • Performance: Load balancers spread client traffic across nodes. This prevents any one node from being overwhelmed by requests and improves overall cluster performance (queries per second).

  • Reliability: Load balancers decouple client health from the health of a single CockroachDB node. In cases where a node fails, the load balancer redirects client traffic to available nodes.

HAProxy is one of the most popular open-source TCP load balancers, and CockroachDB includes a built-in command for generating a configuration file that is preset to work with your running cluster.

With a single load balancer, client connections are resilient to node failure, but the load balancer itself is a point of failure. It’s therefore best to make load balancing resilient as well by using multiple load balancing instances, with a mechanism like floating IPs or DNS to select load balancers for clients.

For performance and availability reasons, it is not recommended to run the HAproxy on the same node as the cockroachDB.

This post shows a detailed steps how to deploy a HAProxy load balancer for CockroachDB cluster.

Deploy the CockroachDB cluster

[root@host1 ~]# ps -ef | grep cockroach | grep -v grep
root      4783     1 99 18:59 ?        2-17:35:59 cockroach start --log-dir=/var/log/cockroachdb_logs/1 --store=/mnt/cockroachdb_mnt1 --insecure --listen-addr=host1:26257 --http-addr=host1:8080 --join=host1:26257,host1:26258,host1:26259,host1:26260,host1:26261,host1:26262,host1:26263,host1:26264,host2:26257,host2:26258,host2:26259,host2:26260,host2:26261,host2:26262,host2:26263,host2:26264,host3:26257,host3:26258,host3:26259,host3:26260,host3:26261,host3:26262,host3:26263,host3:26264 --cache=2GiB --max-sql-memory=2GiB
root      4876     1 95 18:59 ?        03:33:42 cockroach start --log-dir=/var/log/cockroachdb_logs/2 --store=/mnt/cockroachdb_mnt2 --insecure --listen-addr=host1:26258 --http-addr=host1:8081 --join=host1:26257,host1:26258,host1:26259,host1:26260,host1:26261,host1:26262,host1:26263,host1:26264,host2:26257,host2:26258,host2:26259,host2:26260,host2:26261,host2:26262,host2:26263,host2:26264,host3:26257,host3:26258,host3:26259,host3:26260,host3:26261,host3:26262,host3:26263,host3:26264 --cache=2GiB --max-sql-memory=2GiB
root      4932     1 77 18:59 ?        02:53:13 cockroach start --log-dir=/var/log/cockroachdb_logs/3 --store=/mnt/cockroachdb_mnt3 --insecure --listen-addr=host1:26259 --http-addr=host1:8082 --join=host1:26257,host1:26258,host1:26259,host1:26260,host1:26261,host1:26262,host1:26263,host1:26264,host2:26257,host2:26258,host2:26259,host2:26260,host2:26261,host2:26262,host2:26263,host2:26264,host3:26257,host3:26258,host3:26259,host3:26260,host3:26261,host3:26262,host3:26263,host3:26264 --cache=2GiB --max-sql-memory=2GiB
root      5049     1 99 18:59 ?        06:02:17 cockroach start --log-dir=/var/log/cockroachdb_logs/4 --store=/mnt/cockroachdb_mnt4 --insecure --listen-addr=host1:26260 --http-addr=host1:8083 --join=host1:26257,host1:26258,host1:26259,host1:26260,host1:26261,host1:26262,host1:26263,host1:26264,host2:26257,host2:26258,host2:26259,host2:26260,host2:26261,host2:26262,host2:26263,host2:26264,host3:26257,host3:26258,host3:26259,host3:26260,host3:26261,host3:26262,host3:26263,host3:26264 --cache=2GiB --max-sql-memory=2GiB
root      5131     1 94 18:59 ?        03:31:51 cockroach start --log-dir=/var/log/cockroachdb_logs/5 --store=/mnt/cockroachdb_mnt5 --insecure --listen-addr=host1:26261 --http-addr=host1:8084 --join=host1:26257,host1:26258,host1:26259,host1:26260,host1:26261,host1:26262,host1:26263,host1:26264,host2:26257,host2:26258,host2:26259,host2:26260,host2:26261,host2:26262,host2:26263,host2:26264,host3:26257,host3:26258,host3:26259,host3:26260,host3:26261,host3:26262,host3:26263,host3:26264 --cache=2GiB --max-sql-memory=2GiB
root      5220     1 72 18:59 ?        02:42:39 cockroach start --log-dir=/var/log/cockroachdb_logs/6 --store=/mnt/cockroachdb_mnt6 --insecure --listen-addr=host1:26262 --http-addr=host1:8085 --join=host1:26257,host1:26258,host1:26259,host1:26260,host1:26261,host1:26262,host1:26263,host1:26264,host2:26257,host2:26258,host2:26259,host2:26260,host2:26261,host2:26262,host2:26263,host2:26264,host3:26257,host3:26258,host3:26259,host3:26260,host3:26261,host3:26262,host3:26263,host3:26264 --cache=2GiB --max-sql-memory=2GiB
root      5309     1 94 18:59 ?        03:30:22 cockroach start --log-dir=/var/log/cockroachdb_logs/7 --store=/mnt/cockroachdb_mnt7 --insecure --listen-addr=host1:26263 --http-addr=host1:8086 --join=host1:26257,host1:26258,host1:26259,host1:26260,host1:26261,host1:26262,host1:26263,host1:26264,host2:26257,host2:26258,host2:26259,host2:26260,host2:26261,host2:26262,host2:26263,host2:26264,host3:26257,host3:26258,host3:26259,host3:26260,host3:26261,host3:26262,host3:26263,host3:26264 --cache=2GiB --max-sql-memory=2GiB
root      5358     1 58 18:59 ?        02:11:20 cockroach start --log-dir=/var/log/cockroachdb_logs/8 --store=/mnt/cockroachdb_mnt8 --insecure --listen-addr=host1:26264 --http-addr=host1:8087 --join=host1:26257,host1:26258,host1:26259,host1:26260,host1:26261,host1:26262,host1:26263,host1:26264,host2:26257,host2:26258,host2:26259,host2:26260,host2:26261,host2:26262,host2:26263,host2:26264,host3:26257,host3:26258,host3:26259,host3:26260,host3:26261,host3:26262,host3:26263,host3:26264 --cache=2GiB --max-sql-memory=2GiB

[root@host1 ~]# cockroach node status --insecure --host=host1:26257
  id |       address       |     sql_address     |  build  |         started_at         |         updated_at         | locality | is_available | is_live
-----+---------------------+---------------------+---------+----------------------------+----------------------------+----------+--------------+----------
   1 | host1:26257 | host1:26257 | v22.1.6 | 2022-09-17 18:59:37.989384 | 2022-09-17 22:43:39.507057 |          | true         | true
   2 | host1:26258 | host1:26258 | v22.1.6 | 2022-09-17 18:59:39.128398 | 2022-09-17 22:43:40.640431 |          | true         | true
   3 | host1:26259 | host1:26259 | v22.1.6 | 2022-09-17 18:59:39.374127 | 2022-09-17 22:43:40.885678 |          | true         | true
   4 | host1:26260 | host1:26260 | v22.1.6 | 2022-09-17 18:59:39.726667 | 2022-09-17 22:43:36.73931  |          | true         | true
   5 | host1:26261 | host1:26261 | v22.1.6 | 2022-09-17 18:59:39.971549 | 2022-09-17 22:43:36.983538 |          | true         | true
   6 | host1:26262 | host1:26262 | v22.1.6 | 2022-09-17 18:59:40.225358 | 2022-09-17 22:43:37.235768 |          | true         | true
   7 | host1:26263 | host1:26263 | v22.1.6 | 2022-09-17 18:59:40.447688 | 2022-09-17 22:43:37.459962 |          | true         | true
   8 | host1:26264 | host1:26264 | v22.1.6 | 2022-09-17 18:59:40.677217 | 2022-09-17 22:43:37.690883 |          | true         | true
   9 | host2:26257 | host2:26257 | v22.1.6 | 2022-09-17 18:59:41.604348 | 2022-09-17 22:43:38.613662 |          | true         | true
  10 | host2:26258 | host2:26258 | v22.1.6 | 2022-09-17 18:59:41.904071 | 2022-09-17 22:43:38.912773 |          | true         | true
  11 | host2:26259 | host2:26259 | v22.1.6 | 2022-09-17 18:59:42.194446 | 2022-09-17 22:43:39.202688 |          | true         | true
  12 | host2:26260 | host2:26260 | v22.1.6 | 2022-09-17 18:59:42.512201 | 2022-09-17 22:43:39.519727 |          | true         | true
  13 | host2:26261 | host2:26261 | v22.1.6 | 2022-09-17 18:59:42.850974 | 2022-09-17 22:43:39.860063 |          | true         | true
  14 | host2:26262 | host2:26262 | v22.1.6 | 2022-09-17 18:59:43.177162 | 2022-09-17 22:43:40.191082 |          | true         | true
  15 | host2:26263 | host2:26263 | v22.1.6 | 2022-09-17 18:59:43.543973 | 2022-09-17 22:43:40.553146 |          | true         | true
  16 | host2:26264 | host2:26264 | v22.1.6 | 2022-09-17 18:59:43.858993 | 2022-09-17 22:43:40.86705  |          | true         | true
  17 | host3:26257 | host3:26257 | v22.1.6 | 2022-09-17 18:59:44.807641 | 2022-09-17 22:43:37.315927 |          | true         | true
  18 | host3:26258 | host3:26258 | v22.1.6 | 2022-09-17 18:59:45.119533 | 2022-09-17 22:43:37.628497 |          | true         | true
  19 | host3:26259 | host3:26259 | v22.1.6 | 2022-09-17 18:59:45.508535 | 2022-09-17 22:43:38.015714 |          | true         | true
  20 | host3:26260 | host3:26260 | v22.1.6 | 2022-09-17 18:59:45.901631 | 2022-09-17 22:43:38.409443 |          | true         | true
  21 | host3:26261 | host3:26261 | v22.1.6 | 2022-09-17 18:59:46.191724 | 2022-09-17 22:43:38.700531 |          | true         | true
  22 | host3:26262 | host3:26262 | v22.1.6 | 2022-09-17 18:59:46.540504 | 2022-09-17 22:43:39.048942 |          | true         | true
  23 | host3:26263 | host3:26263 | v22.1.6 | 2022-09-17 18:59:46.944353 | 2022-09-17 22:43:39.453175 |          | true         | true
  24 | host3:26264 | host3:26264 | v22.1.6 | 2022-09-17 18:59:47.274458 | 2022-09-17 22:43:39.782378 |          | true         | true
(24 rows)

Generate HAProxy configuration file

Install HAProxy on one of the CockroachDB nodes:

[root@host1 ~]# yum install haproxy
[root@host1 ~]# haproxy -v
HA-Proxy version 1.5.18 2016/05/10
Copyright 2000-2016 Willy Tarreau <willy@haproxy.org>

Generate the HAProxy configuration file:

[root@host1 ~]# cockroach gen haproxy --host=host1:26257 --insecure

[root@host1 ~]# cat haproxy.cfg

global
  maxconn 4096

defaults
    mode                tcp

    # Timeout values should be configured for your specific use.
    # See: https://cbonte.github.io/haproxy-dconv/1.8/configuration.html#4-timeout%20connect

    # With the timeout connect 5 secs,
    # if the backend server is not responding, haproxy will make a total
    # of 3 connection attempts waiting 5s each time before giving up on the server,
    # for a total of 15 seconds.
    retries             2
    timeout connect     5s

    # timeout client and server govern the maximum amount of time of TCP inactivity.
    # The server node may idle on a TCP connection either because it takes time to
    # execute a query before the first result set record is emitted, or in case of
    # some trouble on the server. So these timeout settings should be larger than the
    # time to execute the longest (most complex, under substantial concurrent workload)
    # query, yet not too large so truly failed connections are lingering too long
    # (resources associated with failed connections should be freed reasonably promptly).
    timeout client      10m
    timeout server      10m

    # TCP keep-alive on client side. Server already enables them.
    option              clitcpka

listen psql
    bind :26257
    mode tcp
    balance roundrobin
    option httpchk GET /health?ready=1
    server cockroach1 host1:26257 check port 8080
    server cockroach2 host1:26258 check port 8081
    server cockroach3 host1:26259 check port 8082
    server cockroach4 host1:26260 check port 8083
    server cockroach5 host1:26261 check port 8084
    server cockroach6 host1:26262 check port 8085
    server cockroach7 host1:26263 check port 8086
    server cockroach8 host1:26264 check port 8087
    server cockroach9 host2:26257 check port 8080
    server cockroach10 host2:26258 check port 8081
    server cockroach11 host2:26259 check port 8082
    server cockroach12 host2:26260 check port 8083
    server cockroach13 host2:26261 check port 8084
    server cockroach14 host2:26262 check port 8085
    server cockroach15 host2:26263 check port 8086
    server cockroach16 host2:26264 check port 8087
    server cockroach17 host3:26257 check port 8080
    server cockroach18 host3:26258 check port 8081
    server cockroach19 host3:26259 check port 8082
    server cockroach20 host3:26260 check port 8083
    server cockroach21 host3:26261 check port 8084
    server cockroach22 host3:26262 check port 8085
    server cockroach23 host3:26263 check port 8086
    server cockroach24 host3:26264 check port 8087

Start HAProxy

It’s not recommended to deploy HAProxy on the same node as CockroachDB for performance and availability reasons. Thus, we are going to start the HAProxy on a dedicated node.

Connect to the dedicated HAProxy node and install the proxy on it:

[root@host4 ~]# yum install haproxy

Start HAproxy with the generated configuration file:

[root@host4 ~]# scp host1:/root/haproxy.cfg ./
[root@host4 ~]# haproxy -f haproxy.cfg

Verify the HAProxy

The Load Balancer distributes the requests sent to the cluster equally between the different nodes.

From primary node, we can check which node is being connected as below:

[root@host1 ~]# cockroach sql --host=host4 --insecure
root@host4:26257/defaultdb> SHOW node_id;
  node_id
-----------
  1

root@host4:26257/defaultdb> exit
[root@host1 ~]# cockroach sql --host=host4 --insecure
root@host4:26257/defaultdb> SHOW node_id;
  node_id
-----------
  2

[root@host1 ~]# cockroach sql --host=host4 --insecure
root@host4:26257/defaultdb> SHOW node_id;
  node_id
-----------
  3

Run TPCC through the HAProxy load balancer

[root@host4 cockroachdb]# cockroach workload run tpcc --warehouses=5000 --split --scatter --ramp=30s --duration=5m 'postgresql://root@host4:26257/tpcc?sslmode=disable'

Reference

0%