Surprised! How can the write IOPS be 10x faster than the cloud native disk performance
In a competitive fio write performance benchmark between 3rd party storage solution and cloud native disk, we noticed that the write performance for the 3rd party storage solution is 10x faster than the cloud native. This usually seems impossible since we would argue that no one can beat the raw disk performance. The following case shows an interesting write performance optimization from the storage solution under test.
Cloud native raw disk performance(fio,4k,write)
From the benchmark result, it shows the IOPS limit of the cloud drive is ~7500. This aligns with the cloud storage SPEC in use.
10x faster IOPS is observed!!!
In the output of fio, the IOPS(>75k) is 10x faster than the result from cloud native raw disk(~7500). This can be verified from the iostat output. The IOPS at logic volume layer is aligned with the fio output.
Iosize matters!!!
When the write request comes to the physical disk from logic volume, the I/O size is changed from 4k to 400k. It indicates the smaller writes got merged to larger writes.
Conclusion
With limited IOPS on the cloud drive, larger IO size really help improve the write performance by reducing the write requests to disk. With proper write performance optimization at logic volume layer, the write performance can be boosted 10x faster with no surprise!!!
fio direct I/O error with 1k blocksize
When to run fio write with small blocksize(e.g. 1k), the following error is seen.
$ fio --blocksize=1k --ioengine=libaio --readwrite=randwrite --filesize=2G --group_reporting --direct=1 --iodepth=128 --randrepeat=1 --end_fsync=1 --
name=job1 --numjobs=1 --filename=/mnt/fiomnt/fio.dat
job1: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, (T) 1024B-1024B, ioengine=libaio, iodepth=128
fio-3.7
Starting 1 process
fio: io_u error on file /mnt/fiomnt/fio.dat: Invalid argument: write offset=129521664, buflen=1024
fio: io_u error on file /mnt/fiomnt/fio.dat: Invalid argument: write offset=1589760000, buflen=1024
fio: pid=93922, err=22/file:io_u.c:1747, func=io_u error, error=Invalid argument
job1: (groupid=0, jobs=1): err=22 (file:io_u.c:1747, func=io_u error, error=Invalid argument): pid=93922: Wed Jan 25 00:42:00 2023
cpu : usr=0.00%, sys=0.00%, ctx=1, majf=0, minf=14
IO depths : 1=0.8%, 2=1.6%, 4=3.1%, 8=6.2%, 16=12.5%, 32=25.0%, >=64=50.8%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,128,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Cause:
For direct I/O, the I/O size has to be multiple of filesystem/block device blocksize. In this case, the filesystem blocksize is 4k which can not be well aligned with the requested I/O size(1k). To fix this, the filesystem blocksize should be less than or equal to 1k.
Tracking stock financial metrics in tradingview
Using ANOVA in R to analyze US COVID data to understand age impact to death rate
How to use R to analyze US COVID pandemic waves and peaks
Goal
According to weekly data between 2020 to 2022, we want to get to know the waves and peaks of COVID pandemic in these years.
Download the data
We will continue to use NCHS(National Center for Health Statistics) as our data source.
Visit https://data.cdc.gov/browse?category=NCHS&sortBy=last_modified, and search Provisional COVID-19 Death Counts by Week
, we will find the data we are intrest.
https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Week-Ending-D/r8kw-7aab, in this page, we can export data into csv file.
With that, we may get the data source csv, Provisional_COVID-19_Death_Counts_by_Week_Ending_Date_and_State.csv
.
Load the data
1 | ibrary("dplyr") |
Identify the data we want to focus
As we can see, there are 4 diffrent groups, and it has the whole United states
and each state’s data
1 | > unique(df$group) |
df1 <- df %>%
filter(state == “United States” & group == “By Week”) %>%
select(start_date, covid_19_deaths)
print(df1,n=20)
A tibble: 158 × 2
start_date covid_19_deaths
1 2019-12-29 0
2 2020-01-05 1
3 2020-01-12 2
4 2020-01-19 3
5 2020-01-26 0
6 2020-02-02 4
7 2020-02-09 6
8 2020-02-16 6
9 2020-02-23 9
10 2020-03-01 38
11 2020-03-08 60
12 2020-03-15 588
13 2020-03-22 3226
14 2020-03-29 10141
15 2020-04-05 16347
16 2020-04-12 17221
17 2020-04-19 15557
18 2020-04-26 13223
19 2020-05-03 11243
20 2020-05-10 9239
…library(“ggplot2”)1
2
3
## Draw the graph to see the wave
library(“sjPlot”)
p = ggplot(df1, aes( x=start_date, y=covid_19_deaths, group=1)) +
geom_line(color=”blue”) +
theme(axis.text.x=element_text(angle=45,hjust=1,size=5))
save_plot(“covid_plot_weekly_wave.svg”, fig = p, width=60, height=20)library(“pracma”)1
2
3
4
5
6
7
8
9
![Image](/images/covid_plot_weekly_wave.svg)
## Find the peak by R mark it in the graph
From above graph, we can easily to figure out the waves and peaks, but we also can let R help us to do it, it's pretty useful if we have to deal with many data and many graphs.
To achive it, firstly we can call `findpeaks` from `pracma` library to find the peaks- peaks = findpeaks(df1$covid_19_deaths, npeaks=5, sortstr=TRUE)
peaks
[,1] [,2] [,3] [,4]
[1,] 26027 54 40 66
[2,] 21364 108 98 121
[3,] 17221 16 8 26
[4,] 15536 88 79 98
[5,] 8308 31 26 38
1 |
|
is_peak <- vector( “logical” , length(df1$covid_19_deaths ))
df1$is_peak = is_peak
for (x in peaks[,2]) {
df1$is_peak[x] = TRUE
}
1 |
|
!> df2 = df1 %>% filter(is_peak == TRUE)
- df2[order(-df2$covid_19_deaths),]
A tibble: 5 × 3
start_date covid_19_deaths is_peak
1 2021-01-03 26027 TRUE
2 2022-01-16 21364 TRUE
3 2020-04-12 17221 TRUE
4 2021-08-29 15536 TRUE
5 2020-07-26 8308 TRUEp = ggplot(df1, aes(x=start_date, y=covid_19_deaths, group=1)) +1
2
3
## Hightlight the peak points
geom_line(color=”blue”) +
geom_point(data = . %>% filter(is_peak == TRUE), stat=”identity”, size = 4, color = “red”) +
scale_y_continuous(breaks=seq(0,30000,4000)) +
theme(axis.text.x=element_text(angle=45,hjust=1,size=5))
save_plot(“covid_plot_weekly_peak.svg”, fig = p, width=60, height=20)
1 |
|
!> > sum(df1$covid_19_deaths)
[1] 1089714 ===> the total covid_19_deaths death number from 2020 to 2022
!> summary(df1$covid_19_deaths)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2223 4428 6897 9862 26027
!> df3 <- df %>%
filter(state == "United States" & group == "By Week") %>%
select(start_date, total_deaths)
- sum(df3$total_deaths)
[1] 10077273 ===> the total death number from 2020 to 2022
- summary(df3$total_deaths)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7100 58522 60451 63780 68610 87415
Using R to analyze the quarterly US COVID data
Goal
Based on the overall death data and COVID realated death data since 2019, we want to study the impact trend of COVID to the overall pupulation death of US in these years.
Download the data
We will use NCHS(National Center for Health Statistics) as our data source.
Visit https://data.cdc.gov/browse?category=NCHS&sortBy=last_modified, and search VSRR Quarterly
, we will find the data we are intrested in.
https://data.cdc.gov/NCHS/NCHS-VSRR-Quarterly-provisional-estimates-for-sele/489q-934x
In this page, we can export data into csv file as NCHS_-_VSRR_Quarterly_provisional_estimates_for_selected_indicators_of_mortality.csv
Take a quick look at the data
To load the data:
# If "readr" not installed, run install.packages("readr") to install it
library("readr")
df <- readr::read_csv(file.path(getwd(), "NCHS_-_VSRR_Quarterly_provisional_estimates_for_selected_indicators_of_mortality.csv"), col_names = TRUE)
To check the first few lines:
> head(df)
# A tibble: 6 × 69
Year a…¹ Time …² Cause…³ Rate …⁴ Unit Overa…⁵ Rate …⁶ Rate …⁷ Rate …⁸ Rate …⁹
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 Q1 12 mon… All ca… Age-ad… Deat… 712. 600. 844. NA NA
2 2019 Q1 12 mon… Alzhei… Age-ad… Deat… 29.6 33.1 23.8 NA NA
3 2019 Q1 12 mon… COVID-… Age-ad… Deat… NA NA NA NA NA
4 2019 Q1 12 mon… Cancer Age-ad… Deat… 148. 128. 175. NA NA
5 2019 Q1 12 mon… Chroni… Age-ad… Deat… 11 7.7 14.7 NA NA
6 2019 Q1 12 mon… Chroni… Age-ad… Deat… 38.5 35.7 42.4 NA NA
# … with 59 more variables: `Rate Age 15-24` <dbl>, `Rate Age 25-34` <dbl>,
# `Rate Age 35-44` <dbl>, `Rate Age 45-54` <dbl>, `Rate Age 55-64` <dbl>,
# `Rate 65-74` <dbl>, `Rate Age 75-84` <dbl>, `Rate Age 85 plus` <dbl>,
# `Rate Alaska` <dbl>, `Rate Alabama` <dbl>, `Rate Arkansas` <dbl>,
# `Rate Arizona` <dbl>, `Rate California` <dbl>, `Rate Colorado` <dbl>,
# `Rate Connecticut` <dbl>, `Rate District of Columbia` <dbl>,
# `Rate Delaware` <dbl>, `Rate Florida` <dbl>, `Rate Georgia` <dbl>, …
# ℹ Use `colnames()` to see all variable names
To get a summary:
summary(df)
Year and Quarter Time Period Cause of Death Rate Type
Length:1232 Length:1232 Length:1232 Length:1232
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Unit Overall Rate Rate Sex Female Rate Sex Male
Length:1232 Min. : 1.20 Min. : 0.60 Min. : 1.90
Class :character 1st Qu.: 11.40 1st Qu.: 6.95 1st Qu.: 13.20
Mode :character Median : 17.00 Median : 14.80 Median : 23.50
Mean : 80.51 Mean : 70.56 Mean : 91.63
3rd Qu.: 50.60 3rd Qu.: 49.92 3rd Qu.: 65.42
Max. :1142.30 Max. :1067.00 Max. :1219.90
NA's :44 NA's :44 NA's :44
...
To get glimpse from columns point of view:
> library("dplyr")
> glimpse(df)
Rows: 1,232
Columns: 69
$ `Year and Quarter` <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1"…
$ `Time Period` <chr> "12 months ending with quarter", "12 month…
$ `Cause of Death` <chr> "All causes", "Alzheimer disease", "COVID-…
$ `Rate Type` <chr> "Age-adjusted", "Age-adjusted", "Age-adjus…
$ Unit <chr> "Deaths per 100,000", "Deaths per 100,000"…
$ `Overall Rate` <dbl> 712.2, 29.6, NA, 148.1, 11.0, 38.5, 21.3, …
$ `Rate Sex Female` <dbl> 600.3, 33.1, NA, 127.9, 7.7, 35.7, 16.8, 1…
$ `Rate Sex Male` <dbl> 843.7, 23.8, NA, 175.4, 14.7, 42.4, 26.9, …
$ `Rate Age 1-4` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 5-14` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 15-24` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 25-34` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 35-44` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 45-54` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 55-64` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate 65-74` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 75-84` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
...
Clear the columns name
As we can see, the column names has space, it may cause some trouble while refering it in R, it’s a common sugestion to convert space to “_” before doing any R o/p.
A good thing is , there is a R package can help us on it.
> library("janitor")
> df <- clean_names(df)
!> glimpse(df)
Rows: 1,232
Columns: 69
$ year_and_quarter <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", …
$ time_period <chr> "12 months ending with quarter", "12 months …
$ cause_of_death <chr> "All causes", "Alzheimer disease", "COVID-19…
$ rate_type <chr> "Age-adjusted", "Age-adjusted", "Age-adjuste…
$ unit <chr> "Deaths per 100,000", "Deaths per 100,000", …
$ overall_rate <dbl> 712.2, 29.6, NA, 148.1, 11.0, 38.5, 21.3, 20…
$ rate_sex_female <dbl> 600.3, 33.1, NA, 127.9, 7.7, 35.7, 16.8, 13.…
$ rate_sex_male <dbl> 843.7, 23.8, NA, 175.4, 14.7, 42.4, 26.9, 27…
$ rate_age_1_4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
As you can see, while we run glimpse(df)
, the colum name has been changed from “Year and Quarter” to “year_and_quarter”.
Filter and select
The raw data has so many columns and rows, but we just want to focus on those cols/rows we really have intrest.
We can use filter
(applying on rows) and select
(applying on columns) in such condition.
> df1 <- df %>%
filter(time_period == "3-month period" & rate_type == "Crude" & cause_of_death %in% c("All causes", "COVID-19")) %>%
select(year_and_quarter, cause_of_death, overall_rate)
> print(df1,n=50)
# A tibble: 28 × 3
year_and_quarter cause_of_death overall_rate
<chr> <chr> <dbl>
1 2019 Q1 All causes 910
2 2019 Q1 COVID-19 NA
3 2019 Q2 All causes 851.
4 2019 Q2 COVID-19 NA
5 2019 Q3 All causes 827.
6 2019 Q3 COVID-19 NA
7 2019 Q4 All causes 891.
8 2019 Q4 COVID-19 NA
9 2020 Q1 All causes 945.
10 2020 Q1 COVID-19 8.2
11 2020 Q2 All causes 1035.
12 2020 Q2 COVID-19 137.
13 2020 Q3 All causes 985.
14 2020 Q3 COVID-19 87.3
15 2020 Q4 All causes 1142.
16 2020 Q4 COVID-19 193.
17 2021 Q1 All causes 1116
18 2021 Q1 COVID-19 191.
19 2021 Q2 All causes 915.
20 2021 Q2 COVID-19 42.4
21 2021 Q3 All causes 1051.
22 2021 Q3 COVID-19 138.
23 2021 Q4 All causes 1092.
24 2021 Q4 COVID-19 131.
25 2022 Q1 All causes 1116
26 2022 Q1 COVID-19 150.
27 2022 Q2 All causes 899.
28 2022 Q2 COVID-19 17.6
%>%
looks strange, it’s just like | (pipe)
in linux shell command.
df1 <- df %>%
filter(time_period == "3-month period" & rate_type == "Crude" & cause_of_death %in% c("All causes", "COVID-19")) %>%
select(year_and_quarter, cause_of_death, overall_rate)
It means only keep those rows which meet those condition of filter
, and those columns which meet condition of select
.
Deal with NA value
As you can see, in “overall_rate” column, there are few “NA” values, in this context it means 0, so we may want to convert it as 0 for future’s process.
We can do it like as below.
> df1 <- df1 %>%
mutate_at(c("overall_rate"), ~coalesce(.,0))
It means , we want to convert all NA to 0 in “overall_rate” column.
Now, let’s check the df1 again, we can see all NA has been changed to 0.
!+ > df1
# A tibble: 28 × 3
year_and_quarter cause_of_death overall_rate
<chr> <chr> <dbl>
1 2019 Q1 All causes 910
2 2019 Q1 COVID-19 0
3 2019 Q2 All causes 851.
4 2019 Q2 COVID-19 0
5 2019 Q3 All causes 827.
6 2019 Q3 COVID-19 0
7 2019 Q4 All causes 891.
8 2019 Q4 COVID-19 0
9 2020 Q1 All causes 945.
10 2020 Q1 COVID-19 8.2
# … with 18 more rows
Draw diagram for the whole US data
To draw a diagram directly:
> ggplot(df1, aes(fill=cause_of_death, x=year_and_quarter, y=overall_rate)) +
geom_bar(position="stack", stat="identity") +
geom_col() +
geom_smooth(aes(group=cause_of_death)) +
scale_y_continuous(breaks=seq(0,1500,100))
theme_bw()
To save the diagram in a file:
library(sjPlot)
p = ggplot(df1, aes(fill=cause_of_death, x=year_and_quarter, y=overall_rate)) +
geom_bar(position="stack", stat="identity") +
geom_col() +
geom_smooth(aes(group=cause_of_death)) +
scale_y_continuous(breaks=seq(0,1500,100)) +
theme_bw()
save_plot("covid_plot.svg", fig = p, width=30, height=20)
The diagram looks like below.
Draw diagram for California data
Do you want to try it by yourself?
Create/Calculate a new column for covid ratio
Somehow, we want to get to know the trend for covid ratio.
covid_ratio = overall_rate_of_covid / overall_rate_of_all_causes
covid_death_rate <- df1 %>%
filter(cause_of_death == “COVID-19”) %>%
select(“overall_rate”)
all_causes_rate <- df1 %>%
filter(cause_of_death == “All causes”) %>%
select(overall_rate)
covid_ratio <- covid_death_rate / all_causes_ratedf_ratio <- df1 %>%
filter(cause_of_death == “All causes”) %>%
select(year_and_quarter)
df_ratio[“covid_ratio”] = covid_ratioprint(df_ratio)
A tibble: 14 × 2
year_and_quarter covid_ratio
1 2019 Q1 0
2 2019 Q2 0
3 2019 Q3 0
4 2019 Q4 0
5 2020 Q1 0.00868
6 2020 Q2 0.132
7 2020 Q3 0.0886
8 2020 Q4 0.169
9 2021 Q1 0.171
10 2021 Q2 0.0463
11 2021 Q3 0.131
12 2021 Q4 0.120
13 2022 Q1 0.134
14 2022 Q2 0.0196
Draw diagram for covid ratio
Do you want to try it by yourself?
Getting started with R
Installation
The simplest way is to use Homebrew:
$ brew install r
Another way is to download installation package from https://cloud.r-project.org/
“Hello world” of R
Run it from R console
Command R
will start a R console, and you can run R code inside R console.
(base) ➜ benchling git:(b_test_pr) ✗ R
R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20 (64-bit)
...
> print("hello,world")
[1] "hello,world"
Run it from terminal
Rscript
is a binary front-end to R, for use in scripting applications, see https://linux.die.net/man/1/rscript for more detail.
(base) ➜ R git:(b_test_pr) ✗ cat hello.R
print("hello,world")
(base) ➜ R git:(b_test_pr) ✗ Rscript hello.R
[1] "hello,world"
Install commonly used packages
R installation package comes along with a lot of useful packages, besides that, there are a lot of useful packages available from CRAN.
Here are top 10 most important packages in R for data science.
- ggplot2
- data.table
- dplyr
- tidyr
- Shiny
- plotly
- knitr
- mlr3
To install those packages from CRAN, we can just simiply follow below steps.
- Start R console
- Call “install.packages(XXX)”
Here is an example:
> install.packages("mlr3")
--- Please select a CRAN mirror for use in this session ---
Secure CRAN mirrors
1: 0-Cloud [https]
2: Australia (Canberra) [https]
3: Australia (Melbourne 1) [https]
....
Selection: 1
also installing the dependencies ‘globals’, ‘listenv’, ‘PRROC’, ‘future’, ‘future.apply’, ‘lgr’, ‘mlbench’, ‘mlr3measures’, ‘mlr3misc’, ‘parallelly’, ‘palmerpenguins’, ‘paradox’
trying URL 'https://cloud.r-project.org/bin/macosx/big-sur-arm64/contrib/4.2/globals_0.16.2.tgz'
...
>> library(mlr3)
> ?mlr3
As above, after installation completes, we can try to run library(<package name>)
to verify, and run ?<package name>
to see its document.
Reference
How to redirect the default "404 page not found" error page in Ghost blog
When users access a non-existent page, they may see a page which shows the message “404 page not found” in Ghost blog.
We can redirect the page to another page by creating a error-404.hbs file under the theme folder.
In the following example, we redirect the users to the website home page when they access a non-existent page.
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="refresh" content="0; url='http://localhost:2368/'" />
</head>
<body>
</body>
</html>
How to install GUI desktop on CentOS 7
Install GNOME desktop via the groups option of yum:
$ yum update
$ yum -y groups install "GNOME Desktop"
Inform the startx command which desktop env to run:
$ echo "exec gnome-session" >> ~/.xinitrc
Manually start the GUI desktop:
$ startx
Automaticlly start the GUI desktop after reboot:
$ systemctl set-default graphical.target
$ reboot