Using R to analyze the quarterly US COVID data
Goal
Based on the overall death data and COVID realated death data since 2019, we want to study the impact trend of COVID to the overall pupulation death of US in these years.
Download the data
We will use NCHS(National Center for Health Statistics) as our data source.
Visit https://data.cdc.gov/browse?category=NCHS&sortBy=last_modified, and search VSRR Quarterly
, we will find the data we are intrested in.
https://data.cdc.gov/NCHS/NCHS-VSRR-Quarterly-provisional-estimates-for-sele/489q-934x
In this page, we can export data into csv file as NCHS_-_VSRR_Quarterly_provisional_estimates_for_selected_indicators_of_mortality.csv
Take a quick look at the data
To load the data:
# If "readr" not installed, run install.packages("readr") to install it
library("readr")
df <- readr::read_csv(file.path(getwd(), "NCHS_-_VSRR_Quarterly_provisional_estimates_for_selected_indicators_of_mortality.csv"), col_names = TRUE)
To check the first few lines:
> head(df)
# A tibble: 6 × 69
Year a…¹ Time …² Cause…³ Rate …⁴ Unit Overa…⁵ Rate …⁶ Rate …⁷ Rate …⁸ Rate …⁹
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019 Q1 12 mon… All ca… Age-ad… Deat… 712. 600. 844. NA NA
2 2019 Q1 12 mon… Alzhei… Age-ad… Deat… 29.6 33.1 23.8 NA NA
3 2019 Q1 12 mon… COVID-… Age-ad… Deat… NA NA NA NA NA
4 2019 Q1 12 mon… Cancer Age-ad… Deat… 148. 128. 175. NA NA
5 2019 Q1 12 mon… Chroni… Age-ad… Deat… 11 7.7 14.7 NA NA
6 2019 Q1 12 mon… Chroni… Age-ad… Deat… 38.5 35.7 42.4 NA NA
# … with 59 more variables: `Rate Age 15-24` <dbl>, `Rate Age 25-34` <dbl>,
# `Rate Age 35-44` <dbl>, `Rate Age 45-54` <dbl>, `Rate Age 55-64` <dbl>,
# `Rate 65-74` <dbl>, `Rate Age 75-84` <dbl>, `Rate Age 85 plus` <dbl>,
# `Rate Alaska` <dbl>, `Rate Alabama` <dbl>, `Rate Arkansas` <dbl>,
# `Rate Arizona` <dbl>, `Rate California` <dbl>, `Rate Colorado` <dbl>,
# `Rate Connecticut` <dbl>, `Rate District of Columbia` <dbl>,
# `Rate Delaware` <dbl>, `Rate Florida` <dbl>, `Rate Georgia` <dbl>, …
# ℹ Use `colnames()` to see all variable names
To get a summary:
summary(df)
Year and Quarter Time Period Cause of Death Rate Type
Length:1232 Length:1232 Length:1232 Length:1232
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Unit Overall Rate Rate Sex Female Rate Sex Male
Length:1232 Min. : 1.20 Min. : 0.60 Min. : 1.90
Class :character 1st Qu.: 11.40 1st Qu.: 6.95 1st Qu.: 13.20
Mode :character Median : 17.00 Median : 14.80 Median : 23.50
Mean : 80.51 Mean : 70.56 Mean : 91.63
3rd Qu.: 50.60 3rd Qu.: 49.92 3rd Qu.: 65.42
Max. :1142.30 Max. :1067.00 Max. :1219.90
NA's :44 NA's :44 NA's :44
...
To get glimpse from columns point of view:
> library("dplyr")
> glimpse(df)
Rows: 1,232
Columns: 69
$ `Year and Quarter` <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1"…
$ `Time Period` <chr> "12 months ending with quarter", "12 month…
$ `Cause of Death` <chr> "All causes", "Alzheimer disease", "COVID-…
$ `Rate Type` <chr> "Age-adjusted", "Age-adjusted", "Age-adjus…
$ Unit <chr> "Deaths per 100,000", "Deaths per 100,000"…
$ `Overall Rate` <dbl> 712.2, 29.6, NA, 148.1, 11.0, 38.5, 21.3, …
$ `Rate Sex Female` <dbl> 600.3, 33.1, NA, 127.9, 7.7, 35.7, 16.8, 1…
$ `Rate Sex Male` <dbl> 843.7, 23.8, NA, 175.4, 14.7, 42.4, 26.9, …
$ `Rate Age 1-4` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 5-14` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 15-24` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 25-34` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 35-44` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 45-54` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 55-64` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate 65-74` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Rate Age 75-84` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
...
Clear the columns name
As we can see, the column names has space, it may cause some trouble while refering it in R, it’s a common sugestion to convert space to “_” before doing any R o/p.
A good thing is , there is a R package can help us on it.
> library("janitor")
> df <- clean_names(df)
!> glimpse(df)
Rows: 1,232
Columns: 69
$ year_and_quarter <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", …
$ time_period <chr> "12 months ending with quarter", "12 months …
$ cause_of_death <chr> "All causes", "Alzheimer disease", "COVID-19…
$ rate_type <chr> "Age-adjusted", "Age-adjusted", "Age-adjuste…
$ unit <chr> "Deaths per 100,000", "Deaths per 100,000", …
$ overall_rate <dbl> 712.2, 29.6, NA, 148.1, 11.0, 38.5, 21.3, 20…
$ rate_sex_female <dbl> 600.3, 33.1, NA, 127.9, 7.7, 35.7, 16.8, 13.…
$ rate_sex_male <dbl> 843.7, 23.8, NA, 175.4, 14.7, 42.4, 26.9, 27…
$ rate_age_1_4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
As you can see, while we run glimpse(df)
, the colum name has been changed from “Year and Quarter” to “year_and_quarter”.
Filter and select
The raw data has so many columns and rows, but we just want to focus on those cols/rows we really have intrest.
We can use filter
(applying on rows) and select
(applying on columns) in such condition.
> df1 <- df %>%
filter(time_period == "3-month period" & rate_type == "Crude" & cause_of_death %in% c("All causes", "COVID-19")) %>%
select(year_and_quarter, cause_of_death, overall_rate)
> print(df1,n=50)
# A tibble: 28 × 3
year_and_quarter cause_of_death overall_rate
<chr> <chr> <dbl>
1 2019 Q1 All causes 910
2 2019 Q1 COVID-19 NA
3 2019 Q2 All causes 851.
4 2019 Q2 COVID-19 NA
5 2019 Q3 All causes 827.
6 2019 Q3 COVID-19 NA
7 2019 Q4 All causes 891.
8 2019 Q4 COVID-19 NA
9 2020 Q1 All causes 945.
10 2020 Q1 COVID-19 8.2
11 2020 Q2 All causes 1035.
12 2020 Q2 COVID-19 137.
13 2020 Q3 All causes 985.
14 2020 Q3 COVID-19 87.3
15 2020 Q4 All causes 1142.
16 2020 Q4 COVID-19 193.
17 2021 Q1 All causes 1116
18 2021 Q1 COVID-19 191.
19 2021 Q2 All causes 915.
20 2021 Q2 COVID-19 42.4
21 2021 Q3 All causes 1051.
22 2021 Q3 COVID-19 138.
23 2021 Q4 All causes 1092.
24 2021 Q4 COVID-19 131.
25 2022 Q1 All causes 1116
26 2022 Q1 COVID-19 150.
27 2022 Q2 All causes 899.
28 2022 Q2 COVID-19 17.6
%>%
looks strange, it’s just like | (pipe)
in linux shell command.
df1 <- df %>%
filter(time_period == "3-month period" & rate_type == "Crude" & cause_of_death %in% c("All causes", "COVID-19")) %>%
select(year_and_quarter, cause_of_death, overall_rate)
It means only keep those rows which meet those condition of filter
, and those columns which meet condition of select
.
Deal with NA value
As you can see, in “overall_rate” column, there are few “NA” values, in this context it means 0, so we may want to convert it as 0 for future’s process.
We can do it like as below.
> df1 <- df1 %>%
mutate_at(c("overall_rate"), ~coalesce(.,0))
It means , we want to convert all NA to 0 in “overall_rate” column.
Now, let’s check the df1 again, we can see all NA has been changed to 0.
!+ > df1
# A tibble: 28 × 3
year_and_quarter cause_of_death overall_rate
<chr> <chr> <dbl>
1 2019 Q1 All causes 910
2 2019 Q1 COVID-19 0
3 2019 Q2 All causes 851.
4 2019 Q2 COVID-19 0
5 2019 Q3 All causes 827.
6 2019 Q3 COVID-19 0
7 2019 Q4 All causes 891.
8 2019 Q4 COVID-19 0
9 2020 Q1 All causes 945.
10 2020 Q1 COVID-19 8.2
# … with 18 more rows
Draw diagram for the whole US data
To draw a diagram directly:
> ggplot(df1, aes(fill=cause_of_death, x=year_and_quarter, y=overall_rate)) +
geom_bar(position="stack", stat="identity") +
geom_col() +
geom_smooth(aes(group=cause_of_death)) +
scale_y_continuous(breaks=seq(0,1500,100))
theme_bw()
To save the diagram in a file:
library(sjPlot)
p = ggplot(df1, aes(fill=cause_of_death, x=year_and_quarter, y=overall_rate)) +
geom_bar(position="stack", stat="identity") +
geom_col() +
geom_smooth(aes(group=cause_of_death)) +
scale_y_continuous(breaks=seq(0,1500,100)) +
theme_bw()
save_plot("covid_plot.svg", fig = p, width=30, height=20)
The diagram looks like below.
Draw diagram for California data
Do you want to try it by yourself?
Create/Calculate a new column for covid ratio
Somehow, we want to get to know the trend for covid ratio.
covid_ratio = overall_rate_of_covid / overall_rate_of_all_causes
covid_death_rate <- df1 %>%
filter(cause_of_death == “COVID-19”) %>%
select(“overall_rate”)
all_causes_rate <- df1 %>%
filter(cause_of_death == “All causes”) %>%
select(overall_rate)
covid_ratio <- covid_death_rate / all_causes_ratedf_ratio <- df1 %>%
filter(cause_of_death == “All causes”) %>%
select(year_and_quarter)
df_ratio[“covid_ratio”] = covid_ratioprint(df_ratio)
A tibble: 14 × 2
year_and_quarter covid_ratio
1 2019 Q1 0
2 2019 Q2 0
3 2019 Q3 0
4 2019 Q4 0
5 2020 Q1 0.00868
6 2020 Q2 0.132
7 2020 Q3 0.0886
8 2020 Q4 0.169
9 2021 Q1 0.171
10 2021 Q2 0.0463
11 2021 Q3 0.131
12 2021 Q4 0.120
13 2022 Q1 0.134
14 2022 Q2 0.0196
Draw diagram for covid ratio
Do you want to try it by yourself?