How to use R to analyze US COVID pandemic waves and peaks
Goal
According to weekly data between 2020 to 2022, we want to get to know the waves and peaks of COVID pandemic in these years.
Download the data
We will continue to use NCHS(National Center for Health Statistics) as our data source.
Visit https://data.cdc.gov/browse?category=NCHS&sortBy=last_modified, and search Provisional COVID-19 Death Counts by Week
, we will find the data we are intrest.
https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Week-Ending-D/r8kw-7aab, in this page, we can export data into csv file.
With that, we may get the data source csv, Provisional_COVID-19_Death_Counts_by_Week_Ending_Date_and_State.csv
.
Load the data
1 | ibrary("dplyr") |
Identify the data we want to focus
As we can see, there are 4 diffrent groups, and it has the whole United states
and each state’s data
1 | > unique(df$group) |
df1 <- df %>%
filter(state == “United States” & group == “By Week”) %>%
select(start_date, covid_19_deaths)
print(df1,n=20)
A tibble: 158 × 2
start_date covid_19_deaths
1 2019-12-29 0
2 2020-01-05 1
3 2020-01-12 2
4 2020-01-19 3
5 2020-01-26 0
6 2020-02-02 4
7 2020-02-09 6
8 2020-02-16 6
9 2020-02-23 9
10 2020-03-01 38
11 2020-03-08 60
12 2020-03-15 588
13 2020-03-22 3226
14 2020-03-29 10141
15 2020-04-05 16347
16 2020-04-12 17221
17 2020-04-19 15557
18 2020-04-26 13223
19 2020-05-03 11243
20 2020-05-10 9239
…library(“ggplot2”)1
2
3
## Draw the graph to see the wave
library(“sjPlot”)
p = ggplot(df1, aes( x=start_date, y=covid_19_deaths, group=1)) +
geom_line(color=”blue”) +
theme(axis.text.x=element_text(angle=45,hjust=1,size=5))
save_plot(“covid_plot_weekly_wave.svg”, fig = p, width=60, height=20)library(“pracma”)1
2
3
4
5
6
7
8
9
![Image](/images/covid_plot_weekly_wave.svg)
## Find the peak by R mark it in the graph
From above graph, we can easily to figure out the waves and peaks, but we also can let R help us to do it, it's pretty useful if we have to deal with many data and many graphs.
To achive it, firstly we can call `findpeaks` from `pracma` library to find the peaks- peaks = findpeaks(df1$covid_19_deaths, npeaks=5, sortstr=TRUE)
peaks
[,1] [,2] [,3] [,4]
[1,] 26027 54 40 66
[2,] 21364 108 98 121
[3,] 17221 16 8 26
[4,] 15536 88 79 98
[5,] 8308 31 26 38
1 |
|
is_peak <- vector( “logical” , length(df1$covid_19_deaths ))
df1$is_peak = is_peak
for (x in peaks[,2]) {
df1$is_peak[x] = TRUE
}
1 |
|
!> df2 = df1 %>% filter(is_peak == TRUE)
- df2[order(-df2$covid_19_deaths),]
A tibble: 5 × 3
start_date covid_19_deaths is_peak
1 2021-01-03 26027 TRUE
2 2022-01-16 21364 TRUE
3 2020-04-12 17221 TRUE
4 2021-08-29 15536 TRUE
5 2020-07-26 8308 TRUEp = ggplot(df1, aes(x=start_date, y=covid_19_deaths, group=1)) +1
2
3
## Hightlight the peak points
geom_line(color=”blue”) +
geom_point(data = . %>% filter(is_peak == TRUE), stat=”identity”, size = 4, color = “red”) +
scale_y_continuous(breaks=seq(0,30000,4000)) +
theme(axis.text.x=element_text(angle=45,hjust=1,size=5))
save_plot(“covid_plot_weekly_peak.svg”, fig = p, width=60, height=20)
1 |
|
!> > sum(df1$covid_19_deaths)
[1] 1089714 ===> the total covid_19_deaths death number from 2020 to 2022
!> summary(df1$covid_19_deaths)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2223 4428 6897 9862 26027
!> df3 <- df %>%
filter(state == "United States" & group == "By Week") %>%
select(start_date, total_deaths)
- sum(df3$total_deaths)
[1] 10077273 ===> the total death number from 2020 to 2022
- summary(df3$total_deaths)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7100 58522 60451 63780 68610 87415