Using ANOVA in R to analyze US COVID data to understand age impact to death rate

Posted on 2023-01-12 Edited on 2023-10-22 In Tech , Big Data Views:

Goal

We want to analysis the relationship between age and covid death rate in US.

Download the data

We will use NCHS(National Center for Health Statistics) as our data source.

Visit https://data.cdc.gov/browse?category=NCHS&sortBy=last_modified, and search VSRR Quarterly, we will find the data we are intrest.

https://data.cdc.gov/NCHS/NCHS-VSRR-Quarterly-provisional-estimates-for-sele/489q-934x, in this page, we can export data into csv file.

With that, we may can download data source csv, NCHS_-_VSRR_Quarterly_provisional_estimates_for_selected_indicators_of_mortality.csv

Load the data

library("dplyr")
library("ggplot2")
library("janitor")
library("tidyr")
library("readr")
library("sjPlot")
df <- readr::read_csv(file.path(getwd(), "NCHS_-_VSRR_Quarterly_provisional_estimates_for_selected_indicators_of_mortality.csv"), col_names = TRUE)
df <- clean_names(df)
# To fix the colname format from origin csv file
df <- df %>% rename("rate_age_65_74" = "rate_65_74")

Filter and Select

df1 <- df %>%
    filter(time_period == "3-month period" & rate_type == "Crude" & cause_of_death %in% c("COVID-19")) %>%
    select(year_and_quarter, rate_age_1_4, rate_age_5_14,rate_age_15_24, rate_age_25_34,rate_age_35_44, rate_age_45_54,rate_age_55_64,rate_age_65_74,rate_age_75_84,rate_age_85_plus)
  # take a quick look from column's point of view
!> glimpse(df1)
 Rows: 14
 Columns: 11
 $ year_and_quarter <chr> "2019 Q1", "2019 Q2", "2019 Q3", "2019 Q4", "2020 Q1"…
 $ rate_age_1_4     <dbl> NA, NA, NA, NA, NA, 0.2, NA, 0.2, 0.2, 0.2, 0.6, 0.4,…
 $ rate_age_5_14    <dbl> NA, NA, NA, NA, NA, 0.1, 0.2, 0.2, 0.2, 0.1, 0.6, 0.5…
 $ rate_age_15_24   <dbl> NA, NA, NA, NA, 0.1, 1.4, 1.5, 1.6, 1.9, 1.0, 5.9, 4.…
 $ rate_age_25_34   <dbl> NA, NA, NA, NA, 0.8, 6.6, 5.4, 6.7, 8.9, 4.6, 23.8, 1…
 $ rate_age_35_44   <dbl> NA, NA, NA, NA, 2.1, 18.8, 16.1, 20.6, 25.1, 11.9, 65…
 $ rate_age_45_54   <dbl> NA, NA, NA, NA, 4.9, 56.1, 42.8, 64.4, 79.9, 33.9, 14…
 $ rate_age_55_64   <dbl> NA, NA, NA, NA, 9.3, 129.8, 96.0, 162.0, 201.9, 67.4,…
 $ rate_age_65_74   <dbl> NA, NA, NA, NA, 19.0, 293.6, 206.7, 413.6, 464.8, 109…
 $ rate_age_75_84   <dbl> NA, NA, NA, NA, 43.9, 725.6, 465.8, 1112.4, 1110.7, 1…
 $ rate_age_85_plus <dbl> NA, NA, NA, NA, 97.7, 2210.8, 1127.2, 3101.9, 2783.8,…

`pivot_longer` to reorg the data for grouping

df2 = df1 %>% pivot_longer(names_to = "rate_type", values_to = "rate_of_10k",cols = -c(year_and_quarter))
# convert NA to 0
df2 <- df2 %>%
    mutate_at(c("rate_of_10k"), ~coalesce(.,0))
!> df2
 # A tibble: 140 × 3
    year_and_quarter rate_type        rate_of_10k
    <chr>            <chr>                  <dbl>
  1 2019 Q1          rate_age_1_4               0
  2 2019 Q1          rate_age_5_14              0
  3 2019 Q1          rate_age_15_24             0
  4 2019 Q1          rate_age_25_34             0
  5 2019 Q1          rate_age_35_44             0
  6 2019 Q1          rate_age_45_54             0
  7 2019 Q1          rate_age_55_64             0
  8 2019 Q1          rate_age_65_74             0
  9 2019 Q1          rate_age_75_84             0
 10 2019 Q1          rate_age_85_plus           0

Draw diagram for each group

plotdata <- df2 %>%
    group_by(rate_type) %>%
    summarize(n = n(),
            mean = mean(rate_of_10k),
            sd = sd(rate_of_10k),
            ci = qt(0.975, df = n - 1) * sd / sqrt(n))
p = ggplot(plotdata,
       aes(x = factor(rate_type, level = c('rate_age_1_4','rate_age_5_14','rate_age_15_24','rate_age_25_34','rate_age_35_44','rate_age_45_54','rate_age_55_64','rate_age_65_74','rate_age_75_84','rate_age_85_plus'))
         , y = mean, group = 1)) +
    geom_line(linetype="dashed", color="darkgrey") +
    geom_errorbar(aes(ymin = mean - ci,
                      ymax = mean + ci),
                  width = .2) +
    geom_point(size = 3, color="red") +
    scale_y_continuous(breaks=seq(0,1800,100)) +
    theme_bw() +
    labs(x="rate_type",
         y="rate_of_10k",
         title="Mean Plot with 95% Confidence Interval")
save_plot("covid_plot_age_impact1.svg", fig = p, width=30, height=20)

ANOVA analysis

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among means.

It’s very useful in our case, aov from R make it very easy to use, see https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/aov for more detail.

fit <- aov(rate_of_10k ~ rate_type, data=df2)
summary(fit)
!> summary(fit)
              Df   Sum Sq Mean Sq F value   Pr(>F)
 rate_type     9 12988087 1443121   10.21 8.49e-12 ***
 Residuals   130 18370513  141312
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

we saw 8.49e-12 ***, which means age has big impact on covid death rate.

Compare diffrent age ranges

pairwise <- TukeyHSD(fit)
pairwise

 + pairwise
 >   Tukey multiple comparisons of means
     95% family-wise confidence level

 Fit: aov(formula = rate_of_10k ~ rate_type, data = df2)

 $rate_type
                                          diff        lwr       upr     p adj
 rate_age_15_24-rate_age_1_4        1.25000000 -456.16974  458.6697 1.0000000
 rate_age_25_34-rate_age_1_4        5.87857143 -451.54117  463.2983 1.0000000
 rate_age_35_44-rate_age_1_4       16.50714286 -440.91259  473.9269 1.0000000
 rate_age_45_54-rate_age_1_4       43.45714286 -413.96259  500.8769 0.9999996
 rate_age_5_14-rate_age_1_4        -0.02857143 -457.44831  457.3912 1.0000000
 ....

if p adj is close to 1, it means no big difference for this pair

if p adj is close to 0, it means big difference for this pair

Draw diagram for pairs comparisons

plotdata <- as.data.frame(pairwise[[1]])
plotdata$conditions <- row.names(plotdata)
p = ggplot(data=plotdata, aes(x=conditions, y=diff)) +
  geom_errorbar(aes(ymin=lwr, ymax=upr, width=.2)) +
  geom_hline(yintercept=0, color="red", linetype="dashed") +
  geom_point(size=3, color="red") +
  theme_bw() +
  labs(y="Difference in mean levels", x="",
       title="95% family-wise confidence level") +
   coord_flip()
save_plot("covid_plot_age_impact2.svg", fig = p, width=30, height=20)

Conclusion

Age does have big impact for covid death rate.
85+ was impacted the most.
For 35-, the death rate is very low, and no big difference from age 0-35.