## # A tibble: 6 x 8
## country continent year lifeExp pop gdpPercap gdp1000 pop100k
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 0.779 84.3
## 2 Afghanistan Asia 1957 30.3 9240934 821. 0.821 92.4
## 3 Afghanistan Asia 1962 32.0 10267083 853. 0.853 103.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 0.836 115.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 0.740 131.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 0.786 149.
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap gdp1000 pop100k
## Min. :6.001e+04 Min. : 241.2 Min. : 0.2412 Min. : 0.60
## 1st Qu.:2.794e+06 1st Qu.: 1202.1 1st Qu.: 1.2021 1st Qu.: 27.94
## Median :7.024e+06 Median : 3531.8 Median : 3.5318 Median : 70.24
## Mean :2.960e+07 Mean : 7215.3 Mean : 7.2153 Mean : 296.01
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5 3rd Qu.: 9.3255 3rd Qu.: 195.85
## Max. :1.319e+09 Max. :113523.1 Max. :113.5231 Max. :13186.83
##
## [1] Afghanistan Albania Algeria
## [4] Angola Argentina Australia
## [7] Austria Bahrain Bangladesh
## [10] Belgium Benin Bolivia
## [13] Bosnia and Herzegovina Botswana Brazil
## [16] Bulgaria Burkina Faso Burundi
## [19] Cambodia Cameroon Canada
## [22] Central African Republic Chad Chile
## [25] China Colombia Comoros
## [28] Congo, Dem. Rep. Congo, Rep. Costa Rica
## [31] Cote d'Ivoire Croatia Cuba
## [34] Czech Republic Denmark Djibouti
## [37] Dominican Republic Ecuador Egypt
## [40] El Salvador Equatorial Guinea Eritrea
## [43] Ethiopia Finland France
## [46] Gabon Gambia Germany
## [49] Ghana Greece Guatemala
## [52] Guinea Guinea-Bissau Haiti
## [55] Honduras Hong Kong, China Hungary
## [58] Iceland India Indonesia
## [61] Iran Iraq Ireland
## [64] Israel Italy Jamaica
## [67] Japan Jordan Kenya
## [70] Korea, Dem. Rep. Korea, Rep. Kuwait
## [73] Lebanon Lesotho Liberia
## [76] Libya Madagascar Malawi
## [79] Malaysia Mali Mauritania
## [82] Mauritius Mexico Mongolia
## [85] Montenegro Morocco Mozambique
## [88] Myanmar Namibia Nepal
## [91] Netherlands New Zealand Nicaragua
## [94] Niger Nigeria Norway
## [97] Oman Pakistan Panama
## [100] Paraguay Peru Philippines
## [103] Poland Portugal Puerto Rico
## [106] Reunion Romania Rwanda
## [109] Sao Tome and Principe Saudi Arabia Senegal
## [112] Serbia Sierra Leone Singapore
## [115] Slovak Republic Slovenia Somalia
## [118] South Africa Spain Sri Lanka
## [121] Sudan Swaziland Sweden
## [124] Switzerland Syria Taiwan
## [127] Tanzania Thailand Togo
## [130] Trinidad and Tobago Tunisia Turkey
## [133] Uganda United Kingdom United States
## [136] Uruguay Venezuela Vietnam
## [139] West Bank and Gaza Yemen, Rep. Zambia
## [142] Zimbabwe
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
df %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(color = continent, group = country)) +
labs(title = "Gapminder Life Expectancy By Year")df %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(color = continent, group = country)) +
geom_smooth(se = T, method = 'loess') +
facet_wrap(~continent) +
labs(title = "Gapminder Life Expectancy By Year")## `geom_smooth()` using formula 'y ~ x'
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
## [7] "gdp1000" "pop100k"
Let’s create a basic model of life expectancy, predicted by GDP per capita, year, and population. But let’s start with just Asia
# keep only countries within the continent of Asia
mod1_df <- df %>%
filter(continent == "Asia")
# formula outcome ~ pred1 + pred2 + pred3
mod1 <- lm(lifeExp ~ year + pop100k + gdp1000, data = mod1_df)
# can summarise our model with the summary function
summary(mod1)##
## Call:
## lm(formula = lifeExp ~ year + pop100k + gdp1000, data = mod1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.1852 -5.2575 0.4857 5.0062 17.9753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.833e+02 4.829e+01 -16.221 < 2e-16 ***
## year 4.251e-01 2.442e-02 17.404 < 2e-16 ***
## pop100k 4.228e-06 2.039e-04 0.021 0.983
## gdp1000 2.510e-01 3.011e-02 8.336 1.31e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.232 on 392 degrees of freedom
## Multiple R-squared: 0.5222, Adjusted R-squared: 0.5186
## F-statistic: 142.8 on 3 and 392 DF, p-value: < 2.2e-16
We see that year and gdpPercap is significant. We should also check for normality of residuals, heteroscedasticity, etc… Which we can do by just calling the plot function on our model
qq looks somewhat normal, residuals vs fitted look all right, not a lot of leverage points… alright we’re not being too picky here.
We can also look at component plus residual plots of our data to check linearity of our predictors with respect to the outcome (this is only useful for a multivariate linear model)
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
Here we can see that gdp1000 is not necessarily linear, and population might stray from linear as well. We might want to amend this in our model. However population by itself wasn’t significant, so we’ll remove that from our model
mod2 <- lm(lifeExp ~ year + gdp1000 + I(gdp1000**2) + pop100k, data = mod1_df)
# the I() tells the model to treat the gdpPercap term 'as is,' and not to adjust it or ignore it
summary(mod2)##
## Call:
## lm(formula = lifeExp ~ year + gdp1000 + I(gdp1000^2) + pop100k,
## data = mod1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.670 -4.343 -0.003 5.281 16.414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.260e+02 4.722e+01 -13.257 <2e-16 ***
## year 3.442e-01 2.394e-02 14.380 <2e-16 ***
## gdp1000 7.906e-01 6.539e-02 12.092 <2e-16 ***
## I(gdp1000^2) -6.664e-03 7.332e-04 -9.089 <2e-16 ***
## pop100k 2.956e-04 1.882e-04 1.571 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.49 on 391 degrees of freedom
## Multiple R-squared: 0.6055, Adjusted R-squared: 0.6015
## F-statistic: 150.1 on 4 and 391 DF, p-value: < 2.2e-16
If we want to check to see if our model has improved, we can use the anova function to conduct a goodness of fit F-test between two models. The below anova is testing if mod2 fits better than mod1. If p < 0.05 the improvement in fit is statistically significant
# let's check our model's fit from the first model. The below code
# performs an F-test
anova(mod1, mod2)## Analysis of Variance Table
##
## Model 1: lifeExp ~ year + pop100k + gdp1000
## Model 2: lifeExp ~ year + gdp1000 + I(gdp1000^2) + pop100k
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 392 26566
## 2 391 21933 1 4633.6 82.605 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary function is great, but it’s hard to work with the output of it directly. To do this, a package was developed called broom which allows you to fortify your data with fitted values, create a data frame with your model output, and get goodness of fit statistics
broom has 3 very useful functions called tidy, augment, and glance. tidy returns the same data as your summary table (and more) in the form of a data frame. You can also ask for confidence intervals, p values, and exponentiation, among other things, depending on the model you’ve fit
tidy## # A tibble: 5 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -626. 47.2 -13.3 2.20e-33 -719. -533.
## 2 year 0.344 0.0239 14.4 6.19e-38 0.297 0.391
## 3 gdp1000 0.791 0.0654 12.1 8.19e-29 0.662 0.919
## 4 I(gdp1000^2) -0.00666 0.000733 -9.09 5.09e-18 -0.00811 -0.00522
## 5 pop100k 0.000296 0.000188 1.57 1.17e- 1 -0.0000744 0.000666
Augment returns a data frame of fitted values for your data, which you can then use to fit your trend line over your data. However, because we have a multivariable model, a trend line doesn’t fit nicely to any one variable. I’ll make a univariate model so this will make more sense.
augmentmod3 <- lm(lifeExp ~ year, data = mod1_df)
augdata <- augment(mod3, interval = "prediction")
augdata %>%
ggplot(aes(x = year)) +
geom_point(aes(y = lifeExp)) +
geom_line(aes(y = .fitted), col = 'red') +
geom_line(aes(y = .upper), col = 'blue') +
geom_line(aes(y = .lower), col = 'blue') +
geom_smooth(aes(y = lifeExp), se = F, color = 'green') +
labs(title = "LM of LifeExpectancy vs. Year")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Side note: I will also mention that augment has some drawbacks. If you are using a poisson or negative binomial distribution to fit your data, augment returns fitted values in their non-exponentiated form. An alternative to augment is to get values from the model object itself using the $ operator. Remember, your model output is simply a list, so you can access elements from it like any other list.
## (Intercept) year pop100k gdp1000
## -7.833346e+02 4.250632e-01 4.228480e-06 2.510235e-01
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -26.1852 -5.2575 0.4857 0.0000 5.0062 17.9753
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 46.47 52.95 59.83 60.06 66.91 82.34
Back to the above visual…As you can see, above we have another test of our fit. We have out fitted line in red, prediction intervals, which we got from the augmented data frame in blue, and a lowess smoother in green, showing slight deviations from linearity about our model. In all honesty you should probably plot a lowess smoother over your data before you make a linear model to check if the relationship is linear. A one shortcut for doing that is to plot a geom_smooth of your data using method = 'loess', and another geom_smooth using method = lm, which will plot a loess smoother and a fitted line, respectively. Another shortcut for doing that is the below code pairs.panels() from the psych package, which plots a scatterplot and correlation matrix of all of selected variables from your data frame
mod1_df %>%
dplyr::select(-continent, -country, -pop, -gdpPercap) %>%
psych::pairs.panels(breaks = 40)glanceFinally, we have glance, which gives us goodness of fit statistics for our models
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5222142 | 0.5185577 | 8.232329 | 142.8171 | 0 | 3 | -1394.685 | 2799.37 | 2819.277 | 26566.33 | 392 | 396 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.6055482 | 0.6015129 | 7.489585 | 150.0623 | 0 | 4 | -1356.735 | 2725.47 | 2749.359 | 21932.71 | 391 | 396 |
Sometimes we want to have our model in a neat output table, and we don’t want to format the table every time we change our model. We can programmatically make model tables with the tab_model function in sjPlot
tab_model# library(sjPlot)
sjPlot::tab_model(mod1, mod2, mod3, show.intercept = FALSE, show.ci = 0.95, show.se = F, show.p = T,
show.r2 = TRUE, show.fstat = T, show.aic = T, show.obs = T, rm.terms = NULL, order.terms = NULL,
title = "Comparison of Gapminder Models")| life Exp | life Exp | life Exp | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| year | 0.43 | 0.38 – 0.47 | <0.001 | 0.34 | 0.30 – 0.39 | <0.001 | 0.45 | 0.40 – 0.50 | <0.001 |
| pop100k | 0.00 | -0.00 – 0.00 | 0.983 | 0.00 | -0.00 – 0.00 | 0.117 | |||
| gdp1000 | 0.25 | 0.19 – 0.31 | <0.001 | 0.79 | 0.66 – 0.92 | <0.001 | |||
| gdp1000^2 | -0.01 | -0.01 – -0.01 | <0.001 | ||||||
| Observations | 396 | 396 | 396 | ||||||
| R2 / R2 adjusted | 0.522 / 0.519 | 0.606 / 0.602 | 0.436 / 0.434 | ||||||
| AIC | 2799.370 | 2725.470 | 2861.319 | ||||||
Now we can compare models programmatically without having to export the models into an Excel table every time we run our code! This function works with most of R’s built-in modelling functions. There’s another variant called plot_model which will plot estimates of our models together
plot_modelssjPlot::plot_models(mod1, mod2, mod3, show.values = T, colors = "Dark2", ci.lvl = 0.99, axis.lim = c(-0.5, 0.5),
dot.size = 2, show.p = TRUE) obviously, this plot is a little underwhelming because we have a very simple model, and out predictors don’t show great amounts of spread from zero, but you could see how this type of plot could be useful for logistic, poisson, negative binomial, or other types of regressions with much wider confidence intervals.
So we’ve made a model for one continent…What if we wanted to make a model for all of the continents in our data set? Should we go through that same workflow for every one? There is a faster way, but it’s somewhat tricky, so I’ll present it here and leave it up to your discretion. Often times if you code by brute force method, it can take longer to do, and be many more lines, but the code may be more understandable in the end. I’ll show the programmatic way of making many models though.
R has a package called purrr, which allows the user to nest data within a column to make a new data set. purrr also implements a programmatic concept called mapping, which is simply applying a function to every element of a list. The combination of nesting a data frame into a column, and mapping a function onto that column will allow us to efficiently make many models (though the model specifications have to be the same across groups). Here’s code on how to nest
## # A tibble: 5 x 2
## # Groups: continent [5]
## continent data
## <fct> <list>
## 1 Asia <tibble [396 x 7]>
## 2 Europe <tibble [360 x 7]>
## 3 Africa <tibble [624 x 7]>
## 4 Americas <tibble [300 x 7]>
## 5 Oceania <tibble [24 x 7]>
It’s as easy as that! First you group by the category that you want to model across. In this case we want to apply similar models across continents. Now we have to make a function for a model which we can apply across the nested data frames. First I’ll show you what’s contained in the data column
## # A tibble: 6 x 7
## country year lifeExp pop gdpPercap gdp1000 pop100k
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan 1952 28.8 8425333 779. 0.779 84.3
## 2 Afghanistan 1957 30.3 9240934 821. 0.821 92.4
## 3 Afghanistan 1962 32.0 10267083 853. 0.853 103.
## 4 Afghanistan 1967 34.0 11537966 836. 0.836 115.
## 5 Afghanistan 1972 36.1 13079460 740. 0.740 131.
## 6 Afghanistan 1977 38.4 14880372 786. 0.786 149.
## # A tibble: 6 x 7
## country year lifeExp pop gdpPercap gdp1000 pop100k
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Yemen, Rep. 1982 49.1 9657618 1978. 1.98 96.6
## 2 Yemen, Rep. 1987 52.9 11219340 1972. 1.97 112.
## 3 Yemen, Rep. 1992 55.6 13367997 1879. 1.88 134.
## 4 Yemen, Rep. 1997 58.0 15826497 2117. 2.12 158.
## 5 Yemen, Rep. 2002 60.3 18701257 2235. 2.23 187.
## 6 Yemen, Rep. 2007 62.7 22211743 2281. 2.28 222.
The first row of the data column is just a data frame containg data for all countries in Africa. Here we have year, lifeExp, pop, and gdpPercap. Now we need to make a function which we can apply across these tibbles, which take the same variable names (for sake of ease)
cont_mod <- function(df){
mod1 <- lm(lifeExp ~ year + gdp1000 + I(gdp1000**2), data = df)
return(mod1)
}Now that we have our model function, we can map the model onto our nested data frame and make a new variable called model
We’ve mapped our model. Let’s see what one of the models looks like
##
## Call:
## lm(formula = lifeExp ~ year + gdp1000 + I(gdp1000^2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.9730 -4.3525 -0.2491 5.2605 16.3478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.396e+02 4.651e+01 -13.754 <2e-16 ***
## year 3.512e-01 2.355e-02 14.912 <2e-16 ***
## gdp1000 7.684e-01 6.395e-02 12.015 <2e-16 ***
## I(gdp1000^2) -6.468e-03 7.238e-04 -8.935 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.504 on 392 degrees of freedom
## Multiple R-squared: 0.6031, Adjusted R-squared: 0.6
## F-statistic: 198.5 on 3 and 392 DF, p-value: < 2.2e-16
Here we have a model of all of the African countries predicting life expectancy. We can continue using map to get a tidy model output, augment our output, or look at fit statistics with glance
nest_df_fortify <- nest_df_mod %>%
mutate(tdy = map(model, tidy, conf.int = T),
aug = map(model, augment),
glnc = map(model, glance))
head(nest_df_fortify)## # A tibble: 5 x 6
## # Groups: continent [5]
## continent data model tdy aug glnc
## <fct> <list> <list> <list> <list> <list>
## 1 Asia <tibble [396 x~ <lm> <tibble [4 x ~ <tibble [396 x~ <tibble [1 x ~
## 2 Europe <tibble [360 x~ <lm> <tibble [4 x ~ <tibble [360 x~ <tibble [1 x ~
## 3 Africa <tibble [624 x~ <lm> <tibble [4 x ~ <tibble [624 x~ <tibble [1 x ~
## 4 Americas <tibble [300 x~ <lm> <tibble [4 x ~ <tibble [300 x~ <tibble [1 x ~
## 5 Oceania <tibble [24 x ~ <lm> <tibble [4 x ~ <tibble [24 x ~ <tibble [1 x ~
unnestYou might be thinking “This is great! But how the hell do we work with this now?” That tripped me up for a while too, but the easiest thing for us to do is something called unnest. It takes a nested column, and expands it back to original tibble size. I’ll demonstrate.
## # A tibble: 6 x 8
## # Groups: continent [2]
## continent term estimate std.error statistic p.value conf.low conf.high
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Asia (Interce~ -6.40e+2 46.5 -13.8 2.13e-35 -7.31e+2 -5.48e+2
## 2 Asia year 3.51e-1 0.0236 14.9 3.75e-40 3.05e-1 3.98e-1
## 3 Asia gdp1000 7.68e-1 0.0640 12.0 1.58e-28 6.43e-1 8.94e-1
## 4 Asia I(gdp100~ -6.47e-3 0.000724 -8.94 1.61e-17 -7.89e-3 -5.04e-3
## 5 Europe (Interce~ -1.56e+2 20.5 -7.60 2.57e-13 -1.96e+2 -1.15e+2
## 6 Europe year 1.11e-1 0.0104 10.6 3.70e-23 9.05e-2 1.31e-1
df_tidy %>%
filter(term != '(Intercept)') %>%
ggplot(aes(x = estimate, y = term, color = continent)) +
geom_point() +
geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.3) +
facet_wrap(~continent, ncol = 1)Resource: Tutorial on making error bars from sthda.
Voila! Now we have a data frame of continents with model parameters including standard errors, p values, confidence intervals, etc… This may not be useful for only 5 categories, but imagine if you’re making models for 100 categories! Also, purrr implements this type of stuff extremely efficiently, so running models with map is much faster than running 100 different models individually. For demonstration’s sake, I’ll just show the output of the augment and glance unnests
## # A tibble: 5 x 13
## # Groups: continent [5]
## continent r.squared adj.r.squared sigma statistic p.value df logLik
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Asia 0.603 0.600 7.50 199. 2.76e- 78 3 -1358.
## 2 Europe 0.754 0.752 2.71 364. 4.87e-108 3 -867.
## 3 Africa 0.489 0.487 6.56 198. 5.53e- 90 3 -2057.
## 4 Americas 0.694 0.691 5.20 224. 9.57e- 76 3 -918.
## 5 Oceania 0.977 0.974 0.613 288. 1.30e- 16 3 -20.1
## # ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
## # df.residual <int>, nobs <int>
Now we can easily compare fit statistics across countries. We can see that our \(R^2\) value for Africa is much lower than the other continents, suggesting that our model did not fit that country well. Perhaps this requires some tuning in the future.
augmentdf_aug <- nest_df_fortify %>%
select(continent, aug) %>%
unnest(aug)
# illustration of why plotting a linear regression over a multivariate model
# doesn't work
df_aug %>%
ggplot(aes(x = year)) +
geom_point(aes(y = lifeExp), alpha = 0.3) +
geom_point(aes(y = .fitted), color = 'red', alpha = 0.3) +
facet_wrap(~continent) +
labs(title = "Fitted values of LifeExpectancy in gapminder")R, being a statistical language, has many other types of models that you can implement, most of which can be accessed using the glm function. Let’s load in some data that might allow us to use some different distributions
This is just data cleaning and retrieval, so we aren’t going to cover this in depth
# rm(list = ls())
cases_global <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv") %>%
dplyr::select(-Lat, -Long) %>%
pivot_longer(cols = c(everything(), -`Province/State`, -`Country/Region`),
names_to = "date",
values_to = "cases") %>%
mutate(date = as.Date(date, format = '%m/%d/%y'))##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## i Use `spec()` for the full column specifications.
deaths_global <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv") %>%
dplyr::select(-Lat, -Long) %>%
pivot_longer(cols = c(everything(), -`Province/State`, -`Country/Region`),
names_to = "date",
values_to = "deaths") %>%
mutate(date = as.Date(date, format = '%m/%d/%y'))##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## i Use `spec()` for the full column specifications.
COVID_master <- cases_global %>%
left_join(deaths_global, by = c('Province/State', 'Country/Region', 'date')) %>%
rename(country = 'Country/Region',
state = 'Province/State') %>%
group_by(country, date) %>%
summarise(cases = sum(cases, na.rm = T),
deaths = sum(deaths, na.rm = T)) %>%
mutate(daily_cases = ave(cases, country, FUN = function(x) c(0, diff(x))),
daily_deaths = ave(deaths, country, FUN = function(x) c(0, diff(x))),
daily_cases_s = ave(daily_cases, country, FUN = function(x) kza::kz(x, m = 14, k = 2)),
daily_deaths_s = ave(daily_deaths, country, FUN = function(x) kza::kz(x, m = 14, k = 2)))## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
COVID_master %>%
ggplot(aes(x = date, y = daily_cases_s, group = country, color = country)) +
geom_line(alpha = 0.4) +
theme(legend.position = 'none')COVID_master %>%
ggplot(aes(x = date, y = daily_deaths_s, group = country, color = country)) +
geom_line(alpha = 0.4) +
theme(legend.position = 'none')Let’s look at the Netherlands!
COVID_Ned <- COVID_master %>%
filter(country == "Netherlands")
COVID_Ned %>%
ggplot(aes(x = date)) +
geom_line(aes(y = daily_cases), color = 'black') +
geom_line(aes(y = daily_deaths), color = 'black') +
geom_line(aes(y = daily_cases_s), color = 'blue') +
geom_line(aes(y = daily_deaths_s), color = 'firebrick3')COVID_Ned %>%
pivot_longer(cols = c('daily_cases', 'daily_deaths'),
names_to = 'outcomes',
values_to = 'values') %>%
ggplot() +
geom_histogram(aes(x = values)) +
facet_wrap(~outcomes)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 2882.309
## [1] 10669766
## [1] 38.58314
## [1] 1938.492
Here we see that the mean value of cases and deaths is much less than the variance. In addition, both data are highly skewed. We could either use a transformation to normalize these data, or use a glm distribution that accomodates these factors, such as poisson or negative binomial distributions. We can do this using the glm() function. Let’s model how COVID rates change by the weekday
COVID_Ned_wk <- COVID_Ned %>%
mutate(DoW = weekdays(date))
pmod <- glm(daily_cases ~ date +DoW + daily_cases_s + I(daily_cases_s**2), family = 'poisson', data = COVID_Ned_wk)
summary(pmod)##
## Call:
## glm(formula = daily_cases ~ date + DoW + daily_cases_s + I(daily_cases_s^2),
## family = "poisson", data = COVID_Ned_wk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -25.795 -13.046 -3.805 7.590 31.223
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.647e+01 3.107e-01 -53.017 < 2e-16 ***
## date 1.209e-03 1.688e-05 71.643 < 2e-16 ***
## DoWMonday -1.897e-01 3.377e-03 -56.184 < 2e-16 ***
## DoWSaturday -4.044e-02 3.257e-03 -12.417 < 2e-16 ***
## DoWSunday -7.536e-02 3.281e-03 -22.970 < 2e-16 ***
## DoWThursday -1.951e-02 3.249e-03 -6.004 1.92e-09 ***
## DoWTuesday -2.439e-01 3.422e-03 -71.289 < 2e-16 ***
## DoWWednesday -1.041e-01 3.326e-03 -31.297 < 2e-16 ***
## daily_cases_s 6.771e-04 1.723e-06 392.992 < 2e-16 ***
## I(daily_cases_s^2) -3.527e-08 1.414e-10 -249.405 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1606716 on 426 degrees of freedom
## Residual deviance: 82417 on 417 degrees of freedom
## AIC: 85898
##
## Number of Fisher Scoring iterations: 5
We have model output, but it’s difficult to interpret poisson regression coefficients without exponentiating them. Let’s bring tidy back into the mix
## # A tibble: 10 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.02e-8 3.11e- 1 -53.0 0. 3.82e-8 1.29e-7
## 2 date 1.00e+0 1.69e- 5 71.6 0. 1.00e+0 1.00e+0
## 3 DoWMonday 8.27e-1 3.38e- 3 -56.2 0. 8.22e-1 8.33e-1
## 4 DoWSaturday 9.60e-1 3.26e- 3 -12.4 2.11e- 35 9.54e-1 9.67e-1
## 5 DoWSunday 9.27e-1 3.28e- 3 -23.0 9.38e-117 9.21e-1 9.33e-1
## 6 DoWThursday 9.81e-1 3.25e- 3 -6.00 1.92e- 9 9.74e-1 9.87e-1
## 7 DoWTuesday 7.84e-1 3.42e- 3 -71.3 0. 7.78e-1 7.89e-1
## 8 DoWWednesday 9.01e-1 3.33e- 3 -31.3 5.14e-215 8.95e-1 9.07e-1
## 9 daily_cases_s 1.00e+0 1.72e- 6 393. 0. 1.00e+0 1.00e+0
## 10 I(daily_case~ 1.00e+0 1.41e-10 -249. 0. 1.00e+0 1.00e+0
augment(pmod) %>%
ggplot(aes(x = date)) +
geom_point(aes(y = daily_cases)) +
geom_line(aes(y = exp(.fitted)), color = 'red')Negative binomial regression models are generalizations of poisson-distributed models, where an added parameter for overdispertion is included in the model. In a poisson-distribution, the variance is considered to be equal to the mean of the distribution. This is therefore a model assumption of a poisson model of counts. Sometimes, however, our data does not fit this frame of variance being equal to the mean. In this case, we must estimate the overdispersion with a dispersion parameter, which is accounted for with the glm.nb function from the MASS package. First, let’s calculate the mean and variation of the distribution of cases in our dataset
m <- mean(COVID_Ned_wk$daily_cases)
v <- var(COVID_Ned_wk$daily_cases)
print(paste0("The mean value of daily_cases is ", m))## [1] "The mean value of daily_cases is 2882.30913348946"
## [1] "The variance of daily_cases is 10669766.336137"
## [1] "The ratio of variance:mean is 3701.81192994372"
Here we can see that the variance is much larger than the mean of daily cases. We can get more specific and look at the ratio of mean to variance on a monthly basis, which might make more sense than looking at the entire time series together
COVID_Ned_wk %>%
mutate(month = months(date)) %>%
group_by(month) %>%
summarise(mean = mean(daily_cases, na.rm = T),
var = var(daily_cases, na.rm = T),
VMratio = var/mean)## # A tibble: 12 x 4
## month mean var VMratio
## <chr> <dbl> <dbl> <dbl>
## 1 April 895. 83699. 93.5
## 2 August 601. 29506. 49.1
## 3 December 8918. 5792755. 650.
## 4 February 1956. 4376718. 2238.
## 5 January 4480. 8321530. 1857.
## 6 July 132. 7526. 56.8
## 7 June 128. 2812. 22.0
## 8 March 2590. 7170567. 2769.
## 9 May 230. 8999. 39.1
## 10 November 5810. 1460808. 251.
## 11 October 7473. 5033898. 674.
## 12 September 1759. 716246. 407.
Above we can see that throughout all months, the ratio of variance to mean is extremely high, showing that overdispersion is characteristic of the entire time series, not just one or two segments. Therefore it would make sense to apply a negative binomial generalized linear model (which is extremely easy to do)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
NBmod <- glm.nb(daily_cases ~ date +DoW + daily_cases_s + I(daily_cases_s**2), data = COVID_Ned_wk)
tdyNBmod <- tidy(NBmod, conf.int = T, exponentiate = T)
augNBmod <- augment(NBmod)
gln <- glance(NBmod)below, we can plot the fitted values of our model to the data. As you can see, the fit looks exactly the same as the poisson model, but the standard errors and confidence intervals should be different.
plot(x = augNBmod$date, y = augNBmod$daily_cases)
lines(x = augNBmod$date, y = exp(augNBmod$.fitted), col = 'red', lwd = 2)our model fit isn’t half bad… Even so, there are portions of the model that don’t seem to fit quite right, and it would be interesting to see the biases in our model fit, and if there are areas that can be improved. Unfortunately, support for glm.nb models isn’t as fully developed as for linear models, so we have to do some of the grunt work ourselves. The .resid component of the augment output for glm.nb models is somewhat unusable, so we can make our own residuals by taking the difference between our original data and the exponent of our fitted values.
augNBmod_fit <- augNBmod %>%
mutate(.resid = daily_cases - exp(.fitted))
augNBmod_fit %>%
ggplot(aes(x = date, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)As we can see, it looks like the residuals are still heavily autocorrelated. R has some cool functions for time series that lets you see the extent of this correlation. One is called acf() which just calculates an autocorrelation function. Here we can calculate both the autocorrelation of our residuals and the autocorrelation of our original plot
We see above that there is obvious autocorrelation in our residuals in addition to our original plot, as we would expect with a time series This somewhat throws a wrench in our analysis…
Logistic regressions are implemented very simply with the glm() function, specifying the argument family = 'binomial'. I’ll adapt the gapminder dataset below to demonstrate
df <- gapminder %>%
mutate(lifeExpOv75 = ifelse(lifeExp > 75, 1, 0))
# now we can use lifeExpOv75 as our logistic regression outcome variable
logmod <- glm(lifeExpOv75 ~ year + I(gdpPercap/1000) + continent, family = 'binomial', data = df)
tidy(logmod, exponentiate = T, conf.int = T) %>%
mutate(across(where(is.numeric), round, 2))## # A tibble: 7 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0 27.5 -9.99 0 0 0
## 2 year 1.14 0.01 9.81 0 1.12 1.18
## 3 I(gdpPercap/1000) 1.15 0.01 9.54 0 1.12 1.18
## 4 continentAmericas 17.5 0.76 3.76 0 4.85 112.
## 5 continentAsia 6.94 0.78 2.47 0.01 1.81 45.8
## 6 continentEurope 50.0 0.77 5.11 0 13.8 323.
## 7 continentOceania 79.1 1.02 4.29 0 12.0 732
here we see in our tidy output that every year increase is significantly associated with 1.14 times the odds of having a life expectancy over 75. Every $1000 increase in gdpPercap is associated with 1.15 times the odds of having a life expectancy over 75. In addition, we can see that Africa is not shown in the model output. This is because Africa serves as the reference country in our model. If we wanted to change that, we would have to relevel the continent factor. We can see that though our confidence intervals are extremely wide, they are still statistically significant. However, if we chose a different reference level, we may be able to have more interprable confidence intervals.
One final thing, we want to mention that intercepts are not interprable in this model, as the intercepts represent life expectancy at year 0, where gdpPercapita is 0, and the continent is Africa. If we wanted interprable intercepts, we would have to mean-standardize all of our variables in our model.
Note: This tutorial is very similar to the tutorial in R for data science by Hadley Wickham and Garret Grolemund in their “Many Models” chapter (Chapter 25). I do not take credit for the ideas presented here, as this tutorial is an adaptation of their tutorial, using similar data. For those looking for a more formal introduction to building many models, and the mechanics behind it, I’ll leave a link to their book chapter here.