## # A tibble: 6 x 8
## country continent year lifeExp pop gdpPercap gdp1000 pop100k
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 0.779 84.3
## 2 Afghanistan Asia 1957 30.3 9240934 821. 0.821 92.4
## 3 Afghanistan Asia 1962 32.0 10267083 853. 0.853 103.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 0.836 115.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 0.740 131.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 0.786 149.
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap gdp1000 pop100k
## Min. :6.001e+04 Min. : 241.2 Min. : 0.2412 Min. : 0.60
## 1st Qu.:2.794e+06 1st Qu.: 1202.1 1st Qu.: 1.2021 1st Qu.: 27.94
## Median :7.024e+06 Median : 3531.8 Median : 3.5318 Median : 70.24
## Mean :2.960e+07 Mean : 7215.3 Mean : 7.2153 Mean : 296.01
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5 3rd Qu.: 9.3255 3rd Qu.: 195.85
## Max. :1.319e+09 Max. :113523.1 Max. :113.5231 Max. :13186.83
##
## [1] Afghanistan Albania Algeria
## [4] Angola Argentina Australia
## [7] Austria Bahrain Bangladesh
## [10] Belgium Benin Bolivia
## [13] Bosnia and Herzegovina Botswana Brazil
## [16] Bulgaria Burkina Faso Burundi
## [19] Cambodia Cameroon Canada
## [22] Central African Republic Chad Chile
## [25] China Colombia Comoros
## [28] Congo, Dem. Rep. Congo, Rep. Costa Rica
## [31] Cote d'Ivoire Croatia Cuba
## [34] Czech Republic Denmark Djibouti
## [37] Dominican Republic Ecuador Egypt
## [40] El Salvador Equatorial Guinea Eritrea
## [43] Ethiopia Finland France
## [46] Gabon Gambia Germany
## [49] Ghana Greece Guatemala
## [52] Guinea Guinea-Bissau Haiti
## [55] Honduras Hong Kong, China Hungary
## [58] Iceland India Indonesia
## [61] Iran Iraq Ireland
## [64] Israel Italy Jamaica
## [67] Japan Jordan Kenya
## [70] Korea, Dem. Rep. Korea, Rep. Kuwait
## [73] Lebanon Lesotho Liberia
## [76] Libya Madagascar Malawi
## [79] Malaysia Mali Mauritania
## [82] Mauritius Mexico Mongolia
## [85] Montenegro Morocco Mozambique
## [88] Myanmar Namibia Nepal
## [91] Netherlands New Zealand Nicaragua
## [94] Niger Nigeria Norway
## [97] Oman Pakistan Panama
## [100] Paraguay Peru Philippines
## [103] Poland Portugal Puerto Rico
## [106] Reunion Romania Rwanda
## [109] Sao Tome and Principe Saudi Arabia Senegal
## [112] Serbia Sierra Leone Singapore
## [115] Slovak Republic Slovenia Somalia
## [118] South Africa Spain Sri Lanka
## [121] Sudan Swaziland Sweden
## [124] Switzerland Syria Taiwan
## [127] Tanzania Thailand Togo
## [130] Trinidad and Tobago Tunisia Turkey
## [133] Uganda United Kingdom United States
## [136] Uruguay Venezuela Vietnam
## [139] West Bank and Gaza Yemen, Rep. Zambia
## [142] Zimbabwe
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
df %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(color = continent, group = country)) +
labs(title = "Gapminder Life Expectancy By Year")df %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(color = continent, group = country)) +
geom_smooth(se = T, method = 'loess') +
facet_wrap(~continent) +
labs(title = "Gapminder Life Expectancy By Year")## `geom_smooth()` using formula 'y ~ x'
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
## [7] "gdp1000" "pop100k"
Let’s create a basic model of life expectancy, predicted by GDP per capita, year, and population. But let’s start with just Asia
# keep only countries within the continent of Asia
mod1_df <- df %>%
filter(continent == "Asia")
mod1 <- lm(lifeExp ~ year + pop100k + gdp1000, data = mod1_df)
# can summarise our model with the summary function
summary(mod1)##
## Call:
## lm(formula = lifeExp ~ year + pop100k + gdp1000, data = mod1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.1852 -5.2575 0.4857 5.0062 17.9753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.833e+02 4.829e+01 -16.221 < 2e-16 ***
## year 4.251e-01 2.442e-02 17.404 < 2e-16 ***
## pop100k 4.228e-06 2.039e-04 0.021 0.983
## gdp1000 2.510e-01 3.011e-02 8.336 1.31e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.232 on 392 degrees of freedom
## Multiple R-squared: 0.5222, Adjusted R-squared: 0.5186
## F-statistic: 142.8 on 3 and 392 DF, p-value: < 2.2e-16
We see that year and gdpPercap is significant. We should also check for normality of residuals, heteroscedasticity, etc… Which we can do by just calling the plot function on our model
qq looks somewhat normal, residuals vs fitted look all right, not a lot of leverage points… alright we’re not being too picky here.
We can also look at component plus residual plots of our data to check linearity of our predictors with respect to the outcome (this is only useful for a multivariate linear model)
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
Here we can see that `gdpPercap is not necessarily linear, and population might stray from linear as well. We might want to amend this in our model. However population by itself wasn’t significant, so we’ll remove that from our model
mod2 <- lm(lifeExp ~ year + gdp1000 + I(gdp1000**2) + pop100k, data = mod1_df)
# the I() tells the model to treat the gdpPercap term 'as is,' and not to adjust it or ignore it
summary(mod2)##
## Call:
## lm(formula = lifeExp ~ year + gdp1000 + I(gdp1000^2) + pop100k,
## data = mod1_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.670 -4.343 -0.003 5.281 16.414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.260e+02 4.722e+01 -13.257 <2e-16 ***
## year 3.442e-01 2.394e-02 14.380 <2e-16 ***
## gdp1000 7.906e-01 6.539e-02 12.092 <2e-16 ***
## I(gdp1000^2) -6.664e-03 7.332e-04 -9.089 <2e-16 ***
## pop100k 2.956e-04 1.882e-04 1.571 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.49 on 391 degrees of freedom
## Multiple R-squared: 0.6055, Adjusted R-squared: 0.6015
## F-statistic: 150.1 on 4 and 391 DF, p-value: < 2.2e-16
If we want to check to see if our model has improved, we can use the anova function to conduct a goodness of fit F-test between two models. The below anova is testing if mod2 fits better than mod1. If p < 0.05 the improvement in fit is statistically significant
# let's check our model's fit from the first model. The below code
# performs an F-test
anova(mod1, mod2)## Analysis of Variance Table
##
## Model 1: lifeExp ~ year + pop100k + gdp1000
## Model 2: lifeExp ~ year + gdp1000 + I(gdp1000^2) + pop100k
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 392 26566
## 2 391 21933 1 4633.6 82.605 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary function is great, but it’s hard to work with the output of it directly. To do this, a package was developed called broom which allows you to fortify your data with fitted values, create a data frame with your model output, and get goodness of fit statistics
broom has 3 very useful functions called tidy, augment, and glance. tidy returns the same data as your summary table (and more) in the form of a data frame. You can also ask for confidence intervals, p values, and exponentiation, among other things, depending on the model you’ve fit
tidy## # A tibble: 5 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -626. 47.2 -13.3 2.20e-33 -719. -533.
## 2 year 0.344 0.0239 14.4 6.19e-38 0.297 0.391
## 3 gdp1000 0.791 0.0654 12.1 8.19e-29 0.662 0.919
## 4 I(gdp1000^2) -0.00666 0.000733 -9.09 5.09e-18 -0.00811 -0.00522
## 5 pop100k 0.000296 0.000188 1.57 1.17e- 1 -0.0000744 0.000666
Augment returns a data frame of fitted values for your data, which you can then use to fit your trend line over your data. However, because we have a multivariable model, a trend line doesn’t fit nicely to any one variable. I’ll make a univariate model so this will make more sense.
augmentmod3 <- lm(lifeExp ~ year, data = mod1_df)
augdata <- augment(mod3, interval = "prediction")
augdata %>%
ggplot(aes(x = year)) +
geom_point(aes(y = lifeExp)) +
geom_line(aes(y = .fitted), col = 'red') +
geom_line(aes(y = .upper), col = 'blue') +
geom_line(aes(y = .lower), col = 'blue') +
geom_smooth(aes(y = lifeExp), se = F, color = 'green') +
labs(title = "LM of LifeExpectancy vs. Year")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Side note: I will also mention that augment has some drawbacks. If you are using a poisson or negative binomial distribution to fit your data, augment returns fitted values in their non-exponentiated form. An alternative to augment is to get values from the model object itself using the $ operator. Remember, your model output is simply a list, so you can access elements from it like any other list.
## (Intercept) year pop100k gdp1000
## -7.833346e+02 4.250632e-01 4.228480e-06 2.510235e-01
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -26.1852 -5.2575 0.4857 0.0000 5.0062 17.9753
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 46.47 52.95 59.83 60.06 66.91 82.34
Back to the above visual…As you can see, above we have another test of our fit. We have out fitted line in red, prediction intervals, which we got from the augmented data frame in blue, and a lowess smoother in green, showing slight deviations from linearity about our model. In all honesty you should probably plot a lowess smoother over your data before you make a linear model to check if the relationship is linear. A one shortcut for doing that is to plot a geom_smooth of your data using method = 'loess', and another geom_smooth using method = lm, which will plot a loess smoother and a fitted line, respectively. Another shortcut for doing that is the below code pairs.panels() from the psych package, which plots a scatterplot and correlation matrix of all of selected variables from your data frame
glanceFinally, we have glance, which gives us goodness of fit statistics for our models
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5222142 | 0.5185577 | 8.232329 | 142.8171 | 0 | 3 | -1394.685 | 2799.37 | 2819.277 | 26566.33 | 392 | 396 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.6055482 | 0.6015129 | 7.489585 | 150.0623 | 0 | 4 | -1356.735 | 2725.47 | 2749.359 | 21932.71 | 391 | 396 |
Sometimes we want to have our model in a neat output table, and we don’t want to format the table every time we change our model. We can programmatically make model tables with the tab_model function in sjPlot
tab_modelsjPlot::tab_model(mod1, mod2, mod3, show.intercept = FALSE, show.ci = 0.95, show.se = F, show.p = T,
show.r2 = TRUE, show.fstat = T, show.aic = T, show.obs = T, rm.terms = NULL, order.terms = NULL,
title = "Comparison of Gapminder Models")| life Exp | life Exp | life Exp | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| year | 0.43 | 0.38 – 0.47 | <0.001 | 0.34 | 0.30 – 0.39 | <0.001 | 0.45 | 0.40 – 0.50 | <0.001 |
| pop100k | 0.00 | -0.00 – 0.00 | 0.983 | 0.00 | -0.00 – 0.00 | 0.117 | |||
| gdp1000 | 0.25 | 0.19 – 0.31 | <0.001 | 0.79 | 0.66 – 0.92 | <0.001 | |||
| gdp1000^2 | -0.01 | -0.01 – -0.01 | <0.001 | ||||||
| Observations | 396 | 396 | 396 | ||||||
| R2 / R2 adjusted | 0.522 / 0.519 | 0.606 / 0.602 | 0.436 / 0.434 | ||||||
| AIC | 2799.370 | 2725.470 | 2861.319 | ||||||
Now we can compare models programmatically without having to export the models into an Excel table every time we run our code! This function works with most of R’s built-in modelling functions. There’s another variant called plot_model which will plot estimates of our models together
plot_modelssjPlot::plot_models(mod1, mod2, mod3, show.values = T, colors = "Dark2", ci.lvl = 0.99, axis.lim = c(-0.5, 0.5),
dot.size = 2, show.p = TRUE) obviously, this plot is a little underwhelming because we have a very simple model, and out predictors don’t show great amounts of spread from zero, but you could see how this type of plot could be useful for logistic, poisson, negative binomial, or other types of regressions with much wider confidence intervals.
So we’ve made a model for one continent…What if we wanted to make a model for all of the continents in our data set? Should we go through that same workflow for every one? There is a faster way, but it’s somewhat tricky, so I’ll present it here and leave it up to your discretion. Often times if you code by brute force method, it can take longer to do, and be many more lines, but the code may be more understandable in the end. I’ll show the programmatic way of making many models though.
R has a package called purrr, which allows the user to nest data within a colum to make a new data set. purrr also implements a programmatic concept called mapping, which is simply applying a function to every element of a list. The combination of nesting a data frame into a column, and mapping a function onto that column will allow us to efficiently make many models (though the model specifications have to be the same across groups). Here’s code on how to nest
## # A tibble: 5 x 2
## # Groups: continent [5]
## continent data
## <fct> <list>
## 1 Asia <tibble [396 x 7]>
## 2 Europe <tibble [360 x 7]>
## 3 Africa <tibble [624 x 7]>
## 4 Americas <tibble [300 x 7]>
## 5 Oceania <tibble [24 x 7]>
It’s as easy as that! First you group by the category that you want to model across. In this case we want to apply similar models across continents. Now we have to make a function for a model which we can apply across the nested data frames. First I’ll show you what’s contained in the data column
## # A tibble: 6 x 7
## country year lifeExp pop gdpPercap gdp1000 pop100k
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan 1952 28.8 8425333 779. 0.779 84.3
## 2 Afghanistan 1957 30.3 9240934 821. 0.821 92.4
## 3 Afghanistan 1962 32.0 10267083 853. 0.853 103.
## 4 Afghanistan 1967 34.0 11537966 836. 0.836 115.
## 5 Afghanistan 1972 36.1 13079460 740. 0.740 131.
## 6 Afghanistan 1977 38.4 14880372 786. 0.786 149.
## # A tibble: 6 x 7
## country year lifeExp pop gdpPercap gdp1000 pop100k
## <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Yemen, Rep. 1982 49.1 9657618 1978. 1.98 96.6
## 2 Yemen, Rep. 1987 52.9 11219340 1972. 1.97 112.
## 3 Yemen, Rep. 1992 55.6 13367997 1879. 1.88 134.
## 4 Yemen, Rep. 1997 58.0 15826497 2117. 2.12 158.
## 5 Yemen, Rep. 2002 60.3 18701257 2235. 2.23 187.
## 6 Yemen, Rep. 2007 62.7 22211743 2281. 2.28 222.
The first row of the data column is just a data frame containg data for all countries in Africa. Here we have year, lifeExp, pop, and gdpPercap. Now we need to make a function which we can apply across these tibbles, which take the same variable names (for sake of ease)
cont_mod <- function(df){
mod1 <- lm(lifeExp ~ year + gdp1000 + I(gdp1000**2), data = df)
return(mod1)
}Now that we have our model function, we can map the model onto our nested data frame and make a new variable called model
We’ve mapped our model. Let’s see what one of the models looks like
##
## Call:
## lm(formula = lifeExp ~ year + gdp1000 + I(gdp1000^2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.9730 -4.3525 -0.2491 5.2605 16.3478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.396e+02 4.651e+01 -13.754 <2e-16 ***
## year 3.512e-01 2.355e-02 14.912 <2e-16 ***
## gdp1000 7.684e-01 6.395e-02 12.015 <2e-16 ***
## I(gdp1000^2) -6.468e-03 7.238e-04 -8.935 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.504 on 392 degrees of freedom
## Multiple R-squared: 0.6031, Adjusted R-squared: 0.6
## F-statistic: 198.5 on 3 and 392 DF, p-value: < 2.2e-16
Here we have a model of all of the African countries predicting life expectancy. We can continue using map to get a tidy model output, augment our output, or look at fit statistics with glance
nest_df_fortify <- nest_df_mod %>%
mutate(tdy = map(model, tidy, conf.int = T),
aug = map(model, augment),
glnc = map(model, glance))
head(nest_df_fortify)## # A tibble: 5 x 6
## # Groups: continent [5]
## continent data model tdy aug glnc
## <fct> <list> <list> <list> <list> <list>
## 1 Asia <tibble [396 x~ <lm> <tibble [4 x ~ <tibble [396 x~ <tibble [1 x ~
## 2 Europe <tibble [360 x~ <lm> <tibble [4 x ~ <tibble [360 x~ <tibble [1 x ~
## 3 Africa <tibble [624 x~ <lm> <tibble [4 x ~ <tibble [624 x~ <tibble [1 x ~
## 4 Americas <tibble [300 x~ <lm> <tibble [4 x ~ <tibble [300 x~ <tibble [1 x ~
## 5 Oceania <tibble [24 x ~ <lm> <tibble [4 x ~ <tibble [24 x ~ <tibble [1 x ~
unnestYou might be thinking “This is great! But how the hell do we work with this now?” That tripped me up for a while too, but the easiest thing for us to do is something called unnest. It takes a nested column, and expands it back to original tibble size. I’ll demonstrate.
## # A tibble: 6 x 8
## # Groups: continent [2]
## continent term estimate std.error statistic p.value conf.low conf.high
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Asia (Interce~ -6.40e+2 46.5 -13.8 2.13e-35 -7.31e+2 -5.48e+2
## 2 Asia year 3.51e-1 0.0236 14.9 3.75e-40 3.05e-1 3.98e-1
## 3 Asia gdp1000 7.68e-1 0.0640 12.0 1.58e-28 6.43e-1 8.94e-1
## 4 Asia I(gdp100~ -6.47e-3 0.000724 -8.94 1.61e-17 -7.89e-3 -5.04e-3
## 5 Europe (Interce~ -1.56e+2 20.5 -7.60 2.57e-13 -1.96e+2 -1.15e+2
## 6 Europe year 1.11e-1 0.0104 10.6 3.70e-23 9.05e-2 1.31e-1
df_tidy %>%
filter(term != '(Intercept)') %>%
ggplot(aes(x = estimate, y = term, color = continent)) +
geom_point() +
geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.3) +
facet_wrap(~continent, ncol = 1)Resource: Tutorial on making error bars from sthda.
Voila! Now we have a data frame of continents with model parameters including standard errors, p values, confidence intervals, etc… This may not be useful for only 5 categories, but imagine if you’re making models for 100 categories! Also, purrr implements this type of stuff extremely efficiently, so running models with map is much faster than running 100 different models individually. For demonstration’s sake, I’ll just show the output of the augment and glance unnests
## # A tibble: 5 x 13
## # Groups: continent [5]
## continent r.squared adj.r.squared sigma statistic p.value df logLik
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Asia 0.603 0.600 7.50 199. 2.76e- 78 3 -1358.
## 2 Europe 0.754 0.752 2.71 364. 4.87e-108 3 -867.
## 3 Africa 0.489 0.487 6.56 198. 5.53e- 90 3 -2057.
## 4 Americas 0.694 0.691 5.20 224. 9.57e- 76 3 -918.
## 5 Oceania 0.977 0.974 0.613 288. 1.30e- 16 3 -20.1
## # ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
## # df.residual <int>, nobs <int>
Now we can easily compare fit statistics across countries. We can see that our \(R^2\) value for Africa is much lower than the other continents, suggesting that our model did not fit that country well. Perhaps this requires some tuning in the future.
augmentdf_aug <- nest_df_fortify %>%
select(continent, aug) %>%
unnest(aug)
# illustration of why plotting a linear regression over a multivariate model
# doesn't work
df_aug %>%
ggplot(aes(x = year)) +
geom_point(aes(y = lifeExp), alpha = 0.3) +
geom_point(aes(y = .fitted), color = 'red', alpha = 0.3) +
facet_wrap(~continent) +
labs(title = "Fitted values of LifeExpectancy in gapminder")R, being a statistical language, has many other types of models that you can implement, most of which can be accessed using the glm function. Let’s load in some data that might allow us to use some different distributions
This is just data cleaning and retrieval, so we aren’t going to cover this in depth
# rm(list = ls())
cases_global <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv") %>%
dplyr::select(-Lat, -Long) %>%
pivot_longer(cols = c(everything(), -`Province/State`, -`Country/Region`),
names_to = "date",
values_to = "cases") %>%
mutate(date = as.Date(date, format = '%m/%d/%y'))##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## i Use `spec()` for the full column specifications.
deaths_global <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv") %>%
dplyr::select(-Lat, -Long) %>%
pivot_longer(cols = c(everything(), -`Province/State`, -`Country/Region`),
names_to = "date",
values_to = "deaths") %>%
mutate(date = as.Date(date, format = '%m/%d/%y'))##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## i Use `spec()` for the full column specifications.
COVID_master <- cases_global %>%
left_join(deaths_global, by = c('Province/State', 'Country/Region', 'date')) %>%
rename(country = 'Country/Region',
state = 'Province/State') %>%
group_by(country, date) %>%
summarise(cases = sum(cases, na.rm = T),
deaths = sum(deaths, na.rm = T)) %>%
mutate(daily_cases = ave(cases, country, FUN = function(x) c(0, diff(x))),
daily_deaths = ave(deaths, country, FUN = function(x) c(0, diff(x))),
daily_cases_s = ave(daily_cases, country, FUN = function(x) kza::kz(x, m = 14, k = 2)),
daily_deaths_s = ave(daily_deaths, country, FUN = function(x) kza::kz(x, m = 14, k = 2)))## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
COVID_master %>%
ggplot(aes(x = date, y = daily_cases_s, group = country, color = country)) +
geom_line(alpha = 0.4) +
theme(legend.position = 'none')COVID_master %>%
ggplot(aes(x = date, y = daily_deaths_s, group = country, color = country)) +
geom_line(alpha = 0.4) +
theme(legend.position = 'none')Let’s look at the Netherlands!
COVID_Ned <- COVID_master %>%
filter(country == "Netherlands")
COVID_Ned %>%
ggplot(aes(x = date)) +
geom_line(aes(y = daily_cases), color = 'black') +
geom_line(aes(y = daily_deaths), color = 'black') +
geom_line(aes(y = daily_cases_s), color = 'blue') +
geom_line(aes(y = daily_deaths_s), color = 'firebrick3')COVID_Ned %>%
pivot_longer(cols = c('daily_cases', 'daily_deaths'),
names_to = 'outcomes',
values_to = 'values') %>%
ggplot() +
geom_histogram(aes(x = values)) +
facet_wrap(~outcomes)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 2825.907
## [1] 10587626
## [1] 38.71259
## [1] 1961.939
Here we see that the mean value of cases and deaths is much less than the variance. In addition, both data are highly skewed. We could either use a transformation to normalize these data, or use a glm distribution that accomodates these factors, such as poisson or negative binomial distributions. We can do this using the glm() function. Let’s model how COVID rates change by the weekday
COVID_Ned_wk <- COVID_Ned %>%
mutate(DoW = weekdays(date))
pmod <- glm(daily_cases ~ date +DoW + daily_cases_s, family = 'poisson', data = COVID_Ned_wk)
summary(pmod)##
## Call:
## glm(formula = daily_cases ~ date + DoW + daily_cases_s, family = "poisson",
## data = COVID_Ned_wk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -40.279 -19.440 -5.108 9.926 45.089
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.264e+01 2.470e-01 -294.08 < 2e-16 ***
## date 4.284e-03 1.333e-05 321.30 < 2e-16 ***
## DoWMonday -1.859e-01 3.445e-03 -53.95 < 2e-16 ***
## DoWSaturday -4.227e-02 3.325e-03 -12.71 < 2e-16 ***
## DoWSunday -7.365e-02 3.348e-03 -22.00 < 2e-16 ***
## DoWThursday -1.456e-02 3.310e-03 -4.40 1.08e-05 ***
## DoWTuesday -2.372e-01 3.490e-03 -67.99 < 2e-16 ***
## DoWWednesday -1.027e-01 3.359e-03 -30.57 < 2e-16 ***
## daily_cases_s 2.536e-04 3.536e-07 717.17 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1582339 on 420 degrees of freedom
## Residual deviance: 144895 on 412 degrees of freedom
## AIC: 148309
##
## Number of Fisher Scoring iterations: 5
We have model output, but it’s difficult to interpret poisson regression coefficients without exponentiating them. Let’s bring tidy back into the mix
## # A tibble: 9 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.85e-32 0.247 -294. 0. 1.76e-32 4.62e-32
## 2 date 1.00e+ 0 0.0000133 321. 0. 1.00e+ 0 1.00e+ 0
## 3 DoWMonday 8.30e- 1 0.00345 -54.0 0. 8.25e- 1 8.36e- 1
## 4 DoWSaturday 9.59e- 1 0.00333 -12.7 5.14e- 37 9.52e- 1 9.65e- 1
## 5 DoWSunday 9.29e- 1 0.00335 -22.0 3.08e-107 9.23e- 1 9.35e- 1
## 6 DoWThursday 9.86e- 1 0.00331 -4.40 1.08e- 5 9.79e- 1 9.92e- 1
## 7 DoWTuesday 7.89e- 1 0.00349 -68.0 0. 7.83e- 1 7.94e- 1
## 8 DoWWednesday 9.02e- 1 0.00336 -30.6 3.39e-205 8.97e- 1 9.08e- 1
## 9 daily_cases_s 1.00e+ 0 0.000000354 717. 0. 1.00e+ 0 1.00e+ 0
augment(pmod) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = exp(.fitted)), color = 'red') +
geom_point(aes(y = daily_cases))Note: This tutorial is very similar to the tutorial in R for data science by Hadley Wickham and Garret Grolemund in their “Many Models” chapter (Chapter 25). I do not take credit for the ideas presented here, as this tutorial is an adaptation of their tutorial, using similar data. For those looking for a more formal introduction to building many models, and the mechanics behind it, I’ll leave a link to their book chapter here.