Getting our data

df <- gapminder %>%
  mutate(gdp1000 = gdpPercap/1000,
         pop100k = pop/100000)
head(df)
## # A tibble: 6 x 8
##   country     continent  year lifeExp      pop gdpPercap gdp1000 pop100k
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>   <dbl>   <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.   0.779    84.3
## 2 Afghanistan Asia       1957    30.3  9240934      821.   0.821    92.4
## 3 Afghanistan Asia       1962    32.0 10267083      853.   0.853   103. 
## 4 Afghanistan Asia       1967    34.0 11537966      836.   0.836   115. 
## 5 Afghanistan Asia       1972    36.1 13079460      740.   0.740   131. 
## 6 Afghanistan Asia       1977    38.4 14880372      786.   0.786   149.

Taking a look at our data characteristics

summary(df)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap           gdp1000            pop100k        
##  Min.   :6.001e+04   Min.   :   241.2   Min.   :  0.2412   Min.   :    0.60  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1   1st Qu.:  1.2021   1st Qu.:   27.94  
##  Median :7.024e+06   Median :  3531.8   Median :  3.5318   Median :   70.24  
##  Mean   :2.960e+07   Mean   :  7215.3   Mean   :  7.2153   Mean   :  296.01  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5   3rd Qu.:  9.3255   3rd Qu.:  195.85  
##  Max.   :1.319e+09   Max.   :113523.1   Max.   :113.5231   Max.   :13186.83  
## 
unique(df$country) #lot of countries, let's visualize
##   [1] Afghanistan              Albania                  Algeria                 
##   [4] Angola                   Argentina                Australia               
##   [7] Austria                  Bahrain                  Bangladesh              
##  [10] Belgium                  Benin                    Bolivia                 
##  [13] Bosnia and Herzegovina   Botswana                 Brazil                  
##  [16] Bulgaria                 Burkina Faso             Burundi                 
##  [19] Cambodia                 Cameroon                 Canada                  
##  [22] Central African Republic Chad                     Chile                   
##  [25] China                    Colombia                 Comoros                 
##  [28] Congo, Dem. Rep.         Congo, Rep.              Costa Rica              
##  [31] Cote d'Ivoire            Croatia                  Cuba                    
##  [34] Czech Republic           Denmark                  Djibouti                
##  [37] Dominican Republic       Ecuador                  Egypt                   
##  [40] El Salvador              Equatorial Guinea        Eritrea                 
##  [43] Ethiopia                 Finland                  France                  
##  [46] Gabon                    Gambia                   Germany                 
##  [49] Ghana                    Greece                   Guatemala               
##  [52] Guinea                   Guinea-Bissau            Haiti                   
##  [55] Honduras                 Hong Kong, China         Hungary                 
##  [58] Iceland                  India                    Indonesia               
##  [61] Iran                     Iraq                     Ireland                 
##  [64] Israel                   Italy                    Jamaica                 
##  [67] Japan                    Jordan                   Kenya                   
##  [70] Korea, Dem. Rep.         Korea, Rep.              Kuwait                  
##  [73] Lebanon                  Lesotho                  Liberia                 
##  [76] Libya                    Madagascar               Malawi                  
##  [79] Malaysia                 Mali                     Mauritania              
##  [82] Mauritius                Mexico                   Mongolia                
##  [85] Montenegro               Morocco                  Mozambique              
##  [88] Myanmar                  Namibia                  Nepal                   
##  [91] Netherlands              New Zealand              Nicaragua               
##  [94] Niger                    Nigeria                  Norway                  
##  [97] Oman                     Pakistan                 Panama                  
## [100] Paraguay                 Peru                     Philippines             
## [103] Poland                   Portugal                 Puerto Rico             
## [106] Reunion                  Romania                  Rwanda                  
## [109] Sao Tome and Principe    Saudi Arabia             Senegal                 
## [112] Serbia                   Sierra Leone             Singapore               
## [115] Slovak Republic          Slovenia                 Somalia                 
## [118] South Africa             Spain                    Sri Lanka               
## [121] Sudan                    Swaziland                Sweden                  
## [124] Switzerland              Syria                    Taiwan                  
## [127] Tanzania                 Thailand                 Togo                    
## [130] Trinidad and Tobago      Tunisia                  Turkey                  
## [133] Uganda                   United Kingdom           United States           
## [136] Uruguay                  Venezuela                Vietnam                 
## [139] West Bank and Gaza       Yemen, Rep.              Zambia                  
## [142] Zimbabwe                
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe

visualizing our data

df %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(color = continent, group = country)) +
  labs(title = "Gapminder Life Expectancy By Year")

df %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(color = continent, group = country)) +
  geom_smooth(se = T, method = 'loess') +
  facet_wrap(~continent) +
  labs(title = "Gapminder Life Expectancy By Year")
## `geom_smooth()` using formula 'y ~ x'

Basic Modelling

names(df)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
## [7] "gdp1000"   "pop100k"

Let’s create a basic model of life expectancy, predicted by GDP per capita, year, and population. But let’s start with just Asia

# keep only countries within the continent of Asia
mod1_df <- df %>%
  filter(continent == "Asia")

# formula outcome ~ pred1 + pred2 + pred3

mod1 <- lm(lifeExp ~ year + pop100k + gdp1000, data = mod1_df)
# can summarise our model with the summary function
summary(mod1)
## 
## Call:
## lm(formula = lifeExp ~ year + pop100k + gdp1000, data = mod1_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.1852  -5.2575   0.4857   5.0062  17.9753 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.833e+02  4.829e+01 -16.221  < 2e-16 ***
## year         4.251e-01  2.442e-02  17.404  < 2e-16 ***
## pop100k      4.228e-06  2.039e-04   0.021    0.983    
## gdp1000      2.510e-01  3.011e-02   8.336 1.31e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.232 on 392 degrees of freedom
## Multiple R-squared:  0.5222, Adjusted R-squared:  0.5186 
## F-statistic: 142.8 on 3 and 392 DF,  p-value: < 2.2e-16

Model Assumptions

We see that year and gdpPercap is significant. We should also check for normality of residuals, heteroscedasticity, etc… Which we can do by just calling the plot function on our model

par(mfrow = c(2, 2))
plot(mod1)

qq looks somewhat normal, residuals vs fitted look all right, not a lot of leverage points… alright we’re not being too picky here.


We can also look at component plus residual plots of our data to check linearity of our predictors with respect to the outcome (this is only useful for a multivariate linear model)

library(car)
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
car::crPlots(mod1)

Improving our model

Here we can see that gdp1000 is not necessarily linear, and population might stray from linear as well. We might want to amend this in our model. However population by itself wasn’t significant, so we’ll remove that from our model

mod2 <- lm(lifeExp ~ year + gdp1000 + I(gdp1000**2) + pop100k, data = mod1_df)
# the I() tells the model to treat the gdpPercap term 'as is,' and not to adjust it or ignore it
summary(mod2)
## 
## Call:
## lm(formula = lifeExp ~ year + gdp1000 + I(gdp1000^2) + pop100k, 
##     data = mod1_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.670  -4.343  -0.003   5.281  16.414 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.260e+02  4.722e+01 -13.257   <2e-16 ***
## year          3.442e-01  2.394e-02  14.380   <2e-16 ***
## gdp1000       7.906e-01  6.539e-02  12.092   <2e-16 ***
## I(gdp1000^2) -6.664e-03  7.332e-04  -9.089   <2e-16 ***
## pop100k       2.956e-04  1.882e-04   1.571    0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.49 on 391 degrees of freedom
## Multiple R-squared:  0.6055, Adjusted R-squared:  0.6015 
## F-statistic: 150.1 on 4 and 391 DF,  p-value: < 2.2e-16

If we want to check to see if our model has improved, we can use the anova function to conduct a goodness of fit F-test between two models. The below anova is testing if mod2 fits better than mod1. If p < 0.05 the improvement in fit is statistically significant

Checking model fit

# let's check our model's fit from the first model. The below code
# performs an F-test
anova(mod1, mod2)
## Analysis of Variance Table
## 
## Model 1: lifeExp ~ year + pop100k + gdp1000
## Model 2: lifeExp ~ year + gdp1000 + I(gdp1000^2) + pop100k
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    392 26566                                  
## 2    391 21933  1    4633.6 82.605 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Broom package

The summary function is great, but it’s hard to work with the output of it directly. To do this, a package was developed called broom which allows you to fortify your data with fitted values, create a data frame with your model output, and get goodness of fit statistics

broom has 3 very useful functions called tidy, augment, and glance. tidy returns the same data as your summary table (and more) in the form of a data frame. You can also ask for confidence intervals, p values, and exponentiation, among other things, depending on the model you’ve fit

tidy

tidy(mod2, conf.int = T, statistic = F, std.error = T, p.value = T, conf.level = 0.95)
## # A tibble: 5 x 7
##   term            estimate std.error statistic  p.value     conf.low   conf.high
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>        <dbl>       <dbl>
## 1 (Intercept)  -626.       47.2         -13.3  2.20e-33 -719.        -533.      
## 2 year            0.344     0.0239       14.4  6.19e-38    0.297        0.391   
## 3 gdp1000         0.791     0.0654       12.1  8.19e-29    0.662        0.919   
## 4 I(gdp1000^2)   -0.00666   0.000733     -9.09 5.09e-18   -0.00811     -0.00522 
## 5 pop100k         0.000296  0.000188      1.57 1.17e- 1   -0.0000744    0.000666

Augment returns a data frame of fitted values for your data, which you can then use to fit your trend line over your data. However, because we have a multivariable model, a trend line doesn’t fit nicely to any one variable. I’ll make a univariate model so this will make more sense.

augment

mod3 <- lm(lifeExp ~ year, data = mod1_df)
augdata <- augment(mod3, interval = "prediction")

augdata %>%
  ggplot(aes(x = year)) +
  geom_point(aes(y = lifeExp)) +
  geom_line(aes(y = .fitted), col = 'red') +
  geom_line(aes(y = .upper), col = 'blue') +
  geom_line(aes(y = .lower), col = 'blue') +
  geom_smooth(aes(y = lifeExp), se = F, color = 'green') +
  labs(title = "LM of LifeExpectancy vs. Year")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Side note: I will also mention that augment has some drawbacks. If you are using a poisson or negative binomial distribution to fit your data, augment returns fitted values in their non-exponentiated form. An alternative to augment is to get values from the model object itself using the $ operator. Remember, your model output is simply a list, so you can access elements from it like any other list.

mod1$coefficients
##   (Intercept)          year       pop100k       gdp1000 
## -7.833346e+02  4.250632e-01  4.228480e-06  2.510235e-01
summary(mod1$residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -26.1852  -5.2575   0.4857   0.0000   5.0062  17.9753
summary(mod1$fitted.values)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   46.47   52.95   59.83   60.06   66.91   82.34

Back to the above visual…As you can see, above we have another test of our fit. We have out fitted line in red, prediction intervals, which we got from the augmented data frame in blue, and a lowess smoother in green, showing slight deviations from linearity about our model. In all honesty you should probably plot a lowess smoother over your data before you make a linear model to check if the relationship is linear. A one shortcut for doing that is to plot a geom_smooth of your data using method = 'loess', and another geom_smooth using method = lm, which will plot a loess smoother and a fitted line, respectively. Another shortcut for doing that is the below code pairs.panels() from the psych package, which plots a scatterplot and correlation matrix of all of selected variables from your data frame

Panel Plots

mod1_df %>%
  dplyr::select(-continent, -country, -pop, -gdpPercap) %>%
  psych::pairs.panels(breaks = 40)

glance

Finally, we have glance, which gives us goodness of fit statistics for our models

knitr::kable(glance(mod1))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.5222142 0.5185577 8.232329 142.8171 0 3 -1394.685 2799.37 2819.277 26566.33 392 396
knitr::kable(glance(mod2))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6055482 0.6015129 7.489585 150.0623 0 4 -1356.735 2725.47 2749.359 21932.71 391 396

Comparing models side-by-side

Sometimes we want to have our model in a neat output table, and we don’t want to format the table every time we change our model. We can programmatically make model tables with the tab_model function in sjPlot

tab_model

# library(sjPlot)
sjPlot::tab_model(mod1, mod2, mod3, show.intercept = FALSE, show.ci = 0.95, show.se = F, show.p = T,
                  show.r2 = TRUE, show.fstat = T, show.aic = T, show.obs = T, rm.terms = NULL, order.terms = NULL,
                  title = "Comparison of Gapminder Models")
Comparison of Gapminder Models
  life Exp life Exp life Exp
Predictors Estimates CI p Estimates CI p Estimates CI p
year 0.43 0.38 – 0.47 <0.001 0.34 0.30 – 0.39 <0.001 0.45 0.40 – 0.50 <0.001
pop100k 0.00 -0.00 – 0.00 0.983 0.00 -0.00 – 0.00 0.117
gdp1000 0.25 0.19 – 0.31 <0.001 0.79 0.66 – 0.92 <0.001
gdp1000^2 -0.01 -0.01 – -0.01 <0.001
Observations 396 396 396
R2 / R2 adjusted 0.522 / 0.519 0.606 / 0.602 0.436 / 0.434
AIC 2799.370 2725.470 2861.319

Now we can compare models programmatically without having to export the models into an Excel table every time we run our code! This function works with most of R’s built-in modelling functions. There’s another variant called plot_model which will plot estimates of our models together

plot_models

sjPlot::plot_models(mod1, mod2, mod3, show.values = T, colors = "Dark2", ci.lvl = 0.99, axis.lim = c(-0.5, 0.5),
                    dot.size = 2, show.p = TRUE)

obviously, this plot is a little underwhelming because we have a very simple model, and out predictors don’t show great amounts of spread from zero, but you could see how this type of plot could be useful for logistic, poisson, negative binomial, or other types of regressions with much wider confidence intervals.


Many Many Many Models with purrr

So we’ve made a model for one continent…What if we wanted to make a model for all of the continents in our data set? Should we go through that same workflow for every one? There is a faster way, but it’s somewhat tricky, so I’ll present it here and leave it up to your discretion. Often times if you code by brute force method, it can take longer to do, and be many more lines, but the code may be more understandable in the end. I’ll show the programmatic way of making many models though.

Nesting in purrr

R has a package called purrr, which allows the user to nest data within a column to make a new data set. purrr also implements a programmatic concept called mapping, which is simply applying a function to every element of a list. The combination of nesting a data frame into a column, and mapping a function onto that column will allow us to efficiently make many models (though the model specifications have to be the same across groups). Here’s code on how to nest

nest_df <- df %>%
  group_by(continent) %>%
  nest()

head(nest_df)
## # A tibble: 5 x 2
## # Groups:   continent [5]
##   continent data              
##   <fct>     <list>            
## 1 Asia      <tibble [396 x 7]>
## 2 Europe    <tibble [360 x 7]>
## 3 Africa    <tibble [624 x 7]>
## 4 Americas  <tibble [300 x 7]>
## 5 Oceania   <tibble [24 x 7]>

It’s as easy as that! First you group by the category that you want to model across. In this case we want to apply similar models across continents. Now we have to make a function for a model which we can apply across the nested data frames. First I’ll show you what’s contained in the data column

head(nest_df$data[[1]])
## # A tibble: 6 x 7
##   country      year lifeExp      pop gdpPercap gdp1000 pop100k
##   <fct>       <int>   <dbl>    <int>     <dbl>   <dbl>   <dbl>
## 1 Afghanistan  1952    28.8  8425333      779.   0.779    84.3
## 2 Afghanistan  1957    30.3  9240934      821.   0.821    92.4
## 3 Afghanistan  1962    32.0 10267083      853.   0.853   103. 
## 4 Afghanistan  1967    34.0 11537966      836.   0.836   115. 
## 5 Afghanistan  1972    36.1 13079460      740.   0.740   131. 
## 6 Afghanistan  1977    38.4 14880372      786.   0.786   149.
tail(nest_df$data[[1]])
## # A tibble: 6 x 7
##   country      year lifeExp      pop gdpPercap gdp1000 pop100k
##   <fct>       <int>   <dbl>    <int>     <dbl>   <dbl>   <dbl>
## 1 Yemen, Rep.  1982    49.1  9657618     1978.    1.98    96.6
## 2 Yemen, Rep.  1987    52.9 11219340     1972.    1.97   112. 
## 3 Yemen, Rep.  1992    55.6 13367997     1879.    1.88   134. 
## 4 Yemen, Rep.  1997    58.0 15826497     2117.    2.12   158. 
## 5 Yemen, Rep.  2002    60.3 18701257     2235.    2.23   187. 
## 6 Yemen, Rep.  2007    62.7 22211743     2281.    2.28   222.

The first row of the data column is just a data frame containg data for all countries in Africa. Here we have year, lifeExp, pop, and gdpPercap. Now we need to make a function which we can apply across these tibbles, which take the same variable names (for sake of ease)

Mapping

cont_mod <- function(df){
  mod1 <- lm(lifeExp ~ year + gdp1000 + I(gdp1000**2), data = df)
  return(mod1)
}

Now that we have our model function, we can map the model onto our nested data frame and make a new variable called model

nest_df_mod <- nest_df %>%
  mutate(model = map(data, cont_mod))

We’ve mapped our model. Let’s see what one of the models looks like

summary(nest_df_mod$model[[1]])
## 
## Call:
## lm(formula = lifeExp ~ year + gdp1000 + I(gdp1000^2), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.9730  -4.3525  -0.2491   5.2605  16.3478 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.396e+02  4.651e+01 -13.754   <2e-16 ***
## year          3.512e-01  2.355e-02  14.912   <2e-16 ***
## gdp1000       7.684e-01  6.395e-02  12.015   <2e-16 ***
## I(gdp1000^2) -6.468e-03  7.238e-04  -8.935   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.504 on 392 degrees of freedom
## Multiple R-squared:  0.6031, Adjusted R-squared:    0.6 
## F-statistic: 198.5 on 3 and 392 DF,  p-value: < 2.2e-16

Here we have a model of all of the African countries predicting life expectancy. We can continue using map to get a tidy model output, augment our output, or look at fit statistics with glance

Mapping broom functions

nest_df_fortify <- nest_df_mod %>%
  mutate(tdy = map(model, tidy, conf.int = T),
         aug = map(model, augment),
         glnc = map(model, glance))

head(nest_df_fortify)
## # A tibble: 5 x 6
## # Groups:   continent [5]
##   continent data            model  tdy            aug             glnc          
##   <fct>     <list>          <list> <list>         <list>          <list>        
## 1 Asia      <tibble [396 x~ <lm>   <tibble [4 x ~ <tibble [396 x~ <tibble [1 x ~
## 2 Europe    <tibble [360 x~ <lm>   <tibble [4 x ~ <tibble [360 x~ <tibble [1 x ~
## 3 Africa    <tibble [624 x~ <lm>   <tibble [4 x ~ <tibble [624 x~ <tibble [1 x ~
## 4 Americas  <tibble [300 x~ <lm>   <tibble [4 x ~ <tibble [300 x~ <tibble [1 x ~
## 5 Oceania   <tibble [24 x ~ <lm>   <tibble [4 x ~ <tibble [24 x ~ <tibble [1 x ~

unnest

You might be thinking “This is great! But how the hell do we work with this now?” That tripped me up for a while too, but the easiest thing for us to do is something called unnest. It takes a nested column, and expands it back to original tibble size. I’ll demonstrate.

df_tidy <- nest_df_fortify %>%
  select(continent, tdy) %>%
  unnest(tdy)

head(df_tidy)
## # A tibble: 6 x 8
## # Groups:   continent [2]
##   continent term       estimate std.error statistic  p.value  conf.low conf.high
##   <fct>     <chr>         <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
## 1 Asia      (Interce~  -6.40e+2 46.5         -13.8  2.13e-35  -7.31e+2  -5.48e+2
## 2 Asia      year        3.51e-1  0.0236       14.9  3.75e-40   3.05e-1   3.98e-1
## 3 Asia      gdp1000     7.68e-1  0.0640       12.0  1.58e-28   6.43e-1   8.94e-1
## 4 Asia      I(gdp100~  -6.47e-3  0.000724     -8.94 1.61e-17  -7.89e-3  -5.04e-3
## 5 Europe    (Interce~  -1.56e+2 20.5          -7.60 2.57e-13  -1.96e+2  -1.15e+2
## 6 Europe    year        1.11e-1  0.0104       10.6  3.70e-23   9.05e-2   1.31e-1
df_tidy %>%
  filter(term != '(Intercept)') %>%
  ggplot(aes(x = estimate, y = term, color = continent)) +
  geom_point() +
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.3) +
  facet_wrap(~continent, ncol = 1)

Resource: Tutorial on making error bars from sthda.

Voila! Now we have a data frame of continents with model parameters including standard errors, p values, confidence intervals, etc… This may not be useful for only 5 categories, but imagine if you’re making models for 100 categories! Also, purrr implements this type of stuff extremely efficiently, so running models with map is much faster than running 100 different models individually. For demonstration’s sake, I’ll just show the output of the augment and glance unnests

Utilizing Glance

df_glance <- nest_df_fortify %>%
  select(continent, glnc) %>%
  unnest(glnc)

print(df_glance)
## # A tibble: 5 x 13
## # Groups:   continent [5]
##   continent r.squared adj.r.squared sigma statistic   p.value    df  logLik
##   <fct>         <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>
## 1 Asia          0.603         0.600 7.50       199. 2.76e- 78     3 -1358. 
## 2 Europe        0.754         0.752 2.71       364. 4.87e-108     3  -867. 
## 3 Africa        0.489         0.487 6.56       198. 5.53e- 90     3 -2057. 
## 4 Americas      0.694         0.691 5.20       224. 9.57e- 76     3  -918. 
## 5 Oceania       0.977         0.974 0.613      288. 1.30e- 16     3   -20.1
## # ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

Now we can easily compare fit statistics across countries. We can see that our \(R^2\) value for Africa is much lower than the other continents, suggesting that our model did not fit that country well. Perhaps this requires some tuning in the future.

Utilizing augment

df_aug <- nest_df_fortify %>%
  select(continent, aug) %>%
  unnest(aug)

# illustration of why plotting a linear regression over a multivariate model
# doesn't work
df_aug %>%
  ggplot(aes(x = year)) +
  geom_point(aes(y = lifeExp), alpha = 0.3) +
  geom_point(aes(y = .fitted), color = 'red', alpha = 0.3) +
  facet_wrap(~continent) +
  labs(title = "Fitted values of LifeExpectancy in gapminder")

Stopping point for 2021-03-18


Other Basic Regression Types: glm and glm.nb

R, being a statistical language, has many other types of models that you can implement, most of which can be accessed using the glm function. Let’s load in some data that might allow us to use some different distributions

This is just data cleaning and retrieval, so we aren’t going to cover this in depth

# rm(list = ls())

cases_global <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv") %>%
  dplyr::select(-Lat, -Long) %>%
  pivot_longer(cols = c(everything(), -`Province/State`, -`Country/Region`),
               names_to = "date",
               values_to = "cases") %>%
  mutate(date = as.Date(date, format = '%m/%d/%y'))
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## i Use `spec()` for the full column specifications.
deaths_global <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv") %>%
  dplyr::select(-Lat, -Long) %>%
  pivot_longer(cols = c(everything(), -`Province/State`, -`Country/Region`),
               names_to = "date",
               values_to = "deaths") %>%
  mutate(date = as.Date(date, format = '%m/%d/%y'))
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## i Use `spec()` for the full column specifications.
COVID_master <- cases_global %>%
  left_join(deaths_global, by = c('Province/State', 'Country/Region', 'date')) %>%
  rename(country = 'Country/Region',
         state = 'Province/State') %>%
  group_by(country, date) %>%
  summarise(cases = sum(cases, na.rm = T),
            deaths = sum(deaths, na.rm = T)) %>%
  mutate(daily_cases = ave(cases, country, FUN = function(x) c(0, diff(x))),
         daily_deaths = ave(deaths, country, FUN = function(x) c(0, diff(x))),
         daily_cases_s = ave(daily_cases, country, FUN = function(x) kza::kz(x, m = 14, k = 2)),
         daily_deaths_s = ave(daily_deaths, country, FUN = function(x) kza::kz(x, m = 14, k = 2)))
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
COVID_master %>%
  ggplot(aes(x = date, y = daily_cases_s, group = country, color = country)) +
  geom_line(alpha = 0.4) +
  theme(legend.position = 'none')

COVID_master %>%
  ggplot(aes(x = date, y = daily_deaths_s, group = country, color = country)) +
  geom_line(alpha = 0.4) +
  theme(legend.position = 'none')

Let’s look at the Netherlands!

COVID_Ned <- COVID_master %>%
  filter(country == "Netherlands")

COVID_Ned %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = daily_cases), color = 'black') +
  geom_line(aes(y = daily_deaths), color = 'black') +
  geom_line(aes(y = daily_cases_s), color = 'blue') +
  geom_line(aes(y = daily_deaths_s), color = 'firebrick3')

COVID_Ned %>%
  pivot_longer(cols = c('daily_cases', 'daily_deaths'),
               names_to = 'outcomes',
               values_to = 'values') %>%
  ggplot() +
  geom_histogram(aes(x = values)) +
  facet_wrap(~outcomes)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mean(COVID_Ned$daily_cases)
## [1] 2882.309
var(COVID_Ned$daily_cases)
## [1] 10669766
mean(COVID_Ned$daily_deaths)
## [1] 38.58314
var(COVID_Ned$daily_deaths)
## [1] 1938.492

Here we see that the mean value of cases and deaths is much less than the variance. In addition, both data are highly skewed. We could either use a transformation to normalize these data, or use a glm distribution that accomodates these factors, such as poisson or negative binomial distributions. We can do this using the glm() function. Let’s model how COVID rates change by the weekday

COVID_Ned_wk <- COVID_Ned %>%
  mutate(DoW = weekdays(date))

pmod <- glm(daily_cases ~ date +DoW + daily_cases_s + I(daily_cases_s**2), family = 'poisson', data = COVID_Ned_wk)
summary(pmod)
## 
## Call:
## glm(formula = daily_cases ~ date + DoW + daily_cases_s + I(daily_cases_s^2), 
##     family = "poisson", data = COVID_Ned_wk)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -25.795  -13.046   -3.805    7.590   31.223  
## 
## Coefficients:
##                      Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)        -1.647e+01  3.107e-01  -53.017  < 2e-16 ***
## date                1.209e-03  1.688e-05   71.643  < 2e-16 ***
## DoWMonday          -1.897e-01  3.377e-03  -56.184  < 2e-16 ***
## DoWSaturday        -4.044e-02  3.257e-03  -12.417  < 2e-16 ***
## DoWSunday          -7.536e-02  3.281e-03  -22.970  < 2e-16 ***
## DoWThursday        -1.951e-02  3.249e-03   -6.004 1.92e-09 ***
## DoWTuesday         -2.439e-01  3.422e-03  -71.289  < 2e-16 ***
## DoWWednesday       -1.041e-01  3.326e-03  -31.297  < 2e-16 ***
## daily_cases_s       6.771e-04  1.723e-06  392.992  < 2e-16 ***
## I(daily_cases_s^2) -3.527e-08  1.414e-10 -249.405  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1606716  on 426  degrees of freedom
## Residual deviance:   82417  on 417  degrees of freedom
## AIC: 85898
## 
## Number of Fisher Scoring iterations: 5

We have model output, but it’s difficult to interpret poisson regression coefficients without exponentiating them. Let’s bring tidy back into the mix

tidy(pmod, conf.int = TRUE, exponentiate = T)
## # A tibble: 10 x 7
##    term             estimate std.error statistic   p.value   conf.low  conf.high
##    <chr>               <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
##  1 (Intercept)       7.02e-8  3.11e- 1    -53.0  0.           3.82e-8    1.29e-7
##  2 date              1.00e+0  1.69e- 5     71.6  0.           1.00e+0    1.00e+0
##  3 DoWMonday         8.27e-1  3.38e- 3    -56.2  0.           8.22e-1    8.33e-1
##  4 DoWSaturday       9.60e-1  3.26e- 3    -12.4  2.11e- 35    9.54e-1    9.67e-1
##  5 DoWSunday         9.27e-1  3.28e- 3    -23.0  9.38e-117    9.21e-1    9.33e-1
##  6 DoWThursday       9.81e-1  3.25e- 3     -6.00 1.92e-  9    9.74e-1    9.87e-1
##  7 DoWTuesday        7.84e-1  3.42e- 3    -71.3  0.           7.78e-1    7.89e-1
##  8 DoWWednesday      9.01e-1  3.33e- 3    -31.3  5.14e-215    8.95e-1    9.07e-1
##  9 daily_cases_s     1.00e+0  1.72e- 6    393.   0.           1.00e+0    1.00e+0
## 10 I(daily_case~     1.00e+0  1.41e-10   -249.   0.           1.00e+0    1.00e+0
augment(pmod) %>%
  ggplot(aes(x = date)) +
  geom_point(aes(y = daily_cases)) +
  geom_line(aes(y = exp(.fitted)), color = 'red')

Negative Binomial glm

Negative binomial regression models are generalizations of poisson-distributed models, where an added parameter for overdispertion is included in the model. In a poisson-distribution, the variance is considered to be equal to the mean of the distribution. This is therefore a model assumption of a poisson model of counts. Sometimes, however, our data does not fit this frame of variance being equal to the mean. In this case, we must estimate the overdispersion with a dispersion parameter, which is accounted for with the glm.nb function from the MASS package. First, let’s calculate the mean and variation of the distribution of cases in our dataset

m <- mean(COVID_Ned_wk$daily_cases)
v <- var(COVID_Ned_wk$daily_cases)

print(paste0("The mean value of daily_cases is ", m))
## [1] "The mean value of daily_cases is 2882.30913348946"
print(paste0("The variance of daily_cases is ", v))
## [1] "The variance of daily_cases is 10669766.336137"
print(paste0("The ratio of variance:mean is ", v/m))
## [1] "The ratio of variance:mean is 3701.81192994372"

Here we can see that the variance is much larger than the mean of daily cases. We can get more specific and look at the ratio of mean to variance on a monthly basis, which might make more sense than looking at the entire time series together

COVID_Ned_wk %>%
  mutate(month = months(date)) %>%
  group_by(month) %>%
  summarise(mean = mean(daily_cases, na.rm = T),
            var = var(daily_cases, na.rm = T),
            VMratio = var/mean)
## # A tibble: 12 x 4
##    month      mean      var VMratio
##    <chr>     <dbl>    <dbl>   <dbl>
##  1 April      895.   83699.    93.5
##  2 August     601.   29506.    49.1
##  3 December  8918. 5792755.   650. 
##  4 February  1956. 4376718.  2238. 
##  5 January   4480. 8321530.  1857. 
##  6 July       132.    7526.    56.8
##  7 June       128.    2812.    22.0
##  8 March     2590. 7170567.  2769. 
##  9 May        230.    8999.    39.1
## 10 November  5810. 1460808.   251. 
## 11 October   7473. 5033898.   674. 
## 12 September 1759.  716246.   407.

Above we can see that throughout all months, the ratio of variance to mean is extremely high, showing that overdispersion is characteristic of the entire time series, not just one or two segments. Therefore it would make sense to apply a negative binomial generalized linear model (which is extremely easy to do)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
NBmod <- glm.nb(daily_cases ~ date +DoW + daily_cases_s + I(daily_cases_s**2), data = COVID_Ned_wk)
tdyNBmod <- tidy(NBmod, conf.int = T, exponentiate = T)
augNBmod <- augment(NBmod)
gln <- glance(NBmod)

below, we can plot the fitted values of our model to the data. As you can see, the fit looks exactly the same as the poisson model, but the standard errors and confidence intervals should be different.

plot(x = augNBmod$date, y = augNBmod$daily_cases)
lines(x = augNBmod$date, y = exp(augNBmod$.fitted), col = 'red', lwd = 2)

our model fit isn’t half bad… Even so, there are portions of the model that don’t seem to fit quite right, and it would be interesting to see the biases in our model fit, and if there are areas that can be improved. Unfortunately, support for glm.nb models isn’t as fully developed as for linear models, so we have to do some of the grunt work ourselves. The .resid component of the augment output for glm.nb models is somewhat unusable, so we can make our own residuals by taking the difference between our original data and the exponent of our fitted values.

augNBmod_fit <- augNBmod %>%
  mutate(.resid = daily_cases - exp(.fitted))

augNBmod_fit %>%
  ggplot(aes(x = date, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

As we can see, it looks like the residuals are still heavily autocorrelated. R has some cool functions for time series that lets you see the extent of this correlation. One is called acf() which just calculates an autocorrelation function. Here we can calculate both the autocorrelation of our residuals and the autocorrelation of our original plot

acf(augNBmod_fit$.resid, lag.max = 40)

acf(augNBmod_fit$daily_cases, lag.max = 40)

We see above that there is obvious autocorrelation in our residuals in addition to our original plot, as we would expect with a time series This somewhat throws a wrench in our analysis…

Logistic regressions

Logistic regressions are implemented very simply with the glm() function, specifying the argument family = 'binomial'. I’ll adapt the gapminder dataset below to demonstrate

df <- gapminder %>%
  mutate(lifeExpOv75 = ifelse(lifeExp > 75, 1, 0))

# now we can use lifeExpOv75 as our logistic regression outcome variable
logmod <- glm(lifeExpOv75 ~ year + I(gdpPercap/1000) + continent, family = 'binomial', data = df)

tidy(logmod, exponentiate = T, conf.int = T) %>%
  mutate(across(where(is.numeric), round, 2))
## # A tibble: 7 x 7
##   term              estimate std.error statistic p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)           0        27.5      -9.99    0        0         0   
## 2 year                  1.14      0.01      9.81    0        1.12      1.18
## 3 I(gdpPercap/1000)     1.15      0.01      9.54    0        1.12      1.18
## 4 continentAmericas    17.5       0.76      3.76    0        4.85    112.  
## 5 continentAsia         6.94      0.78      2.47    0.01     1.81     45.8 
## 6 continentEurope      50.0       0.77      5.11    0       13.8     323.  
## 7 continentOceania     79.1       1.02      4.29    0       12.0     732

here we see in our tidy output that every year increase is significantly associated with 1.14 times the odds of having a life expectancy over 75. Every $1000 increase in gdpPercap is associated with 1.15 times the odds of having a life expectancy over 75. In addition, we can see that Africa is not shown in the model output. This is because Africa serves as the reference country in our model. If we wanted to change that, we would have to relevel the continent factor. We can see that though our confidence intervals are extremely wide, they are still statistically significant. However, if we chose a different reference level, we may be able to have more interprable confidence intervals.

One final thing, we want to mention that intercepts are not interprable in this model, as the intercepts represent life expectancy at year 0, where gdpPercapita is 0, and the continent is Africa. If we wanted interprable intercepts, we would have to mean-standardize all of our variables in our model.

Note: This tutorial is very similar to the tutorial in R for data science by Hadley Wickham and Garret Grolemund in their “Many Models” chapter (Chapter 25). I do not take credit for the ideas presented here, as this tutorial is an adaptation of their tutorial, using similar data. For those looking for a more formal introduction to building many models, and the mechanics behind it, I’ll leave a link to their book chapter here.