Getting our data

df <- gapminder %>%
  mutate(gdp1000 = gdpPercap/1000,
         pop100k = pop/100000)
head(df)

## # A tibble: 6 x 8
##   country     continent  year lifeExp      pop gdpPercap gdp1000 pop100k
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>   <dbl>   <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.   0.779    84.3
## 2 Afghanistan Asia       1957    30.3  9240934      821.   0.821    92.4
## 3 Afghanistan Asia       1962    32.0 10267083      853.   0.853   103. 
## 4 Afghanistan Asia       1967    34.0 11537966      836.   0.836   115. 
## 5 Afghanistan Asia       1972    36.1 13079460      740.   0.740   131. 
## 6 Afghanistan Asia       1977    38.4 14880372      786.   0.786   149.

Taking a look at our data characteristics

summary(df)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap           gdp1000            pop100k        
##  Min.   :6.001e+04   Min.   :   241.2   Min.   :  0.2412   Min.   :    0.60  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1   1st Qu.:  1.2021   1st Qu.:   27.94  
##  Median :7.024e+06   Median :  3531.8   Median :  3.5318   Median :   70.24  
##  Mean   :2.960e+07   Mean   :  7215.3   Mean   :  7.2153   Mean   :  296.01  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5   3rd Qu.:  9.3255   3rd Qu.:  195.85  
##  Max.   :1.319e+09   Max.   :113523.1   Max.   :113.5231   Max.   :13186.83  
##

unique(df$country) #lot of countries, let's visualize

##   [1] Afghanistan              Albania                  Algeria                 
##   [4] Angola                   Argentina                Australia               
##   [7] Austria                  Bahrain                  Bangladesh              
##  [10] Belgium                  Benin                    Bolivia                 
##  [13] Bosnia and Herzegovina   Botswana                 Brazil                  
##  [16] Bulgaria                 Burkina Faso             Burundi                 
##  [19] Cambodia                 Cameroon                 Canada                  
##  [22] Central African Republic Chad                     Chile                   
##  [25] China                    Colombia                 Comoros                 
##  [28] Congo, Dem. Rep.         Congo, Rep.              Costa Rica              
##  [31] Cote d'Ivoire            Croatia                  Cuba                    
##  [34] Czech Republic           Denmark                  Djibouti                
##  [37] Dominican Republic       Ecuador                  Egypt                   
##  [40] El Salvador              Equatorial Guinea        Eritrea                 
##  [43] Ethiopia                 Finland                  France                  
##  [46] Gabon                    Gambia                   Germany                 
##  [49] Ghana                    Greece                   Guatemala               
##  [52] Guinea                   Guinea-Bissau            Haiti                   
##  [55] Honduras                 Hong Kong, China         Hungary                 
##  [58] Iceland                  India                    Indonesia               
##  [61] Iran                     Iraq                     Ireland                 
##  [64] Israel                   Italy                    Jamaica                 
##  [67] Japan                    Jordan                   Kenya                   
##  [70] Korea, Dem. Rep.         Korea, Rep.              Kuwait                  
##  [73] Lebanon                  Lesotho                  Liberia                 
##  [76] Libya                    Madagascar               Malawi                  
##  [79] Malaysia                 Mali                     Mauritania              
##  [82] Mauritius                Mexico                   Mongolia                
##  [85] Montenegro               Morocco                  Mozambique              
##  [88] Myanmar                  Namibia                  Nepal                   
##  [91] Netherlands              New Zealand              Nicaragua               
##  [94] Niger                    Nigeria                  Norway                  
##  [97] Oman                     Pakistan                 Panama                  
## [100] Paraguay                 Peru                     Philippines             
## [103] Poland                   Portugal                 Puerto Rico             
## [106] Reunion                  Romania                  Rwanda                  
## [109] Sao Tome and Principe    Saudi Arabia             Senegal                 
## [112] Serbia                   Sierra Leone             Singapore               
## [115] Slovak Republic          Slovenia                 Somalia                 
## [118] South Africa             Spain                    Sri Lanka               
## [121] Sudan                    Swaziland                Sweden                  
## [124] Switzerland              Syria                    Taiwan                  
## [127] Tanzania                 Thailand                 Togo                    
## [130] Trinidad and Tobago      Tunisia                  Turkey                  
## [133] Uganda                   United Kingdom           United States           
## [136] Uruguay                  Venezuela                Vietnam                 
## [139] West Bank and Gaza       Yemen, Rep.              Zambia                  
## [142] Zimbabwe                
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe

visualizing our data

df %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(color = continent, group = country)) +
  labs(title = "Gapminder Life Expectancy By Year")

df %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(color = continent, group = country)) +
  geom_smooth(se = T, method = 'loess') +
  facet_wrap(~continent) +
  labs(title = "Gapminder Life Expectancy By Year")

## `geom_smooth()` using formula 'y ~ x'

Basic Modelling

names(df)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
## [7] "gdp1000"   "pop100k"

Let’s create a basic model of life expectancy, predicted by GDP per capita, year, and population. But let’s start with just Asia

# keep only countries within the continent of Asia
mod1_df <- df %>%
  filter(continent == "Asia")

mod1 <- lm(lifeExp ~ year + pop100k + gdp1000, data = mod1_df)
# can summarise our model with the summary function
summary(mod1)

## 
## Call:
## lm(formula = lifeExp ~ year + pop100k + gdp1000, data = mod1_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.1852  -5.2575   0.4857   5.0062  17.9753 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.833e+02  4.829e+01 -16.221  < 2e-16 ***
## year         4.251e-01  2.442e-02  17.404  < 2e-16 ***
## pop100k      4.228e-06  2.039e-04   0.021    0.983    
## gdp1000      2.510e-01  3.011e-02   8.336 1.31e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.232 on 392 degrees of freedom
## Multiple R-squared:  0.5222, Adjusted R-squared:  0.5186 
## F-statistic: 142.8 on 3 and 392 DF,  p-value: < 2.2e-16

Model Assumptions

We see that year and gdpPercap is significant. We should also check for normality of residuals, heteroscedasticity, etc… Which we can do by just calling the plot function on our model

par(mfrow = c(2, 2))
plot(mod1)

qq looks somewhat normal, residuals vs fitted look all right, not a lot of leverage points… alright we’re not being too picky here.

We can also look at component plus residual plots of our data to check linearity of our predictors with respect to the outcome (this is only useful for a multivariate linear model)

library(car)

## Loading required package: carData

## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

car::crPlots(mod1)

Improving our model

Here we can see that `gdpPercap is not necessarily linear, and population might stray from linear as well. We might want to amend this in our model. However population by itself wasn’t significant, so we’ll remove that from our model

mod2 <- lm(lifeExp ~ year + gdp1000 + I(gdp1000**2) + pop100k, data = mod1_df)
# the I() tells the model to treat the gdpPercap term 'as is,' and not to adjust it or ignore it
summary(mod2)

## 
## Call:
## lm(formula = lifeExp ~ year + gdp1000 + I(gdp1000^2) + pop100k, 
##     data = mod1_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.670  -4.343  -0.003   5.281  16.414 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.260e+02  4.722e+01 -13.257   <2e-16 ***
## year          3.442e-01  2.394e-02  14.380   <2e-16 ***
## gdp1000       7.906e-01  6.539e-02  12.092   <2e-16 ***
## I(gdp1000^2) -6.664e-03  7.332e-04  -9.089   <2e-16 ***
## pop100k       2.956e-04  1.882e-04   1.571    0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.49 on 391 degrees of freedom
## Multiple R-squared:  0.6055, Adjusted R-squared:  0.6015 
## F-statistic: 150.1 on 4 and 391 DF,  p-value: < 2.2e-16

If we want to check to see if our model has improved, we can use the anova function to conduct a goodness of fit F-test between two models. The below anova is testing if mod2 fits better than mod1. If p < 0.05 the improvement in fit is statistically significant

Checking model fit

# let's check our model's fit from the first model. The below code
# performs an F-test
anova(mod1, mod2)

## Analysis of Variance Table
## 
## Model 1: lifeExp ~ year + pop100k + gdp1000
## Model 2: lifeExp ~ year + gdp1000 + I(gdp1000^2) + pop100k
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    392 26566                                  
## 2    391 21933  1    4633.6 82.605 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Broom package

The summary function is great, but it’s hard to work with the output of it directly. To do this, a package was developed called broom which allows you to fortify your data with fitted values, create a data frame with your model output, and get goodness of fit statistics

broom has 3 very useful functions called tidy, augment, and glance. tidy returns the same data as your summary table (and more) in the form of a data frame. You can also ask for confidence intervals, p values, and exponentiation, among other things, depending on the model you’ve fit

`tidy`

tidy(mod2, conf.int = T, statistic = F, std.error = T, p.value = T, conf.level = 0.95)

## # A tibble: 5 x 7
##   term            estimate std.error statistic  p.value     conf.low   conf.high
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>        <dbl>       <dbl>
## 1 (Intercept)  -626.       47.2         -13.3  2.20e-33 -719.        -533.      
## 2 year            0.344     0.0239       14.4  6.19e-38    0.297        0.391   
## 3 gdp1000         0.791     0.0654       12.1  8.19e-29    0.662        0.919   
## 4 I(gdp1000^2)   -0.00666   0.000733     -9.09 5.09e-18   -0.00811     -0.00522 
## 5 pop100k         0.000296  0.000188      1.57 1.17e- 1   -0.0000744    0.000666

Augment returns a data frame of fitted values for your data, which you can then use to fit your trend line over your data. However, because we have a multivariable model, a trend line doesn’t fit nicely to any one variable. I’ll make a univariate model so this will make more sense.

`augment`

mod3 <- lm(lifeExp ~ year, data = mod1_df)
augdata <- augment(mod3, interval = "prediction")

augdata %>%
  ggplot(aes(x = year)) +
  geom_point(aes(y = lifeExp)) +
  geom_line(aes(y = .fitted), col = 'red') +
  geom_line(aes(y = .upper), col = 'blue') +
  geom_line(aes(y = .lower), col = 'blue') +
  geom_smooth(aes(y = lifeExp), se = F, color = 'green') +
  labs(title = "LM of LifeExpectancy vs. Year")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Side note: I will also mention that augment has some drawbacks. If you are using a poisson or negative binomial distribution to fit your data, augment returns fitted values in their non-exponentiated form. An alternative to augment is to get values from the model object itself using the $ operator. Remember, your model output is simply a list, so you can access elements from it like any other list.

mod1$coefficients

##   (Intercept)          year       pop100k       gdp1000 
## -7.833346e+02  4.250632e-01  4.228480e-06  2.510235e-01

summary(mod1$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -26.1852  -5.2575   0.4857   0.0000   5.0062  17.9753

summary(mod1$fitted.values)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   46.47   52.95   59.83   60.06   66.91   82.34

Back to the above visual…As you can see, above we have another test of our fit. We have out fitted line in red, prediction intervals, which we got from the augmented data frame in blue, and a lowess smoother in green, showing slight deviations from linearity about our model. In all honesty you should probably plot a lowess smoother over your data before you make a linear model to check if the relationship is linear. A one shortcut for doing that is to plot a geom_smooth of your data using method = 'loess', and another geom_smooth using method = lm, which will plot a loess smoother and a fitted line, respectively. Another shortcut for doing that is the below code pairs.panels() from the psych package, which plots a scatterplot and correlation matrix of all of selected variables from your data frame

Panel Plots

mod1_df %>%
  dplyr::select(-continent, -country, -pop, -gdpPercap) %>%
  psych::pairs.panels()

`glance`

Finally, we have glance, which gives us goodness of fit statistics for our models

knitr::kable(glance(mod1))

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.5222142	0.5185577	8.232329	142.8171	0	3	-1394.685	2799.37	2819.277	26566.33	392	396

knitr::kable(glance(mod2))

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.6055482	0.6015129	7.489585	150.0623	0	4	-1356.735	2725.47	2749.359	21932.71	391	396

Comparing models side-by-side

Sometimes we want to have our model in a neat output table, and we don’t want to format the table every time we change our model. We can programmatically make model tables with the tab_model function in sjPlot

`tab_model`

sjPlot::tab_model(mod1, mod2, mod3, show.intercept = FALSE, show.ci = 0.95, show.se = F, show.p = T,
                  show.r2 = TRUE, show.fstat = T, show.aic = T, show.obs = T, rm.terms = NULL, order.terms = NULL,
                  title = "Comparison of Gapminder Models")

Comparison of Gapminder Models
	life Exp			life Exp			life Exp
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
year	0.43	0.38 – 0.47	<0.001	0.34	0.30 – 0.39	<0.001	0.45	0.40 – 0.50	<0.001
pop100k	0.00	-0.00 – 0.00	0.983	0.00	-0.00 – 0.00	0.117
gdp1000	0.25	0.19 – 0.31	<0.001	0.79	0.66 – 0.92	<0.001
gdp1000^2				-0.01	-0.01 – -0.01	<0.001
Observations	396			396			396
R² / R² adjusted	0.522 / 0.519			0.606 / 0.602			0.436 / 0.434
AIC	2799.370			2725.470			2861.319

Now we can compare models programmatically without having to export the models into an Excel table every time we run our code! This function works with most of R’s built-in modelling functions. There’s another variant called plot_model which will plot estimates of our models together

`plot_models`

sjPlot::plot_models(mod1, mod2, mod3, show.values = T, colors = "Dark2", ci.lvl = 0.99, axis.lim = c(-0.5, 0.5),
                    dot.size = 2, show.p = TRUE)

obviously, this plot is a little underwhelming because we have a very simple model, and out predictors don’t show great amounts of spread from zero, but you could see how this type of plot could be useful for logistic, poisson, negative binomial, or other types of regressions with much wider confidence intervals.

Many Many Many Models with purrr

So we’ve made a model for one continent…What if we wanted to make a model for all of the continents in our data set? Should we go through that same workflow for every one? There is a faster way, but it’s somewhat tricky, so I’ll present it here and leave it up to your discretion. Often times if you code by brute force method, it can take longer to do, and be many more lines, but the code may be more understandable in the end. I’ll show the programmatic way of making many models though.

Nesting in purrr

R has a package called purrr, which allows the user to nest data within a colum to make a new data set. purrr also implements a programmatic concept called mapping, which is simply applying a function to every element of a list. The combination of nesting a data frame into a column, and mapping a function onto that column will allow us to efficiently make many models (though the model specifications have to be the same across groups). Here’s code on how to nest

nest_df <- df %>%
  group_by(continent) %>%
  nest()

head(nest_df)

## # A tibble: 5 x 2
## # Groups:   continent [5]
##   continent data              
##   <fct>     <list>            
## 1 Asia      <tibble [396 x 7]>
## 2 Europe    <tibble [360 x 7]>
## 3 Africa    <tibble [624 x 7]>
## 4 Americas  <tibble [300 x 7]>
## 5 Oceania   <tibble [24 x 7]>

It’s as easy as that! First you group by the category that you want to model across. In this case we want to apply similar models across continents. Now we have to make a function for a model which we can apply across the nested data frames. First I’ll show you what’s contained in the data column

head(nest_df$data[[1]])

## # A tibble: 6 x 7
##   country      year lifeExp      pop gdpPercap gdp1000 pop100k
##   <fct>       <int>   <dbl>    <int>     <dbl>   <dbl>   <dbl>
## 1 Afghanistan  1952    28.8  8425333      779.   0.779    84.3
## 2 Afghanistan  1957    30.3  9240934      821.   0.821    92.4
## 3 Afghanistan  1962    32.0 10267083      853.   0.853   103. 
## 4 Afghanistan  1967    34.0 11537966      836.   0.836   115. 
## 5 Afghanistan  1972    36.1 13079460      740.   0.740   131. 
## 6 Afghanistan  1977    38.4 14880372      786.   0.786   149.

tail(nest_df$data[[1]])

## # A tibble: 6 x 7
##   country      year lifeExp      pop gdpPercap gdp1000 pop100k
##   <fct>       <int>   <dbl>    <int>     <dbl>   <dbl>   <dbl>
## 1 Yemen, Rep.  1982    49.1  9657618     1978.    1.98    96.6
## 2 Yemen, Rep.  1987    52.9 11219340     1972.    1.97   112. 
## 3 Yemen, Rep.  1992    55.6 13367997     1879.    1.88   134. 
## 4 Yemen, Rep.  1997    58.0 15826497     2117.    2.12   158. 
## 5 Yemen, Rep.  2002    60.3 18701257     2235.    2.23   187. 
## 6 Yemen, Rep.  2007    62.7 22211743     2281.    2.28   222.

The first row of the data column is just a data frame containg data for all countries in Africa. Here we have year, lifeExp, pop, and gdpPercap. Now we need to make a function which we can apply across these tibbles, which take the same variable names (for sake of ease)

Mapping

cont_mod <- function(df){
  mod1 <- lm(lifeExp ~ year + gdp1000 + I(gdp1000**2), data = df)
  return(mod1)
}

Now that we have our model function, we can map the model onto our nested data frame and make a new variable called model

nest_df_mod <- nest_df %>%
  mutate(model = map(data, cont_mod))

We’ve mapped our model. Let’s see what one of the models looks like

summary(nest_df_mod$model[[1]])

## 
## Call:
## lm(formula = lifeExp ~ year + gdp1000 + I(gdp1000^2), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.9730  -4.3525  -0.2491   5.2605  16.3478 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.396e+02  4.651e+01 -13.754   <2e-16 ***
## year          3.512e-01  2.355e-02  14.912   <2e-16 ***
## gdp1000       7.684e-01  6.395e-02  12.015   <2e-16 ***
## I(gdp1000^2) -6.468e-03  7.238e-04  -8.935   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.504 on 392 degrees of freedom
## Multiple R-squared:  0.6031, Adjusted R-squared:    0.6 
## F-statistic: 198.5 on 3 and 392 DF,  p-value: < 2.2e-16

Here we have a model of all of the African countries predicting life expectancy. We can continue using map to get a tidy model output, augment our output, or look at fit statistics with glance

Mapping broom functions

nest_df_fortify <- nest_df_mod %>%
  mutate(tdy = map(model, tidy, conf.int = T),
         aug = map(model, augment),
         glnc = map(model, glance))

head(nest_df_fortify)

## # A tibble: 5 x 6
## # Groups:   continent [5]
##   continent data            model  tdy            aug             glnc          
##   <fct>     <list>          <list> <list>         <list>          <list>        
## 1 Asia      <tibble [396 x~ <lm>   <tibble [4 x ~ <tibble [396 x~ <tibble [1 x ~
## 2 Europe    <tibble [360 x~ <lm>   <tibble [4 x ~ <tibble [360 x~ <tibble [1 x ~
## 3 Africa    <tibble [624 x~ <lm>   <tibble [4 x ~ <tibble [624 x~ <tibble [1 x ~
## 4 Americas  <tibble [300 x~ <lm>   <tibble [4 x ~ <tibble [300 x~ <tibble [1 x ~
## 5 Oceania   <tibble [24 x ~ <lm>   <tibble [4 x ~ <tibble [24 x ~ <tibble [1 x ~

`unnest`

You might be thinking “This is great! But how the hell do we work with this now?” That tripped me up for a while too, but the easiest thing for us to do is something called unnest. It takes a nested column, and expands it back to original tibble size. I’ll demonstrate.

df_tidy <- nest_df_fortify %>%
  select(continent, tdy) %>%
  unnest(tdy)

head(df_tidy)

## # A tibble: 6 x 8
## # Groups:   continent [2]
##   continent term       estimate std.error statistic  p.value  conf.low conf.high
##   <fct>     <chr>         <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
## 1 Asia      (Interce~  -6.40e+2 46.5         -13.8  2.13e-35  -7.31e+2  -5.48e+2
## 2 Asia      year        3.51e-1  0.0236       14.9  3.75e-40   3.05e-1   3.98e-1
## 3 Asia      gdp1000     7.68e-1  0.0640       12.0  1.58e-28   6.43e-1   8.94e-1
## 4 Asia      I(gdp100~  -6.47e-3  0.000724     -8.94 1.61e-17  -7.89e-3  -5.04e-3
## 5 Europe    (Interce~  -1.56e+2 20.5          -7.60 2.57e-13  -1.96e+2  -1.15e+2
## 6 Europe    year        1.11e-1  0.0104       10.6  3.70e-23   9.05e-2   1.31e-1

df_tidy %>%
  filter(term != '(Intercept)') %>%
  ggplot(aes(x = estimate, y = term, color = continent)) +
  geom_point() +
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high), width = 0.3) +
  facet_wrap(~continent, ncol = 1)

Resource: Tutorial on making error bars from sthda.

Voila! Now we have a data frame of continents with model parameters including standard errors, p values, confidence intervals, etc… This may not be useful for only 5 categories, but imagine if you’re making models for 100 categories! Also, purrr implements this type of stuff extremely efficiently, so running models with map is much faster than running 100 different models individually. For demonstration’s sake, I’ll just show the output of the augment and glance unnests

Utilizing Glance

df_glance <- nest_df_fortify %>%
  select(continent, glnc) %>%
  unnest(glnc)

print(df_glance)

## # A tibble: 5 x 13
## # Groups:   continent [5]
##   continent r.squared adj.r.squared sigma statistic   p.value    df  logLik
##   <fct>         <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>
## 1 Asia          0.603         0.600 7.50       199. 2.76e- 78     3 -1358. 
## 2 Europe        0.754         0.752 2.71       364. 4.87e-108     3  -867. 
## 3 Africa        0.489         0.487 6.56       198. 5.53e- 90     3 -2057. 
## 4 Americas      0.694         0.691 5.20       224. 9.57e- 76     3  -918. 
## 5 Oceania       0.977         0.974 0.613      288. 1.30e- 16     3   -20.1
## # ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

Now we can easily compare fit statistics across countries. We can see that our $R^2$ value for Africa is much lower than the other continents, suggesting that our model did not fit that country well. Perhaps this requires some tuning in the future.

Utilizing `augment`

df_aug <- nest_df_fortify %>%
  select(continent, aug) %>%
  unnest(aug)

# illustration of why plotting a linear regression over a multivariate model
# doesn't work
df_aug %>%
  ggplot(aes(x = year)) +
  geom_point(aes(y = lifeExp), alpha = 0.3) +
  geom_point(aes(y = .fitted), color = 'red', alpha = 0.3) +
  facet_wrap(~continent) +
  labs(title = "Fitted values of LifeExpectancy in gapminder")

Stopping point for 2021-03-18

Other Basic Regression Types: glm and glm.nb

R, being a statistical language, has many other types of models that you can implement, most of which can be accessed using the glm function. Let’s load in some data that might allow us to use some different distributions

This is just data cleaning and retrieval, so we aren’t going to cover this in depth

# rm(list = ls())

cases_global <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv") %>%
  dplyr::select(-Lat, -Long) %>%
  pivot_longer(cols = c(everything(), -`Province/State`, -`Country/Region`),
               names_to = "date",
               values_to = "cases") %>%
  mutate(date = as.Date(date, format = '%m/%d/%y'))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## i Use `spec()` for the full column specifications.

deaths_global <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv") %>%
  dplyr::select(-Lat, -Long) %>%
  pivot_longer(cols = c(everything(), -`Province/State`, -`Country/Region`),
               names_to = "date",
               values_to = "deaths") %>%
  mutate(date = as.Date(date, format = '%m/%d/%y'))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   `Province/State` = col_character(),
##   `Country/Region` = col_character()
## )
## i Use `spec()` for the full column specifications.

COVID_master <- cases_global %>%
  left_join(deaths_global, by = c('Province/State', 'Country/Region', 'date')) %>%
  rename(country = 'Country/Region',
         state = 'Province/State') %>%
  group_by(country, date) %>%
  summarise(cases = sum(cases, na.rm = T),
            deaths = sum(deaths, na.rm = T)) %>%
  mutate(daily_cases = ave(cases, country, FUN = function(x) c(0, diff(x))),
         daily_deaths = ave(deaths, country, FUN = function(x) c(0, diff(x))),
         daily_cases_s = ave(daily_cases, country, FUN = function(x) kza::kz(x, m = 14, k = 2)),
         daily_deaths_s = ave(daily_deaths, country, FUN = function(x) kza::kz(x, m = 14, k = 2)))

## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

COVID_master %>%
  ggplot(aes(x = date, y = daily_cases_s, group = country, color = country)) +
  geom_line(alpha = 0.4) +
  theme(legend.position = 'none')

COVID_master %>%
  ggplot(aes(x = date, y = daily_deaths_s, group = country, color = country)) +
  geom_line(alpha = 0.4) +
  theme(legend.position = 'none')

Let’s look at the Netherlands!

COVID_Ned <- COVID_master %>%
  filter(country == "Netherlands")

COVID_Ned %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = daily_cases), color = 'black') +
  geom_line(aes(y = daily_deaths), color = 'black') +
  geom_line(aes(y = daily_cases_s), color = 'blue') +
  geom_line(aes(y = daily_deaths_s), color = 'firebrick3')

COVID_Ned %>%
  pivot_longer(cols = c('daily_cases', 'daily_deaths'),
               names_to = 'outcomes',
               values_to = 'values') %>%
  ggplot() +
  geom_histogram(aes(x = values)) +
  facet_wrap(~outcomes)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mean(COVID_Ned$daily_cases)

## [1] 2825.907

var(COVID_Ned$daily_cases)

## [1] 10587626

mean(COVID_Ned$daily_deaths)

## [1] 38.71259

var(COVID_Ned$daily_deaths)

## [1] 1961.939

Here we see that the mean value of cases and deaths is much less than the variance. In addition, both data are highly skewed. We could either use a transformation to normalize these data, or use a glm distribution that accomodates these factors, such as poisson or negative binomial distributions. We can do this using the glm() function. Let’s model how COVID rates change by the weekday

COVID_Ned_wk <- COVID_Ned %>%
  mutate(DoW = weekdays(date))

pmod <- glm(daily_cases ~ date +DoW + daily_cases_s, family = 'poisson', data = COVID_Ned_wk)
summary(pmod)

## 
## Call:
## glm(formula = daily_cases ~ date + DoW + daily_cases_s, family = "poisson", 
##     data = COVID_Ned_wk)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -40.279  -19.440   -5.108    9.926   45.089  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -7.264e+01  2.470e-01 -294.08  < 2e-16 ***
## date           4.284e-03  1.333e-05  321.30  < 2e-16 ***
## DoWMonday     -1.859e-01  3.445e-03  -53.95  < 2e-16 ***
## DoWSaturday   -4.227e-02  3.325e-03  -12.71  < 2e-16 ***
## DoWSunday     -7.365e-02  3.348e-03  -22.00  < 2e-16 ***
## DoWThursday   -1.456e-02  3.310e-03   -4.40 1.08e-05 ***
## DoWTuesday    -2.372e-01  3.490e-03  -67.99  < 2e-16 ***
## DoWWednesday  -1.027e-01  3.359e-03  -30.57  < 2e-16 ***
## daily_cases_s  2.536e-04  3.536e-07  717.17  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1582339  on 420  degrees of freedom
## Residual deviance:  144895  on 412  degrees of freedom
## AIC: 148309
## 
## Number of Fisher Scoring iterations: 5

We have model output, but it’s difficult to interpret poisson regression coefficients without exponentiating them. Let’s bring tidy back into the mix

tidy(pmod, conf.int = TRUE, exponentiate = T)

## # A tibble: 9 x 7
##   term          estimate   std.error statistic   p.value conf.low conf.high
##   <chr>            <dbl>       <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   2.85e-32 0.247         -294.   0.        1.76e-32  4.62e-32
## 2 date          1.00e+ 0 0.0000133      321.   0.        1.00e+ 0  1.00e+ 0
## 3 DoWMonday     8.30e- 1 0.00345        -54.0  0.        8.25e- 1  8.36e- 1
## 4 DoWSaturday   9.59e- 1 0.00333        -12.7  5.14e- 37 9.52e- 1  9.65e- 1
## 5 DoWSunday     9.29e- 1 0.00335        -22.0  3.08e-107 9.23e- 1  9.35e- 1
## 6 DoWThursday   9.86e- 1 0.00331         -4.40 1.08e-  5 9.79e- 1  9.92e- 1
## 7 DoWTuesday    7.89e- 1 0.00349        -68.0  0.        7.83e- 1  7.94e- 1
## 8 DoWWednesday  9.02e- 1 0.00336        -30.6  3.39e-205 8.97e- 1  9.08e- 1
## 9 daily_cases_s 1.00e+ 0 0.000000354    717.   0.        1.00e+ 0  1.00e+ 0

augment(pmod) %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = exp(.fitted)), color = 'red') +
  geom_point(aes(y = daily_cases))

Note: This tutorial is very similar to the tutorial in R for data science by Hadley Wickham and Garret Grolemund in their “Many Models” chapter (Chapter 25). I do not take credit for the ideas presented here, as this tutorial is an adaptation of their tutorial, using similar data. For those looking for a more formal introduction to building many models, and the mechanics behind it, I’ll leave a link to their book chapter here.

Data Days Model Building, Visualization, and Management

Kees Schipper

3/15/2021

Getting our data

Taking a look at our data characteristics

visualizing our data

Basic Modelling

Model Assumptions

Improving our model

Checking model fit

The Broom package

`tidy`

`augment`

Panel Plots

`glance`

Comparing models side-by-side

`tab_model`

`plot_models`

Many Many Many Models with purrr

Nesting in purrr

Mapping

Mapping broom functions

`unnest`

Utilizing Glance

Utilizing `augment`

Stopping point for 2021-03-18

Other Basic Regression Types: glm and glm.nb

Data Days Model Building, Visualization, and Management

Kees Schipper

3/15/2021

Getting our data

Taking a look at our data characteristics

visualizing our data

Basic Modelling

Model Assumptions

Improving our model

Checking model fit

The Broom package

tidy

augment

Panel Plots

glance

Comparing models side-by-side

tab_model

plot_models

Many Many Many Models with purrr

Nesting in purrr

Mapping

Mapping broom functions

unnest

Utilizing Glance

Utilizing augment

Stopping point for 2021-03-18

Other Basic Regression Types: glm and glm.nb

`tidy`

`augment`

`glance`

`tab_model`

`plot_models`

`unnest`

Utilizing `augment`