group_by and summarize to aggregate hierarchical data to a new level of analysis (from counties to “states.”)ggplot.glance() results for a linear model.car package.car package.The YAML code used in this example is listed below…
title: "States, Counties and Population"
author: "Thomas E. Love"
date: "An Example for 431 Class 13: 2021-10-05"
output: 
    html_document:
        code_folding: show
        toc: TRUE
        toc_float: TRUE
        number_sections: TRUE
        df_print: pagedknitr::opts_chunk$set(comment=NA)
library(broom)
library(car) # note: new today
library(glue)
library(janitor)
library(tidyverse)
theme_set(theme_light())We’re looking at the 2021 County Health Rankings data you’re using for Project A.
v051_rawvalue describes the county’s population.county_ranked indicates if the county is ranked by County Health Rankings.data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2021.csv"
chr_2021_raw <- read_csv(data_url, skip = 1)
chr_2021_today <- chr_2021_raw %>% 
    select(fipscode, state, county, 
           county_ranked, v051_rawvalue) %>%
    filter(county_ranked == 1)For each of the “states” we are counting:
Here, we use group_by() and summarize() to help us change the unit of analysis from the county to the “state” (I use “state” since DC is included.)
table1 <- chr_2021_today %>% 
    group_by(state) %>%
    summarize(ranked_counties = n(), 
              state_popn = sum(v051_rawvalue))
table1table1 will be our data set for the remaining work in this Example.Model state population as a function of its number of ranked counties.
ggplot(table1, aes(x = ranked_counties, 
                   y = state_popn, label = state)) +
    geom_label(size = 3) +
    geom_smooth(method = "lm", col = "red",
                formula = y ~ x, se = TRUE) +
    geom_smooth(method = "loess", col = "blue",
                formula = y ~ x, se = FALSE) + 
    labs(title = "Approach A: No Transformation",
         subtitle = glue('Pearson correlation of state_popn and ranked_counties is {round_half_up(cor(table1$state_popn, table1$ranked_counties),2)}'),
         caption = "Note that, for example, 1e+07 = 10 million residents")Note the use of scales::comma here.
ggplot(table1, aes(x = ranked_counties, 
                   y = state_popn, label = state)) +
    geom_label(size = 3) +
    geom_smooth(method = "lm", col = "red",
                formula = y ~ x, se = TRUE) +
    geom_smooth(method = "loess", col = "blue",
                formula = y ~ x, se = FALSE) + 
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Approach A: No Transformation",
         subtitle = glue('Pearson correlation of state_popn and ranked_counties is {round_half_up(cor(table1$state_popn, table1$ranked_counties),2)}'))modelA <- lm(state_popn ~ ranked_counties, data = table1)
tidy(modelA, conf.int = TRUE, conf.level = 0.95)glance() function for a linear model?glance(modelA)\(R^2\) or r.squared is the proportion of variation in our outcome (here, state_popn) accounted for by the model using ranked_counties as a predictor. As you might expect, this is the square of the Pearson correlation coefficient.
Adjusted \(R^2\), or adj.r.squared is not a proportion of anything, but rather an index that penalizes the \(R^2\) value for the number of predictors in the model. This helps combat against a certain form of overfitting, where the greediness of \(R^2\) (it never decreases even if you add predictors with no association with the outcome) causes model-builders to overstate the quality of fit.
If \(n\) is the number of observations used to fit the model, and \(p\) is the number of predictors (not counting the intercept term) used in the model, then:
\[ R^2_{adj} = 1 - \left[ \frac{(1 - R^2)(n - 1)}{n - p - 1} \right] \]
.sigma or \(\hat{\sigma}\).statistic, which helps generate the …p.value.df = degrees of freedom from the numerator of the ANOVA F test statistic, which is essentially a count of the non-intercept coefficients estimated by the model. This was \(p\) in our discussion of adjusted \(R^2\).logLik = the log-likelihood of the model, which will help us estimate …AIC = the Akaike Information Criterion for the model (lower values indicate better fits)BIC = the Bayesian Information Criterion for the model (lower values indicate better fits)deviance = the model deviance, which we’ll skip for now…df.residual = degrees of freedom from the residual (denominator) of the ANOVA F test. This is \(n - p - 1\) from our adjusted \(R^2\) discussion.nobs = number of observations used to fit the model. This is \(n\).par(mfrow = c(1,2))
plot(modelA, which = c(1:2))Which states have the largest residuals? Looks like row numbers 5, 10 and 35
modA_aug <- augment(modelA, data = table1)
slice(modA_aug, c(5, 10, 35))Which states have more population than we’d expect, based on their number of counties?
Here are the six states with the largest positive residual from Model A.
modA_aug %>% arrange(desc(.resid)) %>% head()modA_aug %>% arrange(desc(state_popn)) %>% head()Which states have smaller populations than we’d expect, based on their county count?
Here are the six states with the largest negative residual from Model A.
modA_aug %>% arrange(.resid) %>% head()pnorm(q = 5.1, mean = 0, sd = 1, lower.tail = FALSE)[1] 1.698267e-07Technically, a standardized residual \(r_i\) (symbolized as .std.resid in augment() output from the broom package is calculated as:
\[ r_i = \frac{e_i}{MSE \sqrt(1-h_i)} \] where \(h_i\) is the leverage of the \(i\)th observation.
leverage is symbolized by .hat in our augment() output from the broom package. It measures the degree to which the observation is an outlier along the predictor.rstandard() and studentized residuals with rstudent() applied to your model.outlierTest() function uses a slightly different approaches to calculate the studentized results.rstandard(modelA)[5]; rstudent(modelA)[5]      5 
5.09585        5 
7.356453 outlierTest() function from carAnother approach is to use the outlierTest() function (available in the car package) to see if the most unusual value (in absolute value) is a surprise given the sample size. This uses the studentized residual rather than the standardized residual, which is just another way of measuring the outlier’s distance, but now not including that point in the calculation of the overall variation in the residuals.
outlierTest(modelA)  rstudent unadjusted p-value Bonferroni p
5 7.356453         2.0938e-09   1.0678e-07Our conclusion from the Bonferroni p value here (which is very, very small) is that there’s evidence to support the belief that we could have a serious problem with Normality, in that a studentized residual of 7.36 is out of the range we might reasonably expect to see given a Normal distribution of errors and 51 observations.
At any rate, California looks like a real outlier. Can we do better than our Model A?
| \(\lambda\) | Transformation | Easy to Interpret | 
|---|---|---|
| -2 | \(1/y^2\), the inverse square | Not typically | 
| -1 | \(1/y\), the inverse | OK | 
| -0.5 | \(1/\sqrt{y}\), the inverse square root | Not typically | 
| 0 | \(log(y)\), the logarithm | OK | 
| 0.5 | \(\sqrt{y}\), the square root | OK | 
| 1 | y, the original (raw) outcome | OK | 
| 2 | \(y^2\), the square | OK | 
| 3 | \(y^3\), the cube | Not typically | 
boxCox() function from the car packageWe’ll use the boxCox() function from the car package today, and also the powerTransform() function from the same package to help us identify a potential choice of power transformation.
boxCox(modelA)powerTransform(modelA)Estimated transformation parameter 
        Y1 
0.02489896 So what transformation is this plot suggesting we try?
Model the logarithm of state population as a function of its number of ranked counties.
ggplot(table1, aes(x = ranked_counties, y = log(state_popn), 
                   label = state)) +
    geom_label(size = 3) +
    geom_smooth(method = "lm", col = "red",
                formula = y ~ x, se = TRUE) +
    geom_smooth(method = "loess", col = "blue",
                formula = y ~ x, se = FALSE) + 
    labs(title = "Approach B",
         subtitle = glue('Pearson correlation of log(state_popn) and ranked_counties is {round_half_up(cor(log(table1$state_popn), table1$ranked_counties),2)}'))modelB <- lm(log(state_popn) ~ ranked_counties, 
             data = table1)
tidy(modelB, conf.int = TRUE, conf.level = 0.95)glance(modelB)par(mfrow = c(1,2))
plot(modelB, which = c(1:2))Now, which states have the largest residuals? Still looks like row numbers 5, 10 and 35
modB_aug <- augment(modelB, data = table1)
slice(modB_aug, c(5, 10, 35))The largest standardized residual is still California, but now it is 2.69. Is that a plausible value for the maximum of 51 values if the (standardized) residuals actually follow a Normal distribution?
pnorm(2.69, mean = 0, sd = 1, lower.tail = FALSE)[1] 0.003572601Or we could look at the studentized residual instead.
outlierTest(modelB)No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
  rstudent unadjusted p-value Bonferroni p
5 2.888511          0.0057913      0.29536Our conclusion from the Bonferroni p value here (which is larger than 0.05) is that there’s no evidence in the California value to suggest a serious problem with Normality. A maximum studentized residual of 2.89 is out of the range we might reasonably expect to see given a Normal distribution of errors and 51 observations.
This seems more reasonable. Could we do better still by taking the log of the predictor, as well?
Model the logarithm of state population as a function of the logarithm of its number of ranked counties.
ggplot(table1, aes(x = log(ranked_counties), 
                   y = log(state_popn), label = state)) +
    geom_label(size = 3) +
    geom_smooth(method = "lm", col = "red",
                formula = y ~ x, se = TRUE) +
    geom_smooth(method = "loess", col = "blue",
                formula = y ~ x, se = FALSE) + 
    labs(title = "Approach C",
         subtitle = glue('Pearson correlation of log(state_popn) and log(ranked_counties) is {round_half_up(cor(log(table1$state_popn), log(table1$ranked_counties)),2)}'))modelC <- lm(log(state_popn) ~ log(ranked_counties), 
             data = table1)
tidy(modelC, conf.int = TRUE, conf.level = 0.95)glance(modelC)par(mfrow = c(1,2))
plot(modelC, which = c(1:2))Now, which states have the largest residuals in absolute value? Row numbers 5, 29 and 42
modC_aug <- augment(modelC, data = table1)
slice(modC_aug, c(5, 29, 42))The largest standardized residual is still California, but now it is 2.13. Is that a plausible value for the maximum of 51 values if the (standardized) residuals actually follow a Normal distribution?
pnorm(2.46, mean = 0, sd = 1, lower.tail = FALSE)[1] 0.006946851Or we could look at the studentized residual instead.
outlierTest(modelC)No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
  rstudent unadjusted p-value Bonferroni p
5 2.599649           0.012363      0.63052Since Models B and C predict the same outcome (log(state_popn)) we can compare them directly on in-sample performance using glance.
bind_rows(glance(modelB), glance(modelC)) %>%
    mutate(model = c("B", "C"))Model C looks like the stronger performer on all five summaries we’ve discussed.
| Summary | Stronger Model | 
|---|---|
| \(R^2\) | C (0.313 vs. 0.303) | 
| Adjusted \(R^2\) | C (0.299 vs. 0.288) | 
| \(\sigma\) | C (0.875 vs. 0.882) | 
| AIC | C (135.1 vs. 135.8) | 
| BIC | C (140.9 vs. 141.6) | 
Using model C, which states have larger population than we’d expect, based on their number of counties?
Here are the six states with the largest positive residual from Model C.
modC_aug %>% mutate(logpop = log(state_popn)) %>%
    select(state, ranked_counties, state_popn, logpop, 
           .fitted, .resid, everything()) %>%
    arrange(desc(.resid)) %>% head()Using model C, which states have smaller population than we’d expect, based on their number of counties?
Here are the six states with the largest negative residual from Model C.
modC_aug %>% mutate(logpop = log(state_popn)) %>%
    select(state, ranked_counties, state_popn, logpop, 
           .fitted, .resid, everything()) %>%
    arrange(.resid) %>% head()Which states are the smallest by population?
modC_aug %>% arrange(state_popn) %>% head()Calculate population per county in each state and use a bar chart and then a Cleveland dot plot to visualize the result.
table1 <- table1 %>%
    mutate(county_size = state_popn / ranked_counties)Let’s use geom_col() to make a bar chart to show our continuous variable (county_size) broken down by a categorical variable (state).
ggplot(table1, aes(x = state, y = county_size)) +
    geom_col() How could we improve this plot?
ggplot(table1, aes(x = reorder(state, desc(county_size)), 
                   y = county_size)) +
    geom_col(fill = "dodgerblue", 
             col = "black") +
    scale_y_continuous(labels = scales::comma) +
    theme(axis.text.x = element_text(size = 5)) +
    labs(y = "Population per Ranked County",
         x = "U.S. State or District")Still tough to work with, I think.
Cleveland dot plots are an excellent alternative to a simple bar chart, especially when you are comparing more than a few items.
ggplot(table1, aes(x = county_size, 
                   y = reorder(state, county_size))) +
    geom_point() +
    theme_bw() +
    scale_x_continuous(labels = scales::comma, 
                       sec.axis = dup_axis()) +
    labs(y = "State", x = "Population per County")table1 %>% arrange(county_size)We’ll leave it to you to craft a reasonable set of conclusions on the issues of interest, in light of the output provided here.
group_by and summarize to aggregate hierarchical data to a new level of analysis (from counties to “states.”)ggplot.glance() results for a linear model.car package.car package.sessioninfo::session_info()- Session info ---------------------------------------------------------------
 setting  value                       
 version  R version 4.1.1 (2021-08-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RTerm                       
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/New_York            
 date     2021-10-05                  
- Packages -------------------------------------------------------------------
 package     * version date       lib source        
 abind         1.4-5   2016-07-21 [1] CRAN (R 4.1.0)
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
 backports     1.2.1   2020-12-09 [1] CRAN (R 4.1.0)
 bit           4.0.4   2020-08-04 [1] CRAN (R 4.1.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.1.0)
 broom       * 0.7.9   2021-07-27 [1] CRAN (R 4.1.0)
 bslib         0.3.0   2021-09-02 [1] CRAN (R 4.1.1)
 car         * 3.0-11  2021-06-27 [1] CRAN (R 4.1.0)
 carData     * 3.0-4   2020-05-22 [1] CRAN (R 4.1.0)
 cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.1.0)
 cli           3.0.1   2021-07-17 [1] CRAN (R 4.1.0)
 colorspace    2.0-2   2021-06-24 [1] CRAN (R 4.1.0)
 crayon        1.4.1   2021-02-08 [1] CRAN (R 4.1.0)
 curl          4.3.2   2021-06-23 [1] CRAN (R 4.1.0)
 data.table    1.14.0  2021-02-21 [1] CRAN (R 4.1.0)
 DBI           1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
 dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.1.0)
 digest        0.6.27  2020-10-24 [1] CRAN (R 4.1.0)
 dplyr       * 1.0.7   2021-06-18 [1] CRAN (R 4.1.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
 evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
 fansi         0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
 farver        2.1.0   2021-02-28 [1] CRAN (R 4.1.0)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
 forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.1.0)
 foreign       0.8-81  2020-12-22 [1] CRAN (R 4.1.0)
 fs            1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
 generics      0.1.0   2020-10-31 [1] CRAN (R 4.1.0)
 ggplot2     * 3.3.5   2021-06-25 [1] CRAN (R 4.1.0)
 glue        * 1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
 gtable        0.3.0   2019-03-25 [1] CRAN (R 4.1.0)
 haven         2.4.3   2021-08-04 [1] CRAN (R 4.1.0)
 highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
 hms           1.1.0   2021-05-17 [1] CRAN (R 4.1.0)
 htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
 httr          1.4.2   2020-07-20 [1] CRAN (R 4.1.0)
 janitor     * 2.1.0   2021-01-05 [1] CRAN (R 4.1.0)
 jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.0)
 jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.1.0)
 knitr         1.34    2021-09-09 [1] CRAN (R 4.1.1)
 labeling      0.4.2   2020-10-20 [1] CRAN (R 4.1.0)
 lattice       0.20-44 2021-05-02 [1] CRAN (R 4.1.0)
 lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.1.0)
 lubridate     1.7.10  2021-02-26 [1] CRAN (R 4.1.0)
 magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
 Matrix        1.3-4   2021-06-01 [2] CRAN (R 4.1.1)
 mgcv          1.8-36  2021-06-01 [2] CRAN (R 4.1.1)
 modelr        0.1.8   2020-05-19 [1] CRAN (R 4.1.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.1.0)
 nlme          3.1-152 2021-02-04 [2] CRAN (R 4.1.1)
 openxlsx      4.2.4   2021-06-16 [1] CRAN (R 4.1.0)
 pillar        1.6.2   2021-07-29 [1] CRAN (R 4.1.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
 purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
 Rcpp          1.0.7   2021-07-07 [1] CRAN (R 4.1.0)
 readr       * 2.0.1   2021-08-10 [1] CRAN (R 4.1.1)
 readxl        1.3.1   2019-03-13 [1] CRAN (R 4.1.0)
 reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
 rio           0.5.27  2021-06-21 [1] CRAN (R 4.1.0)
 rlang         0.4.11  2021-04-30 [1] CRAN (R 4.1.0)
 rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.1)
 rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
 rvest         1.0.1   2021-07-26 [1] CRAN (R 4.1.0)
 sass          0.4.0   2021-05-12 [1] CRAN (R 4.1.0)
 scales        1.1.1   2020-05-11 [1] CRAN (R 4.1.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
 snakecase     0.11.0  2019-05-25 [1] CRAN (R 4.1.0)
 stringi       1.7.4   2021-08-25 [1] CRAN (R 4.1.1)
 stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
 tibble      * 3.1.4   2021-08-25 [1] CRAN (R 4.1.1)
 tidyr       * 1.1.3   2021-03-03 [1] CRAN (R 4.1.0)
 tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
 tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.1.0)
 tzdb          0.1.2   2021-07-20 [1] CRAN (R 4.1.0)
 utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
 vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
 vroom         1.5.5   2021-09-14 [1] CRAN (R 4.1.1)
 withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
 xfun          0.25    2021-08-06 [1] CRAN (R 4.1.0)
 xml2          1.3.2   2020-04-23 [1] CRAN (R 4.1.0)
 yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
 zip           2.2.0   2021-05-31 [1] CRAN (R 4.1.0)
[1] C:/Users/Thomas/Documents/R/win-library/4.1
[2] C:/Program Files/R/R-4.1.1/library