group_by
and summarize
to aggregate hierarchical data to a new level of analysis (from counties to “states.”)ggplot
.glance()
results for a linear model.car
package.car
package.The YAML code used in this example is listed below…
title: "States, Counties and Population"
author: "Thomas E. Love"
date: "An Example for 431 Class 13: 2021-10-05"
output:
html_document:
code_folding: show
toc: TRUE
toc_float: TRUE
number_sections: TRUE
df_print: paged
knitr::opts_chunk$set(comment=NA)
library(broom)
library(car) # note: new today
library(glue)
library(janitor)
library(tidyverse)
theme_set(theme_light())
We’re looking at the 2021 County Health Rankings data you’re using for Project A.
v051_rawvalue
describes the county’s population.county_ranked
indicates if the county is ranked by County Health Rankings.data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2021.csv"
chr_2021_raw <- read_csv(data_url, skip = 1)
chr_2021_today <- chr_2021_raw %>%
select(fipscode, state, county,
county_ranked, v051_rawvalue) %>%
filter(county_ranked == 1)
For each of the “states” we are counting:
Here, we use group_by()
and summarize()
to help us change the unit of analysis from the county to the “state” (I use “state” since DC is included.)
table1 <- chr_2021_today %>%
group_by(state) %>%
summarize(ranked_counties = n(),
state_popn = sum(v051_rawvalue))
table1
table1
will be our data set for the remaining work in this Example.Model state population as a function of its number of ranked counties.
ggplot(table1, aes(x = ranked_counties,
y = state_popn, label = state)) +
geom_label(size = 3) +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = TRUE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
labs(title = "Approach A: No Transformation",
subtitle = glue('Pearson correlation of state_popn and ranked_counties is {round_half_up(cor(table1$state_popn, table1$ranked_counties),2)}'),
caption = "Note that, for example, 1e+07 = 10 million residents")
Note the use of scales::comma
here.
ggplot(table1, aes(x = ranked_counties,
y = state_popn, label = state)) +
geom_label(size = 3) +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = TRUE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Approach A: No Transformation",
subtitle = glue('Pearson correlation of state_popn and ranked_counties is {round_half_up(cor(table1$state_popn, table1$ranked_counties),2)}'))
modelA <- lm(state_popn ~ ranked_counties, data = table1)
tidy(modelA, conf.int = TRUE, conf.level = 0.95)
glance()
function for a linear model?glance(modelA)
\(R^2\) or r.squared
is the proportion of variation in our outcome (here, state_popn
) accounted for by the model using ranked_counties
as a predictor. As you might expect, this is the square of the Pearson correlation coefficient.
Adjusted \(R^2\), or adj.r.squared
is not a proportion of anything, but rather an index that penalizes the \(R^2\) value for the number of predictors in the model. This helps combat against a certain form of overfitting, where the greediness of \(R^2\) (it never decreases even if you add predictors with no association with the outcome) causes model-builders to overstate the quality of fit.
If \(n\) is the number of observations used to fit the model, and \(p\) is the number of predictors (not counting the intercept term) used in the model, then:
\[ R^2_{adj} = 1 - \left[ \frac{(1 - R^2)(n - 1)}{n - p - 1} \right] \]
.sigma
or \(\hat{\sigma}\).statistic
, which helps generate the …p.value
.df
= degrees of freedom from the numerator of the ANOVA F test statistic, which is essentially a count of the non-intercept coefficients estimated by the model. This was \(p\) in our discussion of adjusted \(R^2\).logLik
= the log-likelihood of the model, which will help us estimate …AIC
= the Akaike Information Criterion for the model (lower values indicate better fits)BIC
= the Bayesian Information Criterion for the model (lower values indicate better fits)deviance
= the model deviance, which we’ll skip for now…df.residual
= degrees of freedom from the residual (denominator) of the ANOVA F test. This is \(n - p - 1\) from our adjusted \(R^2\) discussion.nobs
= number of observations used to fit the model. This is \(n\).par(mfrow = c(1,2))
plot(modelA, which = c(1:2))
Which states have the largest residuals? Looks like row numbers 5, 10 and 35
modA_aug <- augment(modelA, data = table1)
slice(modA_aug, c(5, 10, 35))
Which states have more population than we’d expect, based on their number of counties?
Here are the six states with the largest positive residual from Model A.
modA_aug %>% arrange(desc(.resid)) %>% head()
modA_aug %>% arrange(desc(state_popn)) %>% head()
Which states have smaller populations than we’d expect, based on their county count?
Here are the six states with the largest negative residual from Model A.
modA_aug %>% arrange(.resid) %>% head()
pnorm(q = 5.1, mean = 0, sd = 1, lower.tail = FALSE)
[1] 1.698267e-07
Technically, a standardized residual \(r_i\) (symbolized as .std.resid
in augment()
output from the broom
package is calculated as:
\[ r_i = \frac{e_i}{MSE \sqrt(1-h_i)} \] where \(h_i\) is the leverage of the \(i\)th observation.
leverage
is symbolized by .hat
in our augment()
output from the broom
package. It measures the degree to which the observation is an outlier along the predictor.rstandard()
and studentized residuals with rstudent()
applied to your model.outlierTest()
function uses a slightly different approaches to calculate the studentized results.rstandard(modelA)[5]; rstudent(modelA)[5]
5
5.09585
5
7.356453
outlierTest()
function from car
Another approach is to use the outlierTest()
function (available in the car
package) to see if the most unusual value (in absolute value) is a surprise given the sample size. This uses the studentized residual rather than the standardized residual, which is just another way of measuring the outlier’s distance, but now not including that point in the calculation of the overall variation in the residuals.
outlierTest(modelA)
rstudent unadjusted p-value Bonferroni p
5 7.356453 2.0938e-09 1.0678e-07
Our conclusion from the Bonferroni p value here (which is very, very small) is that there’s evidence to support the belief that we could have a serious problem with Normality, in that a studentized residual of 7.36 is out of the range we might reasonably expect to see given a Normal distribution of errors and 51 observations.
At any rate, California looks like a real outlier. Can we do better than our Model A?
\(\lambda\) | Transformation | Easy to Interpret |
---|---|---|
-2 | \(1/y^2\), the inverse square | Not typically |
-1 | \(1/y\), the inverse | OK |
-0.5 | \(1/\sqrt{y}\), the inverse square root | Not typically |
0 | \(log(y)\), the logarithm | OK |
0.5 | \(\sqrt{y}\), the square root | OK |
1 | y, the original (raw) outcome | OK |
2 | \(y^2\), the square | OK |
3 | \(y^3\), the cube | Not typically |
boxCox()
function from the car
packageWe’ll use the boxCox()
function from the car
package today, and also the powerTransform()
function from the same package to help us identify a potential choice of power transformation.
boxCox(modelA)
powerTransform(modelA)
Estimated transformation parameter
Y1
0.02489896
So what transformation is this plot suggesting we try?
Model the logarithm of state population as a function of its number of ranked counties.
ggplot(table1, aes(x = ranked_counties, y = log(state_popn),
label = state)) +
geom_label(size = 3) +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = TRUE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
labs(title = "Approach B",
subtitle = glue('Pearson correlation of log(state_popn) and ranked_counties is {round_half_up(cor(log(table1$state_popn), table1$ranked_counties),2)}'))
modelB <- lm(log(state_popn) ~ ranked_counties,
data = table1)
tidy(modelB, conf.int = TRUE, conf.level = 0.95)
glance(modelB)
par(mfrow = c(1,2))
plot(modelB, which = c(1:2))
Now, which states have the largest residuals? Still looks like row numbers 5, 10 and 35
modB_aug <- augment(modelB, data = table1)
slice(modB_aug, c(5, 10, 35))
The largest standardized residual is still California, but now it is 2.69. Is that a plausible value for the maximum of 51 values if the (standardized) residuals actually follow a Normal distribution?
pnorm(2.69, mean = 0, sd = 1, lower.tail = FALSE)
[1] 0.003572601
Or we could look at the studentized residual instead.
outlierTest(modelB)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
5 2.888511 0.0057913 0.29536
Our conclusion from the Bonferroni p value here (which is larger than 0.05) is that there’s no evidence in the California value to suggest a serious problem with Normality. A maximum studentized residual of 2.89 is out of the range we might reasonably expect to see given a Normal distribution of errors and 51 observations.
This seems more reasonable. Could we do better still by taking the log of the predictor, as well?
Model the logarithm of state population as a function of the logarithm of its number of ranked counties.
ggplot(table1, aes(x = log(ranked_counties),
y = log(state_popn), label = state)) +
geom_label(size = 3) +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = TRUE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
labs(title = "Approach C",
subtitle = glue('Pearson correlation of log(state_popn) and log(ranked_counties) is {round_half_up(cor(log(table1$state_popn), log(table1$ranked_counties)),2)}'))
modelC <- lm(log(state_popn) ~ log(ranked_counties),
data = table1)
tidy(modelC, conf.int = TRUE, conf.level = 0.95)
glance(modelC)
par(mfrow = c(1,2))
plot(modelC, which = c(1:2))
Now, which states have the largest residuals in absolute value? Row numbers 5, 29 and 42
modC_aug <- augment(modelC, data = table1)
slice(modC_aug, c(5, 29, 42))
The largest standardized residual is still California, but now it is 2.13. Is that a plausible value for the maximum of 51 values if the (standardized) residuals actually follow a Normal distribution?
pnorm(2.46, mean = 0, sd = 1, lower.tail = FALSE)
[1] 0.006946851
Or we could look at the studentized residual instead.
outlierTest(modelC)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
5 2.599649 0.012363 0.63052
Since Models B and C predict the same outcome (log(state_popn)) we can compare them directly on in-sample performance using glance
.
bind_rows(glance(modelB), glance(modelC)) %>%
mutate(model = c("B", "C"))
Model C looks like the stronger performer on all five summaries we’ve discussed.
Summary | Stronger Model |
---|---|
\(R^2\) | C (0.313 vs. 0.303) |
Adjusted \(R^2\) | C (0.299 vs. 0.288) |
\(\sigma\) | C (0.875 vs. 0.882) |
AIC | C (135.1 vs. 135.8) |
BIC | C (140.9 vs. 141.6) |
Using model C, which states have larger population than we’d expect, based on their number of counties?
Here are the six states with the largest positive residual from Model C.
modC_aug %>% mutate(logpop = log(state_popn)) %>%
select(state, ranked_counties, state_popn, logpop,
.fitted, .resid, everything()) %>%
arrange(desc(.resid)) %>% head()
Using model C, which states have smaller population than we’d expect, based on their number of counties?
Here are the six states with the largest negative residual from Model C.
modC_aug %>% mutate(logpop = log(state_popn)) %>%
select(state, ranked_counties, state_popn, logpop,
.fitted, .resid, everything()) %>%
arrange(.resid) %>% head()
Which states are the smallest by population?
modC_aug %>% arrange(state_popn) %>% head()
Calculate population per county in each state and use a bar chart and then a Cleveland dot plot to visualize the result.
table1 <- table1 %>%
mutate(county_size = state_popn / ranked_counties)
Let’s use geom_col()
to make a bar chart to show our continuous variable (county_size
) broken down by a categorical variable (state
).
ggplot(table1, aes(x = state, y = county_size)) +
geom_col()
How could we improve this plot?
ggplot(table1, aes(x = reorder(state, desc(county_size)),
y = county_size)) +
geom_col(fill = "dodgerblue",
col = "black") +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x = element_text(size = 5)) +
labs(y = "Population per Ranked County",
x = "U.S. State or District")
Still tough to work with, I think.
Cleveland dot plots are an excellent alternative to a simple bar chart, especially when you are comparing more than a few items.
ggplot(table1, aes(x = county_size,
y = reorder(state, county_size))) +
geom_point() +
theme_bw() +
scale_x_continuous(labels = scales::comma,
sec.axis = dup_axis()) +
labs(y = "State", x = "Population per County")
table1 %>% arrange(county_size)
We’ll leave it to you to craft a reasonable set of conclusions on the issues of interest, in light of the output provided here.
group_by
and summarize
to aggregate hierarchical data to a new level of analysis (from counties to “states.”)ggplot
.glance()
results for a linear model.car
package.car
package.sessioninfo::session_info()
- Session info ---------------------------------------------------------------
setting value
version R version 4.1.1 (2021-08-10)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz America/New_York
date 2021-10-05
- Packages -------------------------------------------------------------------
package * version date lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.1.0)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
broom * 0.7.9 2021-07-27 [1] CRAN (R 4.1.0)
bslib 0.3.0 2021-09-02 [1] CRAN (R 4.1.1)
car * 3.0-11 2021-06-27 [1] CRAN (R 4.1.0)
carData * 3.0-4 2020-05-22 [1] CRAN (R 4.1.0)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0)
data.table 1.14.0 2021-02-21 [1] CRAN (R 4.1.0)
DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
foreign 0.8-81 2020-12-22 [1] CRAN (R 4.1.0)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
glue * 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
haven 2.4.3 2021-08-04 [1] CRAN (R 4.1.0)
highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
janitor * 2.1.0 2021-01-05 [1] CRAN (R 4.1.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
knitr 1.34 2021-09-09 [1] CRAN (R 4.1.1)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.0)
lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
Matrix 1.3-4 2021-06-01 [2] CRAN (R 4.1.1)
mgcv 1.8-36 2021-06-01 [2] CRAN (R 4.1.1)
modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
nlme 3.1-152 2021-02-04 [2] CRAN (R 4.1.1)
openxlsx 4.2.4 2021-06-16 [1] CRAN (R 4.1.0)
pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.1)
readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
rio 0.5.27 2021-06-21 [1] CRAN (R 4.1.0)
rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0)
sass 0.4.0 2021-05-12 [1] CRAN (R 4.1.0)
scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.1.0)
stringi 1.7.4 2021-08-25 [1] CRAN (R 4.1.1)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
tibble * 3.1.4 2021-08-25 [1] CRAN (R 4.1.1)
tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
vroom 1.5.5 2021-09-14 [1] CRAN (R 4.1.1)
withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
xfun 0.25 2021-08-06 [1] CRAN (R 4.1.0)
xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
zip 2.2.0 2021-05-31 [1] CRAN (R 4.1.0)
[1] C:/Users/Thomas/Documents/R/win-library/4.1
[2] C:/Program Files/R/R-4.1.1/library