These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt βWhich tidyverse function do I use to drop certain columns from a data frame? Give me an example and explainβ. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you wonβt be learning anything.
Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).
Let us look at the correlation between motor_theft
(Vehicle theft per 100,000) and fed_spend
(Federal spending
per capita):
cor(state_stats$motor_theft,
state_stats$fed_spend,
use = "pairwise.complete.obs")
## [1] 0.937136
The correlation seems to be very high. Is it statistically significant?
cor.test(state_stats$motor_theft,
state_stats$fed_spend,
use = "pairwise.complete.obs")
##
## Pearson's product-moment correlation
##
## data: state_stats$motor_theft and state_stats$fed_spend
## t = 18.214, df = 46, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8899784 0.9644609
## sample estimates:
## cor
## 0.937136
Yes! It is significant. That settles it - big federal spendings only lead to theft of motor vehicles.
Or?
ANSWER
This is actually real world example of sensitivity of Pearson correlation coefficient to outliers. Simple scatterplot shows that this is driven by just one outlier, District of Columbia (the capital of the USA)
state_stats %>%
ggplot(aes(x = fed_spend, y = motor_theft, label = state)) +
geom_point() + geom_text() + geom_smooth(method = "lm")
Without District of Columbia, the effect is barely visible:
state_stats %>%
filter(state != "District of Columbia") %>%
ggplot(aes(x = fed_spend, y = motor_theft, label = state)) +
geom_point() + geom_text() + geom_smooth(method = "lm")
Now the \(p\)-value is 0.03 and the correlation is relatively small:
state_stats %>%
filter(state != "District of Columbia") %>%
select(motor_theft, fed_spend) %>%
corr.test(use = "pairwise")
## Call:corr.test(x = ., use = "pairwise")
## Correlation matrix
## motor_theft fed_spend
## motor_theft 1.00 0.32
## fed_spend 0.32 1.00
## Sample Size
## motor_theft fed_spend
## motor_theft 47 47
## fed_spend 47 50
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## motor_theft fed_spend
## motor_theft 0.00 0.03
## fed_spend 0.03 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
In any case, if there is any causality here, it should probably be the opposite - large levels of crime call for bigger spending from federal budget.
Finally, let us check other correlation coefficients since they are robust to outliers:
cor(state_stats$motor_theft,
state_stats$fed_spend,
method = "kendall",
use = "pairwise.complete.obs")
## [1] 0.2058563
Visualize Kendall correlation coefficients in
state_stats
with ggcorr()
. By looking at the
correlation diagram, identify correlation coefficients that are high in
magnitude and that are hard to explain with just common sense.
ANSWER HERE
state_stats %>%
ggcorr(method = c("pairwise", "kendall"))
There are large positive correlation of Traffic deaths per 100,000 where alcohol was not a factor with some crimes, such as larceny, aggravated assault, and theft of motor vehicles. It makes senses if we see that these correlations disappear when alcohol is taken into account. It means that most deaths in traffic accidents are probably attributed to alcohol and the second common factor is dangerous driving associated with criminal activities. There might be a causal relationship here.
There is a large negative correlation between population levels and
the quartet of variables from the previous paragraph, i.e.,
agg_assault
, larceny
,
motor_theft
, and tr_deaths_no_alc
. That is
hard to explain directly and needs special investigation. It is probably
driven by small states and territories with high crime rates.
In R, there a function corr.test()
from the package
psych
. It is better to just use it as it gives more
information than cor()
or cor.test()
from base
R and formats the output better.
cor_test_result <- state_stats %>%
select(fed_spend, motor_theft, pop2010) %>%
corr.test()
cor_test_result
## Call:corr.test(x = .)
## Correlation matrix
## fed_spend motor_theft pop2010
## fed_spend 1.00 0.94 -0.16
## motor_theft 0.94 1.00 -0.29
## pop2010 -0.16 -0.29 1.00
## Sample Size
## fed_spend motor_theft pop2010
## fed_spend 51 48 51
## motor_theft 48 48 48
## pop2010 51 48 51
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## fed_spend motor_theft pop2010
## fed_spend 0.00 0.00 0.28
## motor_theft 0.00 0.00 0.10
## pop2010 0.28 0.05 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Note that now we can extract individual elements from
cor_test_result
. Here is the matrix of correlation
coefficients:
cor_test_result$r
## fed_spend motor_theft pop2010
## fed_spend 1.0000000 0.9371360 -0.1555138
## motor_theft 0.9371360 1.0000000 -0.2865984
## pop2010 -0.1555138 -0.2865984 1.0000000
Matrix of \(p\)-values:
cor_test_result$p %>% round(5)
## fed_spend motor_theft pop2010
## fed_spend 0.00000 0.00000 0.27584
## motor_theft 0.00000 0.00000 0.09656
## pop2010 0.27584 0.04828 0.00000
Note that this matrix is not symmetric. According to the manual, \(p\)-values adjusted for multiple tests are reported above the diagonal and raw \(p\)-values below the diagonal.
Confidence intervals:
cor_test_result$ci
Let us look at the correlations between the following seven
variables: murder
, robbery
,
agg_assault
, larceny
,
motor_theft
, smoke
, poverty
:
state_stats %>%
select(murder, agg_assault, robbery, larceny, motor_theft, smoke, poverty) %>%
corr.test()
## Call:corr.test(x = .)
## Correlation matrix
## murder agg_assault robbery larceny motor_theft smoke poverty
## murder 1.00 0.69 0.92 0.25 0.79 0.17 0.46
## agg_assault 0.69 1.00 0.59 0.79 0.90 0.08 0.11
## robbery 0.92 0.59 1.00 0.11 0.71 0.04 0.27
## larceny 0.25 0.79 0.11 1.00 0.69 -0.04 -0.14
## motor_theft 0.79 0.90 0.71 0.69 1.00 -0.05 0.13
## smoke 0.17 0.08 0.04 -0.04 -0.05 1.00 0.49
## poverty 0.46 0.11 0.27 -0.14 0.13 0.49 1.00
## Sample Size
## murder agg_assault robbery larceny motor_theft smoke poverty
## murder 48 48 48 48 48 48 48
## agg_assault 48 48 48 48 48 48 48
## robbery 48 48 48 48 48 48 48
## larceny 48 48 48 48 48 48 48
## motor_theft 48 48 48 48 48 48 48
## smoke 48 48 48 48 48 51 51
## poverty 48 48 48 48 48 51 51
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## murder agg_assault robbery larceny motor_theft smoke poverty
## murder 0.00 0.00 0.00 0.87 0.00 1 0.01
## agg_assault 0.00 0.00 0.00 0.00 0.00 1 1.00
## robbery 0.00 0.00 0.00 1.00 0.00 1 0.65
## larceny 0.09 0.00 0.46 0.00 0.00 1 1.00
## motor_theft 0.00 0.00 0.00 0.00 0.00 1 1.00
## smoke 0.26 0.59 0.78 0.81 0.75 0 0.00
## poverty 0.00 0.46 0.06 0.35 0.39 0 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
This is informative, but what if want to combine \(p\)-values and correlation coefficients into just one matrix?
Write a custom R function whose first input is a data frame and whose output is a matrix of pairwise correlation coefficients with those whose significance level is below a given threshold (default value 0.01) changed to NA.
### ANSWER HERE
custom_corr_matrix <- function(df, signficance_level = 0.01) {
cor_test_result <- df %>%
select(where(is.numeric)) %>%
corr.test()
cor_matrix <- cor_test_result$r
cor_matrix[cor_test_result$p > signficance_level] <- NA
cor_matrix
}
state_stats %>%
select(murder, robbery, agg_assault, larceny, motor_theft, smoke, poverty) %>%
custom_corr_matrix(0.001) %>%
round(2)
## murder robbery agg_assault larceny motor_theft smoke poverty
## murder 1.00 0.92 0.69 NA 0.79 NA NA
## robbery 0.92 1.00 0.59 NA 0.71 NA NA
## agg_assault 0.69 0.59 1.00 0.79 0.90 NA NA
## larceny NA NA 0.79 1.00 0.69 NA NA
## motor_theft 0.79 0.71 0.90 0.69 1.00 NA NA
## smoke NA NA NA NA NA 1.00 NA
## poverty NA NA NA NA NA 0.49 1
Again, we will look at the correlations between the following six
variables: murder
, agg_assault
,
larceny
, motor_theft
, smoke
,
poverty
. Visualize them with ggpairs()
. There
is an outlier. Which variables have extreme values? Remove the outlier
and plot again.
ANSWER HERE
### ANSWER HERE
state_stats %>%
# filter(state != "District of Columbia") %>%
select(murder, robbery, agg_assault, larceny, motor_theft, smoke, poverty) %>%
ggpairs()
District of Columbia has extreme values of murder
,
robbery
, agg_assault
, and
motor_theft.
### ANSWER HERE
state_stats %>%
filter(state != "District of Columbia") %>%
select(murder, agg_assault, robbery, larceny, motor_theft, smoke, poverty) %>%
ggpairs()
The R command function for a linear regression model is
lm(lhs ~ rhs, data = your_data)
. It returns a linear model,
an object of class lm
. The default output is not very
informative:
mod_poverty <- lm(poverty ~ med_income, data = state_stats)
mod_poverty
##
## Call:
## lm(formula = poverty ~ med_income, data = state_stats)
##
## Coefficients:
## (Intercept) med_income
## 28.2594103 -0.0002859
Usually, we want something more informative:
mod_poverty %>% stargazer(type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## poverty
## -----------------------------------------------
## med_income -0.0003***
## (0.00003)
##
## Constant 28.259***
## (1.661)
##
## -----------------------------------------------
## Observations 51
## R2 0.624
## Adjusted R2 0.617
## Residual Std. Error 1.872 (df = 49)
## F Statistic 81.405*** (df = 1; 49)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note that stargazer()
can render your regression table
as
type = "text"
results = 'asis'
in the code chunk options.type = "html"
. This is needed to knit to
HTML. Check out how this table looks in HTML:mod_poverty %>% stargazer(type = "html")
Dependent variable: | |
poverty | |
med_income | -0.0003*** |
(0.00003) | |
Constant | 28.259*** |
(1.661) | |
Observations | 51 |
R2 | 0.624 |
Adjusted R2 | 0.617 |
Residual Std. Error | 1.872 (df = 49) |
F Statistic | 81.405*** (df = 1; 49) |
Note: | p<0.1; p<0.05; p<0.01 |
Very often we donβt need so much information on the regression table. For instance, the residual standard error and the F-statistic donβt tell you much unless you are a professional statistician. Here is how we drop them:
mod_poverty %>% stargazer(type = "text", omit.stat = c("ser", "f"))
##
## ========================================
## Dependent variable:
## ---------------------------
## poverty
## ----------------------------------------
## med_income -0.0003***
## (0.00003)
##
## Constant 28.259***
## (1.661)
##
## ----------------------------------------
## Observations 51
## R2 0.624
## Adjusted R2 0.617
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
If we want a simple data frame, we can use the usual
tidy()
from broom
:
mod_poverty %>% tidy()
The base R function for an overview of a linear model is
summary()
:
mod_poverty %>% summary()
##
## Call:
## lm(formula = poverty ~ med_income, data = state_stats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0798 -1.3004 0.1117 0.7486 6.9706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.826e+01 1.661e+00 17.011 < 2e-16 ***
## med_income -2.859e-04 3.168e-05 -9.022 5.46e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.872 on 49 degrees of freedom
## Multiple R-squared: 0.6242, Adjusted R-squared: 0.6166
## F-statistic: 81.4 on 1 and 49 DF, p-value: 5.459e-12
There are more packages in R for reporting linear models
(modelsummary
, sjPlot
, flextable
and others). We will just focus on stargazer
and
broom
here.
This is not examinable material and you can skip this part safely.
If you want to dig deeper, i.e., look at model residuals, test for
normality etc, you can use augment()
function from
broom
:
mod_poverty %>% augment() %>% head()
mod_poverty %>% augment() %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted", x = "Fitted values", y = "Residuals")
mod_poverty %>% augment() %>%
ggplot(aes(sample = .std.resid)) +
stat_qq() +
stat_qq_line() +
labs(title = "Normal Q-Q", x = "Theoretical Quantiles",
y = "Standardized Residuals")
mod_poverty %>% augment() %>%
ggplot(aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point() +
geom_smooth(se = FALSE) +
labs(title = "Scale-Location",
x = "Fitted values", y = "β|Standardized Residuals|")
mod_poverty %>% augment() %>%
ggplot(aes(x = .hat, y = .std.resid)) +
geom_point(aes(size = .cooksd)) +
geom_smooth(se = FALSE) +
labs(title = "Residuals vs Leverage",
x = "Leverage", y = "Standardized Residuals") +
theme(legend.position = "none")
Here we will just fit a few regression models, reporting results in a
table each time. Try to think about whether results make sense and how
to interpret them. Note that here it is hard to justify that \(Y\) is actually a linear function of \(X\) and we do not claim and do not expect
any causality. This is more of an exercise in presenting your findings
in a table with stargazer()
package.
Fit a regression model explaining murder
via
poverty
and report the result as a table. Is the
coefficient statistically significant?
Fit a regression models explaining murder
via
poverty
poverty
and smoke
poverty
, smoke
, and
soc_sec
poverty
, smoke
, soc_sec
, and
fed_spend
Report all the four of them in a single table and check how adjusted \(R^2\) (fraction of explained variance) changes.
murder
,
robbery
, agg_assault
, larceny
,
and motor_theft
, with poverty
,
smoke
, soc_sec
, and fed_spend
.
Report all the five models in a single regression table and check how
much variance is explained by independent variables in each of the
models and which coefficients are statistically significant.ANSWER
Part (a)
mod_murder_1 <- lm(murder ~ poverty, data = state_stats)
mod_murder_1 %>% stargazer(type = "text", omit.stat = c("ser", "f"))
##
## ========================================
## Dependent variable:
## ---------------------------
## murder
## ----------------------------------------
## poverty 0.757***
## (0.215)
##
## Constant -4.684
## (2.953)
##
## ----------------------------------------
## Observations 48
## R2 0.212
## Adjusted R2 0.194
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note that the coefficient is statistically significant, i.e., higher poverty means more murders. However,
Part (b)
mod_murder_2 <- lm(murder ~ poverty + smoke, data = state_stats)
mod_murder_3 <- lm(murder ~ poverty + smoke + soc_sec, data = state_stats)
mod_murder_4 <- lm(murder ~ poverty + smoke + soc_sec + fed_spend,
data = state_stats)
stargazer(mod_murder_1, mod_murder_2, mod_murder_3, mod_murder_4,
type = "text", omit.stat = c("ser", "f"))
##
## ==================================================
## Dependent variable:
## -------------------------------------
## murder
## (1) (2) (3) (4)
## --------------------------------------------------
## poverty 0.757*** 0.803*** 0.831*** 0.443***
## (0.215) (0.246) (0.229) (0.102)
##
## smoke -0.101 0.159 0.255**
## (0.251) (0.252) (0.108)
##
## soc_sec -0.729*** -0.368***
## (0.259) (0.114)
##
## fed_spend 0.298***
## (0.021)
##
## Constant -4.684 -3.179 2.369 -3.796*
## (2.953) (4.782) (4.867) (2.137)
##
## --------------------------------------------------
## Observations 48 48 48 48
## R2 0.212 0.214 0.334 0.880
## Adjusted R2 0.194 0.179 0.289 0.869
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note that adding smoke
to the model made it worse, i.e.,
we probably shouldnβt bother including smoking to our model, even though
the coefficient is significant in the final model. At the same time, it
is clear that the main explanatory variable here is
fed_spend
, which, as we know, is probably driven by the
presense of an outlier.
Part (c)
mod_murder_4 <- lm(murder ~ poverty + smoke + soc_sec + fed_spend,
data = state_stats)
mod_robbery <- lm(robbery ~ poverty + smoke + soc_sec + fed_spend,
data = state_stats)
mod_assault <- lm(agg_assault ~ poverty + smoke + soc_sec + fed_spend,
data = state_stats)
mod_larceny <- lm(larceny ~ poverty + smoke + soc_sec + fed_spend,
data = state_stats)
mod_motor <- lm(motor_theft ~ poverty + smoke + soc_sec + fed_spend,
data = state_stats)
stargazer(mod_murder_4, mod_robbery, mod_assault, mod_larceny, mod_motor,
type = "text", omit.stat = c("ser", "f"))
##
## =================================================================
## Dependent variable:
## ----------------------------------------------------
## murder robbery agg_assault larceny motor_theft
## (1) (2) (3) (4) (5)
## -----------------------------------------------------------------
## poverty 0.443*** 3.901 -0.891 -11.474** -0.842
## (0.102) (3.255) (0.598) (5.074) (0.665)
##
## smoke 0.255** 3.228 1.687** 5.366 0.688
## (0.108) (3.450) (0.634) (5.377) (0.705)
##
## soc_sec -0.368*** -7.568** -1.444** -1.157 -1.240
## (0.114) (3.637) (0.668) (5.669) (0.743)
##
## fed_spend 0.298*** 5.596*** 1.239*** 4.785*** 2.423***
## (0.021) (0.679) (0.125) (1.058) (0.139)
##
## Constant -3.796* 45.799 -4.701 97.625 2.876
## (2.137) (68.073) (12.503) (106.113) (13.909)
##
## -----------------------------------------------------------------
## Observations 48 48 48 48 48
## R2 0.880 0.691 0.739 0.352 0.891
## Adjusted R2 0.869 0.662 0.714 0.292 0.881
## =================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Here we see that high values of explained variance and statistically
significant coefficients at fed_spend
have to do with the
the presence of an outlier.
In any case, if there is a causal relationship, most likely, it works as follows: higher crime rates call for higher budgets on police.
Fedor is familiar with stargazer
for model reporting,
but, according to ChatGPT, now modelsummary
is better. We
will stick to stargazer
now, but here is a comparison table
for you for future:
Feature | modelsummary |
stargazer |
---|---|---|
π¦ Actively maintained | β Yes (very active) | β No (last major update ~2018) |
π Output formats | β HTML, LaTeX, Markdown, Word | β LaTeX, HTML, text only |
π§ Tidyverse-compatible | β
Uses broom + tidy() |
β Custom parsing |
π§ͺ Model support | β Many models (lm, glm, lme4, brms, fixest, etc.) | β οΈ Mostly lm/glm |
βοΈ Customization | β Very flexible, neat syntax | β οΈ Verbose and rigid |
πΌοΈ Visualization support | β
Plots (plot_models() ) |
β None |
π Side-by-side comparison | β
Easy (msummary(list(...)) ) |
β Supported |
π Documentation | β Excellent | β οΈ Decent but dated |