Info

Objective

These homework problem sets are designed to help you understand material better. You should try doing these problems first and then look at model answers. You can use Generative AI as to help, such as prompt “Which tidyverse function do I use to drop certain columns from a data frame? Give me an example and explain”. It is also a good idea to feed an error message together with your code to Generative AI and ask it to help with fixing errors. But it is pointless to just solve all questions with ChatGPT because you won’t be learning anything.

Your task

Read instructions and write your solutions to these questions into the space provided. Then check the model answers (the link is in the end of the notebook).

Correlation

Question 1

Let us look at the correlation between motor_theft (Vehicle theft per 100,000) and fed_spend (Federal spending per capita):

cor(state_stats$motor_theft,
    state_stats$fed_spend,
    use = "pairwise.complete.obs")

## [1] 0.937136

The correlation seems to be very high. Is it statistically significant?

cor.test(state_stats$motor_theft,
    state_stats$fed_spend,
    use = "pairwise.complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  state_stats$motor_theft and state_stats$fed_spend
## t = 18.214, df = 46, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8899784 0.9644609
## sample estimates:
##      cor 
## 0.937136

Yes! It is significant. That settles it - big federal spendings only lead to theft of motor vehicles.

Or?

ANSWER

This is actually real world example of sensitivity of Pearson correlation coefficient to outliers. Simple scatterplot shows that this is driven by just one outlier, District of Columbia (the capital of the USA)

state_stats %>%
  ggplot(aes(x = fed_spend, y = motor_theft, label = state)) +
  geom_point() + geom_text() + geom_smooth(method = "lm")

Without District of Columbia, the effect is barely visible:

state_stats %>%
  filter(state != "District of Columbia") %>%
  ggplot(aes(x = fed_spend, y = motor_theft, label = state)) +
  geom_point() + geom_text() + geom_smooth(method = "lm")

Now the \(p\)-value is 0.03 and the correlation is relatively small:

state_stats %>%
  filter(state != "District of Columbia") %>%
  select(motor_theft, fed_spend) %>%
  corr.test(use = "pairwise")

## Call:corr.test(x = ., use = "pairwise")
## Correlation matrix 
##             motor_theft fed_spend
## motor_theft        1.00      0.32
## fed_spend          0.32      1.00
## Sample Size 
##             motor_theft fed_spend
## motor_theft          47        47
## fed_spend            47        50
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##             motor_theft fed_spend
## motor_theft        0.00      0.03
## fed_spend          0.03      0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

In any case, if there is any causality here, it should probably be the opposite - large levels of crime call for bigger spending from federal budget.

Finally, let us check other correlation coefficients since they are robust to outliers:

cor(state_stats$motor_theft,
    state_stats$fed_spend,
    method = "kendall",
    use = "pairwise.complete.obs")

## [1] 0.2058563

Question 2

Visualize Kendall correlation coefficients in state_stats with ggcorr(). By looking at the correlation diagram, identify correlation coefficients that are high in magnitude and that are hard to explain with just common sense.

ANSWER HERE

state_stats %>%
  ggcorr(method = c("pairwise", "kendall"))

There are large positive correlation of Traffic deaths per 100,000 where alcohol was not a factor with some crimes, such as larceny, aggravated assault, and theft of motor vehicles. It makes senses if we see that these correlations disappear when alcohol is taken into account. It means that most deaths in traffic accidents are probably attributed to alcohol and the second common factor is dangerous driving associated with criminal activities. There might be a causal relationship here.

There is a large negative correlation between population levels and the quartet of variables from the previous paragraph, i.e., agg_assault, larceny, motor_theft, and tr_deaths_no_alc. That is hard to explain directly and needs special investigation. It is probably driven by small states and territories with high crime rates.

Remark on correlation tests

In R, there a function corr.test() from the package psych. It is better to just use it as it gives more information than cor() or cor.test() from base R and formats the output better.

cor_test_result <- state_stats %>%
  select(fed_spend, motor_theft, pop2010) %>%
  corr.test()

cor_test_result

## Call:corr.test(x = .)
## Correlation matrix 
##             fed_spend motor_theft pop2010
## fed_spend        1.00        0.94   -0.16
## motor_theft      0.94        1.00   -0.29
## pop2010         -0.16       -0.29    1.00
## Sample Size 
##             fed_spend motor_theft pop2010
## fed_spend          51          48      51
## motor_theft        48          48      48
## pop2010            51          48      51
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##             fed_spend motor_theft pop2010
## fed_spend        0.00        0.00    0.28
## motor_theft      0.00        0.00    0.10
## pop2010          0.28        0.05    0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Note that now we can extract individual elements from cor_test_result. Here is the matrix of correlation coefficients:

cor_test_result$r

##              fed_spend motor_theft    pop2010
## fed_spend    1.0000000   0.9371360 -0.1555138
## motor_theft  0.9371360   1.0000000 -0.2865984
## pop2010     -0.1555138  -0.2865984  1.0000000

Matrix of \(p\)-values:

cor_test_result$p %>% round(5)

##             fed_spend motor_theft pop2010
## fed_spend     0.00000     0.00000 0.27584
## motor_theft   0.00000     0.00000 0.09656
## pop2010       0.27584     0.04828 0.00000

Note that this matrix is not symmetric. According to the manual, \(p\)-values adjusted for multiple tests are reported above the diagonal and raw \(p\)-values below the diagonal.

Confidence intervals:

cor_test_result$ci

Question 3

Let us look at the correlations between the following seven variables: murder, robbery, agg_assault, larceny, motor_theft, smoke, poverty:

state_stats %>%
  select(murder, agg_assault, robbery, larceny, motor_theft, smoke, poverty) %>%
  corr.test()

## Call:corr.test(x = .)
## Correlation matrix 
##             murder agg_assault robbery larceny motor_theft smoke poverty
## murder        1.00        0.69    0.92    0.25        0.79  0.17    0.46
## agg_assault   0.69        1.00    0.59    0.79        0.90  0.08    0.11
## robbery       0.92        0.59    1.00    0.11        0.71  0.04    0.27
## larceny       0.25        0.79    0.11    1.00        0.69 -0.04   -0.14
## motor_theft   0.79        0.90    0.71    0.69        1.00 -0.05    0.13
## smoke         0.17        0.08    0.04   -0.04       -0.05  1.00    0.49
## poverty       0.46        0.11    0.27   -0.14        0.13  0.49    1.00
## Sample Size 
##             murder agg_assault robbery larceny motor_theft smoke poverty
## murder          48          48      48      48          48    48      48
## agg_assault     48          48      48      48          48    48      48
## robbery         48          48      48      48          48    48      48
## larceny         48          48      48      48          48    48      48
## motor_theft     48          48      48      48          48    48      48
## smoke           48          48      48      48          48    51      51
## poverty         48          48      48      48          48    51      51
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##             murder agg_assault robbery larceny motor_theft smoke poverty
## murder        0.00        0.00    0.00    0.87        0.00     1    0.01
## agg_assault   0.00        0.00    0.00    0.00        0.00     1    1.00
## robbery       0.00        0.00    0.00    1.00        0.00     1    0.65
## larceny       0.09        0.00    0.46    0.00        0.00     1    1.00
## motor_theft   0.00        0.00    0.00    0.00        0.00     1    1.00
## smoke         0.26        0.59    0.78    0.81        0.75     0    0.00
## poverty       0.00        0.46    0.06    0.35        0.39     0    0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

This is informative, but what if want to combine \(p\)-values and correlation coefficients into just one matrix?

Write a custom R function whose first input is a data frame and whose output is a matrix of pairwise correlation coefficients with those whose significance level is below a given threshold (default value 0.01) changed to NA.

### ANSWER HERE
custom_corr_matrix <- function(df, signficance_level = 0.01) {
  cor_test_result <- df %>%
    select(where(is.numeric)) %>%
    corr.test()
  
  cor_matrix <- cor_test_result$r
  cor_matrix[cor_test_result$p > signficance_level] <- NA
  cor_matrix
}

state_stats %>%
  select(murder, robbery, agg_assault, larceny, motor_theft, smoke, poverty) %>%
  custom_corr_matrix(0.001) %>%
  round(2)

##             murder robbery agg_assault larceny motor_theft smoke poverty
## murder        1.00    0.92        0.69      NA        0.79    NA      NA
## robbery       0.92    1.00        0.59      NA        0.71    NA      NA
## agg_assault   0.69    0.59        1.00    0.79        0.90    NA      NA
## larceny         NA      NA        0.79    1.00        0.69    NA      NA
## motor_theft   0.79    0.71        0.90    0.69        1.00    NA      NA
## smoke           NA      NA          NA      NA          NA  1.00      NA
## poverty         NA      NA          NA      NA          NA  0.49       1

Question 4

Again, we will look at the correlations between the following six variables: murder, agg_assault, larceny, motor_theft, smoke, poverty. Visualize them with ggpairs(). There is an outlier. Which variables have extreme values? Remove the outlier and plot again.

ANSWER HERE

### ANSWER HERE
state_stats %>%
  # filter(state != "District of Columbia") %>%
  select(murder, robbery, agg_assault, larceny, motor_theft, smoke, poverty) %>%
  ggpairs()

District of Columbia has extreme values of murder, robbery, agg_assault, and motor_theft.

### ANSWER HERE
state_stats %>%
  filter(state != "District of Columbia") %>%
  select(murder, agg_assault, robbery, larceny, motor_theft, smoke, poverty) %>%
  ggpairs()

Regression

The R command function for a linear regression model is lm(lhs ~ rhs, data = your_data). It returns a linear model, an object of class lm. The default output is not very informative:

mod_poverty <- lm(poverty ~ med_income, data = state_stats) 
mod_poverty

## 
## Call:
## lm(formula = poverty ~ med_income, data = state_stats)
## 
## Coefficients:
## (Intercept)   med_income  
##  28.2594103   -0.0002859

Usually, we want something more informative:

mod_poverty %>% stargazer(type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               poverty          
## -----------------------------------------------
## med_income                  -0.0003***         
##                              (0.00003)         
##                                                
## Constant                     28.259***         
##                               (1.661)          
##                                                
## -----------------------------------------------
## Observations                    51             
## R2                             0.624           
## Adjusted R2                    0.617           
## Residual Std. Error       1.872 (df = 49)      
## F Statistic           81.405*** (df = 1; 49)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Note that stargazer() can render your regression table as

Text with type = "text"
LaTeX code, which is the default output. You can then copy and paste the LaTeX code to your document or, if you knit to PDF, you need to set results = 'asis' in the code chunk options.
HTML with type = "html". This is needed to knit to HTML. Check out how this table looks in HTML:

mod_poverty %>% stargazer(type = "html")


	Dependent variable:

	poverty

med_income	-0.0003^***
	(0.00003)

Constant	28.259^***
	(1.661)


Observations	51
R²	0.624
Adjusted R²	0.617
Residual Std. Error	1.872 (df = 49)
F Statistic	81.405^*** (df = 1; 49)

Note:	p<0.1; p<0.05; p<0.01

Very often we don’t need so much information on the regression table. For instance, the residual standard error and the F-statistic don’t tell you much unless you are a professional statistician. Here is how we drop them:

mod_poverty %>% stargazer(type = "text", omit.stat = c("ser", "f"))

## 
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                        poverty          
## ----------------------------------------
## med_income           -0.0003***         
##                       (0.00003)         
##                                         
## Constant              28.259***         
##                        (1.661)          
##                                         
## ----------------------------------------
## Observations             51             
## R2                      0.624           
## Adjusted R2             0.617           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

If we want a simple data frame, we can use the usual tidy() from broom:

mod_poverty %>% tidy()

The base R function for an overview of a linear model is summary():

mod_poverty %>% summary()

## 
## Call:
## lm(formula = poverty ~ med_income, data = state_stats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0798 -1.3004  0.1117  0.7486  6.9706 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.826e+01  1.661e+00  17.011  < 2e-16 ***
## med_income  -2.859e-04  3.168e-05  -9.022 5.46e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.872 on 49 degrees of freedom
## Multiple R-squared:  0.6242, Adjusted R-squared:  0.6166 
## F-statistic:  81.4 on 1 and 49 DF,  p-value: 5.459e-12

There are more packages in R for reporting linear models (modelsummary, sjPlot, flextable and others). We will just focus on stargazer and broom here.

Model diagnostic

This is not examinable material and you can skip this part safely.

If you want to dig deeper, i.e., look at model residuals, test for normality etc, you can use augment() function from broom:

mod_poverty %>% augment() %>% head()

Residuals vs Fitted Values

mod_poverty %>% augment() %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted", x = "Fitted values", y = "Residuals")

Residual QQ-plot

mod_poverty %>% augment() %>%
  ggplot(aes(sample = .std.resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Q-Q", x = "Theoretical Quantiles", 
       y = "Standardized Residuals")

Scale-Location Plot

mod_poverty %>% augment() %>%
  ggplot(aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(title = "Scale-Location", 
       x = "Fitted values", y = "√|Standardized Residuals|")

Residuals vs Leverage

mod_poverty %>% augment() %>%
  ggplot(aes(x = .hat, y = .std.resid)) +
  geom_point(aes(size = .cooksd)) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Leverage", 
       x = "Leverage", y = "Standardized Residuals") +
  theme(legend.position = "none")

Question 5

Here we will just fit a few regression models, reporting results in a table each time. Try to think about whether results make sense and how to interpret them. Note that here it is hard to justify that \(Y\) is actually a linear function of \(X\) and we do not claim and do not expect any causality. This is more of an exercise in presenting your findings in a table with stargazer() package.

Fit a regression model explaining murder via poverty and report the result as a table. Is the coefficient statistically significant?
Fit a regression models explaining murder via

poverty
poverty and smoke
poverty, smoke, and soc_sec
poverty, smoke, soc_sec, and fed_spend

Report all the four of them in a single table and check how adjusted \(R^2\) (fraction of explained variance) changes.

Fit regression models explaining murder, robbery, agg_assault, larceny, and motor_theft, with poverty, smoke, soc_sec, and fed_spend. Report all the five models in a single regression table and check how much variance is explained by independent variables in each of the models and which coefficients are statistically significant.

ANSWER

Part (a)

mod_murder_1 <- lm(murder ~ poverty, data = state_stats)
mod_murder_1 %>% stargazer(type = "text", omit.stat = c("ser", "f"))

## 
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                        murder           
## ----------------------------------------
## poverty               0.757***          
##                        (0.215)          
##                                         
## Constant               -4.684           
##                        (2.953)          
##                                         
## ----------------------------------------
## Observations             48             
## R2                      0.212           
## Adjusted R2             0.194           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

Note that the coefficient is statistically significant, i.e., higher poverty means more murders. However,

Only a small fraction of variance is explained by the model
It doesn’t mean that poverty causes murders or that murders cause poverty.

Part (b)

mod_murder_2 <- lm(murder ~ poverty + smoke, data = state_stats)
mod_murder_3 <- lm(murder ~ poverty + smoke + soc_sec, data = state_stats)
mod_murder_4 <- lm(murder ~ poverty + smoke + soc_sec + fed_spend, 
                   data = state_stats)

stargazer(mod_murder_1, mod_murder_2, mod_murder_3, mod_murder_4,
  type = "text", omit.stat = c("ser", "f"))

## 
## ==================================================
##                       Dependent variable:         
##              -------------------------------------
##                             murder                
##                (1)      (2)       (3)       (4)   
## --------------------------------------------------
## poverty      0.757*** 0.803*** 0.831***  0.443*** 
##              (0.215)  (0.246)   (0.229)   (0.102) 
##                                                   
## smoke                  -0.101    0.159    0.255** 
##                       (0.251)   (0.252)   (0.108) 
##                                                   
## soc_sec                        -0.729*** -0.368***
##                                 (0.259)   (0.114) 
##                                                   
## fed_spend                                0.298*** 
##                                           (0.021) 
##                                                   
## Constant      -4.684   -3.179    2.369    -3.796* 
##              (2.953)  (4.782)   (4.867)   (2.137) 
##                                                   
## --------------------------------------------------
## Observations    48       48       48        48    
## R2            0.212    0.214     0.334     0.880  
## Adjusted R2   0.194    0.179     0.289     0.869  
## ==================================================
## Note:                  *p<0.1; **p<0.05; ***p<0.01

Note that adding smoke to the model made it worse, i.e., we probably shouldn’t bother including smoking to our model, even though the coefficient is significant in the final model. At the same time, it is clear that the main explanatory variable here is fed_spend, which, as we know, is probably driven by the presense of an outlier.

Part (c)

mod_murder_4 <- lm(murder ~ poverty + smoke + soc_sec + fed_spend, 
                   data = state_stats)

mod_robbery <- lm(robbery ~ poverty + smoke + soc_sec + fed_spend, 
                   data = state_stats)

mod_assault <- lm(agg_assault ~ poverty + smoke + soc_sec + fed_spend, 
                   data = state_stats)

mod_larceny <- lm(larceny ~ poverty + smoke + soc_sec + fed_spend, 
                   data = state_stats)

mod_motor <- lm(motor_theft ~ poverty + smoke + soc_sec + fed_spend, 
                   data = state_stats)

stargazer(mod_murder_4, mod_robbery, mod_assault, mod_larceny, mod_motor,
  type = "text", omit.stat = c("ser", "f"))

## 
## =================================================================
##                              Dependent variable:                 
##              ----------------------------------------------------
##               murder   robbery  agg_assault  larceny  motor_theft
##                 (1)      (2)        (3)        (4)        (5)    
## -----------------------------------------------------------------
## poverty      0.443***   3.901     -0.891    -11.474**   -0.842   
##               (0.102)  (3.255)    (0.598)    (5.074)    (0.665)  
##                                                                  
## smoke         0.255**   3.228     1.687**     5.366      0.688   
##               (0.108)  (3.450)    (0.634)    (5.377)    (0.705)  
##                                                                  
## soc_sec      -0.368*** -7.568**  -1.444**    -1.157     -1.240   
##               (0.114)  (3.637)    (0.668)    (5.669)    (0.743)  
##                                                                  
## fed_spend    0.298***  5.596***  1.239***   4.785***   2.423***  
##               (0.021)  (0.679)    (0.125)    (1.058)    (0.139)  
##                                                                  
## Constant      -3.796*   45.799    -4.701     97.625      2.876   
##               (2.137)  (68.073)  (12.503)   (106.113)  (13.909)  
##                                                                  
## -----------------------------------------------------------------
## Observations    48        48        48         48         48     
## R2             0.880    0.691      0.739      0.352      0.891   
## Adjusted R2    0.869    0.662      0.714      0.292      0.881   
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

Here we see that high values of explained variance and statistically significant coefficients at fed_spend have to do with the the presence of an outlier.

In any case, if there is a causal relationship, most likely, it works as follows: higher crime rates call for higher budgets on police.

Model answers:

https://rpubs.com/fduzhin/mh3511_hw_7_answers

Warning

Fedor is familiar with stargazer for model reporting, but, according to ChatGPT, now modelsummary is better. We will stick to stargazer now, but here is a comparison table for you for future:

Feature	`modelsummary`	`stargazer`
📦 Actively maintained	✅ Yes (very active)	❌ No (last major update ~2018)
📄 Output formats	✅ HTML, LaTeX, Markdown, Word	✅ LaTeX, HTML, text only
🧠 Tidyverse-compatible	✅ Uses `broom` + `tidy()`	❌ Custom parsing
🧪 Model support	✅ Many models (lm, glm, lme4, brms, fixest, etc.)	⚠️ Mostly lm/glm
✍️ Customization	✅ Very flexible, neat syntax	⚠️ Verbose and rigid
🖼️ Visualization support	✅ Plots (`plot_models()`)	❌ None
📊 Side-by-side comparison	✅ Easy (`msummary(list(...))`)	✅ Supported
📚 Documentation	✅ Excellent	⚠️ Decent but dated

MH3511 – Homework 7

Correlation and Regression.

Fedor Duzhin

2025-07-11