Problem Set 3

packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "pandoc", "mosaic") 

new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
##   options(modelsummary_factory_latex = 'kableExtra')
##   options(modelsummary_factory_html = 'kableExtra')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)
## 
## Loading required package: carData
## 
## 
## Attaching package: 'car'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## 
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## 
## Attaching package: 'mosaic'
## 
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## 
## The following objects are masked from 'package:car':
## 
##     deltaMethod, logit
## 
## 
## The following object is masked from 'package:modelsummary':
## 
##     msummary
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[6]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[7]]
##  [1] "car"          "carData"      "sjPlot"       "broom"        "modelsummary"
##  [6] "fst"          "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "Lock5Data"    "car"          "carData"      "sjPlot"       "broom"       
##  [6] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "pandoc"       "Lock5Data"    "car"          "carData"      "sjPlot"      
##  [6] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[10]]
##  [1] "mosaic"       "mosaicData"   "ggformula"    "Matrix"       "lattice"     
##  [6] "pandoc"       "Lock5Data"    "car"          "carData"      "sjPlot"      
## [11] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
## [16] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [21] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [26] "grDevices"    "utils"        "datasets"     "methods"      "base"

# Load the data
sat_gpa <- read.csv("~/Downloads/sat_gpa.csv")

Task 1

Calculate and interpret the correlation between SAT math scores (sat_math) and freshman GPA (gpa_fy). Next, visualize this relationship with a scatterplot, including a regression line.

# Calculate the correlation between SAT math scores and freshman GPA
correlation <- cor(sat_gpa$sat_math, sat_gpa$gpa_fy)
print(correlation)

## [1] 0.3871178

ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "SAT Math Score", y = "Freshman GPA")

## `geom_smooth()` using formula = 'y ~ x'

Task 2

Create regression tables to analyze the relationship between SAT scores (total and verbal) and freshman GPA, using both the modelsummary and sjPlot packages. Interpret.

m1 <- lm(gpa_fy ~ sat_total, data = sat_gpa)
m2 <- lm(gpa_fy ~ sat_verbal, data = sat_gpa)
m3 <- lm(gpa_fy ~ sat_total + sat_verbal, data = sat_gpa)

models <- list(m1, m2, m3)
modelsummary(models)

tinytable_7t335wnkesxfi08j6ltu

	(1)	(2)	(3)
(Intercept)	0.002	0.701	0.007
	(0.152)	(0.129)	(0.152)
sat_total	0.024		0.022
	(0.001)		(0.003)
sat_verbal		0.036	0.003
		(0.003)	(0.005)
Num.Obs.	1000	1000	1000
R2	0.212	0.161	0.212
R2 Adj.	0.211	0.160	0.211
AIC	2004.8	2067.2	2006.4
BIC	2019.5	2081.9	2026.0
Log.Lik.	-999.382	-1030.580	-999.189
RMSE	0.66	0.68	0.66

tab_model(m1, m2, m3)

	gpa_fy			gpa_fy			gpa_fy
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	0.00	-0.30 – 0.30	0.990	0.70	0.45 – 0.95	<0.001	0.01	-0.29 – 0.31	0.961
sat total	0.02	0.02 – 0.03	<0.001				0.02	0.02 – 0.03	<0.001
sat verbal				0.04	0.03 – 0.04	<0.001	0.00	-0.01 – 0.01	0.536
Observations	1000			1000			1000
R² / R² adjusted	0.212 / 0.211			0.161 / 0.160			0.212 / 0.211

Model 1: SAT Total Score and Freshman GPA In the first model, we explore the impact of SAT total score on freshman GPA.The intercept is 0.00. It’s p-value is 0.990. This means that when the SAT total score is zero, the baseline freshman GPA is 0, but this value is not statistically significant because it’s p-value is larger than 0.001. The SAT Total Score Coefficient is 0.02, p-value less than 0.001, indicates that for each unit increase in SAT total score, the freshman GPA increases by 0.02 units, and this relationship is significant. The R-squared value for this model is 0.212, and the adjusted R-squared value is 0.211, indicating that the SAT total score explains approximately 21.1% of the variance in freshman GPA.

Model 2: SAT Verbal Score and Freshman GPA In the second model, we explore the impact of SAT verbal score on freshman GPA. The intercept is 0.70 and p-value is less than 0.001. This means that when the SAT verbal score is zero, the baseline freshman GPA is 0.70, and this value is statistically significant. SAT Verbal Score Coefficient is 0.04, p-value less than 0.001, indicates that for each unit increase in SAT verbal score, the freshman GPA increases by 0.04 units, and this relationship is significant. The R-squared value for this model is 0.161, and the adjusted R-squared value is 0.160, indicating that the SAT verbal score explains approximately 16.0% of the variance in freshman GPA.

Model 3: SAT Total, SAT Verbal Scores, and Freshman GPA In the third model, we combine SAT total score and SAT verbal score to understand their combined impact on freshman GPA. The intercept is 0.01, p-value is 0.961, which means that this value is not statistically significant. SAT Total Score Coefficient is 0.02, p-value less than 0.001. This indicates that for each unit increase in SAT total score, the freshman GPA increases by 0.02 units, and this relationship is highly significant. SAT Verbal Score Coefficient is 0.00, p-value is 0.536. This indicates that the SAT verbal score has no significant impact on freshman GPA when combined with the SAT total score. The R-squared value for this model is 0.212, and the adjusted R-squared value is 0.211, indicating that the combination of SAT total and SAT verbal scores explains approximately 21.1% of the variance in freshman GPA.

In summary, SAT total score is a significant predictor of freshman GPA (Models 1 and 3), explaining about 21.1% of the variance in freshman GPA. SAT verbal score is a significant predictor when analyzed alone (Model 2), but its impact is not significant when combined with SAT total score (Model 3). In addition, SAT total score has a stronger predictive power for freshman GPA than SAT verbal score.

plot_model(m3, show.values = TRUE, show.p = TRUE)

Task 3

Explore the relationship between Happy Life Years (HLY) and GDP per Capita using the HappyPlanetIndex dataset. Visualize the coefficients using both the modelsummary and Sjplot packages.

data("HappyPlanetIndex")
World <- HappyPlanetIndex

plot(World$GDPperCapita, World$HLY, xlab = "GDP per Capita", ylab = "Happy Life Years", pch = 16)

m1 <- lm(HLY ~ GDPperCapita, data = World)

modelsummary(m1)

tinytable_5qb3q0280lucd0v9gfz3

	(1)
(Intercept)	31.182
	(1.114)
GDPperCapita	0.001
	(0.000)
Num.Obs.	141
R2	0.566
R2 Adj.	0.563
AIC	1043.2
BIC	1052.0
Log.Lik.	-518.576
RMSE	9.57

tab_model(m1)

	HLY
Predictors	Estimates	CI	p
(Intercept)	31.18	28.98 – 33.38	<0.001
GDPperCapita	0.00	0.00 – 0.00	<0.001
Observations	141
R² / R² adjusted	0.566 / 0.563

Task 4

Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:

Model 1: Life Expectancy (LifeExpectancy)

Model 2: Ecological Footprint (Footprint)

Model 3: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)

m4 <- lm(Happiness ~ LifeExpectancy, data = World)
m5 <- lm(Happiness ~ Footprint, data = World)
m6 <- lm(Happiness ~ GDPperCapita + HDI + Population, data = World)

models_world <- list(m4, m5, m6)
modelsummary(models_world)

tinytable_zaf1iyuz0lgumx6b3dre

	(1)	(2)	(3)
(Intercept)	-1.104	4.713	1.528
	(0.399)	(0.150)	(0.357)
LifeExpectancy	0.104
	(0.006)
Footprint		0.419
		(0.042)
GDPperCapita			0.000
			(0.000)
HDI			5.778
			(0.577)
Population			0.001
			(0.000)
Num.Obs.	143	143	141
R2	0.693	0.414	0.707
R2 Adj.	0.691	0.410	0.700
AIC	332.7	425.0	327.6
BIC	341.6	433.9	342.3
Log.Lik.	-163.359	-209.524	-158.779
RMSE	0.76	1.05	0.75

tab_model(m4, m5, m6)

	Happiness			Happiness			Happiness
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	-1.10	-1.89 – -0.32	0.006	4.71	4.42 – 5.01	<0.001	1.53	0.82 – 2.23	<0.001
LifeExpectancy	0.10	0.09 – 0.11	<0.001
Footprint				0.42	0.34 – 0.50	<0.001
GDPperCapita							0.00	-0.00 – 0.00	0.100
HDI							5.78	4.64 – 6.92	<0.001
Population							0.00	-0.00 – 0.00	0.247
Observations	143			143			141
R² / R² adjusted	0.693 / 0.691			0.414 / 0.410			0.707 / 0.700

modelplot(models_world, coef_omit = "Intercept") +
    labs(x = 'Coefficient Estimate',
         y = 'Term',
         title = 'Model Coefficients with Confidence Intervals',
         caption = 'Comparison of Models 4, 5, and 6') +
    theme_minimal()

modelplot(models_world, coef_omit = "Intercept") +
    geom_point(aes(x = estimate, y = term, color = model, shape = model), size = 3) +
    geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term, color = model), height = 0.2) +
    scale_color_brewer(palette = "Dark2") +
    labs(x = 'Coefficient Estimate',
         y = 'Term',
         title = 'Model Coefficients with Confidence Intervals',
         caption = 'Comparison of Models 4, 5, and 6') +
    theme_minimal() +
    theme(legend.position = "bottom")

modelplot(models_world, coef_omit = "Intercept") +
    geom_point(aes(x = estimate, y = term, color = model, shape = model), size = 3) +
    geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term, color = model), height = 0.2) +
    scale_color_brewer(palette = "Dark2") +
    labs(x = 'Coefficient Estimate',
         y = 'Term',
         title = 'Model Coefficients with Confidence Intervals',
         caption = 'Comparison of Models 4, 5, and 6') +
    theme_minimal() +
    theme(legend.position = "bottom") +
    facet_wrap(~ model, scales = "free_x")

plot_model(m6, 
           type = "std2", # Gelman's standardization
           show.values = TRUE, 
           show.p = FALSE, 
           ci.lvl = 0.95,
           title = "Standardized Coefficients using Gelman Approach",
           m.labels = c("Model 4: Happiness ~ LifeExpectancy",
                        "Model 5: Happiness ~ Footprint",
                        "Model 6: Happiness ~ GDPperCapita + HDI + Population")) +
  theme_minimal() +
  theme(legend.position = "bottom")

Task 5

Predict murder rates using both internet penetration and GDP, from the Violence dataset. Create regression tables for the model using both the modelsummary and sjPlot packages. Interpret.

load("/Users/liuyichen/Downloads/Violence.RData")
nrow(Violence)  # Number of observational units

## [1] 8

ncol(Violence)  # Number of variables

## [1] 19

mean(Violence$MurderRate)

## [1] 22.55

plot(Violence$Internet, Violence$MurderRate, xlab = "Internet Penetration", ylab = "Murder Rate per 100,000", pch = 16)

plot(Violence$GDP, Violence$MurderRate, xlab = "GDP", ylab = "Murder Rate per 100,000", pch = 16)

model <- lm(MurderRate ~ Internet + GDP, data = Violence)
summary(model)

## 
## Call:
## lm(formula = MurderRate ~ Internet + GDP, data = Violence)
## 
## Residuals:
##            Austria            Belgium          Guatemala            Jamaica 
##             -2.507             -2.822             14.174             10.453 
## Dominican Republic       South Africa            Ukraine      United States 
##             -6.930             13.338            -29.231              3.526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 28.9842977 11.9299808   2.430   0.0594 .
## Internet     0.4629043  0.4385000   1.056   0.3394  
## GDP         -0.0013199  0.0006033  -2.188   0.0803 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.84 on 5 degrees of freedom
## Multiple R-squared:  0.602,  Adjusted R-squared:  0.4429 
## F-statistic: 3.782 on 2 and 5 DF,  p-value: 0.09991

Interpreting the output:

Intercept: The intercept in this regression model is estimated to be 28.9843. This value represents the expected murder rate when both Internet penetration and GDP are zero. While this scenario may not be practically meaningful (which is not practically feasible in today’s world), it serves as the baseline for this model.

Slopes: The coefficient for Internet is 0.4629. This indicates that for each one percent increase in the Internet penetration, the murder rate is expected to increase by approximately 0.4629 units. The associated p-value is 0.3394, indicating that Internet penetration is not statistically significant in predicting murder rates The coefficient for GDP is -0.0013199. This indicates that for each one percent increase in the GDP, the murder rate is expected to decrease by approximately 0.00132 units. The associated p-value is 0.0803, suggesting that GDP per capital might have a marginally negative effect on murder rates, but it is not statistical significance (p < 0.05).

Statistical Significance: The t-value for the Internet penetration coefficient is 1.056, and its associated p-value is 0.3394. Since this p-value is larger than 0.05, we conclude that there isn’t a statistically significant relationship between the percentage of the internet coefficient and the murder rate at the 5% significance level. The t-value for the GDP coefficient is -2.188, and its associated p-value is 0.0803. Since this p-value is larger than 0.05, we conclude that there isn’t a statistically significant relationship between the percentage of the internet coefficient and the murder rate at the 5% significance level.

The residuals, which represent the differences between observed and predicted murder rates, range from -29.231 to 14.174. The residual standard error, a measure of the typical size of the residuals, is 16.84, indicating some variability in the model’s predictions.

R-squared: The model’s R-squared value is 0.602, indicating that approximately 60.2% of the variability in murder rates can be explained by the variables Internet penetration and GDP per capital. The adjusted R-squared, which considers the number of predictors, is 0.4429, suggesting a moderately good fit of the model given its simplicity.

modelsummary(model, output = "regression_table.docx")

tab_model(model)

	MurderRate
Predictors	Estimates	CI	p
(Intercept)	28.98	-1.68 – 59.65	0.059
Internet	0.46	-0.66 – 1.59	0.339
GDP	-0.00	-0.00 – 0.00	0.080
Observations	8
R² / R² adjusted	0.602 / 0.443

plot_model(model, show.values = TRUE, show.p = TRUE)

predictions <- predict(model)
Violence$PredictedMurderRate <- predictions

Violence$PredictedMurderRate <- predict(model)
ggplot(Violence, aes(x = PredictedMurderRate, y = MurderRate)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(title = "Actual vs. Predicted Murder Rates",
       x = "Predicted Murder Rate",
       y = "Actual Murder Rate") +
  theme_minimal()