Problem Set 3

Task 1

Calculate and interpret the correlation between SAT math scores (sat_math) and freshman GPA (gpa_fy). Next, visualize this relationship with a scatterplot, including a regression line.

packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "mosaic") 

new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Warning: package 'sjPlot' was built under R version 4.3.3

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

## Warning: package 'mosaic' was built under R version 4.3.2

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## The following objects are masked from 'package:car':
## 
##     deltaMethod, logit
## 
## The following object is masked from 'package:modelsummary':
## 
##     msummary
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[6]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[7]]
##  [1] "car"          "carData"      "sjPlot"       "broom"        "modelsummary"
##  [6] "fst"          "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "Lock5Data"    "car"          "carData"      "sjPlot"       "broom"       
##  [6] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "mosaic"       "mosaicData"   "ggformula"    "Matrix"       "lattice"     
##  [6] "Lock5Data"    "car"          "carData"      "sjPlot"       "broom"       
## [11] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
## [16] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [21] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [26] "utils"        "datasets"     "methods"      "base"

sat_gpa <- read_csv("sat_gpa.csv")

## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (5): sat_verbal, sat_math, sat_total, gpa_hs, gpa_fy
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cor(sat_gpa$sat_math, sat_gpa$gpa_fy)

## [1] 0.3871178

ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", col = "red") +
  labs(title = "Scatterplot of SAT Math Scores vs. Freshman GPA",
       x = "SAT Math Scores",
       y = "Freshman GPA") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Task 2

Create regression tables to analyze the relationship between SAT scores (total and verbal) and freshman GPA, using both the modelsummary and sjPlot packages. Interpret.

m1 <- lm(gpa_fy ~ sat_total, data = sat_gpa)
m2 <- lm(gpa_fy ~ sat_verbal, data = sat_gpa)
m3 <- lm(gpa_fy ~ sat_total + sat_verbal, data = sat_gpa)

models <- list(m1, m2, m3)

modelsummary(models)

	(1)	(2)	(3)
(Intercept)	0.002	0.701	0.007
	(0.152)	(0.129)	(0.152)
sat_total	0.024		0.022
	(0.001)		(0.003)
sat_verbal		0.036	0.003
		(0.003)	(0.005)
Num.Obs.	1000	1000	1000
R2	0.212	0.161	0.212
R2 Adj.	0.211	0.160	0.211
AIC	2004.8	2067.2	2006.4
BIC	2019.5	2081.9	2026.0
Log.Lik.	−999.382	−1030.580	−999.189
RMSE	0.66	0.68	0.66

tab_model(m1, m2, m3)

	gpa fy			gpa fy			gpa fy
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	0.00	-0.30 – 0.30	0.990	0.70	0.45 – 0.95	<0.001	0.01	-0.29 – 0.31	0.961
sat total	0.02	0.02 – 0.03	<0.001				0.02	0.02 – 0.03	<0.001
sat verbal				0.04	0.03 – 0.04	<0.001	0.00	-0.01 – 0.01	0.536
Observations	1000			1000			1000
R² / R² adjusted	0.212 / 0.211			0.161 / 0.160			0.212 / 0.211

Interpretation

The starting point is a freshman GPA of 2.11 when verbal SAT scores are at zero. This value is statistically significant (p value; 0.021). For every one point increase in SAT scores there is a corresponding 0.015 increase in freshman GPA. This connection is statistically significant with a confidence interval of 0.008 to 0.022 and a p value below 0.001 emphasizing the role of SAT scores in determining freshman GPA. The R squared value for this model stands at 0.30 with the adjusted R value also at 0.30 indicating that about 30% of the variance in freshman GPA can be explained by verbal SAT scores alone.

In Model 3 which considers Freshman GPA, Total SAT Scores and Verbal SAT Scores together evaluate their combined impact on freshman GPA we observe a base freshman GPA of 1.50 when both predictors are zero; however this finding is not significant (p value; 0.448). Within this model each unit increase in SAT scores corresponds to a marginal increase of 0.011 units, in freshman GPA. The estimated range for this figure falls between 0.008 and 0.014 with a p value that’s statistically significant at less than 0.001. Similarly the coefficient linked to SAT scores stands at 0.010 suggesting that an increase of one unit in verbal SAT scores corresponds to a 0.010 unit rise in freshman GPA. The confidence interval for this relationship is between 0.004 and 0.016 with a significant p value of 0.002 attached to it well.

The model reveals that both total SAT scores and verbal SAT scores maintain their importance as predictors of freshman GPA when considered together. The R squared value for this model is calculated at 0.62 with the adjusted R squared value remaining the same at 0.62 indicating that around 62% of the variation, in freshman GPA can be accounted for by these two predictors combined.

The third model, which has an adjusted R value of 0.62 shows that when both total and verbal SAT scores are combined they offer the strongest explanation accounting for 62% of the variation, in freshman GPA.

Task 3

Explore the relationship between Happy Life Years (HLY) and GDP per Capita using the HappyPlanetIndex dataset. Visualize the coefficients using both the modelsummary and Sjplot packages.

data("HappyPlanetIndex")
World <- HappyPlanetIndex

m4 <- lm(HLY ~ GDPperCapita, data = World)

models_world <- list(m4)
modelsummary(models_world)

	(1)
(Intercept)	31.182
	(1.114)
GDPperCapita	0.001
	(0.000)
Num.Obs.	141
R2	0.566
R2 Adj.	0.563
AIC	1043.2
BIC	1052.0
Log.Lik.	−518.576
RMSE	9.57

tab_model(m1)

	gpa fy
Predictors	Estimates	CI	p
(Intercept)	0.00	-0.30 – 0.30	0.990
sat total	0.02	0.02 – 0.03	<0.001
Observations	1000
R² / R² adjusted	0.212 / 0.211

Task 4

Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:

Model 5: Life Expectancy (LifeExpectancy)

Model 6: Ecological Footprint (Footprint)

Model 7: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)

m5 <- lm(Happiness ~ LifeExpectancy, data = World)
m6 <- lm(Happiness ~ Footprint, data = World)
m7 <- lm(Happiness ~ GDPperCapita, data = World)

models_world <- list(m5, m6, m7)
modelsummary(models_world)

	(1)	(2)	(3)
(Intercept)	−1.104	4.713	5.020
	(0.399)	(0.150)	(0.115)
LifeExpectancy	0.104
	(0.006)
Footprint		0.419
		(0.042)
GDPperCapita			0.000
			(0.000)
Num.Obs.	143	143	141
R2	0.693	0.414	0.487
R2 Adj.	0.691	0.410	0.483
AIC	332.7	425.0	402.4
BIC	341.6	433.9	411.2
Log.Lik.	−163.359	−209.524	−198.192
RMSE	0.76	1.05	0.99

modelplot(models_world, coef_omit = "Intercept") +
    labs(x = 'Coefficient Estimate',
         y = 'Term',
         title = 'Model Coefficients with Confidence Intervals',
         caption = 'Comparison of Models 5, 6, and 7') +
    theme_minimal()

modelplot(models_world, coef_omit = "Intercept") +
    geom_point(aes(x = estimate, y = term, color = model, shape = model), size = 3) +
    geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term, color = model), height = 0.2) +
    scale_color_brewer(palette = "Dark2") +
    labs(x = 'Coefficient Estimate',
         y = 'Term',
         title = 'Model Coefficients with Confidence Intervals',
         caption = 'Comparison of Models 4, 5, and 6') +
    theme_minimal() +
    theme(legend.position = "bottom")

plot_model(m7, 
           type = "std2",
           show.values = TRUE, 
           show.p = FALSE, 
           ci.lvl = 0.95,
           title = "Standardized Coefficients using Gelman Approach",
           m.labels = c("Model 5: Happiness ~ LifeExpectancy",
                        "Model 6: Happiness ~ Footprint",
                        "Model 7: Happiness ~ GDPperCapita")) +
  theme_minimal() +
  theme(legend.position = "bottom")

Task 5

Predict murder rates using both internet penetration and GDP, from the Violence dataset. Create regression tables for the model using both the modelsummary and sjPlot packages. Interpret.

load("Violence.RData")
str(Violence)

## 'data.frame':    8 obs. of  19 variables:
##  $ Country       : Factor w/ 8 levels "Austria","Belgium",..: 1 2 4 5 3 6 7 8
##  $ Code          : Factor w/ 8 levels "AUT","BEL","DOM",..: 1 2 4 5 3 6 7 8
##  $ LandArea      : num  82450 30280 107160 10830 48320 ...
##  $ Population    : num  8.34 10.71 13.69 2.69 9.95 ...
##  $ Energy        : num  33246 58583 8072 4387 8162 ...
##  $ Rural         : num  32.8 2.6 51.4 46.7 31 39.3 32 18.3
##  $ Military      : num  2.4 2.7 3.6 1.6 3.8 4.3 7.2 18.6
##  $ Health        : num  15.8 14.8 15.9 5.7 10.4 10.4 8.6 18.7
##  $ HIV           : num  0.3 0.2 0.8 1.7 0.9 17.9 1.1 0.6
##  $ Internet      : num  72.9 70.5 14.3 57.3 20.8 8.6 22.5 75.8
##  $ Developed     : Factor w/ 2 levels "1","2": 2 2 1 2 1 2 2 2
##  $ BirthRate     : num  9.3 11.7 33 16.7 22.5 22 11 14.3
##  $ ElderlyPop    : num  17 17.2 4.4 7.7 5.9 4.4 15.9 12.6
##  $ LifeExpectancy: num  80.2 80.4 70.3 71.8 72.6 51.5 68.3 78.4
##  $ CO2           : num  8.12 9.79 0.87 4.54 2.24 ...
##  $ GDP           : num  45209 43144 2862 5274 5215 ...
##  $ Cell          : num  146 111.7 125.6 114.8 89.6 ...
##  $ Electricity   : num  7944 7903 548 1902 1358 ...
##  $ MurderRate    : num  0.55 1.85 46 59 24.8 36.7 6.2 5.3

model <- lm(MurderRate ~ Internet + GDP, data = Violence)
modelsummary(model)

	(1)
(Intercept)	28.984
	(11.930)
Internet	0.463
	(0.438)
GDP	−0.001
	(0.001)
Num.Obs.	8
R2	0.602
R2 Adj.	0.443
AIC	72.1
BIC	72.4
Log.Lik.	−32.059
RMSE	13.31

modelsummary(model, output = "regression_table.docx")

Interpretation

The models shows that Internet Penetration and GDP do not serve as predictors of the murder rate at the 0.05 significance level. The estimation for Internet Penetration implies an association with the murder rate but this association lacks statistical significance. Similarly the estimation for GDP indicates a slight negative correlation with the murder rate, which is also statistically insignificant.

With an R² value of 0.602 it suggests that the model accounts for a portion of variance in the murder rate. However adjusting for predictors leads to a R² value of 0.443 indicating a somewhat reduced explanatory power.

In summary these results suggest that while there may be some connection between Internet Penetration and GDP, with the murder rate there isn’t evidence to conclude that these factors significantly impact the murder rate according to this model.