Problem Set 3

Due: July 17, 2024 by lecture time

You must submit both the .Rmd (R markdown file) and R pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code and text answer, when prompted (e.g., when asked to interpret the model output).

Task 1

Calculate and interpret the correlation between SAT math scores (sat_math) and freshman GPA (gpa_fy). Next, visualize this relationship with a scatterplot, including a regression line.

#1. Correlation between sat_math and gpa_fy #2. visualize through scatterplot, regression line

# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "pandoc", "mosaic") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
##   options(modelsummary_factory_latex = 'kableExtra')
##   options(modelsummary_factory_html = 'kableExtra')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)
## 
## #refugeeswelcome
## 
## Loading required package: carData
## 
## 
## Attaching package: 'car'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## 
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## 
## Attaching package: 'mosaic'
## 
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## 
## The following objects are masked from 'package:car':
## 
##     deltaMethod, logit
## 
## 
## The following object is masked from 'package:modelsummary':
## 
##     msummary
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[6]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[7]]
##  [1] "car"          "carData"      "sjPlot"       "broom"        "modelsummary"
##  [6] "fst"          "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "Lock5Data"    "car"          "carData"      "sjPlot"       "broom"       
##  [6] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "pandoc"       "Lock5Data"    "car"          "carData"      "sjPlot"      
##  [6] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[10]]
##  [1] "mosaic"       "mosaicData"   "ggformula"    "Matrix"       "lattice"     
##  [6] "pandoc"       "Lock5Data"    "car"          "carData"      "sjPlot"      
## [11] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
## [16] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [21] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [26] "grDevices"    "utils"        "datasets"     "methods"      "base"

#Load File

sat_gpa <- read_csv("sat_gpa.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (5): sat_verbal, sat_math, sat_total, gpa_hs, gpa_fy
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Correlation between sat_math and gpa_fy

cor(sat_gpa$gpa_fy, sat_gpa$sat_math)
## [1] 0.3871178

#Scatter plot/regression line

ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Math SAT Score", y = "Freshman GPA")
## `geom_smooth()` using formula = 'y ~ x'

Task 2

Create regression tables to analyze the relationship between SAT scores (total and verbal) and freshman GPA, using both the modelsummary and sjPlot packages. Interpret. variables: sat_total / gpa_fy

sat_gpa <- read_csv("sat_gpa.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (5): sat_verbal, sat_math, sat_total, gpa_hs, gpa_fy
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model <- lm(sat_total ~ gpa_fy, data = sat_gpa)
summary(model)
## 
## Call:
## lm(formula = sat_total ~ gpa_fy, data = sat_gpa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.862  -8.670  -0.205   9.249  38.097 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   81.421      1.397   58.30   <2e-16 ***
## gpa_fy         8.877      0.542   16.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.69 on 998 degrees of freedom
## Multiple R-squared:  0.2119, Adjusted R-squared:  0.2111 
## F-statistic: 268.3 on 1 and 998 DF,  p-value: < 2.2e-16

#modelsummary

modelsummary(model)
tinytable_heqibu6yjjwurhg7c2y5
(1)
(Intercept) 81.421
(1.396)
gpa_fy 8.877
(0.542)
Num.Obs. 1000
R2 0.212
R2 Adj. 0.211
AIC 7923.6
BIC 7938.3
Log.Lik. -3958.775
RMSE 12.68

#sjPlot

tab_model(model)
  sat_total
Predictors Estimates CI p
(Intercept) 81.42 78.68 – 84.16 <0.001
gpa fy 8.88 7.81 – 9.94 <0.001
Observations 1000
R2 / R2 adjusted 0.212 / 0.211

#Interpretation The relationship between SAT scores and freshman GPA shows that the intercept is 81.42 (95% CI: 78.68 - 84.16, p < 0.001), indicating the expected GPA when SAT scores are zero. The slope estimate is 8.88 (95% CI: 7.81 - 9.94, p < 0.001), demonstrating a statistically significant positive relationship: higher SAT scores correlate with higher freshman GPAs. The R-squared value of 0.212 (adjusted R-squared: 0.211) suggests that SAT scores explain about 21.2% of the variance in freshman GPA, indicating a moderate relationship with other factors also influencing GPA.

Task 3

Explore the relationship between Happy Life Years (HLY) and GDP per Capita using the HappyPlanetIndex dataset. Visualize the coefficients using both the modelsummary and Sjplot packages.

data("HappyPlanetIndex")
World <- HappyPlanetIndex

#Scatter Plot

plot(World$HLY, World$GDPperCapita, xlab = "HLY", ylab = "GDPperCapita", pch = 16)

model_2 <- lm(HLY ~ GDPperCapita, data = HappyPlanetIndex)
summary(model_2)
## 
## Call:
## lm(formula = HLY ~ GDPperCapita, data = HappyPlanetIndex)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8503  -4.9116   0.7084   5.3119  26.2608 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.118e+01  1.114e+00   28.00   <2e-16 ***
## GDPperCapita 9.093e-04  6.759e-05   13.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.641 on 139 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5656, Adjusted R-squared:  0.5625 
## F-statistic:   181 on 1 and 139 DF,  p-value: < 2.2e-16

#modelsummary

modelsummary(model_2)
tinytable_w53325f2wshtlx59irvs
(1)
(Intercept) 31.182
(1.114)
GDPperCapita 0.001
(0.000)
Num.Obs. 141
R2 0.566
R2 Adj. 0.563
AIC 1043.2
BIC 1052.0
Log.Lik. -518.576
RMSE 9.57

#sjPlot

tab_model(model_2)
  HLY
Predictors Estimates CI p
(Intercept) 31.18 28.98 – 33.38 <0.001
GDPperCapita 0.00 0.00 – 0.00 <0.001
Observations 141
R2 / R2 adjusted 0.566 / 0.563

Task 4

Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:

Model 1: Life Expectancy (LifeExpectancy)

Model 2: Ecological Footprint (Footprint)

Model 3: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)

data("HappyPlanetIndex")
World <- HappyPlanetIndex
m4 <- lm(Happiness ~ LifeExpectancy, data = World)
m5 <- lm(Happiness ~ Footprint, data = World)
m6 <- lm(Happiness ~ GDPperCapita + HDI + Population, data = World)
models_world <- list(m4, m5, m6)
modelsummary(models_world)
tinytable_7orco9h8s4225akqjrqa
(1) (2) (3)
(Intercept) -1.104 4.713 1.528
(0.399) (0.150) (0.357)
LifeExpectancy 0.104
(0.006)
Footprint 0.419
(0.042)
GDPperCapita 0.000
(0.000)
HDI 5.778
(0.577)
Population 0.001
(0.000)
Num.Obs. 143 143 141
R2 0.693 0.414 0.707
R2 Adj. 0.691 0.410 0.700
AIC 332.7 425.0 327.6
BIC 341.6 433.9 342.3
Log.Lik. -163.359 -209.524 -158.779
RMSE 0.76 1.05 0.75
tab_model(m4, m5, m6)
  Happiness Happiness Happiness
Predictors Estimates CI p Estimates CI p Estimates CI p
(Intercept) -1.10 -1.89 – -0.32 0.006 4.71 4.42 – 5.01 <0.001 1.53 0.82 – 2.23 <0.001
LifeExpectancy 0.10 0.09 – 0.11 <0.001
Footprint 0.42 0.34 – 0.50 <0.001
GDPperCapita 0.00 -0.00 – 0.00 0.100
HDI 5.78 4.64 – 6.92 <0.001
Population 0.00 -0.00 – 0.00 0.247
Observations 143 143 141
R2 / R2 adjusted 0.693 / 0.691 0.414 / 0.410 0.707 / 0.700
# Facet the plot by model
modelplot(models_world, coef_omit = "Intercept") +
    geom_point(aes(x = estimate, y = term, color = model, shape = model), size = 3) +
    geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term, color = model), height = 0.2) +
    scale_color_brewer(palette = "Dark2") +
    labs(x = 'Coefficient Estimate',
         y = 'Term',
         title = 'Model Coefficients with Confidence Intervals',
         caption = 'Comparison of Models 4, 5, and 6') +
    theme_minimal() +
    theme(legend.position = "bottom") +
    facet_wrap(~ model, scales = "free_x")

Task 5

Predict murder rates using both internet penetration and GDP, from the Violence dataset. Create regression tables for the model using both the modelsummary and sjPlot packages. Interpret.

load("Violence.Rdata")
Violence$Internet
## [1] 72.9 70.5 14.3 57.3 20.8  8.6 22.5 75.8
Violence$GDP
## [1] 45209.396 43144.343  2862.367  5274.037  5214.537  7275.344  3006.915
## [8] 47198.504
Violence$MurderRate
## [1]  0.55  1.85 46.00 59.00 24.80 36.70  6.20  5.30
m1 <- lm(Internet ~ MurderRate, data = Violence)
m2 <- lm(GDP ~ MurderRate, data = Violence)

#modelsummary

models_world_2 <- list(m1, m2)
modelsummary(models_world_2)
tinytable_u3tpz5boirhmry7tjr06
(1) (2)
(Intercept) 56.427 34949.763
(14.262) (8193.361)
MurderRate -0.603 -667.476
(0.462) (265.319)
Num.Obs. 8 8
R2 0.221 0.513
R2 Adj. 0.091 0.432
AIC 79.5 181.1
BIC 79.7 181.4
Log.Lik. -36.732 -87.560
RMSE 23.87 13711.81

#sjPlot

tab_model(m1, m2)
  Internet GDP
Predictors Estimates CI p Estimates CI p
(Intercept) 56.43 21.53 – 91.32 0.007 34949.76 14901.33 – 54998.20 0.005
MurderRate -0.60 -1.73 – 0.53 0.240 -667.48 -1316.69 – -18.26 0.046
Observations 8 8
R2 / R2 adjusted 0.221 / 0.091 0.513 / 0.432

Interpretation

The estimate for internet penetration as a predictor of murder rate is -0.6. This negative estimate suggests that higher internet penetration is associated with a decrease in murder rates, although the effect is relatively small. Internet penetration as predictor of murder rate is not statistically significant as the p value is 0.24, higher than the baseline 0.05 for 95% confidence.

The estimate for GDP as a predictor of murder rate is -667.48. AThis negative estimate suggests that higher GDP is associated with a decrease in murder rates. GDP as a predictor of murder rates is statistically significant as the p value is 0.046, lower than the baseline of 0.05 for 95% confidence.

The number of observations is 8 for both predictors, which is a relatively small sample size which could limit the generalizability of the results.