Problem Set 3

Instructions

We have 5 tasks this problem set

Task 1

Calculate and interpret the correlation between SAT math scores (sat_math) and freshman GPA (gpa_fy).
Next, visualize this relationship with a scatterplot, including a regression line.

Task 2

Create regression tables to analyze the relationship between SAT scores (total and verbal) and freshman GPA, using both the modelsummary and sjPlot packages. Interpret.

Task 3

Explore the relationship between Happy Life Years (HLY) and GDP per Capita using the HappyPlanetIndex dataset. Visualize the coefficients using both the modelsummary and Sjplot packages.

Task 4

Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:
Model 1: Life Expectancy (LifeExpectancy)
Model 2: Ecological Footprint (Footprint)
Model 3: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)

Task 5

Predict murder rates using both internet penetration and GDP, from the Violence dataset. Create regression tables for the model using both the modelsummary and sjPlot packages. Interpret.

Loading packages and problem sets

# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "mosaic") 

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
##   options(modelsummary_factory_latex = 'kableExtra')
##   options(modelsummary_factory_html = 'kableExtra')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)
## 
## #refugeeswelcome
## 
## Loading required package: carData
## 
## 
## Attaching package: 'car'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## 
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## 
## Attaching package: 'mosaic'
## 
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## 
## The following objects are masked from 'package:car':
## 
##     deltaMethod, logit
## 
## 
## The following object is masked from 'package:modelsummary':
## 
##     msummary
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[6]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[7]]
##  [1] "car"          "carData"      "sjPlot"       "broom"        "modelsummary"
##  [6] "fst"          "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "Lock5Data"    "car"          "carData"      "sjPlot"       "broom"       
##  [6] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "mosaic"       "mosaicData"   "ggformula"    "Matrix"       "lattice"     
##  [6] "Lock5Data"    "car"          "carData"      "sjPlot"       "broom"       
## [11] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
## [16] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [21] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [26] "utils"        "datasets"     "methods"      "base"

setwd("C:/Users/matej/OneDrive/Desktop/U of T/Summer 2024/SOC252/RMarkdowns")

# Load the data
sat_gpa <- read_csv("sat_gpa.csv")

## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (5): sat_verbal, sat_math, sat_total, gpa_hs, gpa_fy
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Task 1

We need to simply calculate a correlation using the “cor” command then make a scatterplot with a regression line

cor(sat_gpa$gpa_fy, sat_gpa$sat_math)

## [1] 0.3871178

We get roughly a correlation 0.39

# Now for our scatterplot
ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "SAT Math Scores", y = "Freshman GPA")

## `geom_smooth()` using formula = 'y ~ x'

Task 2

We need to make a regression table for total and verbal SAT scores (independent variables) and freshman GPA (our dependent variable). We must make 2 tables , one with modelsummary and another with sjPlot.

model <- lm(gpa_fy ~ sat_verbal + sat_total, data = sat_gpa)

#Our first regression table with modelsummary
modelsummary(model)

tinytable_s0waor93uumf179b0ntm

	(1)
(Intercept)	0.007
	(0.152)
sat_verbal	0.003
	(0.005)
sat_total	0.022
	(0.003)
Num.Obs.	1000
R2	0.212
R2 Adj.	0.211
AIC	2006.4
BIC	2026.0
Log.Lik.	-999.189
RMSE	0.66

#And our second with sjPlot
tab_model(model)

	gpa_fy
Predictors	Estimates	CI	p
(Intercept)	0.01	-0.29 – 0.31	0.961
sat verbal	0.00	-0.01 – 0.01	0.536
sat total	0.02	0.02 – 0.03	<0.001
Observations	1000
R² / R² adjusted	0.212 / 0.211

How can we interpret these tables? Intercept Our intercept is 0.01. This means that we should expect a freshman’s GPA to be this value if their Total SAT and Verbal SAT scores were 0. While it is practically impossible to get a 0 on your SATs, 0.01 will serve as our model’s baseline. Slope This is where it gets interesting for our data. Our coefficients are 0.00 for verbal SAT score, and 0.02 for total SAT score. This means that for every percent increase in a freshman’s verbal SAT score, their overall GPA increases by 0.00, practically nothing. However, for every percent increase in a freshman’s total SAT score, their GPA increases by 0.02 units. One could say total SAT score would be a better predictor of a freshman’s SAT score, they would be right especially when analyzing p values P-Values The p-value for our verbal SAT variable is 0.536, 0.001 for our total SAT variable, and 0.961 when we combine the variables. This means that the verbal SAT score model and bivariate model both show a relationship that is not statistically significant at the 5% significance level since they both have p-values greater than 0.05. Our other model that explores the relationship with only total SAT scores, however, shows a relationship that is statistically significant at 5% significance level since it’s p-value is less than 0.05. That in mind, it seems like we should trust the single variable total SAT model more. R-Squared Our bivariate model has an R-squared value of 0.212. This means that 21.2% of the variation in freshmens’ GPAs can be explained by a combination of their verbal and total SAT scores. Sure our model captures and explains some variability in freshman GPA but it’s important to consider that 78.8% of the variation is not explained by our model. Whats interesting is when we compare this R-squared to our adjusted R-squared since adjusted R-squared is meant to give us an R that accounts for the extra variables added. Adding extra variables will always increase r-squared, no matter how significant they are. Since our R-squared is 0.211, which is very close to our original r-square value 0.212, it means that our model’s ability to explain variability in freshman GPA is hardly reduced when we account for the number of independent variables added in. Verbal SAT and total SAT scores have a meaningful contribution to our model and our model does not suffer from overfitting. Conclusion In conclusion, while verbal and total SAT scores (especially total) are useful predictors of freshmen’s GPAs, they only explain a small ammount of the variability. Adding these 2 variables together to predict freshman GPA does not really affect the fit of our model though, meaning it would likely preform well on new, unseen data. A future survey might do well to test other variables such as extracurricular activitie participation or hours spent studying, when researching factors that play significant roles in predicting freshmen’s GPAs. ## Task 3 We need to explore relationships between 2 variables (happy life years and gdp) in the HappyPlanetIndex dataset. We’ll make 2 visualizations like last time, one with modelsummary and one with sjPlot.

#First lets load up this dataset
data("HappyPlanetIndex")
World <- HappyPlanetIndex

model2 <- lm(GDPperCapita ~ HLY, data = HappyPlanetIndex)

#Visualizing with modelsummary requires the use of "modelplot"
modelplot(model2, coef_omit = "Intercept") +
    labs(x = 'Coefficient Estimate',
         y = 'Term',
         title = 'Coefficients for model2',
         caption = 'Comparison of Happy Life Years and GDP per Capita') +
    theme_minimal()

#Visualizing with sjPlot requires the use of "plot_model"
#Model A
plot_model(model2, show.values = TRUE, show.p = TRUE)

We can see in both visualizations that the coefficient is 622.03

Task 4

For task 4 we need 3 models that use happiness as a dependent variable with the HappyPlanetIndex dataset

Model A: Life expectancy and happiness
Model B: Ecological footprint and happiness
Model C: GDP per capita, human development index, population and happiness We need to Visualize the coefficients of the models using modelsummary and sjPlot.

data("HappyPlanetIndex")
World <- HappyPlanetIndex

modelA <- lm(Happiness ~ LifeExpectancy, data = World)
modelB <- lm(Happiness ~ Footprint, data = World)
modelC <- lm(Happiness ~ GDPperCapita + HDI + Population, data = World)
modelsCombined <- list(modelA, modelB, modelC)

# Visualize with modelsummary

# Model A
modelplot(modelA, coef_omit = "Intercept") +
  labs(x = 'Coefficient Estimate',
       y = 'Term',
       title = 'Model A',
       caption = 'Comparison of happiness and life expectancy') +
  theme_minimal()

# Model B
modelplot(modelB, coef_omit = "Intercept") +
  labs(x = 'Coefficient Estimate',
       y = 'Term',
       title = 'Model B',
       caption = 'Comparison of happiness and ecological footprint') +
  theme_minimal()

# Model C
modelplot(modelC, coef_omit = "Intercept") +
  labs(x = 'Coefficient Estimate',
       y = 'Term',
       title = 'Model C',
       caption = 'Comparison of happiness and GDP + HDI + population') +
  theme_minimal()

# All 3 Models
modelplot(modelsCombined, coef_omit = "Intercept") +
  labs(x = 'Coefficient Estimate',
       y = 'Term',
       title = 'Comparison of Models A, B, and C',
       caption = 'Comparison of happiness models') +
  theme_minimal()

# Visualize with sjPlot. We're also gong to standardize all of the coefficients.

#Model A
plot_model(modelA, type = "std2", show.values = TRUE, show.p = FALSE, ci.lvl = 0.95) +
  labs(title = "Standardized Coefficients of Model A") +
  theme_minimal()

#Model B
plot_model(modelB, type = "std2", show.values = TRUE, show.p = FALSE, ci.lvl = 0.95) +
  labs(title = "Standardized Coefficients of Model B") +
  theme_minimal()

#Model C
plot_model(modelC, type = "std2", show.values = TRUE, show.p = FALSE, ci.lvl = 0.95) +
  labs(title = "Standardized Coefficients of Model C") +
  theme_minimal()

##Task 5 We need to use the “Violence” dataset for this task. Once loaded we need to predict murder rate using the independent variables for internet penetration and GDP from the dataset. We will make a regression table with modelsummary and sjPlot.

setwd("C:/Users/matej/OneDrive/Desktop/U of T/Summer 2024/SOC252/RMarkdowns")

load("Violence.Rdata")
data("Violence")

## Warning in data("Violence"): data set 'Violence' not found

# Before we make our model, lets check the structure of the dataset
str(Violence)

## 'data.frame':    8 obs. of  19 variables:
##  $ Country       : Factor w/ 8 levels "Austria","Belgium",..: 1 2 4 5 3 6 7 8
##  $ Code          : Factor w/ 8 levels "AUT","BEL","DOM",..: 1 2 4 5 3 6 7 8
##  $ LandArea      : num  82450 30280 107160 10830 48320 ...
##  $ Population    : num  8.34 10.71 13.69 2.69 9.95 ...
##  $ Energy        : num  33246 58583 8072 4387 8162 ...
##  $ Rural         : num  32.8 2.6 51.4 46.7 31 39.3 32 18.3
##  $ Military      : num  2.4 2.7 3.6 1.6 3.8 4.3 7.2 18.6
##  $ Health        : num  15.8 14.8 15.9 5.7 10.4 10.4 8.6 18.7
##  $ HIV           : num  0.3 0.2 0.8 1.7 0.9 17.9 1.1 0.6
##  $ Internet      : num  72.9 70.5 14.3 57.3 20.8 8.6 22.5 75.8
##  $ Developed     : Factor w/ 2 levels "1","2": 2 2 1 2 1 2 2 2
##  $ BirthRate     : num  9.3 11.7 33 16.7 22.5 22 11 14.3
##  $ ElderlyPop    : num  17 17.2 4.4 7.7 5.9 4.4 15.9 12.6
##  $ LifeExpectancy: num  80.2 80.4 70.3 71.8 72.6 51.5 68.3 78.4
##  $ CO2           : num  8.12 9.79 0.87 4.54 2.24 ...
##  $ GDP           : num  45209 43144 2862 5274 5215 ...
##  $ Cell          : num  146 111.7 125.6 114.8 89.6 ...
##  $ Electricity   : num  7944 7903 548 1902 1358 ...
##  $ MurderRate    : num  0.55 1.85 46 59 24.8 36.7 6.2 5.3

# We need to view the names so we can see the syntax for our variables when we need to code them
names(Violence)

##  [1] "Country"        "Code"           "LandArea"       "Population"    
##  [5] "Energy"         "Rural"          "Military"       "Health"        
##  [9] "HIV"            "Internet"       "Developed"      "BirthRate"     
## [13] "ElderlyPop"     "LifeExpectancy" "CO2"            "GDP"           
## [17] "Cell"           "Electricity"    "MurderRate"

#Looks like we need the variables "MurderRate", "GDP", and "Internet

#Lets make the model first

modelv1 <- lm(MurderRate ~ Internet, data = Violence)
modelv2<- lm(MurderRate ~ GDP, data = Violence)
modelv3<- lm(MurderRate ~ Internet + GDP, data = Violence)
modelV <- list(modelv1, modelv2, modelv3)

#Now our regression table with modelsummary
modelsummary(modelV)

tinytable_yncnz2o2qajjmdxvfxj2

	(1)	(2)	(3)
(Intercept)	38.264	37.853	28.984
	(14.241)	(8.550)	(11.930)
Internet	-0.367		0.463
	(0.281)		(0.438)
GDP		-0.001	-0.001
		(0.000)	(0.001)
Num.Obs.	8	8	8
R2	0.221	0.513	0.602
R2 Adj.	0.091	0.432	0.443
AIC	75.5	71.7	72.1
BIC	75.7	72.0	72.4
Log.Lik.	-34.746	-32.864	-32.059
RMSE	18.62	14.72	13.31

#And another with sjPlot
tab_model(modelv1, modelv2, modelv3)

	MurderRate			MurderRate			MurderRate
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	38.26	3.42 – 73.11	0.036	37.85	16.93 – 58.78	0.004	28.98	-1.68 – 59.65	0.059
Internet	-0.37	-1.05 – 0.32	0.240				0.46	-0.66 – 1.59	0.339
GDP				-0.00	-0.00 – -0.00	0.046	-0.00	-0.00 – 0.00	0.080
Observations	8			8			8
R² / R² adjusted	0.221 / 0.091			0.513 / 0.432			0.602 / 0.443

How can we interpret these visualizations?

Model V1: Internet Penetration and Murder Rate

Intercept: 38.26

When 0% of the population uses the internet, the murder rate would approximately be 38.26.

Coefficient: -0.37

The slope for model V1 indicates a negative relationship such that, for every percent increase in internet penetration, murder rate falls by 0.37.

P-value: 0.240

Since the P is greater than 0.05, the common significance (alpha) the relationship between internet penetration and murder rate is not statistically significant.

R-squared: 0.221, Adjusted R-Squared: 0.091

22.1% of the variation in murder rate is explained by adding in the internet penetration variable. The adjusted R-Squared is quite lower than the R-square at 0.091, the independent variable’s ability to explain the dependent is therefore mild.

AIC: 75.5, BIC: 75.7

These values will come into play when comparing with the other 2 models.

RMSE: 18.62

Model V1’s predicted murder rates are on average 18.62 units away from the observed murder rates.

Model V2: GDP and Murder Rate

Intercept: 37.85

When our GDP is set to 0 (unlikely) the predicted murder rate would be 37.85

Coefficient: -0.001

For each unit increase in gross domestic product, the murder rate will fall by 0.001 units. Suggesting a small negative relationship between GDP and murder rate.

P: 0.046

The effect mentioned above is statistically significant at a standard alpha of 0.05 as the p-value is less and rejects the null.

R-Square: 0.513, Adjusted R-Square: 0.432

51.3% of the variance in murder rate can be explained by GDP. Since the adjusted r-square value is only slightly less, we can confirm that GDP explains murder rate well.

AIC: 71.7, BIC: 72.0

Model V2 seems to have a better fit than model V1 due to a lower AIC and BIC score. a Lower AIC means that it fits better than one in terms of balancing fit and complexity and the lower BIC means which makes it the best fitting model when also penalizing for complexity

RMSE: 14.72

The predicted murder rates are on average 14.72 units away from observed units. This indicates a better fit than Model V1 in terms of Root Mean Square Error.

Model V3: Internet Penetration + GDP and Murder Rate

Intercept: 28.98

If internet penetration and GDP were both to be set to 0, Murder rate is expected to be 28.98.

Coefficient(Internet): 0.46

For each unit increase in internet penetration, murder rate increases by 0.46 units.

P(internet): 0.339 However, this positive relationship is definitely not statistically significant since it is greater than the alpha 0.05.
Coefficient(GDP): -0.001

For every 1 unit increase in GDP, murder rate falls by 0.001 units.

P(GDP): 0.080

This relationship is just barely not statistically significant as the p-value is 0.03 greater than an alpha of 0.05 and thus the null is not rejected.

R-Square: 0.602, Adjusted R-Square: 0.443

60.2% of the variance in murder rate can be explained by a combination of GDP and internet penetration. This explanatory power drops quite a bit when adjusting for number of predictors though, leaving model v3 to have a pretty similar explanatory power to model v2.

AIC: 72.1, BIC: 72.4

This model definitely has better fit than model V1 in terms of balancing fit and complexity (AIC), it also fits better than model V1 when penalizing for complexity (lower BIC), however, model V2 fits better than V3 in both of these categories (lower AIC and BIC than V3)

RMSE: 13.31

The average magnitude of the residuals is 13.31. This model fits best in terms of Root Mean Square Error.

Final Comments

While internet penetration seemed to have a pretty interesting effect on murder rate (model V1) through a negative relationship, that effect was not statistically significant, had the lowest explanatory power through adjusted r-square (0.091), and had the worst fit in terms of AIC, BIC, and RMSE (highest in all 3). Model V1 is definitely the worst fitting model, with the worst explanation for variance in the dependent, so, internet penetration alone seems to be the worst predictor of murder rate.

Combining internet penetration with GDP to predict murder rate together (model V3) showed an interesting effect that, while it was not statistically significant, had the best explanatory power of the models (adjusted r-square 0.443), and way better fit in terms of AIC, BIC and RMSE.

However, I’d say the best model of the 3 is model V2 that predicts murder rate with just GDP. The negative effect GDP has on murder rate is statistically significant, and using just GDP as an independent variable explains way more of the variance in murder rate than internet penetration (adjusted r-square 0.432). What makes this model the best one is the fact that is the best fitting in terms of balancing fit and complexity (lowest AIC of the 3), the best fit in terms of penalizing for complexity (lowest BIC), and second best fit in terms of average magnitude of the residuals (second lowest RMSE)

Thanks for reading my code!

~ Mateja Dokic