We have 5 tasks this problem set
Task 1
Task 2
Task 3
Task 4
Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:
Model 1: Life Expectancy (LifeExpectancy)
Model 2: Ecological Footprint (Footprint)
Model 3: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)
Task 5
# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "mosaic")
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
## backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
##
## Revert to `kableExtra` for one session:
##
## options(modelsummary_factory_default = 'kableExtra')
## options(modelsummary_factory_latex = 'kableExtra')
## options(modelsummary_factory_html = 'kableExtra')
##
## Silence this message forever:
##
## config_modelsummary(startup_message = FALSE)
##
## #refugeeswelcome
##
## Loading required package: carData
##
##
## Attaching package: 'car'
##
##
## The following object is masked from 'package:dplyr':
##
## recode
##
##
## The following object is masked from 'package:purrr':
##
## some
##
##
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
##
## Attaching package: 'mosaic'
##
##
## The following object is masked from 'package:Matrix':
##
## mean
##
##
## The following objects are masked from 'package:car':
##
## deltaMethod, logit
##
##
## The following object is masked from 'package:modelsummary':
##
## msummary
##
##
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
##
##
## The following object is masked from 'package:purrr':
##
## cross
##
##
## The following object is masked from 'package:ggplot2':
##
## stat
##
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
##
##
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "fst" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "broom" "modelsummary" "fst" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "sjPlot" "broom" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[6]]
## [1] "sjPlot" "broom" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[7]]
## [1] "car" "carData" "sjPlot" "broom" "modelsummary"
## [6] "fst" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[8]]
## [1] "Lock5Data" "car" "carData" "sjPlot" "broom"
## [6] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [11] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [16] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [21] "utils" "datasets" "methods" "base"
##
## [[9]]
## [1] "mosaic" "mosaicData" "ggformula" "Matrix" "lattice"
## [6] "Lock5Data" "car" "carData" "sjPlot" "broom"
## [11] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [16] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [21] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [26] "utils" "datasets" "methods" "base"
setwd("C:/Users/matej/OneDrive/Desktop/U of T/Summer 2024/SOC252/RMarkdowns")
# Load the data
sat_gpa <- read_csv("sat_gpa.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (5): sat_verbal, sat_math, sat_total, gpa_hs, gpa_fy
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We need to simply calculate a correlation using the “cor” command then make a scatterplot with a regression line
cor(sat_gpa$gpa_fy, sat_gpa$sat_math)
## [1] 0.3871178
We get roughly a correlation 0.39
# Now for our scatterplot
ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "SAT Math Scores", y = "Freshman GPA")
## `geom_smooth()` using formula = 'y ~ x'
We need to make a regression table for total and verbal SAT scores (independent variables) and freshman GPA (our dependent variable). We must make 2 tables , one with modelsummary and another with sjPlot.
model <- lm(gpa_fy ~ sat_verbal + sat_total, data = sat_gpa)
#Our first regression table with modelsummary
modelsummary(model)
| (1) | |
|---|---|
| (Intercept) | 0.007 |
| (0.152) | |
| sat_verbal | 0.003 |
| (0.005) | |
| sat_total | 0.022 |
| (0.003) | |
| Num.Obs. | 1000 |
| R2 | 0.212 |
| R2 Adj. | 0.211 |
| AIC | 2006.4 |
| BIC | 2026.0 |
| Log.Lik. | -999.189 |
| RMSE | 0.66 |
#And our second with sjPlot
tab_model(model)
| gpa_fy | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.01 | -0.29 – 0.31 | 0.961 |
| sat verbal | 0.00 | -0.01 – 0.01 | 0.536 |
| sat total | 0.02 | 0.02 – 0.03 | <0.001 |
| Observations | 1000 | ||
| R2 / R2 adjusted | 0.212 / 0.211 | ||
How can we interpret these tables?
Intercept Our intercept is 0.01. This means that we should expect a freshman’s GPA to be this value if their Total SAT and Verbal SAT scores were 0. While it is practically impossible to get a 0 on your SATs, 0.01 will serve as our model’s baseline.
Slope This is where it gets interesting for our data. Our coefficients are 0.00 for verbal SAT score, and 0.02 for total SAT score. This means that for every percent increase in a freshman’s verbal SAT score, their overall GPA increases by 0.00, practically nothing. However, for every percent increase in a freshman’s total SAT score, their GPA increases by 0.02 units. One could say total SAT score would be a better predictor of a freshman’s SAT score, they would be right especially when analyzing p values
P-Values The p-value for our verbal SAT variable is 0.536, 0.001 for our total SAT variable, and 0.961 when we combine the variables. This means that the verbal SAT score model and bivariate model both show a relationship that is not statistically significant at the 5% significance level since they both have p-values greater than 0.05. Our other model that explores the relationship with only total SAT scores, however, shows a relationship that is statistically significant at 5% significance level since it’s p-value is less than 0.05. That in mind, it seems like we should trust the single variable total SAT model more.
R-Squared Our bivariate model has an R-squared value of 0.212. This means that 21.2% of the variation in freshmens’ GPAs can be explained by a combination of their verbal and total SAT scores. Sure our model captures and explains some variability in freshman GPA but it’s important to consider that 78.8% of the variation is not explained by our model. Whats interesting is when we compare this R-squared to our adjusted R-squared since adjusted R-squared is meant to give us an R that accounts for the extra variables added. Adding extra variables will always increase r-squared, no matter how significant they are. Since our R-squared is 0.211, which is very close to our original r-square value 0.212, it means that our model’s ability to explain variability in freshman GPA is hardly reduced when we account for the number of independent variables added in. Verbal SAT and total SAT scores have a meaningful contribution to our model and our model does not suffer from overfitting.
Conclusion In conclusion, while verbal and total SAT scores (especially total) are useful predictors of freshmen’s GPAs, they only explain a small ammount of the variability. Adding these 2 variables together to predict freshman GPA does not really affect the fit of our model though, meaning it would likely preform well on new, unseen data. A future survey might do well to test other variables such as extracurricular activitie participation or hours spent studying, when researching factors that play significant roles in predicting freshmen’s GPAs. ## Task 3 We need to explore relationships between 2 variables (happy life years and gdp) in the HappyPlanetIndex dataset. We’ll make 2 visualizations like last time, one with modelsummary and one with sjPlot.
#First lets load up this dataset
data("HappyPlanetIndex")
World <- HappyPlanetIndex
model2 <- lm(GDPperCapita ~ HLY, data = HappyPlanetIndex)
#Visualizing with modelsummary requires the use of "modelplot"
modelplot(model2, coef_omit = "Intercept") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Coefficients for model2',
caption = 'Comparison of Happy Life Years and GDP per Capita') +
theme_minimal()
#Visualizing with sjPlot requires the use of "plot_model"
#Model A
plot_model(model2, show.values = TRUE, show.p = TRUE)
We can see in both visualizations that the coefficient is 622.03
For task 4 we need 3 models that use happiness as a dependent variable with the HappyPlanetIndex dataset
data("HappyPlanetIndex")
World <- HappyPlanetIndex
modelA <- lm(Happiness ~ LifeExpectancy, data = World)
modelB <- lm(Happiness ~ Footprint, data = World)
modelC <- lm(Happiness ~ GDPperCapita + HDI + Population, data = World)
modelsCombined <- list(modelA, modelB, modelC)
# Visualize with modelsummary
# Model A
modelplot(modelA, coef_omit = "Intercept") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Model A',
caption = 'Comparison of happiness and life expectancy') +
theme_minimal()
# Model B
modelplot(modelB, coef_omit = "Intercept") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Model B',
caption = 'Comparison of happiness and ecological footprint') +
theme_minimal()
# Model C
modelplot(modelC, coef_omit = "Intercept") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Model C',
caption = 'Comparison of happiness and GDP + HDI + population') +
theme_minimal()
# All 3 Models
modelplot(modelsCombined, coef_omit = "Intercept") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Comparison of Models A, B, and C',
caption = 'Comparison of happiness models') +
theme_minimal()
# Visualize with sjPlot. We're also gong to standardize all of the coefficients.
#Model A
plot_model(modelA, type = "std2", show.values = TRUE, show.p = FALSE, ci.lvl = 0.95) +
labs(title = "Standardized Coefficients of Model A") +
theme_minimal()
#Model B
plot_model(modelB, type = "std2", show.values = TRUE, show.p = FALSE, ci.lvl = 0.95) +
labs(title = "Standardized Coefficients of Model B") +
theme_minimal()
#Model C
plot_model(modelC, type = "std2", show.values = TRUE, show.p = FALSE, ci.lvl = 0.95) +
labs(title = "Standardized Coefficients of Model C") +
theme_minimal()
##Task 5 We need to use the “Violence” dataset for this task. Once loaded we need to predict murder rate using the independent variables for internet penetration and GDP from the dataset. We will make a regression table with modelsummary and sjPlot.
setwd("C:/Users/matej/OneDrive/Desktop/U of T/Summer 2024/SOC252/RMarkdowns")
load("Violence.Rdata")
data("Violence")
## Warning in data("Violence"): data set 'Violence' not found
# Before we make our model, lets check the structure of the dataset
str(Violence)
## 'data.frame': 8 obs. of 19 variables:
## $ Country : Factor w/ 8 levels "Austria","Belgium",..: 1 2 4 5 3 6 7 8
## $ Code : Factor w/ 8 levels "AUT","BEL","DOM",..: 1 2 4 5 3 6 7 8
## $ LandArea : num 82450 30280 107160 10830 48320 ...
## $ Population : num 8.34 10.71 13.69 2.69 9.95 ...
## $ Energy : num 33246 58583 8072 4387 8162 ...
## $ Rural : num 32.8 2.6 51.4 46.7 31 39.3 32 18.3
## $ Military : num 2.4 2.7 3.6 1.6 3.8 4.3 7.2 18.6
## $ Health : num 15.8 14.8 15.9 5.7 10.4 10.4 8.6 18.7
## $ HIV : num 0.3 0.2 0.8 1.7 0.9 17.9 1.1 0.6
## $ Internet : num 72.9 70.5 14.3 57.3 20.8 8.6 22.5 75.8
## $ Developed : Factor w/ 2 levels "1","2": 2 2 1 2 1 2 2 2
## $ BirthRate : num 9.3 11.7 33 16.7 22.5 22 11 14.3
## $ ElderlyPop : num 17 17.2 4.4 7.7 5.9 4.4 15.9 12.6
## $ LifeExpectancy: num 80.2 80.4 70.3 71.8 72.6 51.5 68.3 78.4
## $ CO2 : num 8.12 9.79 0.87 4.54 2.24 ...
## $ GDP : num 45209 43144 2862 5274 5215 ...
## $ Cell : num 146 111.7 125.6 114.8 89.6 ...
## $ Electricity : num 7944 7903 548 1902 1358 ...
## $ MurderRate : num 0.55 1.85 46 59 24.8 36.7 6.2 5.3
# We need to view the names so we can see the syntax for our variables when we need to code them
names(Violence)
## [1] "Country" "Code" "LandArea" "Population"
## [5] "Energy" "Rural" "Military" "Health"
## [9] "HIV" "Internet" "Developed" "BirthRate"
## [13] "ElderlyPop" "LifeExpectancy" "CO2" "GDP"
## [17] "Cell" "Electricity" "MurderRate"
#Looks like we need the variables "MurderRate", "GDP", and "Internet
#Lets make the model first
modelv1 <- lm(MurderRate ~ Internet, data = Violence)
modelv2<- lm(MurderRate ~ GDP, data = Violence)
modelv3<- lm(MurderRate ~ Internet + GDP, data = Violence)
modelV <- list(modelv1, modelv2, modelv3)
#Now our regression table with modelsummary
modelsummary(modelV)
| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | 38.264 | 37.853 | 28.984 |
| (14.241) | (8.550) | (11.930) | |
| Internet | -0.367 | 0.463 | |
| (0.281) | (0.438) | ||
| GDP | -0.001 | -0.001 | |
| (0.000) | (0.001) | ||
| Num.Obs. | 8 | 8 | 8 |
| R2 | 0.221 | 0.513 | 0.602 |
| R2 Adj. | 0.091 | 0.432 | 0.443 |
| AIC | 75.5 | 71.7 | 72.1 |
| BIC | 75.7 | 72.0 | 72.4 |
| Log.Lik. | -34.746 | -32.864 | -32.059 |
| RMSE | 18.62 | 14.72 | 13.31 |
#And another with sjPlot
tab_model(modelv1, modelv2, modelv3)
| MurderRate | MurderRate | MurderRate | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 38.26 | 3.42 – 73.11 | 0.036 | 37.85 | 16.93 – 58.78 | 0.004 | 28.98 | -1.68 – 59.65 | 0.059 |
| Internet | -0.37 | -1.05 – 0.32 | 0.240 | 0.46 | -0.66 – 1.59 | 0.339 | |||
| GDP | -0.00 | -0.00 – -0.00 | 0.046 | -0.00 | -0.00 – 0.00 | 0.080 | |||
| Observations | 8 | 8 | 8 | ||||||
| R2 / R2 adjusted | 0.221 / 0.091 | 0.513 / 0.432 | 0.602 / 0.443 | ||||||
How can we interpret these visualizations?
Model V1: Internet Penetration and Murder Rate
When 0% of the population uses the internet, the murder rate would approximately be 38.26.
The slope for model V1 indicates a negative relationship such that, for every percent increase in internet penetration, murder rate falls by 0.37.
Since the P is greater than 0.05, the common significance (alpha) the relationship between internet penetration and murder rate is not statistically significant.
22.1% of the variation in murder rate is explained by adding in the internet penetration variable. The adjusted R-Squared is quite lower than the R-square at 0.091, the independent variable’s ability to explain the dependent is therefore mild.
These values will come into play when comparing with the other 2 models.
Model V1’s predicted murder rates are on average 18.62 units away from the observed murder rates.
Model V2: GDP and Murder Rate
When our GDP is set to 0 (unlikely) the predicted murder rate would be 37.85
For each unit increase in gross domestic product, the murder rate will fall by 0.001 units. Suggesting a small negative relationship between GDP and murder rate.
The effect mentioned above is statistically significant at a standard alpha of 0.05 as the p-value is less and rejects the null.
51.3% of the variance in murder rate can be explained by GDP. Since the adjusted r-square value is only slightly less, we can confirm that GDP explains murder rate well.
Model V2 seems to have a better fit than model V1 due to a lower AIC and BIC score. a Lower AIC means that it fits better than one in terms of balancing fit and complexity and the lower BIC means which makes it the best fitting model when also penalizing for complexity
The predicted murder rates are on average 14.72 units away from observed units. This indicates a better fit than Model V1 in terms of Root Mean Square Error.
Model V3: Internet Penetration + GDP and Murder Rate
If internet penetration and GDP were both to be set to 0, Murder rate is expected to be 28.98.
For each unit increase in internet penetration, murder rate increases by 0.46 units.
P(internet): 0.339 However, this positive relationship is definitely not statistically significant since it is greater than the alpha 0.05.
Coefficient(GDP): -0.001
For every 1 unit increase in GDP, murder rate falls by 0.001 units.
This relationship is just barely not statistically significant as the p-value is 0.03 greater than an alpha of 0.05 and thus the null is not rejected.
60.2% of the variance in murder rate can be explained by a combination of GDP and internet penetration. This explanatory power drops quite a bit when adjusting for number of predictors though, leaving model v3 to have a pretty similar explanatory power to model v2.
This model definitely has better fit than model V1 in terms of balancing fit and complexity (AIC), it also fits better than model V1 when penalizing for complexity (lower BIC), however, model V2 fits better than V3 in both of these categories (lower AIC and BIC than V3)
The average magnitude of the residuals is 13.31. This model fits best in terms of Root Mean Square Error.
Final Comments
While internet penetration seemed to have a pretty interesting effect on murder rate (model V1) through a negative relationship, that effect was not statistically significant, had the lowest explanatory power through adjusted r-square (0.091), and had the worst fit in terms of AIC, BIC, and RMSE (highest in all 3). Model V1 is definitely the worst fitting model, with the worst explanation for variance in the dependent, so, internet penetration alone seems to be the worst predictor of murder rate.
Combining internet penetration with GDP to predict murder rate together (model V3) showed an interesting effect that, while it was not statistically significant, had the best explanatory power of the models (adjusted r-square 0.443), and way better fit in terms of AIC, BIC and RMSE.
However, I’d say the best model of the 3 is model V2 that predicts murder rate with just GDP. The negative effect GDP has on murder rate is statistically significant, and using just GDP as an independent variable explains way more of the variance in murder rate than internet penetration (adjusted r-square 0.432). What makes this model the best one is the fact that is the best fitting in terms of balancing fit and complexity (lowest AIC of the 3), the best fit in terms of penalizing for complexity (lowest BIC), and second best fit in terms of average magnitude of the residuals (second lowest RMSE)
~ Mateja Dokic