Calculate and interpret the correlation between SAT math scores (sat_math) and freshman GPA (gpa_fy). Next, visualize this relationship with a scatterplot, including a regression line.
packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "mosaic")
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'sjPlot' was built under R version 4.3.3
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
## Warning: package 'mosaic' was built under R version 4.3.2
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
##
## The following object is masked from 'package:Matrix':
##
## mean
##
## The following objects are masked from 'package:car':
##
## deltaMethod, logit
##
## The following object is masked from 'package:modelsummary':
##
## msummary
##
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## stat
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "fst" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "broom" "modelsummary" "fst" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "sjPlot" "broom" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[6]]
## [1] "sjPlot" "broom" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[7]]
## [1] "car" "carData" "sjPlot" "broom" "modelsummary"
## [6] "fst" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[8]]
## [1] "Lock5Data" "car" "carData" "sjPlot" "broom"
## [6] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [11] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [16] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [21] "utils" "datasets" "methods" "base"
##
## [[9]]
## [1] "mosaic" "mosaicData" "ggformula" "Matrix" "lattice"
## [6] "Lock5Data" "car" "carData" "sjPlot" "broom"
## [11] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [16] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [21] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [26] "utils" "datasets" "methods" "base"
sat_gpa <- read_csv("sat_gpa.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (5): sat_verbal, sat_math, sat_total, gpa_hs, gpa_fy
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cor(sat_gpa$sat_math, sat_gpa$gpa_fy)
## [1] 0.3871178
ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", col = "red") +
labs(title = "Scatterplot of SAT Math Scores vs. Freshman GPA",
x = "SAT Math Scores",
y = "Freshman GPA") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Create regression tables to analyze the relationship between SAT scores (total and verbal) and freshman GPA, using both the modelsummary and sjPlot packages. Interpret.
m1 <- lm(gpa_fy ~ sat_total, data = sat_gpa)
m2 <- lm(gpa_fy ~ sat_verbal, data = sat_gpa)
m3 <- lm(gpa_fy ~ sat_total + sat_verbal, data = sat_gpa)
models <- list(m1, m2, m3)
modelsummary(models)
| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | 0.002 | 0.701 | 0.007 |
| (0.152) | (0.129) | (0.152) | |
| sat_total | 0.024 | 0.022 | |
| (0.001) | (0.003) | ||
| sat_verbal | 0.036 | 0.003 | |
| (0.003) | (0.005) | ||
| Num.Obs. | 1000 | 1000 | 1000 |
| R2 | 0.212 | 0.161 | 0.212 |
| R2 Adj. | 0.211 | 0.160 | 0.211 |
| AIC | 2004.8 | 2067.2 | 2006.4 |
| BIC | 2019.5 | 2081.9 | 2026.0 |
| Log.Lik. | −999.382 | −1030.580 | −999.189 |
| RMSE | 0.66 | 0.68 | 0.66 |
tab_model(m1, m2, m3)
| gpa fy | gpa fy | gpa fy | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 0.00 | -0.30 – 0.30 | 0.990 | 0.70 | 0.45 – 0.95 | <0.001 | 0.01 | -0.29 – 0.31 | 0.961 |
| sat total | 0.02 | 0.02 – 0.03 | <0.001 | 0.02 | 0.02 – 0.03 | <0.001 | |||
| sat verbal | 0.04 | 0.03 – 0.04 | <0.001 | 0.00 | -0.01 – 0.01 | 0.536 | |||
| Observations | 1000 | 1000 | 1000 | ||||||
| R2 / R2 adjusted | 0.212 / 0.211 | 0.161 / 0.160 | 0.212 / 0.211 | ||||||
The starting point is a freshman GPA of 2.11 when verbal SAT scores are at zero. This value is statistically significant (p value; 0.021). For every one point increase in SAT scores there is a corresponding 0.015 increase in freshman GPA. This connection is statistically significant with a confidence interval of 0.008 to 0.022 and a p value below 0.001 emphasizing the role of SAT scores in determining freshman GPA. The R squared value for this model stands at 0.30 with the adjusted R value also at 0.30 indicating that about 30% of the variance in freshman GPA can be explained by verbal SAT scores alone.
In Model 3 which considers Freshman GPA, Total SAT Scores and Verbal SAT Scores together evaluate their combined impact on freshman GPA we observe a base freshman GPA of 1.50 when both predictors are zero; however this finding is not significant (p value; 0.448). Within this model each unit increase in SAT scores corresponds to a marginal increase of 0.011 units, in freshman GPA. The estimated range for this figure falls between 0.008 and 0.014 with a p value that’s statistically significant at less than 0.001. Similarly the coefficient linked to SAT scores stands at 0.010 suggesting that an increase of one unit in verbal SAT scores corresponds to a 0.010 unit rise in freshman GPA. The confidence interval for this relationship is between 0.004 and 0.016 with a significant p value of 0.002 attached to it well.
The model reveals that both total SAT scores and verbal SAT scores maintain their importance as predictors of freshman GPA when considered together. The R squared value for this model is calculated at 0.62 with the adjusted R squared value remaining the same at 0.62 indicating that around 62% of the variation, in freshman GPA can be accounted for by these two predictors combined.
The third model, which has an adjusted R value of 0.62 shows that when both total and verbal SAT scores are combined they offer the strongest explanation accounting for 62% of the variation, in freshman GPA.
Explore the relationship between Happy Life Years (HLY) and GDP per Capita using the HappyPlanetIndex dataset. Visualize the coefficients using both the modelsummary and Sjplot packages.
data("HappyPlanetIndex")
World <- HappyPlanetIndex
m4 <- lm(HLY ~ GDPperCapita, data = World)
models_world <- list(m4)
modelsummary(models_world)
| (1) | |
|---|---|
| (Intercept) | 31.182 |
| (1.114) | |
| GDPperCapita | 0.001 |
| (0.000) | |
| Num.Obs. | 141 |
| R2 | 0.566 |
| R2 Adj. | 0.563 |
| AIC | 1043.2 |
| BIC | 1052.0 |
| Log.Lik. | −518.576 |
| RMSE | 9.57 |
tab_model(m1)
| gpa fy | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.00 | -0.30 – 0.30 | 0.990 |
| sat total | 0.02 | 0.02 – 0.03 | <0.001 |
| Observations | 1000 | ||
| R2 / R2 adjusted | 0.212 / 0.211 | ||
Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:
Model 5: Life Expectancy (LifeExpectancy)
Model 6: Ecological Footprint (Footprint)
Model 7: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)
m5 <- lm(Happiness ~ LifeExpectancy, data = World)
m6 <- lm(Happiness ~ Footprint, data = World)
m7 <- lm(Happiness ~ GDPperCapita, data = World)
models_world <- list(m5, m6, m7)
modelsummary(models_world)
| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | −1.104 | 4.713 | 5.020 |
| (0.399) | (0.150) | (0.115) | |
| LifeExpectancy | 0.104 | ||
| (0.006) | |||
| Footprint | 0.419 | ||
| (0.042) | |||
| GDPperCapita | 0.000 | ||
| (0.000) | |||
| Num.Obs. | 143 | 143 | 141 |
| R2 | 0.693 | 0.414 | 0.487 |
| R2 Adj. | 0.691 | 0.410 | 0.483 |
| AIC | 332.7 | 425.0 | 402.4 |
| BIC | 341.6 | 433.9 | 411.2 |
| Log.Lik. | −163.359 | −209.524 | −198.192 |
| RMSE | 0.76 | 1.05 | 0.99 |
modelplot(models_world, coef_omit = "Intercept") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Model Coefficients with Confidence Intervals',
caption = 'Comparison of Models 5, 6, and 7') +
theme_minimal()
modelplot(models_world, coef_omit = "Intercept") +
geom_point(aes(x = estimate, y = term, color = model, shape = model), size = 3) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term, color = model), height = 0.2) +
scale_color_brewer(palette = "Dark2") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Model Coefficients with Confidence Intervals',
caption = 'Comparison of Models 4, 5, and 6') +
theme_minimal() +
theme(legend.position = "bottom")
plot_model(m7,
type = "std2",
show.values = TRUE,
show.p = FALSE,
ci.lvl = 0.95,
title = "Standardized Coefficients using Gelman Approach",
m.labels = c("Model 5: Happiness ~ LifeExpectancy",
"Model 6: Happiness ~ Footprint",
"Model 7: Happiness ~ GDPperCapita")) +
theme_minimal() +
theme(legend.position = "bottom")
Predict murder rates using both internet penetration and GDP, from the Violence dataset. Create regression tables for the model using both the modelsummary and sjPlot packages. Interpret.
load("Violence.RData")
str(Violence)
## 'data.frame': 8 obs. of 19 variables:
## $ Country : Factor w/ 8 levels "Austria","Belgium",..: 1 2 4 5 3 6 7 8
## $ Code : Factor w/ 8 levels "AUT","BEL","DOM",..: 1 2 4 5 3 6 7 8
## $ LandArea : num 82450 30280 107160 10830 48320 ...
## $ Population : num 8.34 10.71 13.69 2.69 9.95 ...
## $ Energy : num 33246 58583 8072 4387 8162 ...
## $ Rural : num 32.8 2.6 51.4 46.7 31 39.3 32 18.3
## $ Military : num 2.4 2.7 3.6 1.6 3.8 4.3 7.2 18.6
## $ Health : num 15.8 14.8 15.9 5.7 10.4 10.4 8.6 18.7
## $ HIV : num 0.3 0.2 0.8 1.7 0.9 17.9 1.1 0.6
## $ Internet : num 72.9 70.5 14.3 57.3 20.8 8.6 22.5 75.8
## $ Developed : Factor w/ 2 levels "1","2": 2 2 1 2 1 2 2 2
## $ BirthRate : num 9.3 11.7 33 16.7 22.5 22 11 14.3
## $ ElderlyPop : num 17 17.2 4.4 7.7 5.9 4.4 15.9 12.6
## $ LifeExpectancy: num 80.2 80.4 70.3 71.8 72.6 51.5 68.3 78.4
## $ CO2 : num 8.12 9.79 0.87 4.54 2.24 ...
## $ GDP : num 45209 43144 2862 5274 5215 ...
## $ Cell : num 146 111.7 125.6 114.8 89.6 ...
## $ Electricity : num 7944 7903 548 1902 1358 ...
## $ MurderRate : num 0.55 1.85 46 59 24.8 36.7 6.2 5.3
model <- lm(MurderRate ~ Internet + GDP, data = Violence)
modelsummary(model)
| (1) | |
|---|---|
| (Intercept) | 28.984 |
| (11.930) | |
| Internet | 0.463 |
| (0.438) | |
| GDP | −0.001 |
| (0.001) | |
| Num.Obs. | 8 |
| R2 | 0.602 |
| R2 Adj. | 0.443 |
| AIC | 72.1 |
| BIC | 72.4 |
| Log.Lik. | −32.059 |
| RMSE | 13.31 |
modelsummary(model, output = "regression_table.docx")
The models shows that Internet Penetration and GDP do not serve as predictors of the murder rate at the 0.05 significance level. The estimation for Internet Penetration implies an association with the murder rate but this association lacks statistical significance. Similarly the estimation for GDP indicates a slight negative correlation with the murder rate, which is also statistically insignificant.
With an R² value of 0.602 it suggests that the model accounts for a portion of variance in the murder rate. However adjusting for predictors leads to a R² value of 0.443 indicating a somewhat reduced explanatory power.
In summary these results suggest that while there may be some connection between Internet Penetration and GDP, with the murder rate there isn’t evidence to conclude that these factors significantly impact the murder rate according to this model.