Due: July 17, 2024 by lecture time
You must submit both the .Rmd (R markdown file) and R pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code and text answer, when prompted (e.g., when asked to interpret the model output).
Calculate and interpret the correlation between SAT math scores (sat_math) and freshman GPA (gpa_fy). Next, visualize this relationship with a scatterplot, including a regression line.
#1. Correlation between sat_math and gpa_fy #2. visualize through scatterplot, regression line
# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "pandoc", "mosaic") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
## backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
##
## Revert to `kableExtra` for one session:
##
## options(modelsummary_factory_default = 'kableExtra')
## options(modelsummary_factory_latex = 'kableExtra')
## options(modelsummary_factory_html = 'kableExtra')
##
## Silence this message forever:
##
## config_modelsummary(startup_message = FALSE)
##
## #refugeeswelcome
##
## Loading required package: carData
##
##
## Attaching package: 'car'
##
##
## The following object is masked from 'package:dplyr':
##
## recode
##
##
## The following object is masked from 'package:purrr':
##
## some
##
##
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
##
## Attaching package: 'mosaic'
##
##
## The following object is masked from 'package:Matrix':
##
## mean
##
##
## The following objects are masked from 'package:car':
##
## deltaMethod, logit
##
##
## The following object is masked from 'package:modelsummary':
##
## msummary
##
##
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
##
##
## The following object is masked from 'package:purrr':
##
## cross
##
##
## The following object is masked from 'package:ggplot2':
##
## stat
##
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
##
##
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "fst" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "broom" "modelsummary" "fst" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "sjPlot" "broom" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[6]]
## [1] "sjPlot" "broom" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[7]]
## [1] "car" "carData" "sjPlot" "broom" "modelsummary"
## [6] "fst" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[8]]
## [1] "Lock5Data" "car" "carData" "sjPlot" "broom"
## [6] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [11] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [16] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [21] "utils" "datasets" "methods" "base"
##
## [[9]]
## [1] "pandoc" "Lock5Data" "car" "carData" "sjPlot"
## [6] "broom" "modelsummary" "fst" "lubridate" "forcats"
## [11] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [16] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [21] "grDevices" "utils" "datasets" "methods" "base"
##
## [[10]]
## [1] "mosaic" "mosaicData" "ggformula" "Matrix" "lattice"
## [6] "pandoc" "Lock5Data" "car" "carData" "sjPlot"
## [11] "broom" "modelsummary" "fst" "lubridate" "forcats"
## [16] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [21] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [26] "grDevices" "utils" "datasets" "methods" "base"
#Load File
sat_gpa <- read_csv("sat_gpa.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (5): sat_verbal, sat_math, sat_total, gpa_hs, gpa_fy
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Correlation between sat_math and gpa_fy
cor(sat_gpa$gpa_fy, sat_gpa$sat_math)
## [1] 0.3871178
#Scatter plot/regression line
ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Math SAT Score", y = "Freshman GPA")
## `geom_smooth()` using formula = 'y ~ x'
Create regression tables to analyze the relationship between SAT scores (total and verbal) and freshman GPA, using both the modelsummary and sjPlot packages. Interpret. variables: sat_total / gpa_fy
sat_gpa <- read_csv("sat_gpa.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sex
## dbl (5): sat_verbal, sat_math, sat_total, gpa_hs, gpa_fy
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model <- lm(sat_total ~ gpa_fy, data = sat_gpa)
summary(model)
##
## Call:
## lm(formula = sat_total ~ gpa_fy, data = sat_gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.862 -8.670 -0.205 9.249 38.097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.421 1.397 58.30 <2e-16 ***
## gpa_fy 8.877 0.542 16.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.69 on 998 degrees of freedom
## Multiple R-squared: 0.2119, Adjusted R-squared: 0.2111
## F-statistic: 268.3 on 1 and 998 DF, p-value: < 2.2e-16
#modelsummary
modelsummary(model)
| (1) | |
|---|---|
| (Intercept) | 81.421 |
| (1.396) | |
| gpa_fy | 8.877 |
| (0.542) | |
| Num.Obs. | 1000 |
| R2 | 0.212 |
| R2 Adj. | 0.211 |
| AIC | 7923.6 |
| BIC | 7938.3 |
| Log.Lik. | -3958.775 |
| RMSE | 12.68 |
#sjPlot
tab_model(model)
| sat_total | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 81.42 | 78.68 – 84.16 | <0.001 |
| gpa fy | 8.88 | 7.81 – 9.94 | <0.001 |
| Observations | 1000 | ||
| R2 / R2 adjusted | 0.212 / 0.211 | ||
#Interpretation The relationship between SAT scores and freshman GPA shows that the intercept is 81.42 (95% CI: 78.68 - 84.16, p < 0.001), indicating the expected GPA when SAT scores are zero. The slope estimate is 8.88 (95% CI: 7.81 - 9.94, p < 0.001), demonstrating a statistically significant positive relationship: higher SAT scores correlate with higher freshman GPAs. The R-squared value of 0.212 (adjusted R-squared: 0.211) suggests that SAT scores explain about 21.2% of the variance in freshman GPA, indicating a moderate relationship with other factors also influencing GPA.
Explore the relationship between Happy Life Years (HLY) and GDP per Capita using the HappyPlanetIndex dataset. Visualize the coefficients using both the modelsummary and Sjplot packages.
data("HappyPlanetIndex")
World <- HappyPlanetIndex
#Scatter Plot
plot(World$HLY, World$GDPperCapita, xlab = "HLY", ylab = "GDPperCapita", pch = 16)
model_2 <- lm(HLY ~ GDPperCapita, data = HappyPlanetIndex)
summary(model_2)
##
## Call:
## lm(formula = HLY ~ GDPperCapita, data = HappyPlanetIndex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.8503 -4.9116 0.7084 5.3119 26.2608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.118e+01 1.114e+00 28.00 <2e-16 ***
## GDPperCapita 9.093e-04 6.759e-05 13.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.641 on 139 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5656, Adjusted R-squared: 0.5625
## F-statistic: 181 on 1 and 139 DF, p-value: < 2.2e-16
#modelsummary
modelsummary(model_2)
| (1) | |
|---|---|
| (Intercept) | 31.182 |
| (1.114) | |
| GDPperCapita | 0.001 |
| (0.000) | |
| Num.Obs. | 141 |
| R2 | 0.566 |
| R2 Adj. | 0.563 |
| AIC | 1043.2 |
| BIC | 1052.0 |
| Log.Lik. | -518.576 |
| RMSE | 9.57 |
#sjPlot
tab_model(model_2)
| HLY | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 31.18 | 28.98 – 33.38 | <0.001 |
| GDPperCapita | 0.00 | 0.00 – 0.00 | <0.001 |
| Observations | 141 | ||
| R2 / R2 adjusted | 0.566 / 0.563 | ||
Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:
Model 1: Life Expectancy (LifeExpectancy)
Model 2: Ecological Footprint (Footprint)
Model 3: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)
data("HappyPlanetIndex")
World <- HappyPlanetIndex
m4 <- lm(Happiness ~ LifeExpectancy, data = World)
m5 <- lm(Happiness ~ Footprint, data = World)
m6 <- lm(Happiness ~ GDPperCapita + HDI + Population, data = World)
models_world <- list(m4, m5, m6)
modelsummary(models_world)
| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | -1.104 | 4.713 | 1.528 |
| (0.399) | (0.150) | (0.357) | |
| LifeExpectancy | 0.104 | ||
| (0.006) | |||
| Footprint | 0.419 | ||
| (0.042) | |||
| GDPperCapita | 0.000 | ||
| (0.000) | |||
| HDI | 5.778 | ||
| (0.577) | |||
| Population | 0.001 | ||
| (0.000) | |||
| Num.Obs. | 143 | 143 | 141 |
| R2 | 0.693 | 0.414 | 0.707 |
| R2 Adj. | 0.691 | 0.410 | 0.700 |
| AIC | 332.7 | 425.0 | 327.6 |
| BIC | 341.6 | 433.9 | 342.3 |
| Log.Lik. | -163.359 | -209.524 | -158.779 |
| RMSE | 0.76 | 1.05 | 0.75 |
tab_model(m4, m5, m6)
| Happiness | Happiness | Happiness | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | -1.10 | -1.89 – -0.32 | 0.006 | 4.71 | 4.42 – 5.01 | <0.001 | 1.53 | 0.82 – 2.23 | <0.001 |
| LifeExpectancy | 0.10 | 0.09 – 0.11 | <0.001 | ||||||
| Footprint | 0.42 | 0.34 – 0.50 | <0.001 | ||||||
| GDPperCapita | 0.00 | -0.00 – 0.00 | 0.100 | ||||||
| HDI | 5.78 | 4.64 – 6.92 | <0.001 | ||||||
| Population | 0.00 | -0.00 – 0.00 | 0.247 | ||||||
| Observations | 143 | 143 | 141 | ||||||
| R2 / R2 adjusted | 0.693 / 0.691 | 0.414 / 0.410 | 0.707 / 0.700 | ||||||
# Facet the plot by model
modelplot(models_world, coef_omit = "Intercept") +
geom_point(aes(x = estimate, y = term, color = model, shape = model), size = 3) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term, color = model), height = 0.2) +
scale_color_brewer(palette = "Dark2") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Model Coefficients with Confidence Intervals',
caption = 'Comparison of Models 4, 5, and 6') +
theme_minimal() +
theme(legend.position = "bottom") +
facet_wrap(~ model, scales = "free_x")
Predict murder rates using both internet penetration and GDP, from the Violence dataset. Create regression tables for the model using both the modelsummary and sjPlot packages. Interpret.
load("Violence.Rdata")
Violence$Internet
## [1] 72.9 70.5 14.3 57.3 20.8 8.6 22.5 75.8
Violence$GDP
## [1] 45209.396 43144.343 2862.367 5274.037 5214.537 7275.344 3006.915
## [8] 47198.504
Violence$MurderRate
## [1] 0.55 1.85 46.00 59.00 24.80 36.70 6.20 5.30
m1 <- lm(Internet ~ MurderRate, data = Violence)
m2 <- lm(GDP ~ MurderRate, data = Violence)
#modelsummary
models_world_2 <- list(m1, m2)
modelsummary(models_world_2)
| (1) | (2) | |
|---|---|---|
| (Intercept) | 56.427 | 34949.763 |
| (14.262) | (8193.361) | |
| MurderRate | -0.603 | -667.476 |
| (0.462) | (265.319) | |
| Num.Obs. | 8 | 8 |
| R2 | 0.221 | 0.513 |
| R2 Adj. | 0.091 | 0.432 |
| AIC | 79.5 | 181.1 |
| BIC | 79.7 | 181.4 |
| Log.Lik. | -36.732 | -87.560 |
| RMSE | 23.87 | 13711.81 |
#sjPlot
tab_model(m1, m2)
| Internet | GDP | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 56.43 | 21.53 – 91.32 | 0.007 | 34949.76 | 14901.33 – 54998.20 | 0.005 |
| MurderRate | -0.60 | -1.73 – 0.53 | 0.240 | -667.48 | -1316.69 – -18.26 | 0.046 |
| Observations | 8 | 8 | ||||
| R2 / R2 adjusted | 0.221 / 0.091 | 0.513 / 0.432 | ||||
The estimate for internet penetration as a predictor of murder rate is -0.6. This negative estimate suggests that higher internet penetration is associated with a decrease in murder rates, although the effect is relatively small. Internet penetration as predictor of murder rate is not statistically significant as the p value is 0.24, higher than the baseline 0.05 for 95% confidence.
The estimate for GDP as a predictor of murder rate is -667.48. AThis negative estimate suggests that higher GDP is associated with a decrease in murder rates. GDP as a predictor of murder rates is statistically significant as the p value is 0.046, lower than the baseline of 0.05 for 95% confidence.
The number of observations is 8 for both predictors, which is a relatively small sample size which could limit the generalizability of the results.