Due: July 17, 2024 by lecture time
You must submit both the .Rmd (R markdown file) and R pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code and text answer, when prompted (e.g., when asked to interpret the model output).
# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "pandoc", "mosaic") # add any you need here
# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load the packages
lapply(packages, library, character.only = TRUE)
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "fst" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "broom" "modelsummary" "fst" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "sjPlot" "broom" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[6]]
## [1] "sjPlot" "broom" "modelsummary" "fst" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[7]]
## [1] "car" "carData" "sjPlot" "broom" "modelsummary"
## [6] "fst" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[8]]
## [1] "Lock5Data" "car" "carData" "sjPlot" "broom"
## [6] "modelsummary" "fst" "lubridate" "forcats" "stringr"
## [11] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [16] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [21] "utils" "datasets" "methods" "base"
##
## [[9]]
## [1] "pandoc" "Lock5Data" "car" "carData" "sjPlot"
## [6] "broom" "modelsummary" "fst" "lubridate" "forcats"
## [11] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [16] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [21] "grDevices" "utils" "datasets" "methods" "base"
##
## [[10]]
## [1] "mosaic" "mosaicData" "ggformula" "Matrix" "lattice"
## [6] "pandoc" "Lock5Data" "car" "carData" "sjPlot"
## [11] "broom" "modelsummary" "fst" "lubridate" "forcats"
## [16] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [21] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [26] "grDevices" "utils" "datasets" "methods" "base"
Calculate and interpret the correlation between SAT math scores (sat_math) and freshman GPA (gpa_fy). Next, visualize this relationship with a scatterplot, including a regression line.
# Load the data
sat_gpa <- read_csv("E:/hw/sat_gpa.csv")
# Calculate the relationship between SAT math scores (sat_math) and freshman GPA (gpa_fy)
cor(sat_gpa$sat_math, sat_gpa$gpa_fy)
## [1] 0.3871178
The output shows a value of 0.38, indicating a moderate positive relationship.
And now let’s visualize:
ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "SAT Math Scores", y = "Freshman GPA")
Create regression tables to analyze the relationship between SAT scores (total and verbal) and freshman GPA, using both the modelsummary and sjPlot packages. Interpret.
model <- lm(gpa_fy ~ sat_verbal+sat_total, data = sat_gpa)
summary(model)
##
## Call:
## lm(formula = gpa_fy ~ sat_verbal + sat_total, data = sat_gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.19647 -0.44777 0.02895 0.45717 1.60940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.007372 0.152292 0.048 0.961
## sat_verbal 0.002995 0.004835 0.620 0.536
## sat_total 0.022395 0.002786 8.037 2.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6582 on 997 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2106
## F-statistic: 134.2 on 2 and 997 DF, p-value: < 2.2e-16
tab_model(model)
| gpa_fy | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.01 | -0.29 – 0.31 | 0.961 |
| sat verbal | 0.00 | -0.01 – 0.01 | 0.536 |
| sat total | 0.02 | 0.02 – 0.03 | <0.001 |
| Observations | 1000 | ||
| R2 / R2 adjusted | 0.212 / 0.211 | ||
Interpreting the output: Intercept: The intercept in our regression model is estimated to be 0.0074. This value represents the predicted of gpa_fy when all the independnet variables are zero. While this might not have a practical real-world interpretation (since it’s unlikely to have a zero for sat_verval and sat_total), it serves as a baseline for our model. Slope: The coefficient on sat_verbal is 0.0030 which indicates that additional one unit increasing in sat_verbal is associated with increasing in gpa_fy by 0.0030 units on average holding other factors constant. T statistic is 0.620 and p value is 0.536 which is greater than 0.05, the effect of sat_verbal on gpa_fy is not statisitcally significant at 5% level. The coefficient on sat_total is 0.0224 which indicates that additional one unit increasing in sat_total is associated with increasing in gpa_fy by 0.0224 units on average holding other factors constant. T statistic is 8.037 and p value is 0.000 which is less than 0.05, the effect of sat_total on gpa_fy is statisitcally significant at 5% level. R-squared: The model’s R-squared value is 0.2122, meaning that approximately 21.22% of the variability in gpa_fy can be explained by the model. The adjusted R-squared value is slightly lower at 0.2106, which adjusts for the number of predictors in the model. This indicates a reasonably good fit given the simplicity of the model.
Explore the relationship between Happy Life Years (HLY) and GDP per Capita using the HappyPlanetIndex dataset. Visualize the coefficients using both the modelsummary and Sjplot packages.
data("HappyPlanetIndex")
World <- HappyPlanetIndex
head(World)
## Country Region Happiness LifeExpectancy Footprint HLY HPI HPIRank
## 1 Albania 7 5.5 76.2 2.2 41.7 47.91 54
## 2 Algeria 3 5.6 71.7 1.7 40.1 51.23 40
## 3 Angola 4 4.3 41.7 0.9 17.8 26.78 130
## 4 Argentina 1 7.1 74.8 2.5 53.4 58.95 15
## 5 Armenia 7 5.0 71.7 1.4 36.1 48.28 48
## 6 Australia 2 7.9 80.9 7.8 63.7 36.64 102
## GDPperCapita HDI Population
## 1 5316 0.801 3.15
## 2 7062 0.733 32.85
## 3 2335 0.446 16.10
## 4 14280 0.869 38.75
## 5 4945 0.775 3.02
## 6 31794 0.962 20.40
model1 <- lm(GDPperCapita ~ HLY, data = World)
summary(model1)
##
## Call:
## lm(formula = GDPperCapita ~ HLY, data = World)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16811 -5641 -869 5252 37343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14499.02 2029.97 -7.142 4.69e-11 ***
## HLY 622.03 46.23 13.454 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7974 on 139 degrees of freedom
## (因为不存在,2个观察量被删除了)
## Multiple R-squared: 0.5656, Adjusted R-squared: 0.5625
## F-statistic: 181 on 1 and 139 DF, p-value: < 2.2e-16
tab_model(model1)
| GDPperCapita | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -14499.02 | -18512.63 – -10485.42 | <0.001 |
| HLY | 622.03 | 530.62 – 713.44 | <0.001 |
| Observations | 141 | ||
| R2 / R2 adjusted | 0.566 / 0.563 | ||
The intercept is -14499.02 which indicates that GDP per capita is predicted to be -14499.02 when HLY is zero. It does not makes sense. The coefficient on HLY is 622.03 which indicates that addtional one unit increasing in HLY is associated with increasing in GDP per capita by 622.03 units holding other factors constant. The effect of HLY on GDPpercapita is statistically significant at 5% level since p value of 0.000 which is less than 0.05. ## Task 4 Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:
Model 1: Life Expectancy (LifeExpectancy)
Model 2: Ecological Footprint (Footprint)
Model 3: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)
Let’s run multiple MLRs and consider the output:
m1 <- lm(Happiness ~ LifeExpectancy, data = World)
m2 <- lm(Happiness ~ Footprint, data = World)
m3 <- lm(Happiness ~ GDPperCapita + HDI + Population, data = World)
models_world <- list(m1, m2, m3)
Next, let’s generate regression tables for these models using modelsummary.
And using sjPlot:
tab_model(m1, m2, m3)
| Happiness | Happiness | Happiness | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | -1.10 | -1.89 – -0.32 | 0.006 | 4.71 | 4.42 – 5.01 | <0.001 | 1.53 | 0.82 – 2.23 | <0.001 |
| LifeExpectancy | 0.10 | 0.09 – 0.11 | <0.001 | ||||||
| Footprint | 0.42 | 0.34 – 0.50 | <0.001 | ||||||
| GDPperCapita | 0.00 | -0.00 – 0.00 | 0.100 | ||||||
| HDI | 5.78 | 4.64 – 6.92 | <0.001 | ||||||
| Population | 0.00 | -0.00 – 0.00 | 0.247 | ||||||
| Observations | 143 | 143 | 141 | ||||||
| R2 / R2 adjusted | 0.693 / 0.691 | 0.414 / 0.410 | 0.707 / 0.700 | ||||||
# Plot model coefficients
modelplot(models_world, coef_omit = "Intercept") +
labs(x = 'Coefficient Estimate',
y = 'Term',
title = 'Model Coefficients with Confidence Intervals',
caption = 'Comparison of Models 1, 2, and 3') +
theme_minimal()
Predict murder rates using both internet penetration and GDP, from the Violence dataset. Create regression tables for the model using both the modelsummary and sjPlot packages. Interpret.
load("E:/hw/Violence.Rdata")
head(Violence)
## Country Code LandArea Population Energy Rural
## Austria Austria AUT 82450 8.337 33246 32.8
## Belgium Belgium BEL 30280 10.708 58583 2.6
## Guatemala Guatemala GUA 107160 13.686 8072 51.4
## Jamaica Jamaica JAM 10830 2.687 4387 46.7
## Dominican Republic Dominican Republic DOM 48320 9.953 8162 31.0
## South Africa South Africa RSA 1214470 48.793 134489 39.3
## Military Health HIV Internet Developed BirthRate ElderlyPop
## Austria 2.4 15.8 0.3 72.9 2 9.3 17.0
## Belgium 2.7 14.8 0.2 70.5 2 11.7 17.2
## Guatemala 3.6 15.9 0.8 14.3 1 33.0 4.4
## Jamaica 1.6 5.7 1.7 57.3 2 16.7 7.7
## Dominican Republic 3.8 10.4 0.9 20.8 1 22.5 5.9
## South Africa 4.3 10.4 17.9 8.6 2 22.0 4.4
## LifeExpectancy CO2 GDP Cell Electricity
## Austria 80.2 8.1235965 45209.396 145.99132 7944.3892
## Belgium 80.4 9.7927294 43144.343 111.71857 7903.0293
## Guatemala 70.3 0.8702226 2862.367 125.56855 548.1122
## Jamaica 71.8 4.5414469 5274.037 114.83936 1901.6175
## Dominican Republic 72.6 2.2366354 5214.537 89.57889 1358.1914
## South Africa 51.5 8.9332027 7275.344 100.76153 4532.0219
## MurderRate
## Austria 0.55
## Belgium 1.85
## Guatemala 46.00
## Jamaica 59.00
## Dominican Republic 24.80
## South Africa 36.70
m4 <- lm(MurderRate ~ Internet+GDP, data = Violence)
summary(m4)
##
## Call:
## lm(formula = MurderRate ~ Internet + GDP, data = Violence)
##
## Residuals:
## Austria Belgium Guatemala Jamaica
## -2.507 -2.822 14.174 10.453
## Dominican Republic South Africa Ukraine United States
## -6.930 13.338 -29.231 3.526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.9842977 11.9299808 2.430 0.0594 .
## Internet 0.4629043 0.4385000 1.056 0.3394
## GDP -0.0013199 0.0006033 -2.188 0.0803 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.84 on 5 degrees of freedom
## Multiple R-squared: 0.602, Adjusted R-squared: 0.4429
## F-statistic: 3.782 on 2 and 5 DF, p-value: 0.09991
tab_model(m4)
| MurderRate | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 28.98 | -1.68 – 59.65 | 0.059 |
| Internet | 0.46 | -0.66 – 1.59 | 0.339 |
| GDP | -0.00 | -0.00 – 0.00 | 0.080 |
| Observations | 8 | ||
| R2 / R2 adjusted | 0.602 / 0.443 | ||
Intercept: The intercept is 28.9843 which indicates that the murder rate is predicted to be 28.9843% when all the independent variables are zero. It does not makes sense, because the value of zeros for Internet and GDP are out of their range. Slope: The coefficient on Internet is 0.4629 which indicates that additional one unit increasing in Internet is associated with increasing in MurderRate by 0.4629 units on average holding other factors constant. The effect of Internet on MurderRate is not statistically significant at 5% level since p value is greater than 0.05. The coefficient on GDP is -0.0013 which indicates that additional one unit increasing in Internet is associated with decreasing in MurderRate by 0.0013 units on average holding other factors constant. The effect of GDP on MurderRate is not statistically significant at 5% level since p value is greater than 0.05. R-squared: The model’s R-squared value is 0.602, meaning that approximately 60.2% of the variability in MurderRate can be explained by the model. The adjusted R-squared value is slightly lower at 0.4429 , which adjusts for the number of predictors in the model.