Problem Set 3

Due: July 17, 2024 by lecture time

You must submit both the .Rmd (R markdown file) and R pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code and text answer, when prompted (e.g., when asked to interpret the model output).

# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "broom", "sjPlot", "ggplot2", "car", "Lock5Data", "pandoc", "mosaic") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[6]]
##  [1] "sjPlot"       "broom"        "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"        
## 
## [[7]]
##  [1] "car"          "carData"      "sjPlot"       "broom"        "modelsummary"
##  [6] "fst"          "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "Lock5Data"    "car"          "carData"      "sjPlot"       "broom"       
##  [6] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "pandoc"       "Lock5Data"    "car"          "carData"      "sjPlot"      
##  [6] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[10]]
##  [1] "mosaic"       "mosaicData"   "ggformula"    "Matrix"       "lattice"     
##  [6] "pandoc"       "Lock5Data"    "car"          "carData"      "sjPlot"      
## [11] "broom"        "modelsummary" "fst"          "lubridate"    "forcats"     
## [16] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [21] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [26] "grDevices"    "utils"        "datasets"     "methods"      "base"

Task 1

Calculate and interpret the correlation between SAT math scores (sat_math) and freshman GPA (gpa_fy). Next, visualize this relationship with a scatterplot, including a regression line.

# Load the data
sat_gpa <- read_csv("E:/hw/sat_gpa.csv")
# Calculate the relationship between SAT math scores (sat_math) and freshman GPA (gpa_fy)
cor(sat_gpa$sat_math, sat_gpa$gpa_fy)
## [1] 0.3871178

The output shows a value of 0.38, indicating a moderate positive relationship.

And now let’s visualize:

ggplot(sat_gpa, aes(x = sat_math, y = gpa_fy)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "SAT Math Scores", y = "Freshman GPA")

Task 2

Create regression tables to analyze the relationship between SAT scores (total and verbal) and freshman GPA, using both the modelsummary and sjPlot packages. Interpret.

model <- lm(gpa_fy ~ sat_verbal+sat_total, data = sat_gpa)
summary(model)
## 
## Call:
## lm(formula = gpa_fy ~ sat_verbal + sat_total, data = sat_gpa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.19647 -0.44777  0.02895  0.45717  1.60940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.007372   0.152292   0.048    0.961    
## sat_verbal  0.002995   0.004835   0.620    0.536    
## sat_total   0.022395   0.002786   8.037 2.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6582 on 997 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2106 
## F-statistic: 134.2 on 2 and 997 DF,  p-value: < 2.2e-16
tab_model(model)
  gpa_fy
Predictors Estimates CI p
(Intercept) 0.01 -0.29 – 0.31 0.961
sat verbal 0.00 -0.01 – 0.01 0.536
sat total 0.02 0.02 – 0.03 <0.001
Observations 1000
R2 / R2 adjusted 0.212 / 0.211

Interpreting the output: Intercept: The intercept in our regression model is estimated to be 0.0074. This value represents the predicted of gpa_fy when all the independnet variables are zero. While this might not have a practical real-world interpretation (since it’s unlikely to have a zero for sat_verval and sat_total), it serves as a baseline for our model. Slope: The coefficient on sat_verbal is 0.0030 which indicates that additional one unit increasing in sat_verbal is associated with increasing in gpa_fy by 0.0030 units on average holding other factors constant. T statistic is 0.620 and p value is 0.536 which is greater than 0.05, the effect of sat_verbal on gpa_fy is not statisitcally significant at 5% level. The coefficient on sat_total is 0.0224 which indicates that additional one unit increasing in sat_total is associated with increasing in gpa_fy by 0.0224 units on average holding other factors constant. T statistic is 8.037 and p value is 0.000 which is less than 0.05, the effect of sat_total on gpa_fy is statisitcally significant at 5% level. R-squared: The model’s R-squared value is 0.2122, meaning that approximately 21.22% of the variability in gpa_fy can be explained by the model. The adjusted R-squared value is slightly lower at 0.2106, which adjusts for the number of predictors in the model. This indicates a reasonably good fit given the simplicity of the model.

Task 3

Explore the relationship between Happy Life Years (HLY) and GDP per Capita using the HappyPlanetIndex dataset. Visualize the coefficients using both the modelsummary and Sjplot packages.

data("HappyPlanetIndex")
World <- HappyPlanetIndex
head(World)
##     Country Region Happiness LifeExpectancy Footprint  HLY   HPI HPIRank
## 1   Albania      7       5.5           76.2       2.2 41.7 47.91      54
## 2   Algeria      3       5.6           71.7       1.7 40.1 51.23      40
## 3    Angola      4       4.3           41.7       0.9 17.8 26.78     130
## 4 Argentina      1       7.1           74.8       2.5 53.4 58.95      15
## 5   Armenia      7       5.0           71.7       1.4 36.1 48.28      48
## 6 Australia      2       7.9           80.9       7.8 63.7 36.64     102
##   GDPperCapita   HDI Population
## 1         5316 0.801       3.15
## 2         7062 0.733      32.85
## 3         2335 0.446      16.10
## 4        14280 0.869      38.75
## 5         4945 0.775       3.02
## 6        31794 0.962      20.40
model1 <- lm(GDPperCapita ~ HLY, data = World)
summary(model1)
## 
## Call:
## lm(formula = GDPperCapita ~ HLY, data = World)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16811  -5641   -869   5252  37343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14499.02    2029.97  -7.142 4.69e-11 ***
## HLY            622.03      46.23  13.454  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7974 on 139 degrees of freedom
##   (因为不存在,2个观察量被删除了)
## Multiple R-squared:  0.5656, Adjusted R-squared:  0.5625 
## F-statistic:   181 on 1 and 139 DF,  p-value: < 2.2e-16
tab_model(model1)
  GDPperCapita
Predictors Estimates CI p
(Intercept) -14499.02 -18512.63 – -10485.42 <0.001
HLY 622.03 530.62 – 713.44 <0.001
Observations 141
R2 / R2 adjusted 0.566 / 0.563

The intercept is -14499.02 which indicates that GDP per capita is predicted to be -14499.02 when HLY is zero. It does not makes sense. The coefficient on HLY is 622.03 which indicates that addtional one unit increasing in HLY is associated with increasing in GDP per capita by 622.03 units holding other factors constant. The effect of HLY on GDPpercapita is statistically significant at 5% level since p value of 0.000 which is less than 0.05. ## Task 4 Visualize the coefficients of the models specified below using both modelplot from the modelsummary package and plot_model from the sjPlot package. Fit three models to predict happiness (Happiness) using different predictors:

Model 1: Life Expectancy (LifeExpectancy)

Model 2: Ecological Footprint (Footprint)

Model 3: GDP per Capita (GDPperCapita), Human Development Index (HDI), and Population (Population)

Let’s run multiple MLRs and consider the output:

m1 <- lm(Happiness ~ LifeExpectancy, data = World)
m2 <- lm(Happiness ~ Footprint, data = World)
m3 <- lm(Happiness ~ GDPperCapita + HDI + Population, data = World)
models_world <- list(m1, m2, m3)

Next, let’s generate regression tables for these models using modelsummary.

And using sjPlot:

tab_model(m1, m2, m3)
  Happiness Happiness Happiness
Predictors Estimates CI p Estimates CI p Estimates CI p
(Intercept) -1.10 -1.89 – -0.32 0.006 4.71 4.42 – 5.01 <0.001 1.53 0.82 – 2.23 <0.001
LifeExpectancy 0.10 0.09 – 0.11 <0.001
Footprint 0.42 0.34 – 0.50 <0.001
GDPperCapita 0.00 -0.00 – 0.00 0.100
HDI 5.78 4.64 – 6.92 <0.001
Population 0.00 -0.00 – 0.00 0.247
Observations 143 143 141
R2 / R2 adjusted 0.693 / 0.691 0.414 / 0.410 0.707 / 0.700
# Plot model coefficients

modelplot(models_world, coef_omit = "Intercept") +
    labs(x = 'Coefficient Estimate',
         y = 'Term',
         title = 'Model Coefficients with Confidence Intervals',
         caption = 'Comparison of Models 1, 2, and 3') +
    theme_minimal()

Task 5

Predict murder rates using both internet penetration and GDP, from the Violence dataset. Create regression tables for the model using both the modelsummary and sjPlot packages. Interpret.

load("E:/hw/Violence.Rdata")
head(Violence)
##                               Country Code LandArea Population Energy Rural
## Austria                       Austria  AUT    82450      8.337  33246  32.8
## Belgium                       Belgium  BEL    30280     10.708  58583   2.6
## Guatemala                   Guatemala  GUA   107160     13.686   8072  51.4
## Jamaica                       Jamaica  JAM    10830      2.687   4387  46.7
## Dominican Republic Dominican Republic  DOM    48320      9.953   8162  31.0
## South Africa             South Africa  RSA  1214470     48.793 134489  39.3
##                    Military Health  HIV Internet Developed BirthRate ElderlyPop
## Austria                 2.4   15.8  0.3     72.9         2       9.3       17.0
## Belgium                 2.7   14.8  0.2     70.5         2      11.7       17.2
## Guatemala               3.6   15.9  0.8     14.3         1      33.0        4.4
## Jamaica                 1.6    5.7  1.7     57.3         2      16.7        7.7
## Dominican Republic      3.8   10.4  0.9     20.8         1      22.5        5.9
## South Africa            4.3   10.4 17.9      8.6         2      22.0        4.4
##                    LifeExpectancy       CO2       GDP      Cell Electricity
## Austria                      80.2 8.1235965 45209.396 145.99132   7944.3892
## Belgium                      80.4 9.7927294 43144.343 111.71857   7903.0293
## Guatemala                    70.3 0.8702226  2862.367 125.56855    548.1122
## Jamaica                      71.8 4.5414469  5274.037 114.83936   1901.6175
## Dominican Republic           72.6 2.2366354  5214.537  89.57889   1358.1914
## South Africa                 51.5 8.9332027  7275.344 100.76153   4532.0219
##                    MurderRate
## Austria                  0.55
## Belgium                  1.85
## Guatemala               46.00
## Jamaica                 59.00
## Dominican Republic      24.80
## South Africa            36.70
m4 <- lm(MurderRate ~ Internet+GDP, data = Violence)
summary(m4)
## 
## Call:
## lm(formula = MurderRate ~ Internet + GDP, data = Violence)
## 
## Residuals:
##            Austria            Belgium          Guatemala            Jamaica 
##             -2.507             -2.822             14.174             10.453 
## Dominican Republic       South Africa            Ukraine      United States 
##             -6.930             13.338            -29.231              3.526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 28.9842977 11.9299808   2.430   0.0594 .
## Internet     0.4629043  0.4385000   1.056   0.3394  
## GDP         -0.0013199  0.0006033  -2.188   0.0803 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.84 on 5 degrees of freedom
## Multiple R-squared:  0.602,  Adjusted R-squared:  0.4429 
## F-statistic: 3.782 on 2 and 5 DF,  p-value: 0.09991
tab_model(m4)
  MurderRate
Predictors Estimates CI p
(Intercept) 28.98 -1.68 – 59.65 0.059
Internet 0.46 -0.66 – 1.59 0.339
GDP -0.00 -0.00 – 0.00 0.080
Observations 8
R2 / R2 adjusted 0.602 / 0.443

Intercept: The intercept is 28.9843 which indicates that the murder rate is predicted to be 28.9843% when all the independent variables are zero. It does not makes sense, because the value of zeros for Internet and GDP are out of their range. Slope: The coefficient on Internet is 0.4629 which indicates that additional one unit increasing in Internet is associated with increasing in MurderRate by 0.4629 units on average holding other factors constant. The effect of Internet on MurderRate is not statistically significant at 5% level since p value is greater than 0.05. The coefficient on GDP is -0.0013 which indicates that additional one unit increasing in Internet is associated with decreasing in MurderRate by 0.0013 units on average holding other factors constant. The effect of GDP on MurderRate is not statistically significant at 5% level since p value is greater than 0.05. R-squared: The model’s R-squared value is 0.602, meaning that approximately 60.2% of the variability in MurderRate can be explained by the model. The adjusted R-squared value is slightly lower at 0.4429 , which adjusts for the number of predictors in the model.