###Loading Data
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
head(satgpa)
## # A tibble: 6 × 6
## sex sat_v sat_m sat_sum hs_gpa fy_gpa
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1 65 62 127 3.4 3.18
## 2 2 58 64 122 4 3.33
## 3 2 56 60 116 3.75 3.25
## 4 1 42 53 95 3.75 2.42
## 5 1 55 52 107 4 2.63
## 6 2 55 56 111 4 2.91
satgpa$sex=as.factor(satgpa$sex)
###Exercise 1: Using hs_gpa, to predict fy_gpa. Create scatter plot and include the linear regression line fit in the figure.
plot(satgpa$hs_gpa, satgpa$fy_gpa, main = "High School gpa vs First Year gpa", xlab = "hs_gpa", ylab = "fy_gpa", col="thistle")
m1=lm(fy_gpa~hs_gpa, data=satgpa)
abline(m1,col="royalblue")
cor(satgpa$hs_gpa, satgpa$fy_gpa)
## [1] 0.5433535
cor(satgpa$sat_sum, satgpa$fy_gpa)
## [1] 0.460281
m2=lm(satgpa$fy_gpa~satgpa$sat_sum)
summary(m2)
##
## Call:
## lm(formula = satgpa$fy_gpa ~ satgpa$sat_sum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1976 -0.4495 0.0315 0.4557 1.6115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001927 0.151991 0.013 0.99
## satgpa$sat_sum 0.023866 0.001457 16.379 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.658 on 998 degrees of freedom
## Multiple R-squared: 0.2119, Adjusted R-squared: 0.2111
## F-statistic: 268.3 on 1 and 998 DF, p-value: < 2.2e-16
summary(m1)
##
## Call:
## lm(formula = fy_gpa ~ hs_gpa, data = satgpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30544 -0.37417 0.03936 0.41912 1.75240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09132 0.11789 0.775 0.439
## hs_gpa 0.74314 0.03635 20.447 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6222 on 998 degrees of freedom
## Multiple R-squared: 0.2952, Adjusted R-squared: 0.2945
## F-statistic: 418.1 on 1 and 998 DF, p-value: < 2.2e-16
###Exercise 2 Response: Based on the correlation and linear model analysis, it can be determined that high school gpa is a better predictor of first year gpa. To begin, because the correlation value is higher for the relationship between fy_gpa and hs_gpa (0.54), than the relationship between fy_gpa and sat_sum (0.46), this is the first indicator that hs_gpa is a better predictor. Next, the RSE, or residual standard error value for the hs_gpa linear model (0.62) is lower than that of sat_sum linear model (0.66), which means that the data points are closer to the regression line on average. Finally, the adjusted r-squared value for the hs_gpa linear model (0.29) is higher than that of the sat_sum model (0.21), which means that the hs_gpa model explains a larger percentage of variation. Overall we can conclude that hs_gpa is a better predictor of fy_gpa.
###Exercise 3: Fit a multiple linear model using hs_gpa and sex to predict fy_gpa, and interpret the slope for the predictor variable related to sex and comment on its significance (0.05). Would you keep sex in the model or use hs_gpa only?
m3=lm(satgpa$fy_gpa~satgpa$hs_gpa+satgpa$sex)
m3$coefficients
## (Intercept) satgpa$hs_gpa satgpa$sex2
## 0.08890031 0.73848992 0.03571299
summary(m3)
##
## Call:
## lm(formula = satgpa$fy_gpa ~ satgpa$hs_gpa + satgpa$sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.28210 -0.37622 0.04143 0.40960 1.72841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.08890 0.11793 0.754 0.451
## satgpa$hs_gpa 0.73849 0.03672 20.114 <2e-16 ***
## satgpa$sex2 0.03571 0.03977 0.898 0.369
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6223 on 997 degrees of freedom
## Multiple R-squared: 0.2958, Adjusted R-squared: 0.2944
## F-statistic: 209.4 on 2 and 997 DF, p-value: < 2.2e-16
summary(m1)
##
## Call:
## lm(formula = fy_gpa ~ hs_gpa, data = satgpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30544 -0.37417 0.03936 0.41912 1.75240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09132 0.11789 0.775 0.439
## hs_gpa 0.74314 0.03635 20.447 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6222 on 998 degrees of freedom
## Multiple R-squared: 0.2952, Adjusted R-squared: 0.2945
## F-statistic: 418.1 on 1 and 998 DF, p-value: < 2.2e-16
###Exercise 3 Response: #Test the significance of sex: H0: sex is not a useful predictor of fy_gpa H1: sex is a useful predictor of fy_gpa The slope of the sex variable (0.036) represents that for every 1 unit increase in sex, while hs_gpa stays constant, there will be a 0.036 increase in first year gpa. Next, because the p-value for sex is 0.369, which is above the significance level of 0.05, we fail to reject the null hypothesis, and can determine that sex is not significant, and not a useful predictor of first year gpa. Also, because the slope of the sex variable is significantly smaller than that of the highschool gpa slope (0.74), we can also assume that it has little impact on the dependent variable (fy_gpa). Because the sex variable is not a significant predictor, and has little impact on the dependent variable, I would not keep it in the model, and only use hs_gpa.
###Exercise 4: Use appropriate graphs to perform model diagnostics for 1) constant variability and 2) nearly normal residuals for the MLR with hs_gpa and sex as predictors for predicting fy_gpa.
m4=lm(satgpa$fy_gpa~satgpa$sex+satgpa$hs_gpa)
plot(m4,col="plum1")
library(ggplot2)
ggplot(data=satgpa, aes(x=m4$residuals))+geom_histogram(fill="khaki",col="black")+labs(title="Histogram of Residuals", x="residuals", y="frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
###Exercise 4 Response: Constant variability is demonstrated from this
model diagnostic because the residuals are spread relatively evenly
across the x-axis, which reprents constant variance. Next, because the
points fall roughly along the straight line in the qq-plot, this can
demonstrate nearly normal residuals, though the histogram of the
residuals is slightly skewed left.