###Loading Data

library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
head(satgpa)
## # A tibble: 6 × 6
##     sex sat_v sat_m sat_sum hs_gpa fy_gpa
##   <int> <int> <int>   <int>  <dbl>  <dbl>
## 1     1    65    62     127   3.4    3.18
## 2     2    58    64     122   4      3.33
## 3     2    56    60     116   3.75   3.25
## 4     1    42    53      95   3.75   2.42
## 5     1    55    52     107   4      2.63
## 6     2    55    56     111   4      2.91
satgpa$sex=as.factor(satgpa$sex)

Section 1:

###Exercise 1: Using hs_gpa, to predict fy_gpa. Create scatter plot and include the linear regression line fit in the figure.

plot(satgpa$hs_gpa, satgpa$fy_gpa, main = "High School gpa vs First Year gpa", xlab = "hs_gpa", ylab = "fy_gpa", col="thistle")
m1=lm(fy_gpa~hs_gpa, data=satgpa)
abline(m1,col="royalblue")

Exercise 2: Compare hs_gpa and sat_sum to determine which is the best predictor of fy_gpa

cor(satgpa$hs_gpa, satgpa$fy_gpa)
## [1] 0.5433535
cor(satgpa$sat_sum, satgpa$fy_gpa)
## [1] 0.460281
m2=lm(satgpa$fy_gpa~satgpa$sat_sum)
summary(m2)
## 
## Call:
## lm(formula = satgpa$fy_gpa ~ satgpa$sat_sum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1976 -0.4495  0.0315  0.4557  1.6115 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.001927   0.151991   0.013     0.99    
## satgpa$sat_sum 0.023866   0.001457  16.379   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.658 on 998 degrees of freedom
## Multiple R-squared:  0.2119, Adjusted R-squared:  0.2111 
## F-statistic: 268.3 on 1 and 998 DF,  p-value: < 2.2e-16
summary(m1)
## 
## Call:
## lm(formula = fy_gpa ~ hs_gpa, data = satgpa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30544 -0.37417  0.03936  0.41912  1.75240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.09132    0.11789   0.775    0.439    
## hs_gpa       0.74314    0.03635  20.447   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6222 on 998 degrees of freedom
## Multiple R-squared:  0.2952, Adjusted R-squared:  0.2945 
## F-statistic: 418.1 on 1 and 998 DF,  p-value: < 2.2e-16

###Exercise 2 Response: Based on the correlation and linear model analysis, it can be determined that high school gpa is a better predictor of first year gpa. To begin, because the correlation value is higher for the relationship between fy_gpa and hs_gpa (0.54), than the relationship between fy_gpa and sat_sum (0.46), this is the first indicator that hs_gpa is a better predictor. Next, the RSE, or residual standard error value for the hs_gpa linear model (0.62) is lower than that of sat_sum linear model (0.66), which means that the data points are closer to the regression line on average. Finally, the adjusted r-squared value for the hs_gpa linear model (0.29) is higher than that of the sat_sum model (0.21), which means that the hs_gpa model explains a larger percentage of variation. Overall we can conclude that hs_gpa is a better predictor of fy_gpa.

###Exercise 3: Fit a multiple linear model using hs_gpa and sex to predict fy_gpa, and interpret the slope for the predictor variable related to sex and comment on its significance (0.05). Would you keep sex in the model or use hs_gpa only?

m3=lm(satgpa$fy_gpa~satgpa$hs_gpa+satgpa$sex)
m3$coefficients
##   (Intercept) satgpa$hs_gpa   satgpa$sex2 
##    0.08890031    0.73848992    0.03571299
summary(m3)
## 
## Call:
## lm(formula = satgpa$fy_gpa ~ satgpa$hs_gpa + satgpa$sex)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.28210 -0.37622  0.04143  0.40960  1.72841 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.08890    0.11793   0.754    0.451    
## satgpa$hs_gpa  0.73849    0.03672  20.114   <2e-16 ***
## satgpa$sex2    0.03571    0.03977   0.898    0.369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6223 on 997 degrees of freedom
## Multiple R-squared:  0.2958, Adjusted R-squared:  0.2944 
## F-statistic: 209.4 on 2 and 997 DF,  p-value: < 2.2e-16
summary(m1)
## 
## Call:
## lm(formula = fy_gpa ~ hs_gpa, data = satgpa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30544 -0.37417  0.03936  0.41912  1.75240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.09132    0.11789   0.775    0.439    
## hs_gpa       0.74314    0.03635  20.447   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6222 on 998 degrees of freedom
## Multiple R-squared:  0.2952, Adjusted R-squared:  0.2945 
## F-statistic: 418.1 on 1 and 998 DF,  p-value: < 2.2e-16

###Exercise 3 Response: #Test the significance of sex: H0: sex is not a useful predictor of fy_gpa H1: sex is a useful predictor of fy_gpa The slope of the sex variable (0.036) represents that for every 1 unit increase in sex, while hs_gpa stays constant, there will be a 0.036 increase in first year gpa. Next, because the p-value for sex is 0.369, which is above the significance level of 0.05, we fail to reject the null hypothesis, and can determine that sex is not significant, and not a useful predictor of first year gpa. Also, because the slope of the sex variable is significantly smaller than that of the highschool gpa slope (0.74), we can also assume that it has little impact on the dependent variable (fy_gpa). Because the sex variable is not a significant predictor, and has little impact on the dependent variable, I would not keep it in the model, and only use hs_gpa.

###Exercise 4: Use appropriate graphs to perform model diagnostics for 1) constant variability and 2) nearly normal residuals for the MLR with hs_gpa and sex as predictors for predicting fy_gpa.

m4=lm(satgpa$fy_gpa~satgpa$sex+satgpa$hs_gpa)
plot(m4,col="plum1")

library(ggplot2)
ggplot(data=satgpa, aes(x=m4$residuals))+geom_histogram(fill="khaki",col="black")+labs(title="Histogram of Residuals", x="residuals", y="frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

###Exercise 4 Response: Constant variability is demonstrated from this model diagnostic because the residuals are spread relatively evenly across the x-axis, which reprents constant variance. Next, because the points fall roughly along the straight line in the qq-plot, this can demonstrate nearly normal residuals, though the histogram of the residuals is slightly skewed left.