\(~\)

Regression

\(~\)

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

# Load Libraries
library(openintro)
library(ggplot2)
library(tidyverse)
# To check what additional datasets are available from the OpenIntro 
#package run the following code:

#data(package = "openintro")

\(~\)

For this assignment we will be using the gpa_study_hours dataset. The columns represent the variables gpa and study_hours for a sample of 193 undergraduate students who took an introductory statistics course in 2012 at a private US university. GPA ranges from 0 to 4 points, however one student reported a GPA > 4. This is a data error but this observation has been left in the dataset as it is used to illustrate issues with real survey data. Both variables are self reported, hence may not be accurate.

\(~\)

# Loading dataset
data('gpa_study_hours', package = 'openintro') 

\(~\)

Description of Data Set

\(~\)

A data frame with 193 rows and 2 columns:

gpa Grade point average (GPA) of student.
study_hours Number of hours students study per week.

\(~\)

Let’s checkout the data set

# glimpse of data
glimpse(gpa_study_hours)
## Rows: 193
## Columns: 2
## $ gpa         <dbl> 4.000, 3.800, 3.930, 3.400, 3.200, 3.520, 3.680, 3.400, 3.…
## $ study_hours <dbl> 10, 25, 45, 10, 4, 10, 24, 40, 10, 10, 30, 7, 15, 60, 10, …
# summary of data
summary(gpa_study_hours)
##       gpa         study_hours   
##  Min.   :2.600   Min.   : 2.00  
##  1st Qu.:3.400   1st Qu.:10.00  
##  Median :3.620   Median :15.00  
##  Mean   :3.586   Mean   :17.48  
##  3rd Qu.:3.800   3rd Qu.:20.00  
##  Max.   :4.300   Max.   :69.00
gpa_study_hours %>%
ggplot(aes(x = gpa, y = study_hours)) +
    geom_point() +
    xlab("GPA") +
    ylab("Hours Studying") +
    ggtitle("Intro to Stats Class")

# Creating a box plot to visualize potential outliers in data
boxplot(gpa~study_hours, data = gpa_study_hours, main = "Intro to Stats Class",
   xlab = "GPA", ylab = "Study Hours")

Note the 3 outliers in the box plot

\(~\)

# Obtaining our Linear Model
my_lm <- lm(gpa_study_hours$gpa ~ gpa_study_hours$study_hours)
summary(my_lm)
## 
## Call:
## lm(formula = gpa_study_hours$gpa ~ gpa_study_hours$study_hours)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95130 -0.19456  0.03879  0.21708  0.73872 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 3.527997   0.037424  94.272   <2e-16 ***
## gpa_study_hours$study_hours 0.003328   0.001794   1.855   0.0652 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2837 on 191 degrees of freedom
## Multiple R-squared:  0.01769,    Adjusted R-squared:  0.01255 
## F-statistic:  3.44 on 1 and 191 DF,  p-value: 0.06517
# Plotting gpa and study_hours with lm line
ggplot(gpa_study_hours, aes(gpa, study_hours)) +
  geom_point() +
  stat_smooth(method = lm, se = TRUE) +
  geom_segment(aes(xend = gpa, yend = study_hours), color = "blue", size = 0.3)
## `geom_smooth()` using formula 'y ~ x'

hist(resid(my_lm), xlab = "", main = "Histogram of Residuals")

Regression Diagnostics plots

par(mfrow = c(1, 2))
plot(my_lm)

\(~\)

The diagnostic plots show residuals in four different ways:

Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.
Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. This plot will be described further in the next sections.

\(~\)

Conclusion:

\(~\)

There are four assumptions associated with a linear regression model:

Linearity Assumption:
Based on the Residuals vs. Fitted plot, the red line is approximately at zero, suggesting linearity. It also shows there is somewhat of a linear relationship between the two variables.
Homogeneity of Variance Assumption:
Viewing the Scale-Location plot I see that there’s not much of a variability of the residual points, instead the line seems to be decreasing suggesting heteroscedasticity.
Normality of Residuals:
Based on the QQ plot, it looks like the some points deviate from the line therefore, both are approximately normally distributed.
Outliers and high Leverage Points Assumption: The presence of outliers can affect the interpretation of the model and the high leverage points is present in a data of it had extreme predictor x values.
Before examing the Residuals vs. Leverage plot we were made aware that there was an error where a student reported a GPA > 4 and was left in the data. We can see Cook’s distance lines and observation # 109 is close to the border but there are not influential points in our model.

\(~\)

Was the linear model appropriate? Why or why not?

  • Based on the models I say yes, it was appropriate.

\(~\)

Professor Fulton’s Code:

require(car)
require(lmtest)
require(Hmisc)
require(ResourceSelection)

#sample data
enroll = c(14.55,22.18,61.69,38.02,89.29,67.57,49.16,43.74,26.37,24.74,11.57,19.61)
expend = c(49.53,64.05,138.08,341.35,234.38,164.16,244.08,117.93,110.89,57.74,72.88,89.08)
rwp = c(8.02,9.19,23.32,108.18,45.11,15.35,55.24,17.33,9.59,12.79,7.84,13.26)

#specify model, data, and level
model = expend ~ enroll + rwp
data = cbind(enroll, expend, rwp)
level = .95

#function
regressit = function(model, data, level) #model is the functional relationship among variables, , l is level, forecast is vector or matrix to forecast
    {
    kdepairs(data) #scatterplot
    myreg = lm(model, y = TRUE)    #runregression
    confint(myreg, level = level) #confidenceintervals
    r1 = summary(myreg) #regression summary
    r2 = anova(myreg) #ANOVA tableforregression
    r3 = coefficients(myreg) #coefficients
    r4 = myreg$residuals #residuals
    r5 = shapiro.test(r4) #test for normality of residuals
    r6 = bptest(myreg)  #Breusch-Pagan-Godfrey-Test for homoscedasticity, lmtest
    r7a = ncvTest(myreg) #non-constant variance test for homoscedasticity
    r7 = dwtest(myreg)  #independence of residuals  #test for independence of errors
    r8 = vif(myreg)  #look at collinearity, car package
    me = mean(myreg$residuals)
    mad = mean(abs(myreg$residuals))
    mse = mean(myreg$residuals^2)
    rmse = sqrt(mse)
    mpe = 100*mean(myreg$residuals / myreg$y)
    mape = 100*mean(abs(myreg$residuals) / myreg$y)
    aic = AIC(myreg)
    bic = BIC(myreg)
    all = c(me, mad, rmse, mpe, mape, aic, bic)
    names(all) = c("ME","MAD","RMSE","MPE","MAPE", "AIC","BIC")
    mybarplot = barplot(all)
    mylist = list(r1,r2,r3,r4,r5,r6,r7,r7a,r8,all)
    return(mylist)
    }

regressit(model,data,level)

## [[1]]
## 
## Call:
## lm(formula = model, y = TRUE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.796 -12.088  -3.396  14.956  29.608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.5815    11.6548   2.281  0.04850 *  
## enroll        1.1505     0.2673   4.304  0.00198 ** 
## rwp           2.5404     0.2164  11.737  9.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 9 degrees of freedom
## Multiple R-squared:  0.9597, Adjusted R-squared:  0.9507 
## F-statistic: 107.1 on 2 and 9 DF,  p-value: 5.3e-07
## 
## 
## [[2]]
## Analysis of Variance Table
## 
## Response: expend
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## enroll     1  30854   30854  76.529 1.076e-05 ***
## rwp        1  55537   55537 137.750 9.300e-07 ***
## Residuals  9   3629     403                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [[3]]
## (Intercept)      enroll         rwp 
##   26.581450    1.150469    2.540402 
## 
## [[4]]
##          1          2          3          4          5          6          7 
## -14.164793 -11.395139 -18.716035  -3.792960  -9.524329  20.846214  20.609704 
##          8          9         10         11         12 
##  -2.998115  29.608237 -29.795786  13.070876   6.252129 
## 
## [[5]]
## 
##  Shapiro-Wilk normality test
## 
## data:  r4
## W = 0.96592, p-value = 0.8637
## 
## 
## [[6]]
## 
##  studentized Breusch-Pagan test
## 
## data:  myreg
## BP = 1.0515, df = 2, p-value = 0.5911
## 
## 
## [[7]]
## 
##  Durbin-Watson test
## 
## data:  myreg
## DW = 2.2799, p-value = 0.6346
## alternative hypothesis: true autocorrelation is greater than 0
## 
## 
## [[8]]
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.4461112, Df = 1, p = 0.50419
## 
## [[9]]
##  enroll     rwp 
## 1.12679 1.12679 
## 
## [[10]]
##            ME           MAD          RMSE           MPE          MAPE 
## -7.033219e-16  1.506453e+01  1.738909e+01 -3.872326e+00  1.600507e+01 
##           AIC           BIC 
##  1.105948e+02  1.125344e+02