# Set the working directory
setwd("C:/Users/mws16/OneDrive/Desktop/OPIM-5603-Statistics in Business Analytics/Homework 4")

Part I

Q1: Categorical Predictor Variables in Regression

Use the BirthSmokers.csv

The BirthSmokers dataset is a random sample of 32 births based on a study conducted on pregnant women. The dataset has 3 variables wgt, gest and smoke giving information on birth weight of baby in gms, length of gestation in weeks and smoking status of mother(yes/no)

  1. Use smoke and gest variables to predict the birth weight of the baby (Run a linear regression)
# read in the file
birthsmokers <- read.csv("BirthSmokers.csv")
# Build a linear regression
fit1a <- lm(Wgt ~ .,
            data = birthsmokers)
summary(fit1a)
## 
## Call:
## lm(formula = Wgt ~ ., data = birthsmokers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.693  -92.063   -9.365   79.663  197.507 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2389.573    349.206  -6.843 1.63e-07 ***
## Gest          143.100      9.128  15.677 1.07e-15 ***
## Smokeyes     -244.544     41.982  -5.825 2.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared:  0.8964, Adjusted R-squared:  0.8892 
## F-statistic: 125.4 on 2 and 29 DF,  p-value: 5.289e-15
# Save predicted values to data table
birthsmokers$Preds <- fit1a$fitted.values
birthsmokers
  1. Plot the estimated regression function (Residuals vs Fitted)
# Plot the regression function
plot(y = fit1a$residuals,
     x = fit1a$fitted.values,
     main = "Residuals vs Fitted values",
     xlab = "Fitted values",
     ylab = "Residuals")
abline(0,1)  # this is a perfect prediction - 45 degree line
# add the regression line
abline(lm(fit1a$residuals ~ fit1a$fitted.values),
       col = "red")

  1. Determine the regression equation for Smoking mothers and Non smoking mothers
summary(fit1a)
## 
## Call:
## lm(formula = Wgt ~ ., data = birthsmokers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.693  -92.063   -9.365   79.663  197.507 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2389.573    349.206  -6.843 1.63e-07 ***
## Gest          143.100      9.128  15.677 1.07e-15 ***
## Smokeyes     -244.544     41.982  -5.825 2.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared:  0.8964, Adjusted R-squared:  0.8892 
## F-statistic: 125.4 on 2 and 29 DF,  p-value: 5.289e-15

As Smoke is a categorical predictor with two values, the regression function gives coefficient estimate for only one value of the variable. In this case, the estimate for Smokeyes = -244.544.

Regression Equation for Smoking Mothers: (Smoke = Yes. Numerically, Smokeyes = 1) Wgt = (-2389.573) + 143.100 * Gest - 244.544

Regression Equation for Non smoking mothers: (Smoke = No. Numerically, Smokeno = 0) Wgt = (-2389.573) + 143.100 * Gest

Q2: Interaction Terms

Use the depression.csv

The response variable is “effectiveness” - a higher number means the treatment was more effective, a lower number means less effective treatment. You also have info on the age of the patient and the treatment that they received.

We want to study the effectiveness of three treatments A, B, and C for severe depression. The researchers collected depression data on a random sample of 36 severely depressed individuals

  1. Use “age” (numeric) and “treatment” (factor) variables to predict “effectiveness” (numeric) for depression using a linear regression.
# read in the data
depression <- read.csv("depression.csv")
# Build a linear regression
fit2a <- lm(effect ~ .,
            data = depression)
summary(fit2a)
## 
## Call:
## lm(formula = effect ~ ., data = depression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5732  -3.3922   0.9829   3.9613   9.5062 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.54335    3.58105   9.088 2.23e-10 ***
## age           0.66446    0.06978   9.522 7.42e-11 ***
## TRTB         -9.80758    2.46471  -3.979 0.000371 ***
## TRTC        -10.25276    2.46542  -4.159 0.000224 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.035 on 32 degrees of freedom
## Multiple R-squared:  0.784,  Adjusted R-squared:  0.7637 
## F-statistic: 38.71 on 3 and 32 DF,  p-value: 9.287e-11
# Save the predictions to data table
depression$preds <- fit2a$fitted.values
depression
  1. How do you interpret the coefficients in the equation? Hint: How does R handle factors with a linear regression?
summary(fit2a)
## 
## Call:
## lm(formula = effect ~ ., data = depression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5732  -3.3922   0.9829   3.9613   9.5062 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.54335    3.58105   9.088 2.23e-10 ***
## age           0.66446    0.06978   9.522 7.42e-11 ***
## TRTB         -9.80758    2.46471  -3.979 0.000371 ***
## TRTC        -10.25276    2.46542  -4.159 0.000224 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.035 on 32 degrees of freedom
## Multiple R-squared:  0.784,  Adjusted R-squared:  0.7637 
## F-statistic: 38.71 on 3 and 32 DF,  p-value: 9.287e-11

Regression Equation: effect = 32.54335 + 0.66446 * age + (-9.80758) * TRTB + (-10.25276) * TRTC

When a predictor variable is factor (categorical) with n possible values, R generates estimates for (n-1) values. The intercept and estimates are calcuates taking into account the weightage of the ommited value. Thus, when the value of the categorical variable equals to the one omitted value, the remaining coefficients and the intercept in the equation balance out the effect of this categorical variable.

  1. Use a single interaction term TRT * age to predict the effectiveness. What does this variable mean?
# Build regression using interaction term
fit2c <- lm(effect ~ age * TRT,
            data = depression)
summary(fit2c)
## 
## Call:
## lm(formula = effect ~ age * TRT, data = depression)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4366 -2.7637  0.1887  2.9075  6.5634 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.51559    3.82523  12.422 2.34e-13 ***
## age           0.33051    0.08149   4.056 0.000328 ***
## TRTB        -18.59739    5.41573  -3.434 0.001759 ** 
## TRTC        -41.30421    5.08453  -8.124 4.56e-09 ***
## age:TRTB      0.19318    0.11660   1.657 0.108001    
## age:TRTC      0.70288    0.10896   6.451 3.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.9001 
## F-statistic: 64.04 on 5 and 30 DF,  p-value: 4.264e-15

The interaction term age * TRT adds the deviation from the baseline slope, in addition to the deviation from the baseline given by the simple predictor terms.

In simple words, in an interaction term between a categorical and a continuous variable, the slope depends on how the categorical variable changes. For example, in the model above, for treatment B, a unit change in the age will result in the effect changing by 0.19318, whereas, for treatment C, a unit change in the age will result in the effect changing by 0.70288.

  1. Compare the results of the model with and without the interaction term. Does the interaction term do a better job of explaining data?
# evaluate the AIC for both models
AIC(fit2a)
## [1] 237.3518
AIC(fit2c)
## [1] 208.0487
# evaluate regression metrics for both models
mymetrics2a <- hydroGOF::gof(sim = fit2a$fitted.values,
                           obs = depression$effect)
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
# evaluate regression metrics for both models
mymetrics2c <- hydroGOF::gof(sim = fit2c$fitted.values,
                           obs = depression$effect)

mymetrics2a
##          [,1]
## ME       0.00
## MAE      4.72
## MSE     32.38
## RMSE     5.69
## NRMSE % 45.80
## PBIAS %  0.00
## RSR      0.46
## rSD      0.89
## NSE      0.78
## mNSE     0.53
## rNSE     0.65
## d        0.94
## md       0.75
## rd       0.90
## cp       0.89
## r        0.89
## R2       0.78
## bR2      0.78
## KGE      0.84
## VE       0.91
mymetrics2c
##          [,1]
## ME       0.00
## MAE      2.98
## MSE     12.84
## RMSE     3.58
## NRMSE % 28.90
## PBIAS %  0.00
## RSR      0.29
## rSD      0.96
## NSE      0.91
## mNSE     0.70
## rNSE     0.90
## d        0.98
## md       0.85
## rd       0.97
## cp       0.95
## r        0.96
## R2       0.91
## bR2      0.91
## KGE      0.94
## VE       0.95

The model with interaction term (fit2c) has a higher R-squared value (0.9143), a lower AIC (208.0487), a lower RMSE, MAE, MSE. Hence definitely the model with interaction term has a better performance.

Thus it can be said that the interaction term does a better job of explaining the data, maybe because it takes into consideration the interaction of the categorical and continuous predictor along with the interaction of both the predictors with the target variable.

Q3: Detecting Multicollinearity

Use the Cement.csv

The heat evolved in calories during the hardening of cement on a per gram basis is given in the cement dataset. The four predictors are the percentages of four ingredients: tricalcium aluminate, tricalcium silicate, tetracalcium alumino ferrite, and dicalcium silicate

  1. Run the correlations on all the variables in the dataset. Describe what you see.
cement <- read.csv("Cement.csv")
# scatterplot matrix
library(psych)
# pearson correlation
pairs.panels(cement, 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

In the Person correlation matrix, it can be seen that all the predictor variables have a significant correlation with the target variable Heat. However, the variables Trical_sil and Dical_Sil have a strong negative correlation with each other, which means, having both of these variables will introduce multicollinearity while building the regression model. It is nothing but a repeatition of information. Hence, preferable, one of these variables should be dropped while buildin the regression.

Similarly, variables Trical_Alum and Tetracal_AlumFer seem to have a correlation, although not as strong as the other two variable. It needs to be checked whether dropping one of these variables would improve the performance of the regression.

  1. Run the variance inflation factor analysis.
# Build a linear regression
fit3b <- lm(Heat ~ .,
            data = cement)
summary(fit3b)
## 
## Call:
## lm(formula = Heat ~ ., data = cement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       62.4054    70.0710   0.891   0.3991  
## Trical_Alum        1.5511     0.7448   2.083   0.0708 .
## Trical_sil         0.5102     0.7238   0.705   0.5009  
## Tetracal_AlumFer   0.1019     0.7547   0.135   0.8959  
## Dical_Sil         -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07
# import library car
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
# Find the Variance Inflation Factor 
vif(fit3b)
##      Trical_Alum       Trical_sil Tetracal_AlumFer        Dical_Sil 
##         38.49621        254.42317         46.86839        282.51286
  1. Build regression model for heat evolved. Which variables will you consider?
# Get the VIF values
vif(fit3b)
##      Trical_Alum       Trical_sil Tetracal_AlumFer        Dical_Sil 
##         38.49621        254.42317         46.86839        282.51286

All the variables have a high VIF value. For the variable to be retained for prediction, the VIF should be as low as possible (Ideally, Sqrt(VIF)<=3, maximum 5). Trying building a model without Dical_Sil, the variable with highest VIF.

# Fit another regression dropping the variable Dical_Sil
fit3c <- lm(Heat ~ . - Dical_Sil,
            data = cement)
summary(fit3c)
## 
## Call:
## lm(formula = Heat ~ . - Dical_Sil, data = cement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2543 -1.4726  0.1755  1.5409  3.9711 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      48.19363    3.91330  12.315 6.17e-07 ***
## Trical_Alum       1.69589    0.20458   8.290 1.66e-05 ***
## Trical_sil        0.65691    0.04423  14.851 1.23e-07 ***
## Tetracal_AlumFer  0.25002    0.18471   1.354    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.312 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.3 on 3 and 9 DF,  p-value: 3.367e-08
# Check the VIF values
vif(fit3c)
##      Trical_Alum       Trical_sil Tetracal_AlumFer 
##         3.251068         1.063575         3.142125

Now the VIF values are low and the variables also seem to have significance in the regression according to the p-values. The variable Tetracal_AlumFer doesnt show any significance, but has a considerably low VIF. Trying to build a model without this variable and then check the performance of both the models.

# Fit another regression dropping both Tetracal_AlumFer and Dical_Sil
fit3c_new <- lm(Heat ~ . - Dical_Sil - Tetracal_AlumFer,
                data = cement)
summary(fit3c_new)
## 
## Call:
## lm(formula = Heat ~ . - Dical_Sil - Tetracal_AlumFer, data = cement)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.893 -1.574 -1.302  1.363  4.048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
## Trical_Alum  1.46831    0.12130   12.11 2.69e-07 ***
## Trical_sil   0.66225    0.04585   14.44 5.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.9744 
## F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09
# Find the AIC for models with and without the Tetracal_AlumFer
AIC(fit3c)
## [1] 63.9036
AIC(fit3c_new)
## [1] 64.31239

After removing the variable Tetracal_AlumFer, the R-square reduced and the residual standard error increased slightly. The second model also has a higher AIC. Hence, only the variable Dical_Sil should be removed for building a regression model for Heat.

  1. Describe your intuition behind the choice of variables based on VIF analysis

Part II

Q4: Logistic Regression

Use Smarket: S&P Stock Market Data. https://rdrr.io/cran/ISLR/man/Smarket.html

These are the daily percentage returns for the S&P 500 stock index between 2001 and 2005. You can download this from within R in the ISLR package.

  1. Build a logistic regression model using the glm (generalized linear model) function. We are trying to predict if the market will go up or down the next day.
# install.packages("ISLR")
# import library ISLR
library(ISLR)
# read in the Smarket data
smarket <- Smarket
smarket
str(smarket$Direction)
##  Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
# Recode the target variable
smarket$Direction <- as.character(smarket$Direction)
smarket$Direction[smarket$Direction == "Up"] <- 1
smarket$Direction[smarket$Direction == "Down"] <- 0
str(smarket$Direction)
##  chr [1:1250] "1" "1" "0" "1" "1" "1" "0" "1" "1" "1" "0" "0" "1" "1" ...
smarket$Direction <- as.factor(smarket$Direction)

fit4a <- glm(Direction ~ . - Year - Today,
              data = smarket,
              family = binomial(link = "logit"))
summary(fit4a)
## 
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial(link = "logit"), 
##     data = smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3
# predict the probability of having an affair for each observation
mypreds4a <- predict(fit4a,  # my model
                    newdata = smarket, # dataset
                    type = "response") # to get probabilities
summary(mypreds4a)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4084  0.5020  0.5180  0.5184  0.5338  0.6486
smarket$preds <- round(mypreds4a)
  1. Describe the interpretation of every variable in the model with the Direction.
  2. Link: :https://stats.idre.ucla.edu/r/dae/logit-regression/
# get the coefficients
coef(fit4a)
##  (Intercept)         Lag1         Lag2         Lag3         Lag4 
## -0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938 
##         Lag5       Volume 
##  0.010313068  0.135440659
# 

The logit function gives a change in the log odds of the variables according to the coefficients. Here:

To better understand the coefficients, lets exponentiate the coefficients.

# exponentiate the coefficients to correctly interpret them as there is log in the output (because we are using logit)
exp(coef(fit4a))
## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
##   0.8816146   0.9295323   0.9585809   1.0111468   1.0094029   1.0103664 
##      Volume 
##   1.1450412

The positive sign of the coefficient idicates a direct relationship between the variables and the target. Interprtation of these coefficient values:

  1. Explain why we use odds ratio in Logistic Regression? Is it the same thing as probability?

Q5: Poisson Regression

Use the AmphibianRoadKills.csv

You are given with amphibian road kill data which gives the count of the number of animals killed on a 25000 mile road at different points.

  1. Build a Poisson regression model to predict the number of animals killed on the roadway
options(scipen=999)    # to avoid scientific notation

# read in the data
arkill <- read.csv("AmphibianRoadKills.csv")
arkill
# build the regression model with family = poisson
fit5a <- glm(death ~ distance,
             data = arkill,
             family=poisson())
# check the model built
summary(fit5a)
## 
## Call:
## glm(formula = death ~ distance, family = poisson(), data = arkill)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.1100  -1.6950  -0.4708   1.4206   7.3337  
## 
## Coefficients:
##                 Estimate   Std. Error z value            Pr(>|z|)    
## (Intercept)  4.316484980  0.043219983   99.87 <0.0000000000000002 ***
## distance    -0.000105851  0.000004387  -24.13 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.4  on 51  degrees of freedom
## Residual deviance:  390.9  on 50  degrees of freedom
## AIC: 634.29
## 
## Number of Fisher Scoring iterations: 4
  1. Should I use the quasipoisson model? Why or why not? If so, fit that model and compare the model fit/coefficients etc.
#install.packages("qcc")
# import library qcc to be able to run a quasipoisson model
library(qcc)
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
# build regression using family = quasipoisson
fit5b <- glm(death ~ distance,
             data = arkill,
             family = quasipoisson())
# check the model built
summary(fit5b)
## 
## Call:
## glm(formula = death ~ distance, family = quasipoisson(), data = arkill)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.1100  -1.6950  -0.4708   1.4206   7.3337  
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)  4.31648498  0.11938536  36.156 < 0.0000000000000002 ***
## distance    -0.00010585  0.00001212  -8.735      0.0000000000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 7.630148)
## 
##     Null deviance: 1071.4  on 51  degrees of freedom
## Residual deviance:  390.9  on 50  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
metrics5a <- hydroGOF::gof(sim = fit5a$fitted.values,
                           obs = fit5a$y)
metrics5b <- hydroGOF::gof(sim = fit5b$fitted.values,
                           obs = fit5b$y)
metrics5a
##           [,1]
## ME        0.00
## MAE      10.40
## MSE     282.67
## RMSE     16.81
## NRMSE %  69.30
## PBIAS %   0.00
## RSR       0.69
## rSD       0.79
## NSE       0.51
## mNSE      0.46
## rNSE     -1.50
## d         0.83
## md        0.70
## rd        0.13
## cp        0.16
## r         0.72
## R2        0.52
## bR2       0.41
## KGE       0.65
## VE        0.60
metrics5b
##           [,1]
## ME        0.00
## MAE      10.40
## MSE     282.67
## RMSE     16.81
## NRMSE %  69.30
## PBIAS %   0.00
## RSR       0.69
## rSD       0.79
## NSE       0.51
## mNSE      0.46
## rNSE     -1.50
## d         0.83
## md        0.70
## rd        0.13
## cp        0.16
## r         0.72
## R2        0.52
## bR2       0.41
## KGE       0.65
## VE        0.60
  1. Residual diagnostics: Make a three plots and describe what you see
# plot the residuals vs fitted values
plot(y = fit5a$residuals,
     x = fit5a$fitted.values,
     main = "Residuals vs Fitted values",
     xlab = "Fitted values",
     ylab = "Residuals")
abline(0,0) #horizontal axis at x=0, y=0
# add the regression line
abline(lm(fit5a$residuals ~ fit5a$fitted.values),
       col = "red")

Looking at the Residuals vs Fitted Values scatter plot, it can be seen that:

# plot residuals vs the actual distances
plot(y = fit5a$residuals,
     x = arkill$distance,
     main = "Residuals vs Distance",
     xlab = "Distance",
     ylab = "Residuals")
abline(0,0) #horizontal axis at x=0, y=0

In the Residuals vs. Distance scatterplot above, it can be seen that:

hist(fit5a$residuals,
     main = "Residuals frequency",
     xlab = "Residual values",
     ylab = "Frequency")

The histogram above solidifies the observation that the model is over predicting more than its underpredicting. It can be seen that: