Load libraries

Background

The fishing industry uses numerous measurements to describe a specific fish. Our goal is to predict the weight of a fish based on a number of these measurements and determine if any of these measurements are insignificant in determining the weigh of a product. See below for the description of these measurments.

Data Description

The data consists of the following variables:

  1. Weight: weight of fish in g (numerical)
  2. Species: species name of fish (categorical)
  3. Body.Height: height of body of fish in cm (numerical)
  4. Total.Length: length of fish from mouth to tail in cm (numerical)
  5. Diagonal.Length: length of diagonal of main body of fish in cm (numerical)
  6. Height: height of head of fish in cm (numerical)
  7. Width: width of head of fish in cm (numerical)

Read the data

# Import library you may need
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# Read the data set
fishfull = read.csv("Fish.csv",header=T, fileEncoding = 'UTF-8-BOM')
row.cnt = nrow(fishfull)
# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]

head(fish,2)

Please use fish as your data set for the following questions unless otherwise stated.

Question 1: Exploratory Data Analysis [8 points]

(a) Create a box plot comparing the response variable, Weight, across the multiple species. Based on this box plot, does there appear to be a relationship between the predictor and the response?

#making Species as factors
fish$Species <- as.factor(fish$Species)
#class(fishfull$Species)

# Visualizations via Box plots 
boxplot(Weight ~ Species, data = fish,
        xlab = "Species Types",
        ylab = "Weight",
        main = "Box plots of Fish Weights across different species",
        notch = FALSE,
        varwidth = TRUE,
        col = c("green", "red", "blue","yellow", "violet", "lightblue","orange")
        )

#Answer:

A few characteristics of fish weights with respect to species types were discernible from the box plots:

To summarize, the box plots enable end-users to quickly tell which species has the most variances, distribution types of weights. Lastly, it shows which species weighs the least and which ones has extreme outliers but beyond that, there’s really no good linear relationship between weights and species types mainly because of the high amount variability.

(b) Create scatterplots of the response, Weight, against each quantitative predictor, namely Body.Height, Total.Length, Diagonal.Length, Height, and Width. Describe the general trend of each plot. Are there any potential outliers?

#Answer:

(c) Display the correlations between each of the quantitative variables. Interpret the correlations in the context of the relationships of the predictors to the response and in the context of multicollinearity.

library(corrplot)
## corrplot 0.92 loaded
headers = c("Body.Height","Total.Length","Diagonal.Length","Height","Width","Weight")
#scale(df2[,1:9])
newdata <- data.frame(scale(fish[,-2]))[headers]
correlation = cor(newdata )
corrplot(correlation, method = 'number',bg="lightblue")

#Answer:

With respect to (w.r.t) response variable to predictors:

The fact that predictors have high correlation relationships with its response is a good thing as it will translate to higher predictive power of the model. But, those high correlation levels among predictors spells trouble and will manifest itself in multicollinearity issues (VIF issues). Some of the well-documented areas that cause problems to the model are:

• Standard errors are inflated

• Larger the confidence interval

• The less likely it is that a coefficient will be evaluated as statistically

While all these issues are problematic, its not fatal. The model can still be utilized but with caution

(d) Based on this exploratory analysis, is it reasonable to assume a multiple linear regression model for the relationship between Weight and the predictor variables?

#Answer:

Remember, this is just general guidelines to keep the model more interpretable and parsimonious, while in a true a modeling process, a more robust sub-setting selection methodology like a step-wise regression may still be needed to be performed to truly trim down the model and only utilize the most impactful predictors relative to the response

Question 2: Fitting the Multiple Linear Regression Model [8 points]

Create the full model without transforming the response variable or predicting variables using the fish data set. Do not use fishtest

(a) Build a multiple linear regression model, called model1, using the response and all predictors. Display the summary table of the model.

model1 <- lm(Weight ~ ., data=fish)
summary(model1)
## 
## Call:
## lm(formula = Weight ~ ., data = fish)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.37  -70.59  -23.50   42.42 1335.87 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -813.90     218.34  -3.728 0.000282 ***
## SpeciesParkki       79.34     132.71   0.598 0.550918    
## SpeciesPerch        10.41     206.26   0.050 0.959837    
## SpeciesPike         16.76     233.06   0.072 0.942775    
## SpeciesRoach       194.03     156.84   1.237 0.218173    
## SpeciesSmelt       455.78     204.92   2.224 0.027775 *  
## SpeciesWhitefish    28.31     164.91   0.172 0.863967    
## Body.Height       -176.87      61.36  -2.882 0.004583 ** 
## Total.Length       266.70      77.75   3.430 0.000797 ***
## Diagonal.Length    -72.49      49.48  -1.465 0.145267    
## Height              38.27      22.09   1.732 0.085448 .  
## Width               29.63      40.54   0.731 0.466080    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156.1 on 137 degrees of freedom
## Multiple R-squared:  0.8419, Adjusted R-squared:  0.8292 
## F-statistic:  66.3 on 11 and 137 DF,  p-value: < 2.2e-16

(b) Is the overall regression significant at an \(\alpha\) level of 0.01? Explain.

#Answer:

hypothesis that all slope coefficients are equal to zero, and state that the model is statistically significant at an α level of 0.01

(c) What is the coefficient estimate for Body.Height? Interpret this coefficient.

#Answer:

of the fish DECREASES by 176.87 units holding all other variables constant. This behavior is peculiar and counter-intuitive at best.

This interpretation is evidence that multicollinearity is distorting the coefficient estimates of the model

(d) What is the coefficient estimate for the Species category Parkki? Interpret this coefficient.

#Answer:

Question 3: Checking for Outliers and Multicollinearity [6 points]

(a) Create a plot for the Cook’s Distances. Using a threshold Cook’s Distance of 1, identify the row numbers of any outliers.

# Cook’s Distance
cook = cooks.distance(model1)

plot(cook,type="h", lwd=4, col="red", ylab = "Cook's Distance", main="Cook's Distance")
abline(h=1,col="blue")

#Identify outliers
cat("Observation", which(cook>1), "has a cook's distance that is greater than 1")
## Observation 30 has a cook's distance that is greater than 1

#Answer:

(b) Remove the outlier(s) from the data set and create a new model, called model2, using all predictors with Weight as the response. Display the summary of this model.

fish2<- fish[-c(30),]
model2 <- lm(Weight~., data=fish2)
summary(model2)
## 
## Call:
## lm(formula = Weight ~ ., data = fish2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.10  -50.18  -14.44   34.04  433.68 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -969.766    131.601  -7.369 1.51e-11 ***
## SpeciesParkki     195.500     80.105   2.441 0.015951 *  
## SpeciesPerch      174.241    124.404   1.401 0.163608    
## SpeciesPike      -175.936    140.605  -1.251 0.212983    
## SpeciesRoach      141.867     94.319   1.504 0.134871    
## SpeciesSmelt      489.714    123.174   3.976 0.000113 ***
## SpeciesWhitefish  122.277     99.293   1.231 0.220270    
## Body.Height       -76.321     37.437  -2.039 0.043422 *  
## Total.Length       74.822     48.319   1.549 0.123825    
## Diagonal.Length    34.349     30.518   1.126 0.262350    
## Height             10.000     13.398   0.746 0.456692    
## Width              -8.339     24.483  -0.341 0.733924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.84 on 136 degrees of freedom
## Multiple R-squared:  0.9385, Adjusted R-squared:  0.9335 
## F-statistic: 188.6 on 11 and 136 DF,  p-value: < 2.2e-16

(c) Display the VIF of each predictor for model2. Using a VIF threshold of max(10, 1/(1-\(R^2\)) what conclusions can you draw?

vif(model2)
##                       GVIF Df GVIF^(1/(2*Df))
## Species         1545.55017  6        1.843983
## Body.Height     2371.15420  1       48.694499
## Total.Length    4540.47698  1       67.383062
## Diagonal.Length 2126.64985  1       46.115614
## Height            56.21375  1        7.497583
## Width             29.01683  1        5.386727
VIF_threshold <- max(10, 1/(1-summary(model2)$r.squared))
cat("VIF threshold:", VIF_threshold)
## VIF threshold: 16.25583

#Answer:

Question 4: Checking Model Assumptions [6 points]

Please use the cleaned data set, which have the outlier(s) removed, and model2 for answering the following questions.

(a) Create scatterplots of the standardized residuals of model2 versus each quantitative predictor. Does the linearity assumption appear to hold for all predictors?

fish_1 <- fish2[,-2]
## Standardized Residuals versus individual predicting variables 
resids = stdres(model2)
par(mfrow =c(3,2))
xlabc=c("Body.Height","Total.Length","Diagonal.Length","Height","Width")
for (i in 2:6) {
  plot(fish_1[,i],resids,xlab=xlabc[i-1],ylab="Residuals")
  abline(0,0,col="red")
  lines(lowess(fish_1[,i], resids), col='blue')}

#Answer:

For example:

Overall, the linearity assumption does not seem to hold

(b) Create a scatter plot of the standardized residuals of model2 versus the fitted values of model2. Does the constant variance assumption appear to hold? Do the errors appear uncorrelated?

fits = model2$fitted
plot(fits, resids, xlab="Fitted Values",ylab="Residuals", main="Model2: Residuals vs. Fitted Values")
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits, resids), col='blue')

#Answer:

(c) Create a histogram and normal QQ plot for the standardized residuals. What conclusions can you draw from these plots?

qqPlot(resids, ylab="Residuals", main = "Model2: Q-Q Plot of Residuals")

##  37 124 
##  36 123
hist(resids, xlab="Residuals", main = "Mdel2: Histogram of Residuals",nclass=10,col="gold")

#Answer:

Question 5: Partial F Test [6 points]

(a) Build a third multiple linear regression model using the cleaned data set without the outlier(s), called model3, using only Species and Total.Length as predicting variables and Weight as the response. Display the summary table of the model3.

fish_3 <- fish2[,c(1,2,4)]
model3 <- lm(Weight ~ ., data=fish_3)
summary(model3)
## 
## Call:
## lm(formula = Weight ~ ., data = fish_3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -233.83  -56.59  -10.13   34.58  418.30 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -730.977     42.449 -17.220  < 2e-16 ***
## SpeciesParkki      63.129     38.889   1.623    0.107    
## SpeciesPerch      -23.941     21.745  -1.101    0.273    
## SpeciesPike      -400.964     33.350 -12.023  < 2e-16 ***
## SpeciesRoach      -19.876     30.111  -0.660    0.510    
## SpeciesSmelt      256.408     39.858   6.433 1.85e-09 ***
## SpeciesWhitefish  -14.971     42.063  -0.356    0.722    
## Total.Length       40.775      1.181  34.527  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.86 on 140 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9321 
## F-statistic: 289.1 on 7 and 140 DF,  p-value: < 2.2e-16

(b) Conduct a partial F-test comparing model3 with model2. What can you conclude using an \(\alpha\) level of 0.01?

anova(model3,model2)

#Answer:

Question 6: Reduced Model Residual Analysis and Multicollinearity Test [7 points]

(a) Conduct a multicollinearity test on model3. Comment on the multicollinearity in model3.

vif(model3)
##                  GVIF Df GVIF^(1/(2*Df))
## Species      2.654472  6        1.084755
## Total.Length 2.654472  1        1.629255
VIF_threshold3 <- max(10, 1/(1-summary(model3)$r.squared))
cat("VIF threshold of Model3:", VIF_threshold3)
## VIF threshold of Model3: 15.45466

#Answer:

(b) Conduct residual analysis for model3 (similar to Q4). Comment on each assumption and whether they hold.

par(mfrow=c(2,2))

#Residuals of predictor vs Residuals for linearity check
resids3 = stdres(model3)
xlabc=c("Total.Length")
plot(fish_3[,3],resids3,xlab=xlabc,ylab="Residuals", pch = 19,main="Model3: Lineartiy/Mean Zero")
  abline(0,0,col="red")
  lines(lowess(fish_3[,3], resids3), col='blue')
#Residuals of fitted vs Residuals for non-constant variance check
fits3 = model3$fitted
plot(fits3, resids3, xlab="Fitted Values",ylab="Residuals",main="Model3: Heterodasticity Check",pch = 19)
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits3, resids3), col='blue')

#QQ Plot & Histogram of residuals
qqPlot(resids3, ylab="Residuals", main = "Model3: Q-Q Plot of Residuals")
##  37 124 
##  36 123
hist(resids3, xlab="Residuals", main = "Model3: Histogram of Residuals",nclass=10,col="yellow")

#Answer:

Question 7: Transformation [9 pts]

(a) Use model3 to find the optimal lambda, rounded to the nearest 0.5, for a Box-Cox transformation on model3. What transformation, if any, should be applied according to the lambda value? Please ensure you use model3

library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
## 
##     mutate
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
options(digits=3)
bc <- boxcox((model3))

lambda <- bc$x[which.max(bc$y)]
cat("Optimal Lambda = ", lambda, end="\n")
## Optimal Lambda =  0.343

#Answer:

(b) Based on the results in (a), create model4 with the appropriate transformation. Display the summary.

#transformed response variable
transformed_y <- sqrt(fish_3$Weight)
fishfull_transformed <- fish_3 %>% mutate(sqrt_weight=transformed_y) %>% select(sqrt_weight,Species, Total.Length)
head(fishfull_transformed ,2)
model4 <- lm(sqrt_weight~., data=fishfull_transformed )
summary(model4)
## 
## Call:
## lm(formula = sqrt_weight ~ ., data = fishfull_transformed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.011 -0.769 -0.058  0.680  4.638 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6.9665     0.5728  -12.16  < 2e-16 ***
## SpeciesParkki     -0.3640     0.5248   -0.69    0.489    
## SpeciesPerch      -1.9573     0.2934   -6.67  5.5e-10 ***
## SpeciesPike      -10.9049     0.4500  -24.23  < 2e-16 ***
## SpeciesRoach      -2.0934     0.4063   -5.15  8.6e-07 ***
## SpeciesSmelt      -1.0499     0.5378   -1.95    0.053 .  
## SpeciesWhitefish  -0.5505     0.5676   -0.97    0.334    
## Total.Length       0.9505     0.0159   59.65  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.28 on 140 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.981 
## F-statistic: 1.07e+03 on 7 and 140 DF,  p-value: <2e-16

(c) Perform Residual Analysis on model4. Comment on each assumption. Was the transformation successful/unsuccessful?

par(mfrow=c(2,2))

#Residuals of predictor vs Residuals for linearity check
resids4 = stdres(model4)
xlabc=c("Total.Length")
plot(fishfull_transformed[,3],resids4,xlab=xlabc,ylab="Residuals", pch = 19,main="Model4: Lineartiy/Mean Zero")
  abline(0,0,col="red")
  lines(lowess(fishfull_transformed[,3], resids4), col='blue')
#Residuals of fitted vs Residuals for non-constant variance check
fits4 = model4$fitted
plot(fits4, resids4, xlab="Fitted Values",ylab="Residuals",main="Model4: Heterodasticity Check",pch = 19)
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits4, resids4), col='blue')

#QQ Plot & Histogram of residuals
qqPlot(resids4, ylab="Residuals", main = "Model4: Q-Q Plot of Residuals")
## 131 141 
## 130 140
hist(resids4, xlab="Residuals", main = "Model4: Histogram of Residuals",nclass=10,col="yellow")

#Answer:

Overall, based on these plots, the transformation has improved the linearity assumption. And the transformation was successful in fixing the non-constant variance problem. The only exception, We still have are heavy tailed residuals as indicated by the QQ plot. These outliers still need to be further studied for a more comprehensive analysis

Question 8: Model Comparison [2 pts]

(a) Using each model summary, compare and discuss the R-squared and Adjusted R-squared of model2, model3, and model4.

library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
adj.R.Square <- c(summary(model2)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared)
R.Square <- c(summary(model2)$r.squared,summary(model3)$r.squared,summary(model4)$r.squared)
Models <- c("Model2","Model3","Model4")

df <- data.frame(Models,R.Square ,adj.R.Square)
df%>%
  kable() %>%
  kable_styling(font_size=20,full_width = F) 
Models R.Square adj.R.Square
Model2 0.938 0.934
Model3 0.935 0.932
Model4 0.982 0.981

#Answer:

Question 9: Prediction [8 points]

(a) Predict Weight for the last 10 rows of data (fishtest) using both model3 and model4. Compare and discuss the mean squared prediction error (MSPE) of both models.

# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]
# Mean Squared Prediction Error (MSPE)
pred3 = predict(model3, fishtest)
mse.model3 <- mean((pred3-fishtest$Weight)^2)
# Mean Squared Prediction Error (MSPE)
pred4 = predict(model4, fishtest)^2# to bring it to it's original scale.  Remember it was transformed by sqrt of response
mse.model4 <- mean((pred4-fishtest$Weight)^2)
cat("The MSPE of model4 is", mse.model4,"\n")
## The MSPE of model4 is 2443
cat("The MSPE of model3 is", mse.model3,"\n")
## The MSPE of model3 is 9392

#Answer:

(b) Suppose you have found a Perch fish with a Body.Height of 28 cm, and a Total.Length of 32 cm. Using model4, predict the weight on this fish with a 90% prediction interval. Provide an interpretation of the prediction interval.

#new data point for Perch
new.point<-data.frame(Species="Perch", Total.Length=32) # model4 has only one predictor; Total.length
# Calculate prediction interval
predict(model4, new.point, interval="prediction", level=0.9)^2
##   fit lwr upr
## 1 462 374 559
#As refernce only
fishfull %>% filter(Species=="Perch") %>% filter(Total.Length>30 & Total.Length<33)

#Answer: