The fishing industry uses numerous measurements to describe a specific fish. Our goal is to predict the weight of a fish based on a number of these measurements and determine if any of these measurements are insignificant in determining the weigh of a product. See below for the description of these measurments.
The data consists of the following variables:
# Import library you may need
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Read the data set
fishfull = read.csv("Fish.csv",header=T, fileEncoding = 'UTF-8-BOM')
row.cnt = nrow(fishfull)
# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]
head(fish,2)
Please use fish as your data set for the following questions unless otherwise stated.
(a) Create a box plot comparing the response variable, Weight, across the multiple species. Based on this box plot, does there appear to be a relationship between the predictor and the response?
#making Species as factors
fish$Species <- as.factor(fish$Species)
#class(fishfull$Species)
# Visualizations via Box plots
boxplot(Weight ~ Species, data = fish,
xlab = "Species Types",
ylab = "Weight",
main = "Box plots of Fish Weights across different species",
notch = FALSE,
varwidth = TRUE,
col = c("green", "red", "blue","yellow", "violet", "lightblue","orange")
)
#Answer:
A few characteristics of fish weights with respect to species types were discernible from the box plots:
There is quite a bit of variability among the median weights across different species of fish
It does show that certain species like Roach has a few extreme outliers but most of it’s data are in tight range.
whereas, Pikes, has the highest amount of variability with a longer box and whiskers
Smelt is the lowest in weight while all the other species exhibited right skewed except for Bream. Bream seem to have the most “normal” distribution
To summarize, the box plots enable end-users to quickly tell which species has the most variances, distribution types of weights. Lastly, it shows which species weighs the least and which ones has extreme outliers but beyond that, there’s really no good linear relationship between weights and species types mainly because of the high amount variability.
(b) Create scatterplots of the response, Weight, against each quantitative predictor, namely Body.Height, Total.Length, Diagonal.Length, Height, and Width. Describe the general trend of each plot. Are there any potential outliers?
#Answer:
Graphically, it appears that all 5 predictors have a positive relationship with the Weight of the fish, but there seem to be some slight non-linear curvature relationship between the response variable and its quantitative predictors
As the predictor variables increases, so does its corresponding weights. This can be visually seen with an upward movement in the scatter plots.
In addition, there seem to be an outlier in 3 of the predictors; Diagonal length, Body Height, Total Length around data point 20 while the predictors of Height and Width predictors have more outliers in the lower range, somewhere between 4 thru 12
(c) Display the correlations between each of the quantitative variables. Interpret the correlations in the context of the relationships of the predictors to the response and in the context of multicollinearity.
library(corrplot)
## corrplot 0.92 loaded
headers = c("Body.Height","Total.Length","Diagonal.Length","Height","Width","Weight")
#scale(df2[,1:9])
newdata <- data.frame(scale(fish[,-2]))[headers]
correlation = cor(newdata )
corrplot(correlation, method = 'number',bg="lightblue")
#Answer:
With respect to (w.r.t) response variable to predictors:
Most of the predictors have relatively high correlation w.r.t. response variable in the mid 0.80 range. The lowest was Height at 0.69 between Weight and Width
The same can be observed for predictors w.r.t. to each other as well. For instance, Total length and Body Height have perfect correlation while Diagonal Length and Body Height have correlation of 0.99. Width vs. Total length, Body Height and Diagonal Length also experienced high correlation at the mid .87 range. Only Height saw a lower correlation level with it’s corresponding predictors at the mid 0.6 range.
The fact that predictors have high correlation relationships with its response is a good thing as it will translate to higher predictive power of the model. But, those high correlation levels among predictors spells trouble and will manifest itself in multicollinearity issues (VIF issues). Some of the well-documented areas that cause problems to the model are:
• Standard errors are inflated
• Larger the confidence interval
• The less likely it is that a coefficient will be evaluated as statistically
While all these issues are problematic, its not fatal. The model can still be utilized but with caution
(d) Based on this exploratory analysis, is it reasonable to assume a multiple linear regression model for the relationship between Weight and the predictor variables?
#Answer:
I would still use a multiple linear regression model (MLR) as a baseline model first, but given the potential problems with multicollinearity, I would be cautious about interpreting the statistical significance of the regression coefficients, confidence interval or it’s predictive power.
We can attempt a multiple linear regression model first, but in order to improve the model, we may need to perform some Box-Cox transformations later to reduce the heteroskedasticity and/or incorporate non-linear associations in the linear model by including transformed versions of the predictors in the model.
Remember, this is just general guidelines to keep the model more interpretable and parsimonious, while in a true a modeling process, a more robust sub-setting selection methodology like a step-wise regression may still be needed to be performed to truly trim down the model and only utilize the most impactful predictors relative to the response
Create the full model without transforming the response variable or predicting variables using the fish data set. Do not use fishtest
(a) Build a multiple linear regression model, called model1, using the response and all predictors. Display the summary table of the model.
model1 <- lm(Weight ~ ., data=fish)
summary(model1)
##
## Call:
## lm(formula = Weight ~ ., data = fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.37 -70.59 -23.50 42.42 1335.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -813.90 218.34 -3.728 0.000282 ***
## SpeciesParkki 79.34 132.71 0.598 0.550918
## SpeciesPerch 10.41 206.26 0.050 0.959837
## SpeciesPike 16.76 233.06 0.072 0.942775
## SpeciesRoach 194.03 156.84 1.237 0.218173
## SpeciesSmelt 455.78 204.92 2.224 0.027775 *
## SpeciesWhitefish 28.31 164.91 0.172 0.863967
## Body.Height -176.87 61.36 -2.882 0.004583 **
## Total.Length 266.70 77.75 3.430 0.000797 ***
## Diagonal.Length -72.49 49.48 -1.465 0.145267
## Height 38.27 22.09 1.732 0.085448 .
## Width 29.63 40.54 0.731 0.466080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 156.1 on 137 degrees of freedom
## Multiple R-squared: 0.8419, Adjusted R-squared: 0.8292
## F-statistic: 66.3 on 11 and 137 DF, p-value: < 2.2e-16
(b) Is the overall regression significant at an \(\alpha\) level of 0.01? Explain.
#Answer:
hypothesis that all slope coefficients are equal to zero, and state that the model is statistically significant at an α level of 0.01
(c) What is the coefficient estimate for Body.Height? Interpret this coefficient.
#Answer:
of the fish DECREASES by 176.87 units holding all other variables constant. This behavior is peculiar and counter-intuitive at best.
This interpretation is evidence that multicollinearity is distorting the coefficient estimates of the model
(d) What is the coefficient estimate for the Species category Parkki? Interpret this coefficient.
#Answer:
(a) Create a plot for the Cook’s Distances. Using a threshold Cook’s Distance of 1, identify the row numbers of any outliers.
# Cook’s Distance
cook = cooks.distance(model1)
plot(cook,type="h", lwd=4, col="red", ylab = "Cook's Distance", main="Cook's Distance")
abline(h=1,col="blue")
#Identify outliers
cat("Observation", which(cook>1), "has a cook's distance that is greater than 1")
## Observation 30 has a cook's distance that is greater than 1
#Answer:
(b) Remove the outlier(s) from the data set and create a new model, called model2, using all predictors with Weight as the response. Display the summary of this model.
fish2<- fish[-c(30),]
model2 <- lm(Weight~., data=fish2)
summary(model2)
##
## Call:
## lm(formula = Weight ~ ., data = fish2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.10 -50.18 -14.44 34.04 433.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -969.766 131.601 -7.369 1.51e-11 ***
## SpeciesParkki 195.500 80.105 2.441 0.015951 *
## SpeciesPerch 174.241 124.404 1.401 0.163608
## SpeciesPike -175.936 140.605 -1.251 0.212983
## SpeciesRoach 141.867 94.319 1.504 0.134871
## SpeciesSmelt 489.714 123.174 3.976 0.000113 ***
## SpeciesWhitefish 122.277 99.293 1.231 0.220270
## Body.Height -76.321 37.437 -2.039 0.043422 *
## Total.Length 74.822 48.319 1.549 0.123825
## Diagonal.Length 34.349 30.518 1.126 0.262350
## Height 10.000 13.398 0.746 0.456692
## Width -8.339 24.483 -0.341 0.733924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.84 on 136 degrees of freedom
## Multiple R-squared: 0.9385, Adjusted R-squared: 0.9335
## F-statistic: 188.6 on 11 and 136 DF, p-value: < 2.2e-16
(c) Display the VIF of each predictor for model2. Using a VIF threshold of max(10, 1/(1-\(R^2\)) what conclusions can you draw?
vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## Species 1545.55017 6 1.843983
## Body.Height 2371.15420 1 48.694499
## Total.Length 4540.47698 1 67.383062
## Diagonal.Length 2126.64985 1 46.115614
## Height 56.21375 1 7.497583
## Width 29.01683 1 5.386727
VIF_threshold <- max(10, 1/(1-summary(model2)$r.squared))
cat("VIF threshold:", VIF_threshold)
## VIF threshold: 16.25583
#Answer:
The VIF threshold in model2 happens to be 16.25583
Since all the predictors have VIF greater than 10 (first column:GVIF), the test is positive for multicollinearity
Please use the cleaned data set, which have the outlier(s) removed, and model2 for answering the following questions.
(a) Create scatterplots of the standardized residuals of model2 versus each quantitative predictor. Does the linearity assumption appear to hold for all predictors?
fish_1 <- fish2[,-2]
## Standardized Residuals versus individual predicting variables
resids = stdres(model2)
par(mfrow =c(3,2))
xlabc=c("Body.Height","Total.Length","Diagonal.Length","Height","Width")
for (i in 2:6) {
plot(fish_1[,i],resids,xlab=xlabc[i-1],ylab="Residuals")
abline(0,0,col="red")
lines(lowess(fish_1[,i], resids), col='blue')}
#Answer:
For example:
Body.Height: A possible concern between 20-30cm, where the majority of the residuals are below the zero line.
Total.Length: A possible concern between 20-30cm, where the majority of the residuals are below the zero line.
Diagonal.Length: A possible concern between 27-35cm, where the majority of the residuals are below the zero line.
Width: A possible concern between 3.9-5cm, where the majority of the residuals are below the zero line
Overall, the linearity assumption does not seem to hold
(b) Create a scatter plot of the standardized residuals of model2 versus the fitted values of model2. Does the constant variance assumption appear to hold? Do the errors appear uncorrelated?
fits = model2$fitted
plot(fits, resids, xlab="Fitted Values",ylab="Residuals", main="Model2: Residuals vs. Fitted Values")
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits, resids), col='blue')
#Answer:
(c) Create a histogram and normal QQ plot for the standardized residuals. What conclusions can you draw from these plots?
qqPlot(resids, ylab="Residuals", main = "Model2: Q-Q Plot of Residuals")
## 37 124
## 36 123
hist(resids, xlab="Residuals", main = "Mdel2: Histogram of Residuals",nclass=10,col="gold")
#Answer:
(a) Build a third multiple linear regression model using the cleaned data set without the outlier(s), called model3, using only Species and Total.Length as predicting variables and Weight as the response. Display the summary table of the model3.
fish_3 <- fish2[,c(1,2,4)]
model3 <- lm(Weight ~ ., data=fish_3)
summary(model3)
##
## Call:
## lm(formula = Weight ~ ., data = fish_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -233.83 -56.59 -10.13 34.58 418.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -730.977 42.449 -17.220 < 2e-16 ***
## SpeciesParkki 63.129 38.889 1.623 0.107
## SpeciesPerch -23.941 21.745 -1.101 0.273
## SpeciesPike -400.964 33.350 -12.023 < 2e-16 ***
## SpeciesRoach -19.876 30.111 -0.660 0.510
## SpeciesSmelt 256.408 39.858 6.433 1.85e-09 ***
## SpeciesWhitefish -14.971 42.063 -0.356 0.722
## Total.Length 40.775 1.181 34.527 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 94.86 on 140 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9321
## F-statistic: 289.1 on 7 and 140 DF, p-value: < 2.2e-16
(b) Conduct a partial F-test comparing model3 with model2. What can you conclude using an \(\alpha\) level of 0.01?
anova(model3,model2)
#Answer:
(a) Conduct a multicollinearity test on model3. Comment on the multicollinearity in model3.
vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## Species 2.654472 6 1.084755
## Total.Length 2.654472 1 1.629255
VIF_threshold3 <- max(10, 1/(1-summary(model3)$r.squared))
cat("VIF threshold of Model3:", VIF_threshold3)
## VIF threshold of Model3: 15.45466
#Answer:
The VIF threshold in model3 happens to be 15.45466
Since all the predictors have VIF less than 10 (first column:GVIF), the test is negative for multicollinearity
(b) Conduct residual analysis for model3 (similar to Q4). Comment on each assumption and whether they hold.
par(mfrow=c(2,2))
#Residuals of predictor vs Residuals for linearity check
resids3 = stdres(model3)
xlabc=c("Total.Length")
plot(fish_3[,3],resids3,xlab=xlabc,ylab="Residuals", pch = 19,main="Model3: Lineartiy/Mean Zero")
abline(0,0,col="red")
lines(lowess(fish_3[,3], resids3), col='blue')
#Residuals of fitted vs Residuals for non-constant variance check
fits3 = model3$fitted
plot(fits3, resids3, xlab="Fitted Values",ylab="Residuals",main="Model3: Heterodasticity Check",pch = 19)
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits3, resids3), col='blue')
#QQ Plot & Histogram of residuals
qqPlot(resids3, ylab="Residuals", main = "Model3: Q-Q Plot of Residuals")
## 37 124
## 36 123
hist(resids3, xlab="Residuals", main = "Model3: Histogram of Residuals",nclass=10,col="yellow")
#Answer:
We still have similar linearity/mean zero and constant variance violations like model2 with a u-shape pattern in the residuals and increasing variance as fitted values increase (non-constant variance problem). Specifically, residuals vs. predictor and residuals vs. fitted values did not have a random scatter of points hovering around line zero. It’s actually these lack of randomness in the scatter of residuals that indicated the data has both a non-linearity and heteordasticity issues.
Similarly, the histogram and QQ plot both exhibited deviations from normality with a heavy right tail. Thus Normality does not hold. A transformation of the response may be needed here
(a) Use model3 to find the optimal lambda, rounded to the nearest 0.5, for a Box-Cox transformation on model3. What transformation, if any, should be applied according to the lambda value? Please ensure you use model3
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
##
## mutate
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
options(digits=3)
bc <- boxcox((model3))
lambda <- bc$x[which.max(bc$y)]
cat("Optimal Lambda = ", lambda, end="\n")
## Optimal Lambda = 0.343
#Answer:
(b) Based on the results in (a), create model4 with the appropriate transformation. Display the summary.
#transformed response variable
transformed_y <- sqrt(fish_3$Weight)
fishfull_transformed <- fish_3 %>% mutate(sqrt_weight=transformed_y) %>% select(sqrt_weight,Species, Total.Length)
head(fishfull_transformed ,2)
model4 <- lm(sqrt_weight~., data=fishfull_transformed )
summary(model4)
##
## Call:
## lm(formula = sqrt_weight ~ ., data = fishfull_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.011 -0.769 -0.058 0.680 4.638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.9665 0.5728 -12.16 < 2e-16 ***
## SpeciesParkki -0.3640 0.5248 -0.69 0.489
## SpeciesPerch -1.9573 0.2934 -6.67 5.5e-10 ***
## SpeciesPike -10.9049 0.4500 -24.23 < 2e-16 ***
## SpeciesRoach -2.0934 0.4063 -5.15 8.6e-07 ***
## SpeciesSmelt -1.0499 0.5378 -1.95 0.053 .
## SpeciesWhitefish -0.5505 0.5676 -0.97 0.334
## Total.Length 0.9505 0.0159 59.65 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.28 on 140 degrees of freedom
## Multiple R-squared: 0.982, Adjusted R-squared: 0.981
## F-statistic: 1.07e+03 on 7 and 140 DF, p-value: <2e-16
(c) Perform Residual Analysis on model4. Comment on each assumption. Was the transformation successful/unsuccessful?
par(mfrow=c(2,2))
#Residuals of predictor vs Residuals for linearity check
resids4 = stdres(model4)
xlabc=c("Total.Length")
plot(fishfull_transformed[,3],resids4,xlab=xlabc,ylab="Residuals", pch = 19,main="Model4: Lineartiy/Mean Zero")
abline(0,0,col="red")
lines(lowess(fishfull_transformed[,3], resids4), col='blue')
#Residuals of fitted vs Residuals for non-constant variance check
fits4 = model4$fitted
plot(fits4, resids4, xlab="Fitted Values",ylab="Residuals",main="Model4: Heterodasticity Check",pch = 19)
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits4, resids4), col='blue')
#QQ Plot & Histogram of residuals
qqPlot(resids4, ylab="Residuals", main = "Model4: Q-Q Plot of Residuals")
## 131 141
## 130 140
hist(resids4, xlab="Residuals", main = "Model4: Histogram of Residuals",nclass=10,col="yellow")
#Answer:
In the Linearity/Mean Zero plot, the residuals are now scattered evenly across the zero line; fixed non-linearity problem
In the Constant Variance/Correlation plot, the spread of the residuals showed no pattern of increasing nor decreasing trend; fixed heteordasticity problem. Also, like previously, there weren’t any clustering in the residuals. This means errors were not correlated
QQ plot shows the majority of the residuals lines up quite well with the normal-line except for a few possible outliers
The histogram also shows similar result with an almost normal profile
Overall, based on these plots, the transformation has improved the linearity assumption. And the transformation was successful in fixing the non-constant variance problem. The only exception, We still have are heavy tailed residuals as indicated by the QQ plot. These outliers still need to be further studied for a more comprehensive analysis
(a) Using each model summary, compare and discuss the R-squared and Adjusted R-squared of model2, model3, and model4.
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
adj.R.Square <- c(summary(model2)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared)
R.Square <- c(summary(model2)$r.squared,summary(model3)$r.squared,summary(model4)$r.squared)
Models <- c("Model2","Model3","Model4")
df <- data.frame(Models,R.Square ,adj.R.Square)
df%>%
kable() %>%
kable_styling(font_size=20,full_width = F)
Models | R.Square | adj.R.Square |
---|---|---|
Model2 | 0.938 | 0.934 |
Model3 | 0.935 | 0.932 |
Model4 | 0.982 | 0.981 |
#Answer:
(a) Predict Weight for the last 10 rows of data (fishtest) using both model3 and model4. Compare and discuss the mean squared prediction error (MSPE) of both models.
# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]
# Mean Squared Prediction Error (MSPE)
pred3 = predict(model3, fishtest)
mse.model3 <- mean((pred3-fishtest$Weight)^2)
# Mean Squared Prediction Error (MSPE)
pred4 = predict(model4, fishtest)^2# to bring it to it's original scale. Remember it was transformed by sqrt of response
mse.model4 <- mean((pred4-fishtest$Weight)^2)
cat("The MSPE of model4 is", mse.model4,"\n")
## The MSPE of model4 is 2443
cat("The MSPE of model3 is", mse.model3,"\n")
## The MSPE of model3 is 9392
#Answer:
(b) Suppose you have found a Perch fish with a Body.Height of 28 cm, and a Total.Length of 32 cm. Using model4, predict the weight on this fish with a 90% prediction interval. Provide an interpretation of the prediction interval.
#new data point for Perch
new.point<-data.frame(Species="Perch", Total.Length=32) # model4 has only one predictor; Total.length
# Calculate prediction interval
predict(model4, new.point, interval="prediction", level=0.9)^2
## fit lwr upr
## 1 462 374 559
#As refernce only
fishfull %>% filter(Species=="Perch") %>% filter(Total.Length>30 & Total.Length<33)
#Answer:
The prediction of fish weight (given Total.Length = 32 cm), is 462 grams with lower bound at 374 grams and the upper bound at 559 grams
Model4 predicts the fish to weigh 462 grams with a 90% prediction interval of 374 to 559. This tells us that we can be 90% confident that the weight of a fish with these specific characteristics will be between 374 to 559 grams
For reference, the actual weight of a Perch data point at Total.Length of 32.8 cm is 514 grams (about 10% higher than what we predicted 462 grams), which is Not too far from an actual weight at given Total.Length of 32 cm.