Load libraries

Background

The fishing industry uses numerous measurements to describe a specific fish. Our goal is to predict the weight of a fish based on a number of these measurements and determine if any of these measurements are insignificant in determining the weigh of a product. See below for the description of these measurments.

Data Description

The data consists of the following variables:

Weight: weight of fish in g (numerical)
Species: species name of fish (categorical)
Body.Height: height of body of fish in cm (numerical)
Total.Length: length of fish from mouth to tail in cm (numerical)
Diagonal.Length: length of diagonal of main body of fish in cm (numerical)
Height: height of head of fish in cm (numerical)
Width: width of head of fish in cm (numerical)

Read the data

# Import library you may need
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Read the data set
fishfull = read.csv("Fish.csv",header=T, fileEncoding = 'UTF-8-BOM')
row.cnt = nrow(fishfull)
# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]

head(fish,2)

Please use fish as your data set for the following questions unless otherwise stated.

Question 1: Exploratory Data Analysis [8 points]

(a) Create a box plot comparing the response variable, Weight, across the multiple species. Based on this box plot, does there appear to be a relationship between the predictor and the response?

#making Species as factors
fish$Species <- as.factor(fish$Species)
#class(fishfull$Species)

# Visualizations via Box plots 
boxplot(Weight ~ Species, data = fish,
        xlab = "Species Types",
        ylab = "Weight",
        main = "Box plots of Fish Weights across different species",
        notch = FALSE,
        varwidth = TRUE,
        col = c("green", "red", "blue","yellow", "violet", "lightblue","orange")
        )

#Answer:

A few characteristics of fish weights with respect to species types were discernible from the box plots:

There is quite a bit of variability among the median weights across different species of fish
It does show that certain species like Roach has a few extreme outliers but most of it’s data are in tight range.
whereas, Pikes, has the highest amount of variability with a longer box and whiskers
Smelt is the lowest in weight while all the other species exhibited right skewed except for Bream. Bream seem to have the most “normal” distribution

To summarize, the box plots enable end-users to quickly tell which species has the most variances, distribution types of weights. Lastly, it shows which species weighs the least and which ones has extreme outliers but beyond that, there’s really no good linear relationship between weights and species types mainly because of the high amount variability.

(b) Create scatterplots of the response, Weight, against each quantitative predictor, namely Body.Height, Total.Length, Diagonal.Length, Height, and Width. Describe the general trend of each plot. Are there any potential outliers?

#Answer:

Graphically, it appears that all 5 predictors have a positive relationship with the Weight of the fish, but there seem to be some slight non-linear curvature relationship between the response variable and its quantitative predictors
As the predictor variables increases, so does its corresponding weights. This can be visually seen with an upward movement in the scatter plots.
In addition, there seem to be an outlier in 3 of the predictors; Diagonal length, Body Height, Total Length around data point 20 while the predictors of Height and Width predictors have more outliers in the lower range, somewhere between 4 thru 12

(c) Display the correlations between each of the quantitative variables. Interpret the correlations in the context of the relationships of the predictors to the response and in the context of multicollinearity.

library(corrplot)

## corrplot 0.92 loaded

headers = c("Body.Height","Total.Length","Diagonal.Length","Height","Width","Weight")
#scale(df2[,1:9])
newdata <- data.frame(scale(fish[,-2]))[headers]
correlation = cor(newdata )
corrplot(correlation, method = 'number',bg="lightblue")

#Answer:

With respect to (w.r.t) response variable to predictors:

Most of the predictors have relatively high correlation w.r.t. response variable in the mid 0.80 range. The lowest was Height at 0.69 between Weight and Width
The same can be observed for predictors w.r.t. to each other as well. For instance, Total length and Body Height have perfect correlation while Diagonal Length and Body Height have correlation of 0.99. Width vs. Total length, Body Height and Diagonal Length also experienced high correlation at the mid .87 range. Only Height saw a lower correlation level with it’s corresponding predictors at the mid 0.6 range.

The fact that predictors have high correlation relationships with its response is a good thing as it will translate to higher predictive power of the model. But, those high correlation levels among predictors spells trouble and will manifest itself in multicollinearity issues (VIF issues). Some of the well-documented areas that cause problems to the model are:

• Standard errors are inflated

• Larger the confidence interval

• The less likely it is that a coefficient will be evaluated as statistically

While all these issues are problematic, its not fatal. The model can still be utilized but with caution

(d) Based on this exploratory analysis, is it reasonable to assume a multiple linear regression model for the relationship between Weight and the predictor variables?

#Answer:

I would still use a multiple linear regression model (MLR) as a baseline model first, but given the potential problems with multicollinearity, I would be cautious about interpreting the statistical significance of the regression coefficients, confidence interval or it’s predictive power.
We can attempt a multiple linear regression model first, but in order to improve the model, we may need to perform some Box-Cox transformations later to reduce the heteroskedasticity and/or incorporate non-linear associations in the linear model by including transformed versions of the predictors in the model.

Remember, this is just general guidelines to keep the model more interpretable and parsimonious, while in a true a modeling process, a more robust sub-setting selection methodology like a step-wise regression may still be needed to be performed to truly trim down the model and only utilize the most impactful predictors relative to the response

Question 2: Fitting the Multiple Linear Regression Model [8 points]

Create the full model without transforming the response variable or predicting variables using the fish data set. Do not use fishtest

(a) Build a multiple linear regression model, called model1, using the response and all predictors. Display the summary table of the model.

model1 <- lm(Weight ~ ., data=fish)
summary(model1)

## 
## Call:
## lm(formula = Weight ~ ., data = fish)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.37  -70.59  -23.50   42.42 1335.87 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -813.90     218.34  -3.728 0.000282 ***
## SpeciesParkki       79.34     132.71   0.598 0.550918    
## SpeciesPerch        10.41     206.26   0.050 0.959837    
## SpeciesPike         16.76     233.06   0.072 0.942775    
## SpeciesRoach       194.03     156.84   1.237 0.218173    
## SpeciesSmelt       455.78     204.92   2.224 0.027775 *  
## SpeciesWhitefish    28.31     164.91   0.172 0.863967    
## Body.Height       -176.87      61.36  -2.882 0.004583 ** 
## Total.Length       266.70      77.75   3.430 0.000797 ***
## Diagonal.Length    -72.49      49.48  -1.465 0.145267    
## Height              38.27      22.09   1.732 0.085448 .  
## Width               29.63      40.54   0.731 0.466080    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156.1 on 137 degrees of freedom
## Multiple R-squared:  0.8419, Adjusted R-squared:  0.8292 
## F-statistic:  66.3 on 11 and 137 DF,  p-value: < 2.2e-16

(b) Is the overall regression significant at an \(\alpha\) level of 0.01? Explain.

#Answer:

Because the p value associated to the F-statistic (<2.2e-16) is less than the α level, we reject the null

hypothesis that all slope coefficients are equal to zero, and state that the model is statistically significant at an α level of 0.01

(c) What is the coefficient estimate for Body.Height? Interpret this coefficient.

#Answer:

βˆBody.Height = -176.87, which means for every unit increase in Body.Height, the expected weight

of the fish DECREASES by 176.87 units holding all other variables constant. This behavior is peculiar and counter-intuitive at best.

This interpretation is evidence that multicollinearity is distorting the coefficient estimates of the model

(d) What is the coefficient estimate for the Species category Parkki? Interpret this coefficient.

#Answer:

Species Parkki = 79.34, which means that if the fish is categorized as Parkki, then the expected weight of the fish would be 79.34 unit greater than the baseline species Bream holding all other variables constant. Given that the associated p-value equals 0.550918, this coefficient is not statistically significant.

Question 3: Checking for Outliers and Multicollinearity [6 points]

(a) Create a plot for the Cook’s Distances. Using a threshold Cook’s Distance of 1, identify the row numbers of any outliers.

# Cook’s Distance
cook = cooks.distance(model1)

plot(cook,type="h", lwd=4, col="red", ylab = "Cook's Distance", main="Cook's Distance")
abline(h=1,col="blue")

#Identify outliers
cat("Observation", which(cook>1), "has a cook's distance that is greater than 1")

## Observation 30 has a cook's distance that is greater than 1

#Answer:

Observation 30 is greater than 1

(b) Remove the outlier(s) from the data set and create a new model, called model2, using all predictors with Weight as the response. Display the summary of this model.

fish2<- fish[-c(30),]
model2 <- lm(Weight~., data=fish2)
summary(model2)

## 
## Call:
## lm(formula = Weight ~ ., data = fish2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.10  -50.18  -14.44   34.04  433.68 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -969.766    131.601  -7.369 1.51e-11 ***
## SpeciesParkki     195.500     80.105   2.441 0.015951 *  
## SpeciesPerch      174.241    124.404   1.401 0.163608    
## SpeciesPike      -175.936    140.605  -1.251 0.212983    
## SpeciesRoach      141.867     94.319   1.504 0.134871    
## SpeciesSmelt      489.714    123.174   3.976 0.000113 ***
## SpeciesWhitefish  122.277     99.293   1.231 0.220270    
## Body.Height       -76.321     37.437  -2.039 0.043422 *  
## Total.Length       74.822     48.319   1.549 0.123825    
## Diagonal.Length    34.349     30.518   1.126 0.262350    
## Height             10.000     13.398   0.746 0.456692    
## Width              -8.339     24.483  -0.341 0.733924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.84 on 136 degrees of freedom
## Multiple R-squared:  0.9385, Adjusted R-squared:  0.9335 
## F-statistic: 188.6 on 11 and 136 DF,  p-value: < 2.2e-16

(c) Display the VIF of each predictor for model2. Using a VIF threshold of max(10, 1/(1-\(R^2\)) what conclusions can you draw?

vif(model2)

##                       GVIF Df GVIF^(1/(2*Df))
## Species         1545.55017  6        1.843983
## Body.Height     2371.15420  1       48.694499
## Total.Length    4540.47698  1       67.383062
## Diagonal.Length 2126.64985  1       46.115614
## Height            56.21375  1        7.497583
## Width             29.01683  1        5.386727

VIF_threshold <- max(10, 1/(1-summary(model2)$r.squared))
cat("VIF threshold:", VIF_threshold)

## VIF threshold: 16.25583

#Answer:

The VIF threshold in model2 happens to be 16.25583
Since all the predictors have VIF greater than 10 (first column:GVIF), the test is positive for multicollinearity

Question 4: Checking Model Assumptions [6 points]

Please use the cleaned data set, which have the outlier(s) removed, and model2 for answering the following questions.

(a) Create scatterplots of the standardized residuals of model2 versus each quantitative predictor. Does the linearity assumption appear to hold for all predictors?

fish_1 <- fish2[,-2]
## Standardized Residuals versus individual predicting variables 
resids = stdres(model2)
par(mfrow =c(3,2))
xlabc=c("Body.Height","Total.Length","Diagonal.Length","Height","Width")
for (i in 2:6) {
  plot(fish_1[,i],resids,xlab=xlabc[i-1],ylab="Residuals")
  abline(0,0,col="red")
  lines(lowess(fish_1[,i], resids), col='blue')}

#Answer:

From the scatter plots for Body.Height, Total.Length, Diagonal.Lenght, and Width we can see that the residuals exhibit a slight U-shape pattern, which provides an indication of non-linearity.

For example:

Body.Height: A possible concern between 20-30cm, where the majority of the residuals are below the zero line.
Total.Length: A possible concern between 20-30cm, where the majority of the residuals are below the zero line.
Diagonal.Length: A possible concern between 27-35cm, where the majority of the residuals are below the zero line.
Width: A possible concern between 3.9-5cm, where the majority of the residuals are below the zero line

Overall, the linearity assumption does not seem to hold

(b) Create a scatter plot of the standardized residuals of model2 versus the fitted values of model2. Does the constant variance assumption appear to hold? Do the errors appear uncorrelated?

fits = model2$fitted
plot(fits, resids, xlab="Fitted Values",ylab="Residuals", main="Model2: Residuals vs. Fitted Values")
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits, resids), col='blue')

#Answer:

The residuals do not have a random pattern associated with it. In fact, it again exhibit that slight U-shape around the zero line (in red). This is an indication that heteordasticity (non-constant variance) is a problem. However, there does not appear to have any “clustering effect” in the residuals. This suggests that the errors might be uncorrelated

(c) Create a histogram and normal QQ plot for the standardized residuals. What conclusions can you draw from these plots?

qqPlot(resids, ylab="Residuals", main = "Model2: Q-Q Plot of Residuals")

##  37 124 
##  36 123

hist(resids, xlab="Residuals", main = "Mdel2: Histogram of Residuals",nclass=10,col="gold")

#Answer:

The QQ plot indicates that the residuals in our regression model have heavy tails. The QQ plot indicates there are outliers and non-normality in the data set. The histogram displays a right-skewed long tail which confirms there are outliers in the data. The logical conclusion is the data needs to be improved upon via some transformation methodology

Question 5: Partial F Test [6 points]

(a) Build a third multiple linear regression model using the cleaned data set without the outlier(s), called model3, using only Species and Total.Length as predicting variables and Weight as the response. Display the summary table of the model3.

fish_3 <- fish2[,c(1,2,4)]
model3 <- lm(Weight ~ ., data=fish_3)
summary(model3)

## 
## Call:
## lm(formula = Weight ~ ., data = fish_3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -233.83  -56.59  -10.13   34.58  418.30 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -730.977     42.449 -17.220  < 2e-16 ***
## SpeciesParkki      63.129     38.889   1.623    0.107    
## SpeciesPerch      -23.941     21.745  -1.101    0.273    
## SpeciesPike      -400.964     33.350 -12.023  < 2e-16 ***
## SpeciesRoach      -19.876     30.111  -0.660    0.510    
## SpeciesSmelt      256.408     39.858   6.433 1.85e-09 ***
## SpeciesWhitefish  -14.971     42.063  -0.356    0.722    
## Total.Length       40.775      1.181  34.527  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.86 on 140 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9321 
## F-statistic: 289.1 on 7 and 140 DF,  p-value: < 2.2e-16

(b) Conduct a partial F-test comparing model3 with model2. What can you conclude using an \(\alpha\) level of 0.01?

anova(model3,model2)

#Answer:

The Partial F-stat = 1.7626 is as shown in the ANOVA Table above. The P-value is greater than 0.01 which means that we fail to reject the null hypothesis and conclude the additional predictors (Body.Height, Dia.Length, Height & Width) will NOT be significantly associated to the response variable: Weight of fish

Question 6: Reduced Model Residual Analysis and Multicollinearity Test [7 points]

(a) Conduct a multicollinearity test on model3. Comment on the multicollinearity in model3.

vif(model3)

##                  GVIF Df GVIF^(1/(2*Df))
## Species      2.654472  6        1.084755
## Total.Length 2.654472  1        1.629255

VIF_threshold3 <- max(10, 1/(1-summary(model3)$r.squared))
cat("VIF threshold of Model3:", VIF_threshold3)

## VIF threshold of Model3: 15.45466

#Answer:

The VIF threshold in model3 happens to be 15.45466
Since all the predictors have VIF less than 10 (first column:GVIF), the test is negative for multicollinearity

(b) Conduct residual analysis for model3 (similar to Q4). Comment on each assumption and whether they hold.

par(mfrow=c(2,2))

#Residuals of predictor vs Residuals for linearity check
resids3 = stdres(model3)
xlabc=c("Total.Length")
plot(fish_3[,3],resids3,xlab=xlabc,ylab="Residuals", pch = 19,main="Model3: Lineartiy/Mean Zero")
  abline(0,0,col="red")
  lines(lowess(fish_3[,3], resids3), col='blue')
#Residuals of fitted vs Residuals for non-constant variance check
fits3 = model3$fitted
plot(fits3, resids3, xlab="Fitted Values",ylab="Residuals",main="Model3: Heterodasticity Check",pch = 19)
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits3, resids3), col='blue')

#QQ Plot & Histogram of residuals
qqPlot(resids3, ylab="Residuals", main = "Model3: Q-Q Plot of Residuals")

##  37 124 
##  36 123

hist(resids3, xlab="Residuals", main = "Model3: Histogram of Residuals",nclass=10,col="yellow")

#Answer:

We still have similar linearity/mean zero and constant variance violations like model2 with a u-shape pattern in the residuals and increasing variance as fitted values increase (non-constant variance problem). Specifically, residuals vs. predictor and residuals vs. fitted values did not have a random scatter of points hovering around line zero. It’s actually these lack of randomness in the scatter of residuals that indicated the data has both a non-linearity and heteordasticity issues.
Similarly, the histogram and QQ plot both exhibited deviations from normality with a heavy right tail. Thus Normality does not hold. A transformation of the response may be needed here

Question 7: Transformation [9 pts]

(a) Use model3 to find the optimal lambda, rounded to the nearest 0.5, for a Box-Cox transformation on model3. What transformation, if any, should be applied according to the lambda value? Please ensure you use model3

library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:ggpubr':
## 
##     mutate

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

options(digits=3)
bc <- boxcox((model3))

lambda <- bc$x[which.max(bc$y)]
cat("Optimal Lambda = ", lambda, end="\n")

## Optimal Lambda =  0.343

#Answer:

Rounding up to nearest 0.5 integer resulted in the optimal lambda being 0.5. According to the literature, the recommendation is to take the square root of response variable; Weight

(b) Based on the results in (a), create model4 with the appropriate transformation. Display the summary.

#transformed response variable
transformed_y <- sqrt(fish_3$Weight)
fishfull_transformed <- fish_3 %>% mutate(sqrt_weight=transformed_y) %>% select(sqrt_weight,Species, Total.Length)
head(fishfull_transformed ,2)

model4 <- lm(sqrt_weight~., data=fishfull_transformed )
summary(model4)

## 
## Call:
## lm(formula = sqrt_weight ~ ., data = fishfull_transformed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.011 -0.769 -0.058  0.680  4.638 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6.9665     0.5728  -12.16  < 2e-16 ***
## SpeciesParkki     -0.3640     0.5248   -0.69    0.489    
## SpeciesPerch      -1.9573     0.2934   -6.67  5.5e-10 ***
## SpeciesPike      -10.9049     0.4500  -24.23  < 2e-16 ***
## SpeciesRoach      -2.0934     0.4063   -5.15  8.6e-07 ***
## SpeciesSmelt      -1.0499     0.5378   -1.95    0.053 .  
## SpeciesWhitefish  -0.5505     0.5676   -0.97    0.334    
## Total.Length       0.9505     0.0159   59.65  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.28 on 140 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.981 
## F-statistic: 1.07e+03 on 7 and 140 DF,  p-value: <2e-16

(c) Perform Residual Analysis on model4. Comment on each assumption. Was the transformation successful/unsuccessful?

par(mfrow=c(2,2))

#Residuals of predictor vs Residuals for linearity check
resids4 = stdres(model4)
xlabc=c("Total.Length")
plot(fishfull_transformed[,3],resids4,xlab=xlabc,ylab="Residuals", pch = 19,main="Model4: Lineartiy/Mean Zero")
  abline(0,0,col="red")
  lines(lowess(fishfull_transformed[,3], resids4), col='blue')
#Residuals of fitted vs Residuals for non-constant variance check
fits4 = model4$fitted
plot(fits4, resids4, xlab="Fitted Values",ylab="Residuals",main="Model4: Heterodasticity Check",pch = 19)
abline(0,0,col="red")
#add lowess smoothing curve to plot
lines(lowess(fits4, resids4), col='blue')

#QQ Plot & Histogram of residuals
qqPlot(resids4, ylab="Residuals", main = "Model4: Q-Q Plot of Residuals")

## 131 141 
## 130 140

hist(resids4, xlab="Residuals", main = "Model4: Histogram of Residuals",nclass=10,col="yellow")

#Answer:

In the Linearity/Mean Zero plot, the residuals are now scattered evenly across the zero line; fixed non-linearity problem
In the Constant Variance/Correlation plot, the spread of the residuals showed no pattern of increasing nor decreasing trend; fixed heteordasticity problem. Also, like previously, there weren’t any clustering in the residuals. This means errors were not correlated
QQ plot shows the majority of the residuals lines up quite well with the normal-line except for a few possible outliers
The histogram also shows similar result with an almost normal profile

Overall, based on these plots, the transformation has improved the linearity assumption. And the transformation was successful in fixing the non-constant variance problem. The only exception, We still have are heavy tailed residuals as indicated by the QQ plot. These outliers still need to be further studied for a more comprehensive analysis

Question 8: Model Comparison [2 pts]

(a) Using each model summary, compare and discuss the R-squared and Adjusted R-squared of model2, model3, and model4.

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

adj.R.Square <- c(summary(model2)$adj.r.squared,summary(model3)$adj.r.squared,summary(model4)$adj.r.squared)
R.Square <- c(summary(model2)$r.squared,summary(model3)$r.squared,summary(model4)$r.squared)
Models <- c("Model2","Model3","Model4")

df <- data.frame(Models,R.Square ,adj.R.Square)
df%>%
  kable() %>%
  kable_styling(font_size=20,full_width = F)

Models	R.Square	adj.R.Square
Model2	0.938	0.934
Model3	0.935	0.932
Model4	0.982	0.981

#Answer:

Model4 has the highest R-square and Adj. R-Square. The transformation and the exclusion of high outlier at data point 30 actually improved the fit of the model. It’s Not surprisingly that model2 has both a slightly higher R-square and thus adj. R-square than model3 because it has more predictors. But such increase might not indicate increased explanatory power in model2 as most of its predictors has multicollinearity/VIF issues.

Question 9: Prediction [8 points]

(a) Predict Weight for the last 10 rows of data (fishtest) using both model3 and model4. Compare and discuss the mean squared prediction error (MSPE) of both models.

# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]
# Mean Squared Prediction Error (MSPE)
pred3 = predict(model3, fishtest)
mse.model3 <- mean((pred3-fishtest$Weight)^2)
# Mean Squared Prediction Error (MSPE)
pred4 = predict(model4, fishtest)^2# to bring it to it's original scale.  Remember it was transformed by sqrt of response
mse.model4 <- mean((pred4-fishtest$Weight)^2)
cat("The MSPE of model4 is", mse.model4,"\n")

## The MSPE of model4 is 2443

cat("The MSPE of model3 is", mse.model3,"\n")

## The MSPE of model3 is 9392

#Answer:

Based on this metric, MSPE only, model4 is preferred as it has a lower MSPE value

(b) Suppose you have found a Perch fish with a Body.Height of 28 cm, and a Total.Length of 32 cm. Using model4, predict the weight on this fish with a 90% prediction interval. Provide an interpretation of the prediction interval.

#new data point for Perch
new.point<-data.frame(Species="Perch", Total.Length=32) # model4 has only one predictor; Total.length
# Calculate prediction interval
predict(model4, new.point, interval="prediction", level=0.9)^2

##   fit lwr upr
## 1 462 374 559

#As refernce only
fishfull %>% filter(Species=="Perch") %>% filter(Total.Length>30 & Total.Length<33)

#Answer:

The prediction of fish weight (given Total.Length = 32 cm), is 462 grams with lower bound at 374 grams and the upper bound at 559 grams
Model4 predicts the fish to weigh 462 grams with a 90% prediction interval of 374 to 559. This tells us that we can be 90% confident that the weight of a fish with these specific characteristics will be between 374 to 559 grams
For reference, the actual weight of a Perch data point at Total.Length of 32.8 cm is 514 grams (about 10% higher than what we predicted 462 grams), which is Not too far from an actual weight at given Total.Length of 32 cm.

HW2 Peer Assessment

Load libraries

Background

Data Description

Read the data

Question 1: Exploratory Data Analysis [8 points]

Question 2: Fitting the Multiple Linear Regression Model [8 points]

Question 3: Checking for Outliers and Multicollinearity [6 points]

Question 4: Checking Model Assumptions [6 points]

Question 5: Partial F Test [6 points]

Question 6: Reduced Model Residual Analysis and Multicollinearity Test [7 points]

Question 7: Transformation [9 pts]

Question 8: Model Comparison [2 pts]

Question 9: Prediction [8 points]