Background

The fishing industry uses numerous measurements to describe a specific fish. Our goal is to predict the weight of a fish based on a number of these measurements and determine if any of these measurements are insignificant in determining the weigh of a product. See below for the description of these measurments.

Data Description

The data consists of the following variables:

  1. Weight: weight of fish in g (numerical)
  2. Species: species name of fish (categorical)
  3. Body.Height: height of body of fish in cm (numerical)
  4. Total.Length: length of fish from mouth to tail in cm (numerical)
  5. Diagonal.Length: length of diagonal of main body of fish in cm (numerical)
  6. Height: height of head of fish in cm (numerical)
  7. Width: width of head of fish in cm (numerical)

Read the data

# Check if installed otherwise install Packages
if (!require("corrplot")) install.packages("corrplot")
## Loading required package: corrplot
## corrplot 0.92 loaded
if (!require("dplyr")) install.packages("dplyr")
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
if (!require("car")) install.packages("car")
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# Import library you may need
library(car)
library(dplyr)
# Read the data set
fishfull = read.csv("C:/Users/AWadagol/Downloads/Fish.csv",header=T, fileEncoding = 'UTF-8-BOM')
row.cnt = nrow(fishfull)
# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]

Please use fish as your data set for the following questions unless otherwise stated.

Question 1: Exploratory Data Analysis [8 points]

(a) Create a box plot comparing the response variable, Weight, across the multiple species. Based on this box plot, does there appear to be a relationship between the predictor and the response?

boxplot(fish$Weight ~ fish$Species, data=fish, main = "BoxPlot of Fish Weight for each type of species",xlab = "Fish Species", ylab = "Fish Weight", col = "darkgray")

In this case, we can see that there is a very high variability in the overall weight for each of the species available.Box Plots are generally used to find the range of the particular entity subset, i.e.here we can identify that the range of Weight for Bream ranges from ~250 to ~1,000. It can also help us identify if some of the points are outliers in the overall entity subset, for example the extreme case of Roach weight gives us that there is either a mistake in the measurement or the # of Roach’s that were measured could be from the “smaller” subgroup thereby asking us to question our measurements.

However, boxplots can generally not provide any relationship between Response and Predictor variables

(b) Create scatterplots of the response, Weight, against each quantitative predictor, namely Body.Height, Total.Length, Diagonal.Length, Height, and Width. Describe the general trend of each plot. Are there any potential outliers?

library(cowplot)
attach(fish)
bhw <- scatterplot(Body.Height,Weight,regLine = TRUE, legend = TRUE,xlab = "Body Height", ylab = "Weight")

tlw <- scatterplot(Total.Length,Weight, regLine = TRUE, legend = TRUE, xlab="Total Length")

dlw <- scatterplot(Diagonal.Length,Weight, regLine = TRUE, legend = TRUE, xlab = "Diagonal Length")

hw <- scatterplot(Height, Weight, regLine = TRUE, legend = TRUE)

ww <- scatterplot(Width, Weight, regLine = TRUE, legend = TRUE)

We can see from the 5 plots above that there is a positive relation between Weight and each of the Predictor variables. Here we can also see with the individual boxplots that Diagonal Length, Total Length and Body Height have outliers (but so does Weight, the response variable itself). In each of these cases, we can see that the regression line and the overall trend have slight difference with the higher values showing a funnel effect in all the cases (thus higher variability and less likely to be a linear-only regression).

(c) Display the correlations between each of the quantitative variables. Interpret the correlations in the context of the relationships of the predictors to the response and in the context of multicollinearity.

library(corrplot)
fish_Correlation <- cor(fish[,-2])

fish_Correlation
##                    Weight Body.Height Total.Length Diagonal.Length    Height
## Weight          1.0000000   0.8616894    0.8654773       0.8688250 0.6879801
## Body.Height     0.8616894   1.0000000    0.9995134       0.9919502 0.6268604
## Total.Length    0.8654773   0.9995134    1.0000000       0.9940896 0.6422261
## Diagonal.Length 0.8688250   0.9919502    0.9940896       1.0000000 0.7052116
## Height          0.6879801   0.6268604    0.6422261       0.7052116 1.0000000
## Width           0.8456717   0.8661882    0.8728030       0.8770361 0.7908491
##                     Width
## Weight          0.8456717
## Body.Height     0.8661882
## Total.Length    0.8728030
## Diagonal.Length 0.8770361
## Height          0.7908491
## Width           1.0000000
corrplot(fish_Correlation,method = 'number')

Here we can see that Weight has a really good correlation with each of the predictor variables. This is a good sign, this means that we will be getting a good predictor for each of the predictor variables.

However, the inter-correlation between the different predictor variables is also high. This could lead to many issues, as the overall multicollinearity issues will be higher. Here are some of the issues that we can see if multicollinearity exists -

  1. The coefficients become extremely sensitive to any changes that are made to the regression, as we cannot hold all the other predictor variables constant to identify individual predictor variable effect.
  2. Multicollinearity reduces the accuracy of the B0 and B1 etc. coefficients, which means that the p-value will be distorted and we cannot trust the end prediction as much.

(d) Based on this exploratory analysis, is it reasonable to assume a multiple linear regression model for the relationship between Weight and the predictor variables?

It is reasonable to assume that a multiple linear regression model can be built between the response and predictor variables. This would be a good starting point to develop a rough estimate. But as we have seen that there is multicollinearity between the different predictor variables, we have to ensure that we run a transformation such as Principal Component Analysis etc. or get more data to reduce the overall effect that multicollinearity will have on the p-value. We could also run a Lasso Regression to reduce the effect.

Question 2: Fitting the Multiple Linear Regression Model [8 points]

Create the full model without transforming the response variable or predicting variables using the fish data set. Do not use fishtest

(a) Build a multiple linear regression model, called model1, using the response and all predictors. Display the summary table of the model.

model1 <- lm(Weight ~ ., data=fish)

summary(model1)
## 
## Call:
## lm(formula = Weight ~ ., data = fish)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.37  -70.59  -23.50   42.42 1335.87 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -813.90     218.34  -3.728 0.000282 ***
## SpeciesParkki       79.34     132.71   0.598 0.550918    
## SpeciesPerch        10.41     206.26   0.050 0.959837    
## SpeciesPike         16.76     233.06   0.072 0.942775    
## SpeciesRoach       194.03     156.84   1.237 0.218173    
## SpeciesSmelt       455.78     204.92   2.224 0.027775 *  
## SpeciesWhitefish    28.31     164.91   0.172 0.863967    
## Body.Height       -176.87      61.36  -2.882 0.004583 ** 
## Total.Length       266.70      77.75   3.430 0.000797 ***
## Diagonal.Length    -72.49      49.48  -1.465 0.145267    
## Height              38.27      22.09   1.732 0.085448 .  
## Width               29.63      40.54   0.731 0.466080    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156.1 on 137 degrees of freedom
## Multiple R-squared:  0.8419, Adjusted R-squared:  0.8292 
## F-statistic:  66.3 on 11 and 137 DF,  p-value: < 2.2e-16

(b) Is the overall regression significant at an \(\alpha\) level of 0.01? Explain.

The p-value for model1 is found to be < 2.2e-16, which is significantly less than 0.01. This means that we will be able to say that the model1 is significant at \(\alpha\) level = 0.01

(c) What is the coefficient estimate for Body.Height? Interpret this coefficient.

\(/beta\) for Body Height is found to be -176.87. This means that holding all the other predictor variables constant, and increase of 1 unit of Body.Height parameter will REDUCE the Weight of the specimen by -176.87 units.

(d) What is the coefficient estimate for the Species category Parkki? Interpret this coefficient.

\(/beta\) for the Species Parkki is 79.34.

Here the base Species auto-coded by R is Bream. So in our case, for every 1 unit increase in Parkki if all the other predictor variables are held constant, then

Question 3: Checking for Outliers and Multicollinearity [6 points]

(a) Create a plot for the Cook’s Distances. Using a threshold Cook’s Distance of 1, identify the row numbers of any outliers.

cooks_distance <- cooks.distance(model1)

plot(cooks_distance)

which(cooks_distance>1)
## 30 
## 30

As of above, the Cooks Distance which is above the threshold distance of 1 is at row number 30.

(b) Remove the outlier(s) from the data set and create a new model, called model2, using all predictors with Weight as the response. Display the summary of this model.

fish2 <- fish[-30,]

model2 <- lm(Weight~., data=fish2)

summary(model2)
## 
## Call:
## lm(formula = Weight ~ ., data = fish2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.10  -50.18  -14.44   34.04  433.68 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -969.766    131.601  -7.369 1.51e-11 ***
## SpeciesParkki     195.500     80.105   2.441 0.015951 *  
## SpeciesPerch      174.241    124.404   1.401 0.163608    
## SpeciesPike      -175.936    140.605  -1.251 0.212983    
## SpeciesRoach      141.867     94.319   1.504 0.134871    
## SpeciesSmelt      489.714    123.174   3.976 0.000113 ***
## SpeciesWhitefish  122.277     99.293   1.231 0.220270    
## Body.Height       -76.321     37.437  -2.039 0.043422 *  
## Total.Length       74.822     48.319   1.549 0.123825    
## Diagonal.Length    34.349     30.518   1.126 0.262350    
## Height             10.000     13.398   0.746 0.456692    
## Width              -8.339     24.483  -0.341 0.733924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.84 on 136 degrees of freedom
## Multiple R-squared:  0.9385, Adjusted R-squared:  0.9335 
## F-statistic: 188.6 on 11 and 136 DF,  p-value: < 2.2e-16

(c) Display the VIF of each predictor for model2. Using a VIF threshold of max(10, 1/(1-\(R^2\)) what conclusions can you draw?

vif(model2)
##                       GVIF Df GVIF^(1/(2*Df))
## Species         1545.55017  6        1.843983
## Body.Height     2371.15420  1       48.694499
## Total.Length    4540.47698  1       67.383062
## Diagonal.Length 2126.64985  1       46.115614
## Height            56.21375  1        7.497583
## Width             29.01683  1        5.386727
r_squared_model2 <- summary(model2)$r.squared

r_squared_model2
## [1] 0.9384836
Threshold_VIF <- max(10,1/(1-r_squared_model2))

Threshold_VIF
## [1] 16.25583

The VIF Threshold is 16.25583.

All of the predictor variables have GVIF greater that VIF Threshold. This means that there is definitely multi-collinearity between the predictor variables.

Question 4: Checking Model Assumptions [6 points]

Please use the cleaned data set, which have the outlier(s) removed, and model2 for answering the following questions.

(a) Create scatterplots of the standardized residuals of model2 versus each quantitative predictor. Does the linearity assumption appear to hold for all predictors?

plot(model2)

scatterplot(fish2$Body.Height,model2$residuals, xlab = "Body Height", ylab = "Residuals")

scatterplot(fish2$Total.Length,model2$residuals, xlab = "Total Length", ylab = "Residuals")

scatterplot(fish2$Diagonal.Length,model2$residuals, xlab = "Diagonal Length", ylab = "Residuals")

scatterplot(fish2$Height,model2$residuals, xlab = "Height", ylab = "Residuals")

scatterplot(fish2$Width,model2$residuals, xlab = "Width", ylab = "Residuals")

Each of the Scatter plot above shows a curved relationship between the residuals and the individual predictor variables. The non-linearity or curved relationship points that there is non-linearity between the predictor variables and response variables overall.

(b) Create a scatter plot of the standardized residuals of model2 versus the fitted values of model2. Does the constant variance assumption appear to hold? Do the errors appear uncorrelated?

scatterplot(model2$fitted.values,model2$residuals,xlab = "Fitted Values", ylab = "Residuals")

Here there is again a clear curved relationship between the fitted values and residuals. This means that the assumption that there is constant variance is FALSE and does NOT HOLD. This shows that there is Heteroskedasticity. However, the errors do not appear to be correlated that is there does not seem to be a standard pattern here.

(c) Create a histogram and normal QQ plot for the standardized residuals. What conclusions can you draw from these plots?

hist(model2$residuals, xlab = "Residuals", main = "Histogram of Residuals")

qqPlot(model2$residuals, ylab = "Residuals", main="QQ Plot of Residuals")

##  37 124 
##  36 123

The Histogram above shows Right-Skewed data which means that there are outliers in this dataset

Similarly, the QQ Plot shows that the tails are extremely heavy. This means that there are outliers in the dataset (as pointed by 37 and 124).This means also that there is multi-collinearity in the predictor variables.

Question 5: Partial F Test [6 points]

(a) Build a third multiple linear regression model using the cleaned data set without the outlier(s), called model3, using only Species and Total.Length as predicting variables and Weight as the response. Display the summary table of the model3.

sort(abs(model2$residuals), decreasing = TRUE)
##          37         124          22          11          18         131 
## 433.6821337 287.2939225 237.2939225 232.9446157 218.4451574 211.0959242 
##          73         142         141          87          67          17 
## 193.3023944 173.4889673 166.1126201 164.9503727 160.3487177 154.8803646 
##         143          42         111          52          54          16 
## 143.8901435 142.1201450 139.2438437 138.0946718 129.1885390 126.8509021 
##          88          32         103          63          31         130 
## 125.7677052 125.0010723 123.5667941 121.7763911 113.8828459 112.5162641 
##         144          43          60          72         116          25 
## 109.2067171 105.0640215 101.1128105 100.9480852 100.7200619  97.4081500 
##         105          74          86          97          84          26 
##  96.3508999  94.3835378  92.3413297  92.2777682  91.7200009  90.5815278 
##         114         137           1          40         122          76 
##  87.7350196  87.0445433  85.6180791  85.5639469  83.6669971  83.5549277 
##          85           5          61         108          35         112 
##  83.3953971  81.4702259  78.3948639  77.0886242  76.1384667  75.6465707 
##          69           9         140         120          46         146 
##  75.1445007  75.1389246  73.0560302  72.9903396  71.9430989  70.6514673 
##         147         100          36         133          41          93 
##  69.9180221  68.2231900  67.7081180  65.3125904  64.7673010  62.3150282 
##         113         123          15         128          27         127 
##  61.7187199  60.8929665  60.1135979  58.6321110  56.6428113  55.9684790 
##         119          48           7          28          71          29 
##  52.9160352  52.0855889  50.6811820  50.0183212  50.0143630  49.2543684 
##          66         104          21          65          79          53 
##  47.1819223  46.7262002  46.0925140  45.6818713  44.7037081  44.0769294 
##         107         118          33          95         121          96 
##  44.0133576  43.8737179  43.2556654  43.2389387  42.2897978  41.8660972 
##          50         106          10          38         149         117 
##  40.1099972  39.2408142  38.8505934  38.5843904  38.5087794  38.3333091 
##          70          68          47          75          82         136 
##  37.8153560  37.8019896  36.9191401  36.1252002  35.4529513  35.1085463 
##          92           4          44          81          34          78 
##  34.8107937  33.6846266  33.0335287  32.8929112  32.3787095  32.0269935 
##          45         139         110           2           6          59 
##  31.9436460  31.4204892  30.2364345  30.1992465  29.3019671  25.6834189 
##          62          56         126         148          90         125 
##  25.6756876  24.7827257  24.1984666  24.1048315  23.7534593  22.5863568 
##         132         101          98         129           3          55 
##  22.0357219  21.2787810  19.9735124  18.8063948  17.6399825  17.0530752 
##          49          80         138          83          12         109 
##  16.8578948  16.6897796  16.0132303  15.7588406  15.0728406  14.9744008 
##          20         102          94          57           8          24 
##  14.8713532  13.9116133  13.6192175  12.7040793  12.5223374  12.4361861 
##          58          89          19          13          14          91 
##  10.1525251   9.6605766   9.5306368   9.3397333   9.2129921   8.1822409 
##          39         134          77          23         135         115 
##   7.8254481   7.0990765   6.2858617   5.8832477   4.5601310   3.5520471 
##          51         145          99          64 
##   2.9566297   1.8023752   0.3545769   0.1792934
fish2_outliers <- cbind(fish2)[c(37,124,22,11,18,131),]


fish3 <- fish2[c(-38,-125,-22,-11,-18,-132),]

model3 <- lm(Weight ~ Species + Total.Length, data = fish2)

summary(model3)
## 
## Call:
## lm(formula = Weight ~ Species + Total.Length, data = fish2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -233.83  -56.59  -10.13   34.58  418.30 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -730.977     42.449 -17.220  < 2e-16 ***
## SpeciesParkki      63.129     38.889   1.623    0.107    
## SpeciesPerch      -23.941     21.745  -1.101    0.273    
## SpeciesPike      -400.964     33.350 -12.023  < 2e-16 ***
## SpeciesRoach      -19.876     30.111  -0.660    0.510    
## SpeciesSmelt      256.408     39.858   6.433 1.85e-09 ***
## SpeciesWhitefish  -14.971     42.063  -0.356    0.722    
## Total.Length       40.775      1.181  34.527  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.86 on 140 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9321 
## F-statistic: 289.1 on 7 and 140 DF,  p-value: < 2.2e-16

(b) Conduct a partial F-test comparing model3 with model2. What can you conclude using an \(\alpha\) level of 0.01?

anova(model3, model2)

The p-value here is 0.14 which is very much greater than \(\alpha\) level of 0.01 – Thus, we can say that we cannot reject the null hypothesis and the additional predictors other than Species and Total.Length are not required for better prediction values.

Question 6: Reduced Model Residual Analysis and Multicollinearity Test [7 points]

(a) Conduct a multicollinearity test on model3. Comment on the multicollinearity in model3.

vif(model3)
##                  GVIF Df GVIF^(1/(2*Df))
## Species      2.654472  6        1.084755
## Total.Length 2.654472  1        1.629255

Earlier we were given a threshold to identify the multicollinearity

r_squared_model3 <- summary(model3)$r.squared

r_squared_model3
## [1] 0.9352946
Threshold_VIF_model3 <- max(10,1/(1-r_squared_model3))

Threshold_VIF_model3
## [1] 15.45466

Now our threshold is way higher than the VIF value of Species and Total.Length (2.654472 each) – so there is no multicollinearity between Species and Total.Length

(b) Conduct residual analysis for model3 (similar to Q4). Comment on each assumption and whether they hold.

scatterplot(fish2$Total.Length,model3$residuals, xlab = "Total Length", ylab = "Residuals")

scatterplot(model3$fitted.values,model3$residuals,xlab = "Fitted Values", ylab = "Residuals")

hist(model3$residuals, xlab = "Residuals", main = "Histogram of Residuals")

qqPlot(model3$residuals, ylab = "Residuals", main="QQ Plot of Residuals")

##  37 124 
##  36 123

As of Question 4, we can now plot the different plots, we definitely see a U-Shaped Pattern on the fitted values vs residuals. Thus, we can see that there is still non-linearity between the response and predictor variables.

Same way the QQ plot and Histogram shows that there is deviation from the norm. Therefore there is non-linear relationship between the variables

Question 7: Transformation [9 pts]

(a) Use model3 to find the optimal lambda, rounded to the nearest 0.5, for a Box-Cox transformation on model3. What transformation, if any, should be applied according to the lambda value? Please ensure you use model3

boxCoxTransformation <- boxCox(model3)

Maximum_value_BC <- max(boxCoxTransformation$y)

Maximum_Value_row <- which.max(boxCoxTransformation$y)

boxCoxTransformation$x[Maximum_Value_row]
## [1] 0.3434343

Rounding the Lambda value to 0.5, the lambda value will be 0.5

We can apply log-linear transformation or basic square root transformation on the response variable.

(b) Based on the results in (a), create model4 with the appropriate transformation. Display the summary.

model4 <- lm(Weight^(1/2)~Species + Total.Length,data=fish2)

summary(model4)
## 
## Call:
## lm(formula = Weight^(1/2) ~ Species + Total.Length, data = fish2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0111 -0.7687 -0.0579  0.6797  4.6383 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6.96654    0.57278 -12.163  < 2e-16 ***
## SpeciesParkki     -0.36404    0.52476  -0.694   0.4890    
## SpeciesPerch      -1.95734    0.29342  -6.671 5.46e-10 ***
## SpeciesPike      -10.90490    0.45001 -24.233  < 2e-16 ***
## SpeciesRoach      -2.09340    0.40630  -5.152 8.58e-07 ***
## SpeciesSmelt      -1.04994    0.53782  -1.952   0.0529 .  
## SpeciesWhitefish  -0.55048    0.56758  -0.970   0.3338    
## Total.Length       0.95052    0.01594  59.649  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.28 on 140 degrees of freedom
## Multiple R-squared:  0.9817, Adjusted R-squared:  0.9808 
## F-statistic:  1074 on 7 and 140 DF,  p-value: < 2.2e-16

(c) Perform Residual Analysis on model4. Comment on each assumption. Was the transformation successful/unsuccessful?

scatterplot(fish2$Total.Length,model4$residuals, xlab = "Total Length", ylab = "Residuals")

scatterplot(model4$fitted.values,model4$residuals,xlab = "Fitted Values", ylab = "Residuals")

hist(model4$residuals, xlab = "Residuals", main = "Histogram of Residuals")

qqPlot(model4$residuals, ylab = "Residuals", main="QQ Plot of Residuals")

## 131 141 
## 130 140

As compared to the above two residual analysis, we now start seeing a significant difference in the way the data is being shown in the Fitted Value vs Residual plot. We are seeing a significant reduction in the U-Shaped pattern and are seeing more of a “straight” line showing linearity, which means that the overall transformation is helping reduce the non-linearity

Also, the histogram is not right-skewed, as well as QQ Plot is not as tail heavy as it was in the earlier examples. This shows again that the transformation has helped in reducing overall non-linearity.

Question 8: Model Comparison [2 pts]

(a) Using each model summary, compare and discuss the R-squared and Adjusted R-squared of model2, model3, and model4.

model2_R_Sq <- summary(model2)$r.squared
model2_adj_R_Sq <- summary(model2)$adj.r.squared

model3_R_Sq <- summary(model3)$r.squared
model3_adj_R_Sq <- summary(model3)$adj.r.squared

model4_R_Sq <- summary(model4)$r.squared
model4_adj_R_Sq <- summary(model4)$adj.r.squared

r_Squared_values <- c(model2_R_Sq,model3_R_Sq,model4_R_Sq)
Adjusted_r_Squared_values <- c(model2_adj_R_Sq,model3_adj_R_Sq,model4_adj_R_Sq)

r_Squared_values
## [1] 0.9384836 0.9352946 0.9817138
Adjusted_r_Squared_values
## [1] 0.9335080 0.9320593 0.9807995

Here we can see that from Model 2 to Model 3 both the R-Squared Values and Adjusted R-Squared Values dropped, though Model 4 has the maximum of both the values overall.

This shows that as the overall outlier points and unnecessary variables got removed, as well as an additional transformation was performed on the response variable (to reduce the overall non-linearity, we can see an increase in R-Sq and Adj R-Sq values. The difference in model2 and model3 is also expected as we have more overall predictors and datapoints.

Question 9: Prediction [8 points]

(a) Predict Weight for the last 10 rows of data (fishtest) using both model3 and model4. Compare and discuss the mean squared prediction error (MSPE) of both models.

nrow(fishtest)
## [1] 10
model3_fishtest_predictions <- predict(model3, fishtest)
model3_fishtest_predictions
##         150         151         152         153         154         155 
##  835.320989  492.283575  351.535704    7.581312  223.690686 -143.287496 
##         156         157         158         159 
##  185.102637  621.398920  147.658936   14.734838
model3_MPSE <- mean((fishtest$Weight - model3_fishtest_predictions)^2)
model3_MPSE
## [1] 9392.25
model4_fishtest_predictions <- predict(model4, fishtest)
model4_fishtest_predictions
##       150       151       152       153       154       155       156       157 
## 28.146519 21.549151 16.432491  8.850901 13.888673  5.333966 12.830454 23.001051 
##       158       159 
## 11.679876  3.389796
model4_fishtest_predictions <- model4_fishtest_predictions^2
model4_fishtest_predictions
##       150       151       152       153       154       155       156       157 
## 792.22652 464.36590 270.02675  78.33845 192.89524  28.45119 164.62056 529.04835 
##       158       159 
## 136.41949  11.49071
model4_MPSE <- mean((fishtest$Weight - model4_fishtest_predictions)^2)
model4_MPSE
## [1] 2442.998

MPSE of Model 4 is 2442.998 and that of Model 3 is 9392.25

Thus, Model 4 is definitely the better option overall.

(b) Suppose you have found a Perch fish with a Body.Height of 28 cm, and a Total.Length of 32 cm. Using model4, predict the weight on this fish with a 90% prediction interval. Provide an interpretation of the prediction interval.

fish %>% filter(Species == "Perch") %>% filter(Total.Length>=32)  %>% filter(Body.Height>= 28) %>% filter(Total.Length <34) %>% filter(Body.Height <31)
new_Perch_fish <- data.frame(Species = "Perch", Body.Height = 28, Total.Length = 32 )

new_Perch_fish_weight <- predict(model4, new_Perch_fish, interval = "prediction", level = 0.9)

new_Perch_fish_weight <- new_Perch_fish_weight^2

new_Perch_fish_weight
##        fit      lwr      upr
## 1 461.9429 374.4536 558.6091

Here we find that the fit value of the new Perch fish is 461.9429 gms for a confidence interval of 90%. This means that for 90% CI, we can be confident that the minimum value of weight will be 374.4536 gms and maximum value of weight will be 558.6091.

The value of the filtered Fish table shows us that for similar Body.Height and Total.Length the weight for a perch will be 514 gms. which is between the 90% confidence interval range.