The fishing industry uses numerous measurements to describe a specific fish. Our goal is to predict the weight of a fish based on a number of these measurements and determine if any of these measurements are insignificant in determining the weigh of a product. See below for the description of these measurments.
The data consists of the following variables:
# Check if installed otherwise install Packages
if (!require("corrplot")) install.packages("corrplot")
## Loading required package: corrplot
## corrplot 0.92 loaded
if (!require("dplyr")) install.packages("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if (!require("car")) install.packages("car")
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Import library you may need
library(car)
library(dplyr)
# Read the data set
fishfull = read.csv("C:/Users/AWadagol/Downloads/Fish.csv",header=T, fileEncoding = 'UTF-8-BOM')
row.cnt = nrow(fishfull)
# Split the data into training and testing sets
fishtest = fishfull[(row.cnt-9):row.cnt,]
fish = fishfull[1:(row.cnt-10),]
Please use fish as your data set for the following questions unless otherwise stated.
(a) Create a box plot comparing the response variable, Weight, across the multiple species. Based on this box plot, does there appear to be a relationship between the predictor and the response?
boxplot(fish$Weight ~ fish$Species, data=fish, main = "BoxPlot of Fish Weight for each type of species",xlab = "Fish Species", ylab = "Fish Weight", col = "darkgray")
In this case, we can see that there is a very high variability in the overall weight for each of the species available.Box Plots are generally used to find the range of the particular entity subset, i.e.here we can identify that the range of Weight for Bream ranges from ~250 to ~1,000. It can also help us identify if some of the points are outliers in the overall entity subset, for example the extreme case of Roach weight gives us that there is either a mistake in the measurement or the # of Roach’s that were measured could be from the “smaller” subgroup thereby asking us to question our measurements.
However, boxplots can generally not provide any relationship between Response and Predictor variables
(b) Create scatterplots of the response, Weight, against each quantitative predictor, namely Body.Height, Total.Length, Diagonal.Length, Height, and Width. Describe the general trend of each plot. Are there any potential outliers?
library(cowplot)
attach(fish)
bhw <- scatterplot(Body.Height,Weight,regLine = TRUE, legend = TRUE,xlab = "Body Height", ylab = "Weight")
tlw <- scatterplot(Total.Length,Weight, regLine = TRUE, legend = TRUE, xlab="Total Length")
dlw <- scatterplot(Diagonal.Length,Weight, regLine = TRUE, legend = TRUE, xlab = "Diagonal Length")
hw <- scatterplot(Height, Weight, regLine = TRUE, legend = TRUE)
ww <- scatterplot(Width, Weight, regLine = TRUE, legend = TRUE)
We can see from the 5 plots above that there is a positive relation between Weight and each of the Predictor variables. Here we can also see with the individual boxplots that Diagonal Length, Total Length and Body Height have outliers (but so does Weight, the response variable itself). In each of these cases, we can see that the regression line and the overall trend have slight difference with the higher values showing a funnel effect in all the cases (thus higher variability and less likely to be a linear-only regression).
(c) Display the correlations between each of the quantitative variables. Interpret the correlations in the context of the relationships of the predictors to the response and in the context of multicollinearity.
library(corrplot)
fish_Correlation <- cor(fish[,-2])
fish_Correlation
## Weight Body.Height Total.Length Diagonal.Length Height
## Weight 1.0000000 0.8616894 0.8654773 0.8688250 0.6879801
## Body.Height 0.8616894 1.0000000 0.9995134 0.9919502 0.6268604
## Total.Length 0.8654773 0.9995134 1.0000000 0.9940896 0.6422261
## Diagonal.Length 0.8688250 0.9919502 0.9940896 1.0000000 0.7052116
## Height 0.6879801 0.6268604 0.6422261 0.7052116 1.0000000
## Width 0.8456717 0.8661882 0.8728030 0.8770361 0.7908491
## Width
## Weight 0.8456717
## Body.Height 0.8661882
## Total.Length 0.8728030
## Diagonal.Length 0.8770361
## Height 0.7908491
## Width 1.0000000
corrplot(fish_Correlation,method = 'number')
Here we can see that Weight has a really good correlation with each of the predictor variables. This is a good sign, this means that we will be getting a good predictor for each of the predictor variables.
However, the inter-correlation between the different predictor variables is also high. This could lead to many issues, as the overall multicollinearity issues will be higher. Here are some of the issues that we can see if multicollinearity exists -
(d) Based on this exploratory analysis, is it reasonable to assume a multiple linear regression model for the relationship between Weight and the predictor variables?
It is reasonable to assume that a multiple linear regression model can be built between the response and predictor variables. This would be a good starting point to develop a rough estimate. But as we have seen that there is multicollinearity between the different predictor variables, we have to ensure that we run a transformation such as Principal Component Analysis etc. or get more data to reduce the overall effect that multicollinearity will have on the p-value. We could also run a Lasso Regression to reduce the effect.
Create the full model without transforming the response variable or predicting variables using the fish data set. Do not use fishtest
(a) Build a multiple linear regression model, called model1, using the response and all predictors. Display the summary table of the model.
model1 <- lm(Weight ~ ., data=fish)
summary(model1)
##
## Call:
## lm(formula = Weight ~ ., data = fish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.37 -70.59 -23.50 42.42 1335.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -813.90 218.34 -3.728 0.000282 ***
## SpeciesParkki 79.34 132.71 0.598 0.550918
## SpeciesPerch 10.41 206.26 0.050 0.959837
## SpeciesPike 16.76 233.06 0.072 0.942775
## SpeciesRoach 194.03 156.84 1.237 0.218173
## SpeciesSmelt 455.78 204.92 2.224 0.027775 *
## SpeciesWhitefish 28.31 164.91 0.172 0.863967
## Body.Height -176.87 61.36 -2.882 0.004583 **
## Total.Length 266.70 77.75 3.430 0.000797 ***
## Diagonal.Length -72.49 49.48 -1.465 0.145267
## Height 38.27 22.09 1.732 0.085448 .
## Width 29.63 40.54 0.731 0.466080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 156.1 on 137 degrees of freedom
## Multiple R-squared: 0.8419, Adjusted R-squared: 0.8292
## F-statistic: 66.3 on 11 and 137 DF, p-value: < 2.2e-16
(b) Is the overall regression significant at an \(\alpha\) level of 0.01? Explain.
The p-value for model1 is found to be < 2.2e-16, which is significantly less than 0.01. This means that we will be able to say that the model1 is significant at \(\alpha\) level = 0.01
(c) What is the coefficient estimate for Body.Height? Interpret this coefficient.
\(/beta\) for Body Height is found to be -176.87. This means that holding all the other predictor variables constant, and increase of 1 unit of Body.Height parameter will REDUCE the Weight of the specimen by -176.87 units.
(d) What is the coefficient estimate for the Species category Parkki? Interpret this coefficient.
\(/beta\) for the Species Parkki is 79.34.
Here the base Species auto-coded by R is Bream. So in our case, for every 1 unit increase in Parkki if all the other predictor variables are held constant, then
(a) Create a plot for the Cook’s Distances. Using a threshold Cook’s Distance of 1, identify the row numbers of any outliers.
cooks_distance <- cooks.distance(model1)
plot(cooks_distance)
which(cooks_distance>1)
## 30
## 30
As of above, the Cooks Distance which is above the threshold distance of 1 is at row number 30.
(b) Remove the outlier(s) from the data set and create a new model, called model2, using all predictors with Weight as the response. Display the summary of this model.
fish2 <- fish[-30,]
model2 <- lm(Weight~., data=fish2)
summary(model2)
##
## Call:
## lm(formula = Weight ~ ., data = fish2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.10 -50.18 -14.44 34.04 433.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -969.766 131.601 -7.369 1.51e-11 ***
## SpeciesParkki 195.500 80.105 2.441 0.015951 *
## SpeciesPerch 174.241 124.404 1.401 0.163608
## SpeciesPike -175.936 140.605 -1.251 0.212983
## SpeciesRoach 141.867 94.319 1.504 0.134871
## SpeciesSmelt 489.714 123.174 3.976 0.000113 ***
## SpeciesWhitefish 122.277 99.293 1.231 0.220270
## Body.Height -76.321 37.437 -2.039 0.043422 *
## Total.Length 74.822 48.319 1.549 0.123825
## Diagonal.Length 34.349 30.518 1.126 0.262350
## Height 10.000 13.398 0.746 0.456692
## Width -8.339 24.483 -0.341 0.733924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.84 on 136 degrees of freedom
## Multiple R-squared: 0.9385, Adjusted R-squared: 0.9335
## F-statistic: 188.6 on 11 and 136 DF, p-value: < 2.2e-16
(c) Display the VIF of each predictor for model2. Using a VIF threshold of max(10, 1/(1-\(R^2\)) what conclusions can you draw?
vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## Species 1545.55017 6 1.843983
## Body.Height 2371.15420 1 48.694499
## Total.Length 4540.47698 1 67.383062
## Diagonal.Length 2126.64985 1 46.115614
## Height 56.21375 1 7.497583
## Width 29.01683 1 5.386727
r_squared_model2 <- summary(model2)$r.squared
r_squared_model2
## [1] 0.9384836
Threshold_VIF <- max(10,1/(1-r_squared_model2))
Threshold_VIF
## [1] 16.25583
The VIF Threshold is 16.25583.
All of the predictor variables have GVIF greater that VIF Threshold. This means that there is definitely multi-collinearity between the predictor variables.
Please use the cleaned data set, which have the outlier(s) removed, and model2 for answering the following questions.
(a) Create scatterplots of the standardized residuals of model2 versus each quantitative predictor. Does the linearity assumption appear to hold for all predictors?
plot(model2)
scatterplot(fish2$Body.Height,model2$residuals, xlab = "Body Height", ylab = "Residuals")
scatterplot(fish2$Total.Length,model2$residuals, xlab = "Total Length", ylab = "Residuals")
scatterplot(fish2$Diagonal.Length,model2$residuals, xlab = "Diagonal Length", ylab = "Residuals")
scatterplot(fish2$Height,model2$residuals, xlab = "Height", ylab = "Residuals")
scatterplot(fish2$Width,model2$residuals, xlab = "Width", ylab = "Residuals")
Each of the Scatter plot above shows a curved relationship between the residuals and the individual predictor variables. The non-linearity or curved relationship points that there is non-linearity between the predictor variables and response variables overall.
(b) Create a scatter plot of the standardized residuals of model2 versus the fitted values of model2. Does the constant variance assumption appear to hold? Do the errors appear uncorrelated?
scatterplot(model2$fitted.values,model2$residuals,xlab = "Fitted Values", ylab = "Residuals")
Here there is again a clear curved relationship between the fitted values and residuals. This means that the assumption that there is constant variance is FALSE and does NOT HOLD. This shows that there is Heteroskedasticity. However, the errors do not appear to be correlated that is there does not seem to be a standard pattern here.
(c) Create a histogram and normal QQ plot for the standardized residuals. What conclusions can you draw from these plots?
hist(model2$residuals, xlab = "Residuals", main = "Histogram of Residuals")
qqPlot(model2$residuals, ylab = "Residuals", main="QQ Plot of Residuals")
## 37 124
## 36 123
The Histogram above shows Right-Skewed data which means that there are outliers in this dataset
Similarly, the QQ Plot shows that the tails are extremely heavy. This means that there are outliers in the dataset (as pointed by 37 and 124).This means also that there is multi-collinearity in the predictor variables.
(a) Build a third multiple linear regression model using the cleaned data set without the outlier(s), called model3, using only Species and Total.Length as predicting variables and Weight as the response. Display the summary table of the model3.
sort(abs(model2$residuals), decreasing = TRUE)
## 37 124 22 11 18 131
## 433.6821337 287.2939225 237.2939225 232.9446157 218.4451574 211.0959242
## 73 142 141 87 67 17
## 193.3023944 173.4889673 166.1126201 164.9503727 160.3487177 154.8803646
## 143 42 111 52 54 16
## 143.8901435 142.1201450 139.2438437 138.0946718 129.1885390 126.8509021
## 88 32 103 63 31 130
## 125.7677052 125.0010723 123.5667941 121.7763911 113.8828459 112.5162641
## 144 43 60 72 116 25
## 109.2067171 105.0640215 101.1128105 100.9480852 100.7200619 97.4081500
## 105 74 86 97 84 26
## 96.3508999 94.3835378 92.3413297 92.2777682 91.7200009 90.5815278
## 114 137 1 40 122 76
## 87.7350196 87.0445433 85.6180791 85.5639469 83.6669971 83.5549277
## 85 5 61 108 35 112
## 83.3953971 81.4702259 78.3948639 77.0886242 76.1384667 75.6465707
## 69 9 140 120 46 146
## 75.1445007 75.1389246 73.0560302 72.9903396 71.9430989 70.6514673
## 147 100 36 133 41 93
## 69.9180221 68.2231900 67.7081180 65.3125904 64.7673010 62.3150282
## 113 123 15 128 27 127
## 61.7187199 60.8929665 60.1135979 58.6321110 56.6428113 55.9684790
## 119 48 7 28 71 29
## 52.9160352 52.0855889 50.6811820 50.0183212 50.0143630 49.2543684
## 66 104 21 65 79 53
## 47.1819223 46.7262002 46.0925140 45.6818713 44.7037081 44.0769294
## 107 118 33 95 121 96
## 44.0133576 43.8737179 43.2556654 43.2389387 42.2897978 41.8660972
## 50 106 10 38 149 117
## 40.1099972 39.2408142 38.8505934 38.5843904 38.5087794 38.3333091
## 70 68 47 75 82 136
## 37.8153560 37.8019896 36.9191401 36.1252002 35.4529513 35.1085463
## 92 4 44 81 34 78
## 34.8107937 33.6846266 33.0335287 32.8929112 32.3787095 32.0269935
## 45 139 110 2 6 59
## 31.9436460 31.4204892 30.2364345 30.1992465 29.3019671 25.6834189
## 62 56 126 148 90 125
## 25.6756876 24.7827257 24.1984666 24.1048315 23.7534593 22.5863568
## 132 101 98 129 3 55
## 22.0357219 21.2787810 19.9735124 18.8063948 17.6399825 17.0530752
## 49 80 138 83 12 109
## 16.8578948 16.6897796 16.0132303 15.7588406 15.0728406 14.9744008
## 20 102 94 57 8 24
## 14.8713532 13.9116133 13.6192175 12.7040793 12.5223374 12.4361861
## 58 89 19 13 14 91
## 10.1525251 9.6605766 9.5306368 9.3397333 9.2129921 8.1822409
## 39 134 77 23 135 115
## 7.8254481 7.0990765 6.2858617 5.8832477 4.5601310 3.5520471
## 51 145 99 64
## 2.9566297 1.8023752 0.3545769 0.1792934
fish2_outliers <- cbind(fish2)[c(37,124,22,11,18,131),]
fish3 <- fish2[c(-38,-125,-22,-11,-18,-132),]
model3 <- lm(Weight ~ Species + Total.Length, data = fish2)
summary(model3)
##
## Call:
## lm(formula = Weight ~ Species + Total.Length, data = fish2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -233.83 -56.59 -10.13 34.58 418.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -730.977 42.449 -17.220 < 2e-16 ***
## SpeciesParkki 63.129 38.889 1.623 0.107
## SpeciesPerch -23.941 21.745 -1.101 0.273
## SpeciesPike -400.964 33.350 -12.023 < 2e-16 ***
## SpeciesRoach -19.876 30.111 -0.660 0.510
## SpeciesSmelt 256.408 39.858 6.433 1.85e-09 ***
## SpeciesWhitefish -14.971 42.063 -0.356 0.722
## Total.Length 40.775 1.181 34.527 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 94.86 on 140 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9321
## F-statistic: 289.1 on 7 and 140 DF, p-value: < 2.2e-16
(b) Conduct a partial F-test comparing model3 with model2. What can you conclude using an \(\alpha\) level of 0.01?
anova(model3, model2)
The p-value here is 0.14 which is very much greater than \(\alpha\) level of 0.01 – Thus, we can say that we cannot reject the null hypothesis and the additional predictors other than Species and Total.Length are not required for better prediction values.
(a) Conduct a multicollinearity test on model3. Comment on the multicollinearity in model3.
vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## Species 2.654472 6 1.084755
## Total.Length 2.654472 1 1.629255
Earlier we were given a threshold to identify the multicollinearity
r_squared_model3 <- summary(model3)$r.squared
r_squared_model3
## [1] 0.9352946
Threshold_VIF_model3 <- max(10,1/(1-r_squared_model3))
Threshold_VIF_model3
## [1] 15.45466
Now our threshold is way higher than the VIF value of Species and Total.Length (2.654472 each) – so there is no multicollinearity between Species and Total.Length
(b) Conduct residual analysis for model3 (similar to Q4). Comment on each assumption and whether they hold.
scatterplot(fish2$Total.Length,model3$residuals, xlab = "Total Length", ylab = "Residuals")
scatterplot(model3$fitted.values,model3$residuals,xlab = "Fitted Values", ylab = "Residuals")
hist(model3$residuals, xlab = "Residuals", main = "Histogram of Residuals")
qqPlot(model3$residuals, ylab = "Residuals", main="QQ Plot of Residuals")
## 37 124
## 36 123
As of Question 4, we can now plot the different plots, we definitely see a U-Shaped Pattern on the fitted values vs residuals. Thus, we can see that there is still non-linearity between the response and predictor variables.
Same way the QQ plot and Histogram shows that there is deviation from the norm. Therefore there is non-linear relationship between the variables
(a) Use model3 to find the optimal lambda, rounded to the nearest 0.5, for a Box-Cox transformation on model3. What transformation, if any, should be applied according to the lambda value? Please ensure you use model3
boxCoxTransformation <- boxCox(model3)
Maximum_value_BC <- max(boxCoxTransformation$y)
Maximum_Value_row <- which.max(boxCoxTransformation$y)
boxCoxTransformation$x[Maximum_Value_row]
## [1] 0.3434343
Rounding the Lambda value to 0.5, the lambda value will be 0.5
We can apply log-linear transformation or basic square root transformation on the response variable.
(b) Based on the results in (a), create model4 with the appropriate transformation. Display the summary.
model4 <- lm(Weight^(1/2)~Species + Total.Length,data=fish2)
summary(model4)
##
## Call:
## lm(formula = Weight^(1/2) ~ Species + Total.Length, data = fish2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0111 -0.7687 -0.0579 0.6797 4.6383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.96654 0.57278 -12.163 < 2e-16 ***
## SpeciesParkki -0.36404 0.52476 -0.694 0.4890
## SpeciesPerch -1.95734 0.29342 -6.671 5.46e-10 ***
## SpeciesPike -10.90490 0.45001 -24.233 < 2e-16 ***
## SpeciesRoach -2.09340 0.40630 -5.152 8.58e-07 ***
## SpeciesSmelt -1.04994 0.53782 -1.952 0.0529 .
## SpeciesWhitefish -0.55048 0.56758 -0.970 0.3338
## Total.Length 0.95052 0.01594 59.649 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.28 on 140 degrees of freedom
## Multiple R-squared: 0.9817, Adjusted R-squared: 0.9808
## F-statistic: 1074 on 7 and 140 DF, p-value: < 2.2e-16
(c) Perform Residual Analysis on model4. Comment on each assumption. Was the transformation successful/unsuccessful?
scatterplot(fish2$Total.Length,model4$residuals, xlab = "Total Length", ylab = "Residuals")
scatterplot(model4$fitted.values,model4$residuals,xlab = "Fitted Values", ylab = "Residuals")
hist(model4$residuals, xlab = "Residuals", main = "Histogram of Residuals")
qqPlot(model4$residuals, ylab = "Residuals", main="QQ Plot of Residuals")
## 131 141
## 130 140
As compared to the above two residual analysis, we now start seeing a significant difference in the way the data is being shown in the Fitted Value vs Residual plot. We are seeing a significant reduction in the U-Shaped pattern and are seeing more of a “straight” line showing linearity, which means that the overall transformation is helping reduce the non-linearity
Also, the histogram is not right-skewed, as well as QQ Plot is not as tail heavy as it was in the earlier examples. This shows again that the transformation has helped in reducing overall non-linearity.
(a) Using each model summary, compare and discuss the R-squared and Adjusted R-squared of model2, model3, and model4.
model2_R_Sq <- summary(model2)$r.squared
model2_adj_R_Sq <- summary(model2)$adj.r.squared
model3_R_Sq <- summary(model3)$r.squared
model3_adj_R_Sq <- summary(model3)$adj.r.squared
model4_R_Sq <- summary(model4)$r.squared
model4_adj_R_Sq <- summary(model4)$adj.r.squared
r_Squared_values <- c(model2_R_Sq,model3_R_Sq,model4_R_Sq)
Adjusted_r_Squared_values <- c(model2_adj_R_Sq,model3_adj_R_Sq,model4_adj_R_Sq)
r_Squared_values
## [1] 0.9384836 0.9352946 0.9817138
Adjusted_r_Squared_values
## [1] 0.9335080 0.9320593 0.9807995
Here we can see that from Model 2 to Model 3 both the R-Squared Values and Adjusted R-Squared Values dropped, though Model 4 has the maximum of both the values overall.
This shows that as the overall outlier points and unnecessary variables got removed, as well as an additional transformation was performed on the response variable (to reduce the overall non-linearity, we can see an increase in R-Sq and Adj R-Sq values. The difference in model2 and model3 is also expected as we have more overall predictors and datapoints.
(a) Predict Weight for the last 10 rows of data (fishtest) using both model3 and model4. Compare and discuss the mean squared prediction error (MSPE) of both models.
nrow(fishtest)
## [1] 10
model3_fishtest_predictions <- predict(model3, fishtest)
model3_fishtest_predictions
## 150 151 152 153 154 155
## 835.320989 492.283575 351.535704 7.581312 223.690686 -143.287496
## 156 157 158 159
## 185.102637 621.398920 147.658936 14.734838
model3_MPSE <- mean((fishtest$Weight - model3_fishtest_predictions)^2)
model3_MPSE
## [1] 9392.25
model4_fishtest_predictions <- predict(model4, fishtest)
model4_fishtest_predictions
## 150 151 152 153 154 155 156 157
## 28.146519 21.549151 16.432491 8.850901 13.888673 5.333966 12.830454 23.001051
## 158 159
## 11.679876 3.389796
model4_fishtest_predictions <- model4_fishtest_predictions^2
model4_fishtest_predictions
## 150 151 152 153 154 155 156 157
## 792.22652 464.36590 270.02675 78.33845 192.89524 28.45119 164.62056 529.04835
## 158 159
## 136.41949 11.49071
model4_MPSE <- mean((fishtest$Weight - model4_fishtest_predictions)^2)
model4_MPSE
## [1] 2442.998
MPSE of Model 4 is 2442.998 and that of Model 3 is 9392.25
Thus, Model 4 is definitely the better option overall.
(b) Suppose you have found a Perch fish with a Body.Height of 28 cm, and a Total.Length of 32 cm. Using model4, predict the weight on this fish with a 90% prediction interval. Provide an interpretation of the prediction interval.
fish %>% filter(Species == "Perch") %>% filter(Total.Length>=32) %>% filter(Body.Height>= 28) %>% filter(Total.Length <34) %>% filter(Body.Height <31)
new_Perch_fish <- data.frame(Species = "Perch", Body.Height = 28, Total.Length = 32 )
new_Perch_fish_weight <- predict(model4, new_Perch_fish, interval = "prediction", level = 0.9)
new_Perch_fish_weight <- new_Perch_fish_weight^2
new_Perch_fish_weight
## fit lwr upr
## 1 461.9429 374.4536 558.6091
Here we find that the fit value of the new Perch fish is 461.9429 gms for a confidence interval of 90%. This means that for 90% CI, we can be confident that the minimum value of weight will be 374.4536 gms and maximum value of weight will be 558.6091.
The value of the filtered Fish table shows us that for similar Body.Height and Total.Length the weight for a perch will be 514 gms. which is between the 90% confidence interval range.