knitr::include_graphics("weight2.jpg")
Fish are abundant in most bodies of water. They can be found in nearly all aquatic environments, from high mountain streams to the abyssal and even hadal depths of the deepest oceans, although no species has yet been documented in the deepest 25% of the ocean. With 34,300 described species, fish exhibit greater species diversity than any other group of vertebrates.
Fish are an important resource for humans worldwide, especially as food. Commercial and subsistence fishers hunt fish in wild fisheries or farm them in ponds or in cages in the ocean. They are also caught by recreational fishers, kept as pets, raised by fishkeepers, and exhibited in public aquaria.
Many species of fish are caught by humans and consumed as food in virtually all regions around the world. Fish has been an important dietary source of protein and other nutrients throughout human history.
With this dataset, we are going to predict the weight of the fish based on their species, vertical, diagonal, cross, height, and width. To achive this, we are using Linear Regression Method.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(performance)
library(forcats)
This dataset consist of 7 variables and 159 observations. The description of the variables are : - Species : Type of fish
Weight : Weight of the fish, in grams
Length1 : Vertical length of the fish, in centimeters. Renamed to Vertical
Length2 : Diagonal length of the fish, in centimeters. Renamed to Diagonal
Length3 : Cross length of the fish, in centimeters. Renamed to Cross
Height : Height of the fish, in centimeters
Height : Diagonal width of the fish, in centimeters
fish <- read.csv("Fish.csv")
Changing Species as factor, and renaming Length1, Length2, Length3 column to more meaningful column.
fish_clean <- fish %>%
mutate(Species = as.factor(Species)) %>%
# select(-Species) %>%
rename(Vertical = Length1,
Diagonal = Length2,
Cross = Length3)
str(fish_clean)
## 'data.frame': 159 obs. of 7 variables:
## $ Species : Factor w/ 7 levels "Bream","Parkki",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : num 242 290 340 363 430 450 500 390 450 500 ...
## $ Vertical: num 23.2 24 23.9 26.3 26.5 26.8 26.8 27.6 27.6 28.5 ...
## $ Diagonal: num 25.4 26.3 26.5 29 29 29.7 29.7 30 30 30.7 ...
## $ Cross : num 30 31.2 31.1 33.5 34 34.7 34.5 35 35.1 36.2 ...
## $ Height : num 11.5 12.5 12.4 12.7 12.4 ...
## $ Width : num 4.02 4.31 4.7 4.46 5.13 ...
# Check missing values
colSums(is.na(fish_clean))
## Species Weight Vertical Diagonal Cross Height Width
## 0 0 0 0 0 0 0
summary(fish_clean)
## Species Weight Vertical Diagonal
## Bream :35 Min. : 0.0 Min. : 7.50 Min. : 8.40
## Parkki :11 1st Qu.: 120.0 1st Qu.:19.05 1st Qu.:21.00
## Perch :56 Median : 273.0 Median :25.20 Median :27.30
## Pike :17 Mean : 398.3 Mean :26.25 Mean :28.42
## Roach :20 3rd Qu.: 650.0 3rd Qu.:32.70 3rd Qu.:35.50
## Smelt :14 Max. :1650.0 Max. :59.00 Max. :63.40
## Whitefish: 6
## Cross Height Width
## Min. : 8.80 Min. : 1.728 Min. :1.048
## 1st Qu.:23.15 1st Qu.: 5.945 1st Qu.:3.386
## Median :29.40 Median : 7.786 Median :4.248
## Mean :31.23 Mean : 8.971 Mean :4.417
## 3rd Qu.:39.65 3rd Qu.:12.366 3rd Qu.:5.585
## Max. :68.00 Max. :18.957 Max. :8.142
##
Exclude data with weight of 0 gram.
fish_clean <- fish_clean %>%
filter(!Weight == 0)
Split the data into train and test dataset. 65% of the data will be use as the train dataset, and 35% of the data will be use as the test the model.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(310)
fish_index <- sample(x = nrow(fish_clean), size = nrow(fish_clean)*0.65)
fish_train <- fish_clean[fish_index,]
fish_test <- fish_clean[-fish_index,]
Finding the weight distribution, and Median of the fish.
ggplot(fish_train, aes(x=fct_reorder(Species, Weight, .fun="median"), y = Weight)) +
#ggplot(fish_train, aes(x=Species, y = Weight)) +
geom_boxplot() +
labs(x = "Species",
title = "Weight Distribution and Median",
subtitle = "group by Species")
A glance of the size of the fish.
fish_long <- pivot_longer(data = fish_train, cols = names(fish_train)[2:7])
ggplot(data = fish_long, mapping = aes(x=value)) +
# geom_histogram() +
geom_boxplot() +
theme_bw() +
facet_wrap(~name, scales = "free") +
labs(x = "Measure")
There are outliers in Cross, Diagonal, Vertical, and Weight. In the character of Linear Regression performs bad with outlier data, hence, we will remove the outliers.
fCross_out <- boxplot(fish_train$Cross, plot=FALSE)$out
fDiagonal_out <- boxplot(fish_train$Diagonal, plot=FALSE)$out
fVertical_out <- boxplot(fish_train$Vertical, plot=FALSE)$out
fWeight_out <- boxplot(fish_train$Weight, plot=FALSE)$out
fish_train <- fish_train %>%
filter(!Cross %in% fCross_out) %>%
filter(!Diagonal %in% fDiagonal_out) %>%
filter(!Vertical %in% fVertical_out) %>%
filter(!Weight %in% fWeight_out)
We can see there, there are strong positive correlations between the predictors.
pairs(fish_train[,-1])
ggcorr(fish_train, label = T, label_size = 3, hjust = 1)
## Warning in ggcorr(fish_train, label = T, label_size = 3, hjust = 1): data in
## column(s) 'Species' are not numeric and were ignored
As mention above, we will be using Linear Regression to predict the weight of the fish.
Model with all Predictor has Adjusted R-Squared 0.96, meaning the model can explain 96% of variance of the target variable (weight)
fish_model_all <- lm(formula = Weight ~ . , data = fish_train)
summary(fish_model_all)
##
## Call:
## lm(formula = Weight ~ ., data = fish_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.824 -36.158 -5.791 28.525 176.776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -727.275 112.261 -6.478 5.16e-09 ***
## SpeciesParkki 48.177 64.348 0.749 0.456043
## SpeciesPerch 110.085 96.640 1.139 0.257742
## SpeciesPike 162.666 115.221 1.412 0.161544
## SpeciesRoach 126.596 74.476 1.700 0.092694 .
## SpeciesSmelt 442.258 98.175 4.505 2.03e-05 ***
## SpeciesWhitefish 70.460 82.398 0.855 0.394811
## Vertical -7.183 33.742 -0.213 0.831904
## Diagonal 45.620 36.183 1.261 0.210709
## Cross -27.868 24.922 -1.118 0.266522
## Height 47.290 12.167 3.887 0.000197 ***
## Width 74.369 20.687 3.595 0.000535 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.36 on 88 degrees of freedom
## Multiple R-squared: 0.9677, Adjusted R-squared: 0.9636
## F-statistic: 239.6 on 11 and 88 DF, p-value: < 2.2e-16
Using Feature Selection to tune the base model.
Model with Feature Selection Backward has Adjusted R-Squared 0.96, same with base model.
fish_model_back <- step(object = fish_model_all, direction = "backward", trace = FALSE)
summary(fish_model_back)
##
## Call:
## lm(formula = Weight ~ Species + Diagonal + Height + Width, data = fish_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.343 -31.704 -7.047 28.623 181.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -811.047 69.053 -11.745 < 2e-16 ***
## SpeciesParkki 98.430 39.465 2.494 0.014454 *
## SpeciesPerch 187.347 69.231 2.706 0.008143 **
## SpeciesPike 190.075 112.669 1.687 0.095062 .
## SpeciesRoach 168.155 62.670 2.683 0.008678 **
## SpeciesSmelt 503.059 73.258 6.867 8.19e-10 ***
## SpeciesWhitefish 124.295 68.296 1.820 0.072094 .
## Diagonal 11.424 3.936 2.902 0.004655 **
## Height 45.030 11.337 3.972 0.000143 ***
## Width 67.785 19.851 3.415 0.000960 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.19 on 90 degrees of freedom
## Multiple R-squared: 0.9671, Adjusted R-squared: 0.9639
## F-statistic: 294.3 on 9 and 90 DF, p-value: < 2.2e-16
Model with Feature Selection Forward has Adjusted R-Squared 0.96, same with base model.
fish_model_non <- lm(formula = Weight ~ 1, data = fish_train)
fish_model_for <- step (object = fish_model_non, direction = "forward",
scope = list(lower=fish_model_non, upper = fish_model_all),
trace = FALSE)
summary(fish_model_for)
##
## Call:
## lm(formula = Weight ~ Width + Species + Height + Diagonal, data = fish_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.343 -31.704 -7.047 28.623 181.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -811.047 69.053 -11.745 < 2e-16 ***
## Width 67.785 19.851 3.415 0.000960 ***
## SpeciesParkki 98.430 39.465 2.494 0.014454 *
## SpeciesPerch 187.347 69.231 2.706 0.008143 **
## SpeciesPike 190.075 112.669 1.687 0.095062 .
## SpeciesRoach 168.155 62.670 2.683 0.008678 **
## SpeciesSmelt 503.059 73.258 6.867 8.19e-10 ***
## SpeciesWhitefish 124.295 68.296 1.820 0.072094 .
## Height 45.030 11.337 3.972 0.000143 ***
## Diagonal 11.424 3.936 2.902 0.004655 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.19 on 90 degrees of freedom
## Multiple R-squared: 0.9671, Adjusted R-squared: 0.9639
## F-statistic: 294.3 on 9 and 90 DF, p-value: < 2.2e-16
Model with Feature Selection Both (Forward and Backward) has Adjusted R-Squared 0.96, same with base model.
fish_model_both <- step (object = fish_model_non, direction = "both",
scope = list(lower=fish_model_non, upper = fish_model_all),
trace = FALSE)
summary(fish_model_both)
##
## Call:
## lm(formula = Weight ~ Width + Species + Height + Diagonal, data = fish_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.343 -31.704 -7.047 28.623 181.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -811.047 69.053 -11.745 < 2e-16 ***
## Width 67.785 19.851 3.415 0.000960 ***
## SpeciesParkki 98.430 39.465 2.494 0.014454 *
## SpeciesPerch 187.347 69.231 2.706 0.008143 **
## SpeciesPike 190.075 112.669 1.687 0.095062 .
## SpeciesRoach 168.155 62.670 2.683 0.008678 **
## SpeciesSmelt 503.059 73.258 6.867 8.19e-10 ***
## SpeciesWhitefish 124.295 68.296 1.820 0.072094 .
## Height 45.030 11.337 3.972 0.000143 ***
## Diagonal 11.424 3.936 2.902 0.004655 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.19 on 90 degrees of freedom
## Multiple R-squared: 0.9671, Adjusted R-squared: 0.9639
## F-statistic: 294.3 on 9 and 90 DF, p-value: < 2.2e-16
All 4 models (base, backward, forward, both), have the same Adjusted R-Squared, 0.96. In addition, all four models have p-value < 0.05. We will continue to predict with the backward model.
Predict the test dataset using the backward Model
fish_pred <- predict(object = fish_model_back, newdata = fish_test)
With Weight value in the train dataset ranging from 6.7 grams to 1650 grams, and Mean Absolute Erro (MAE) at 80.24, with can say, the model is doing a pretty good job.
Mean Absolute Percentage Error (MAPE), is MAE in percentage. In our model, showing 1.6% error.
Root Mean Squared Error (RMSE) is a measure of how spread out the residuals are.
summary(fish_train)
## Species Weight Vertical Diagonal
## Bream :22 Min. : 6.7 Min. : 9.30 Min. : 9.80
## Parkki : 6 1st Qu.: 120.0 1st Qu.:18.90 1st Qu.:20.52
## Perch :32 Median : 281.5 Median :25.20 Median :27.25
## Pike : 9 Mean : 376.6 Mean :25.58 Mean :27.67
## Roach :15 3rd Qu.: 650.0 3rd Qu.:32.70 3rd Qu.:35.25
## Smelt :13 Max. :1100.0 Max. :48.30 Max. :51.70
## Whitefish: 3
## Cross Height Width
## Min. :10.80 Min. : 1.728 Min. :1.048
## 1st Qu.:22.48 1st Qu.: 5.819 1st Qu.:3.294
## Median :30.30 Median : 7.701 Median :4.329
## Mean :30.47 Mean : 8.829 Mean :4.377
## 3rd Qu.:39.35 3rd Qu.:12.379 3rd Qu.:5.617
## Max. :55.10 Max. :18.957 Max. :8.142
##
MAE(y_pred = fish_pred, y_true = fish_test$Weight)
## [1] 80.24292
MAPE(y_pred = fish_pred, y_true = fish_test$Weight)
## [1] 1.584826
RMSE(y_pred = fish_pred, y_true = fish_test$Weight)
## [1] 131.8713
In Linear Regression, there are assumptions that need to be met.
With errors distributed normally, meaning errors are gathering in 0, and showing a bell curve.
Normality assumption can be seen with histogram. In our model, we can see errors are mostly gather in 0, and it is showing a bell curve. We can say, our model is performing well.
hist(fish_model_back$residuals)
plot(density(fish_model_back$residuals))
The best method to test Normality assumptions is with Shapiro-Wilk test. A good model, should show p-value > 0.05. Our model shows p-value 0.004. Meaning our model is not performing well.
shapiro.test(fish_model_back$residuals)
##
## Shapiro-Wilk normality test
##
## data: fish_model_back$residuals
## W = 0.96024, p-value = 0.004182
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.
Homoscedasticity can be shown with scatter plot. A good Homoscedasticity plot should show a random patern. As shown by our model, it is not showing any pattern.
plot(x=fish_model_back$fitted.values, y = fish_model_back$residuals)
abline(h = 0, col = "red", lty = 2)
The best way to test Homoscedasticity is by using Breusch-Pagan test. A good model should show p-value > 0.05. Our model show p-value = 0.317. Meaning our model is performing well.
bptest(fish_model_back)
##
## studentized Breusch-Pagan test
##
## data: fish_model_back
## BP = 10.427, df = 9, p-value = 0.317
A good model should shows no Multicolinearity between the predictors. On moderate scale, vif number > 10 indicate high multicolinearity. But, on extreme scale, vif number > 5 indicate multicolinearity. On this assumption, our model tend to have Multicolinearity.
vif(fish_model_back)
## GVIF Df GVIF^(1/(2*Df))
## Species 371.21234 6 1.637326
## Diagonal 42.56248 1 6.523992
## Height 68.97981 1 8.305408
## Width 35.26324 1 5.938286
Having only passed one Assumptions (Homoscedasticity), our model is NOT performing well.
Model tuning With PCA. Select only the numeric predictor.
fish_num <- fish_clean %>%
select_if(is.numeric) %>%
select(-Weight)
summary(fish_num)
## Vertical Diagonal Cross Height
## Min. : 7.50 Min. : 8.40 Min. : 8.80 Min. : 1.728
## 1st Qu.:19.15 1st Qu.:21.00 1st Qu.:23.20 1st Qu.: 5.941
## Median :25.30 Median :27.40 Median :29.70 Median : 7.789
## Mean :26.29 Mean :28.47 Mean :31.28 Mean : 8.987
## 3rd Qu.:32.70 3rd Qu.:35.75 3rd Qu.:39.67 3rd Qu.:12.372
## Max. :59.00 Max. :63.40 Max. :68.00 Max. :18.957
## Width
## Min. :1.048
## 1st Qu.:3.399
## Median :4.277
## Mean :4.424
## 3rd Qu.:5.587
## Max. :8.142
From PC1, we can get 97% of predictors data.
fish_pca <- prcomp(fish_num)
summary(fish_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 18.9721 3.33642 0.82270 0.34948 0.17337
## Proportion of Variance 0.9678 0.02993 0.00182 0.00033 0.00008
## Cumulative Proportion 0.9678 0.99777 0.99959 0.99992 1.00000
fish_keep <- as.data.frame(fish_pca$x[,1])
Merge PC1 with non-numeric predictor.
fish2 <- fish_clean %>%
select_if(is.factor) %>%
cbind(fish_keep)
fish2$Weight <- fish_clean$Weight
names(fish2)[2] <- "PC1"
Cross Validation again.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(310)
fish_index2 <- sample(x = nrow(fish2), size = nrow(fish2)*0.65)
fish_train2 <- fish2[fish_index2,]
fish_test2 <- fish2[-fish_index2,]
EDA and remove outliers.
fish_long2 <- pivot_longer(data = fish_train2, cols = names(fish_train2)[2:3])
ggplot(data = fish_long2, mapping = aes(x=value)) +
# geom_histogram() +
geom_boxplot() +
theme_bw() +
facet_wrap(~name, scales = "free")
fPC1_out <- boxplot(fish_train2$PC1, plot=FALSE)$out
fWeight_out2 <- boxplot(fish_train2$Weight, plot=FALSE)$out
fish_train2 <- fish_train2 %>%
filter(!PC1 %in% fPC1_out) %>%
filter(!Weight %in% fWeight_out2)
fish_model_all2 <- lm(formula = Weight ~ . , data = fish_train2)
summary(fish_model_all2)
##
## Call:
## lm(formula = Weight ~ ., data = fish_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160.83 -44.91 -4.78 41.67 217.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 399.0677 17.2036 23.197 < 2e-16 ***
## SpeciesParkki 49.1646 37.7215 1.303 0.196
## SpeciesPerch 12.2652 21.1597 0.580 0.564
## SpeciesPike -407.7797 30.7606 -13.257 < 2e-16 ***
## SpeciesRoach 2.1116 28.8630 0.073 0.942
## SpeciesSmelt 280.7950 38.7665 7.243 1.31e-10 ***
## SpeciesWhitefish 17.4661 45.5129 0.384 0.702
## PC1 22.2366 0.7298 30.471 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.83 on 92 degrees of freedom
## Multiple R-squared: 0.9495, Adjusted R-squared: 0.9456
## F-statistic: 246.9 on 7 and 92 DF, p-value: < 2.2e-16
fish_model_back2 <- step(object = fish_model_all2, direction = "backward", trace = FALSE)
summary(fish_model_back2)
##
## Call:
## lm(formula = Weight ~ Species + PC1, data = fish_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160.83 -44.91 -4.78 41.67 217.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 399.0677 17.2036 23.197 < 2e-16 ***
## SpeciesParkki 49.1646 37.7215 1.303 0.196
## SpeciesPerch 12.2652 21.1597 0.580 0.564
## SpeciesPike -407.7797 30.7606 -13.257 < 2e-16 ***
## SpeciesRoach 2.1116 28.8630 0.073 0.942
## SpeciesSmelt 280.7950 38.7665 7.243 1.31e-10 ***
## SpeciesWhitefish 17.4661 45.5129 0.384 0.702
## PC1 22.2366 0.7298 30.471 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.83 on 92 degrees of freedom
## Multiple R-squared: 0.9495, Adjusted R-squared: 0.9456
## F-statistic: 246.9 on 7 and 92 DF, p-value: < 2.2e-16
fish_pred2 <- predict(object = fish_model_back2, newdata = fish_test2)
MAE at 113.86. MAPE at 79%. RMSE at 1.8. All evaluations are inferior than base model.
MAE(y_pred = fish_pred2, y_true = fish_test2$Weight)
## [1] 79.02441
MAPE(y_pred = fish_pred2, y_true = fish_test2$Weight)
## [1] 1.8096
RMSE(y_pred = fish_pred2, y_true = fish_test2$Weight)
## [1] 113.8594
Linearity with Histogram show error distributed normally around 0.
hist(fish_model_back2$residuals)
plot(density(fish_model_back2$residuals))
Linearity with Shapiro-Wilk test, p-value = 0.079 (> 0.05). Our tuned model pass the Shapiro-Wilk test.
shapiro.test(fish_model_back2$residuals)
##
## Shapiro-Wilk normality test
##
## data: fish_model_back2$residuals
## W = 0.97709, p-value = 0.07871
Homoscedasticity with scatter plot showing no pattern.
plot(x=fish_model_back2$fitted.values, y = fish_model_back2$residuals)
abline(h = 0, col = "red", lty = 2)
Homoscedasticity with Breusch-Pagan test shows p-value = 0.46 (> 0.05). Our tuned model pass Breush-Pagan test.
bptest(fish_model_back2)
##
## studentized Breusch-Pagan test
##
## data: fish_model_back2
## BP = 6.7248, df = 7, p-value = 0.4581
No Multicolinearity shown in vif.
vif(fish_model_back2)
## GVIF Df GVIF^(1/(2*Df))
## Species 3.076256 6 1.098167
## PC1 3.076256 1 1.753926
compare_performance(fish_model_all,fish_model_back,fish_model_for,fish_model_both, fish_model_back2)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC | AIC weights | BIC | BIC weights | R2 | R2 (adj.) | RMSE | Sigma
## ----------------------------------------------------------------------------------------------------------------
## fish_model_all | lm | 1117.069 | 0.095 | 1150.936 | 0.008 | 0.968 | 0.964 | 56.622 | 60.360
## fish_model_back | lm | 1114.749 | 0.302 | 1143.406 | 0.331 | 0.967 | 0.964 | 57.100 | 60.189
## fish_model_for | lm | 1114.749 | 0.302 | 1143.406 | 0.331 | 0.967 | 0.964 | 57.100 | 60.189
## fish_model_both | lm | 1114.749 | 0.302 | 1143.406 | 0.331 | 0.967 | 0.964 | 57.100 | 60.189
## fish_model_back2 | lm | 1153.798 | < 0.001 | 1177.245 | < 0.001 | 0.949 | 0.946 | 70.814 | 73.828
All predictor variables, Species, Vertical Length, Diagonal Length, Cross Length, Height, and Width have significant outcome to our target variable, Weight. But with base model, our model is only good to deliver one (Homoscedasticity) out of three assumptions. With tuned PCA model, the Adjusted R-Squared slightly lower, 0.9456 (compared to 0.9639 on base model), but can deliver all three (Normality, Homoscedasticity, Multicolinearity) assumptions. Meaning, our tuned PCA model is performing well in predicting the test dataset.