Fish Weight Prediction

knitr::include_graphics("weight2.jpg")

Fish are abundant in most bodies of water. They can be found in nearly all aquatic environments, from high mountain streams to the abyssal and even hadal depths of the deepest oceans, although no species has yet been documented in the deepest 25% of the ocean. With 34,300 described species, fish exhibit greater species diversity than any other group of vertebrates.

Fish are an important resource for humans worldwide, especially as food. Commercial and subsistence fishers hunt fish in wild fisheries or farm them in ponds or in cages in the ocean. They are also caught by recreational fishers, kept as pets, raised by fishkeepers, and exhibited in public aquaria.

Many species of fish are caught by humans and consumed as food in virtually all regions around the world. Fish has been an important dietary source of protein and other nutrients throughout human history.

With this dataset, we are going to predict the weight of the fish based on their species, vertical, diagonal, cross, height, and width. To achive this, we are using Linear Regression Method.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(MLmetrics)

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(performance)
library(forcats)

Dataset

This dataset consist of 7 variables and 159 observations. The description of the variables are : - Species : Type of fish

Weight : Weight of the fish, in grams
Length1 : Vertical length of the fish, in centimeters. Renamed to Vertical
Length2 : Diagonal length of the fish, in centimeters. Renamed to Diagonal
Length3 : Cross length of the fish, in centimeters. Renamed to Cross
Height : Height of the fish, in centimeters
Height : Diagonal width of the fish, in centimeters

fish <- read.csv("Fish.csv")

Data Wrangling

Changing Species as factor, and renaming Length1, Length2, Length3 column to more meaningful column.

fish_clean <- fish %>% 
  mutate(Species = as.factor(Species)) %>% 
#  select(-Species) %>% 
  rename(Vertical = Length1,
         Diagonal = Length2,
         Cross = Length3)


str(fish_clean)

## 'data.frame':    159 obs. of  7 variables:
##  $ Species : Factor w/ 7 levels "Bream","Parkki",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight  : num  242 290 340 363 430 450 500 390 450 500 ...
##  $ Vertical: num  23.2 24 23.9 26.3 26.5 26.8 26.8 27.6 27.6 28.5 ...
##  $ Diagonal: num  25.4 26.3 26.5 29 29 29.7 29.7 30 30 30.7 ...
##  $ Cross   : num  30 31.2 31.1 33.5 34 34.7 34.5 35 35.1 36.2 ...
##  $ Height  : num  11.5 12.5 12.4 12.7 12.4 ...
##  $ Width   : num  4.02 4.31 4.7 4.46 5.13 ...

# Check missing values
colSums(is.na(fish_clean))

##  Species   Weight Vertical Diagonal    Cross   Height    Width 
##        0        0        0        0        0        0        0

summary(fish_clean)

##       Species       Weight          Vertical        Diagonal    
##  Bream    :35   Min.   :   0.0   Min.   : 7.50   Min.   : 8.40  
##  Parkki   :11   1st Qu.: 120.0   1st Qu.:19.05   1st Qu.:21.00  
##  Perch    :56   Median : 273.0   Median :25.20   Median :27.30  
##  Pike     :17   Mean   : 398.3   Mean   :26.25   Mean   :28.42  
##  Roach    :20   3rd Qu.: 650.0   3rd Qu.:32.70   3rd Qu.:35.50  
##  Smelt    :14   Max.   :1650.0   Max.   :59.00   Max.   :63.40  
##  Whitefish: 6                                                   
##      Cross           Height           Width      
##  Min.   : 8.80   Min.   : 1.728   Min.   :1.048  
##  1st Qu.:23.15   1st Qu.: 5.945   1st Qu.:3.386  
##  Median :29.40   Median : 7.786   Median :4.248  
##  Mean   :31.23   Mean   : 8.971   Mean   :4.417  
##  3rd Qu.:39.65   3rd Qu.:12.366   3rd Qu.:5.585  
##  Max.   :68.00   Max.   :18.957   Max.   :8.142  
##

Exclude data with weight of 0 gram.

fish_clean <- fish_clean %>% 
  filter(!Weight == 0)

Cross Validation

Split the data into train and test dataset. 65% of the data will be use as the train dataset, and 35% of the data will be use as the test the model.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(310)

fish_index <- sample(x = nrow(fish_clean), size = nrow(fish_clean)*0.65)

fish_train <- fish_clean[fish_index,]
fish_test <- fish_clean[-fish_index,]

Exploratory Data Analysis

Finding the weight distribution, and Median of the fish.

ggplot(fish_train, aes(x=fct_reorder(Species, Weight, .fun="median"), y = Weight)) +
#ggplot(fish_train, aes(x=Species, y = Weight)) +
  geom_boxplot() +
  labs(x = "Species",
       title = "Weight Distribution and Median",
       subtitle = "group by Species")

A glance of the size of the fish.

fish_long <- pivot_longer(data = fish_train, cols = names(fish_train)[2:7])

ggplot(data = fish_long, mapping = aes(x=value)) +
#  geom_histogram() +
  geom_boxplot() +
  theme_bw() +
  facet_wrap(~name, scales = "free") +
  labs(x = "Measure")

There are outliers in Cross, Diagonal, Vertical, and Weight. In the character of Linear Regression performs bad with outlier data, hence, we will remove the outliers.

fCross_out <- boxplot(fish_train$Cross, plot=FALSE)$out
fDiagonal_out <- boxplot(fish_train$Diagonal, plot=FALSE)$out
fVertical_out <- boxplot(fish_train$Vertical, plot=FALSE)$out
fWeight_out <- boxplot(fish_train$Weight, plot=FALSE)$out


fish_train <- fish_train %>%
  filter(!Cross %in% fCross_out) %>%
  filter(!Diagonal %in% fDiagonal_out) %>%
  filter(!Vertical %in% fVertical_out) %>%
  filter(!Weight %in% fWeight_out)

Correlations between the predictor

We can see there, there are strong positive correlations between the predictors.

pairs(fish_train[,-1])

ggcorr(fish_train, label = T, label_size = 3, hjust = 1)

## Warning in ggcorr(fish_train, label = T, label_size = 3, hjust = 1): data in
## column(s) 'Species' are not numeric and were ignored

Model

As mention above, we will be using Linear Regression to predict the weight of the fish.

Model With All Predictor

Model with all Predictor has Adjusted R-Squared 0.96, meaning the model can explain 96% of variance of the target variable (weight)

fish_model_all <- lm(formula = Weight ~ . , data = fish_train)
summary(fish_model_all)

## 
## Call:
## lm(formula = Weight ~ ., data = fish_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -118.824  -36.158   -5.791   28.525  176.776 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -727.275    112.261  -6.478 5.16e-09 ***
## SpeciesParkki      48.177     64.348   0.749 0.456043    
## SpeciesPerch      110.085     96.640   1.139 0.257742    
## SpeciesPike       162.666    115.221   1.412 0.161544    
## SpeciesRoach      126.596     74.476   1.700 0.092694 .  
## SpeciesSmelt      442.258     98.175   4.505 2.03e-05 ***
## SpeciesWhitefish   70.460     82.398   0.855 0.394811    
## Vertical           -7.183     33.742  -0.213 0.831904    
## Diagonal           45.620     36.183   1.261 0.210709    
## Cross             -27.868     24.922  -1.118 0.266522    
## Height             47.290     12.167   3.887 0.000197 ***
## Width              74.369     20.687   3.595 0.000535 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.36 on 88 degrees of freedom
## Multiple R-squared:  0.9677, Adjusted R-squared:  0.9636 
## F-statistic: 239.6 on 11 and 88 DF,  p-value: < 2.2e-16

Feature Selection

Using Feature Selection to tune the base model.

Step Backward

Model with Feature Selection Backward has Adjusted R-Squared 0.96, same with base model.

fish_model_back <- step(object = fish_model_all, direction = "backward", trace = FALSE)
summary(fish_model_back)

## 
## Call:
## lm(formula = Weight ~ Species + Diagonal + Height + Width, data = fish_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.343  -31.704   -7.047   28.623  181.330 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -811.047     69.053 -11.745  < 2e-16 ***
## SpeciesParkki      98.430     39.465   2.494 0.014454 *  
## SpeciesPerch      187.347     69.231   2.706 0.008143 ** 
## SpeciesPike       190.075    112.669   1.687 0.095062 .  
## SpeciesRoach      168.155     62.670   2.683 0.008678 ** 
## SpeciesSmelt      503.059     73.258   6.867 8.19e-10 ***
## SpeciesWhitefish  124.295     68.296   1.820 0.072094 .  
## Diagonal           11.424      3.936   2.902 0.004655 ** 
## Height             45.030     11.337   3.972 0.000143 ***
## Width              67.785     19.851   3.415 0.000960 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.19 on 90 degrees of freedom
## Multiple R-squared:  0.9671, Adjusted R-squared:  0.9639 
## F-statistic: 294.3 on 9 and 90 DF,  p-value: < 2.2e-16

Step Forward

Model with Feature Selection Forward has Adjusted R-Squared 0.96, same with base model.

fish_model_non <- lm(formula = Weight ~ 1, data = fish_train)

fish_model_for <- step (object = fish_model_non, direction = "forward",
                        scope = list(lower=fish_model_non, upper = fish_model_all),
                        trace = FALSE)
summary(fish_model_for)

## 
## Call:
## lm(formula = Weight ~ Width + Species + Height + Diagonal, data = fish_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.343  -31.704   -7.047   28.623  181.330 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -811.047     69.053 -11.745  < 2e-16 ***
## Width              67.785     19.851   3.415 0.000960 ***
## SpeciesParkki      98.430     39.465   2.494 0.014454 *  
## SpeciesPerch      187.347     69.231   2.706 0.008143 ** 
## SpeciesPike       190.075    112.669   1.687 0.095062 .  
## SpeciesRoach      168.155     62.670   2.683 0.008678 ** 
## SpeciesSmelt      503.059     73.258   6.867 8.19e-10 ***
## SpeciesWhitefish  124.295     68.296   1.820 0.072094 .  
## Height             45.030     11.337   3.972 0.000143 ***
## Diagonal           11.424      3.936   2.902 0.004655 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.19 on 90 degrees of freedom
## Multiple R-squared:  0.9671, Adjusted R-squared:  0.9639 
## F-statistic: 294.3 on 9 and 90 DF,  p-value: < 2.2e-16

Step Both

Model with Feature Selection Both (Forward and Backward) has Adjusted R-Squared 0.96, same with base model.

fish_model_both <- step (object = fish_model_non, direction = "both",
                        scope = list(lower=fish_model_non, upper = fish_model_all),
                        trace = FALSE)
summary(fish_model_both)

## 
## Call:
## lm(formula = Weight ~ Width + Species + Height + Diagonal, data = fish_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.343  -31.704   -7.047   28.623  181.330 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -811.047     69.053 -11.745  < 2e-16 ***
## Width              67.785     19.851   3.415 0.000960 ***
## SpeciesParkki      98.430     39.465   2.494 0.014454 *  
## SpeciesPerch      187.347     69.231   2.706 0.008143 ** 
## SpeciesPike       190.075    112.669   1.687 0.095062 .  
## SpeciesRoach      168.155     62.670   2.683 0.008678 ** 
## SpeciesSmelt      503.059     73.258   6.867 8.19e-10 ***
## SpeciesWhitefish  124.295     68.296   1.820 0.072094 .  
## Height             45.030     11.337   3.972 0.000143 ***
## Diagonal           11.424      3.936   2.902 0.004655 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.19 on 90 degrees of freedom
## Multiple R-squared:  0.9671, Adjusted R-squared:  0.9639 
## F-statistic: 294.3 on 9 and 90 DF,  p-value: < 2.2e-16

All 4 models (base, backward, forward, both), have the same Adjusted R-Squared, 0.96. In addition, all four models have p-value < 0.05. We will continue to predict with the backward model.

Predict

Predict the test dataset using the backward Model

fish_pred <- predict(object = fish_model_back, newdata = fish_test)

Model Evaluation

With Weight value in the train dataset ranging from 6.7 grams to 1650 grams, and Mean Absolute Erro (MAE) at 80.24, with can say, the model is doing a pretty good job.

Mean Absolute Percentage Error (MAPE), is MAE in percentage. In our model, showing 1.6% error.

Root Mean Squared Error (RMSE) is a measure of how spread out the residuals are.

summary(fish_train)

##       Species       Weight          Vertical        Diagonal    
##  Bream    :22   Min.   :   6.7   Min.   : 9.30   Min.   : 9.80  
##  Parkki   : 6   1st Qu.: 120.0   1st Qu.:18.90   1st Qu.:20.52  
##  Perch    :32   Median : 281.5   Median :25.20   Median :27.25  
##  Pike     : 9   Mean   : 376.6   Mean   :25.58   Mean   :27.67  
##  Roach    :15   3rd Qu.: 650.0   3rd Qu.:32.70   3rd Qu.:35.25  
##  Smelt    :13   Max.   :1100.0   Max.   :48.30   Max.   :51.70  
##  Whitefish: 3                                                   
##      Cross           Height           Width      
##  Min.   :10.80   Min.   : 1.728   Min.   :1.048  
##  1st Qu.:22.48   1st Qu.: 5.819   1st Qu.:3.294  
##  Median :30.30   Median : 7.701   Median :4.329  
##  Mean   :30.47   Mean   : 8.829   Mean   :4.377  
##  3rd Qu.:39.35   3rd Qu.:12.379   3rd Qu.:5.617  
##  Max.   :55.10   Max.   :18.957   Max.   :8.142  
##

MAE(y_pred = fish_pred, y_true = fish_test$Weight)

## [1] 80.24292

MAPE(y_pred = fish_pred, y_true = fish_test$Weight)

## [1] 1.584826

RMSE(y_pred = fish_pred, y_true = fish_test$Weight)

## [1] 131.8713

Assumptions

In Linear Regression, there are assumptions that need to be met.

Normality

With errors distributed normally, meaning errors are gathering in 0, and showing a bell curve.

Histogram

Normality assumption can be seen with histogram. In our model, we can see errors are mostly gather in 0, and it is showing a bell curve. We can say, our model is performing well.

hist(fish_model_back$residuals)

plot(density(fish_model_back$residuals))

Shapiro test

The best method to test Normality assumptions is with Shapiro-Wilk test. A good model, should show p-value > 0.05. Our model shows p-value 0.004. Meaning our model is not performing well.

shapiro.test(fish_model_back$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fish_model_back$residuals
## W = 0.96024, p-value = 0.004182

Homoscedasticity

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.

Scatter Plot

Homoscedasticity can be shown with scatter plot. A good Homoscedasticity plot should show a random patern. As shown by our model, it is not showing any pattern.

plot(x=fish_model_back$fitted.values, y = fish_model_back$residuals)
abline(h = 0, col = "red", lty = 2)

Breusch-Pagan Test

The best way to test Homoscedasticity is by using Breusch-Pagan test. A good model should show p-value > 0.05. Our model show p-value = 0.317. Meaning our model is performing well.

bptest(fish_model_back)

## 
##  studentized Breusch-Pagan test
## 
## data:  fish_model_back
## BP = 10.427, df = 9, p-value = 0.317

Multicolinearity

A good model should shows no Multicolinearity between the predictors. On moderate scale, vif number > 10 indicate high multicolinearity. But, on extreme scale, vif number > 5 indicate multicolinearity. On this assumption, our model tend to have Multicolinearity.

vif(fish_model_back)

##               GVIF Df GVIF^(1/(2*Df))
## Species  371.21234  6        1.637326
## Diagonal  42.56248  1        6.523992
## Height    68.97981  1        8.305408
## Width     35.26324  1        5.938286

Having only passed one Assumptions (Homoscedasticity), our model is NOT performing well.

Model Tuning

Model tuning With PCA. Select only the numeric predictor.

fish_num <- fish_clean %>% 
  select_if(is.numeric) %>% 
  select(-Weight)

summary(fish_num)

##     Vertical        Diagonal         Cross           Height      
##  Min.   : 7.50   Min.   : 8.40   Min.   : 8.80   Min.   : 1.728  
##  1st Qu.:19.15   1st Qu.:21.00   1st Qu.:23.20   1st Qu.: 5.941  
##  Median :25.30   Median :27.40   Median :29.70   Median : 7.789  
##  Mean   :26.29   Mean   :28.47   Mean   :31.28   Mean   : 8.987  
##  3rd Qu.:32.70   3rd Qu.:35.75   3rd Qu.:39.67   3rd Qu.:12.372  
##  Max.   :59.00   Max.   :63.40   Max.   :68.00   Max.   :18.957  
##      Width      
##  Min.   :1.048  
##  1st Qu.:3.399  
##  Median :4.277  
##  Mean   :4.424  
##  3rd Qu.:5.587  
##  Max.   :8.142

From PC1, we can get 97% of predictors data.

fish_pca <- prcomp(fish_num)
summary(fish_pca)

## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5
## Standard deviation     18.9721 3.33642 0.82270 0.34948 0.17337
## Proportion of Variance  0.9678 0.02993 0.00182 0.00033 0.00008
## Cumulative Proportion   0.9678 0.99777 0.99959 0.99992 1.00000

fish_keep <- as.data.frame(fish_pca$x[,1])

Merge PC1 with non-numeric predictor.

fish2 <- fish_clean %>% 
  select_if(is.factor) %>% 
  cbind(fish_keep)

fish2$Weight <- fish_clean$Weight
names(fish2)[2] <- "PC1"

Cross Validation again.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(310)

fish_index2 <- sample(x = nrow(fish2), size = nrow(fish2)*0.65)

fish_train2 <- fish2[fish_index2,]
fish_test2 <- fish2[-fish_index2,]

EDA and remove outliers.

fish_long2 <- pivot_longer(data = fish_train2, cols = names(fish_train2)[2:3])

ggplot(data = fish_long2, mapping = aes(x=value)) +
#  geom_histogram() +
  geom_boxplot() +
  theme_bw() +
  facet_wrap(~name, scales = "free")

fPC1_out <- boxplot(fish_train2$PC1, plot=FALSE)$out
fWeight_out2 <- boxplot(fish_train2$Weight, plot=FALSE)$out


fish_train2 <- fish_train2 %>%
  filter(!PC1 %in% fPC1_out) %>%
  filter(!Weight %in% fWeight_out2)

Model with PC1

fish_model_all2 <- lm(formula = Weight ~ . , data = fish_train2)
summary(fish_model_all2)

## 
## Call:
## lm(formula = Weight ~ ., data = fish_train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -160.83  -44.91   -4.78   41.67  217.42 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       399.0677    17.2036  23.197  < 2e-16 ***
## SpeciesParkki      49.1646    37.7215   1.303    0.196    
## SpeciesPerch       12.2652    21.1597   0.580    0.564    
## SpeciesPike      -407.7797    30.7606 -13.257  < 2e-16 ***
## SpeciesRoach        2.1116    28.8630   0.073    0.942    
## SpeciesSmelt      280.7950    38.7665   7.243 1.31e-10 ***
## SpeciesWhitefish   17.4661    45.5129   0.384    0.702    
## PC1                22.2366     0.7298  30.471  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.83 on 92 degrees of freedom
## Multiple R-squared:  0.9495, Adjusted R-squared:  0.9456 
## F-statistic: 246.9 on 7 and 92 DF,  p-value: < 2.2e-16

Feature Selection with Backward.

fish_model_back2 <- step(object = fish_model_all2, direction = "backward", trace = FALSE)
summary(fish_model_back2)

## 
## Call:
## lm(formula = Weight ~ Species + PC1, data = fish_train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -160.83  -44.91   -4.78   41.67  217.42 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       399.0677    17.2036  23.197  < 2e-16 ***
## SpeciesParkki      49.1646    37.7215   1.303    0.196    
## SpeciesPerch       12.2652    21.1597   0.580    0.564    
## SpeciesPike      -407.7797    30.7606 -13.257  < 2e-16 ***
## SpeciesRoach        2.1116    28.8630   0.073    0.942    
## SpeciesSmelt      280.7950    38.7665   7.243 1.31e-10 ***
## SpeciesWhitefish   17.4661    45.5129   0.384    0.702    
## PC1                22.2366     0.7298  30.471  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.83 on 92 degrees of freedom
## Multiple R-squared:  0.9495, Adjusted R-squared:  0.9456 
## F-statistic: 246.9 on 7 and 92 DF,  p-value: < 2.2e-16

Predict with Model PC1

fish_pred2 <- predict(object = fish_model_back2, newdata = fish_test2)

Model Evaluation

MAE at 113.86. MAPE at 79%. RMSE at 1.8. All evaluations are inferior than base model.

MAE(y_pred = fish_pred2, y_true = fish_test2$Weight)

## [1] 79.02441

MAPE(y_pred = fish_pred2, y_true = fish_test2$Weight)

## [1] 1.8096

RMSE(y_pred = fish_pred2, y_true = fish_test2$Weight)

## [1] 113.8594

Asumptions

Linearity with Histogram show error distributed normally around 0.

hist(fish_model_back2$residuals)

plot(density(fish_model_back2$residuals))

Linearity with Shapiro-Wilk test, p-value = 0.079 (> 0.05). Our tuned model pass the Shapiro-Wilk test.

shapiro.test(fish_model_back2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fish_model_back2$residuals
## W = 0.97709, p-value = 0.07871

Homoscedasticity with scatter plot showing no pattern.

plot(x=fish_model_back2$fitted.values, y = fish_model_back2$residuals)
abline(h = 0, col = "red", lty = 2)

Homoscedasticity with Breusch-Pagan test shows p-value = 0.46 (> 0.05). Our tuned model pass Breush-Pagan test.

bptest(fish_model_back2)

## 
##  studentized Breusch-Pagan test
## 
## data:  fish_model_back2
## BP = 6.7248, df = 7, p-value = 0.4581

No Multicolinearity shown in vif.

vif(fish_model_back2)

##             GVIF Df GVIF^(1/(2*Df))
## Species 3.076256  6        1.098167
## PC1     3.076256  1        1.753926

Model comparison

compare_performance(fish_model_all,fish_model_back,fish_model_for,fish_model_both, fish_model_back2)

## # Comparison of Model Performance Indices
## 
## Name             | Model |      AIC | AIC weights |      BIC | BIC weights |    R2 | R2 (adj.) |   RMSE |  Sigma
## ----------------------------------------------------------------------------------------------------------------
## fish_model_all   |    lm | 1117.069 |       0.095 | 1150.936 |       0.008 | 0.968 |     0.964 | 56.622 | 60.360
## fish_model_back  |    lm | 1114.749 |       0.302 | 1143.406 |       0.331 | 0.967 |     0.964 | 57.100 | 60.189
## fish_model_for   |    lm | 1114.749 |       0.302 | 1143.406 |       0.331 | 0.967 |     0.964 | 57.100 | 60.189
## fish_model_both  |    lm | 1114.749 |       0.302 | 1143.406 |       0.331 | 0.967 |     0.964 | 57.100 | 60.189
## fish_model_back2 |    lm | 1153.798 |     < 0.001 | 1177.245 |     < 0.001 | 0.949 |     0.946 | 70.814 | 73.828

Conclusion

All predictor variables, Species, Vertical Length, Diagonal Length, Cross Length, Height, and Width have significant outcome to our target variable, Weight. But with base model, our model is only good to deliver one (Homoscedasticity) out of three assumptions. With tuned PCA model, the Adjusted R-Squared slightly lower, 0.9456 (compared to 0.9639 on base model), but can deliver all three (Normality, Homoscedasticity, Multicolinearity) assumptions. Meaning, our tuned PCA model is performing well in predicting the test dataset.

Predict Fish Weight With OLS Linear Regression

Sunarto Rusli

3/4/2022

Fish Weight Prediction

Dataset

Data Wrangling

Cross Validation

Exploratory Data Analysis

Correlations between the predictor

Model

Model With All Predictor

Feature Selection

Step Backward

Step Forward

Step Both

Predict

Model Evaluation

Assumptions

Normality

Histogram

Shapiro test

Homoscedasticity

Scatter Plot

Breusch-Pagan Test

Multicolinearity

Model Tuning

Model with PC1

Feature Selection with Backward.

Predict with Model PC1

Model Evaluation

Asumptions

Model comparison

Conclusion