Guess the Fish Weight: Predicting The Weight of the fish with Regression Method

Plenty of fish! fish is one of the common foods that loved by human. Still over fishing tend to happen and make the regulator to regulate that some of the fish can not be caught if it still have not reach the minimum weight.

With a dataset of fish species, with some of it characteristic like it vertical, diagonal, length, height, and width. We will try to predict the weight of the fish based on their characteristic. We will use Linear Regression Method to see whether the weight of the fish related to their characteristic.

Introduction

With Linear Regression Method, we will use fish dataset to identify the relationship between Weight with other numerical variables. We also try to see whether the weight of the fish can be predicted based on historical data.

Attaching Necessary Library

Load any package required for analysis.

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(ggplot2)
library(performance)
library(MLmetrics)
library(rmdformats)

## Warning: package 'rmdformats' was built under R version 4.0.3

Data Import

fish <- read_csv("Fish.csv")

## Parsed with column specification:
## cols(
##   Species = col_character(),
##   Weight = col_double(),
##   Length1 = col_double(),
##   Length2 = col_double(),
##   Length3 = col_double(),
##   Height = col_double(),
##   Width = col_double()
## )

Data Wrangling

We will exclude Species variable as we do not need it now as We will use only the numerical variable.

fish <- fish %>%
  select(-Species)

After removing unnecessary variable, we will check whether our variable data type and missing value.

# Check data type

glimpse(fish)

## Rows: 159
## Columns: 6
## $ Weight  <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
## $ Length1 <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
## $ Length2 <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
## $ Length3 <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
## $ Height  <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
## $ Width   <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...

# Check missing values

anyNA(fish)

## [1] FALSE

after assuring the data type is correct with zero missing values, we will continue to the Exploratory Data Analysis.

Exploratory Data Analysis

At this section, we will see if there any correlation from each of the variables.

We will use Pearson correlation with ggcorr function from GGally Library.

ggcorr(fish, label = T, label_size = 3, hjust = 1, layout.exp = 2)

The graphic above shown us how each of the numerical variables have a strong correlation with Weight as our target variable.

Cross Validation

Before the phase of model creation, we need to split our dataset into Train and Test dataset for more accurate model. We will use train dataset to train our model, while the Test dataset will be used as a comparison whether our model can predict new data that has not been use.

We will split the Train and test dataset with 70 : 30 ratio.

set.seed(100)
index <- sample(nrow(fish), nrow(fish)*0.7)

fish_train <- fish[index,]
fish_test <- fish[-index,]

Modelling

At this phase, we will try to create a model with Linear Regression Method with Weight as our target variable and Width as our predictor variable

set.seed(100)
model_init <- lm(formula = Weight ~ Width, data = fish_train)


summary(model_init)

## 
## Call:
## lm(formula = Weight ~ Width, data = fish_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -250.02 -110.64  -37.03   62.77  887.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -423.838     46.690  -9.078 5.41e-15 ***
## Width        184.982      9.954  18.584  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 171.7 on 109 degrees of freedom
## Multiple R-squared:  0.7601, Adjusted R-squared:  0.7579 
## F-statistic: 345.4 on 1 and 109 DF,  p-value: < 2.2e-16

From the summary, we can conclude that Width as our predictor variable have p-value below 0.05, this means Width have a significant effect toward Weight as our target variable.

For simple interpretation of the coefficient, every increased 1 unit point in Width, it will contribute to 184.982 increase in Weight.

We also need to check the Multiple R-Square of the model, we can see that our Multiple R-Square is around 0.7601 or 76.01%. This means that the model that we create before can explain 76.01% of variance of our target variable.

Now we will try to add more variables to our model. To do this we will use Step-Wise Regression Method. This method will create an optimum formula for our model based on the lowest AIC.

The lower the result of AIC, then the less the undetected observation value will be.

First we will create two different models. model_null is the model without any predictor variable, and model_all with all variables included to it.

model_null <- lm(formula = Weight ~ 1, data = fish_train)
summary(model_null)

## 
## Call:
## lm(formula = Weight ~ 1, data = fish_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -389.2 -269.2 -119.2  260.8 1210.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   389.24      33.13   11.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 349 on 110 degrees of freedom

Based on model_null, we can see without any predictors included, it will create the mean value of the target variable with the value of 389.24

model_all <- lm(formula = Weight ~ ., data = fish_train)
summary(model_all)

## 
## Call:
## lm(formula = Weight ~ ., data = fish_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -251.54  -62.93  -20.95   44.49  437.95 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -487.479     35.933 -13.566  < 2e-16 ***
## Length1       49.752     47.601   1.045 0.298327    
## Length2       16.237     50.058   0.324 0.746314    
## Length3      -38.222     20.118  -1.900 0.060196 .  
## Height        34.218      9.926   3.447 0.000815 ***
## Width          1.121     23.284   0.048 0.961689    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.3 on 105 degrees of freedom
## Multiple R-squared:  0.881,  Adjusted R-squared:  0.8753 
## F-statistic: 155.4 on 5 and 105 DF,  p-value: < 2.2e-16

For our model_all we can conclude when all variables were put in to the model, the only variable with p-value < 0.05 is the Height variable.

Now we will use Step-Wise Regression Method to choose our best model to be analyzed.

# Create formula using Step-Wise Regression

step(object = model_all, direction = "backward", trace = F)

## 
## Call:
## lm(formula = Weight ~ Length1 + Length3 + Height, data = fish_train)
## 
## Coefficients:
## (Intercept)      Length1      Length3       Height  
##     -484.70        65.63       -36.94        34.97

step(object = model_null, direction = "forward", scope = list(lower = model_null, upper = model_all), trace = F)

## 
## Call:
## lm(formula = Weight ~ Length3 + Width + Height + Length1, data = fish_train)
## 
## Coefficients:
## (Intercept)      Length3        Width       Height      Length1  
##    -486.255      -35.401        3.284       33.875       63.677

step(object = model_null, direction = "both", scope = list(lower = model_null, upper = model_all), trace = F)

## 
## Call:
## lm(formula = Weight ~ Length3 + Height + Length1, data = fish_train)
## 
## Coefficients:
## (Intercept)      Length3       Height      Length1  
##     -484.70       -36.94        34.97        65.63

# Create the model

model_step_back <- lm(formula = Weight ~ Length1 + Length3 + Height, data = fish_train)
model_step_forw <- lm(formula = Weight ~ Length3 + Width + Height + Length1, data = fish_train)
model_step_both <- lm(formula = Weight ~ Length3 + Height + Length1, data = fish_train)

# Compare these models with `model_all`

compare_performance(model_all, model_step_back, model_step_forw, model_step_both)

## # Comparison of Model Performance Indices
## 
## Model           | Type |     AIC |     BIC |   R2 | R2_adjusted |   RMSE |     BF
## ---------------------------------------------------------------------------------
## model_all       |   lm | 1391.59 | 1410.56 | 0.88 |        0.88 | 119.87 |       
## model_step_back |   lm | 1387.73 | 1401.28 | 0.88 |        0.88 | 119.95 | 103.80
## model_step_forw |   lm | 1389.70 | 1405.96 | 0.88 |        0.88 | 119.93 |   9.97
## model_step_both |   lm | 1387.73 | 1401.28 | 0.88 |        0.88 | 119.95 | 103.80

From the comparison between 4 models. Models with the lowest AIC value are model_step_back and model_step_both. Both of the models have an AIC value of 1387.73. Both of the models are also have an Adjusted R-Square around 0.88 or 88%. This means that both of the model that we create before can explain 88% of variance of our target variable.

It does not matter to choose between these 2 models as the results that we needed from these two models are equals. we will use model_step_both for further analysis.

Evaluation

In this step, we will check our model performance.

fish_pred <- predict(object = model_step_both, newdata = fish_test, level = 0.95)

# RMSE of train dataset

RMSE(y_pred = model_step_both$fitted.values, y_true = fish_train$Weight)

## [1] 119.9461

# RMSE of test dataset
RMSE(y_pred = fish_pred, y_true = fish_test$Weight)

## [1] 124.8925

We can see from the RMSE that the model that have been train by data train are good enough to predicting testing dataset. It is shown by the error of the train dataset is lower than test data set.

Assumption Checking

In Linear regression Method, there are a few assumption classical assumption that need to be fulfilled. Suppose it fail to follow the assumption, the result of the model bay be inaccurate or misleading.

In this section we will check several assumption such as:

Normality
Homoscedasticity
Multicolinearity

Normality Test

Normality test conducted to see whether our residuals fulfilled the assumption of normal distribution. to do this, we will conduct Shapiro-Wilk Normality Test by using shapiro.test() function.

Hypothesis:

H0: The residuals follow normal distribution

H1: The residuals does not follow normal distribution

shapiro.test(model_step_both$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_step_both$residuals
## W = 0.95116, p-value = 0.0004733

From the result above, we can see that our p-value < 0.05, thus we reject the null hypothesis and our residuals are not following normal distribution.

We can also see the residuals distribution by creating histogram.

# Histogram

hist(model_step_both$residuals)

# Density Plot

plot(density(model_step_both$residuals))

Homoscedasticity Test

We conduct Homoscedasticity test to see whether the variance of our data were distributed evently or does not create any pattern.

To do this, we will conduct Breusch-Pagan test by using bptest() function from lmtest package.

Hypothesis:

H0 : Error variance distributed evently (Homoscedasticity)

H1 : Error variance formed pattern (Heteroscedasticity)

bptest(model_step_both)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_step_both
## BP = 40.342, df = 3, p-value = 9.017e-09

We can see from the result above that our p-value < 0.05, thus we reject null hypothesis, means that our data formed a pattern (heteroscedasticity), thus we fail to follow the homoscedasticity assumption.

We can also visualize the result by using scatterplot between prediction values (fitted values) and residuals.

plot(model_step_both$fitted.values, model_step_both$residuals)
abline(h = 0, col = "red")

Multicolinearity test

We conduct Multicolinearity test to see whether there are any strong relationship between predictor variables. The multicolinearity existence can be seen by the VIF (Variance Inflation Factor) value. VIF value shows how big the coefficient variance increase due to multicolinearity.

when VIF value greater than 10, it indicates there are multicolinearity in our data.

We conduct Multicolinearity test by using vif() function from car package.

vif(model_step_both)

##   Length3    Height   Length1 
## 207.26264   5.65118 171.31491

From the result above, we can see that our Length3 variable have VIF value > 10, thus means we have multicolinearity in our data.

From the test that we have been conducted, none of these assumptions fulfilled by our data. Next step, we will try to tune data to make sure we able to fulfill the assumption needed to make sure that our model will not be misleading to predict our data test.

Data & Model Tuning

As we can see from our model, our model does not fulfill all the assumptions needed and it may mislead our dsired result. To fulfill the assumptions, we will tune our model first.

with ggcorr we will see the correlation between predictor variables.

model_step_both

## 
## Call:
## lm(formula = Weight ~ Length3 + Height + Length1, data = fish_train)
## 
## Coefficients:
## (Intercept)      Length3       Height      Length1  
##     -484.70       -36.94        34.97        65.63

ggcorr(fish, label = T, label_size = 3, hjust = 1, layout.exp = 2)

From graphic above, we can see that the Length1 and Length2 variables are very dependent with Length3 variable. we will try to remove these variables from our dataset and try to recreate our model from new dataset.

Next, we will see if there any outlier in our fish data.

boxplot(fish)

From the boxplot chart, we can that we have a few outliers in our data. Thus, we will remove the outlier from our fish data and store it in object called fish_revised.

fish_revised <- fish %>%
  filter(Weight < 1500) %>%
  select(-Length1, -Length2)

We will divided our new data into train and test dataset based on fish_revised data.

set.seed(100)
index_rev <- sample(nrow(fish_revised), nrow(fish_revised)*0.7)

fish_train_rev <- fish_revised[index_rev,]
fish_test_rev <- fish_revised[-index_rev,]

Next, we will create model and based on fish_train_rev data. We will create a model with Width as predictor variable and stored it in model_init_rev. Other models would be a model without predictor variables which will be stored in model_null_rev and a model with all predictor variables included, this model will be stored in model_all_rev.

Again, these 3 models will be based on fish_train_rev and will be different from our previous models.

model_init_rev <- lm(formula = Weight ~ Width, data = fish_train_rev)
model_null_rev <- lm(formula = Weight ~ 1, data = fish_train_rev)
model_all_rev <- lm(formula = Weight ~ ., data = fish_train_rev)

compare_performance(model_init_rev, model_null_rev, model_all_rev)

## # Comparison of Model Performance Indices
## 
## Model          | Type |     AIC |     BIC |   R2 | R2_adjusted |   RMSE |       BF
## ----------------------------------------------------------------------------------
## model_init_rev |   lm | 1380.90 | 1388.98 | 0.83 |        0.83 | 132.68 |         
## model_null_rev |   lm | 1572.10 | 1577.48 | 0.00 |        0.00 | 321.87 |     0.00
## model_all_rev  |   lm | 1346.75 | 1360.21 | 0.88 |        0.88 | 111.38 | 1.77e+06

Here are, the comparison between our previously created model. we can see that based on AIC and Adjusted R-Square, model_all_rev were better than other two models.

Next, we will create another model using Step_Wise Regression Method and compare it with model_all_rev.

step(object = model_all_rev, direction = "backward", trace = F)

## 
## Call:
## lm(formula = Weight ~ Length3 + Height + Width, data = fish_train_rev)
## 
## Coefficients:
## (Intercept)      Length3       Height        Width  
##     -438.79        13.96        10.54        68.41

step(object = model_null_rev, direction = "forward", scope = list(lower = model_null_rev, upper = model_all_rev), trace = F)

## 
## Call:
## lm(formula = Weight ~ Length3 + Width + Height, data = fish_train_rev)
## 
## Coefficients:
## (Intercept)      Length3        Width       Height  
##     -438.79        13.96        68.41        10.54

step(object = model_null_rev, direction = "both", scope = list(lower = model_null_rev, upper = model_all_rev), trace = F)

## 
## Call:
## lm(formula = Weight ~ Length3 + Width + Height, data = fish_train_rev)
## 
## Coefficients:
## (Intercept)      Length3        Width       Height  
##     -438.79        13.96        68.41        10.54

model_step_back_rev <- lm(formula = Weight ~ Length3 + Height + Width, data = fish_train_rev)
model_step_forw_rev <- lm(formula = Weight ~ Length3 + Width + Height, data = fish_train_rev)
model_step_both_rev <- lm(formula = Weight ~ Length3 + Width + Height, data = fish_train_rev)


compare_performance(model_step_back_rev, model_step_both_rev, model_step_forw_rev, model_all_rev)

## # Comparison of Model Performance Indices
## 
## Model               | Type |     AIC |     BIC |   R2 | R2_adjusted |   RMSE | BF
## ---------------------------------------------------------------------------------
## model_step_back_rev |   lm | 1346.75 | 1360.21 | 0.88 |        0.88 | 111.38 |   
## model_step_both_rev |   lm | 1346.75 | 1360.21 | 0.88 |        0.88 | 111.38 |  1
## model_step_forw_rev |   lm | 1346.75 | 1360.21 | 0.88 |        0.88 | 111.38 |  1
## model_all_rev       |   lm | 1346.75 | 1360.21 | 0.88 |        0.88 | 111.38 |  1

All the Adjusted R-Square are around 0.88 or 88%. This means that both of the model that we create before can explain 88% of variance of our target variable.

Based on the comparison above, all of these models have same value for the AIC and Adjusted R-Square. So it will not be an issue for us to choose randomly between these models, and we will use model_step_both_rev for further analysis

fish_pred_rev <- predict(object = model_step_both_rev, newdata = fish_test_rev, level = 0.95)

# RMSE of train_rev dataset

RMSE(y_pred = model_step_both_rev$fitted.values, y_true = fish_train_rev$Weight)

## [1] 111.381

# RMSE of test_rev dataset

RMSE(y_pred = fish_pred_rev, y_true = fish_test_rev$Weight)

## [1] 94.73077

Although our prediction in train data are higher than test data, the difference are not that high.

We can conclude that our model are good enough to predicting testing data set.

Assumption Checking: Based on Tuned data

Normality test

Hypothesis:

H0: The residuals follow normal distribution

H1: The residuals does not follow normal distribution

shapiro.test(model_step_both_rev$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_step_both_rev$residuals
## W = 0.93915, p-value = 8.548e-05

From the result above, we can see that our p-value < 0.05, thus we reject the null hypothesis and our residuals are not following normal distribution.

we can also see the residuals distribution by creating histogram

# Histogram

hist(model_step_both_rev$residuals)

# Density Plot

plot(density(model_step_both_rev$residuals))

Homoscedascity Test

Hypothesis:

H0 : Error variance distributed evently (Homoscedasticity)

H1 : Error variance formed pattern (Heteroscedasticity)

bptest(model_step_both_rev)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_step_both_rev
## BP = 6.7147, df = 3, p-value = 0.08157

We can see from the result above that our p-value > 0.05, thus fail to reject the null hypothesis, means that our data did not form a pattern (heteroscedasticity), and we fulfill the homoscedasticity assumption.

Multicolinearity Test

vif(model_step_both_rev)

##  Length3    Width   Height 
## 6.084121 7.089598 3.124635

from the result above, we can see none of these variable have VIF value > 10, thus means we don’t have multicolinearity in our data and we fulfill the multicolinearity assumption.

Conclusions

Based on the analysis, we can conclude that the predictor variables such as Length3, Width, and Height are have a significant effect on ourWeight` as our target variable. Our model also able to fulfill two out of three classical assumptions. The adjusted R-Square of the model have a value of 0.876, means that 87.6% of the variables can explain the variance of fish weight.The accuracy of our model predicting the wight of the fish can be measured with RMSE, training data set have an RMSE around 111.381 while our test data set have an RMSE around 94.731. Although the RMSE of train data have a higher value than our test data, the difference are not that high and we can suggest that our model are good enough to predict the test data set.

Future Improvement

While our model able to fulfill two out of three classical assumptions, it still unable to fulfill the normality assumption. To fulfill this assumption, we can tune our model by changing the train data into log form.

Predicting Fish Weight: Linear Regression Method

Predicting Fish Weight: Linear Regression Method

Guess the Fish Weight: Predicting The Weight of the fish with Regression Method

Introduction

Attaching Necessary Library

Data Import

Data Wrangling

Exploratory Data Analysis

Cross Validation

Modelling

Evaluation

Assumption Checking

Normality Test

Homoscedasticity Test

Multicolinearity test

Data & Model Tuning

Assumption Checking: Based on Tuned data

Normality test

Homoscedascity Test

Multicolinearity Test

Conclusions

Future Improvement