# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

Intro

This article is made for completing the assignment for Algoritma : Machine Learning Course in Linear Regression Material Theia Batch 2022.

We are going to use fish measurement data in one fish market to learn about linear regression techniques to predict weight of fish based on variables available in the dataset. In real life application, the weight of the fish is correlated with the price of the fish, so the target variable on this article would be weight.

In this dataset there are few assumptions :

  • Data is taken in one Marketplace.
  • There are only 7 kind of fishes sold in the Marketplace.
  • Data is taken in one period of time for all fish.

Explaratory Data Analysis

Load Library

options(scipen = 99) #non active science annotation (example : e-1)
library(tidyverse) #load several important libraries 
library(dplyr) #grammar of data manipulation
library(ggplot2) #plotting data visualization
library(scales) #edit scales
library(GGally) #plotting data visualization

Load Dataset

#load dataset
fish <- read.csv("data-input/fish.csv")
glimpse(fish)
#> Rows: 159
#> Columns: 7
#> $ Species <chr> "Bream", "Bream", "Bream", "Bream", "Bream", "Bream", "Bream",…
#> $ Weight  <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500, 50…
#> $ Length1 <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5, 28…
#> $ Length2 <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7, 31…
#> $ Length3 <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2, 36…
#> $ Height  <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.1795,…
#> $ Width   <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6900…

Based on the Dataset Information, Length 1 means vertical length in centimeters (cm), Length 2 means diagonal length in cm, Length 3 means cross length in cm. We will change the column name based on this information to make easier data wrangling.

fish_clean <-
  fish %>%
  mutate(Species = as.factor(Species), 
         Vertical.Length = Length1, 
         Diagonal.Length = Length2, 
         Cross.Length = Length3) %>%
  select(-c(Length1, Length2, Length3))
glimpse(fish_clean)
#> Rows: 159
#> Columns: 7
#> $ Species         <fct> Bream, Bream, Bream, Bream, Bream, Bream, Bream, Bream…
#> $ Weight          <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475,…
#> $ Height          <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, …
#> $ Width           <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785…
#> $ Vertical.Length <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, …
#> $ Diagonal.Length <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, …
#> $ Cross.Length    <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, …
  • Species = Species of Fish sold in the Market
  • Weight = Weight of Fish sold in gram
  • Height = Height of Fish in centimeter
  • Width = Width of Fish in centimeter
  • Vertical.Length = Vertical Length of Fish in centimeter
  • Diagonal.Length = Diagonal Length of Fish in centimeter
  • Cross.Length = Cross Length of Fish in centimeter

Insight from Dataset

When looking at fish, there are two things most of us look at before deciding to buy. First we take a look at the species, and then we take a look at the weight. After that, we look at the size and later decide if the size and the weight make sense and reasonable. Some fish could have meat density different from others because of many factors and it also has an effect on the nutrition contained in the meat. This is why Tuna will cost much higher compared to common Bass.

Bream

Bream

Parkki

Parkki

Perch

Perch

Pike

Pike

Roach

Roach

Smelt

Smelt

Whitefish

Whitefish

After simple research for the fish species, I took several images that display the average size of each species. Assuming the hand size is the same, we can see that each fish has very different characteristics, size, and proportion on its body. This could affect data prediction so we may have to separate each species modeling, we will decide later.

Modelling Data

Data Inspection and Validation

Before we continue, let’s take more look at the the data.

summary(fish_clean$Weight)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     0.0   120.0   273.0   398.3   650.0  1650.0

There is strange data here which the minimum weight is 0

ggplot(data = fish_clean, (aes(y = Weight))) + 
  geom_boxplot(aes(fill = Weight), width = 1)

We see that there are three data which is classified as outliers based on Boxplot Interquartile Method

We also take a look at weight for each species

ggplot(data = fish_clean, aes(x = Species, y = Weight)) + 
  geom_boxplot(aes(fill = Weight), width = 0.8, outlier.shape = NA) +
  geom_jitter()

Apparently the three outliers from before is the very huge Pike fish if we look at the boxplot of Pike

If we take a look at Smelt Fish, we can see that the fishes is consisted on very light (few grams) population. Which means the fish can’t grow big (maximum size is small)

Data Cleansing

Check Missing Value

colSums(is.na(fish_clean))
#>         Species          Weight          Height           Width Vertical.Length 
#>               0               0               0               0               0 
#> Diagonal.Length    Cross.Length 
#>               0               0

We wanted to see which row contains Weight==0

fish_clean %>% filter(Weight == 0)

Based on data above, it is impossible that kind of fish has 0 gram of weight. It must be failed input of the data.

fish_clean <- subset(fish_clean, Weight != 0)

Since Smelt Fish is so small and consist of very small weight, it will not going to produce anything significant in our models. Therefore we can remove it from our dataset And also, it is very uncommon when people buy very small fish individually and check the size from each of the fish.

fish_set <- fish_clean %>% filter(Species != "Smelt") %>% droplevels()

Correlation Data

To see correlation between predictor, we use ggcorr() function to see the easy graph for it.

ggcorr(fish_set, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)

Based on diagram above. Vertical.Lenth, Diagonal.Length and Cross.Length is correlated very high and it is sign of linearity. Therefore we exclude two of those column from the dataset. It is possible because fish are usually in the shape of a long rectangular shape, therefore between the cross length and vertical length and diagonal length is very highly correlated.

fish_set <- fish_set %>% select(-c(Cross.Length,Vertical.Length))
ggcorr(fish_set, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)

The correlation is still very high but we will go on using all predictors for one model and only just one predictor for one model.

Linear Modelling

All kind Fish included with Every Variables

We are using lm method with (Weight ~ .) means that we are using all predictors from data = fish_set

model_fish_all <- lm(Weight~., data = fish_set)
summary(model_fish_all)
#> 
#> Call:
#> lm(formula = Weight ~ ., data = fish_set)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -224.38  -58.86  -11.00   29.49  407.16 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)      -764.1713    87.5509  -8.728  0.00000000000000861 ***
#> SpeciesParkki      73.2008    49.5211   1.478               0.1417    
#> SpeciesPerch       28.7326    84.7023   0.339               0.7350    
#> SpeciesPike      -291.7249   128.4700  -2.271               0.0247 *  
#> SpeciesRoach       24.0990    79.4064   0.303               0.7620    
#> SpeciesWhitefish   28.9762    81.1718   0.357               0.7217    
#> Height              9.4507    13.5908   0.695               0.4880    
#> Width              -0.1356    25.0075  -0.005               0.9957    
#> Diagonal.Length    37.4297     3.8903   9.621 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 98.3 on 135 degrees of freedom
#> Multiple R-squared:  0.9265, Adjusted R-squared:  0.9222 
#> F-statistic: 212.8 on 8 and 135 DF,  p-value: < 0.00000000000000022

Based on model above, we can see that the Species Pike has significantly more higher P Value than other species. The most significant variables above is Intercept and Diagonal.Length.

We will see initial performance of this model with Data Visualization

ggplot(data = fish_set, mapping = aes(x = Weight, y = Diagonal.Length)) + 
  geom_point() +
  stat_summary(fun.data= model_fish_all) + 
  geom_smooth(method='lm') +
  scale_y_continuous(limits = c(5, 60))

Based on the plot above, we can see that our model is still underfit (don’t bother to measure the performance, it is very bad based on the plot alone)

Model Visualization with Linear Regression for Each Species, but only measuring With Diagonal.Length

ggplot(data = fish_set, mapping = aes(x = Weight, y = Diagonal.Length)) + 
  geom_point(aes(color=Species), size = 1) +
  geom_smooth(method=lm,se=FALSE,fullrange=TRUE,
                  aes(color=Species), size = 0.5) +
  scale_y_continuous(limits = c(5, 60))

We can get some insight about the data from plot above :

  • There are two group of fish. One : Roach and Parkki. Two : Pike, Whitefish, Bream, and Perch.
  • The Linear Model we created is forced to accomodate the data of Roach and Parkki which the slope is significantly different with other fishes.

Let see the distribution for each species

ggpairs(data = fish_set, 
        columns = c(1,5),
        diag = list(discrete="barDiag",
                    continuous = wrap("densityDiag",
                                      alpha=0.5 )),
        mapping = aes(color=Species))

To make the models more accurate, we can treat these two group as different population. Based on the data, it looks like Roach and Parkki belongs to small-fish group that is not sold individually.

Therefore, we make the dataset into two group.

small_set <- fish_set %>% filter(Species %in% c("Roach","Parkki"))
big_set <- fish_set %>% filter(Species %in% c("Pike", "Whitefish", "Bream", "Perch"))

Based on the Correlation Value, it seems that every variables are highly correlated with each other. We will make two models, with all predictor and with one predictor, we choose Diagonal.Length for model with one predictor.

#all predictor
model_small_fish <- lm(Weight~., data = small_set)
model_big_fish<- lm(Weight~., data = big_set)

#one predictor
model_small_fish_1 <- lm(Weight~Diagonal.Length, data = small_set)
model_big_fish_1 <- lm(Weight~Diagonal.Length, data = big_set)

Model Evaluation

Create Prediction Value

We are using predict() to set value based on the linear model that we created. Therefore it can be compared between the values model predicted and the real values from the dataset.

fish_set$Weight.Predict <- predict(object = model_fish_all, newdata = fish_set)

small_set$Weight.Predict <- predict(object = model_small_fish, newdata = small_set)
small_set$Weight.Predict.1 <- predict(object = model_small_fish_1, newdata = small_set)

big_set$Weight.Predict <- predict(object = model_big_fish, newdata = big_set)
big_set$Weight.Predict.1 <- predict(object = model_big_fish_1, newdata = big_set)

Metric Measurement

We test the model using two kind of Metric Measurement to see the performance of the model.

Adjusted R-squared

Goodnes of Fit based on Adjusted R Squared Value

#all species
summary(model_fish_all)$adj.r.squared #all predictor
#> [1] 0.9221694
#small fish
summary(model_small_fish)$adj.r.squared #all predictor
#> [1] 0.9504212
summary(model_small_fish_1)$adj.r.squared #all predictor
#> [1] 0.8656531
#big fish
summary(model_big_fish)$adj.r.squared #all predictor
#> [1] 0.9178336
summary(model_big_fish_1)$adj.r.squared #all predictor
#> [1] 0.8099166

Based on the value above :

  • The performance of the model will be better if it’s using all predictors instead of one.

  • Model for small fish can predict better, because the range of value for target variable Weight is also significantly smaller.

  • Model for big fish performance is lower than Model for all fish, because it seems the growth of fish correlated with the weight is not in linear function (could be exponential)

RMSE

Root Mean Squared Error is a value obtained by square root the error value, it can be directly interpreted with the range of data (in this example, weight).

library(MLmetrics)

#all species
RMSE(y_pred = fish_set$Weight.Predict, y_true = fish_set$Weight)
#> [1] 95.18191
range(fish_set$Weight)
#> [1]    5.9 1650.0
#small fish
RMSE(y_pred = small_set$Weight.Predict, y_true = small_set$Weight) #all predictor
#> [1] 16.35706
RMSE(y_pred = small_set$Weight.Predict.1, y_true = small_set$Weight) #1 predictor
#> [1] 28.49577
range(small_set$Weight)
#> [1]  40 390
#big fish
RMSE(y_pred = big_set$Weight.Predict, y_true = big_set$Weight) #all predictor
#> [1] 99.76553
RMSE(y_pred = big_set$Weight.Predict.1, y_true = big_set$Weight) #1 predictor
#> [1] 155.2467
range(big_set$Weight)
#> [1]    5.9 1650.0

Example for Interpretation :

  • On Small Fish Model with All Predictor, model can predict the value with 16.35 range of error with 95% confidence level. The range of value on weight is 40.390.

Based on the value above :

  • The performance of the model will be better if it’s using all predictors instead of one.

  • Model for small fish can predict better, because the range of value for target variable Weight is also significantly smaller.

  • Model for big fish performance is lower than Model for all fish, because it seems the growth of fish correlated with the weight is not in linear function (could be exponential)

  • It is align with the result from Adjusted R Square

Statistics Assumption Test

We apply several test to take a look at the behavior of the data based on the assumptions that are set when creating linear model.

Variable Linearity

cor.test(fish_set$Width, fish_set$Diagonal.Length)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fish_set$Width and fish_set$Diagonal.Length
#> t = 17.828, df = 142, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.7729129 0.8758446
#> sample estimates:
#>       cor 
#> 0.8313777
cor.test(fish_set$Width, fish_set$Height)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fish_set$Width and fish_set$Height
#> t = 12.195, df = 142, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.6247335 0.7867517
#> sample estimates:
#>       cor 
#> 0.7152201
cor.test(fish_set$Height, fish_set$Diagonal.Length)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fish_set$Height and fish_set$Diagonal.Length
#> t = 7.3805, df = 142, p-value = 0.00000000001214
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.3971759 0.6353929
#> sample estimates:
#>       cor 
#> 0.5265432

Based on the value above, there’s linearity because p-value < alpha Therefore, only use one predictor is better for making this model.

Normality of Residuals

We only take three of the best models for this analysis. Model 1 : All kind of Fish, and All Predictors Model 2 : Small Fish, and All Predictors Model 3 : Small Fish, One Predictor

# histogram residual
hist(model_fish_all$residuals)

# histogram residual
hist(model_small_fish$residuals)

# histogram residual
hist(model_small_fish_1$residuals)

Shapiro Test

# shapiro test dari residual
shapiro.test(model_fish_all$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_fish_all$residuals
#> W = 0.93472, p-value = 0.000003278
# shapiro test dari residual
shapiro.test(model_small_fish$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_small_fish$residuals
#> W = 0.89833, p-value = 0.007645
# shapiro test dari residual
shapiro.test(model_small_fish_1$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_small_fish_1$residuals
#> W = 0.95649, p-value = 0.2513

alpha = 0.05

Based on the value above, only model_small_fish_1 that accept assumption Normality of Residuals with 95% Confidence Level.

The data is not distributed normally on the rest of the model.

Homoscedasticity of Residuals

# bptest dari model
library(lmtest)
bptest(model_small_fish_1)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_small_fish_1
#> BP = 2.3854, df = 1, p-value = 0.1225

alpha = 0.05

Based on the value above, we can take that the data is spread on pattern (homoscedasticity) with 95% Confidence Level. So the model performance would not be very good.

No Multicollinearity

For this test, we use model_fish_all because it’s using all predictors.

# vif dari model
library(car)
vif(model_fish_all)
#>                      GVIF Df GVIF^(1/(2*Df))
#> Species         148.43366  5        1.648744
#> Height           41.84501  1        6.468772
#> Width            19.45973  1        4.411319
#> Diagonal.Length  21.69180  1        4.657446
  • VIF > 10: multicollinearity
  • VIF < 10: no multicollinearity

Based on values above, there are multicollinearity on the predictors used in the model_fish_all. Means that the value of predictors are very correlated.

summary(model_small_fish)
#> 
#> Call:
#> lm(formula = Weight ~ ., data = small_set)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -20.261 -12.578  -3.534   6.305  43.497 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value       Pr(>|t|)    
#> (Intercept)     -263.558     26.408  -9.980 0.000000000334 ***
#> SpeciesRoach      -5.790     27.477  -0.211       0.834812    
#> Height            13.047      8.639   1.510       0.143524    
#> Width             78.852     20.998   3.755       0.000927 ***
#> Diagonal.Length    2.334      4.277   0.546       0.590135    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 17.92 on 25 degrees of freedom
#> Multiple R-squared:  0.9573, Adjusted R-squared:  0.9504 
#> F-statistic:   140 on 4 and 25 DF,  p-value: < 0.00000000000000022

Conclusion

In Conclusion :

  • The best model we can make in this Article is model_small_fish_1, which is the model that predict small fish only with one Predictor. This model also passed several evaluation of model.

  • The Dataset is not distributed normally, which cause big error in some of model predictions. Therefore it would be better to log transform the data before doing modelling.

  • The Model applied is not passed several assumptions for linear modelling, for example multicollinearity. therefore it could be better if it is processed by Principal Component Analysis (Advanced Technique).

  • Number of Data is not sufficient to make better model, especially for the bigger fishes.

  • The Linear Model would be very good to predict one type of fish, and not the combination of them all.

  • This is why in real life application, seller usually not measure the fish by its size, rather its weight because it is simply only one metrics available and easy to use.