# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

Intro

This article is made for completing the assignment for Algoritma : Machine Learning Course in Linear Regression Material Theia Batch 2022.

We are going to use fish measurement data in one fish market to learn about linear regression techniques to predict weight of fish based on variables available in the dataset. In real life application, the weight of the fish is correlated with the price of the fish, so the target variable on this article would be weight.

In this dataset there are few assumptions :

Data is taken in one Marketplace.
There are only 7 kind of fishes sold in the Marketplace.
Data is taken in one period of time for all fish.

Explaratory Data Analysis

Load Library

options(scipen = 99) #non active science annotation (example : e-1)
library(tidyverse) #load several important libraries 
library(dplyr) #grammar of data manipulation
library(ggplot2) #plotting data visualization
library(scales) #edit scales
library(GGally) #plotting data visualization

Load Dataset

#load dataset
fish <- read.csv("data-input/fish.csv")
glimpse(fish)

#> Rows: 159
#> Columns: 7
#> $ Species <chr> "Bream", "Bream", "Bream", "Bream", "Bream", "Bream", "Bream",…
#> $ Weight  <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500, 50…
#> $ Length1 <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5, 28…
#> $ Length2 <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7, 31…
#> $ Length3 <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2, 36…
#> $ Height  <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.1795,…
#> $ Width   <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6900…

Based on the Dataset Information, Length 1 means vertical length in centimeters (cm), Length 2 means diagonal length in cm, Length 3 means cross length in cm. We will change the column name based on this information to make easier data wrangling.

fish_clean <-
  fish %>%
  mutate(Species = as.factor(Species), 
         Vertical.Length = Length1, 
         Diagonal.Length = Length2, 
         Cross.Length = Length3) %>%
  select(-c(Length1, Length2, Length3))
glimpse(fish_clean)

#> Rows: 159
#> Columns: 7
#> $ Species         <fct> Bream, Bream, Bream, Bream, Bream, Bream, Bream, Bream…
#> $ Weight          <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475,…
#> $ Height          <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, …
#> $ Width           <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785…
#> $ Vertical.Length <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, …
#> $ Diagonal.Length <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, …
#> $ Cross.Length    <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, …

Species = Species of Fish sold in the Market
Weight = Weight of Fish sold in gram
Height = Height of Fish in centimeter
Width = Width of Fish in centimeter
Vertical.Length = Vertical Length of Fish in centimeter
Diagonal.Length = Diagonal Length of Fish in centimeter
Cross.Length = Cross Length of Fish in centimeter

Insight from Dataset

When looking at fish, there are two things most of us look at before deciding to buy. First we take a look at the species, and then we take a look at the weight. After that, we look at the size and later decide if the size and the weight make sense and reasonable. Some fish could have meat density different from others because of many factors and it also has an effect on the nutrition contained in the meat. This is why Tuna will cost much higher compared to common Bass.

Bream

Parkki

Perch

Pike

Roach

Smelt

Whitefish

After simple research for the fish species, I took several images that display the average size of each species. Assuming the hand size is the same, we can see that each fish has very different characteristics, size, and proportion on its body. This could affect data prediction so we may have to separate each species modeling, we will decide later.

Modelling Data

Data Inspection and Validation

Before we continue, let’s take more look at the the data.

summary(fish_clean$Weight)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     0.0   120.0   273.0   398.3   650.0  1650.0

There is strange data here which the minimum weight is 0

ggplot(data = fish_clean, (aes(y = Weight))) + 
  geom_boxplot(aes(fill = Weight), width = 1)

We see that there are three data which is classified as outliers based on Boxplot Interquartile Method

We also take a look at weight for each species

ggplot(data = fish_clean, aes(x = Species, y = Weight)) + 
  geom_boxplot(aes(fill = Weight), width = 0.8, outlier.shape = NA) +
  geom_jitter()

Apparently the three outliers from before is the very huge Pike fish if we look at the boxplot of Pike

If we take a look at Smelt Fish, we can see that the fishes is consisted on very light (few grams) population. Which means the fish can’t grow big (maximum size is small)

Data Cleansing

Check Missing Value

colSums(is.na(fish_clean))

#>         Species          Weight          Height           Width Vertical.Length 
#>               0               0               0               0               0 
#> Diagonal.Length    Cross.Length 
#>               0               0

We wanted to see which row contains Weight==0

fish_clean %>% filter(Weight == 0)

Based on data above, it is impossible that kind of fish has 0 gram of weight. It must be failed input of the data.

fish_clean <- subset(fish_clean, Weight != 0)

Since Smelt Fish is so small and consist of very small weight, it will not going to produce anything significant in our models. Therefore we can remove it from our dataset And also, it is very uncommon when people buy very small fish individually and check the size from each of the fish.

fish_set <- fish_clean %>% filter(Species != "Smelt") %>% droplevels()

Correlation Data

To see correlation between predictor, we use ggcorr() function to see the easy graph for it.

ggcorr(fish_set, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)

Based on diagram above. Vertical.Lenth, Diagonal.Length and Cross.Length is correlated very high and it is sign of linearity. Therefore we exclude two of those column from the dataset. It is possible because fish are usually in the shape of a long rectangular shape, therefore between the cross length and vertical length and diagonal length is very highly correlated.

fish_set <- fish_set %>% select(-c(Cross.Length,Vertical.Length))

ggcorr(fish_set, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)

The correlation is still very high but we will go on using all predictors for one model and only just one predictor for one model.

Linear Modelling

All kind Fish included with Every Variables

We are using lm method with (Weight ~ .) means that we are using all predictors from data = fish_set

model_fish_all <- lm(Weight~., data = fish_set)
summary(model_fish_all)

#> 
#> Call:
#> lm(formula = Weight ~ ., data = fish_set)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -224.38  -58.86  -11.00   29.49  407.16 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)      -764.1713    87.5509  -8.728  0.00000000000000861 ***
#> SpeciesParkki      73.2008    49.5211   1.478               0.1417    
#> SpeciesPerch       28.7326    84.7023   0.339               0.7350    
#> SpeciesPike      -291.7249   128.4700  -2.271               0.0247 *  
#> SpeciesRoach       24.0990    79.4064   0.303               0.7620    
#> SpeciesWhitefish   28.9762    81.1718   0.357               0.7217    
#> Height              9.4507    13.5908   0.695               0.4880    
#> Width              -0.1356    25.0075  -0.005               0.9957    
#> Diagonal.Length    37.4297     3.8903   9.621 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 98.3 on 135 degrees of freedom
#> Multiple R-squared:  0.9265, Adjusted R-squared:  0.9222 
#> F-statistic: 212.8 on 8 and 135 DF,  p-value: < 0.00000000000000022

Based on model above, we can see that the Species Pike has significantly more higher P Value than other species. The most significant variables above is Intercept and Diagonal.Length.

We will see initial performance of this model with Data Visualization

ggplot(data = fish_set, mapping = aes(x = Weight, y = Diagonal.Length)) + 
  geom_point() +
  stat_summary(fun.data= model_fish_all) + 
  geom_smooth(method='lm') +
  scale_y_continuous(limits = c(5, 60))

Based on the plot above, we can see that our model is still underfit (don’t bother to measure the performance, it is very bad based on the plot alone)

Model Visualization with Linear Regression for Each Species, but only measuring With Diagonal.Length

ggplot(data = fish_set, mapping = aes(x = Weight, y = Diagonal.Length)) + 
  geom_point(aes(color=Species), size = 1) +
  geom_smooth(method=lm,se=FALSE,fullrange=TRUE,
                  aes(color=Species), size = 0.5) +
  scale_y_continuous(limits = c(5, 60))

We can get some insight about the data from plot above :

There are two group of fish. One : Roach and Parkki. Two : Pike, Whitefish, Bream, and Perch.
The Linear Model we created is forced to accomodate the data of Roach and Parkki which the slope is significantly different with other fishes.

Let see the distribution for each species

ggpairs(data = fish_set, 
        columns = c(1,5),
        diag = list(discrete="barDiag",
                    continuous = wrap("densityDiag",
                                      alpha=0.5 )),
        mapping = aes(color=Species))

To make the models more accurate, we can treat these two group as different population. Based on the data, it looks like Roach and Parkki belongs to small-fish group that is not sold individually.

Therefore, we make the dataset into two group.

small_set <- fish_set %>% filter(Species %in% c("Roach","Parkki"))
big_set <- fish_set %>% filter(Species %in% c("Pike", "Whitefish", "Bream", "Perch"))

Based on the Correlation Value, it seems that every variables are highly correlated with each other. We will make two models, with all predictor and with one predictor, we choose Diagonal.Length for model with one predictor.

#all predictor
model_small_fish <- lm(Weight~., data = small_set)
model_big_fish<- lm(Weight~., data = big_set)

#one predictor
model_small_fish_1 <- lm(Weight~Diagonal.Length, data = small_set)
model_big_fish_1 <- lm(Weight~Diagonal.Length, data = big_set)

Model Evaluation

Create Prediction Value

We are using predict() to set value based on the linear model that we created. Therefore it can be compared between the values model predicted and the real values from the dataset.

fish_set$Weight.Predict <- predict(object = model_fish_all, newdata = fish_set)

small_set$Weight.Predict <- predict(object = model_small_fish, newdata = small_set)
small_set$Weight.Predict.1 <- predict(object = model_small_fish_1, newdata = small_set)

big_set$Weight.Predict <- predict(object = model_big_fish, newdata = big_set)
big_set$Weight.Predict.1 <- predict(object = model_big_fish_1, newdata = big_set)

Metric Measurement

We test the model using two kind of Metric Measurement to see the performance of the model.

Adjusted R-squared

Goodnes of Fit based on Adjusted R Squared Value

#all species
summary(model_fish_all)$adj.r.squared #all predictor

#> [1] 0.9221694

#small fish
summary(model_small_fish)$adj.r.squared #all predictor

#> [1] 0.9504212

summary(model_small_fish_1)$adj.r.squared #all predictor

#> [1] 0.8656531

#big fish
summary(model_big_fish)$adj.r.squared #all predictor

#> [1] 0.9178336

summary(model_big_fish_1)$adj.r.squared #all predictor

#> [1] 0.8099166

Based on the value above :

The performance of the model will be better if it’s using all predictors instead of one.
Model for small fish can predict better, because the range of value for target variable Weight is also significantly smaller.
Model for big fish performance is lower than Model for all fish, because it seems the growth of fish correlated with the weight is not in linear function (could be exponential)

RMSE

Root Mean Squared Error is a value obtained by square root the error value, it can be directly interpreted with the range of data (in this example, weight).

library(MLmetrics)

#all species
RMSE(y_pred = fish_set$Weight.Predict, y_true = fish_set$Weight)

#> [1] 95.18191

range(fish_set$Weight)

#> [1]    5.9 1650.0

#small fish
RMSE(y_pred = small_set$Weight.Predict, y_true = small_set$Weight) #all predictor

#> [1] 16.35706

RMSE(y_pred = small_set$Weight.Predict.1, y_true = small_set$Weight) #1 predictor

#> [1] 28.49577

range(small_set$Weight)

#> [1]  40 390

#big fish
RMSE(y_pred = big_set$Weight.Predict, y_true = big_set$Weight) #all predictor

#> [1] 99.76553

RMSE(y_pred = big_set$Weight.Predict.1, y_true = big_set$Weight) #1 predictor

#> [1] 155.2467

range(big_set$Weight)

#> [1]    5.9 1650.0

Example for Interpretation :

On Small Fish Model with All Predictor, model can predict the value with 16.35 range of error with 95% confidence level. The range of value on weight is 40.390.

Based on the value above :

The performance of the model will be better if it’s using all predictors instead of one.
Model for small fish can predict better, because the range of value for target variable Weight is also significantly smaller.
Model for big fish performance is lower than Model for all fish, because it seems the growth of fish correlated with the weight is not in linear function (could be exponential)
It is align with the result from Adjusted R Square

Statistics Assumption Test

We apply several test to take a look at the behavior of the data based on the assumptions that are set when creating linear model.

Variable Linearity

cor.test(fish_set$Width, fish_set$Diagonal.Length)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fish_set$Width and fish_set$Diagonal.Length
#> t = 17.828, df = 142, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.7729129 0.8758446
#> sample estimates:
#>       cor 
#> 0.8313777

cor.test(fish_set$Width, fish_set$Height)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fish_set$Width and fish_set$Height
#> t = 12.195, df = 142, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.6247335 0.7867517
#> sample estimates:
#>       cor 
#> 0.7152201

cor.test(fish_set$Height, fish_set$Diagonal.Length)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fish_set$Height and fish_set$Diagonal.Length
#> t = 7.3805, df = 142, p-value = 0.00000000001214
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.3971759 0.6353929
#> sample estimates:
#>       cor 
#> 0.5265432

Based on the value above, there’s linearity because p-value < alpha Therefore, only use one predictor is better for making this model.

Normality of Residuals

We only take three of the best models for this analysis. Model 1 : All kind of Fish, and All Predictors Model 2 : Small Fish, and All Predictors Model 3 : Small Fish, One Predictor

# histogram residual
hist(model_fish_all$residuals)

# histogram residual
hist(model_small_fish$residuals)

# histogram residual
hist(model_small_fish_1$residuals)

Shapiro Test

# shapiro test dari residual
shapiro.test(model_fish_all$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_fish_all$residuals
#> W = 0.93472, p-value = 0.000003278

# shapiro test dari residual
shapiro.test(model_small_fish$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_small_fish$residuals
#> W = 0.89833, p-value = 0.007645

# shapiro test dari residual
shapiro.test(model_small_fish_1$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_small_fish_1$residuals
#> W = 0.95649, p-value = 0.2513

alpha = 0.05

Based on the value above, only model_small_fish_1 that accept assumption Normality of Residuals with 95% Confidence Level.

The data is not distributed normally on the rest of the model.

Homoscedasticity of Residuals

# bptest dari model
library(lmtest)
bptest(model_small_fish_1)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_small_fish_1
#> BP = 2.3854, df = 1, p-value = 0.1225