# clear-up the environment
rm(list = ls())
# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)This article is made for completing the assignment for Algoritma : Machine Learning Course in Linear Regression Material Theia Batch 2022.
We are going to use fish measurement data in one fish market to learn about linear regression techniques to predict weight of fish based on variables available in the dataset. In real life application, the weight of the fish is correlated with the price of the fish, so the target variable on this article would be weight.
In this dataset there are few assumptions :
options(scipen = 99) #non active science annotation (example : e-1)
library(tidyverse) #load several important libraries
library(dplyr) #grammar of data manipulation
library(ggplot2) #plotting data visualization
library(scales) #edit scales
library(GGally) #plotting data visualization#load dataset
fish <- read.csv("data-input/fish.csv")
glimpse(fish)#> Rows: 159
#> Columns: 7
#> $ Species <chr> "Bream", "Bream", "Bream", "Bream", "Bream", "Bream", "Bream",…
#> $ Weight <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500, 50…
#> $ Length1 <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5, 28…
#> $ Length2 <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7, 31…
#> $ Length3 <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2, 36…
#> $ Height <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.1795,…
#> $ Width <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6900…
Based on the Dataset Information, Length 1 means vertical length in centimeters (cm), Length 2 means diagonal length in cm, Length 3 means cross length in cm. We will change the column name based on this information to make easier data wrangling.
fish_clean <-
fish %>%
mutate(Species = as.factor(Species),
Vertical.Length = Length1,
Diagonal.Length = Length2,
Cross.Length = Length3) %>%
select(-c(Length1, Length2, Length3))
glimpse(fish_clean)#> Rows: 159
#> Columns: 7
#> $ Species <fct> Bream, Bream, Bream, Bream, Bream, Bream, Bream, Bream…
#> $ Weight <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475,…
#> $ Height <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, …
#> $ Width <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785…
#> $ Vertical.Length <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, …
#> $ Diagonal.Length <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, …
#> $ Cross.Length <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, …
Species = Species of Fish sold in the MarketWeight = Weight of Fish sold in gramHeight = Height of Fish in centimeterWidth = Width of Fish in centimeterVertical.Length = Vertical Length of Fish in
centimeterDiagonal.Length = Diagonal Length of Fish in
centimeterCross.Length = Cross Length of Fish in centimeterWhen looking at fish, there are two things most of us look at before deciding to buy. First we take a look at the species, and then we take a look at the weight. After that, we look at the size and later decide if the size and the weight make sense and reasonable. Some fish could have meat density different from others because of many factors and it also has an effect on the nutrition contained in the meat. This is why Tuna will cost much higher compared to common Bass.
Bream
Parkki
Perch
Pike
Roach
Smelt
Whitefish
After simple research for the fish species, I took several images that display the average size of each species. Assuming the hand size is the same, we can see that each fish has very different characteristics, size, and proportion on its body. This could affect data prediction so we may have to separate each species modeling, we will decide later.
Before we continue, let’s take more look at the the data.
summary(fish_clean$Weight)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0 120.0 273.0 398.3 650.0 1650.0
There is strange data here which the minimum weight is 0
ggplot(data = fish_clean, (aes(y = Weight))) +
geom_boxplot(aes(fill = Weight), width = 1)
We see that there are three data which is classified as
outliers based on Boxplot Interquartile Method
We also take a look at weight for each species
ggplot(data = fish_clean, aes(x = Species, y = Weight)) +
geom_boxplot(aes(fill = Weight), width = 0.8, outlier.shape = NA) +
geom_jitter()
Apparently the three outliers from before is the very huge Pike fish if
we look at the boxplot of Pike
If we take a look at Smelt Fish, we can see that the fishes is consisted on very light (few grams) population. Which means the fish can’t grow big (maximum size is small)
Check Missing Value
colSums(is.na(fish_clean))#> Species Weight Height Width Vertical.Length
#> 0 0 0 0 0
#> Diagonal.Length Cross.Length
#> 0 0
We wanted to see which row contains Weight==0
fish_clean %>% filter(Weight == 0)Based on data above, it is impossible that kind of fish has 0 gram of weight. It must be failed input of the data.
fish_clean <- subset(fish_clean, Weight != 0)Since Smelt Fish is so small and consist of very small weight, it will not going to produce anything significant in our models. Therefore we can remove it from our dataset And also, it is very uncommon when people buy very small fish individually and check the size from each of the fish.
fish_set <- fish_clean %>% filter(Species != "Smelt") %>% droplevels()To see correlation between predictor, we use ggcorr()
function to see the easy graph for it.
ggcorr(fish_set, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)
Based on diagram above. Vertical.Lenth, Diagonal.Length and Cross.Length
is correlated very high and it is sign of linearity. Therefore we
exclude two of those column from the dataset. It is possible because
fish are usually in the shape of a long rectangular shape, therefore
between the cross length and vertical length and diagonal length is very
highly correlated.
fish_set <- fish_set %>% select(-c(Cross.Length,Vertical.Length))ggcorr(fish_set, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)The correlation is still very high but we will go on using all predictors for one model and only just one predictor for one model.
All kind Fish included with Every Variables
We are using lm method with (Weight ~ .) means that we
are using all predictors from data = fish_set
model_fish_all <- lm(Weight~., data = fish_set)
summary(model_fish_all)#>
#> Call:
#> lm(formula = Weight ~ ., data = fish_set)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -224.38 -58.86 -11.00 29.49 407.16
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -764.1713 87.5509 -8.728 0.00000000000000861 ***
#> SpeciesParkki 73.2008 49.5211 1.478 0.1417
#> SpeciesPerch 28.7326 84.7023 0.339 0.7350
#> SpeciesPike -291.7249 128.4700 -2.271 0.0247 *
#> SpeciesRoach 24.0990 79.4064 0.303 0.7620
#> SpeciesWhitefish 28.9762 81.1718 0.357 0.7217
#> Height 9.4507 13.5908 0.695 0.4880
#> Width -0.1356 25.0075 -0.005 0.9957
#> Diagonal.Length 37.4297 3.8903 9.621 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 98.3 on 135 degrees of freedom
#> Multiple R-squared: 0.9265, Adjusted R-squared: 0.9222
#> F-statistic: 212.8 on 8 and 135 DF, p-value: < 0.00000000000000022
Based on model above, we can see that the Species Pike has
significantly more higher P Value than other species. The most
significant variables above is Intercept and
Diagonal.Length.
We will see initial performance of this model with Data Visualization
ggplot(data = fish_set, mapping = aes(x = Weight, y = Diagonal.Length)) +
geom_point() +
stat_summary(fun.data= model_fish_all) +
geom_smooth(method='lm') +
scale_y_continuous(limits = c(5, 60))Based on the plot above, we can see that our model is still underfit (don’t bother to measure the performance, it is very bad based on the plot alone)
Model Visualization with Linear Regression for Each
Species, but only measuring With
Diagonal.Length
ggplot(data = fish_set, mapping = aes(x = Weight, y = Diagonal.Length)) +
geom_point(aes(color=Species), size = 1) +
geom_smooth(method=lm,se=FALSE,fullrange=TRUE,
aes(color=Species), size = 0.5) +
scale_y_continuous(limits = c(5, 60))We can get some insight about the data from plot above :
Let see the distribution for each species
ggpairs(data = fish_set,
columns = c(1,5),
diag = list(discrete="barDiag",
continuous = wrap("densityDiag",
alpha=0.5 )),
mapping = aes(color=Species))
To make the models more accurate, we can treat these two group as
different population. Based on the data, it looks like Roach and Parkki
belongs to small-fish group that is not sold individually.
Therefore, we make the dataset into two group.
small_set <- fish_set %>% filter(Species %in% c("Roach","Parkki"))
big_set <- fish_set %>% filter(Species %in% c("Pike", "Whitefish", "Bream", "Perch"))Based on the Correlation Value, it seems that every variables are
highly correlated with each other. We will make two models, with all
predictor and with one predictor, we choose Diagonal.Length
for model with one predictor.
#all predictor
model_small_fish <- lm(Weight~., data = small_set)
model_big_fish<- lm(Weight~., data = big_set)
#one predictor
model_small_fish_1 <- lm(Weight~Diagonal.Length, data = small_set)
model_big_fish_1 <- lm(Weight~Diagonal.Length, data = big_set)We are using predict() to set value based on the linear
model that we created. Therefore it can be compared between the values
model predicted and the real values from the dataset.
fish_set$Weight.Predict <- predict(object = model_fish_all, newdata = fish_set)
small_set$Weight.Predict <- predict(object = model_small_fish, newdata = small_set)
small_set$Weight.Predict.1 <- predict(object = model_small_fish_1, newdata = small_set)
big_set$Weight.Predict <- predict(object = model_big_fish, newdata = big_set)
big_set$Weight.Predict.1 <- predict(object = model_big_fish_1, newdata = big_set)We test the model using two kind of Metric Measurement to see the performance of the model.
Goodnes of Fit based on Adjusted R Squared Value
#all species
summary(model_fish_all)$adj.r.squared #all predictor#> [1] 0.9221694
#small fish
summary(model_small_fish)$adj.r.squared #all predictor#> [1] 0.9504212
summary(model_small_fish_1)$adj.r.squared #all predictor#> [1] 0.8656531
#big fish
summary(model_big_fish)$adj.r.squared #all predictor#> [1] 0.9178336
summary(model_big_fish_1)$adj.r.squared #all predictor#> [1] 0.8099166
Based on the value above :
The performance of the model will be better if it’s using all predictors instead of one.
Model for small fish can predict better, because the range of
value for target variable Weight is also significantly
smaller.
Model for big fish performance is lower than Model for all fish, because it seems the growth of fish correlated with the weight is not in linear function (could be exponential)
Root Mean Squared Error is a value obtained by square root the error value, it can be directly interpreted with the range of data (in this example, weight).
library(MLmetrics)
#all species
RMSE(y_pred = fish_set$Weight.Predict, y_true = fish_set$Weight)#> [1] 95.18191
range(fish_set$Weight)#> [1] 5.9 1650.0
#small fish
RMSE(y_pred = small_set$Weight.Predict, y_true = small_set$Weight) #all predictor#> [1] 16.35706
RMSE(y_pred = small_set$Weight.Predict.1, y_true = small_set$Weight) #1 predictor#> [1] 28.49577
range(small_set$Weight)#> [1] 40 390
#big fish
RMSE(y_pred = big_set$Weight.Predict, y_true = big_set$Weight) #all predictor#> [1] 99.76553
RMSE(y_pred = big_set$Weight.Predict.1, y_true = big_set$Weight) #1 predictor#> [1] 155.2467
range(big_set$Weight)#> [1] 5.9 1650.0
Example for Interpretation :
Based on the value above :
The performance of the model will be better if it’s using all predictors instead of one.
Model for small fish can predict better, because the range of
value for target variable Weight is also significantly
smaller.
Model for big fish performance is lower than Model for all fish, because it seems the growth of fish correlated with the weight is not in linear function (could be exponential)
It is align with the result from Adjusted R Square
We apply several test to take a look at the behavior of the data based on the assumptions that are set when creating linear model.
cor.test(fish_set$Width, fish_set$Diagonal.Length)#>
#> Pearson's product-moment correlation
#>
#> data: fish_set$Width and fish_set$Diagonal.Length
#> t = 17.828, df = 142, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.7729129 0.8758446
#> sample estimates:
#> cor
#> 0.8313777
cor.test(fish_set$Width, fish_set$Height)#>
#> Pearson's product-moment correlation
#>
#> data: fish_set$Width and fish_set$Height
#> t = 12.195, df = 142, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.6247335 0.7867517
#> sample estimates:
#> cor
#> 0.7152201
cor.test(fish_set$Height, fish_set$Diagonal.Length)#>
#> Pearson's product-moment correlation
#>
#> data: fish_set$Height and fish_set$Diagonal.Length
#> t = 7.3805, df = 142, p-value = 0.00000000001214
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.3971759 0.6353929
#> sample estimates:
#> cor
#> 0.5265432
Based on the value above, there’s linearity because p-value < alpha Therefore, only use one predictor is better for making this model.
We only take three of the best models for this analysis. Model 1 : All kind of Fish, and All Predictors Model 2 : Small Fish, and All Predictors Model 3 : Small Fish, One Predictor
# histogram residual
hist(model_fish_all$residuals)# histogram residual
hist(model_small_fish$residuals)# histogram residual
hist(model_small_fish_1$residuals)# shapiro test dari residual
shapiro.test(model_fish_all$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_fish_all$residuals
#> W = 0.93472, p-value = 0.000003278
# shapiro test dari residual
shapiro.test(model_small_fish$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_small_fish$residuals
#> W = 0.89833, p-value = 0.007645
# shapiro test dari residual
shapiro.test(model_small_fish_1$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_small_fish_1$residuals
#> W = 0.95649, p-value = 0.2513
alpha = 0.05
Based on the value above, only model_small_fish_1 that
accept assumption Normality of Residuals with 95% Confidence Level.
The data is not distributed normally on the rest of the model.
# bptest dari model
library(lmtest)
bptest(model_small_fish_1)#>
#> studentized Breusch-Pagan test
#>
#> data: model_small_fish_1
#> BP = 2.3854, df = 1, p-value = 0.1225
alpha = 0.05
Based on the value above, we can take that the data is spread on pattern (homoscedasticity) with 95% Confidence Level. So the model performance would not be very good.
For this test, we use model_fish_all because it’s using
all predictors.
# vif dari model
library(car)
vif(model_fish_all)#> GVIF Df GVIF^(1/(2*Df))
#> Species 148.43366 5 1.648744
#> Height 41.84501 1 6.468772
#> Width 19.45973 1 4.411319
#> Diagonal.Length 21.69180 1 4.657446
Based on values above, there are multicollinearity on the predictors
used in the model_fish_all. Means that the value of
predictors are very correlated.
summary(model_small_fish)#>
#> Call:
#> lm(formula = Weight ~ ., data = small_set)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -20.261 -12.578 -3.534 6.305 43.497
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -263.558 26.408 -9.980 0.000000000334 ***
#> SpeciesRoach -5.790 27.477 -0.211 0.834812
#> Height 13.047 8.639 1.510 0.143524
#> Width 78.852 20.998 3.755 0.000927 ***
#> Diagonal.Length 2.334 4.277 0.546 0.590135
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 17.92 on 25 degrees of freedom
#> Multiple R-squared: 0.9573, Adjusted R-squared: 0.9504
#> F-statistic: 140 on 4 and 25 DF, p-value: < 0.00000000000000022
In Conclusion :
The best model we can make in this Article is
model_small_fish_1, which is the model that predict small
fish only with one Predictor. This model also passed several evaluation
of model.
The Dataset is not distributed normally, which cause big error in some of model predictions. Therefore it would be better to log transform the data before doing modelling.
The Model applied is not passed several assumptions for linear modelling, for example multicollinearity. therefore it could be better if it is processed by Principal Component Analysis (Advanced Technique).
Number of Data is not sufficient to make better model, especially for the bigger fishes.
The Linear Model would be very good to predict one type of fish, and not the combination of them all.
This is why in real life application, seller usually not measure the fish by its size, rather its weight because it is simply only one metrics available and easy to use.