Predicting Fish Weight: Linear Regression Method
Guess the Fish Weight: Predicting The Weight of the fish with Regression Method
Plenty of fish! fish is one of the common foods that loved by human. Still over fishing tend to happen and make the regulator to regulate that some of the fish can not be caught if it still have not reach the minimum weight.
With a dataset of fish species, with some of it characteristic like it vertical, diagonal, length, height, and width. We will try to predict the weight of the fish based on their characteristic. We will use Linear Regression Method to see whether the weight of the fish related to their characteristic.
Introduction
With Linear Regression Method, we will use fish dataset to identify the relationship between Weight with other numerical variables. We also try to see whether the weight of the fish can be predicted based on historical data.
Attaching Necessary Library
Load any package required for analysis.
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(ggplot2)
library(performance)
library(MLmetrics)
library(rmdformats)## Warning: package 'rmdformats' was built under R version 4.0.3
Data Import
## Parsed with column specification:
## cols(
## Species = col_character(),
## Weight = col_double(),
## Length1 = col_double(),
## Length2 = col_double(),
## Length3 = col_double(),
## Height = col_double(),
## Width = col_double()
## )
Data Wrangling
We will exclude Species variable as we do not need it now as We will use only the numerical variable.
After removing unnecessary variable, we will check whether our variable data type and missing value.
## Rows: 159
## Columns: 6
## $ Weight <dbl> 242, 290, 340, 363, 430, 450, 500, 390, 450, 500, 475, 500,...
## $ Length1 <dbl> 23.2, 24.0, 23.9, 26.3, 26.5, 26.8, 26.8, 27.6, 27.6, 28.5,...
## $ Length2 <dbl> 25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7,...
## $ Length3 <dbl> 30.0, 31.2, 31.1, 33.5, 34.0, 34.7, 34.5, 35.0, 35.1, 36.2,...
## $ Height <dbl> 11.5200, 12.4800, 12.3778, 12.7300, 12.4440, 13.6024, 14.17...
## $ Width <dbl> 4.0200, 4.3056, 4.6961, 4.4555, 5.1340, 4.9274, 5.2785, 4.6...
## [1] FALSE
after assuring the data type is correct with zero missing values, we will continue to the Exploratory Data Analysis.
Exploratory Data Analysis
At this section, we will see if there any correlation from each of the variables.
We will use Pearson correlation with ggcorr function from GGally Library.
The graphic above shown us how each of the numerical variables have a strong correlation with Weight as our target variable.
Cross Validation
Before the phase of model creation, we need to split our dataset into Train and Test dataset for more accurate model. We will use train dataset to train our model, while the Test dataset will be used as a comparison whether our model can predict new data that has not been use.
We will split the Train and test dataset with 70 : 30 ratio.
Modelling
At this phase, we will try to create a model with Linear Regression Method with Weight as our target variable and Width as our predictor variable
##
## Call:
## lm(formula = Weight ~ Width, data = fish_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -250.02 -110.64 -37.03 62.77 887.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -423.838 46.690 -9.078 5.41e-15 ***
## Width 184.982 9.954 18.584 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 171.7 on 109 degrees of freedom
## Multiple R-squared: 0.7601, Adjusted R-squared: 0.7579
## F-statistic: 345.4 on 1 and 109 DF, p-value: < 2.2e-16
From the summary, we can conclude that Width as our predictor variable have p-value below 0.05, this means Width have a significant effect toward Weight as our target variable.
For simple interpretation of the coefficient, every increased 1 unit point in Width, it will contribute to 184.982 increase in Weight.
We also need to check the Multiple R-Square of the model, we can see that our Multiple R-Square is around 0.7601 or 76.01%. This means that the model that we create before can explain 76.01% of variance of our target variable.
Now we will try to add more variables to our model. To do this we will use Step-Wise Regression Method. This method will create an optimum formula for our model based on the lowest AIC.
The lower the result of AIC, then the less the undetected observation value will be.
First we will create two different models. model_null is the model without any predictor variable, and model_all with all variables included to it.
##
## Call:
## lm(formula = Weight ~ 1, data = fish_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -389.2 -269.2 -119.2 260.8 1210.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 389.24 33.13 11.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 349 on 110 degrees of freedom
Based on model_null, we can see without any predictors included, it will create the mean value of the target variable with the value of 389.24
##
## Call:
## lm(formula = Weight ~ ., data = fish_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -251.54 -62.93 -20.95 44.49 437.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -487.479 35.933 -13.566 < 2e-16 ***
## Length1 49.752 47.601 1.045 0.298327
## Length2 16.237 50.058 0.324 0.746314
## Length3 -38.222 20.118 -1.900 0.060196 .
## Height 34.218 9.926 3.447 0.000815 ***
## Width 1.121 23.284 0.048 0.961689
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.3 on 105 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.8753
## F-statistic: 155.4 on 5 and 105 DF, p-value: < 2.2e-16
For our model_all we can conclude when all variables were put in to the model, the only variable with p-value < 0.05 is the Height variable.
Now we will use Step-Wise Regression Method to choose our best model to be analyzed.
# Create formula using Step-Wise Regression
step(object = model_all, direction = "backward", trace = F)##
## Call:
## lm(formula = Weight ~ Length1 + Length3 + Height, data = fish_train)
##
## Coefficients:
## (Intercept) Length1 Length3 Height
## -484.70 65.63 -36.94 34.97
step(object = model_null, direction = "forward", scope = list(lower = model_null, upper = model_all), trace = F)##
## Call:
## lm(formula = Weight ~ Length3 + Width + Height + Length1, data = fish_train)
##
## Coefficients:
## (Intercept) Length3 Width Height Length1
## -486.255 -35.401 3.284 33.875 63.677
step(object = model_null, direction = "both", scope = list(lower = model_null, upper = model_all), trace = F)##
## Call:
## lm(formula = Weight ~ Length3 + Height + Length1, data = fish_train)
##
## Coefficients:
## (Intercept) Length3 Height Length1
## -484.70 -36.94 34.97 65.63
# Create the model
model_step_back <- lm(formula = Weight ~ Length1 + Length3 + Height, data = fish_train)
model_step_forw <- lm(formula = Weight ~ Length3 + Width + Height + Length1, data = fish_train)
model_step_both <- lm(formula = Weight ~ Length3 + Height + Length1, data = fish_train)
# Compare these models with `model_all`
compare_performance(model_all, model_step_back, model_step_forw, model_step_both)## # Comparison of Model Performance Indices
##
## Model | Type | AIC | BIC | R2 | R2_adjusted | RMSE | BF
## ---------------------------------------------------------------------------------
## model_all | lm | 1391.59 | 1410.56 | 0.88 | 0.88 | 119.87 |
## model_step_back | lm | 1387.73 | 1401.28 | 0.88 | 0.88 | 119.95 | 103.80
## model_step_forw | lm | 1389.70 | 1405.96 | 0.88 | 0.88 | 119.93 | 9.97
## model_step_both | lm | 1387.73 | 1401.28 | 0.88 | 0.88 | 119.95 | 103.80
From the comparison between 4 models. Models with the lowest AIC value are model_step_back and model_step_both. Both of the models have an AIC value of 1387.73. Both of the models are also have an Adjusted R-Square around 0.88 or 88%. This means that both of the model that we create before can explain 88% of variance of our target variable.
It does not matter to choose between these 2 models as the results that we needed from these two models are equals. we will use model_step_both for further analysis.
Evaluation
In this step, we will check our model performance.
fish_pred <- predict(object = model_step_both, newdata = fish_test, level = 0.95)
# RMSE of train dataset
RMSE(y_pred = model_step_both$fitted.values, y_true = fish_train$Weight)## [1] 119.9461
## [1] 124.8925
We can see from the RMSE that the model that have been train by data train are good enough to predicting testing dataset. It is shown by the error of the train dataset is lower than test data set.
Assumption Checking
In Linear regression Method, there are a few assumption classical assumption that need to be fulfilled. Suppose it fail to follow the assumption, the result of the model bay be inaccurate or misleading.
In this section we will check several assumption such as:
- Normality
- Homoscedasticity
- Multicolinearity
Normality Test
Normality test conducted to see whether our residuals fulfilled the assumption of normal distribution. to do this, we will conduct Shapiro-Wilk Normality Test by using shapiro.test() function.
Hypothesis:
H0: The residuals follow normal distribution
H1: The residuals does not follow normal distribution
##
## Shapiro-Wilk normality test
##
## data: model_step_both$residuals
## W = 0.95116, p-value = 0.0004733
From the result above, we can see that our p-value < 0.05, thus we reject the null hypothesis and our residuals are not following normal distribution.
We can also see the residuals distribution by creating histogram.
Homoscedasticity Test
We conduct Homoscedasticity test to see whether the variance of our data were distributed evently or does not create any pattern.
To do this, we will conduct Breusch-Pagan test by using bptest() function from lmtest package.
Hypothesis:
H0 : Error variance distributed evently (Homoscedasticity)
H1 : Error variance formed pattern (Heteroscedasticity)
##
## studentized Breusch-Pagan test
##
## data: model_step_both
## BP = 40.342, df = 3, p-value = 9.017e-09
We can see from the result above that our p-value < 0.05, thus we reject null hypothesis, means that our data formed a pattern (heteroscedasticity), thus we fail to follow the homoscedasticity assumption.
We can also visualize the result by using scatterplot between prediction values (fitted values) and residuals.
Multicolinearity test
We conduct Multicolinearity test to see whether there are any strong relationship between predictor variables. The multicolinearity existence can be seen by the VIF (Variance Inflation Factor) value. VIF value shows how big the coefficient variance increase due to multicolinearity.
when VIF value greater than 10, it indicates there are multicolinearity in our data.
We conduct Multicolinearity test by using vif() function from car package.
## Length3 Height Length1
## 207.26264 5.65118 171.31491
From the result above, we can see that our Length3 variable have VIF value > 10, thus means we have multicolinearity in our data.
From the test that we have been conducted, none of these assumptions fulfilled by our data. Next step, we will try to tune data to make sure we able to fulfill the assumption needed to make sure that our model will not be misleading to predict our data test.
Data & Model Tuning
As we can see from our model, our model does not fulfill all the assumptions needed and it may mislead our dsired result. To fulfill the assumptions, we will tune our model first.
with ggcorr we will see the correlation between predictor variables.
##
## Call:
## lm(formula = Weight ~ Length3 + Height + Length1, data = fish_train)
##
## Coefficients:
## (Intercept) Length3 Height Length1
## -484.70 -36.94 34.97 65.63
From graphic above, we can see that the Length1 and Length2 variables are very dependent with Length3 variable. we will try to remove these variables from our dataset and try to recreate our model from new dataset.
Next, we will see if there any outlier in our fish data.
From the boxplot chart, we can that we have a few outliers in our data. Thus, we will remove the outlier from our fish data and store it in object called fish_revised.
We will divided our new data into train and test dataset based on fish_revised data.
set.seed(100)
index_rev <- sample(nrow(fish_revised), nrow(fish_revised)*0.7)
fish_train_rev <- fish_revised[index_rev,]
fish_test_rev <- fish_revised[-index_rev,]Next, we will create model and based on fish_train_rev data. We will create a model with Width as predictor variable and stored it in model_init_rev. Other models would be a model without predictor variables which will be stored in model_null_rev and a model with all predictor variables included, this model will be stored in model_all_rev.
Again, these 3 models will be based on fish_train_rev and will be different from our previous models.
model_init_rev <- lm(formula = Weight ~ Width, data = fish_train_rev)
model_null_rev <- lm(formula = Weight ~ 1, data = fish_train_rev)
model_all_rev <- lm(formula = Weight ~ ., data = fish_train_rev)
compare_performance(model_init_rev, model_null_rev, model_all_rev)## # Comparison of Model Performance Indices
##
## Model | Type | AIC | BIC | R2 | R2_adjusted | RMSE | BF
## ----------------------------------------------------------------------------------
## model_init_rev | lm | 1380.90 | 1388.98 | 0.83 | 0.83 | 132.68 |
## model_null_rev | lm | 1572.10 | 1577.48 | 0.00 | 0.00 | 321.87 | 0.00
## model_all_rev | lm | 1346.75 | 1360.21 | 0.88 | 0.88 | 111.38 | 1.77e+06
Here are, the comparison between our previously created model. we can see that based on AIC and Adjusted R-Square, model_all_rev were better than other two models.
Next, we will create another model using Step_Wise Regression Method and compare it with model_all_rev.
##
## Call:
## lm(formula = Weight ~ Length3 + Height + Width, data = fish_train_rev)
##
## Coefficients:
## (Intercept) Length3 Height Width
## -438.79 13.96 10.54 68.41
step(object = model_null_rev, direction = "forward", scope = list(lower = model_null_rev, upper = model_all_rev), trace = F)##
## Call:
## lm(formula = Weight ~ Length3 + Width + Height, data = fish_train_rev)
##
## Coefficients:
## (Intercept) Length3 Width Height
## -438.79 13.96 68.41 10.54
step(object = model_null_rev, direction = "both", scope = list(lower = model_null_rev, upper = model_all_rev), trace = F)##
## Call:
## lm(formula = Weight ~ Length3 + Width + Height, data = fish_train_rev)
##
## Coefficients:
## (Intercept) Length3 Width Height
## -438.79 13.96 68.41 10.54
model_step_back_rev <- lm(formula = Weight ~ Length3 + Height + Width, data = fish_train_rev)
model_step_forw_rev <- lm(formula = Weight ~ Length3 + Width + Height, data = fish_train_rev)
model_step_both_rev <- lm(formula = Weight ~ Length3 + Width + Height, data = fish_train_rev)
compare_performance(model_step_back_rev, model_step_both_rev, model_step_forw_rev, model_all_rev)## # Comparison of Model Performance Indices
##
## Model | Type | AIC | BIC | R2 | R2_adjusted | RMSE | BF
## ---------------------------------------------------------------------------------
## model_step_back_rev | lm | 1346.75 | 1360.21 | 0.88 | 0.88 | 111.38 |
## model_step_both_rev | lm | 1346.75 | 1360.21 | 0.88 | 0.88 | 111.38 | 1
## model_step_forw_rev | lm | 1346.75 | 1360.21 | 0.88 | 0.88 | 111.38 | 1
## model_all_rev | lm | 1346.75 | 1360.21 | 0.88 | 0.88 | 111.38 | 1
All the Adjusted R-Square are around 0.88 or 88%. This means that both of the model that we create before can explain 88% of variance of our target variable.
Based on the comparison above, all of these models have same value for the AIC and Adjusted R-Square. So it will not be an issue for us to choose randomly between these models, and we will use model_step_both_rev for further analysis
fish_pred_rev <- predict(object = model_step_both_rev, newdata = fish_test_rev, level = 0.95)
# RMSE of train_rev dataset
RMSE(y_pred = model_step_both_rev$fitted.values, y_true = fish_train_rev$Weight)## [1] 111.381
## [1] 94.73077
Although our prediction in train data are higher than test data, the difference are not that high.
We can conclude that our model are good enough to predicting testing data set.
Assumption Checking: Based on Tuned data
Normality test
Hypothesis:
H0: The residuals follow normal distribution
H1: The residuals does not follow normal distribution
##
## Shapiro-Wilk normality test
##
## data: model_step_both_rev$residuals
## W = 0.93915, p-value = 8.548e-05
From the result above, we can see that our p-value < 0.05, thus we reject the null hypothesis and our residuals are not following normal distribution.
we can also see the residuals distribution by creating histogram
Homoscedascity Test
Hypothesis:
H0 : Error variance distributed evently (Homoscedasticity)
H1 : Error variance formed pattern (Heteroscedasticity)
##
## studentized Breusch-Pagan test
##
## data: model_step_both_rev
## BP = 6.7147, df = 3, p-value = 0.08157
We can see from the result above that our p-value > 0.05, thus fail to reject the null hypothesis, means that our data did not form a pattern (heteroscedasticity), and we fulfill the homoscedasticity assumption.
Conclusions
Based on the analysis, we can conclude that the predictor variables such as Length3, Width, and Height are have a significant effect on ourWeight` as our target variable. Our model also able to fulfill two out of three classical assumptions. The adjusted R-Square of the model have a value of 0.876, means that 87.6% of the variables can explain the variance of fish weight.The accuracy of our model predicting the wight of the fish can be measured with RMSE, training data set have an RMSE around 111.381 while our test data set have an RMSE around 94.731. Although the RMSE of train data have a higher value than our test data, the difference are not that high and we can suggest that our model are good enough to predict the test data set.
Future Improvement
While our model able to fulfill two out of three classical assumptions, it still unable to fulfill the normality assumption. To fulfill this assumption, we can tune our model by changing the train data into log form.