Intro
Halo my name is Nunu.Today we’ll practice multiple linear regression using R. The dataset used in this Rpubs was uploaded to Kaggle by Aung Pyae. Linear Regression are used to explain the linear relationship between one dependent variable with ≥1 independent variables.
What are we working with?
we’re going to know the relationship between the weight of fish with other variables. So here’s the variables:
Species : The name of the fish species
Weight : Weight of the fish (g)
Height : Height (cm)
Width : Diagonal width (cm)
Length1 : Vertical length (cm)
Length2 : Diagonal length (cm)
Length3 : Cross length (cm)
Our dependent variable is Weight and the others are going to be our independent variable. The independent variable is something which is not affected by the experiment itself but which can be manipulated to affect the dependent variable (Kaliyadan and Kulkarni, 2019). The final analysis will assist fish market office management in setting business strategies to improve fish quality, especially in increasing fish weight.
Data Preparation
Load the required package.
library(dplyr)
library(GGally)
library(ggplot2)
library(reshape2)
library(ggeasy)
library(caret)
library(car)
library(lmtest)
library(MLmetrics)
Load the dataset, named fish_market
fish_market <- read.csv("C:/algoritma_FIX/W4_Regression/RM/LBB_4/Fish.csv")
rmarkdown::paged_table(fish_market)
str(fish_market)
'data.frame': 159 obs. of 7 variables:
$ ï..Species: chr "Bream" "Bream" "Bream" "Bream" ...
$ Weight : num 242 290 340 363 430 450 500 390 450 500 ...
$ Length1 : num 23.2 24 23.9 26.3 26.5 26.8 26.8 27.6 27.6 28.5 ...
$ Length2 : num 25.4 26.3 26.5 29 29 29.7 29.7 30 30 30.7 ...
$ Length3 : num 30 31.2 31.1 33.5 34 34.7 34.5 35 35.1 36.2 num :4.02 4.31 4.7 4.46 5.13
ï..Species Weight Length1 Length2 Length3
0
the data set has 159 rows and 7 columns. also contains 0 NA value. We have to change the type of species to factor.
fish_market <- fish_market %>%
mutate(species = as.factor(species))
Data Exploration
Check The Distribution
fish_hor <- melt(fish_market %>% select(-Weight), id = "species")
ggplot(fish_hor, aes(x=variable, y=value, fill=variable)) +
geom_boxplot()+
labs(x = " ", y = "Value")+
scale_fill_brewer(palette= "Set2") +
theme(legend.position = "None")

ggplot(fish_market, aes(y=Weight))+
geom_boxplot(fill = 4,
alpha = 0.5,
color = 1,
outlier.colour = 1)+
ggtitle(" Dependent variable ")+
labs(x = "Weight", y = "Value") +
theme_classic()+
theme(legend.position = "None")

From the boxplot above, we can see that only one independent variable is normally distributed (Width). While others tend to sskewed right. Each of Length variables (Length1, Length2, Length3) has an outlier, so does the dependent variable. Should we remove outliers before regression? Sometimes it’s best to keep outliers in our data.If the outliers are too influential to regression line, we have to take another action.
check the correlation
ggcorr(fish_market %>% select(-c(species)) ,
label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

By seeing the correlation plot we can define Length1, Length2, Length3 are all highly positively correlated to each other. If the correlation is too high it can lead to Multicollinearity in regression modeling. Which should be avoided by eliminate one of those variables that highly correlated to each others. What impact does the strong correlation between the two variables have on the regression analysis? we’ll know the answer later by keep all of the variables.
Modeling
Train-Test Split
fish_reg <- fish_market %>%
select(-species)
set.seed(123)
samplesize <- round(0.8 * nrow(fish_reg), 0)
index <- sample(seq_len(nrow(fish_reg)), size = samplesize)
data_train <- fish_reg[index, ]
data_test <- fish_reg[-index, ]
1. Linear Regression
fish_lm <- lm(Weight ~., data = data_train)
summary(fish_lm)
Call:
lm(formula = Weight ~ ., data = data_train)
Residuals:
Min 1Q Median 3Q Max
-262.28 -73.51 -23.75 68.23 413.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -503.86 32.84 -15.343 <2e-16 ***
Length1 64.49 47.97 1.344 0.1813
Length2 -23.55 49.75 -0.473 0.6368
Length3 -12.86 20.40 -0.630 0.5296
Height 24.06 10.17 2.365 0.0196 *
Width 13.76 24.72 0.557 0.5788
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 126.3 on 121 degrees of freedom
Multiple R-squared: 0.8832, Adjusted R-squared: 0.8783
F-statistic: 182.9 on 5 and 121 DF, p-value: < 2.2e-16
Let’s take a look on Pr(>|t|) column. We will take significance level of 0.05. It means if the value Pr(>|t|) is below 0.05, than we can asume that the variable has significant effect toward the model. The summary of fish_lm shown only one variable (Height) has significant effect toward our model. So with every increased value of one cm in Height will contibute to 24.06 increase in fish Weight.
Weight = -503.86+24.06(Height)
2. Step-Wise Regression
fish_lm_step <- step(fish_lm, direction = "backward")
Start: AIC=1234.88
Weight ~ Length1 + Length2 + Length3 + Height + Width
Df Sum of Sq RSS AIC
- Length2 1 3575 1933890 1233.1
- Width 1 4943 1935258 1233.2
- Length3 1 6340 1936656 1233.3
- Length1 1 28836 1959151 1234.8
<none> 1930316 1234.9
- Height 1 89252 2019567 1238.6
Step: AIC=1233.12
Weight ~ Length1 + Length3 + Height + Width
Df Sum of Sq RSS AIC
- Width 1 2783 1936674 1231.3
- Length3 1 12860 1946750 1232.0
<none> 1933890 1233.1
- Length1 1 68894 2002785 1235.6
- Height 1 95062 2028952 1237.2
Step: AIC=1231.3
Weight ~ Length1 + Length3 + Height
Df Sum of Sq RSS AIC
<none> 1936674 1231.3
- Length3 1 31263 1967937 1231.3
- Length1 1 151406 2088080 1238.9
- Height 1 299135 2235808 1247.5
summary(fish_lm_step)
Call:
lm(formula = Weight ~ Length1 + Length3 + Height, data = data_train)
Residuals:
Min 1Q Median 3Q Max
-267.50 -68.26 -24.57 66.06 402.36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -500.765 31.375 -15.961 < 2e-16 ***
Length1 49.868 16.082 3.101 0.00239 **
Length3 -21.282 15.103 -1.409 0.16133
Height 27.898 6.401 4.359 2.73e-05 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 125.5 on 123 degrees of freedom
Multiple R-squared: 0.8828, Adjusted R-squared: 0.8799
F-statistic: 308.7 on 3 and 123 DF, p-value: < 2.2e-16
This step-wise regression method will produce an optimum formula based on the lowest AIC value. The Akaike information criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from.
We can see that the step wise regression eliminates the Length2 and Width variables to produce the smallest AIC. The selected variables are Length1, Length2, Height. Length1 and Height have a significant effect to our model fish_lm_step (Pr(>|t|) is below 0.05). We can check the Adjusted R-Squared value from fish_lm and fish_lm_step. The first model with complete variables has adjusted R-squared of 0.8783 or fish_lm model can explain 87.83% of variance in Fish Weight (independent variable). While the step-wise regression has adjusted R-squared of 0.8799. There’s no big difference with fish_lm and fish_lm_step.
Weight = -500.765+49.868 (Length1)-40.938(Length3)+27.898(Height)
Evaluation
Mean absolute percentage error is commonly used for regression and model evaluation. This is the average of the absolute values of the percentage errors; it has the advantage of being dimensionless (Kennedy 2003). \[ MAPE = \frac{1}{n}\ \sum_{t=1}^{n}\ \lvert\frac{At-Ft}{At}\lvert\ \] ### 1. Linear Regression (fish_lm)
lm_pred <- predict(fish_lm, newdata = data_test %>% select(-Weight))
# MAPE of test dataset
MAPE(lm_pred, data_test$Weight)
[1] 2.262244
2. Step-Wise Regression (fish_lm_step)
lm_pred_step <- predict(fish_lm_step, newdata = data_test %>% select(-Weight))
# MAPE of test dataset
MAPE(lm_pred_step, data_test$Weight)
[1] 2.235405
The Step-Wise Regression (fish_lm_step) has a smaller MAPE, even only by a small margin. it indicates that fish_lm_step is better than fish_lm in predicting the data set. Next, we have to check the linear regression assumptions in fish_lm_step
Assumptions
There are four assumptions associated with a linear regression model: ### 1. Linearity The presence of a pattern may indicate a problem with some aspect of the linear model.
plot(fish_lm_step,1)

The residual and fit value that we have tend to be quadratic curve. This indicates that the linearity assumption is violated.
2. Normality of residuals
The normal probability plot of residuals should follow a straight line.
plot(fish_lm_step,2)

QQ plot of residuals can be used to check the normality assumption. From the QQ plot, not all of the data follow a straight line. so we can assume that the residual didnt follom the normality assumption. To make sure, we can check by formal test using Shapiro test.
shapiro.test(fish_lm_step$residuals)
The test rejects the hypothesis of normality when the p-value is less than or equal to 0.05. The null hypothesis is that the residuals follow normal distribution. It means that that our hypothesis is rejected. So or residuals are not following the normal distribution.
3. Heterocedasticity (Non-constant variance of error terms)
The residuals are assumed to have a constant variance.
plot(fish_lm_step, 3)

This plot shows if residuals are not spread equally along the ranges of predictors. It means our residuals doesnt have constant variance. To make sure, We can check with Breusch-Pagan test.
bptest(fish_lm_step)
studentized Breusch-Pagan test
data: fish_lm_step
BP = 60.461, df = 3, p-value = 4.684e-13
The null hypothesis is there is no heterocesdasticity. our test shown p-value < 0.05, we can conclude that heterocesdasticity is present in our model.
4. Multicollinearity
Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. To check the multicollinearity, we can measure the varianec inflation factor (VIF). There’s no multicollinearity when VIF<10.
vif(fish_lm_step)
Length1 Length3 Height
217.499860 259.609031 6.147768
when we do correlation checking, we have 3 (Length1, Length2, Length3) variables that have a high correlation to each other. If the correlation is too high it can lead to Multicollinearity in regression modeling. In Step-wese regression, R eliminate the Length2 variable. But it still have multicollinearity in our model. So we have to eliminate another Length variable (Length1 or Length3)
Model Improvement
our final model doesn’t fulfill all of the linear regression assumptions. We can fix them by transform the data by sqrt and remove one variable that has high correlation to get rid of multicolinierity. i choose Length1 and Height as a final independent predictor. ### 1. Train-Test Split New Model
fish_imprv <- fish_market %>%
mutate_if(~is.numeric(.), ~sqrt(.)) %>%
select(Weight,Length1, Height)
set.seed(123)
data_train2 <- fish_imprv[index, ]
data_test2 <- fish_imprv[-index, ]
2. Regression New Model
fish_lm_imprv <- lm(Weight ~ Length1+Height, data = data_train2)
summary(fish_lm_imprv)
Call:
lm(formula = Weight ~ Length1 + Height, data = data_train2)
Residuals:
Min 1Q Median 3Q Max
-11.6387 -1.0081 -0.3917 1.5848 4.9890
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.6715 1.0467 -26.437 <2e-16 ***
Length1 6.8101 0.2888 23.580 <2e-16 ***
Height 3.7827 0.3828 9.881 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.28 on 124 degrees of freedom
Multiple R-squared: 0.9406, Adjusted R-squared: 0.9396
F-statistic: 980.9 on 2 and 124 DF, p-value: < 2.2e-16
3. Evaluation New Model
lm_pred2 <- predict(fish_lm_imprv, newdata = data_test2 %>% select(-Weight))
# MAPE of test dataset
MAPE(lm_pred2^2,(data_test2$Weight)^2)
[1] 0.2202986
Our new model has R-squared of 93.96% and MAPE really drop to 0.2202986.
4. Assumptions New Model
#Linearity
plot(fish_lm_imprv,1)

There is still square pattern in our residual plot.
#Normality of residuals
shapiro.test(fish_lm_imprv$residuals)
Shapiro-Wilk normality test
data: fish_lm_imprv$residuals
W = 0.92149, p-value = 1.641e-06
p-value < 0.05, it means that residual isnt normally distributed.
#Heterocedasticity (Non-constant variance of error terms)
bptest(fish_lm_imprv)
studentized Breusch-Pagan test
data: fish_lm_imprv
BP = 4.932, df = 2, p-value = 0.08492
p-value > 0.05, it means that heterocesdasticity is not present.
#Multicollinearity
vif(fish_lm_imprv)
Length1 Height
1.99608 1.99608
all of VIF<10, there’s no Multicollinearity
Conclusion
Variables that affect weight of fish are length1 and height. we need to eliminate the variables Length2 and Length3 because the two variables have a large correlation. And to fulfill the assumptions of linear regression we need to transform data to sqrt. The final linear regression model has a MAPE value of 0.2202986 with Adjusted R-squared: 0.9396. it means the model is better than two previous models. For profitable business purposes, to increase the Weight of fish, fisherman needs to increase the Length1 and Height of each fish.
---
title: "Fish Market Analysis using Linear Regression"
author: "khusnul_2012"
output: html_notebook
---

## Intro

Halo my name is Nunu.Today we'll practice multiple linear regression using R. The dataset used in this Rpubs was uploaded to Kaggle by *Aung Pyae*. Linear Regression are used to explain the linear relationship between one dependent variable with ≥1 independent variables. 

## What are we working with?

we're going to know the relationship between the weight of fish with other variables. So here's the variables:

Species     : The name of the fish species

Weight      : Weight of the fish (g)

Height      : Height (cm)

Width       : Diagonal width (cm)

Length1     : Vertical length (cm)

Length2     : Diagonal length (cm)

Length3     : Cross length (cm)

Our dependent variable is Weight and the others are going to be our independent variable. The independent variable is something which is not affected by the experiment itself but which can be manipulated to affect the dependent variable (Kaliyadan and Kulkarni, 2019). The final analysis will assist fish market office management in setting business strategies to improve fish quality, especially in increasing fish weight.

## Data Preparation 

Load the required package.
```{r}
library(dplyr)
library(GGally)
library(ggplot2)
library(reshape2) 
library(ggeasy) 
library(caret)
library(car)
library(lmtest)
library(MLmetrics)
```
Load the dataset, named *fish_market*
```{r}
fish_market <- read.csv("C:/algoritma_FIX/W4_Regression/RM/LBB_4/Fish.csv")
rmarkdown::paged_table(fish_market)
```
```{r}
str(fish_market)
colSums(is.na(fish_market))
```
the data set has 159 rows and 7 columns. also contains 0 NA value. We have to change the type of species to factor. 
```{r}
fish_market <- fish_market %>% 
  rename(species = ï..Species) %>% 
  mutate(species = as.factor(species))
```

## Data Exploration

### Check The Distribution
```{r}
fish_hor <- melt(fish_market %>% select(-Weight), id = "species")
ggplot(fish_hor, aes(x=variable, y=value, fill=variable)) +
  geom_boxplot()+
    labs(x = " ", y = "Value")+
  ggtitle("                                                  Independent variable ") +
  scale_fill_brewer(palette= "Set2") +
      theme_classic()+
      theme(legend.position = "None")
```
```{r}
ggplot(fish_market, aes(y=Weight))+
  geom_boxplot(fill = 4,           
               alpha = 0.5,        
               color = 1,          
               outlier.colour = 1)+
  ggtitle("                                                  Dependent variable ")+
  labs(x = "Weight", y = "Value") +
      theme_classic()+
      theme(legend.position = "None")
```
From the boxplot above, we can see that only one independent variable is normally distributed (Width). While others tend to sskewed right. Each of Length variables (Length1, Length2, Length3) has an outlier, so does the dependent variable. Should we remove outliers before regression? Sometimes it’s best to keep outliers in our data.If the outliers are too influential to regression line, we have to take another action. 


### check the correlation
```{r}
ggcorr(fish_market %>% select(-c(species)) ,
       label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
```
By seeing the correlation plot we can define *Length1*, *Length2*, *Length3* are all highly positively correlated to each other. If the correlation is too high it can lead to Multicollinearity in regression modeling. Which should be avoided by eliminate one of those variables that highly correlated to each others. What impact does the strong correlation between the two variables have on the regression analysis?  we'll know the answer later by keep all of the variables.


## Modeling 

### Train-Test Split
```{r}
fish_reg <- fish_market %>% 
  select(-species)
set.seed(123)
samplesize <- round(0.8 * nrow(fish_reg), 0)
index <- sample(seq_len(nrow(fish_reg)), size = samplesize)
data_train <- fish_reg[index, ]
data_test <- fish_reg[-index, ]
```

### 1. Linear Regression
```{r}
fish_lm <- lm(Weight ~., data = data_train)
summary(fish_lm)
```
Let's take a look on *Pr(>|t|)* column. We will take significance level of 0.05. It means if the value *Pr(>|t|)* is below 0.05, than we can asume that the variable has significant effect toward the model. The summary of fish_lm shown only one variable (*Height*) has significant effect toward our model. So with every increased value of one cm in *Height* will contibute to 24.06 increase in fish Weight.

>                                            Weight = -503.86+24.06(Height)

### 2. Step-Wise Regression

```{r}
fish_lm_step <- step(fish_lm, direction = "backward")
summary(fish_lm_step)
```
This step-wise regression method will produce an optimum formula based on the lowest AIC value. The Akaike information criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from. 

We can see that the step wise regression eliminates the *Length2* and *Width* variables to produce the smallest AIC. The selected variables are Length1, Length2, Height. Length1 and Height have a significant effect to our model *fish_lm_step* (Pr(>|t|) is below 0.05). We can check the Adjusted R-Squared value from *fish_lm* and *fish_lm_step*. The first model with complete variables has adjusted R-squared of 0.8783 or fish_lm model can explain 87.83% of variance in Fish Weight (independent variable). While the step-wise regression has adjusted R-squared of 0.8799. There's no big difference with *fish_lm* and *fish_lm_step*.

>                           Weight = -500.765+49.868 (Length1)-40.938(Length3)+27.898(Height)

## Evaluation
Mean absolute percentage error is commonly used  for regression and model evaluation. This is the average of the absolute values of the percentage errors; it has the advantage of being dimensionless (Kennedy 2003).
$$ MAPE = \frac{1}{n}\ \sum_{t=1}^{n}\ \lvert\frac{At-Ft}{At}\lvert\  $$
### 1. Linear Regression (fish_lm)
```{r}
lm_pred <- predict(fish_lm, newdata = data_test %>% select(-Weight))

# MAPE of test dataset
MAPE(lm_pred, data_test$Weight)
```
### 2. Step-Wise Regression (fish_lm_step)
```{r}
lm_pred_step <- predict(fish_lm_step, newdata = data_test %>% select(-Weight))

# MAPE of test dataset
MAPE(lm_pred_step, data_test$Weight)
```
The Step-Wise Regression (fish_lm_step) has a smaller MAPE, even only by a small margin. it indicates that fish_lm_step is better than fish_lm in predicting the data set. Next, we have to check the linear regression assumptions in **fish_lm_step**

## Assumptions
There are four assumptions associated with a linear regression model:
### 1. Linearity
The presence of a pattern may indicate a problem with some aspect of the linear model. 
```{r}
plot(fish_lm_step,1)
```
The residual and fit value that we have tend to be quadratic curve. This indicates that the linearity assumption is violated. 

### 2. Normality of residuals
The normal probability plot of residuals should  follow a straight line.
```{r}
plot(fish_lm_step,2)
```
QQ plot of residuals can be used to  check the normality assumption. From the QQ plot, not all of the data follow a straight line. so we can assume that the residual didnt follom the normality assumption. To make sure, we can check by formal test using Shapiro test. 
```{r}
shapiro.test(fish_lm_step$residuals)
```
The test rejects the hypothesis of normality when the p-value is less than or equal to 0.05. The null hypothesis is that the residuals follow normal distribution. It means that that our hypothesis is rejected. So  or residuals are not following the normal distribution.

### 3. Heterocedasticity (Non-constant variance of error terms)
The residuals are assumed to have a constant variance. 
```{r}
plot(fish_lm_step, 3)
```
This plot shows if residuals are not spread equally along the ranges of predictors. It means our residuals doesnt have constant variance. To make sure, We can check with Breusch-Pagan test.
```{r}
bptest(fish_lm_step)
```
The null hypothesis is there is no heterocesdasticity. our test shown p-value < 0.05, we can conclude that heterocesdasticity is present in our model.

### 4. Multicollinearity
Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. To check the multicollinearity, we can measure the varianec inflation factor (VIF). There's no multicollinearity when VIF<10.
```{r}
vif(fish_lm_step)
```
when we do correlation checking, we have 3 (Length1, Length2, Length3) variables that have a high correlation to each other. If the correlation is too high it can lead to Multicollinearity in regression modeling. In Step-wese regression, R eliminate the Length2 variable. But it still have multicollinearity in our model. So we have to eliminate another Length variable (Length1 or Length3)

## Model Improvement
our final model doesn’t fulfill all of the linear regression assumptions. We can fix them by transform the data by **sqrt** and remove one variable that has high correlation to get rid of multicolinierity. i choose Length1 and Height as a final independent predictor.
### 1. Train-Test Split New Model
```{r}
fish_imprv <- fish_market  %>% 
  mutate_if(~is.numeric(.), ~sqrt(.)) %>% 
  select(Weight,Length1, Height)
set.seed(123)
data_train2 <- fish_imprv[index, ]
data_test2 <- fish_imprv[-index, ]

```
### 2. Regression New Model
```{r}
fish_lm_imprv <- lm(Weight ~ Length1+Height, data = data_train2)
summary(fish_lm_imprv)
```

### 3. Evaluation New Model
```{r}
lm_pred2 <- predict(fish_lm_imprv, newdata = data_test2 %>% select(-Weight))

# MAPE of test dataset
MAPE(lm_pred2^2,(data_test2$Weight)^2)
```
Our new model has R-squared of 93.96% and MAPE really drop to 0.2202986.

### 4. Assumptions New Model
```{r}
#Linearity
plot(fish_lm_imprv,1)
```
There is still square pattern in our residual plot.
```{r}
#Normality of residuals
shapiro.test(fish_lm_imprv$residuals)
```
 p-value < 0.05, it means that residual isnt normally distributed.
```{r}
#Heterocedasticity (Non-constant variance of error terms)
bptest(fish_lm_imprv)
```
 p-value > 0.05, it means that heterocesdasticity is not present.
```{r}
#Multicollinearity
vif(fish_lm_imprv)
```
all of VIF<10, there's no Multicollinearity

## Conclusion
Variables that affect weight of fish are length1 and height. we need to eliminate the variables Length2 and Length3 because the two variables have a large correlation. And to fulfill the assumptions of linear regression we need to transform data to **sqrt**. The final linear regression model has a MAPE value of 0.2202986 with Adjusted R-squared: 0.9396. it means the model is better than two previous models. For profitable business purposes, to increase the Weight of fish, fisherman  needs to increase the Length1 and Height of each fish.

