1. CASE DESCRIPTION

1.1 Brief Introudction

Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my third RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.

1.2 Case Background

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

The original dataset can be accessed at UCI Abalone Dataset Link.

1.3 Data Explanation

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.

Name Data Type Meas. Description
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years

1.4 Objectives

  • The choice of target variable depends on the perspective of the case you want to take
  • Data analysis and process of selecting predictor variables / feature selection
  • Test the validity of the model
  • Interpretation of models and recommendations regarding the initial case

2. Data Preprocessing

2.1 Setup Libraries

The first step to conduct Regression Model Exercise is to import the required libraries.

library(tidyverse)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)

2.2 Data Input & Structure

After all of required libraries are set, let’s read Abalone Dataset (retail.csv) in our folder using read.csv() function, save it into an object named abalone_df, and display first 6 rows of the data.

abalone_df <- read.csv('abalone.csv')
head(abalone_df)

we can also check the last 6 rows of the dataset using tail function.

tail(abalone_df)

Based on the output above, we now that our data has dimension of 4,177 rows and 9 columns.Next we will check the data structure and the data types using str function. We need to make sure our columns have the correct data types for better regression model result.

str(abalone_df)
#> 'data.frame':    4177 obs. of  9 variables:
#>  $ Sex           : chr  "M" "M" "F" "M" ...
#>  $ Length        : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
#>  $ Diameter      : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
#>  $ Height        : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
#>  $ Whole.weight  : num  0.514 0.226 0.677 0.516 0.205 ...
#>  $ Shucked.weight: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
#>  $ Viscera.weight: num  0.101 0.0485 0.1415 0.114 0.0395 ...
#>  $ Shell.weight  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
#>  $ Rings         : int  15 7 9 10 7 8 20 16 9 19 ...

Retail dataset has 1 chr, 7 num, and 1 int data types, but we know that Sex column have wrong data types. That’s why we need to do adjustment to change from chr to Factor. We also need to add new column called Age by adding 1.5 to Rings column and then remove Rings column from our dataset

abalone_df$Sex <- as.factor(abalone_df$Sex)
abalone_df$Age <- abalone_df$Rings + 1.5
abalone_df <- abalone_df %>% select(-Rings)
str(abalone_df)
#> 'data.frame':    4177 obs. of  9 variables:
#>  $ Sex           : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
#>  $ Length        : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
#>  $ Diameter      : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
#>  $ Height        : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
#>  $ Whole.weight  : num  0.514 0.226 0.677 0.516 0.205 ...
#>  $ Shucked.weight: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
#>  $ Viscera.weight: num  0.101 0.0485 0.1415 0.114 0.0395 ...
#>  $ Shell.weight  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
#>  $ Age           : num  16.5 8.5 10.5 11.5 8.5 9.5 21.5 17.5 10.5 20.5 ...

Now our data has the correct data type and we can continue with the next steps by checking the missing data in our dataset.

2.3 Missing Data

Even though the dataset already has the correct data types, we still need to check for any missing data and conduct necessary data treatment for those missing data.

anyNA(abalone_df)
#> [1] FALSE
colSums(is.na(abalone_df))
#>            Sex         Length       Diameter         Height   Whole.weight 
#>              0              0              0              0              0 
#> Shucked.weight Viscera.weight   Shell.weight            Age 
#>              0              0              0              0

Based on the output above, the dataset doesn’t have any missing values. That means we don’t need to do any missing value data treatment on the dataset.

2.4 Exploratory Data Analysis

Let’s check the statistical summary of our dataset.

summary(abalone_df)
#>  Sex          Length         Diameter          Height        Whole.weight   
#>  F:1307   Min.   :0.075   Min.   :0.0550   Min.   :0.0000   Min.   :0.0020  
#>  I:1342   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4415  
#>  M:1528   Median :0.545   Median :0.4250   Median :0.1400   Median :0.7995  
#>           Mean   :0.524   Mean   :0.4079   Mean   :0.1395   Mean   :0.8287  
#>           3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1530  
#>           Max.   :0.815   Max.   :0.6500   Max.   :1.1300   Max.   :2.8255  
#>  Shucked.weight   Viscera.weight    Shell.weight         Age       
#>  Min.   :0.0010   Min.   :0.0005   Min.   :0.0015   Min.   : 2.50  
#>  1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300   1st Qu.: 9.50  
#>  Median :0.3360   Median :0.1710   Median :0.2340   Median :10.50  
#>  Mean   :0.3594   Mean   :0.1806   Mean   :0.2388   Mean   :11.43  
#>  3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290   3rd Qu.:12.50  
#>  Max.   :1.4880   Max.   :0.7600   Max.   :1.0050   Max.   :30.50
stacked_df <- stack(abalone_df %>% select(-Sex,-Age))
ggplot(stacked_df, aes(ind,values))+
  geom_boxplot(aes(fill = ind))+
  labs(title = 'Box Plot for Each InputColumns', x='Column Names', y='Value', 
       fill = 'Column Names')+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggplot(abalone_df, aes(Sex,Age))+
  geom_violin(aes(fill = Sex))+
  labs(title = 'Sex Category Violin Plot', x='Sex', y='Age', 
       fill = 'Sex')+
  theme(plot.title = element_text(hjust = 0.5))

ggpairs(abalone_df,
        columns = 2:9,
        aes(color = Sex,
            alpha = 0.5),
        upper = list(continuous = "points"),
        title = 'Abalone Dataset Pairplot',
        legend = 1)+
  labs(title = 'Sex Category Violin Plot', fill = 'Sex')+
  theme(plot.title = element_text(hjust = 0.5))

ggcorr(abalone_df, label = T, label_size = 2.9, hjust = 1, layout.exp = 3)+
  labs(title = 'Abalone Dataset Heatmap Correlation')+
  theme(plot.title = element_text(hjust = 0.5))

key insights:
- Though features are not normaly distributed, are close to normality.
- None of the features have minimum = 0 except Height (requires re-check).
- Each feature has difference scale range.
- Each input columns have outlier and we will ignore this finding in model development process.
- Male : age majority lies in between 7.5 years to 19 years.
- Female : age majority lies in between 8 years to 19 years.
- Immature: age majority lies in between 6 yeras to < 10 years.
- length is linearly correlated with diameter while, non-linear relation with height, whole weight, shucked weight, viscera weight and shell weight.
- Whole Weight is almost linearly varying with all other features except age.
- Height has least linearity with remaining features.
- Age is most linearly proportional with Shell Weight followed by Diameter and Length.
- Age is least correlated with Shucked Weight.
Such high correlation coefficients among features can result into multi-collinearity. We need to check for that later.

3. Linear Regression Model Development

3.1 Regression Model Based on Highest Pearson Correlation Value

Our first regression model will be build based on input parameters with highest Pearson Correlation Value to Age column (Shell.weight, Height, Diameter, Length).

# buat model
model_abalone_1 <- lm(Age ~ Shell.weight+Height+Diameter+Length, abalone_df)

# summary model
summary(model_abalone_1)
#> 
#> Call:
#> lm(formula = Age ~ Shell.weight + Height + Diameter + Length, 
#>     data = abalone_df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.3126  -1.5620  -0.5489   0.9620  15.3197 
#> 
#> Coefficients:
#>              Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)    7.9599     0.2611  30.489 < 0.0000000000000002 ***
#> Shell.weight  12.8490     0.6755  19.021 < 0.0000000000000002 ***
#> Height        11.6702     1.7281   6.753     0.00000000001645 ***
#> Diameter      14.7604     2.5039   5.895     0.00000000404324 ***
#> Length       -13.8239     1.9828  -6.972     0.00000000000362 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.484 on 4172 degrees of freedom
#> Multiple R-squared:  0.407,  Adjusted R-squared:  0.4064 
#> F-statistic: 715.8 on 4 and 4172 DF,  p-value: < 0.00000000000000022

3.2 Regression Model Based with All Input Parameters

Our second regression model will be build with all of input parameters included.

# buat model
model_abalone_all <- lm(Age ~ . , abalone_df)

# summary model
summary(model_abalone_all)
#> 
#> Call:
#> lm(formula = Age ~ ., data = abalone_df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -10.4800  -1.3053  -0.3428   0.8600  13.9426 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)      5.39464    0.29157  18.502 < 0.0000000000000002 ***
#> SexI            -0.82488    0.10240  -8.056 0.000000000000001022 ***
#> SexM             0.05772    0.08335   0.692                0.489    
#> Length          -0.45834    1.80912  -0.253                0.800    
#> Diameter        11.07510    2.22728   4.972 0.000000687620027268 ***
#> Height          10.76154    1.53620   7.005 0.000000000002861366 ***
#> Whole.weight     8.97544    0.72540  12.373 < 0.0000000000000002 ***
#> Shucked.weight -19.78687    0.81735 -24.209 < 0.0000000000000002 ***
#> Viscera.weight -10.58183    1.29375  -8.179 0.000000000000000376 ***
#> Shell.weight     8.74181    1.12473   7.772 0.000000000000009639 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.194 on 4167 degrees of freedom
#> Multiple R-squared:  0.5379, Adjusted R-squared:  0.5369 
#> F-statistic: 538.9 on 9 and 4167 DF,  p-value: < 0.00000000000000022

3.3 Step-wise Regression Model

last model that we build is by using Backward Step-wise method to find the best input parameters.

# stepwise regression - backward
model_backward <- step(object = model_abalone_all, direction = "backward", trace = F)
# cek summary model
summary(model_backward)
#> 
#> Call:
#> lm(formula = Age ~ Sex + Diameter + Height + Whole.weight + Shucked.weight + 
#>     Viscera.weight + Shell.weight, data = abalone_df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -10.4644  -1.3041  -0.3427   0.8633  13.9571 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)      5.37038    0.27536  19.503 < 0.0000000000000002 ***
#> SexI            -0.82644    0.10220  -8.087 0.000000000000000796 ***
#> SexM             0.05755    0.08333   0.691                 0.49    
#> Diameter        10.56951    0.98887  10.688 < 0.0000000000000002 ***
#> Height          10.74911    1.53525   7.002 0.000000000002937673 ***
#> Whole.weight     8.97751    0.72528  12.378 < 0.0000000000000002 ***
#> Shucked.weight -19.80258    0.81490 -24.301 < 0.0000000000000002 ***
#> Viscera.weight -10.61279    1.28782  -8.241 0.000000000000000227 ***
#> Shell.weight     8.75078    1.12405   7.785 0.000000000000008730 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.194 on 4168 degrees of freedom
#> Multiple R-squared:  0.5379, Adjusted R-squared:  0.537 
#> F-statistic: 606.4 on 8 and 4168 DF,  p-value: < 0.00000000000000022

4. Model Evaluation

4.1 Metrics Evaluation

Now we have 3 models that need to be evaluated. We will evaluate model performances based on several metrics such as R squared, Adjusted R squared, and RMSE

# Create Metrics Table
Model_Name <- c('model_abalone_1','model_abalone_all','model_backward')
R2_Metric <- c(summary(model_abalone_1)$r.squared,
               summary(model_abalone_all)$r.squared,
               summary(model_backward)$r.squared)

Adj_R2_Metric <- c(summary(model_abalone_1)$adj.r.squared,
                   summary(model_abalone_all)$adj.r.squared,
                   summary(model_backward)$adj.r.squared)

pred_model_1 <- predict(model_abalone_1, newdata = abalone_df)
pred_model_all <- predict(model_abalone_all, newdata = abalone_df)
pred_model_backward <- predict(model_backward, newdata = abalone_df)
RMSE_Metric <- c(RMSE(y_pred = pred_model_1, y_true = abalone_df$Age),
                 RMSE(y_pred = pred_model_all, y_true = abalone_df$Age),
                 RMSE(y_pred = pred_model_backward, y_true = abalone_df$Age))
metric_df <- data.frame(Model_Name,R2_Metric,Adj_R2_Metric,RMSE_Metric)
metric_df

Key Insights:
- Based on metrics value, model_abalone_all has best performance from each metrics with model_backward has slightly lower metrics value.
- model_backward is more efficient with 9 input parameters while model_abalone_all use 10 input parameters.
- All of models still lacking in terms of R squared and Adjusted R squared value. Based on experience, we need to develop model with R squared and Adjusted R squared value >0.65.
- For now, we will select model_backward for further evaluation even though we need to look for more regression algorithm other than Linear Regression.

4.2 Normality of Residuals

When developing regression model, we are hoping that our model has normal distribution in error/residuals which means errors are populated near 0 value.

hist(model_backward$residuals)

shapiro.test(model_backward$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward$residuals
#> W = 0.92554, p-value < 0.00000000000000022

Based on residual histogram plot and shapiro statistical test, model_backward doesn’t have normal distribution with errors populated in between -2.5 and 2.5.

4.3 Homoscedasticity

Homoscedasticity describes a situation in which the error term (that is, the noise or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables. Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an independent variable. The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as heteroscedasticity increases.

plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")

bptest(model_backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 459.78, df = 8, p-value < 0.00000000000000022

According to plot and Breusch-Pagan Hypothesis, error variance not spreads constantly and creating heteroscedasticity because the p-value < 0.05.

4.4 Multicollinearity

Multicollinearity is the occurence of high intercorrleations among two or more independent variables in a multiple regression model. Multicollinearity can lead to skewed or misleading results when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model. Multicollinearity among Independent variables will result in less reliable statistical inferences.

vif(model_backward)
#>                      GVIF Df GVIF^(1/(2*Df))
#> Sex              1.536892  2        1.113425
#> Diameter         8.355832  1        2.890646
#> Height           3.577715  1        1.891485
#> Whole.weight   109.754856  1       10.476395
#> Shucked.weight  28.386046  1        5.327856
#> Viscera.weight  17.289367  1        4.158048
#> Shell.weight    21.242202  1        4.608926

Variance Inflation Factor (VIF) Test shows that there are multicollinearity in parameters that have GVIF value > 10 (Whole.weight, Shucked.weight, Viscera.weight, and Shell.weight). It means we might need to drop those parameters as input for our model.

5. Conclusion

In this exercise, we can’t get model that satisfy our need using Linear Regression Algorithm. Nevertheless, we still select model_backward as our most efficient Linear Regression Model and conduct further evaluation for Normality of Residuals, Homoscedasticity, and Multicollinearity. Based on evaluation result, model_backward doesn’t have normal distribution in residual error which means there is chance big error appear in the model. There is also heteroscedasticity from model which means possibility of big error happens in certain range of value. model_backward also still considered as not efficient because multicollinearity appears from the input parameters used in the model. Finally, Linear regression might not be the best option for our dataset because there are a lot of parameters that have non-linear relationship between each other/output. That’s why it will be better to exercise other regression algorithm for our dataset.