1. CASE DESCRIPTION
1.1 Brief Introudction
Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my third RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.
1.2 Case Background
Predicting the age of abalone from physical measurements. The age of
abalone is determined by cutting the shell through the cone, staining
it, and counting the number of rings through a microscope – a boring and
time-consuming task. Other measurements, which are easier to obtain, are
used to predict the age. Further information, such as weather patterns
and location (hence food availability) may be required to solve the
problem.
The original dataset can be accessed at UCI Abalone
Dataset Link.
1.3 Data Explanation
Given is the attribute name, attribute type, the measurement unit and
a brief description. The number of rings is the value to predict: either
as a continuous value or as a classification problem.
| Name | Data Type | Meas. | Description |
|---|---|---|---|
| Sex | nominal | M, F, and I (infant) | |
| Length | continuous | mm | Longest shell measurement |
| Diameter | continuous | mm | perpendicular to length |
| Height | continuous | mm | with meat in shell |
| Whole weight | continuous | grams | whole abalone |
| Shucked weight | continuous | grams | weight of meat |
| Viscera weight | continuous | grams | gut weight (after bleeding) |
| Shell weight | continuous | grams | after being dried |
| Rings | integer | +1.5 gives the age in years |
1.4 Objectives
- The choice of target variable depends on the perspective of the case
you want to take
- Data analysis and process of selecting predictor variables / feature
selection
- Test the validity of the model
- Interpretation of models and recommendations regarding the initial case
2. Data Preprocessing
2.1 Setup Libraries
The first step to conduct Regression Model Exercise is to import the required libraries.
library(tidyverse)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)2.2 Data Input & Structure
After all of required libraries are set, let’s read Abalone Dataset
(retail.csv) in our folder using read.csv() function, save
it into an object named abalone_df, and display first 6
rows of the data.
abalone_df <- read.csv('abalone.csv')
head(abalone_df)we can also check the last 6 rows of the dataset using
tail function.
tail(abalone_df)Based on the output above, we now that our data has dimension of
4,177 rows and 9 columns.Next we will check the data structure and the
data types using str function. We need to make sure our
columns have the correct data types for better regression model
result.
str(abalone_df)#> 'data.frame': 4177 obs. of 9 variables:
#> $ Sex : chr "M" "M" "F" "M" ...
#> $ Length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
#> $ Diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
#> $ Height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
#> $ Whole.weight : num 0.514 0.226 0.677 0.516 0.205 ...
#> $ Shucked.weight: num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
#> $ Viscera.weight: num 0.101 0.0485 0.1415 0.114 0.0395 ...
#> $ Shell.weight : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
#> $ Rings : int 15 7 9 10 7 8 20 16 9 19 ...
Retail dataset has 1 chr, 7 num, and 1
int data types, but we know that Sex column
have wrong data types. That’s why we need to do adjustment to change
from chr to Factor. We also need to add new
column called Age by adding 1.5 to Rings
column and then remove Rings column from our dataset
abalone_df$Sex <- as.factor(abalone_df$Sex)
abalone_df$Age <- abalone_df$Rings + 1.5
abalone_df <- abalone_df %>% select(-Rings)
str(abalone_df)#> 'data.frame': 4177 obs. of 9 variables:
#> $ Sex : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
#> $ Length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
#> $ Diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
#> $ Height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
#> $ Whole.weight : num 0.514 0.226 0.677 0.516 0.205 ...
#> $ Shucked.weight: num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
#> $ Viscera.weight: num 0.101 0.0485 0.1415 0.114 0.0395 ...
#> $ Shell.weight : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
#> $ Age : num 16.5 8.5 10.5 11.5 8.5 9.5 21.5 17.5 10.5 20.5 ...
Now our data has the correct data type and we can continue with the next steps by checking the missing data in our dataset.
2.3 Missing Data
Even though the dataset already has the correct data types, we still need to check for any missing data and conduct necessary data treatment for those missing data.
anyNA(abalone_df)#> [1] FALSE
colSums(is.na(abalone_df))#> Sex Length Diameter Height Whole.weight
#> 0 0 0 0 0
#> Shucked.weight Viscera.weight Shell.weight Age
#> 0 0 0 0
Based on the output above, the dataset doesn’t have any missing values. That means we don’t need to do any missing value data treatment on the dataset.
2.4 Exploratory Data Analysis
Let’s check the statistical summary of our dataset.
summary(abalone_df)#> Sex Length Diameter Height Whole.weight
#> F:1307 Min. :0.075 Min. :0.0550 Min. :0.0000 Min. :0.0020
#> I:1342 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4415
#> M:1528 Median :0.545 Median :0.4250 Median :0.1400 Median :0.7995
#> Mean :0.524 Mean :0.4079 Mean :0.1395 Mean :0.8287
#> 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1530
#> Max. :0.815 Max. :0.6500 Max. :1.1300 Max. :2.8255
#> Shucked.weight Viscera.weight Shell.weight Age
#> Min. :0.0010 Min. :0.0005 Min. :0.0015 Min. : 2.50
#> 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 9.50
#> Median :0.3360 Median :0.1710 Median :0.2340 Median :10.50
#> Mean :0.3594 Mean :0.1806 Mean :0.2388 Mean :11.43
#> 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290 3rd Qu.:12.50
#> Max. :1.4880 Max. :0.7600 Max. :1.0050 Max. :30.50
stacked_df <- stack(abalone_df %>% select(-Sex,-Age))
ggplot(stacked_df, aes(ind,values))+
geom_boxplot(aes(fill = ind))+
labs(title = 'Box Plot for Each InputColumns', x='Column Names', y='Value',
fill = 'Column Names')+
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))ggplot(abalone_df, aes(Sex,Age))+
geom_violin(aes(fill = Sex))+
labs(title = 'Sex Category Violin Plot', x='Sex', y='Age',
fill = 'Sex')+
theme(plot.title = element_text(hjust = 0.5))ggpairs(abalone_df,
columns = 2:9,
aes(color = Sex,
alpha = 0.5),
upper = list(continuous = "points"),
title = 'Abalone Dataset Pairplot',
legend = 1)+
labs(title = 'Sex Category Violin Plot', fill = 'Sex')+
theme(plot.title = element_text(hjust = 0.5))ggcorr(abalone_df, label = T, label_size = 2.9, hjust = 1, layout.exp = 3)+
labs(title = 'Abalone Dataset Heatmap Correlation')+
theme(plot.title = element_text(hjust = 0.5))
key insights:
- Though features are not normaly distributed, are close to
normality.
- None of the features have minimum = 0 except Height (requires
re-check).
- Each feature has difference scale range.
- Each input columns have outlier and we will ignore this finding in
model development process.
- Male : age majority lies in between 7.5 years to 19 years.
- Female : age majority lies in between 8 years to 19 years.
- Immature: age majority lies in between 6 yeras to < 10 years.
- length is linearly correlated with diameter while, non-linear relation
with height, whole weight, shucked weight, viscera weight and shell
weight.
- Whole Weight is almost linearly varying with all other features except
age.
- Height has least linearity with remaining features.
- Age is most linearly proportional with Shell Weight followed by
Diameter and Length.
- Age is least correlated with Shucked Weight.
Such high correlation coefficients among features can result into
multi-collinearity. We need to check for that later.
3. Linear Regression Model Development
3.1 Regression Model Based on Highest Pearson Correlation Value
Our first regression model will be build based on input parameters
with highest Pearson Correlation Value to Age column
(Shell.weight, Height, Diameter, Length).
# buat model
model_abalone_1 <- lm(Age ~ Shell.weight+Height+Diameter+Length, abalone_df)
# summary model
summary(model_abalone_1)#>
#> Call:
#> lm(formula = Age ~ Shell.weight + Height + Diameter + Length,
#> data = abalone_df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.3126 -1.5620 -0.5489 0.9620 15.3197
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 7.9599 0.2611 30.489 < 0.0000000000000002 ***
#> Shell.weight 12.8490 0.6755 19.021 < 0.0000000000000002 ***
#> Height 11.6702 1.7281 6.753 0.00000000001645 ***
#> Diameter 14.7604 2.5039 5.895 0.00000000404324 ***
#> Length -13.8239 1.9828 -6.972 0.00000000000362 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.484 on 4172 degrees of freedom
#> Multiple R-squared: 0.407, Adjusted R-squared: 0.4064
#> F-statistic: 715.8 on 4 and 4172 DF, p-value: < 0.00000000000000022
3.2 Regression Model Based with All Input Parameters
Our second regression model will be build with all of input parameters included.
# buat model
model_abalone_all <- lm(Age ~ . , abalone_df)
# summary model
summary(model_abalone_all)#>
#> Call:
#> lm(formula = Age ~ ., data = abalone_df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -10.4800 -1.3053 -0.3428 0.8600 13.9426
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.39464 0.29157 18.502 < 0.0000000000000002 ***
#> SexI -0.82488 0.10240 -8.056 0.000000000000001022 ***
#> SexM 0.05772 0.08335 0.692 0.489
#> Length -0.45834 1.80912 -0.253 0.800
#> Diameter 11.07510 2.22728 4.972 0.000000687620027268 ***
#> Height 10.76154 1.53620 7.005 0.000000000002861366 ***
#> Whole.weight 8.97544 0.72540 12.373 < 0.0000000000000002 ***
#> Shucked.weight -19.78687 0.81735 -24.209 < 0.0000000000000002 ***
#> Viscera.weight -10.58183 1.29375 -8.179 0.000000000000000376 ***
#> Shell.weight 8.74181 1.12473 7.772 0.000000000000009639 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.194 on 4167 degrees of freedom
#> Multiple R-squared: 0.5379, Adjusted R-squared: 0.5369
#> F-statistic: 538.9 on 9 and 4167 DF, p-value: < 0.00000000000000022
3.3 Step-wise Regression Model
last model that we build is by using Backward Step-wise method to find the best input parameters.
# stepwise regression - backward
model_backward <- step(object = model_abalone_all, direction = "backward", trace = F)
# cek summary model
summary(model_backward)#>
#> Call:
#> lm(formula = Age ~ Sex + Diameter + Height + Whole.weight + Shucked.weight +
#> Viscera.weight + Shell.weight, data = abalone_df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -10.4644 -1.3041 -0.3427 0.8633 13.9571
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.37038 0.27536 19.503 < 0.0000000000000002 ***
#> SexI -0.82644 0.10220 -8.087 0.000000000000000796 ***
#> SexM 0.05755 0.08333 0.691 0.49
#> Diameter 10.56951 0.98887 10.688 < 0.0000000000000002 ***
#> Height 10.74911 1.53525 7.002 0.000000000002937673 ***
#> Whole.weight 8.97751 0.72528 12.378 < 0.0000000000000002 ***
#> Shucked.weight -19.80258 0.81490 -24.301 < 0.0000000000000002 ***
#> Viscera.weight -10.61279 1.28782 -8.241 0.000000000000000227 ***
#> Shell.weight 8.75078 1.12405 7.785 0.000000000000008730 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.194 on 4168 degrees of freedom
#> Multiple R-squared: 0.5379, Adjusted R-squared: 0.537
#> F-statistic: 606.4 on 8 and 4168 DF, p-value: < 0.00000000000000022
4. Model Evaluation
4.1 Metrics Evaluation
Now we have 3 models that need to be evaluated. We will evaluate model performances based on several metrics such as R squared, Adjusted R squared, and RMSE
# Create Metrics Table
Model_Name <- c('model_abalone_1','model_abalone_all','model_backward')
R2_Metric <- c(summary(model_abalone_1)$r.squared,
summary(model_abalone_all)$r.squared,
summary(model_backward)$r.squared)
Adj_R2_Metric <- c(summary(model_abalone_1)$adj.r.squared,
summary(model_abalone_all)$adj.r.squared,
summary(model_backward)$adj.r.squared)
pred_model_1 <- predict(model_abalone_1, newdata = abalone_df)
pred_model_all <- predict(model_abalone_all, newdata = abalone_df)
pred_model_backward <- predict(model_backward, newdata = abalone_df)
RMSE_Metric <- c(RMSE(y_pred = pred_model_1, y_true = abalone_df$Age),
RMSE(y_pred = pred_model_all, y_true = abalone_df$Age),
RMSE(y_pred = pred_model_backward, y_true = abalone_df$Age))
metric_df <- data.frame(Model_Name,R2_Metric,Adj_R2_Metric,RMSE_Metric)
metric_dfKey Insights:
- Based on metrics value, model_abalone_all has best
performance from each metrics with model_backward has
slightly lower metrics value.
- model_backward is more efficient with 9 input parameters
while model_abalone_all use 10 input parameters.
- All of models still lacking in terms of R squared and Adjusted R
squared value. Based on experience, we need to develop model with R
squared and Adjusted R squared value >0.65.
- For now, we will select model_backward for further
evaluation even though we need to look for more regression algorithm
other than Linear Regression.
4.2 Normality of Residuals
When developing regression model, we are hoping that our model has normal distribution in error/residuals which means errors are populated near 0 value.
hist(model_backward$residuals)shapiro.test(model_backward$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward$residuals
#> W = 0.92554, p-value < 0.00000000000000022
Based on residual histogram plot and shapiro statistical test,
model_backward doesn’t have normal distribution with errors
populated in between -2.5 and 2.5.
4.3 Homoscedasticity
Homoscedasticity describes a situation in which the error term (that is, the noise or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables. Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an independent variable. The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as heteroscedasticity increases.
plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")bptest(model_backward)#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward
#> BP = 459.78, df = 8, p-value < 0.00000000000000022
According to plot and Breusch-Pagan Hypothesis, error variance not spreads constantly and creating heteroscedasticity because the p-value < 0.05.
4.4 Multicollinearity
Multicollinearity is the occurence of high intercorrleations among two or more independent variables in a multiple regression model. Multicollinearity can lead to skewed or misleading results when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model. Multicollinearity among Independent variables will result in less reliable statistical inferences.
vif(model_backward)#> GVIF Df GVIF^(1/(2*Df))
#> Sex 1.536892 2 1.113425
#> Diameter 8.355832 1 2.890646
#> Height 3.577715 1 1.891485
#> Whole.weight 109.754856 1 10.476395
#> Shucked.weight 28.386046 1 5.327856
#> Viscera.weight 17.289367 1 4.158048
#> Shell.weight 21.242202 1 4.608926
Variance Inflation Factor (VIF) Test shows that there are multicollinearity in parameters that have GVIF value > 10 (Whole.weight, Shucked.weight, Viscera.weight, and Shell.weight). It means we might need to drop those parameters as input for our model.
5. Conclusion
In this exercise, we can’t get model that satisfy our need using
Linear Regression Algorithm. Nevertheless, we still select
model_backward as our most efficient Linear Regression
Model and conduct further evaluation for Normality of Residuals,
Homoscedasticity, and Multicollinearity. Based on evaluation result,
model_backward doesn’t have normal distribution in residual
error which means there is chance big error appear in the model. There
is also heteroscedasticity from model which means possibility of big
error happens in certain range of value. model_backward
also still considered as not efficient because multicollinearity appears
from the input parameters used in the model. Finally, Linear regression
might not be the best option for our dataset because there are a lot of
parameters that have non-linear relationship between each other/output.
That’s why it will be better to exercise other regression algorithm for
our dataset.