Title of the project

Predicting the Age of Abalone using Classification and Regression Models

Introduction

Predicting the age of abalone is possible through physical measurements. The abalone add one ring to their shell every year, except in their first year. Theoretically, the age of abalone can be accurately estimated by adding 1.5 to the number of rings. In real-life, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Therefore, other measurements (e.g., weight and length of the abalone), which are easier to obtain, are used to predict the age. In this project, we aim to use the available dataset to develop Classification and Linear Regression models, which will help in accurately predict the age of abalone.

Dataset source

https://www.kaggle.com/miksaas/abalone-eda-regression-pca-classification

Dataset details

There are seven predictor variables included in the dataset. The first three constitute the shell measurements:

The other four variables are related to the weight of abalone:

The response variable, that is the dependent variable in this analysis is:

Analysis details

Initial Questions

  1. What is the age of the abalone, Adult or Infant?
  2. What would be the age of an abalone predicted by the number of rings?

Objectives

To determine the age of abalone from ‘Sex’ feature and physical measurement. In this project, supervised learning algorithms are used to conclude the output data.

Import dataset

Import the csv file from Google drive.

id <- "18dk-yeiB7jFTRv1KR3bTzgiya3dwnKCc" # sharable google drive link
df<-read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))

Data Cleaning/Preprocessing

Load the required libraries.

defaultW <- getOption("warn") 
options(warn = -1) 
library(plyr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(corrplot)
library(caret)
library(randomForest)
library(Metrics)
library(DataExplorer)
library(ggcorrplot)
library(ggExtra)
options(warn = defaultW)

Inspect the dataset.

class(df)
## [1] "data.frame"
typeof(df)
## [1] "list"
dim(df)
## [1] 4187    9
ncol(df)
## [1] 9
nrow((df))
## [1] 4187
colnames(df)
## [1] "Sex"            "Length"         "Diameter"       "Height"        
## [5] "Whole.weight"   "Shucked.weight" "Viscera.weight" "Shell.weight"  
## [9] "Rings"
str(df)
## 'data.frame':    4187 obs. of  9 variables:
##  $ Sex           : chr  "M" "M" "F" "M" ...
##  $ Length        : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ Diameter      : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ Height        : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ Whole.weight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ Shucked.weight: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ Viscera.weight: num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ Shell.weight  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ Rings         : int  15 7 9 10 7 8 20 16 9 19 ...
glimpse(df)
## Rows: 4,187
## Columns: 9
## $ Sex            <chr> "M", "M", "F", "M", "I", "I", "F", "F", "M", "F", "F", ~
## $ Length         <dbl> 0.455, 0.350, 0.530, 0.440, 0.330, 0.425, 0.530, 0.545,~
## $ Diameter       <dbl> 0.365, 0.265, 0.420, 0.365, 0.255, 0.300, 0.415, 0.425,~
## $ Height         <dbl> 0.095, 0.090, 0.135, 0.125, 0.080, 0.095, 0.150, 0.125,~
## $ Whole.weight   <dbl> 0.5140, 0.2255, 0.6770, 0.5160, 0.2050, 0.3515, 0.7775,~
## $ Shucked.weight <dbl> 0.2245, 0.0995, 0.2565, 0.2155, 0.0895, 0.1410, 0.2370,~
## $ Viscera.weight <dbl> 0.1010, 0.0485, 0.1415, 0.1140, 0.0395, 0.0775, 0.1415,~
## $ Shell.weight   <dbl> 0.150, 0.070, 0.210, 0.155, 0.055, 0.120, 0.330, 0.260,~
## $ Rings          <int> 15, 7, 9, 10, 7, 8, 20, 16, 9, 19, 14, 10, 11, 10, 10, ~
summary(df)
##      Sex                Length         Diameter          Height      
##  Length:4187        Min.   :0.075   Min.   :0.0550   Min.   :0.0000  
##  Class :character   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150  
##  Mode  :character   Median :0.545   Median :0.4250   Median :0.1400  
##                     Mean   :0.524   Mean   :0.4078   Mean   :0.1395  
##                     3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650  
##                     Max.   :0.815   Max.   :0.6500   Max.   :1.1300  
##                     NA's   :1       NA's   :1        NA's   :1       
##   Whole.weight    Shucked.weight   Viscera.weight    Shell.weight   
##  Min.   :0.0020   Min.   :0.0010   Min.   :0.0005   Min.   :0.0015  
##  1st Qu.:0.4421   1st Qu.:0.1862   1st Qu.:0.0935   1st Qu.:0.1300  
##  Median :0.7995   Median :0.3355   Median :0.1710   Median :0.2335  
##  Mean   :0.8286   Mean   :0.3593   Mean   :0.1805   Mean   :0.2388  
##  3rd Qu.:1.1530   3rd Qu.:0.5020   3rd Qu.:0.2527   3rd Qu.:0.3289  
##  Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050  
##  NA's   :1                                          NA's   :1       
##      Rings       
##  Min.   : 1.000  
##  1st Qu.: 8.000  
##  Median : 9.000  
##  Mean   : 9.932  
##  3rd Qu.:11.000  
##  Max.   :29.000  
## 
head(df, 5)
##   Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 1   M  0.455    0.365  0.095       0.5140         0.2245         0.1010
## 2   M  0.350    0.265  0.090       0.2255         0.0995         0.0485
## 3   F  0.530    0.420  0.135       0.6770         0.2565         0.1415
## 4   M  0.440    0.365  0.125       0.5160         0.2155         0.1140
## 5   I  0.330    0.255  0.080       0.2050         0.0895         0.0395
##   Shell.weight Rings
## 1        0.150    15
## 2        0.070     7
## 3        0.210     9
## 4        0.155    10
## 5        0.055     7
tail(df, 5)
##      Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 4183   F  0.565    0.450  0.165       0.8870         0.3700         0.2390
## 4184   M  0.590    0.440  0.135       0.9660         0.4390         0.2145
## 4185   M  0.600    0.475  0.205       1.1760         0.5255         0.2875
## 4186   F  0.625    0.485  0.150       1.0945         0.5310         0.2610
## 4187   M  0.710    0.555  0.195       1.9485         0.9455         0.3765
##      Shell.weight Rings
## 4183       0.2490    11
## 4184       0.2605    10
## 4185       0.3080     9
## 4186       0.2960    10
## 4187       0.4950    12

Visualize the data description

plot_intro(df)

Check for duplicate rows and remove them; then check the number of duplicate rows again.

df[duplicated(df),]
##      Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 4025   M  0.660    0.485  0.155       1.2275         0.6100         0.2740
## 4105   M  0.635    0.500  0.180       1.2915         0.5940         0.2695
## 4123   F  0.570    0.450  0.150       0.9645         0.5310         0.1890
## 4135   I  0.540    0.415  0.135       0.7090         0.3195         0.1740
## 4168   M  0.475    0.360  0.140       0.5135         0.2410         0.1045
##      Shell.weight Rings
## 4025        0.300     8
## 4105        0.370     9
## 4123        0.209     9
## 4135        0.185     9
## 4168        0.155     8
df<-df[!duplicated(df),]
sum(duplicated(df))
## [1] 0

Check for rows containing NA or missing values.

df[!complete.cases(df),]
##      Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 3893   F  0.500     0.40  0.150       0.8085         0.2730         0.1120
## 3923   M  0.375     0.28     NA       0.2225         0.0875         0.0430
## 3951   F  0.530       NA  0.165       0.7720         0.2855         0.1975
## 3980   I     NA     0.35  0.135       0.4940         0.1925         0.0945
## 4001   I  0.315     0.23  0.000           NA         0.0575         0.0285
##      Shell.weight Rings
## 3893           NA    13
## 3923       0.0800    10
## 3951       0.2300    12
## 3980       0.1405     7
## 4001       0.3640     6
which(is.na(df), arr.ind=TRUE)
##       row col
## 3980 3980   2
## 3951 3951   3
## 3923 3923   4
## 4001 4001   5
## 3893 3893   8

Visualize the missing value distribution

plot_missing(df)

The 2nd, 3rd, 4th, 5th and 8th columns (all with “numeric” type data) contain missing values; therefore, we impute them by mean.

df[2][is.na(df[2])] <- mean(df[,2], na.rm = TRUE)
df[3][is.na(df[3])] <- mean(df[,3], na.rm = TRUE)
df[4][is.na(df[4])] <- mean(df[,4], na.rm = TRUE)
df[5][is.na(df[5])] <- mean(df[,5], na.rm = TRUE)
df[8][is.na(df[8])] <- mean(df[,8], na.rm = TRUE)

Check if there is any remaining missing values.

sum(!complete.cases(df))
## [1] 0

Visualize missing values if there is any

plot_intro(df)

The minimum value of ’Height” column is 0.0; we inspect and get the total number of rows with “Height” = 0.

df[df$Height==0,]
##      Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 1258   I  0.430     0.34      0    0.4280000         0.2065         0.0860
## 4001   I  0.315     0.23      0    0.8284987         0.0575         0.0285
## 4002   I  0.315     0.23      0    0.1340000         0.0575         0.0285
##      Shell.weight Rings
## 1258       0.1150     8
## 4001       0.3640     6
## 4002       0.3505     6
nrow(df[df$Height==0,])
## [1] 3

The next cleaning steps are:

df<- df %>%
  mutate_if(is.character, str_trim) %>%
  mutate(Weight.diff=Whole.weight-(Viscera.weight + Shucked.weight + Shell.weight)) %>%
  subset(Height>0) %>%
  subset(Weight.diff>0) %>%
  select(-Weight.diff)

Rename the columns.

names(df)<-c("Sex", "Length", "Diameter", "Height", "Whole", "Shucked", "Viscera", "Shell", "Rings")  

Convert “Sex” column into ‘factor’ type.

df$Sex <- ordered(df$Sex, 
                  levels = c("I", "M", "F"), 
                  labels = c("Infant", "Male", "Female"))

Create two way contingency table.

table('Sex'=df$Sex,'Rings'=df$Rings)
##         Rings
## Sex        2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##   Infant   1  10  42  87 200 252 255 163  91  62  21  24  14   8   6   7   5
##   Male     0   2   6  11  22  76 170 269 289 218 115  88  56  51  30  25  18
##   Female   0   0   0   4  16  41 118 230 241 198 125  88  56  41  30  26  19
##         Rings
## Sex       19  20  21  22  23  24  25  26  27  29
##   Infant   2   2   1   0   0   0   0   0   0   0
##   Male    14  12   6   3   3   1   0   1   1   0
##   Female  15  12   7   3   6   1   1   0   1   1

Show the preprocessed dataset.

head(df,10)
##       Sex Length Diameter Height  Whole Shucked Viscera Shell Rings
## 1    Male  0.455    0.365  0.095 0.5140  0.2245  0.1010 0.150    15
## 2    Male  0.350    0.265  0.090 0.2255  0.0995  0.0485 0.070     7
## 3  Female  0.530    0.420  0.135 0.6770  0.2565  0.1415 0.210     9
## 4    Male  0.440    0.365  0.125 0.5160  0.2155  0.1140 0.155    10
## 5  Infant  0.330    0.255  0.080 0.2050  0.0895  0.0395 0.055     7
## 6  Infant  0.425    0.300  0.095 0.3515  0.1410  0.0775 0.120     8
## 7  Female  0.530    0.415  0.150 0.7775  0.2370  0.1415 0.330    20
## 8  Female  0.545    0.425  0.125 0.7680  0.2940  0.1495 0.260    16
## 9    Male  0.475    0.370  0.125 0.5095  0.2165  0.1125 0.165     9
## 10 Female  0.550    0.440  0.150 0.8945  0.3145  0.1510 0.320    19

Exploratory Data Analysis

Visualize each variable and sort it by ‘Sex’ feature

Most of the variables are positively skewed.

plot_qq(df, by = "Sex")

Visualize the distribution of “Sex”.

ggplot(data=df,aes(x=Sex,fill=Sex))+geom_bar()

Create boxplots and violin plots of “Sex” versus “Whole”.

qplot(Sex, Whole, data = df, geom = "boxplot", fill = Sex)

qplot(Sex, Whole, data = df, geom = "violin", fill = Sex)

Create a scatterplot of “Whole” versus “Shucked”, “Whole” versus “Viscera” and “Whole” versus “Shell” to examine how they relate to each other.

qplot(Whole, Shucked, data = df, color = Sex)

qplot(Whole, Viscera, data = df, color = Sex)

qplot(Whole, Shell, data = df, color = Sex)

It appears that “Whole” versus “Shucked” has the strongest positive correlation.

Create a correlation matrix.

corr<-cor(df[c(-1,-10)], method = "pearson", use = "complete.obs")
round(corr, 2)
##          Length Diameter Height Whole Shucked Viscera Shell Rings
## Length     1.00     0.99   0.82  0.93    0.90    0.90  0.90  0.54
## Diameter   0.99     1.00   0.83  0.93    0.90    0.90  0.91  0.56
## Height     0.82     0.83   1.00  0.81    0.77    0.79  0.81  0.54
## Whole      0.93     0.93   0.81  1.00    0.97    0.97  0.96  0.53
## Shucked    0.90     0.90   0.77  0.97    1.00    0.93  0.88  0.41
## Viscera    0.90     0.90   0.79  0.97    0.93    1.00  0.91  0.49
## Shell      0.90     0.91   0.81  0.96    0.88    0.91  1.00  0.62
## Rings      0.54     0.56   0.54  0.53    0.41    0.49  0.62  1.00
corrplot(corr, type = 'lower', order = 'hclust', tl.col = 'black',
         tl.srt = 45, tl.cex=0.8,  addCoef.col = 'black', number.cex=0.8, col = COL2('RdYlBu'), cl.pos='n')

The correlation matrix shows that “Shell” and “Shucked” have the strongest and lowest correlation with “Rings”,respectively.

To examine the relationship between some of the predictor variables and the dependent variable, we plot “Shell” versus “Rings” and “Shucked” versus “Rings”.

ggplot(data=df,aes(x=Shell,y=Rings,color=Sex))+geom_point()+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data=df,aes(x=Shucked,y=Rings,color=Sex))+geom_point()+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

The relationship between “Shell” & “Rings” and “Shucked” & “Rings” appears similar for males and females but steeper for infants.

Draw histograms to show the distribution of each variable.

par(mfrow=c(3,3))
Length<-df$Length; hist(Length, col="blue")
Diameter<-df$Diameter; hist(Diameter, col="blue")
Height<-df$Height; hist(Height, col="blue")
Whole<-df$Whole; hist(Whole, col="blue")
Shucked<-df$Shucked; hist(Shucked, col="blue")
Viscera<-df$Viscera; hist(Viscera, col="blue")
Shell<-df$Shell; hist(Shell, col="blue")
Rings<-df$Rings; hist(Rings, col="blue")

Visualize the correlation varies with number of rings

Following the previous analysis, we decided to dig deeper into the variation of the correlation with the number of rings. We tested for a variety of values and discovered that the region limited by Rings 10 has a higher correlation.

corr.df <- df
ring.corr.df <- corr.df %>%  
  filter(Rings <10) %>%
  select(-Sex)

r <- cor(ring.corr.df, use="complete.obs")

ggcorrplot(r, hc.order = TRUE, type = "lower",lab = TRUE)

Correlation between Size attribute and Rings

height.rd <- corr.df %>%  
  filter(Height <0.4 & Rings <10) 

p <- ggplot(ring.corr.df, aes(x=Rings, y=Length)) +
      geom_point() +
      theme(legend.position="none")

q <- ggplot(ring.corr.df, aes(x=Rings, y=Diameter)) +
      geom_point() +
      theme(legend.position="none") 

s <- ggplot(height.rd, aes(x=Rings, y=Height)) +
      geom_point() +
      theme(legend.position="none") 

p1 <- ggMarginal(p, type="histogram")
p1
q1 <- ggMarginal(q, type="histogram")
q1
s1 <- ggMarginal(s, type="histogram")
s1

Visualize number of rings > 10

Observe that the correlation decays drastically near 0 for the number of rings >10. These findings imply that abalones grow in size and weight until they reach a particular age, which is around 10 years. After this age, their size and age stay steady in relation to age.

nb.ring.m <- corr.df %>%  
  filter(Rings >10) %>%
  select(-Sex)

m <- cor(nb.ring.m, use="complete.obs")

ggcorrplot(m, hc.order = TRUE, type = "lower",lab = TRUE)

Classification Model

In here, create a classification model to predict the sex of abalone. To classify Adult or infant

Transform F and M to NonInfant

levels(df$Sex)
## [1] "Infant" "Male"   "Female"
levels(df$Sex) <- c(levels(df$Sex), "NonInfant")
new_df = df 
new_df$Sex[new_df$Sex=="Male"]<-"NonInfant"
new_df$Sex[new_df$Sex=="Female"]<-"NonInfant"
new_df$Sex <- ordered(new_df$Sex, 
                  levels = c("Infant", "NonInfant"), 
                  labels = c("Infant", "NonInfant"))
str(df)
## 'data.frame':    4020 obs. of  9 variables:
##  $ Sex     : Ord.factor w/ 4 levels "Infant"<"Male"<..: 2 2 3 2 1 1 3 3 2 3 ...
##  $ Length  : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ Diameter: num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ Height  : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ Whole   : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ Shucked : num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ Viscera : num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ Shell   : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ Rings   : int  15 7 9 10 7 8 20 16 9 19 ...
head(df,10)
##       Sex Length Diameter Height  Whole Shucked Viscera Shell Rings
## 1    Male  0.455    0.365  0.095 0.5140  0.2245  0.1010 0.150    15
## 2    Male  0.350    0.265  0.090 0.2255  0.0995  0.0485 0.070     7
## 3  Female  0.530    0.420  0.135 0.6770  0.2565  0.1415 0.210     9
## 4    Male  0.440    0.365  0.125 0.5160  0.2155  0.1140 0.155    10
## 5  Infant  0.330    0.255  0.080 0.2050  0.0895  0.0395 0.055     7
## 6  Infant  0.425    0.300  0.095 0.3515  0.1410  0.0775 0.120     8
## 7  Female  0.530    0.415  0.150 0.7775  0.2370  0.1415 0.330    20
## 8  Female  0.545    0.425  0.125 0.7680  0.2940  0.1495 0.260    16
## 9    Male  0.475    0.370  0.125 0.5095  0.2165  0.1125 0.165     9
## 10 Female  0.550    0.440  0.150 0.8945  0.3145  0.1510 0.320    19

Performs stable random split of the data set

TrainingIndex <- createDataPartition(new_df$Sex, p=0.8, list = FALSE) #to split the data 80:20
TrainingSet <- new_df[TrainingIndex,] # Training Set
TestingSet <- new_df[-TrainingIndex,] # Test Set

If any error involves e1071 occur

#install.packages('e1071', dependencies=TRUE)

Build Training model

Model <- train(Sex~ ., data = TrainingSet,
               method = "svmPoly", # SVM model (polynomial kernel)
               na.action = na.omit,
               preProcess=c("scale","center"),
               trControl= trainControl(method="none"),
               tuneGrid = data.frame(degree=1,scale=1,C=1)
)

Build 10 fold Cross-validation to validate the performance of model

Model.cv <- train(Sex ~ ., data = TrainingSet,
                  method = "svmPoly",
                  na.action = na.omit,
                  preProcess=c("scale","center"),
                  trControl= trainControl(method="cv", number=10), #10-fold cv
                  tuneGrid = data.frame(degree=1,scale=1,C=1)
)

Apply model for prediction

Model.training <-predict(Model, TrainingSet) # Apply model to make prediction on Training set
Model.testing <-predict(Model, TestingSet) # Apply model to make prediction on Testing set
Model.cv <-predict(Model.cv, TrainingSet) # Perform cross-validation

Model performance Evaluation (Displays confusion matrix and statistics)

Model.training.confusion <-confusionMatrix(Model.training,as.factor(TrainingSet$Sex))
Model.testing.confusion <-confusionMatrix(Model.testing, as.factor(TestingSet$Sex))
Model.cv.confusion <-confusionMatrix(Model.cv, as.factor(TrainingSet$Sex))

Evaluate performance of the model

SVM model was used. The performance is based on the accuracy, confusion matrix and p-value. The accuracy from the train, test and validation set are around 83%, which mean the model are able to predict correctly for 83% over the dataset. In addition, the confusion matrix tells us that, the precision is around 72% that explain how precise that the model able to predict result of infant abalone. Lastly, the p-value does tells us that how significant that the relationship between the infant and non-infant abalone. Other than that, based on the correlation table, we can observe that the length and whole features are highly correlate to the target variable.

print(Model.training.confusion)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Infant NonInfant
##   Infant       729       285
##   NonInfant    274      1929
##                                           
##                Accuracy : 0.8262          
##                  95% CI : (0.8127, 0.8392)
##     No Information Rate : 0.6882          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5963          
##                                           
##  Mcnemar's Test P-Value : 0.6723          
##                                           
##             Sensitivity : 0.7268          
##             Specificity : 0.8713          
##          Pos Pred Value : 0.7189          
##          Neg Pred Value : 0.8756          
##              Prevalence : 0.3118          
##          Detection Rate : 0.2266          
##    Detection Prevalence : 0.3152          
##       Balanced Accuracy : 0.7990          
##                                           
##        'Positive' Class : Infant          
## 
print(Model.testing.confusion)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Infant NonInfant
##   Infant       189        70
##   NonInfant     61       483
##                                           
##                Accuracy : 0.8369          
##                  95% CI : (0.8095, 0.8618)
##     No Information Rate : 0.6887          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6233          
##                                           
##  Mcnemar's Test P-Value : 0.4846          
##                                           
##             Sensitivity : 0.7560          
##             Specificity : 0.8734          
##          Pos Pred Value : 0.7297          
##          Neg Pred Value : 0.8879          
##              Prevalence : 0.3113          
##          Detection Rate : 0.2354          
##    Detection Prevalence : 0.3225          
##       Balanced Accuracy : 0.8147          
##                                           
##        'Positive' Class : Infant          
## 
print(Model.cv.confusion)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Infant NonInfant
##   Infant       729       285
##   NonInfant    274      1929
##                                           
##                Accuracy : 0.8262          
##                  95% CI : (0.8127, 0.8392)
##     No Information Rate : 0.6882          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5963          
##                                           
##  Mcnemar's Test P-Value : 0.6723          
##                                           
##             Sensitivity : 0.7268          
##             Specificity : 0.8713          
##          Pos Pred Value : 0.7189          
##          Neg Pred Value : 0.8756          
##              Prevalence : 0.3118          
##          Detection Rate : 0.2266          
##    Detection Prevalence : 0.3152          
##       Balanced Accuracy : 0.7990          
##                                           
##        'Positive' Class : Infant          
## 

To find the correlation of each variables.

SexCor <- select(TrainingSet, -Sex) 
cor(SexCor) #see correlation
##             Length  Diameter    Height     Whole   Shucked   Viscera     Shell
## Length   1.0000000 0.9865918 0.8024605 0.9269127 0.8996606 0.9046409 0.8987671
## Diameter 0.9865918 1.0000000 0.8082910 0.9264385 0.8940024 0.9004564 0.9050096
## Height   0.8024605 0.8082910 1.0000000 0.7980004 0.7543802 0.7787534 0.7957477
## Whole    0.9269127 0.9264385 0.7980004 1.0000000 0.9708241 0.9659322 0.9558026
## Shucked  0.8996606 0.8940024 0.7543802 0.9708241 1.0000000 0.9310090 0.8813069
## Viscera  0.9046409 0.9004564 0.7787534 0.9659322 0.9310090 1.0000000 0.9057781
## Shell    0.8987671 0.9050096 0.7957477 0.9558026 0.8813069 0.9057781 1.0000000
## Rings    0.5401054 0.5578540 0.5284526 0.5283171 0.4074314 0.4910791 0.6223938
##              Rings
## Length   0.5401054
## Diameter 0.5578540
## Height   0.5284526
## Whole    0.5283171
## Shucked  0.4074314
## Viscera  0.4910791
## Shell    0.6223938
## Rings    1.0000000

To give a simple visualization of different Sex of abalone which is Female and Male using scatterplot. Length and Whole are chosen.

b <- ggplot(TrainingSet, aes(x = `Length`, y = `Whole`))
b + geom_point(aes(shape = Sex, color = Sex)) +
  scale_color_manual(values = c("#FC4E07","#00AFBB"))
## Warning: Using shapes for an ordinal variable is not advised

Regression Model

This model is used to predict the age of the abalone with the variables given after data cleaning.

Split into training data and test data

Reg_TrainingIndex <- createDataPartition(df$Ring, p=0.8, list = FALSE) 
Reg_TrainingSet <- df[Reg_TrainingIndex,] # Training Set
Reg_TestingSet <- df[-Reg_TrainingIndex,] # Test Set

Build regression model: random forest

Reg_model <- train(Rings~ ., data = Reg_TrainingSet,
                   method = "rf", 
                   trControl= trainControl(method="repeatedcv", number=10, repeats=3),
                   tuneGrid = expand.grid(.mtry=round(sqrt(ncol(Reg_TrainingSet[,1:5])))),
                   metric =  "RMSE")

Evaluate performance of the model

The RMSE, R-Squared, and MAE metrics are used to evaluate the prediction error rates and model performance in regression analysis. The model shows RMSE value of 2.19 which tells us that this model able to fit the dataset well as it is better to have RMSE as low as possible. Meanwhile, R-Squared indicate a high value and able to explain 54.2% of the variations. As for MAE, the values predicted by the model and the actual values has the mean absolute difference of 1.544. The lower value of MAE indicate better fit of the model.

Reg_model$results
##   mtry     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1    2 2.179215 0.5380188 1.540265 0.1219902 0.03550781 0.0661215
Reg_model.training <- predict(Reg_model, Reg_TrainingSet)
Reg_model.testing <- predict(Reg_model, Reg_TestingSet)
sprintf("Train rmse: %f", rmse(Reg_TrainingSet$Rings, Reg_model.training))
## [1] "Train rmse: 1.190251"
sprintf("Test rmse: %f", rmse(Reg_TestingSet$Rings, Reg_model.testing))
## [1] "Test rmse: 2.161742"
actual_pred <-  data.frame(cbind(Reg_TestingSet$Rings, predicteds=Reg_model.testing)) 
correlation_accuracy <- cor(actual_pred)
sprintf("Correlation accuracy of the predictions: %f ", correlation_accuracy[2])
## [1] "Correlation accuracy of the predictions: 0.755809 "
min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max)) 
sprintf("Min-Max accuracy: %f ", min_max_accuracy)
## [1] "Min-Max accuracy: 0.869210 "