Predicting the Age of Abalone using Classification and Regression Models
Predicting the age of abalone is possible through physical measurements. The abalone add one ring to their shell every year, except in their first year. Theoretically, the age of abalone can be accurately estimated by adding 1.5 to the number of rings. In real-life, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Therefore, other measurements (e.g., weight and length of the abalone), which are easier to obtain, are used to predict the age. In this project, we aim to use the available dataset to develop Classification and Linear Regression models, which will help in accurately predict the age of abalone.
There are seven predictor variables included in the dataset. The first three constitute the shell measurements:
The other four variables are related to the weight of abalone:
The response variable, that is the dependent variable in this analysis is:
Initial Questions
Objectives
To determine the age of abalone from ‘Sex’ feature and physical measurement. In this project, supervised learning algorithms are used to conclude the output data.
Import the csv file from Google drive.
id <- "18dk-yeiB7jFTRv1KR3bTzgiya3dwnKCc" # sharable google drive link
df<-read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))
Load the required libraries.
defaultW <- getOption("warn")
options(warn = -1)
library(plyr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(corrplot)
library(caret)
library(randomForest)
library(Metrics)
library(DataExplorer)
library(ggcorrplot)
library(ggExtra)
options(warn = defaultW)
Inspect the dataset.
class(df)
## [1] "data.frame"
typeof(df)
## [1] "list"
dim(df)
## [1] 4187 9
ncol(df)
## [1] 9
nrow((df))
## [1] 4187
colnames(df)
## [1] "Sex" "Length" "Diameter" "Height"
## [5] "Whole.weight" "Shucked.weight" "Viscera.weight" "Shell.weight"
## [9] "Rings"
str(df)
## 'data.frame': 4187 obs. of 9 variables:
## $ Sex : chr "M" "M" "F" "M" ...
## $ Length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ Diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ Height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ Whole.weight : num 0.514 0.226 0.677 0.516 0.205 ...
## $ Shucked.weight: num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ Viscera.weight: num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ Shell.weight : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ Rings : int 15 7 9 10 7 8 20 16 9 19 ...
glimpse(df)
## Rows: 4,187
## Columns: 9
## $ Sex <chr> "M", "M", "F", "M", "I", "I", "F", "F", "M", "F", "F", ~
## $ Length <dbl> 0.455, 0.350, 0.530, 0.440, 0.330, 0.425, 0.530, 0.545,~
## $ Diameter <dbl> 0.365, 0.265, 0.420, 0.365, 0.255, 0.300, 0.415, 0.425,~
## $ Height <dbl> 0.095, 0.090, 0.135, 0.125, 0.080, 0.095, 0.150, 0.125,~
## $ Whole.weight <dbl> 0.5140, 0.2255, 0.6770, 0.5160, 0.2050, 0.3515, 0.7775,~
## $ Shucked.weight <dbl> 0.2245, 0.0995, 0.2565, 0.2155, 0.0895, 0.1410, 0.2370,~
## $ Viscera.weight <dbl> 0.1010, 0.0485, 0.1415, 0.1140, 0.0395, 0.0775, 0.1415,~
## $ Shell.weight <dbl> 0.150, 0.070, 0.210, 0.155, 0.055, 0.120, 0.330, 0.260,~
## $ Rings <int> 15, 7, 9, 10, 7, 8, 20, 16, 9, 19, 14, 10, 11, 10, 10, ~
summary(df)
## Sex Length Diameter Height
## Length:4187 Min. :0.075 Min. :0.0550 Min. :0.0000
## Class :character 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150
## Mode :character Median :0.545 Median :0.4250 Median :0.1400
## Mean :0.524 Mean :0.4078 Mean :0.1395
## 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650
## Max. :0.815 Max. :0.6500 Max. :1.1300
## NA's :1 NA's :1 NA's :1
## Whole.weight Shucked.weight Viscera.weight Shell.weight
## Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015
## 1st Qu.:0.4421 1st Qu.:0.1862 1st Qu.:0.0935 1st Qu.:0.1300
## Median :0.7995 Median :0.3355 Median :0.1710 Median :0.2335
## Mean :0.8286 Mean :0.3593 Mean :0.1805 Mean :0.2388
## 3rd Qu.:1.1530 3rd Qu.:0.5020 3rd Qu.:0.2527 3rd Qu.:0.3289
## Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050
## NA's :1 NA's :1
## Rings
## Min. : 1.000
## 1st Qu.: 8.000
## Median : 9.000
## Mean : 9.932
## 3rd Qu.:11.000
## Max. :29.000
##
head(df, 5)
## Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395
## Shell.weight Rings
## 1 0.150 15
## 2 0.070 7
## 3 0.210 9
## 4 0.155 10
## 5 0.055 7
tail(df, 5)
## Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 4183 F 0.565 0.450 0.165 0.8870 0.3700 0.2390
## 4184 M 0.590 0.440 0.135 0.9660 0.4390 0.2145
## 4185 M 0.600 0.475 0.205 1.1760 0.5255 0.2875
## 4186 F 0.625 0.485 0.150 1.0945 0.5310 0.2610
## 4187 M 0.710 0.555 0.195 1.9485 0.9455 0.3765
## Shell.weight Rings
## 4183 0.2490 11
## 4184 0.2605 10
## 4185 0.3080 9
## 4186 0.2960 10
## 4187 0.4950 12
Visualize the data description
plot_intro(df)
Check for duplicate rows and remove them; then check the number of duplicate rows again.
df[duplicated(df),]
## Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 4025 M 0.660 0.485 0.155 1.2275 0.6100 0.2740
## 4105 M 0.635 0.500 0.180 1.2915 0.5940 0.2695
## 4123 F 0.570 0.450 0.150 0.9645 0.5310 0.1890
## 4135 I 0.540 0.415 0.135 0.7090 0.3195 0.1740
## 4168 M 0.475 0.360 0.140 0.5135 0.2410 0.1045
## Shell.weight Rings
## 4025 0.300 8
## 4105 0.370 9
## 4123 0.209 9
## 4135 0.185 9
## 4168 0.155 8
df<-df[!duplicated(df),]
sum(duplicated(df))
## [1] 0
Check for rows containing NA or missing values.
df[!complete.cases(df),]
## Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 3893 F 0.500 0.40 0.150 0.8085 0.2730 0.1120
## 3923 M 0.375 0.28 NA 0.2225 0.0875 0.0430
## 3951 F 0.530 NA 0.165 0.7720 0.2855 0.1975
## 3980 I NA 0.35 0.135 0.4940 0.1925 0.0945
## 4001 I 0.315 0.23 0.000 NA 0.0575 0.0285
## Shell.weight Rings
## 3893 NA 13
## 3923 0.0800 10
## 3951 0.2300 12
## 3980 0.1405 7
## 4001 0.3640 6
which(is.na(df), arr.ind=TRUE)
## row col
## 3980 3980 2
## 3951 3951 3
## 3923 3923 4
## 4001 4001 5
## 3893 3893 8
Visualize the missing value distribution
plot_missing(df)
The 2nd, 3rd, 4th, 5th and 8th columns (all with “numeric” type data) contain missing values; therefore, we impute them by mean.
df[2][is.na(df[2])] <- mean(df[,2], na.rm = TRUE)
df[3][is.na(df[3])] <- mean(df[,3], na.rm = TRUE)
df[4][is.na(df[4])] <- mean(df[,4], na.rm = TRUE)
df[5][is.na(df[5])] <- mean(df[,5], na.rm = TRUE)
df[8][is.na(df[8])] <- mean(df[,8], na.rm = TRUE)
Check if there is any remaining missing values.
sum(!complete.cases(df))
## [1] 0
Visualize missing values if there is any
plot_intro(df)
The minimum value of ’Height” column is 0.0; we inspect and get the total number of rows with “Height” = 0.
df[df$Height==0,]
## Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 1258 I 0.430 0.34 0 0.4280000 0.2065 0.0860
## 4001 I 0.315 0.23 0 0.8284987 0.0575 0.0285
## 4002 I 0.315 0.23 0 0.1340000 0.0575 0.0285
## Shell.weight Rings
## 1258 0.1150 8
## 4001 0.3640 6
## 4002 0.3505 6
nrow(df[df$Height==0,])
## [1] 3
The next cleaning steps are:
df<- df %>%
mutate_if(is.character, str_trim) %>%
mutate(Weight.diff=Whole.weight-(Viscera.weight + Shucked.weight + Shell.weight)) %>%
subset(Height>0) %>%
subset(Weight.diff>0) %>%
select(-Weight.diff)
Rename the columns.
names(df)<-c("Sex", "Length", "Diameter", "Height", "Whole", "Shucked", "Viscera", "Shell", "Rings")
Convert “Sex” column into ‘factor’ type.
df$Sex <- ordered(df$Sex,
levels = c("I", "M", "F"),
labels = c("Infant", "Male", "Female"))
Create two way contingency table.
table('Sex'=df$Sex,'Rings'=df$Rings)
## Rings
## Sex 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## Infant 1 10 42 87 200 252 255 163 91 62 21 24 14 8 6 7 5
## Male 0 2 6 11 22 76 170 269 289 218 115 88 56 51 30 25 18
## Female 0 0 0 4 16 41 118 230 241 198 125 88 56 41 30 26 19
## Rings
## Sex 19 20 21 22 23 24 25 26 27 29
## Infant 2 2 1 0 0 0 0 0 0 0
## Male 14 12 6 3 3 1 0 1 1 0
## Female 15 12 7 3 6 1 1 0 1 1
Show the preprocessed dataset.
head(df,10)
## Sex Length Diameter Height Whole Shucked Viscera Shell Rings
## 1 Male 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
## 2 Male 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
## 3 Female 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
## 4 Male 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
## 5 Infant 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
## 6 Infant 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
## 7 Female 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.330 20
## 8 Female 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.260 16
## 9 Male 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.165 9
## 10 Female 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.320 19
Visualize each variable and sort it by ‘Sex’ feature
Most of the variables are positively skewed.
plot_qq(df, by = "Sex")
Visualize the distribution of “Sex”.
ggplot(data=df,aes(x=Sex,fill=Sex))+geom_bar()
Create boxplots and violin plots of “Sex” versus “Whole”.
qplot(Sex, Whole, data = df, geom = "boxplot", fill = Sex)
qplot(Sex, Whole, data = df, geom = "violin", fill = Sex)
Create a scatterplot of “Whole” versus “Shucked”, “Whole” versus “Viscera” and “Whole” versus “Shell” to examine how they relate to each other.
qplot(Whole, Shucked, data = df, color = Sex)
qplot(Whole, Viscera, data = df, color = Sex)
qplot(Whole, Shell, data = df, color = Sex)
It appears that “Whole” versus “Shucked” has the strongest positive correlation.
Create a correlation matrix.
corr<-cor(df[c(-1,-10)], method = "pearson", use = "complete.obs")
round(corr, 2)
## Length Diameter Height Whole Shucked Viscera Shell Rings
## Length 1.00 0.99 0.82 0.93 0.90 0.90 0.90 0.54
## Diameter 0.99 1.00 0.83 0.93 0.90 0.90 0.91 0.56
## Height 0.82 0.83 1.00 0.81 0.77 0.79 0.81 0.54
## Whole 0.93 0.93 0.81 1.00 0.97 0.97 0.96 0.53
## Shucked 0.90 0.90 0.77 0.97 1.00 0.93 0.88 0.41
## Viscera 0.90 0.90 0.79 0.97 0.93 1.00 0.91 0.49
## Shell 0.90 0.91 0.81 0.96 0.88 0.91 1.00 0.62
## Rings 0.54 0.56 0.54 0.53 0.41 0.49 0.62 1.00
corrplot(corr, type = 'lower', order = 'hclust', tl.col = 'black',
tl.srt = 45, tl.cex=0.8, addCoef.col = 'black', number.cex=0.8, col = COL2('RdYlBu'), cl.pos='n')
The correlation matrix shows that “Shell” and “Shucked” have the strongest and lowest correlation with “Rings”,respectively.
To examine the relationship between some of the predictor variables and the dependent variable, we plot “Shell” versus “Rings” and “Shucked” versus “Rings”.
ggplot(data=df,aes(x=Shell,y=Rings,color=Sex))+geom_point()+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
ggplot(data=df,aes(x=Shucked,y=Rings,color=Sex))+geom_point()+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
The relationship between “Shell” & “Rings” and “Shucked” & “Rings” appears similar for males and females but steeper for infants.
Draw histograms to show the distribution of each variable.
par(mfrow=c(3,3))
Length<-df$Length; hist(Length, col="blue")
Diameter<-df$Diameter; hist(Diameter, col="blue")
Height<-df$Height; hist(Height, col="blue")
Whole<-df$Whole; hist(Whole, col="blue")
Shucked<-df$Shucked; hist(Shucked, col="blue")
Viscera<-df$Viscera; hist(Viscera, col="blue")
Shell<-df$Shell; hist(Shell, col="blue")
Rings<-df$Rings; hist(Rings, col="blue")
Visualize the correlation varies with number of rings
Following the previous analysis, we decided to dig deeper into the variation of the correlation with the number of rings. We tested for a variety of values and discovered that the region limited by Rings 10 has a higher correlation.
corr.df <- df
ring.corr.df <- corr.df %>%
filter(Rings <10) %>%
select(-Sex)
r <- cor(ring.corr.df, use="complete.obs")
ggcorrplot(r, hc.order = TRUE, type = "lower",lab = TRUE)
Correlation between Size attribute and Rings
height.rd <- corr.df %>%
filter(Height <0.4 & Rings <10)
p <- ggplot(ring.corr.df, aes(x=Rings, y=Length)) +
geom_point() +
theme(legend.position="none")
q <- ggplot(ring.corr.df, aes(x=Rings, y=Diameter)) +
geom_point() +
theme(legend.position="none")
s <- ggplot(height.rd, aes(x=Rings, y=Height)) +
geom_point() +
theme(legend.position="none")
p1 <- ggMarginal(p, type="histogram")
p1
q1 <- ggMarginal(q, type="histogram")
q1
s1 <- ggMarginal(s, type="histogram")
s1
Visualize number of rings > 10
Observe that the correlation decays drastically near 0 for the number of rings >10. These findings imply that abalones grow in size and weight until they reach a particular age, which is around 10 years. After this age, their size and age stay steady in relation to age.
nb.ring.m <- corr.df %>%
filter(Rings >10) %>%
select(-Sex)
m <- cor(nb.ring.m, use="complete.obs")
ggcorrplot(m, hc.order = TRUE, type = "lower",lab = TRUE)
In here, create a classification model to predict the sex of abalone. To classify Adult or infant
Transform F and M to NonInfant
levels(df$Sex)
## [1] "Infant" "Male" "Female"
levels(df$Sex) <- c(levels(df$Sex), "NonInfant")
new_df = df
new_df$Sex[new_df$Sex=="Male"]<-"NonInfant"
new_df$Sex[new_df$Sex=="Female"]<-"NonInfant"
new_df$Sex <- ordered(new_df$Sex,
levels = c("Infant", "NonInfant"),
labels = c("Infant", "NonInfant"))
str(df)
## 'data.frame': 4020 obs. of 9 variables:
## $ Sex : Ord.factor w/ 4 levels "Infant"<"Male"<..: 2 2 3 2 1 1 3 3 2 3 ...
## $ Length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ Diameter: num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ Height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ Whole : num 0.514 0.226 0.677 0.516 0.205 ...
## $ Shucked : num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ Viscera : num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ Shell : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ Rings : int 15 7 9 10 7 8 20 16 9 19 ...
head(df,10)
## Sex Length Diameter Height Whole Shucked Viscera Shell Rings
## 1 Male 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
## 2 Male 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
## 3 Female 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
## 4 Male 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
## 5 Infant 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
## 6 Infant 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
## 7 Female 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.330 20
## 8 Female 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.260 16
## 9 Male 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.165 9
## 10 Female 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.320 19
Performs stable random split of the data set
TrainingIndex <- createDataPartition(new_df$Sex, p=0.8, list = FALSE) #to split the data 80:20
TrainingSet <- new_df[TrainingIndex,] # Training Set
TestingSet <- new_df[-TrainingIndex,] # Test Set
If any error involves e1071 occur
#install.packages('e1071', dependencies=TRUE)
Build Training model
Model <- train(Sex~ ., data = TrainingSet,
method = "svmPoly", # SVM model (polynomial kernel)
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"),
tuneGrid = data.frame(degree=1,scale=1,C=1)
)
Build 10 fold Cross-validation to validate the performance of model
Model.cv <- train(Sex ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="cv", number=10), #10-fold cv
tuneGrid = data.frame(degree=1,scale=1,C=1)
)
Apply model for prediction
Model.training <-predict(Model, TrainingSet) # Apply model to make prediction on Training set
Model.testing <-predict(Model, TestingSet) # Apply model to make prediction on Testing set
Model.cv <-predict(Model.cv, TrainingSet) # Perform cross-validation
Model performance Evaluation (Displays confusion matrix and statistics)
Model.training.confusion <-confusionMatrix(Model.training,as.factor(TrainingSet$Sex))
Model.testing.confusion <-confusionMatrix(Model.testing, as.factor(TestingSet$Sex))
Model.cv.confusion <-confusionMatrix(Model.cv, as.factor(TrainingSet$Sex))
Evaluate performance of the model
SVM model was used. The performance is based on the accuracy, confusion matrix and p-value. The accuracy from the train, test and validation set are around 83%, which mean the model are able to predict correctly for 83% over the dataset. In addition, the confusion matrix tells us that, the precision is around 72% that explain how precise that the model able to predict result of infant abalone. Lastly, the p-value does tells us that how significant that the relationship between the infant and non-infant abalone. Other than that, based on the correlation table, we can observe that the length and whole features are highly correlate to the target variable.
print(Model.training.confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Infant NonInfant
## Infant 729 285
## NonInfant 274 1929
##
## Accuracy : 0.8262
## 95% CI : (0.8127, 0.8392)
## No Information Rate : 0.6882
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5963
##
## Mcnemar's Test P-Value : 0.6723
##
## Sensitivity : 0.7268
## Specificity : 0.8713
## Pos Pred Value : 0.7189
## Neg Pred Value : 0.8756
## Prevalence : 0.3118
## Detection Rate : 0.2266
## Detection Prevalence : 0.3152
## Balanced Accuracy : 0.7990
##
## 'Positive' Class : Infant
##
print(Model.testing.confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Infant NonInfant
## Infant 189 70
## NonInfant 61 483
##
## Accuracy : 0.8369
## 95% CI : (0.8095, 0.8618)
## No Information Rate : 0.6887
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6233
##
## Mcnemar's Test P-Value : 0.4846
##
## Sensitivity : 0.7560
## Specificity : 0.8734
## Pos Pred Value : 0.7297
## Neg Pred Value : 0.8879
## Prevalence : 0.3113
## Detection Rate : 0.2354
## Detection Prevalence : 0.3225
## Balanced Accuracy : 0.8147
##
## 'Positive' Class : Infant
##
print(Model.cv.confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Infant NonInfant
## Infant 729 285
## NonInfant 274 1929
##
## Accuracy : 0.8262
## 95% CI : (0.8127, 0.8392)
## No Information Rate : 0.6882
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5963
##
## Mcnemar's Test P-Value : 0.6723
##
## Sensitivity : 0.7268
## Specificity : 0.8713
## Pos Pred Value : 0.7189
## Neg Pred Value : 0.8756
## Prevalence : 0.3118
## Detection Rate : 0.2266
## Detection Prevalence : 0.3152
## Balanced Accuracy : 0.7990
##
## 'Positive' Class : Infant
##
To find the correlation of each variables.
SexCor <- select(TrainingSet, -Sex)
cor(SexCor) #see correlation
## Length Diameter Height Whole Shucked Viscera Shell
## Length 1.0000000 0.9865918 0.8024605 0.9269127 0.8996606 0.9046409 0.8987671
## Diameter 0.9865918 1.0000000 0.8082910 0.9264385 0.8940024 0.9004564 0.9050096
## Height 0.8024605 0.8082910 1.0000000 0.7980004 0.7543802 0.7787534 0.7957477
## Whole 0.9269127 0.9264385 0.7980004 1.0000000 0.9708241 0.9659322 0.9558026
## Shucked 0.8996606 0.8940024 0.7543802 0.9708241 1.0000000 0.9310090 0.8813069
## Viscera 0.9046409 0.9004564 0.7787534 0.9659322 0.9310090 1.0000000 0.9057781
## Shell 0.8987671 0.9050096 0.7957477 0.9558026 0.8813069 0.9057781 1.0000000
## Rings 0.5401054 0.5578540 0.5284526 0.5283171 0.4074314 0.4910791 0.6223938
## Rings
## Length 0.5401054
## Diameter 0.5578540
## Height 0.5284526
## Whole 0.5283171
## Shucked 0.4074314
## Viscera 0.4910791
## Shell 0.6223938
## Rings 1.0000000
To give a simple visualization of different Sex of abalone which is Female and Male using scatterplot. Length and Whole are chosen.
b <- ggplot(TrainingSet, aes(x = `Length`, y = `Whole`))
b + geom_point(aes(shape = Sex, color = Sex)) +
scale_color_manual(values = c("#FC4E07","#00AFBB"))
## Warning: Using shapes for an ordinal variable is not advised
This model is used to predict the age of the abalone with the variables given after data cleaning.
Split into training data and test data
Reg_TrainingIndex <- createDataPartition(df$Ring, p=0.8, list = FALSE)
Reg_TrainingSet <- df[Reg_TrainingIndex,] # Training Set
Reg_TestingSet <- df[-Reg_TrainingIndex,] # Test Set
Build regression model: random forest
Reg_model <- train(Rings~ ., data = Reg_TrainingSet,
method = "rf",
trControl= trainControl(method="repeatedcv", number=10, repeats=3),
tuneGrid = expand.grid(.mtry=round(sqrt(ncol(Reg_TrainingSet[,1:5])))),
metric = "RMSE")
Evaluate performance of the model
The RMSE, R-Squared, and MAE metrics are used to evaluate the prediction error rates and model performance in regression analysis. The model shows RMSE value of 2.19 which tells us that this model able to fit the dataset well as it is better to have RMSE as low as possible. Meanwhile, R-Squared indicate a high value and able to explain 54.2% of the variations. As for MAE, the values predicted by the model and the actual values has the mean absolute difference of 1.544. The lower value of MAE indicate better fit of the model.
Reg_model$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 2 2.179215 0.5380188 1.540265 0.1219902 0.03550781 0.0661215
Reg_model.training <- predict(Reg_model, Reg_TrainingSet)
Reg_model.testing <- predict(Reg_model, Reg_TestingSet)
sprintf("Train rmse: %f", rmse(Reg_TrainingSet$Rings, Reg_model.training))
## [1] "Train rmse: 1.190251"
sprintf("Test rmse: %f", rmse(Reg_TestingSet$Rings, Reg_model.testing))
## [1] "Test rmse: 2.161742"
actual_pred <- data.frame(cbind(Reg_TestingSet$Rings, predicteds=Reg_model.testing))
correlation_accuracy <- cor(actual_pred)
sprintf("Correlation accuracy of the predictions: %f ", correlation_accuracy[2])
## [1] "Correlation accuracy of the predictions: 0.755809 "
min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))
sprintf("Min-Max accuracy: %f ", min_max_accuracy)
## [1] "Min-Max accuracy: 0.869210 "