Introduction with Dataset

https://archive.ics.uci.edu/ml/datasets/Abalone

Objective

Predicting the AGE of abalone from physical measurements. The AGE of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age.

Name / Data Type / Measurement Unit / Description

Sex / nominal / – / M, F, and I (infant)

Length / continuous / mm / Longest shell measurement

Diameter / continuous / mm / perpendicular to length

Height / continuous / mm / with meat in shell

Whole weight / continuous / grams / whole abalone

Shucked weight / continuous / grams / weight of meat

Viscera weight / continuous / grams / gut weight (after bleeding)

Shell weight / continuous / grams / after being dried

Rings / integer / – / +1.5 gives the age in years

train <- read.csv("abalone.csv", header =T, na.strings=c("","NA"))

str(train)
## 'data.frame':    4177 obs. of  9 variables:
##  $ sex     : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ length  : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ diameter: num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ height  : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ weight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ shucked : num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ viscera : num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ shell   : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ rings   : int  15 7 9 10 7 8 20 16 9 19 ...

Using Decision Tree

We hope to use Decision Tree rules whether it is possible to put or classify the different age-group abalone into the right AGE bucket.

Dependent variable = rings

In this step, we’ll encode the dependent variable rings into 3 levels.

This will help the algorithm to clearly classify the levels. This encoding would lead to:

Group abalones with less than 6 rings (<7.5 years old),

from 6 to 13 rings (7.5 to 14.5 years old) and more than 13 rings (>14.5 years old) indicating young, adult and old abalones respectively.

suppressWarnings(suppressMessages(library(dplyr)))
train1 <- train %>%
  mutate(age=case_when(
    rings %in% 1:5 ~ "young",
    rings %in% 6:13 ~ "adult",
    rings %in% 14:30 ~ "old"
  ))

#convert AGE into factor
train1$age <- as.factor(train1$age)
str(train1)
## 'data.frame':    4177 obs. of  10 variables:
##  $ sex     : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ length  : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ diameter: num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ height  : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ weight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ shucked : num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ viscera : num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ shell   : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ rings   : int  15 7 9 10 7 8 20 16 9 19 ...
##  $ age     : Factor w/ 3 levels "adult","old",..: 2 1 1 1 1 1 2 2 1 2 ...
# Now develop the regression tree
suppressWarnings(suppressMessages(library(rpart)))
suppressWarnings(suppressMessages(library(partykit)))

# remove the RINGS
myvars <- names(train1) %in% c("rings") 
train1 <- train1[!myvars]
str(train1)
## 'data.frame':    4177 obs. of  9 variables:
##  $ sex     : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ length  : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ diameter: num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ height  : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ weight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ shucked : num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ viscera : num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ shell   : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ age     : Factor w/ 3 levels "adult","old",..: 2 1 1 1 1 1 2 2 1 2 ...
abalone_model1 <- rpart(age~., data=train1, method = "anova", control=rpart.control(minsplit=60, minbucket=30, maxdepth=10))

plot(as.party(abalone_model1))

#print(abalone_model1)
#summary(abalone_model1)
rsq.rpart(abalone_model1)
## 
## Regression tree:
## rpart(formula = age ~ ., data = train1, method = "anova", control = rpart.control(minsplit = 60, 
##     minbucket = 30, maxdepth = 10))
## 
## Variables actually used in tree construction:
## [1] diameter shell    shucked  viscera 
## 
## Root node error: 1065.6/4177 = 0.25512
## 
## n= 4177 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.286572      0   1.00000 1.00022 0.040673
## 2 0.024325      1   0.71343 0.74737 0.030723
## 3 0.017725      2   0.68910 0.72389 0.030643
## 4 0.014584      4   0.65365 0.70421 0.029932
## 5 0.010000      5   0.63907 0.66818 0.028987

R-square is only 1-0.639 = 0.361 which is clearly not a good model based on variables.

Let’s try adding volume

# volume is calulated as Length x diameter x height
# add a new column with length * diameter * height

train2 <- train1 %>%  mutate(volume = length * diameter * height)
str(train2)
## 'data.frame':    4177 obs. of  10 variables:
##  $ sex     : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ length  : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ diameter: num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ height  : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ weight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ shucked : num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ viscera : num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ shell   : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ age     : Factor w/ 3 levels "adult","old",..: 2 1 1 1 1 1 2 2 1 2 ...
##  $ volume  : num  0.01578 0.00835 0.03005 0.02007 0.00673 ...
head(train2)
##   sex length diameter height weight shucked viscera shell   age     volume
## 1   M  0.455    0.365  0.095 0.5140  0.2245  0.1010 0.150   old 0.01577712
## 2   M  0.350    0.265  0.090 0.2255  0.0995  0.0485 0.070 adult 0.00834750
## 3   F  0.530    0.420  0.135 0.6770  0.2565  0.1415 0.210 adult 0.03005100
## 4   M  0.440    0.365  0.125 0.5160  0.2155  0.1140 0.155 adult 0.02007500
## 5   I  0.330    0.255  0.080 0.2050  0.0895  0.0395 0.055 adult 0.00673200
## 6   I  0.425    0.300  0.095 0.3515  0.1410  0.0775 0.120 adult 0.01211250

Let’s try model 2 with some variables instead, example

abalone_model2 <- rpart(age ~ length + diameter, data=train2, method = "anova", control=rpart.control(minsplit=60, minbucket=30, maxdepth=10))

plot(as.party(abalone_model2))

#print(abalone_model1)
#summary(abalone_model1)
rsq.rpart(abalone_model2)
## 
## Regression tree:
## rpart(formula = age ~ length + diameter, data = train2, method = "anova", 
##     control = rpart.control(minsplit = 60, minbucket = 30, maxdepth = 10))
## 
## Variables actually used in tree construction:
## [1] diameter
## 
## Root node error: 1065.6/4177 = 0.25512
## 
## n= 4177 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.283742      0   1.00000 1.00046 0.040684
## 2 0.034724      1   0.71626 0.73977 0.029163
## 3 0.010000      2   0.68153 0.70809 0.028890

Conclusion of Decision Tree

The AGE of abalone cannot be determined by the physical appearances. Therefore counting the number of rings through a microscope – a boring and time-consuming task is the only option to determine the AGE of abalone.