Predicting the AGE of abalone from physical measurements. The AGE of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age.
Sex / nominal / – / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / – / +1.5 gives the age in years
train <- read.csv("abalone.csv", header =T, na.strings=c("","NA"))
str(train)
## 'data.frame': 4177 obs. of 9 variables:
## $ sex : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
## $ length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ diameter: num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ weight : num 0.514 0.226 0.677 0.516 0.205 ...
## $ shucked : num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ viscera : num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ shell : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ rings : int 15 7 9 10 7 8 20 16 9 19 ...
We hope to use Decision Tree rules whether it is possible to put or classify the different age-group abalone into the right AGE bucket.
Dependent variable = rings
In this step, we’ll encode the dependent variable rings into 3 levels.
This will help the algorithm to clearly classify the levels. This encoding would lead to:
Group abalones with less than 6 rings (<7.5 years old),
from 6 to 13 rings (7.5 to 14.5 years old) and more than 13 rings (>14.5 years old) indicating young, adult and old abalones respectively.
suppressWarnings(suppressMessages(library(dplyr)))
train1 <- train %>%
mutate(age=case_when(
rings %in% 1:5 ~ "young",
rings %in% 6:13 ~ "adult",
rings %in% 14:30 ~ "old"
))
#convert AGE into factor
train1$age <- as.factor(train1$age)
str(train1)
## 'data.frame': 4177 obs. of 10 variables:
## $ sex : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
## $ length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ diameter: num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ weight : num 0.514 0.226 0.677 0.516 0.205 ...
## $ shucked : num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ viscera : num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ shell : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ rings : int 15 7 9 10 7 8 20 16 9 19 ...
## $ age : Factor w/ 3 levels "adult","old",..: 2 1 1 1 1 1 2 2 1 2 ...
# Now develop the regression tree
suppressWarnings(suppressMessages(library(rpart)))
suppressWarnings(suppressMessages(library(partykit)))
# remove the RINGS
myvars <- names(train1) %in% c("rings")
train1 <- train1[!myvars]
str(train1)
## 'data.frame': 4177 obs. of 9 variables:
## $ sex : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
## $ length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ diameter: num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ weight : num 0.514 0.226 0.677 0.516 0.205 ...
## $ shucked : num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ viscera : num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ shell : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ age : Factor w/ 3 levels "adult","old",..: 2 1 1 1 1 1 2 2 1 2 ...
abalone_model1 <- rpart(age~., data=train1, method = "anova", control=rpart.control(minsplit=60, minbucket=30, maxdepth=10))
plot(as.party(abalone_model1))
#print(abalone_model1)
#summary(abalone_model1)
rsq.rpart(abalone_model1)
##
## Regression tree:
## rpart(formula = age ~ ., data = train1, method = "anova", control = rpart.control(minsplit = 60,
## minbucket = 30, maxdepth = 10))
##
## Variables actually used in tree construction:
## [1] diameter shell shucked viscera
##
## Root node error: 1065.6/4177 = 0.25512
##
## n= 4177
##
## CP nsplit rel error xerror xstd
## 1 0.286572 0 1.00000 1.00022 0.040673
## 2 0.024325 1 0.71343 0.74737 0.030723
## 3 0.017725 2 0.68910 0.72389 0.030643
## 4 0.014584 4 0.65365 0.70421 0.029932
## 5 0.010000 5 0.63907 0.66818 0.028987
R-square is only 1-0.639 = 0.361 which is clearly not a good model based on variables.
Let’s try adding volume
# volume is calulated as Length x diameter x height
# add a new column with length * diameter * height
train2 <- train1 %>% mutate(volume = length * diameter * height)
str(train2)
## 'data.frame': 4177 obs. of 10 variables:
## $ sex : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
## $ length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ diameter: num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ weight : num 0.514 0.226 0.677 0.516 0.205 ...
## $ shucked : num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ viscera : num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ shell : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ age : Factor w/ 3 levels "adult","old",..: 2 1 1 1 1 1 2 2 1 2 ...
## $ volume : num 0.01578 0.00835 0.03005 0.02007 0.00673 ...
head(train2)
## sex length diameter height weight shucked viscera shell age volume
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 old 0.01577712
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 adult 0.00834750
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 adult 0.03005100
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 adult 0.02007500
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 adult 0.00673200
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 adult 0.01211250
Let’s try model 2 with some variables instead, example
abalone_model2 <- rpart(age ~ length + diameter, data=train2, method = "anova", control=rpart.control(minsplit=60, minbucket=30, maxdepth=10))
plot(as.party(abalone_model2))
#print(abalone_model1)
#summary(abalone_model1)
rsq.rpart(abalone_model2)
##
## Regression tree:
## rpart(formula = age ~ length + diameter, data = train2, method = "anova",
## control = rpart.control(minsplit = 60, minbucket = 30, maxdepth = 10))
##
## Variables actually used in tree construction:
## [1] diameter
##
## Root node error: 1065.6/4177 = 0.25512
##
## n= 4177
##
## CP nsplit rel error xerror xstd
## 1 0.283742 0 1.00000 1.00046 0.040684
## 2 0.034724 1 0.71626 0.73977 0.029163
## 3 0.010000 2 0.68153 0.70809 0.028890