Abalone Rings Prediction

Tadros Salama

2021-06-10

How well can we predict the number of rings on a abalone shell?

Data

library(AppliedPredictiveModeling)
data(abalone)

Summary STATS

Type LongestShell Diameter Height WholeWeight ShuckedWeight VisceraWeight ShellWeight Rings
F:1307 Min. :0.075 Min. :0.0550 Min. :0.0000 Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015 Min. : 1.000
I:1342 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 8.000
M:1528 Median :0.545 Median :0.4250 Median :0.1400 Median :0.7995 Median :0.3360 Median :0.1710 Median :0.2340 Median : 9.000
NA Mean :0.524 Mean :0.4079 Mean :0.1395 Mean :0.8287 Mean :0.3594 Mean :0.1806 Mean :0.2388 Mean : 9.934
NA 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1530 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290 3rd Qu.:11.000
NA Max. :0.815 Max. :0.6500 Max. :1.1300 Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050 Max. :29.000

  • Variables LongestShell and Diameter appears to have a strong relationship with the number of rings on an abalone shell. The different weight measurements also showed a positive relationship with number of rings. Gender does not appear to have an effect on the number of rings, but age does.

Spliting data into 70% train & 30% test

set.seed(123)
n <- nrow(abalone)
train_index <- sample(1:n, round(0.7 * n))
train <- abalone[train_index, ]
test <- abalone[-train_index, ]

Multiple Linear Regression

lm1 <- lm(Rings ~ ., train)
summary(lm1)
## 
## Call:
## lm(formula = Rings ~ ., data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.444 -1.313 -0.336  0.880 14.136 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.0932     0.3492   8.858  < 2e-16 ***
## TypeI          -0.7255     0.1223  -5.931 3.37e-09 ***
## TypeM           0.1159     0.0993   1.167 0.243116    
## LongestShell   -0.7745     2.1287  -0.364 0.716018    
## Diameter        9.8634     2.6414   3.734 0.000192 ***
## Height         25.1272     2.7536   9.125  < 2e-16 ***
## WholeWeight     9.0528     0.8657  10.457  < 2e-16 ***
## ShuckedWeight -19.9145     0.9915 -20.086  < 2e-16 ***
## VisceraWeight -11.8409     1.5667  -7.558 5.46e-14 ***
## ShellWeight     7.1323     1.3232   5.390 7.60e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.186 on 2914 degrees of freedom
## Multiple R-squared:  0.5442, Adjusted R-squared:  0.5428 
## F-statistic: 386.6 on 9 and 2914 DF,  p-value: < 2.2e-16

Regression Tree

t1 <- rpart(Rings ~ ., data=train, method = 'anova')
par(cex=0.7, xpd=NA)
plot(t1, uniform = TRUE)
text(t1, use.n = T)

Random Forest

rf1 <- randomForest(Rings ~.,train, importance= TRUE)
vip(rf1, num_features = 20, geom = "point", include_type = TRUE)

Results

RMSE <- function(y, y_hat) { sqrt(mean((y - y_hat)^2))
}
pred_lm <- predict(lm1, newdata = test)
pred_t <- predict(t1, newdata = test)
pred_rf <- predict(rf1, newdata = test)

lm_score <- RMSE(test$Rings, pred_lm)
t_score <- RMSE(test$Rings, pred_t)
rf_score <- RMSE(test$Rings, pred_rf)
  • The Random Forests model, rf1 had the best predictive performance. When applied to the test data, it’s predictions were on average 2.1 rings off from the correct number of rings on a abalone shell.
  • The Regression Tree model, t1 had the worst predictive performance. On average, it’s predictions were 2.4 rings off from the correct number of shells.
pred_df <- data.frame(
  Actual = test$Rings,
  mlm = pred_lm,
  regtree = pred_t,
  rf = pred_rf
)

plot_lm <- ggplot(pred_df, aes(Actual, mlm)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  xlab("Actual Rings") + ylab("Predicted Rings") +
  ggtitle('Multiple Linear Regression')

plot_rt <- ggplot(pred_df, aes(Actual, regtree)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  ylab("Predicted Rings") +
  ggtitle('Regression Tree')

plot_rf <- ggplot(pred_df, aes(Actual, rf)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  ylab("Predicted Rings") +
  ggtitle('Random Forests')

grid.arrange(plot_lm, plot_rt, plot_rf, ncol=3, nrow =1)