How well can we predict the number of rings on a abalone shell?
Data
library(AppliedPredictiveModeling)
data(abalone)Summary STATS
| Type | LongestShell | Diameter | Height | WholeWeight | ShuckedWeight | VisceraWeight | ShellWeight | Rings | |
|---|---|---|---|---|---|---|---|---|---|
| F:1307 | Min. :0.075 | Min. :0.0550 | Min. :0.0000 | Min. :0.0020 | Min. :0.0010 | Min. :0.0005 | Min. :0.0015 | Min. : 1.000 | |
| I:1342 | 1st Qu.:0.450 | 1st Qu.:0.3500 | 1st Qu.:0.1150 | 1st Qu.:0.4415 | 1st Qu.:0.1860 | 1st Qu.:0.0935 | 1st Qu.:0.1300 | 1st Qu.: 8.000 | |
| M:1528 | Median :0.545 | Median :0.4250 | Median :0.1400 | Median :0.7995 | Median :0.3360 | Median :0.1710 | Median :0.2340 | Median : 9.000 | |
| NA | Mean :0.524 | Mean :0.4079 | Mean :0.1395 | Mean :0.8287 | Mean :0.3594 | Mean :0.1806 | Mean :0.2388 | Mean : 9.934 | |
| NA | 3rd Qu.:0.615 | 3rd Qu.:0.4800 | 3rd Qu.:0.1650 | 3rd Qu.:1.1530 | 3rd Qu.:0.5020 | 3rd Qu.:0.2530 | 3rd Qu.:0.3290 | 3rd Qu.:11.000 | |
| NA | Max. :0.815 | Max. :0.6500 | Max. :1.1300 | Max. :2.8255 | Max. :1.4880 | Max. :0.7600 | Max. :1.0050 | Max. :29.000 |
- Variables
LongestShellandDiameterappears to have a strong relationship with the number of rings on an abalone shell. The different weight measurements also showed a positive relationship with number of rings. Gender does not appear to have an effect on the number of rings, but age does.
Spliting data into 70% train & 30% test
set.seed(123)
n <- nrow(abalone)
train_index <- sample(1:n, round(0.7 * n))
train <- abalone[train_index, ]
test <- abalone[-train_index, ]Multiple Linear Regression
lm1 <- lm(Rings ~ ., train)
summary(lm1)##
## Call:
## lm(formula = Rings ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.444 -1.313 -0.336 0.880 14.136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0932 0.3492 8.858 < 2e-16 ***
## TypeI -0.7255 0.1223 -5.931 3.37e-09 ***
## TypeM 0.1159 0.0993 1.167 0.243116
## LongestShell -0.7745 2.1287 -0.364 0.716018
## Diameter 9.8634 2.6414 3.734 0.000192 ***
## Height 25.1272 2.7536 9.125 < 2e-16 ***
## WholeWeight 9.0528 0.8657 10.457 < 2e-16 ***
## ShuckedWeight -19.9145 0.9915 -20.086 < 2e-16 ***
## VisceraWeight -11.8409 1.5667 -7.558 5.46e-14 ***
## ShellWeight 7.1323 1.3232 5.390 7.60e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.186 on 2914 degrees of freedom
## Multiple R-squared: 0.5442, Adjusted R-squared: 0.5428
## F-statistic: 386.6 on 9 and 2914 DF, p-value: < 2.2e-16
Regression Tree
t1 <- rpart(Rings ~ ., data=train, method = 'anova')
par(cex=0.7, xpd=NA)
plot(t1, uniform = TRUE)
text(t1, use.n = T)Random Forest
rf1 <- randomForest(Rings ~.,train, importance= TRUE)
vip(rf1, num_features = 20, geom = "point", include_type = TRUE)Results
RMSE <- function(y, y_hat) { sqrt(mean((y - y_hat)^2))
}pred_lm <- predict(lm1, newdata = test)
pred_t <- predict(t1, newdata = test)
pred_rf <- predict(rf1, newdata = test)
lm_score <- RMSE(test$Rings, pred_lm)
t_score <- RMSE(test$Rings, pred_t)
rf_score <- RMSE(test$Rings, pred_rf)- The Random Forests model,
rf1had the best predictive performance. When applied to the test data, it’s predictions were on average 2.1 rings off from the correct number of rings on a abalone shell. - The Regression Tree model,
t1had the worst predictive performance. On average, it’s predictions were 2.4 rings off from the correct number of shells.
pred_df <- data.frame(
Actual = test$Rings,
mlm = pred_lm,
regtree = pred_t,
rf = pred_rf
)
plot_lm <- ggplot(pred_df, aes(Actual, mlm)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
xlab("Actual Rings") + ylab("Predicted Rings") +
ggtitle('Multiple Linear Regression')
plot_rt <- ggplot(pred_df, aes(Actual, regtree)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
ylab("Predicted Rings") +
ggtitle('Regression Tree')
plot_rf <- ggplot(pred_df, aes(Actual, rf)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
ylab("Predicted Rings") +
ggtitle('Random Forests')
grid.arrange(plot_lm, plot_rt, plot_rf, ncol=3, nrow =1)