The purpose of this assignment is to predict the number of rings on an abalone. Rings are a way of determining the maturity of abalones, but the instructor altered the data set to set some number of rings values equal to zero (rude). This post is broken into two main sections: first we’ll account for missing data, outliers, and our instructor’s tinkering then we’ll develop a Poisson regression model to predict the number of rings in our test set of abalone.
Data Exploration
The abalone
data set contains information on about 3,300 abalones. Gender and height do not appear to correlate with number of rings, so those are dropped prior to modeling. Length, diameter, and measures of weight do correlate with number of rings, but the weight measurements may pose multicollinearity issues.
ggpairs(abalone%>%select(-index))+yaztheme::theme_yaz()

Digging further into multicollinearity around weight measures, there is strong correlation between all four weight measures. We’ll go ahead and pick wholeweight
for use in models since the measures shuckedweight
, visceraweight
, and shellweight
involve human interaction with the abalone and therefore potentially risk measurement error.
The resulting data set has four columns and 3,310 observations. Although there are no null values, there may be outliers and we know the data is inflated with zero-values.
abalone_train <- abalone%>%
select(target_rings, length, diameter, wholeweight)
head(abalone_train)
Rather than train a zeroo-inflated poisson regression model, we can delete all outliers and reimpute the missing values using regression trees.

Missing values can be reimputed using the mice
package.
library(mice)
temp_df <- mice(abalone_sans_outliers, method = 'cart', maxit = 1)
abalone_train_clean <- complete(temp_df)
GGally::ggpairs(abalone_train_clean)

Finally, we can train a poisson regression model on the remaining data. Interaction terms were tested but eventally discarded in the backwards step-wise variable selection process, so the final poisson regression model is \(rings ~ length + diameter + wholeweight\).
library(MASS)
step <- stepAIC(glm(target_rings~ .,data = abalone_train_clean, family = poisson),
scope = . ~ .^2, direction = 'backward')
summary(step)
We can now build the perscribed model and bootstrap confidence intervals for its estimates by iteratively resampling the data and calculating the root mean squared error (RMSE). The average bootstrapped prediction is 2.23 rings! That can’t possibly be right. Re-examining the exploratory plots
for(i in seq(1,250)){
df <- abalone_train_clean_index%>%sample_frac(.9, replace = TRUE)
pois.fit <- glm(formula = target_rings ~ length + diameter + wholeweight,
family = poisson, data = df)
oos <- setdiff(abalone_train_clean_index, df)
pred_df <- data.frame(predicted_value = predict.glm(pois.fit, oos),
actual = oos$target_rings,
index = oos$index)%>%
mutate(error = actual - predicted_value,
iter = i)
full_dfs[[i]] <- pred_df
}
for(i in seq(1,250)){
df <- abalone_train_clean_index%>%sample_frac(.9, replace = TRUE)
pois.fit <- glm(formula = target_rings ~ length + diameter + wholeweight,
family = poisson, data = df)
oos <- setdiff(abalone_train_clean_index, df)
pred_df <- data.frame(predicted_value = predict.glm(pois.fit, oos),
actual = oos$target_rings,
index = oos$index)%>%
mutate(error = actual - predicted_value,
iter = i)
full_dfs[[i]] <- pred_df
}
cv_preds <- bind_rows(full_dfs)
cv_preds%>%
group_by(index)%>%
summarise(mean_error = mean(error, na.rm = T),
mean_pred = mean(predicted_value, na.rm = T))%>%
.$mean_pred%>%
summary()
The first model attempted clearly underestimates ring count. But before giving up on this line of inquiry, I want to try one more model with the data as we have it. After removing outliers (including our inflated number of zero values!) the data looked almost normalized, so maybe an OLS regression model will do just fine here.
summary(lm.fit)
This is much better! But we’re still not getting much beyond just taking the average value for target_rings
and applying it to the whole data set.
Lastly, we can try a zero-inflated poisson regression model without removing and reimputing outliers. The zero-inflated model has much larger coefficients than the traditional poisson model tested previously. Perhaps that will remedy the problem of under-predicting ring counts.
As should have been intuitive all along, the zero-inflated model outperforms the rest. The result isn’t perfect, but the pattern of the fit appears to follow the pattern of actual values better than OLS or simple poisson models.
---
title: "PRED 411 - Abalone Extra Credit"
author: 'Josh Yazman'
output: html_notebook
---

The purpose of this assignment is to predict the number of rings on an abalone. Rings are a way of determining the maturity of abalones, but the instructor altered the data set to set some number of rings values equal to zero (rude). This post is broken into two main sections: first we'll account for missing data, outliers, and our instructor's tinkering then we'll develop a Poisson regression model to predict the number of rings in our test set of abalone[^1]. 

## Data Exploration
The `abalone` data set contains information on about 3,300 abalones. Gender and height do not appear to correlate with number of rings, so those are dropped prior to modeling. Length, diameter, and measures of weight do correlate with number of rings, but the weight measurements may pose multicollinearity issues. 

```{r, echo=TRUE, results='hide', warning = FALSE, message = FALSE, fig.height=10}
library(readr)
library(dplyr)
library(yaztheme)
library(ggplot2)
library(GGally)
# setwd('Desktop/abalone/')
abalone <- read_csv('zip_abalone-1.csv')
colnames(abalone) <- tolower(colnames(abalone))
ggpairs(abalone%>%select(-index))+yaztheme::theme_yaz()
```

Digging further into multicollinearity around weight measures, there is strong correlation between all four weight measures. We'll go ahead and pick `wholeweight` for use in models since the measures `shuckedweight`, `visceraweight`, and `shellweight` involve human interaction with the abalone and therefore potentially risk measurement error. 

```{r, echo=TRUE,results='hide'}
ggpairs(abalone%>%select(contains('weight')))+theme_yaz()
```

The resulting data set has four columns and 3,310 observations. Although there are no null values, there may be outliers and we know the data is inflated with zero-values. 
```{r, echo = TRUE}
abalone_train <- abalone%>%
  select(target_rings, length, diameter, wholeweight)
head(abalone_train)
```

Rather than train a zeroo-inflated poisson regression model, we can delete all outliers and reimpute the missing values using regression trees. 
```{r, echo=TRUE,results='hide', fig.width=6, fig.height=3}
complete_cols <- list()
for(i in colnames(abalone_train)){
  colvec <- abalone_train%>%select_(col = i)%>%.$col
  col_q <- quantile(colvec)
  low <- col_q[2] - (col_q[4] - col_q[2])
  high <- col_q[4] + (col_q[4] - col_q[2])
  for(j in seq(1,length(colvec))){
    if(between(colvec[j], low, high)){
      colvec[j] <- colvec[j]
    } else {
      colvec[j] <- NA
    }
  }
  complete_cols[[i]] <- colvec
}

abalone_sans_outliers <- data.frame(complete_cols)

deleted <- mice::fico(abalone_sans_outliers)

bind_cols(data.frame(variable = names(deleted)),
          data.frame(percent = as.vector(deleted)))%>%
  ggplot(aes(variable, percent*100))+
  geom_col(fill = yaz_cols[1])+
  coord_flip()+
  labs(title = 'Percentage of Column Deleted as Outliers',
       y = 'Percent', x = element_blank())+
  theme_yaz()
```

Missing values can be reimputed using the `mice` package. 
```{r, echo=TRUE,results='hide'}
library(mice)
temp_df <- mice(abalone_sans_outliers, method = 'cart', maxit = 1)
abalone_train_clean <- complete(temp_df)
GGally::ggpairs(abalone_train_clean)
```

Finally, we can train a poisson regression model on the remaining data. Interaction terms were tested but eventally discarded in the backwards step-wise variable selection process, so the final poisson regression model is $rings ~ length + diameter + wholeweight$. 
```{r, echo=TRUE,results='hide'}
library(MASS)
step <- stepAIC(glm(target_rings~ .,data = abalone_train_clean, family = poisson),
                scope = . ~ .^2, direction = 'backward')
summary(step)
```

We can now build the perscribed model and bootstrap confidence intervals for its estimates by iteratively resampling the data and calculating the root mean squared error (RMSE). The average bootstrapped prediction is 2.23 rings! That can't possibly be right. Re-examining the exploratory plots

```{r, echo=TRUE,results='hide'}
abalone_train_clean_index <- abalone_train_clean%>%
  bind_cols(data.frame(index = seq(1, nrow(abalone_train_clean))))
full_dfs <- list()
for(i in seq(1,250)){
  df <- abalone_train_clean_index%>%sample_frac(.9, replace = TRUE)
  pois.fit <- glm(formula = target_rings ~ length + diameter + wholeweight, 
                  family = poisson, data = df)
  oos <- setdiff(abalone_train_clean_index, df)
  pred_df <- data.frame(predicted_value = predict.glm(pois.fit, oos),
                        actual = oos$target_rings,
                        index = oos$index)%>%
    mutate(error = actual - predicted_value,
           iter = i)
  full_dfs[[i]] <- pred_df
}

cv_preds <- bind_rows(full_dfs)
cv_preds%>%
  group_by(index)%>%
  summarise(mean_error = mean(error, na.rm = T),
            mean_pred = mean(predicted_value, na.rm = T))%>%
  .$mean_pred%>%
  summary()
```

The first model attempted clearly underestimates ring count. But before giving up on this line of inquiry, I want to try one more model with the data as we have it. After removing outliers (including our inflated number of zero values!) the data looked almost normalized, so maybe an OLS regression model will do just fine here.

```{r, echo=TRUE,results='hide'}
lm.fit <- lm(target_rings ~ length + diameter + wholeweight, data = abalone_train_clean)
summary(lm.fit)
```

This is much better! But we're still not getting much beyond just taking the average value for `target_rings` and applying it to the whole data set.  
```{r, echo=TRUE,results='hide'}
ols_dfs <- list()
for(i in seq(1,250)){
  df <- abalone_train_clean_index%>%sample_frac(.9, replace = TRUE)
  lm.fit <- lm(formula = target_rings ~ length + diameter + wholeweight, data = df)
  oos <- setdiff(abalone_train_clean_index, df)
  pred_df <- data.frame(predicted_value = predict(lm.fit, oos),
                        actual = oos$target_rings,
                        index = oos$index)%>%
    mutate(error = actual - predicted_value,
           iter = i)
  ols_dfs[[i]] <- pred_df
}

cv_preds <- bind_rows(ols_dfs)
cv_preds%>%
  group_by(actual)%>%
  summarise(mean_error = mean(error, na.rm = T),
            mean_pred = mean(predicted_value, na.rm = T),
            high = quantile(predicted_value, .95),
            low = quantile(predicted_value, .05))%>%
  ggplot(aes(x = actual, y = mean_pred))+
  geom_errorbar(aes(ymin = low, ymax = high), width = .25)+
  geom_point(show.legend = F, color = yaz_cols[1])+
  labs(title = 'OLS Model Predictions',
       y = 'Predicted Value', x = 'Actual Value')+
  geom_hline(yintercept = 0, linetyp = 'dashed')+
  theme_yaz()
```

Lastly, we can try a zero-inflated poisson regression model without removing and reimputing outliers. The zero-inflated model has much larger coefficients than the traditional poisson model tested previously. Perhaps that will remedy the problem of under-predicting ring counts. 
```{r, echo=TRUE,results='hide'}
library(pscl)
zinf.fit <- zeroinfl(target_rings ~ length + diameter + wholeweight, data = abalone_train)
summary(zinf.fit)
```

As should have been intuitive all along, the zero-inflated model outperforms the rest. The result isn't perfect, but the pattern of the fit appears to follow the pattern of actual values better than OLS or simple poisson models.  
```{r, echo=TRUE,results='hide'}
abalone_index_zeroes <- abalone_train%>%
  bind_cols(data.frame(index = seq(1, nrow(abalone_train))))
zinf_dfs <- list()
for(i in seq(1,250)){
  df <- abalone_index_zeroes%>%sample_frac(.9, replace = TRUE)
  zinf.fit <- zinf.fit <- zeroinfl(target_rings ~ length + diameter + wholeweight, data = df)
  oos <- setdiff(abalone_index_zeroes, df)
  pred_df <- data.frame(predicted_value = predict(zinf.fit, oos),
                        actual = oos$target_rings,
                        index = oos$index)%>%
    mutate(error = actual - predicted_value,
           iter = i)
  zinf_dfs[[i]] <- pred_df
}

cv_preds <- bind_rows(zinf_dfs)
cv_preds%>%
  group_by(actual)%>%
  summarise(mean_error = mean(error, na.rm = T),
            mean_pred = mean(predicted_value, na.rm = T),
            high = quantile(predicted_value, .95),
            low = quantile(predicted_value, .05))%>%
  ggplot(aes(x = actual, y = mean_pred))+
  geom_errorbar(aes(ymin = low, ymax = high), width = .25)+
  geom_point(show.legend = F, color = yaz_cols[1])+
  labs(title = 'Zero-Inflated Model Predictions',
       y = 'Predicted Value', x = 'Actual Value')+
  geom_hline(yintercept = 0, linetyp = 'dashed')+
  theme_yaz()
```

[^1]: All measures are bootstrapped to present a range of model outcomes within a reasonable confidence interval. 