An Evaluation of Mollusk Metrics

Background

The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. In this homework problem, you will be using other, easier to obtain measurements to predict the number of rings, and calculate the age of the abalone instead.

Data Description

The dataset abalone.csvPreview the document (taken from http://archive.ics.uci.edu/ml/datasets/Abalone (Links to an external site.)Links to an external site.) contains measurements from 4,177 abalone. Variable descriptions are as follows:

Sex: one of three possible categories: infant (I), male (M), or female (F)

Length: longest shell measurement (in mm)

Diameter: measurement taken perpendicular to length (in mm)

Height: height with meat in shell (in mm)

Whole: weight of whole abalone (in grams)

Shucked: weight of abalone meat (in grams)

Viscera: gut weight after bleeding (in grams)

Shell: weight after being dried (in grams)

Rings: number of rings (age of abalone is 1.5 times the number of rings)

Instructions on reading the data

#setwd("C:/Users/emily/OneDrive/Desktop/R")
dataFull=read.csv("abalone.csv",head=T)
dataTest=dataFull[(nrow(dataFull)-10):nrow(dataFull),]
data=dataFull[1:(nrow(dataFull)-11),]

Use the dataframe data to build your models. The dataframe dataTest will be used for making predictions in Question 4.

Question Sets

Q1: Exploratory Data Analysis

  1. Create a boxplot of the qualitative predictor and the response. Does there appear to be a relationship between the predictor and the response? Interpret the boxplot.
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
df <-  data
attach(df)
ggplot(data=df, aes(x=as.factor(Sex),y=Rings, fill=Sex))+
  geom_boxplot(alpha=0.3)+
  xlab("Sex of Abalone")+
  ylab("Number of Rings")+
  ggtitle("Sex of Abalone vs Number of Rings")

A: From the boxplot, it looks like Sex does have a relationship between predictor and response, but only as a product of age. The actual difference between Males and Females is insignificant in regard to number of rings. Differences, however, emerge when we considered the “I” category - Abalones that are Infants and unable to be categorized by sex. Unsexed infantiles do appear to predict a smaller number of rings - but this is not due to a function of sex, but rather a function of their young age (younger = fewer rings).

  1. Create a scatterplot matrix containing the quantitative predictors and the response, and display the correlations between each pair of variables. Interpret the correlations.
library(corrplot)
## corrplot 0.84 loaded
x <- cor(df[2:9])
corrplot.mixed(x)

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(df[,2:9],ggplot2::aes(color="teal"))

A: There are strong linear relationships amongst each of the predictor variables, which each factor having a correlation value of \(0.77\) or highter, with length and diameter even having a correlation as high as \(0.99\). These strong linear relationships between variabes strongly suggest multicollinearity, requiring further review. The relationships between the predictors and the response variable, Rings, are also moderately related, ranging between \(0.42\) and \(0.63\).

  1. Is it reasonable to use a multiple linear regression model to predict Rings using all the predictor variables?

A: No, it is likely not reasonable to use all predictor variables to predict number of rings. It is evident through these graphs that many of these factors are linearly related with one another and therefore would not add any predictive power to such a model. A predictive model with only some of these variables, or using combinations or some variables as new factors, may better suffice. Further analysis would be needed.

Q2: Fitting the Multiple Linear Regression Model

Build a linear regression model, model1, to predict Rings using all the predictors. Display the model1 summary.

library(broom)
library(jtools)

model1 <- lm(data=df,Rings~.)
summ(model1)
Observations 4166
Dependent variable Rings
Type OLS linear regression
F(9,4156) 537.99
0.54
Adj. R² 0.54
Est. S.E. t val. p
(Intercept) 3.89 0.29 13.33 0.00
SexI -0.83 0.10 -8.07 0.00
SexM 0.06 0.08 0.71 0.48
Length -0.46 1.81 -0.25 0.80
Diameter 11.09 2.23 4.97 0.00
Height 10.79 1.54 7.01 0.00
Whole 8.98 0.73 12.36 0.00
Shucked -19.80 0.82 -24.19 0.00
Viscera -10.57 1.30 -8.16 0.00
Shell 8.72 1.13 7.75 0.00
Standard errors: OLS
tidy(model1,conf.int=TRUE,conf.level=0.90)
  1. What is the coefficient estimate for Height? Interpret this coefficient.

A: \(10.787\). Holding all other values constant, for each one millimeter in height that abalone meat in the shell increases, the number of predicted rings it will have will go up by about 10.787.

  1. What is the coefficient estimate for the Sex category Infant? Interpret this coefficient.

A: \(-0.828\). One can predict there to be, when holding all other values constant, 0.828 fewer rings when the abalone is an infant versus when being categorized as female.

  1. What is the coefficient estimate for the Sex category Male? Interpret this coefficient

A: \(0.05946\). Holding all other values constant, one can predict there to be 0.059 more Rings than if the abalone is a female.

  1. Give a 90% confidence interval for Length. Interpret this interval.

A: The 90% confidence interval for Length is \((-3.44, 2.52)\). This means that we can claim with 90% certainty that the true value at which Length increases number of rings by 1 falls within this interval, i.e. some value anywhere between -3.44 and 2.52 is the number of millimeters change in abalone length that will result in one additional ring.

Q3: Checking Model Assumptions

library(ggfortify)
autoplot(model1, label.size=3, which=c(1,0)) #residuals vs fitted 

autoplot(model1, label.size=3, which=c(2,0)) #QQ plot

library(lindia)
gg_reshist(model1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

autoplot(model1, label.size=3, which=c(4,0)) #Cooks distance

library(car)
## Loading required package: carData
influencePlot(model1, id="TRUE", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
## Warning in applyDefaults(id, defaults = list(method = "noteworthy", n =
## 2, : unnamed id arguments, will be ignored
summ(model1, confint = TRUE, vifs=T)
Observations 4166
Dependent variable Rings
Type OLS linear regression
F(9,4156) 537.99
0.54
Adj. R² 0.54
Est. 2.5% 97.5% t val. p VIF
(Intercept) 3.89 3.32 4.46 13.33 0.00 NA
SexI -0.83 -1.03 -0.63 -8.07 0.00 1.54
SexM 0.06 -0.10 0.22 0.71 0.48 1.54
Length -0.46 -4.01 3.09 -0.25 0.80 40.95
Diameter 11.09 6.72 15.47 4.97 0.00 42.38
Height 10.79 7.77 13.80 7.01 0.00 3.58
Whole 8.98 7.55 10.40 12.36 0.00 109.68
Shucked -19.80 -21.40 -18.20 -24.19 0.00 28.52
Viscera -10.57 -13.11 -8.03 -8.16 0.00 17.45
Shell 8.72 6.52 10.93 7.75 0.00 21.25
Standard errors: OLS
  1. Create a scatterplot of the standardized residuals versus fitted values. What conclusions can you draw from this plot?

The constant variance assumption is shown to be violated in the scatterplot. The residuals create a funnel shape as the fitted values increase, meaning that prediction intervals will be wider than they should be at the low fitted value range, and narrower at the high fitted values. The plot also seems to be slightly non-linear, perhaps violating a linearity assumption.

  1. Create a histogram and normal QQ plot for the residuals. What conclusions can you draw from these plots?

There is a signifcantly heavy tail on the right side of the chart, suggesting a non-normal distribution. The histogram further confirms this deviation with right skew.

  1. Create a plot for the Cook’s Distances. What conclusions can you draw from this plot?

Observation 2052 is a highly influential point and should be removed because it is greater than 1.

  1. Display the VIF of each predictor. What conclusions can you draw?

The Variable Inflation Factors we see here would indicate multicollinearity taking place between all factors, except for Sex (\(1.54\)) and Height (\(3.58\)), which have low values. High VIF values that we see with the other factors could be taken as a sign that some of these predictors are redundant, i.e. because they are collinear with one another, using many of these factors may not add any significant predictive power to a model. In this case, such factors could be combined or removed, or some combination of both, to create the most effective model.

Q4: Model Comparison and Prediction

Create a second model, model2, that uses all the predictors except for: Whole weight , Viscera weight, and Length. Display the model summary.

#subset data for model 2, exluding 3 predictors
dataTest2 <- subset(dataTest, select=-c(Viscera,Whole,Length))
df2 <- subset(data, select=-c(Viscera,Whole,Length))

#create model2
model2 <- lm(Rings~Sex+Diameter+Height+Shucked+Shell, data=df2)

#display model summaries
library(jtools)
summ(model1, confint = TRUE, vifs=T) #Rsq = 0.54, Adj Rsq = 0.54
Observations 4166
Dependent variable Rings
Type OLS linear regression
F(9,4156) 537.99
0.54
Adj. R² 0.54
Est. 2.5% 97.5% t val. p VIF
(Intercept) 3.89 3.32 4.46 13.33 0.00 NA
SexI -0.83 -1.03 -0.63 -8.07 0.00 1.54
SexM 0.06 -0.10 0.22 0.71 0.48 1.54
Length -0.46 -4.01 3.09 -0.25 0.80 40.95
Diameter 11.09 6.72 15.47 4.97 0.00 42.38
Height 10.79 7.77 13.80 7.01 0.00 3.58
Whole 8.98 7.55 10.40 12.36 0.00 109.68
Shucked -19.80 -21.40 -18.20 -24.19 0.00 28.52
Viscera -10.57 -13.11 -8.03 -8.16 0.00 17.45
Shell 8.72 6.52 10.93 7.75 0.00 21.25
Standard errors: OLS
summ(model2, confint = TRUE,vifs=T) #Rsq = 0.52, Adj Rsq = 0.52
Observations 4166
Dependent variable Rings
Type OLS linear regression
F(6,4159) 754.11
0.52
Adj. R² 0.52
Est. 2.5% 97.5% t val. p VIF
(Intercept) 3.76 3.22 4.31 13.49 0.00 NA
SexI -0.87 -1.07 -0.67 -8.40 0.00 1.51
SexM 0.06 -0.11 0.22 0.65 0.52 1.51
Diameter 10.77 8.81 12.72 10.78 0.00 8.20
Height 11.01 7.95 14.07 7.05 0.00 3.56
Shucked -11.65 -12.39 -10.91 -30.84 0.00 5.86
Shell 19.63 18.35 20.92 29.92 0.00 6.96
Standard errors: OLS
#Anova for F-stat
cat("Model1 vs Model2 \n F-statistic: ", anova(model1,model2)$F[2], ", p = ", anova(model1,model2)$'Pr(>F)'[2], ".\n", sep="") #displays F-stat = 51.165

#mean square prediction error
pred1 <- predict(model1,dataTest,interval=c("prediction"))
pred2 <- predict(model2,dataTest2,interval=c("prediction"))
mspe1 <- mean((dataTest$Rings - pred1[,1]) ^ 2) #gives MSPE = 1.520872
mspe2 <-mean((dataTest2$Rings - pred2[,1]) ^ 2) #gives MSPE = 1.559619

#precision measure
pm1 <- sum((dataTest$Rings-pred1[,1])^2) / sum((dataTest$Rings-mean(dataTest$Rings))^2)
pm2 <- sum((dataTest2$Rings-pred2[,1])^2) / sum((dataTest2$Rings-mean(dataTest$Rings))^2)

#summaries
cat("\n -*--*--*--*--*--*--*- \n");
cat("\n Model 1 Quality Metrics: \n R-squared: ", summary(model1)$r.squared,
    " \n Adjusted R-squared: ", summary(model1)$adj.r.squared,
    "\n Mean Square Prediction Error: ", 
    pm1, "\n Precision Measure: ", mspe1);
cat("\n \n Model 2 Quality Metrics: \n R-squared: ", summary(model2)$r.squared,
    " \n Adjusted R-squared: ", summary(model2)$adj.r.squared,
    "\n Mean Square Prediction Error: ", 
    pm2, "\n Precision Measure: ", mspe2)
Model1 vs Model2 
 F-statistic: 51.16507, p = 1.818439e-32.

 -*--*--*--*--*--*--*- 

 Model 1 Quality Metrics: 
 R-squared:  0.5381134  
 Adjusted R-squared:  0.5371132 
 Mean Square Prediction Error:  1.179651 
 Precision Measure:  1.520872
 
 Model 2 Quality Metrics: 
 R-squared:  0.5210544  
 Adjusted R-squared:  0.5203634 
 Mean Square Prediction Error:  1.209705 
 Precision Measure:  1.559619
  1. Compare and discuss the R-squared and Adjusted R-squared of model2 with model1.

The summaries above show R-squared values of 0.5381134 and 0.5210544 for models 1 and 2, respectively. Adjusted R-squared values are 0.5371132 and 0.5203634 for models 1 and 2, respectively. Model 1 has very slightly higher values, being approximately 0.015 higher, indicating that model 2 is no better, or perhaps even partially worse, at accounting for variance in the predictors.

  1. Conduct a partial F-test comparing model2 with model1. What can you conclude?

The anova() function was used to conduct a partial F-test, giving an F-stat of 51.1650672, and a significant p-value (p<0.05). These values tell us that the difference between the two models is significant (Reject H0), meaning at least one of those three excluded variables had a significant impact on the predictive quality of the model. Additional tests would be needed to determine which of those predictors had such an impact.

  1. Predict Rings for the last 11 rows of data (i.e. dataTest) using both model2 and model1. Compare and discuss the point prediction and intervals of both models. Compare and discuss the mean squared prediction error and the precision measure.

The prediction intervals show that the first model has a slightly smaller range than model 2, meaning that we can estimate the true prediction more accurately with a certain level of confidence.

Mean Square Prediction Error (MSPE) is the sum of the square differences between predicted and observed. While good for measuring prediction accuracy in a least squares linear model, it depends on the scale of the response data and is sensitive to outliers. Model 1 has slightly lower MSPE of \(1.18\) versus model 2 having an MSPE of \(1.21\). There is likely no significant difference between these two values, but a lower value indicates lower variability and a more accurate model.

Precision Measure (PM) is similar to the regression R-squared value, and is considered the strongest metric for prediction accuracy as it does not depend on scale and is appropriate for making this evaluation on the linear models using least squares. PM is the proportion of the variability in the prediction vs the variability in the new data. Given this, the lower PM of two models is preferred. Model 1 has a slightly lower PM of \(1.52\) vs model 2 having a PM of \(1.56\), indicating less variability.

clam <- data.frame(Diameter = 0.4, Height = 0.2, Shucked = 0.3, Shell = 0.3, Sex = 'F')
clam.rings <- predict(model2, clam, interval="prediction", confidence = 0.9)

cat("This adult female abalone is predicted to be approximately", round(clam.rings[1]*1.5, digits=2), "years old.")
## This adult female abalone is predicted to be approximately 19 years old.
  1. Suppose you’ve found an adult female abalone with a diameter of 0.4mm, a height of 0.2mm, a shucked weight of 0.3 grams, and a shell weight of 0.3 grams. Using model2, predict the number of rings on this abalone with a 90% prediction interval. What is the age of this abalone?

Using these parameters with model two predicts a ring number of 12.6670284, which multiplied by 1.5 estimates an age of clam.rings[1]*1.5.