The objective: to predict the Age of Abalone Species for harvesting

Statistical Inference & Model Creation

(Regression). Supervised Learning. What we want to do is Predict Values, i.e Age. Making forecasts, not estimating the relationship between values.

Abalone

Raw Data

Abalone Raw Data
Variable Classes Glimpse
Sex integer Infant, Infant, Infant, Infant, Infant, Infant
Length double 5.565, 3.675, 10.08, 4.095, 6.93, 7.875
Diam double 4.095, 2.625, 7.35, 3.15, 4.83, 6.09
Height double 1.26, 0.84, 2.205, 0.945, 1.785, 2.1
Whole double 11.5, 3.5, 79.375, 4.6875, 21.1875, 27.375
Shuck double 4.3125, 1.1875, 44, 2.25, 9.875, 11.5625
Rings integer 6, 4, 6, 3, 6, 6
Class character A1, A1, A1, A1, A1, A1
Volume double 28.7137305, 8.103375, 163.36404, 12.18979125, 59.7473415, 100.713375
Ratio double 0.150189471, 0.146543878, 0.26933712, 0.184580683, 0.16527932, 0.114806003

1.) Ratio

a.) Kurtosis

  • Forming a histogram and QQ plot using RATIO.
  • Calculating skewness and kurtosis using ‘rockchalk.’

Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

Kurtosis Skewness
1.667 0.7147

b.) Log Scale

  • Transforming RATIO using log10() to create L_RATIO.
  • Forming a histogram and QQ plot using L_RATIO.
  • Calculating the skewness and kurtosis.
  • Creating a boxplot of L_RATIO differentiated by CLASS.

Kurtosis Skewness
0.5354 -0.09392

c.) Bartlett’s Test

  • Testing the homogeneity of variance across Age Classes using bartlett.test()

(Variance). of a distribution is a measure of the variability of data. It measures how far a set of (random) numbers are spread out from their average value. It can be formulated as the average of the squared difference from the mean.

[1] 350 926

    Bartlett test of homogeneity of variances

data:  L_Ratio by Class
Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267
P_Value
Cannot Reject Null

[1] 350 746

    Bartlett test of homogeneity of variances

data:  Ratio by Class
Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531
P_Value
Reject Null

Question:

  • Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across Age Classes?

Answer:

L_Ratio exhibits better conformance to normality across Age Classes due to the QQ-Plots falling closer in-line with the expected quantiles, lower Kurtosis, as well as having a less skewed distribution as observed by both the skewness measure and the histograms of the respective variables. Finally, at 0.05 a significance level, we cannot reject the null hypothesis that L_Ratio comes from a normal distribution according to Bartlett’s test for Equality of Variances, which we must do for ratio analysis as it falls well short of significance.


2.) Analysis of Variance

a.) ANOVA

  • Performing an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables.
  • Assuming equal variances.
  • Performing 2 analyses.
  • First, fitting a model with the interaction term CLASS:SEX.
  • Then, fitting a model without CLASS:SEX.
  • Using summary() to obtain the Analysis of Variance tables.
L_Ratio ~ Class * Sex
Variable Df Sum Sq Mean Sq F value Pr(>F)
Class 4 1.0554 0.2638 38.3699 0.0000
Sex 2 0.0914 0.0457 6.6444 0.0014
Class:Sex 8 0.0267 0.0033 0.4855 0.8671
Residuals 1021 7.0206 0.0069 NA NA
L_Ratio ~ Class + Sex
Variable Df Sum Sq Mean Sq F value Pr(>F)
Class 4 1.0554 0.2638 38.524 0.0000
Sex 2 0.0914 0.0457 6.671 0.0013
Residuals 1029 7.0473 0.0068 NA NA

Question:

  • Comparing the 2 analyses.
  • What does the non-significant interaction term suggest about the relationship between L_RATIO and the attributes CLASS and SEX?

Answer:

It means the joint effect of Class and Sex is not statistically higher than the sum of both effect individually.

b.) Tukey

  • For the model without CLASS:SEX (i.e. an interaction term), obtaining multiple comparisons with the TukeyHSD() function.
  • Interpreting the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).
Tukey Class
diff lwr upr p adj
A2-A1 -0.012 -0.039 0.014 0.692
A3-A1 -0.034 -0.059 -0.009 0.002
A4-A1 -0.059 -0.086 -0.031 0.000
A5-A1 -0.100 -0.128 -0.072 0.000
A3-A2 -0.022 -0.041 -0.002 0.018
A4-A2 -0.046 -0.068 -0.024 0.000
A5-A2 -0.087 -0.110 -0.065 0.000
A4-A3 -0.024 -0.045 -0.004 0.011
A5-A3 -0.066 -0.087 -0.045 0.000
A5-A4 -0.041 -0.065 -0.018 0.000
Tukey Sex
diff lwr upr p adj
Female-Infant 0.016 0.001 0.031 0.038
Male-Infant 0.018 0.003 0.033 0.011
Male-Female 0.002 -0.013 0.017 0.941

Question:

  • First, interpret the trend in coefficients across Age Classes.
  • What is this indicating about L_RATIO?
  • Second, do these results suggest Male and Female abalone can be combined into a single category labeled as ‘adult?’
  • If not, why not?

Answer:

The trends in Age Class is interesting in that for the Age Classes A3, A4 and A5 the L_Ratio values are not significantly significant (at a 95% level) from each other meaning that there is no relationship to L_Ratio and Age Classes A3, A4 and A5.

However, for the A1 and A2 classes there is a clear relationship between the Age Class and the value of L_Ratio.

Secondly, Male and Female abalone have a 0.941 similarity in means, suggesting they’re well beyond statistically significant, almost to the point of being identical. It would be logical to just combine them and split the abalone into Infant and Adult for ratio analysis.


3.) Infant/Adult

a.) Type

  • Using combineLevels() from the ‘rockchalk’ package to combine “M” and “F” into a new level, “ADULT”. This will necessitate defining a new variable, TYPE, in mydata which will have 2 levels: “INFANT” and “ADULT”.
  • Presenting side-by-side histograms of VOLUME.
  • Displaying infant Volumes and, the other, Adult Volumes.

Question:

  • Compare the histograms.
  • How do the distributions differ?
  • Are there going to be any difficulties separating Infant from Adult based on VOLUME?

Answer:

The 2 distributions have distinct shapes, however, there is a significant amount of overlap in the 0 - 500 range. Abalone having a VOLUME > 500, I could assume with relative safety would be Adult, however, there are significant amounts of both Infant and Adult abalone in the lower range.

Around 90% of Infant abalone have a Volume measure that’s less than the lower 50% of Adult abalone, giving us reason to be concerned about overlap here.

Adult.50% Adult.75%
388.7 530.4
Adult.50% Adult.90%
148.5 331.1

b.) Shuck ~ Volume

  • Creating a scatterplot of SHUCK vs. VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME.
  • Be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2).
  • Using color to differentiate CLASS in the plots.
  • Repeat using color to differentiate by TYPE.

Question:

Comparing the 2 scatterplots.

  • What effect(s) does log-transformation appear to have on the variability present in the plot?
  • What are the implications for linear regression analysis?
  • Where do the various CLASS levels appear in the plots?
  • Where do the levels of TYPE appear in the plots?

Answer:

There is a reduction of variability of the logarithmic transformed Volume variable which gives clearer separation between the Sex and Age in the distribution. The logarithmic version is considerably more suited for linear regression analysis. The Sex variable is particularly well distributed with a visible cutoff of the majority of Infants that lie below 2 on the logarithmic scale.

Logarithmic version of Class variable also shows better conformity to a linear model, however, the top 3 Age classes, A3, A4 and A5 still exhibit a considerable amount of clustering at the right end of the scale.

Ratio L_Ratio
0.0008599 0.007917

4.) Age Class

a.) Linear Regression

  • Since abalone growth slows after Age Class A3, Infants in Classes A4 and A5 are considered mature and candidates for harvest.
  • Reclassifying the Infants in Classes A4 and A5 as ADULT.
  • This reclassification can be achieved using combineLevels(), but only on the abalone in classes A4 and A5.
  • I will use this recoded TYPE variable, in which the Infants in Age Classes A4 and A5 are reclassified as ADULT, for the remainder of this data analysis.
  • Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE.
  • Using the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE.
  • Applying summary() to the model object to produce results.


  • Model: L_Shuck ~ L_Volume + Class + Type
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.2706 -0.05429 0.000159 0 0.05599 0.3097
Coefficients
Variable Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.81751 0.01904 -42.93573 0.00000
L_Volume 0.99930 0.01026 97.37662 0.00000
ClassA2 -0.01801 0.01100 -1.63610 0.10212
ClassA3 -0.04731 0.01247 -3.79282 0.00016
ClassA4 -0.07578 0.01406 -5.39142 0.00000
ClassA5 -0.11712 0.01413 -8.28816 0.00000
TypeAdult 0.02109 0.00769 2.74373 0.00618
Residual_Std_Error dof Multiple_RSq Adjusted_Rsq
0.08297 1029 0.9504 0.9501
F_Stat.value F_Stat.numdf F_Stat.dendf p-value
3287 6 1029 0

Question:

  • Interpret the trend in CLASS level coefficient estimates?

(Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays)

Answer:

The Age Class level coefficients show a clear increasing trend in the Class variable with respect to L_Shuck. Given the intercept is negative (meaning the expected value of Shuck would be negative given a value of 0 for our predictors), this means that we would expect to add an increasingly large negative number as the Classes go from A3 -> A5 (A2 is arguably not statically significant from A1, at a 10% level, while A3, A4 and A5 are), meaning that as the Class level variable increases, I would also expect L_Shuck to increase. This is the pattern I also see in the displays.

Question:

  • Is TYPE an important predictor in this regression model?

(Hint: this question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions)

Answer:

Type is also an important predictor in determining WHEN to harvest abalone. The TypeAdult variable tells us that we can expect to add 0.02 to the Shuck variable when the Type is Adult, meaning that in general Adult abalone are larger in Shuck and Volume than Infant abalone, which is consistent with our display in 3b.


The next 2 analysis step into an analysis of the residuals resulting from the regression model in (4)(a).


5.) Residual Analysis

a.) Normality

  • As “model” is the regression object, using model$residuals constructing a histogram and QQ plot.
  • Computing the skewness and kurtosis.
  • Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

Kurtosis Skewness
0.3433 -0.05945

b.) Variance

  • Plotting the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE.
  • Keeping in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals.
  • Presenting boxplots of the residuals differentiated by CLASS and TYPE (These 4 plots can be conveniently presented on one page using par(mfrow..) or grid.arrange().
  • Testing the homogeneity of variance of the residuals across Age Classes using bartlett.test().


    Bartlett test of homogeneity of variances

data:  Residuals by Class
Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

Question:

  • What is revealed by the displays and calculations in (5)(a) and (5)(b)?
  • Does the model ‘fit’?
  • Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions?
  • Discuss.

Answer:

The plots in (5)(a) show a reasonably symmetric histogram and QQ plot, although there are a few outliers in the residuals, prominently distinguishable in the QQ-Plot.

The histogram shows a great deal of clustering in the middle, however, also looking at the Kurtosis and Skewness of the residuals, I see they are approximately zero indicating a relatively normal behavior indicating our model is relatively well fit for the data and that it may ultimately prove useful in determining when to harvest abalone.

The Bartlett test performed on Age Class indicates that the variance amongst the Age Classes don’t differ significantly (p = 0.45), although the presence of several outliers is an area of concern.


There is a trade-off faced in managing abalone harvest. The Infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This task will use VOLUME to form binary decision rules to guide harvesting.

If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.

The next steps in the analysis will require consideration of the proportions of Infants and Adults harvested at different cutoffs.

  • For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).”
  • Otherwise, the resulting Infant and Adult proportions cannot be directly compared and plotted as requested.

6.) Optimal Harvesting

a.) Split Values

A series of volumes covers the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes.

Infant Split Adult Split
133.8 384.5

b.) Split Visual

Presenting a plot showing the Infant proportions and the Adult proportions vs. volume.value. Computing the 50% “split” volume.value for each and showing it in a plot.

Question:

The two 50% “split” values serve a descriptive purpose illustrating the difference between the populations.

  • What do these values suggest regarding possible cutoffs for harvesting?

Answer:

From the information displayed, it would suggest that Infants will be harvested at a disproportional rate than Adult abalone based on the determined volume cutoff.

The slope of the Infant line is much steeper and reaches the cutoff point relatively quickly, meaning we have a high likelihood of harvesting a disproportion amount of Infants leading to an overall less than optimal harvest yield.


This part will address the determination of a volume.value corresponding to the observed maximum difference in harvest percentages of Adults and Infants.

To calculate this result, the vectors of proportions from item (6) must be used.

These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for Infants, and (1 - prop.adults) for Adults.

The reason the proportion for Infants drops sooner than Adults is that Infants are maturing and becoming Adults with larger volumes.


7.) Harvest Proportions

a.) Difference

  • Evaluating a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) vs. volume.value.
  • Comparing to the 50% “split” points determined in (6)(a).
  • There is considerable variability present in the peak area for this plot.
  • The observed “peak” difference may not be the best representation of the data.
  • One solution is to smooth the data to determine a more representative estimate of the maximum difference.

b.) Smoothed

  • Executing the following to create a smoothed curve to append to the plot in (a).
  • The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.

c.) Combined

  • Presenting a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) vs. volume.value with the variable smooth.difference superimposed.
  • Determining the volume.value corresponding to the maximum smoothed difference.
  • Showing the estimated peak location corresponding to the cutoff determined.

d.) Cutoff

  • What separate harvest proportions for Infants and Adults would result if this cutoff is used?
  • Showing the separate harvest proportions (NOTE: the Adult harvest proportion is the “true positive rate” and the Infant harvest proportion is the “false positive rate”).

Confusion Matrix - a matrix that compares actual categorical levels/events to the predicted categorical levels. When we predict the right level, we refer to this as a True Positive. However, if we predict a level/event that did not happen this is called a False Positive.

Max Difference
Type Volume TPR FPR Yield
max.difference 262.143 0.742 0.176 0.584

There are alternative ways to determine cutoffs. 2 such cutoffs are described below.


8.) Alternative Cutoffs

a.) Zero A1 Infants Cutoff

Harvesting of Infants in Age CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of Infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting Infants from CLASS “A1.”

Computing this cutoff, and the proportions of Infants and Adults with VOLUME exceeding this cutoff.

Zero A1 Infants
Type Volume TPR FPR Yield
zero.A1.infants 206.786 0.826 0.287 0.676

b.) Inverse Cutoff

Another cutoff is one for which the proportion of Adults not harvested equals the proportion of Infants harvested.

  • This cutoff would equate these rates; effectively, our 2 errors: ‘missed’ Adults and ‘wrongly-harvested’ Infants.
  • This leaves for discussion which is the greater loss: a larger proportion of Adults not harvested or Infants harvested?
  • This cutoff is 237.6391.
  • Calculating the separate harvest proportions for Infants and Adults using this cutoff. Shows these proportions.
Equal Error
Type Volume TPR FPR Yield
equal.error 237.639 0.782 0.218 0.625

9.) ROC Curve

a.) Visual Area

  • Constructing an ROC curve by plotting (1 - prop.adults) vs. (1 - prop.infants).
  • Each point which appears corresponds to a particular volume.value.
  • Showing the location of the cutoffs determined in (7) and (8) on this plot and labeling each.

(ROC curve). A ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs. False Positive Rate at different classification thresholds. It tells us how good the model is at classification. Therefore, curves of different models can be compared directly in general or for different thresholds.

b.) Numerical Area

  • Numerically integrating the area under the ROC curve and reporting our result. This is most easily done with the auc() function from the “flux” package.
  • Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.

(AUC-ROC curve). When we need to evaluate/visualize the performance of the multi-class classification problem, we use AUC (Area Under The Curve) and ROC (Receiver Operating Characteristics) curve. ROC is a probability curve and AUC represents the degree of separability. It tells how much the model is capable of distinguishing between classes such as harvest/not-harvest. The higher the AUC, the better the model is at predicting Adults as harvest and Infants as not-harvest. A highly accurate model has AUC close to 1 which reflects it’s good measure of separability. A poor model has AUC near 0 which means it has worst measure of separability.

Area Under Curve
0.8667

10.) Results

a.) Consolidated

Preparing a table showing each cutoff along with the following:

  • 1.) True Positive rate (1-prop.adults,
  • 2.) False Positive rate (1-prop.infants),
  • 3.) Harvest proportion of the total population
Consolidated Harvest Proportions
Type Volume TPR FPR Yield
max.difference 262.143 0.742 0.176 0.584
zero.A1.infants 206.786 0.826 0.287 0.676
equal.error 237.639 0.782 0.218 0.625

Question:

Based on the ROC curve, it is evident that a wide range of possible “cutoffs” exist. Compare and discuss the 3 cutoffs determined in this task.

Answer:

The 3 various cutoffs represent distinct characteristics of a given harvest of abalone. The max difference method which is essentially the brute force solution here, I believe represents arguably the worst-case scenario. The true-positive rate is the lowest of the 3, as well as the lowest overall harvest yield ratio. Although, this method does have the lowest false-positive rate as well.

The Zero A1 Infants harvest has the benefit of having the highest true-positive harvest rate and overall harvest yield. The drawback here is that this method also leads to the highest rate of false-positives, meaning more Infants will ultimately get harvested which is not ideal for future harvest and sustainability.

The equal error methodology is, as the name implies, right down the middle in terms of true-positive rate, false-positive rate, and overall harvest yield, essentially splitting max on the low end and Zero A1 on the high end (in terms of error).


Question:

Assuming we are expected to make a presentation of our analysis to the researchers, how would we do so?
Considering the following in our response:

  1. Would we make a specific recommendation or outline various choices and trade-offs?
  2. What qualifications/limitations would we present regarding our analysis?
  3. If it is necessary to proceed based on the current analysis, what suggestions would we have for implementation of a cutoff?
  4. What suggestions would we have for planning future abalone studies of this type?

Answer:

Given that there is no clear harvesting strategy that has ‘best fit’ for all use cases involved (harvest yield, protection of Infants, and minimization of false-positives, coupled with the fact that I are not an expert on abalone harvesting, I would not recommend any single strategy but rather simply present the findings above and present the various trade-offs, each technique present and let the experts decide.

The qualifications here are that this is an observational study on which I had no control over the sample gathered, therefore causality cannot be firmly established if I cannot control the variables under study. This must be understood before applying techniques discovered through this analysis.

As previously noted, I are no expert in the field of abalone harvesting, however, if forced to pick a method from the data observed and the analysis performed, I would recommend the Zero A1 Infants, as it has the highest proportion of harvest yield by volume and the highest true-positive rate, while still protecting the most important segment of the population which are A1 Infant abalone.

This will ensure that further harvest will be sustainable by leaving the Infants to mature into Adults. The high false-positive rate is indeed an area of concern, however, I believe this proportion could be misleading given the general difficulty in determining abalone Sex, and that the Infants that are harvested will be of justifiable size.

I would recommend that the executors of supplemental research in this area be consulted in advance of sample collection so that they may properly plan to have a control group in the study. I would also suggest that the researchers be advised by experts in this area in that planning phase on potential downfalls in sample collection and control group selection, so that we can minimize confounding factors in our research and conduct it with greater confidence than a simple observational study alone.