(Regression). Supervised Learning. What we want to do is Predict Values, i.e Age. Making forecasts, not estimating the relationship between values.
Raw Data
| Variable | Classes | Glimpse |
|---|---|---|
| Sex | integer | Infant, Infant, Infant, Infant, Infant, Infant |
| Length | double | 5.565, 3.675, 10.08, 4.095, 6.93, 7.875 |
| Diam | double | 4.095, 2.625, 7.35, 3.15, 4.83, 6.09 |
| Height | double | 1.26, 0.84, 2.205, 0.945, 1.785, 2.1 |
| Whole | double | 11.5, 3.5, 79.375, 4.6875, 21.1875, 27.375 |
| Shuck | double | 4.3125, 1.1875, 44, 2.25, 9.875, 11.5625 |
| Rings | integer | 6, 4, 6, 3, 6, 6 |
| Class | character | A1, A1, A1, A1, A1, A1 |
| Volume | double | 28.7137305, 8.103375, 163.36404, 12.18979125, 59.7473415, 100.713375 |
| Ratio | double | 0.150189471, 0.146543878, 0.26933712, 0.184580683, 0.16527932, 0.114806003 |
Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.
| Kurtosis | Skewness |
|---|---|
| 1.667 | 0.7147 |
| Kurtosis | Skewness |
|---|---|
| 0.5354 | -0.09392 |
(Variance). of a distribution is a measure of the variability of data. It measures how far a set of (random) numbers are spread out from their average value. It can be formulated as the average of the squared difference from the mean.
[1] 350 926
Bartlett test of homogeneity of variances
data: L_Ratio by Class
Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267
| P_Value |
|---|
| Cannot Reject Null |
[1] 350 746
Bartlett test of homogeneity of variances
data: Ratio by Class
Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531
| P_Value |
|---|
| Reject Null |
Question:
Answer:
L_Ratio exhibits better conformance to normality across Age Classes due to the QQ-Plots falling closer in-line with the expected quantiles, lower Kurtosis, as well as having a less skewed distribution as observed by both the skewness measure and the histograms of the respective variables. Finally, at 0.05 a significance level, we cannot reject the null hypothesis that L_Ratio comes from a normal distribution according to Bartlett’s test for Equality of Variances, which we must do for ratio analysis as it falls well short of significance.
| Variable | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| Class | 4 | 1.0554 | 0.2638 | 38.3699 | 0.0000 |
| Sex | 2 | 0.0914 | 0.0457 | 6.6444 | 0.0014 |
| Class:Sex | 8 | 0.0267 | 0.0033 | 0.4855 | 0.8671 |
| Residuals | 1021 | 7.0206 | 0.0069 | NA | NA |
| Variable | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| Class | 4 | 1.0554 | 0.2638 | 38.524 | 0.0000 |
| Sex | 2 | 0.0914 | 0.0457 | 6.671 | 0.0013 |
| Residuals | 1029 | 7.0473 | 0.0068 | NA | NA |
Question:
Answer:
It means the joint effect of Class and Sex is not statistically higher than the sum of both effect individually.
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| A2-A1 | -0.012 | -0.039 | 0.014 | 0.692 |
| A3-A1 | -0.034 | -0.059 | -0.009 | 0.002 |
| A4-A1 | -0.059 | -0.086 | -0.031 | 0.000 |
| A5-A1 | -0.100 | -0.128 | -0.072 | 0.000 |
| A3-A2 | -0.022 | -0.041 | -0.002 | 0.018 |
| A4-A2 | -0.046 | -0.068 | -0.024 | 0.000 |
| A5-A2 | -0.087 | -0.110 | -0.065 | 0.000 |
| A4-A3 | -0.024 | -0.045 | -0.004 | 0.011 |
| A5-A3 | -0.066 | -0.087 | -0.045 | 0.000 |
| A5-A4 | -0.041 | -0.065 | -0.018 | 0.000 |
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| Female-Infant | 0.016 | 0.001 | 0.031 | 0.038 |
| Male-Infant | 0.018 | 0.003 | 0.033 | 0.011 |
| Male-Female | 0.002 | -0.013 | 0.017 | 0.941 |
Question:
Answer:
The trends in Age Class is interesting in that for the Age Classes A3, A4 and A5 the L_Ratio values are not significantly significant (at a 95% level) from each other meaning that there is no relationship to L_Ratio and Age Classes A3, A4 and A5.
However, for the A1 and A2 classes there is a clear relationship between the Age Class and the value of L_Ratio.
Secondly, Male and Female abalone have a 0.941 similarity in means, suggesting they’re well beyond statistically significant, almost to the point of being identical. It would be logical to just combine them and split the abalone into Infant and Adult for ratio analysis.
Question:
Answer:
The 2 distributions have distinct shapes, however, there is a significant amount of overlap in the 0 - 500 range. Abalone having a VOLUME > 500, I could assume with relative safety would be Adult, however, there are significant amounts of both Infant and Adult abalone in the lower range.
Around 90% of Infant abalone have a Volume measure that’s less than the lower 50% of Adult abalone, giving us reason to be concerned about overlap here.
| Adult.50% | Adult.75% |
|---|---|
| 388.7 | 530.4 |
| Adult.50% | Adult.90% |
|---|---|
| 148.5 | 331.1 |
Question:
Comparing the 2 scatterplots.
Answer:
There is a reduction of variability of the logarithmic transformed Volume variable which gives clearer separation between the Sex and Age in the distribution. The logarithmic version is considerably more suited for linear regression analysis. The Sex variable is particularly well distributed with a visible cutoff of the majority of Infants that lie below 2 on the logarithmic scale.
Logarithmic version of Class variable also shows better conformity to a linear model, however, the top 3 Age classes, A3, A4 and A5 still exhibit a considerable amount of clustering at the right end of the scale.
| Ratio | L_Ratio |
|---|---|
| 0.0008599 | 0.007917 |
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -0.2706 | -0.05429 | 0.000159 | 0 | 0.05599 | 0.3097 |
| Variable | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | -0.81751 | 0.01904 | -42.93573 | 0.00000 |
| L_Volume | 0.99930 | 0.01026 | 97.37662 | 0.00000 |
| ClassA2 | -0.01801 | 0.01100 | -1.63610 | 0.10212 |
| ClassA3 | -0.04731 | 0.01247 | -3.79282 | 0.00016 |
| ClassA4 | -0.07578 | 0.01406 | -5.39142 | 0.00000 |
| ClassA5 | -0.11712 | 0.01413 | -8.28816 | 0.00000 |
| TypeAdult | 0.02109 | 0.00769 | 2.74373 | 0.00618 |
| Residual_Std_Error | dof | Multiple_RSq | Adjusted_Rsq |
|---|---|---|---|
| 0.08297 | 1029 | 0.9504 | 0.9501 |
| F_Stat.value | F_Stat.numdf | F_Stat.dendf | p-value |
|---|---|---|---|
| 3287 | 6 | 1029 | 0 |
Question:
(Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays)
Answer:
The Age Class level coefficients show a clear increasing trend in the Class variable with respect to L_Shuck. Given the intercept is negative (meaning the expected value of Shuck would be negative given a value of 0 for our predictors), this means that we would expect to add an increasingly large negative number as the Classes go from A3 -> A5 (A2 is arguably not statically significant from A1, at a 10% level, while A3, A4 and A5 are), meaning that as the Class level variable increases, I would also expect L_Shuck to increase. This is the pattern I also see in the displays.
Question:
(Hint: this question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions)
Answer:
Type is also an important predictor in determining WHEN to harvest abalone. The TypeAdult variable tells us that we can expect to add 0.02 to the Shuck variable when the Type is Adult, meaning that in general Adult abalone are larger in Shuck and Volume than Infant abalone, which is consistent with our display in 3b.
The next 2 analysis step into an analysis of the residuals resulting from the regression model in (4)(a).
| Kurtosis | Skewness |
|---|---|
| 0.3433 | -0.05945 |
Bartlett test of homogeneity of variances
data: Residuals by Class
Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498
Question:
Answer:
The plots in (5)(a) show a reasonably symmetric histogram and QQ plot, although there are a few outliers in the residuals, prominently distinguishable in the QQ-Plot.
The histogram shows a great deal of clustering in the middle, however, also looking at the Kurtosis and Skewness of the residuals, I see they are approximately zero indicating a relatively normal behavior indicating our model is relatively well fit for the data and that it may ultimately prove useful in determining when to harvest abalone.
The Bartlett test performed on Age Class indicates that the variance amongst the Age Classes don’t differ significantly (p = 0.45), although the presence of several outliers is an area of concern.
There is a trade-off faced in managing abalone harvest. The Infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This task will use VOLUME to form binary decision rules to guide harvesting.
If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.
The next steps in the analysis will require consideration of the proportions of Infants and Adults harvested at different cutoffs.
A series of volumes covers the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes.
| Infant Split | Adult Split |
|---|---|
| 133.8 | 384.5 |
Presenting a plot showing the Infant proportions and the Adult proportions vs. volume.value. Computing the 50% “split” volume.value for each and showing it in a plot.
Question:
The two 50% “split” values serve a descriptive purpose illustrating the difference between the populations.
Answer:
From the information displayed, it would suggest that Infants will be harvested at a disproportional rate than Adult abalone based on the determined volume cutoff.
The slope of the Infant line is much steeper and reaches the cutoff point relatively quickly, meaning we have a high likelihood of harvesting a disproportion amount of Infants leading to an overall less than optimal harvest yield.
This part will address the determination of a volume.value corresponding to the observed maximum difference in harvest percentages of Adults and Infants.
To calculate this result, the vectors of proportions from item (6) must be used.
These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for Infants, and (1 - prop.adults) for Adults.
The reason the proportion for Infants drops sooner than Adults is that Infants are maturing and becoming Adults with larger volumes.
Confusion Matrix - a matrix that compares actual categorical levels/events to the predicted categorical levels. When we predict the right level, we refer to this as a True Positive. However, if we predict a level/event that did not happen this is called a False Positive.
| Type | Volume | TPR | FPR | Yield |
|---|---|---|---|---|
| max.difference | 262.143 | 0.742 | 0.176 | 0.584 |
There are alternative ways to determine cutoffs. 2 such cutoffs are described below.
Harvesting of Infants in Age CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of Infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting Infants from CLASS “A1.”
Computing this cutoff, and the proportions of Infants and Adults with VOLUME exceeding this cutoff.
| Type | Volume | TPR | FPR | Yield |
|---|---|---|---|---|
| zero.A1.infants | 206.786 | 0.826 | 0.287 | 0.676 |
Another cutoff is one for which the proportion of Adults not harvested equals the proportion of Infants harvested.
| Type | Volume | TPR | FPR | Yield |
|---|---|---|---|---|
| equal.error | 237.639 | 0.782 | 0.218 | 0.625 |
(ROC curve). A ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs. False Positive Rate at different classification thresholds. It tells us how good the model is at classification. Therefore, curves of different models can be compared directly in general or for different thresholds.
(AUC-ROC curve). When we need to evaluate/visualize the performance of the multi-class classification problem, we use AUC (Area Under The Curve) and ROC (Receiver Operating Characteristics) curve. ROC is a probability curve and AUC represents the degree of separability. It tells how much the model is capable of distinguishing between classes such as harvest/not-harvest. The higher the AUC, the better the model is at predicting Adults as harvest and Infants as not-harvest. A highly accurate model has AUC close to 1 which reflects it’s good measure of separability. A poor model has AUC near 0 which means it has worst measure of separability.
| Area Under Curve |
|---|
| 0.8667 |
Preparing a table showing each cutoff along with the following:
| Type | Volume | TPR | FPR | Yield |
|---|---|---|---|---|
| max.difference | 262.143 | 0.742 | 0.176 | 0.584 |
| zero.A1.infants | 206.786 | 0.826 | 0.287 | 0.676 |
| equal.error | 237.639 | 0.782 | 0.218 | 0.625 |
Question:
Based on the ROC curve, it is evident that a wide range of possible “cutoffs” exist. Compare and discuss the 3 cutoffs determined in this task.
Answer:
The 3 various cutoffs represent distinct characteristics of a given harvest of abalone. The max difference method which is essentially the brute force solution here, I believe represents arguably the worst-case scenario. The true-positive rate is the lowest of the 3, as well as the lowest overall harvest yield ratio. Although, this method does have the lowest false-positive rate as well.
The Zero A1 Infants harvest has the benefit of having the highest true-positive harvest rate and overall harvest yield. The drawback here is that this method also leads to the highest rate of false-positives, meaning more Infants will ultimately get harvested which is not ideal for future harvest and sustainability.
The equal error methodology is, as the name implies, right down the middle in terms of true-positive rate, false-positive rate, and overall harvest yield, essentially splitting max on the low end and Zero A1 on the high end (in terms of error).
Question:
Assuming we are expected to make a presentation of our analysis to the researchers, how would we do so?
Considering the following in our response:
Answer:
Given that there is no clear harvesting strategy that has ‘best fit’ for all use cases involved (harvest yield, protection of Infants, and minimization of false-positives, coupled with the fact that I are not an expert on abalone harvesting, I would not recommend any single strategy but rather simply present the findings above and present the various trade-offs, each technique present and let the experts decide.
The qualifications here are that this is an observational study on which I had no control over the sample gathered, therefore causality cannot be firmly established if I cannot control the variables under study. This must be understood before applying techniques discovered through this analysis.
As previously noted, I are no expert in the field of abalone harvesting, however, if forced to pick a method from the data observed and the analysis performed, I would recommend the Zero A1 Infants, as it has the highest proportion of harvest yield by volume and the highest true-positive rate, while still protecting the most important segment of the population which are A1 Infant abalone.
This will ensure that further harvest will be sustainable by leaving the Infants to mature into Adults. The high false-positive rate is indeed an area of concern, however, I believe this proportion could be misleading given the general difficulty in determining abalone Sex, and that the Infants that are harvested will be of justifiable size.
I would recommend that the executors of supplemental research in this area be consulted in advance of sample collection so that they may properly plan to have a control group in the study. I would also suggest that the researchers be advised by experts in this area in that planning phase on potential downfalls in sample collection and control group selection, so that we can minimize confounding factors in our research and conduct it with greater confidence than a simple observational study alone.