##Data Analysis #2

## 'data.frame':    1036 obs. of  10 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ RATIO : num  0.15 0.147 0.269 0.185 0.165 ...

Test Items starts from here - There are 10 sections - total of 75 points

Section 1: (5 points)

(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using ‘rockchalk.’ Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car
## [1] 0.7147056
## [1] 1.667298
## [1] 0.7157417
## [1] 4.676321

(1)(b) Tranform RATIO using log10() to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.

## [1] -0.09391548
## [1] 0.5354309
## [1] -0.09405162
## [1] 3.542266

(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).

## [[1]]
## 
##  Bartlett test of homogeneity of variances
## 
## data:  RATIO by CLASS
## Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531
## 
## 
## [[2]]
## 
##  Bartlett test of homogeneity of variances
## 
## data:  L_RATIO by CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267

Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?

Answer:When looking at the histograms, the RATIO graph is slightly skewed ot the right, with most of the data on the left side. The L_RATIO graph is mostly evenly distributed. When looking at the QQplots, the RATIO graph shows that the results do not align with the QQline after the Theoretical Quantiles value of 2. The L_RATIO graph shows that the results align with the QQline for most of the values. When looking at the calculations, the calculated skewness for RATIO was 0.7, which is not close to the normal distribution value of 0. The calculated skewness for L_RATIO was -0.09, which is much closer to 0, indicating a more normal distribution. The kurtosis for RATIO was 1.7 and 0.5 for L_RATIO. With L_RATIO’s value being lower, this again indicates that L_RATIO is more evenly distributed. When looking at the Barlett test, RATIO has a larger K-squared value (21.49) and a smaller p-value (0.0002531). This indicates that when using the RATIO varaible, the assumption of homogeneity is false. L_RATIO has a smaller K-squared value (3.1891) and a larger p-value (0.5267). Because of these generated results, L_RATIO demonstrates a better conformance to a normal distribution with homogeneous variances across age classes. As a result, we fail to reject the null hypothesis with the Bartlett test of homogeneity of variances.

Section 2 (10 points)

(2)(a) Perform an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use summary() to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.370 < 2e-16 ***
## SEX            2  0.091 0.04569   6.644 0.00136 ** 
## CLASS:SEX      8  0.027 0.00334   0.485 0.86709    
## Residuals   1021  7.021 0.00688                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.524 < 2e-16 ***
## SEX            2  0.091 0.04569   6.671 0.00132 ** 
## Residuals   1029  7.047 0.00685                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?

Answer: Initially, the sum squared value and mean squared value for CLASS were 1.055 and 0.26384 respectively. The sum squared value and mean squared value for SEX were 0.091 and 0.04569 respectively. After ‘CLASS:SEX’ was removed, none of those values changed. Initially, the test statistic from the F test and the p-value of the F-statistic for CLASS were 38.370 and 2e-16 respectively. the test statistic from the F test and the p-value of the F-statistic for SEX were 6.644 and 0.00136 respectively. After ‘CLASS:SEX’ was removed, the values changed very slightly. With small p-values, this indicates that the values are statistically significant in predicting L_RATIO. However, because there was not much change between values after ‘CLASS:SEX’ was removed, this demonstrates that CLASS and SEX are not interrelated and do not correlate with each other. The interaction between them is not significant and demonstrates that they act independently.

(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the TukeyHSD() function. Interpret the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L_RATIO ~ CLASS + SEX, data = mydata)
## 
## $CLASS
##              diff         lwr          upr     p adj
## A2-A1 -0.01248831 -0.03876038  0.013783756 0.6919456
## A3-A1 -0.03426008 -0.05933928 -0.009180867 0.0018630
## A4-A1 -0.05863763 -0.08594237 -0.031332896 0.0000001
## A5-A1 -0.09997200 -0.12764430 -0.072299703 0.0000000
## A3-A2 -0.02177176 -0.04106269 -0.002480831 0.0178413
## A4-A2 -0.04614932 -0.06825638 -0.024042262 0.0000002
## A5-A2 -0.08748369 -0.11004316 -0.064924223 0.0000000
## A4-A3 -0.02437756 -0.04505283 -0.003702280 0.0114638
## A5-A3 -0.06571193 -0.08687025 -0.044553605 0.0000000
## A5-A4 -0.04133437 -0.06508845 -0.017580286 0.0000223
## 
## $SEX
##             diff          lwr           upr     p adj
## I-F -0.015890329 -0.031069561 -0.0007110968 0.0376673
## M-F  0.002069057 -0.012585555  0.0167236690 0.9412689
## M-I  0.017959386  0.003340824  0.0325779478 0.0111881

Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as ‘adults?’ If not, why not?

Answer: For CLASS, there are low p-values for all class relationships/interactions (all 0.0178413 and below, except for one (A2-A1). This demonstrates that each class (except A2-A1) cannot be combined with another class. We reject the null hypothesis for all classes but for one exception. The A2-A1 relationship/interaction has a p-value of 0.6919456, much higher than the others. This suggests that we can assume that class is insignificant in class A1 and class A2. So we cannot reject the null hypothesis that A2 abalones and A2 abalones are from the same population, since there’s no significant difference between them. For SEX, there are low p-values for infant-female (0.0376673) and infant-male (0.0111881) relationships/interactions. This demonstrates that infant abalones cannot be combined into the adult category. However, the male-female relationship/interaction has a p-value of 0.9412689, much higher than the other two. This suggests that we can assume that gender is insignificant in adult abalones. Both males and females can be analyzed within the same group with the same consistency. So we can reject the null hypothesis that males and females are not significant and male and female abalones can be combined into a single category.

Section 3: (10 points)

(3)(a1) We combine “M” and “F” into a new level, “ADULT”. (While this could be accomplished using combineLevels() from the ‘rockchalk’ package, we use base R code because many students do not have access to the rockchalk package.) This necessitated defining a new variable, TYPE, in mydata which had two levels: “I” and “ADULT”.

## 
## Check on definition of TYPE object (should be an integer):  integer
## 
## mydata$TYPE is treated as a factor:  TRUE
##    
##     ADULT   I
##   F   326   0
##   I     0 329
##   M   381   0

(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.

Essay Question: Compare the histograms. How do the distributions differ? Are there going to be any difficulties separating infants from adults based on VOLUME?

Answer: The histogram of infant ratio is greatly skewed to the left, with almost all of the data being on the left side of the graph and between 0-300 cm^3 in volume. The histogram of adult ratio is more evenly distributed, with most of the data being spread over 200-600 cm^3 in volume. Because most of the infants are small in volume, they would most likely be able to be separated from the adults easily. However, since there are some infants that are over 400 cm^3 in volume, they might be miscategorized as adults. So if volume was only the defining factor in categorizing abalones as infants or adults, there might be some difficulty categorizing for abalones over 400 cm^3 in volume.

(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.

Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?

Answer: It is hard to discern distinct lines and trends for individual classes and types since there is so much overlap. But there are still observations that can be made. When comparing the ‘Shuck vs. Volume’ graphs for CLASS, I can see that the younger age classes (A1, A2) are situated at the lower end of the graph (low shuck weight and low volume). I also see the older classes (A3, A4, A5) are situated throughout the graph but are mostly present in the middle and far end of the graph (medium to high shuck volume and medium to high volume). Something I also notice is that the oldest class (A5) tends to have larger volume compared to shuck weight. This can be seen in the graph where the ‘seagreen’ data points are located on the right side of where most of the data is concentrated. When comparing the ‘Shuck vs. Volume’ graphs for TYPE, I can see that almost all infants are located on the lower end of the graph (low shuck weight and low volume). The adults are located all over the graph but are mostly present in the middle and far end of the graph (medium to high shuck weight and medium to high volume). When comparing the ‘Log10 Shuck vs. Log10 Volume’ graphs for CLASS and TYPE, I can automatically see that the data are more concentrated than the previous graphs. Similarly, to the ‘Shuck vs. Volume’ graphs, I can see that the younger classes are situated at the lower end of the graph and the older classes situated in the middle and far end of the graph in both cases. I also see the same trend of the oldest class A5 having most data located on the right lower side of the congregation of data. Both graphs show that adult type abalones and older abalones have higher weight and volume. Overall, the Log10 graphs show how the logarithmic function reduced the variability of the data. The data are more concentrated and closely fitted around a certain trend (straight line). With the data less scattered, more distinct trending can be interpreted.

Section 4: (5 points)

(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. Reclassify the infants in classes A4 and A5 as ADULTS. This reclassification could have been achieved using combineLevels(), but only on the abalones in classes A4 and A5. We will do this recoding of the TYPE variable using base R functions. We will use this recoded TYPE variable, in which the infants in A4 and A5 are reclassified as ADULTS, for the remainder of this data analysis assignment.

## 
## Check on redefinition of TYPE object (should be an integer):  integer
## 
## mydata$TYPE is treated as a factor:  TRUE
## 
## Three-way contingency table for SEX, CLASS, and TYPE:
## , ,  = ADULT
## 
##    
##      A1  A2  A3  A4  A5
##   F   5  41 121  82  77
##   I   0   0   0  21  19
##   M  12  62 143  85  79
## 
## , ,  = I
## 
##    
##      A1  A2  A3  A4  A5
##   F   0   0   0   0   0
##   I  91 133  65   0   0
##   M   0   0   0   0   0

(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply summary() to the model object to produce results.

## 
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = mydata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270634 -0.054287  0.000159  0.055986  0.309718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.796418   0.021718 -36.672  < 2e-16 ***
## L_VOLUME     0.999303   0.010262  97.377  < 2e-16 ***
## CLASSA2     -0.018005   0.011005  -1.636 0.102124    
## CLASSA3     -0.047310   0.012474  -3.793 0.000158 ***
## CLASSA4     -0.075782   0.014056  -5.391 8.67e-08 ***
## CLASSA5     -0.117119   0.014131  -8.288 3.56e-16 ***
## TYPEI       -0.021093   0.007688  -2.744 0.006180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9501 
## F-statistic:  3287 on 6 and 1029 DF,  p-value: < 2.2e-16

Essay Question: Interpret the trend in CLASS levelcoefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).

Answer: When looking at the level coefficient estimates, each class has a larger coefficient estimate than the last. This demonstrates that L_SHUCK decreases with older classes. This means that as an abalone ages, L_SHUCK will increase slower than the class before it. Perhaps, as an abalone reaches a very old age, L_SHUCK doesn’t increase at all or perhaps even decreases. But overall, the younger age classes have a sharp increase of L_SHUCK. This was present in the data analysis #1 assignment. But it can’t always be clearly seen in graphs. This analysis provided this insight.

Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.

Answer: TYPE is important but generally not a significant predictor in this regression. From the analysis performed in #4a1 and #4a2, TYPE has a low coefficient estimate and a low p-value. It’s lower than most of the other classes, except A2. So it is less important and less impactful than the other variables. But it would be helpful in predicting regression for Class A2.


The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).


Section 5: (5 points)

(5)(a) If “model” is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## [1] -0.05945234
## [1] 0.3433082
## [1] -0.05953853
## [1] 3.349772

(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using par(mfrow..) or grid.arrange(). Test the homogeneity of variance of the residuals across classes using bartlett.test() (Kabacoff Section 9.3.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  model$residuals by mydata$CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model ‘fit’? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.

Answer: The scatterplots for the residuals don’t show any linear trends for class or type. However, it’s clear that the data range stops at about 3.0 cm^3 in volume. It’s also evident that most adults and most of the older classes (A4, A5) are on the higher end of volume values. The boxplots for the residuals seem to be distributed fairly normally and evenly. All 5 classes display similar trends in quartile distribution and range. Classes A1, A2, and A3 have some outliers on the higher residual occurrence end. Classes A1, A2, A4, and A5 have outliers on the lower residual occurrence end. Both Adult and Infant types display similar trends in quartile distribution and range as well. However, adults have more outliers than infants and most of the outliers seem to be on the lower residual occurrence end. The Bartlett test resulted in a small k-squared value (3.6882) and a large p-value (0.4498). The residuals demonstrate a conformance to a normal distribution with homogeneous variances across age classes. As a result, we fail to reject the null hypothesis with the Bartlett test of homogeneity of variances. Overall, this means that the model is useful for harvesting.


There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.

The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).” Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.


Section 6: (5 points)

(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes. Code for doing this is provided.

## [1] 133.8199
## [1] 384.5138
## [1] 3.710996 3.810202 3.909408 4.008615 4.107821 4.207027
## [1] 0.003460208 0.003460208 0.003460208 0.006920415 0.013840830 0.013840830
## [1] 0 0 0 0 0 0

(6)(b) Present a plot showing the infant proportions and the adult proportions versus volume.value. Compute the 50% “split” volume.value for each and show on the plot.

Essay Question: The two 50% “split” values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?

Answer: The two ‘split’ values serve as a good cutoff point for harvesting abalones. This graph shows that infant abalones generally have lower volume than adults, which is to be expected. But there is a difference in volumes. If you harvest abalones at the infant ‘split’ value (133.82 cm^3), the majority of the abalones will be infants. If you harvest abalones at the adult ‘split’ value (384.51 cm^3), the majority of abalones will be adults. Therefore, in order to save most infants from being harvested, harvesting abalones that are around 384.51 cm^3 (or above) in volume would be better.


This part will address the determination of a volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.


Section 7: (10 points)

(7)(a) Evaluate a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value. Compare to the 50% “split” points determined in (6)(a). There is considerable variability present in the peak area of this plot. The observed “peak” difference may not be the best representation of the data. One solution is to smooth the data to determine a more representative estimate of the maximum difference.

(7)(b) Since curve smoothing is not studied in this course, code is supplied below. Execute the following code to create a smoothed curve to append to the plot in (a). The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.

(7)(c) Present a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value with the variable smooth.difference superimposed. Determine the volume.value corresponding to the maximum smoothed difference (Hint: use which.max()). Show the estimated peak location corresponding to the cutoff determined.

(7)(d) What separate harvest proportions for infants and adults would result if this cutoff is used? Show the separate harvest proportions (NOTE: the adult harvest proportion is the “true positive rate” and the infant harvest proportion is the “false positive rate”).

Code for calculating the adult harvest proportion is provided.

## [1] 0.7416332
## [1] 0.1764706

There are alternative ways to determine cutoffs. Two such cutoffs are described below.


Section 8: (10 points)

(8)(a) Harvesting of infants in CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting infants from CLASS “A1.”

Compute this cutoff, and the proportions of infants and adults with VOLUME exceeding this cutoff. Code for determining this cutoff is provided. Show these proportions.

## [1] 0.2871972
## [1] 0.8259705

(8)(b) Another cutoff is one for which the proportion of adults not harvested equals the proportion of infants harvested. This cutoff would equate these rates; effectively, our two errors: ‘missed’ adults and wrongly-harvested infants. This leaves for discussion which is the greater loss: a larger proportion of adults not harvested or infants harvested? This cutoff is 237.7383. Calculate the separate harvest proportions for infants and adults using this cutoff. Show these proportions. Code for determining this cutoff is provided.

## [1] 0.2179931
## [1] 0.7817938
Section 9: (5 points)

(9)(a) Construct an ROC curve by plotting (1 - prop.adults) versus (1 - prop.infants). Each point which appears corresponds to a particular volume.value. Show the location of the cutoffs determined in (7) and (8) on this plot and label each.

(9)(b) Numerically integrate the area under the ROC curve and report your result. This is most easily done with the auc() function from the “flux” package. Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.

## [1] 0.8666894
Section 10: (10 points)

(10)(a) Prepare a table showing each cutoff along with the following: 1) true positive rate (1-prop.adults, 2) false positive rate (1-prop.infants), 3) harvest proportion of the total population

##                            Volume True Pos. Rate False P. Rate
## Max Difference Cutoff 262.1430097      0.7416332     0.1764706
## A1 Infant Cutoff      206.7859796      0.8259705     0.2871972
## Equal Harvest Cutoff  237.6390914      0.7817938     0.2179931
##                       Harvest Proportion
## Max Difference Cutoff          0.5839768
## A1 Infant Cutoff               0.6756757
## Equal Harvest Cutoff           0.6245174

Essay Question: Based on the ROC curve, it is evident a wide range of possible “cutoffs” exist. Compare and discuss the three cutoffs determined in this assignment.

Answer: Of the three cutoffs, each have their own advantages. The Max Difference Cutoff has the largest volume and the lowest false positive rate. The A1 Infant Cutoff has the highest true positive rate and the largest proportion yield. The Equal Harvest Cutoff has values that fall in between the Max Difference Cutoff and the A1 Infant Cutoff. Each have their own disadvantages as well. The Max Difference Cutoff has the lowest positive rate and the lowest proportion yield. The Ai Infant Cutoff has the smallest volume and the highest false positive rate. And the Equal Harvest Cutoff has values that fall in between again. Therefore, the cutoffs represent different objectives. If you want the greatest proportion yield with the most amount of correctly identified abalones (true positives), go with the A1 Infant Cutoff. If you want the greatest volume with the least amount of incorrectly identified abalones (false positives), go with the Max Difference Cutoff. But if you want the best of both worlds, you can go with the Equal Harvest Cutoff.

Final Essay Question: Assume you are expected to make a presentation of your analysis to the investigators How would you do so? Consider the following in your answer:

  1. Would you make a specific recommendation or outline various choices and tradeoffs?
  2. What qualifications or limitations would you present regarding your analysis?
  3. If it is necessary to proceed based on the current analysis, what suggestions would you have for implementation of a cutoff?
  1. What suggestions would you have for planning future abalone studies of this type?

Answer: 1. I don’t think that I would make an outright recommendation. Especially with the information presented with the cutoffs, my recommendation would depend on the objective presented by executives/organizing body. If there was no objective present, I would only present the data and analysis for their consideration. 2. In any data analysis, there are certain cautions. The analysis is only as good as the data put into it. Therefore, I would express that the analysis is limited by the given data. I would also show the various graphs within this analysis and point out the number out outliers. Particularly for this subject matter, infants are something to be cautious about. Without a healthy number of infants, populations of adult abalones may not flourish. Infants are also sometimes hard to identify so they must be handled with care. I would make that the primary limitation. 3. While the Max Difference Cutoff is promising with the least likelihood of infant overharvesting, if this data were presented from an abalone ‘business’ of sort, I’d expect executives to want the greatest profit for the least cost/loss. Therefore, I would go with the Equal Harvest Cutoff so that mild profits can be made with mild wins and mild losses. 4. As in any data analysis, more data, the better. So in future studies of this kind, I would recommend acquiring more data. I would also recommend trying other ways to correctly identify adults and infants. It is sometimes hard to determine which an abalone is based on volume, shuck weight, and whole weight. Drilling into the abalone shell to see it’s growth rings hurts/kills the abalone as well. If we had a better way of determining the age classification of the abalones, we could lower false positive rates and increase true positive rates.