##Data Analysis #2

## 'data.frame':    1036 obs. of  10 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ RATIO : num  0.15 0.147 0.269 0.185 0.165 ...

Test Items starts from here - There are 10 sections - total of 75 points

#### Section 1: (5 points) ####

(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using ‘rockchalk.’ Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## [1] 0.7147056
## [1] 1.667298

(1)(b) Tranform RATIO using log10() to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.

## [1] -0.09391548
## [1] 0.5354309

(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  mydata$L_RATIO and mydata$CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267
## 
##  Bartlett test of homogeneity of variances
## 
## data:  mydata$RATIO and mydata$CLASS
## Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531

Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?

Based on the histogram and qqplot it looks like L_RATIO exhibits better conformance to the normal distribution. L_Ratio also has lower Kurtosis and a less skewed distribution. We ran the Barlett test for homogenity across classes and for L_ratio we fail to reject the null hypothese and assume homogenity of variances across classes.

#### Section 2 (10 points) ####

(2)(a) Perform an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use summary() to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.370 < 2e-16 ***
## SEX            2  0.091 0.04569   6.644 0.00136 ** 
## CLASS:SEX      8  0.027 0.00334   0.485 0.86709    
## Residuals   1021  7.021 0.00688                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.524 < 2e-16 ***
## SEX            2  0.091 0.04569   6.671 0.00132 ** 
## Residuals   1029  7.047 0.00685                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?

The F-value of the CLASS:SEX relationship is very low suggesting the group means are close together or low variability comparing the two variables. The non-significant interaction means that Class and Sex will be very poor in trying to predict the L_Ratio. Class and Sex are better predictors considered alone.

(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the TukeyHSD() function. Interpret the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L_RATIO ~ CLASS + SEX, data = mydata)
## 
## $CLASS
##              diff         lwr          upr     p adj
## A2-A1 -0.01248831 -0.03876038  0.013783756 0.6919456
## A3-A1 -0.03426008 -0.05933928 -0.009180867 0.0018630
## A4-A1 -0.05863763 -0.08594237 -0.031332896 0.0000001
## A5-A1 -0.09997200 -0.12764430 -0.072299703 0.0000000
## A3-A2 -0.02177176 -0.04106269 -0.002480831 0.0178413
## A4-A2 -0.04614932 -0.06825638 -0.024042262 0.0000002
## A5-A2 -0.08748369 -0.11004316 -0.064924223 0.0000000
## A4-A3 -0.02437756 -0.04505283 -0.003702280 0.0114638
## A5-A3 -0.06571193 -0.08687025 -0.044553605 0.0000000
## A5-A4 -0.04133437 -0.06508845 -0.017580286 0.0000223
## 
## $SEX
##             diff          lwr           upr     p adj
## I-F -0.015890329 -0.031069561 -0.0007110968 0.0376673
## M-F  0.002069057 -0.012585555  0.0167236690 0.9412689
## M-I  0.017959386  0.003340824  0.0325779478 0.0111881

Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as ‘adults?’ If not, why not?

the Tukey test suggests there is a significant diference between classes expect for classes A1 and A2 when comparing L_RATIO. The difference between Male and Female is small and the p-value is greater than .05 therefore we fail to reject the null and the comparison is not statistically significant. Male and Female and be combined into one category calle adults. The p-value comparing infants to adults less than .05 so we reject the null hypothesis and there is a significant difference between infants and adults so we should keep infants a separate category.

#### Section 3: (10 points) ####

(3)(a1) Here, we will combine “M” and “F” into a new level, “ADULT”. The code for doing this is given to you. For (3)(a1), all you need to do is execute the code as given.

## 
## ADULT     I 
##   707   329

(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.

Essay Question: Compare the histograms. How do the distributions differ? Are there going to be any difficulties separating infants from adults based on VOLUME?

The Adult histogram is more normal, the Infant historgram is right skewed. Although infants tend to be smaller hence the right skewed distribution there is still a fair amount of overlap of the right tail of the infant distribution with the normal adult distribution. Since the right tail overlaps there is going to be some difficulty separating those values.

(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.

Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?

The scatterplot on the left as the volumne increasing the variability in shuck weight increases. The log-transformation appears to decrease the variability. This means that performing linear regresssion on the log-transforming will yield better results. The lower classes appear towards the bottom left of each graph and the higher classes group together at the upper end of the graph.

#### Section 4: (5 points) ####

(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. You are given code in (4)(a1) to reclassify the infants in classes A4 and A5 as ADULTS.

## 
## ADULT     I 
##   747   289

(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply summary() to the model object to produce results.

## NULL
## 
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = mydata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270634 -0.054287  0.000159  0.055986  0.309718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.796418   0.021718 -36.672  < 2e-16 ***
## L_VOLUME     0.999303   0.010262  97.377  < 2e-16 ***
## CLASSA2     -0.018005   0.011005  -1.636 0.102124    
## CLASSA3     -0.047310   0.012474  -3.793 0.000158 ***
## CLASSA4     -0.075782   0.014056  -5.391 8.67e-08 ***
## CLASSA5     -0.117119   0.014131  -8.288 3.56e-16 ***
## TYPEI       -0.021093   0.007688  -2.744 0.006180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9501 
## F-statistic:  3287 on 6 and 1029 DF,  p-value: < 2.2e-16

Essay Question: Interpret the trend in CLASS levelcoefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).

As the class increases the correlation coefficients decrease. This would indicate that as the class increases the L_shuck ratio decreases. THis is a similar observation from data analysis assignment 2(b).

Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.

Type does not have a strong coefficient in predicting L-SHUCK, so TYPE is not a strong predictor in this regression model.


The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).


#### Section 5: (5 points) ####

(5)(a) If “model” is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## [1] 0.3433082
## [1] -0.05945234
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using par(mfrow..) or grid.arrange(). Test the homogeneity of variance of the residuals across classes using bartlett.test() (Kabacoff Section 9.3.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  linear_model$residuals and mydata$CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model ‘fit’? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.

The residuals are clustered around 0 as indicated by the boxplots and scatterplots. So we can assume the residuals approximate the normal distribution suggesting the linear model using L_volume would be appropriate. The scatterplot indicates a large clustering as the class gets higher. Ultimately volume would be a useful for harvesting decisions but we will run into issues at higher classes.


Harvest Strategy:

There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.The Management needs to make a decision to implement 1 rule that meets the business goal.

The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).” Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.


#### Section 6: (5 points) ####

(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes. Code for doing this is provided.

(6)(b) Our first “rule” will be protection of all infants. We want to find a volume cutoff that protects all infants, but gives us the largest possible harvest of adults. We can achieve this by using the volume of the largest infant as our cutoff. You are given code below to identify the largest infant VOLUME and to return the proportion of adults harvested by using this cutoff. You will need to modify this latter code to return the proportion of infants harvested using this cutoff. Remember that we will harvest any individual with VOLUME greater than our cutoff.

## [1] 526.6383
## [1] 0.2476573
## [1] 0

(6)(c) Our next approaches will look at what happens when we use the median infant and adult harvest VOLUMEs. Using the median VOLUMEs as our cutoffs will give us (roughly) 50% harvests. We need to identify the median volumes and calculate the resulting infant and adult harvest proportions for both.

## [1] 133.8214
## [1] 0.4982699
## [1] 0.9330656
## [1] 0.02422145
## [1] 0.4993307

(6)(d) Next, we will create a plot showing the infant conserved proportions (i.e. “not harvested,” the prop.infants vector) and the adult conserved proportions (i.e. prop.adults) as functions of volume.value. We will add vertical A-B lines and text annotations for the three (3) “rules” considered, thus far: “protect all infants,” “median infant” and “median adult.” Your plot will have two (2) curves - one (1) representing infant and one (1) representing adult proportions as functions of volume.value - and three (3) A-B lines representing the cutoffs determined in (6)(b) and (6)(c).

Essay Question: The two 50% “median” values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?

There is a clear trade off harvesting adults that runs the risk of harvesting infants. If we protect all infants then we will harvest very little adults. If we chose the median adult we will harvest more adults but also more infants. If we choose median infants we wil protect more infants but harvest less adults. So there should be a better metric between theses two values if we want to maximize adult harvesting while minimizing infant harvesting.


More harvest strategies:

This part will address the determination of a cutoff volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. In other words, we want to find the volume value such that the vertical distance between the infant curve and the adult curve is maximum. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.


#### Section 7: (10 points) ####

(7)(a) Evaluate a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value. Compare to the 50% “split” points determined in (6)(a). There is considerable variability present in the peak area of this plot. The observed “peak” difference may not be the best representation of the data. One solution is to smooth the data to determine a more representative estimate of the maximum difference.

## [1] 0.5752144

(7)(b) Since curve smoothing is not studied in this course, code is supplied below. Execute the following code to create a smoothed curve to append to the plot in (a). The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.

(7)(c) Present a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value with the variable smooth.difference superimposed. Determine the volume.value corresponding to the maximum smoothed difference (Hint: use which.max()). Show the estimated peak location corresponding to the cutoff determined.

Include, side-by-side, the plot from (6)(d) but with a fourth vertical A-B line added. That line should intercept the x-axis at the “max difference” volume determined from the smoothed curve here.

(7)(d) What separate harvest proportions for infants and adults would result if this cutoff is used? Show the separate harvest proportions. We will actually calculate these proportions in two ways: first, by ‘indexing’ and returning the appropriate element of the (1 - prop.adults) and (1 - prop.infants) vectors, and second, by simply counting the number of adults and infants with VOLUME greater than the vlume threshold of interest.

Code for calculating the adult harvest proportion using both approaches is provided.

## [1] 0.7416332
## [1] 0.7416332

There are alternative ways to determine cutoffs. Two such cutoffs are described below.


#### Section 8: (10 points) ####

(8)(a) Harvesting of infants in CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting infants from CLASS “A1.”

Compute this cutoff, and the proportions of infants and adults with VOLUME exceeding this cutoff. Code for determining this cutoff is provided. Show these proportions. You may use either the ‘indexing’ or ‘count’ approach, or both.

## [1] 206.786

(8)(b) Next, append one (1) more vertical A-B line to our (6)(d) graph. This time, showing the “zero A1 infants” cutoff from (8)(a). This graph should now have five (5) A-B lines: “protect all infants,” “median infant,” “median adult,” “max difference” and “zero A1 infants.”

#### Section 9: (5 points) ####

(9)(a) Construct an ROC curve by plotting (1 - prop.adults) versus (1 - prop.infants). Each point which appears corresponds to a particular volume.value. Show the location of the cutoffs determined in (6), (7) and (8) on this plot and label each.

## [1] 0.2871972
## [1] 0.8259705

(9)(b) Numerically integrate the area under the ROC curve and report your result. This is most easily done with the auc() function from the “flux” package. Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.

## [1] 0.8666894

#### Section 10: (10 points) ####

(10)(a) Prepare a table showing each cutoff along with the following: 1) true positive rate (1-prop.adults, 2) false positive rate (1-prop.infants), 3) harvest proportion of the total population

To calculate the total harvest proportions, you can use the ‘count’ approach, but ignoring TYPE; simply count the number of individuals (i.e. rows) with VOLUME greater than a given threshold and divide by the total number of individuals in our dataset.

##                      Volume   TPR   FPR totalHARVEST
## Protect all Infants 526.638 0.248 0.000        0.179
## Median Infants      133.821 0.933 0.498        0.812
## Median Adults       384.558 0.499 0.024        0.367
## Max Difference      262.143 0.742 0.176        0.584
## Zero A1 Infants     206.786 0.826 0.287        0.676

Essay Question: Based on the ROC curve, it is evident a wide range of possible “cutoffs” exist. Compare and discuss the five cutoffs determined in this assignment.

Looking at the ROC curve the slope of the line is very steep until around the y value of .50 then the curve bends until it flattens out at the y value of 1. If we protect all infants the we will have a true positive rate of 25% and a false positive rate of 0%. So we we want to harvest absolutely no infants this would be a good cutoff. If we take the median infant then the true positive rate will be 93% and false positive rate of 50% Median adult cut off would yield 50% true positive and a 2% false positive. The max differnce has a true positive rate of 74% and a false positive rate of 17% and the zero A1 infants has a true positive rate of 82% and false positive rate of 29% of infants. Based on the ROC curve if want to maximize harvest of adults while minimizing harvest of infants we would choose either max differnce or zero A1 infants.

Final Essay Question: Assume you are expected to make a presentation of your analysis to the investigators How would you do so? Consider the following in your answer:

  1. Would you make a specific recommendation or outline various choices and tradeoffs?
  2. What qualifications or limitations would you present regarding your analysis?
  3. If it is necessary to proceed based on the current analysis, what suggestions would you have for implementation of a cutoff?
  4. What suggestions would you have for planning future abalone studies of this type?

***1. I would not make a specific recommendation. I would present each choice with its various tradeoffs. I would not want to bias the end user into a specific choice but make certain they understand the ramifications of each choice.

  1. I would present that this is limited to the sample of abalones in this particular study. I would present a confidence interval of p value of our sample to see if it is a representative sample of all abalones.

  2. It depends on who the recommendation is for and their specific goals. If the end user wants to study adult abalones and only needs a smaller sample and does not want to harvest any infants then I would recommend the protect all infants cutoff. If the end user is harvesting to feed their village and wants to maximize harvest while mimizing as much as possible infants then I would suggest the zero A1 infants cutoff.

  3. For future abalone studies, we need more data on the enviroment of the sample of abalones. The access to food, temperature, PH level, pollutants, etc. In other words we need to minimize the effects of confounding varialbes and much as possible. Perhaps consult the advice of biologists and experts to assist in collecting the samples and try to find the best representative sample. ***