Data Analysis #2 Version 2 (75 points total)

Data Analysis #2

## [1] FALSE

## Classes 'tbl_df', 'tbl' and 'data.frame':    1036 obs. of  14 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : num  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ...9  : logi  NA NA NA NA NA NA ...
##  $ ...10 : logi  NA NA NA NA NA NA ...
##  $ ...11 : chr  NA NA NA NA ...
##  $ ...12 : chr  NA NA NA NA ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ Ratio : num  0.15 0.147 0.269 0.185 0.165 ...

Test Items starts from here - There are 10 sections - total of 75 points

Section 1: (5 points)

(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using ‘rockchalk.’ Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## [1] 0.7147056

## [1] 1.667298

(1)(b) Tranform RATIO using log10() to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.

## [1] -0.09391548

## [1] 0.5354309

(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  L_RATIO by abalones$CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267

## 
##  Bartlett test of homogeneity of variances
## 
## data:  RATIO by abalones$CLASS
## Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531

Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?

RATIO is far more normalyl distributed than L_RATIO. The skewness is much larger in abalones$Ratio, 0.715 v. -0.094, the kurtosis is much smaller and closer to the desired value, and one fails to reject the null hypothesis with a very strong p-value of 0.0002531 v. a p-value of 0.5267.

Section 2 (10 points)

(2)(a) Perform an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use summary() to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).

## Call:
##    aov(formula = L_RATIO ~ CLASS + SEX + CLASS * SEX)
## 
## Terms:
##                    CLASS      SEX CLASS:SEX Residuals
## Sum of Squares  1.055357 0.091376  0.026706  7.020595
## Deg. of Freedom        4        2         8      1021
## 
## Residual standard error: 0.08292282
## Estimated effects may be unbalanced

## Call:
##    aov(formula = L_RATIO ~ CLASS + SEX, data = abalones)
## 
## Terms:
##                    CLASS      SEX Residuals
## Sum of Squares  1.055357 0.091376  7.047301
## Deg. of Freedom        4        2      1029
## 
## Residual standard error: 0.08275681
## Estimated effects may be unbalanced

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.370 < 2e-16 ***
## SEX            2  0.091 0.04569   6.644 0.00136 ** 
## CLASS:SEX      8  0.027 0.00334   0.485 0.86709    
## Residuals   1021  7.021 0.00688                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.524 < 2e-16 ***
## SEX            2  0.091 0.04569   6.671 0.00132 ** 
## Residuals   1029  7.047 0.00685                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?

The output suggests that any interaction between Class and Sex is not releveant to the analyses, however, there does appear to be a strong predictive value in the Classes and Sexes when considered on their own.

(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the TukeyHSD() function. Interpret the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L_RATIO ~ CLASS + SEX, data = abalones)
## 
## $CLASS
##              diff         lwr          upr     p adj
## A2-A1 -0.01248831 -0.03876038  0.013783756 0.6919456
## A3-A1 -0.03426008 -0.05933928 -0.009180867 0.0018630
## A4-A1 -0.05863763 -0.08594237 -0.031332896 0.0000001
## A5-A1 -0.09997200 -0.12764430 -0.072299703 0.0000000
## A3-A2 -0.02177176 -0.04106269 -0.002480831 0.0178413
## A4-A2 -0.04614932 -0.06825638 -0.024042262 0.0000002
## A5-A2 -0.08748369 -0.11004316 -0.064924223 0.0000000
## A4-A3 -0.02437756 -0.04505283 -0.003702280 0.0114638
## A5-A3 -0.06571193 -0.08687025 -0.044553605 0.0000000
## A5-A4 -0.04133437 -0.06508845 -0.017580286 0.0000223
## 
## $SEX
##             diff          lwr           upr     p adj
## I-F -0.015890329 -0.031069561 -0.0007110968 0.0376673
## M-F  0.002069057 -0.012585555  0.0167236691 0.9412689
## M-I  0.017959386  0.003340824  0.0325779478 0.0111881

Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as ‘adults?’ If not, why not?

Answer: (Enter your answer here.) -According to the Tukey test it would appear as though there is a significant difference between all classes except for classes A2 and A1 when one uses the L_RATIO.

-Accordign to the Tukey Test for abalone sexes it would appear as though there is a possibility to combine males and females into one group. However, Infants must remain separate due to their high P-value.

Section 3: (10 points)

(3)(a1) We combine “M” and “F” into a new level, “ADULT”. (While this could be accomplished using combineLevels() from the ‘rockchalk’ package, we use base R code because many students do not have access to the rockchalk package.) This necessitated defining a new variable, TYPE, in mydata which had two levels: “I” and “ADULT”.

## 
## Check on definition of TYPE object (should be an integer):  integer

## 
## abalones$TYPE is treated as a factor:  TRUE

##    
##     ADULT   I
##   F   326   0
##   I     0 329
##   M   381   0

(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.

Essay Question: Compare the histograms. How do the distributions differ? Are there going to be any difficulties separating infants from adults based on VOLUME?

The adult histogram is much more normally distributed than the infant ratio. There will most likely not be a huge difficulty separating infants from adults based on volume. There appears to be some overlap around the 200-400 range, but past that there should be few difficulties.

(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.

Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?

The log transformation gives a far more discernable pattern to the data. When it is not log transformed the data is far more scattered. But with a log transformation the data has a far more linear trend where you can easily differentiations between the classes and types as volume increases. Once the log transformation is performed one can easily see that as class increases volume increases, and that the adults generally ahve higher volumes than the infants. This would make it appear as though a linear regression analysis should yield good results.

Section 4: (5 points)

(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. Reclassify the infants in classes A4 and A5 as ADULTS. This reclassification could have been achieved using combineLevels(), but only on the abalones in classes A4 and A5. We will do this recoding of the TYPE variable using base R functions. We will use this recoded TYPE variable, in which the infants in A4 and A5 are reclassified as ADULTS, for the remainder of this data analysis assignment.

## 
## Check on redefinition of TYPE object (should be an integer):  integer

## 
## abalones$TYPE is treated as a factor:  TRUE

## 
## Three-way contingency table for SEX, CLASS, and TYPE:

## , ,  = ADULT
## 
##    
##      A1  A2  A3  A4  A5
##   F   5  41 121  82  77
##   I   0   0   0  21  19
##   M  12  62 143  85  79
## 
## , ,  = I
## 
##    
##      A1  A2  A3  A4  A5
##   F   0   0   0   0   0
##   I  91 133  65   0   0
##   M   0   0   0   0   0

(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply summary() to the model object to produce results.

## 
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = abalones)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270634 -0.054287  0.000159  0.055986  0.309718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.796418   0.021718 -36.672  < 2e-16 ***
## L_VOLUME     0.999303   0.010262  97.377  < 2e-16 ***
## CLASSA2     -0.018005   0.011005  -1.636 0.102124    
## CLASSA3     -0.047310   0.012474  -3.793 0.000158 ***
## CLASSA4     -0.075782   0.014056  -5.391 8.67e-08 ***
## CLASSA5     -0.117119   0.014131  -8.288 3.56e-16 ***
## TYPEI       -0.021093   0.007688  -2.744 0.006180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9501 
## F-statistic:  3287 on 6 and 1029 DF,  p-value: < 2.2e-16

Essay Question: Interpret the trend in CLASS levelcoefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).

The trend in Class Level would appear to be that as the class level increases the coefficients seems to be a decrease. Since we are predicting Shuck it would appear as though there is a larger increase in shuck as the abalones are of a younger age/class. This pattern is consistent with earlier displays from data analysis 1 where we determined this same pattern.

Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.

The type does not have as strong a contribution to the predictions of L_SHUCK for harvesting decisions. The value is far less significant on the table genereated.

The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).

Section 5: (5 points)

(5)(a) If “model” is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using par(mfrow..) or grid.arrange(). Test the homogeneity of variance of the residuals across classes using bartlett.test() (Kabacoff Section 9.3.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  model$residuals by abalones$CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model ‘fit’? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.

These scatter plots show that one of the assumptions of a lienar regression are met since there is not clear pattern or line in the residuals and their absolute value appear to be very small. However there does seem to be a larger clustering to the right side of the graphs on both. The bartlett test has a larger p-value which makes one believe that the data is normally distributed. This pattern in the scatter plot does seem odd however. The histogram and QQnorm plot from 5a also help suggest that the data is normally distributed.

There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.

The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).” Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.

Section 6: (5 points)

(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes. Code for doing this is provided.

## [1] 133.8199

## [1] 384.5138

## [1] 133.8199

## [1] 384.5138

## [1] 3.710996 3.810202 3.909408 4.008615 4.107821 4.207027

## [1] 0.003460208 0.003460208 0.003460208 0.006920415 0.013840830 0.013840830

## [1] 0 0 0 0 0 0

(6)(b) Present a plot showing the infant proportions and the adult proportions versus volume.value. Compute the 50% “split” volume.value for each and show on the plot.

Essay Question: The two 50% “split” values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?

These values suggest good cutoff points for the harvesting of abalones. More adults than infants should be harvested which is the goal.

This part will address the determination of a volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.

Section 7: (10 points)

(7)(a) Evaluate a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value. Compare to the 50% “split” points determined in (6)(a). There is considerable variability present in the peak area of this plot. The observed “peak” difference may not be the best representation of the data. One solution is to smooth the data to determine a more representative estimate of the maximum difference.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02634 0.19679 0.23931 0.45983 0.57521

(7)(b) Since curve smoothing is not studied in this course, code is supplied below. Execute the following code to create a smoothed curve to append to the plot in (a). The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02634 0.19679 0.23931 0.45983 0.57521

(7)(c) Present a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value with the variable smooth.difference superimposed. Determine the volume.value corresponding to the maximum smoothed difference (Hint: use which.max()). Show the estimated peak location corresponding to the cutoff determined.

(7)(d) What separate harvest proportions for infants and adults would result if this cutoff is used? Show the separate harvest proportions (NOTE: the adult harvest proportion is the “true positive rate” and the infant harvest proportion is the “false positive rate”).

Code for calculating the adult harvest proportion is provided.

## [1] 0.7416332

## [1] 0.1764706

There are alternative ways to determine cutoffs. Two such cutoffs are described below.

Section 8: (10 points)

(8)(a) Harvesting of infants in CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting infants from CLASS “A1.”

Compute this cutoff, and the proportions of infants and adults with VOLUME exceeding this cutoff. Code for determining this cutoff is provided. Show these proportions.

## [1] 206.786

## [1] 0.8259705

## [1] 0.2871972

(8)(b) Another cutoff is one for which the proportion of adults not harvested equals the proportion of infants harvested. This cutoff would equate these rates; effectively, our two errors: ‘missed’ adults and wrongly-harvested infants. This leaves for discussion which is the greater loss: a larger proportion of adults not harvested or infants harvested? This cutoff is 237.7383. Calculate the separate harvest proportions for infants and adults using this cutoff. Show these proportions. Code for determining this cutoff is provided.

## [1] 0.7817938

## [1] 0.2179931

Section 9: (5 points)

(9)(a) Construct an ROC curve by plotting (1 - prop.adults) versus (1 - prop.infants). Each point which appears corresponds to a particular volume.value. Show the location of the cutoffs determined in (7) and (8) on this plot and label each.

(9)(b) Numerically integrate the area under the ROC curve and report your result. This is most easily done with the auc() function from the “flux” package. Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.

## [1] 0.8666894

Section 10: (10 points)

(10)(a) Prepare a table showing each cutoff along with the following: 1) true positive rate (1-prop.adults, 2) false positive rate (1-prop.infants), 3) harvest proportion of the total population

## New names:
## * `` -> ...1

## # A tibble: 3 x 5
##   ...1            Volume   TPR   FPR PropYield
##   <chr>            <dbl> <dbl> <dbl>     <dbl>
## 1 max.difference    262. 0.742 0.176     0.584
## 2 zero.A1.infants   207. 0.826 0.287     0.676
## 3 equal.error       238. 0.782 0.218     0.625

Essay Question: Based on the ROC curve, it is evident a wide range of possible “cutoffs” exist. Compare and discuss the three cutoffs determined in this assignment.

Starting with the first row on the table, max.difference, as a cutoff it would appear as though this one has the lowest PropYield. However, this also has the lowest false positive rate as well which is encouraging

The second row on the table, zero.A1.infants has the highest PropYield but unfortunately it also has a much larger False Positive Rate than the one above it. This could be troubling

The third row on the table, equal.error, has a safe medium when it comes to False positive rate and PropYield numbers. This would be the number I would be most comfortable with personally.

Final Essay Question: Assume you are expected to make a presentation of your analysis to the investigators How would you do so? Consider the following in your answer:

Would you make a specific recommendation or outline various choices and tradeoffs?
What qualifications or limitations would you present regarding your analysis?
If it is necessary to proceed based on the current analysis, what suggestions would you have for implementation of a cutoff?

What suggestions would you have for planning future abalone studies of this type?

ANSWERS:

I would not make a specfic recommendation. I would instead outline the potential risks and rewards of each approach. I would not want to sway their opinion any which way so that I can hear their honest opinions on the subject without me biasing them towards a certain approach.
I would show the investigators the histograms, box plots, skewness, kurtosis, and outliers within the data sample that was collected. I would also be sure to show them the various checks to ensure that a regression equation can be used if desired, such as normality of the residuals. I would make sure to highlight that abalones are very difficult to sex and that I am unsure as to how the sampling was done or whether or not it was biased.
I would recommend exploring each cutoff method in a small sample area and see which seems to give the best results. I would form a control group using the old method, and then 3 more test groups using the 3 highlighted in this study. Not all the possible variables are contained in the study so we can be more certain as to which approach is best after the experiments have been performed.
I would recommend that future abalone studies collect more data on the environemnts in which the abalones live to ensure that we are comparing abalones that are as similar as possible. Just to offer a few examples, I would like info on the pH of the water, the amount of pollution, the clarity of the water, and dissolved oxygen concentrations. If we can get better data then we can create better models.