Data Analysis Assignment #2 (75 points total)

Test Items starts from here - There are 10 sections - total of 75 points

#### Section 1: (5 points) ####

(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using ‘rockchalk.’ Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## Skewness: 0.6573096

## Adjusted Kurtosis: 0.3296656

(1)(b) Tranform RATIO using log10() to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.

## Skewness of L_RATIO: 0.3528897

## Adjusted Kurtosis of L_RATIO: -0.1956505

(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  L_RATIO by CLASS
## Bartlett's K-squared = 33.434, df = 4, p-value = 9.734e-07

Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?

Answer: (I think based on the analysis from steps 1.a, 1.b, and 1.c, the variable L_RATIO (log-transformed RATIO) exhibits better conformance to a normal distribution with homogeneous variances across age classes. The histogram and QQ plot for L_RATIO show a more symmetric distribution with reduced skewness and kurtosis values compared to RATIO, indicating that L_RATIO is closer to a normal distribution. Additionally, the Bartlett’s test results show a significant p-value, suggesting that variances across classes may not be fully homogeneous. However, L_RATIO still provides a more normalized and consistent spread across classes than RATIO, making it a better fit for analyses that assume normality and homogeneity of variance.)

#### Section 2 (10 points) ####

(2)(a) Perform an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use summary() to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).

##               Df  Sum Sq  Mean Sq F value   Pr(>F)    
## CLASS          4 0.04946 0.012365  43.070  < 2e-16 ***
## SEX            2 0.01303 0.006514  22.689 2.29e-10 ***
## CLASS:SEX      8 0.00301 0.000377   1.311    0.234    
## Residuals   1021 0.29312 0.000287                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##               Df  Sum Sq  Mean Sq F value  Pr(>F)    
## CLASS          4 0.04946 0.012365   42.97 < 2e-16 ***
## SEX            2 0.01303 0.006514   22.63 2.4e-10 ***
## Residuals   1029 0.29613 0.000288                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?

Answer: (Ithink the results of the two analyses show that the interaction term, CLASS, is non-significant (p = 0.234). This suggests that there is no meaningful interaction effect between CLASS and SEX on L_RATIO. In other words, the relationship between L_RATIO and CLASS does not significantly differ across levels of SEX, and vice versa. The main effects of CLASS and SEX are highly significant in both models, indicating that each factor independently influences L_RATIO. However, since the interaction term is not significant, it implies that the effects of CLASS and SEX on L_RATIO are additive rather than multiplicative, allowing for a simplified model without the interaction term. This simplified model should provide an adequate description of the data without loss of explanatory power.)

(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the TukeyHSD() function. Interpret the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L_RATIO ~ CLASS + SEX, data = mydata)
## 
## $CLASS
##               diff          lwr           upr     p adj
## A2-A1 -0.012483896 -0.017869374 -7.098418e-03 0.0000000
## A3-A1 -0.018096590 -0.023237545 -1.295564e-02 0.0000000
## A4-A1 -0.022313749 -0.027910912 -1.671659e-02 0.0000000
## A5-A1 -0.023702740 -0.029375250 -1.803023e-02 0.0000000
## A3-A2 -0.005612694 -0.009567118 -1.658271e-03 0.0010565
## A4-A2 -0.009829853 -0.014361551 -5.298155e-03 0.0000000
## A5-A2 -0.011218844 -0.015843281 -6.594408e-03 0.0000000
## A4-A3 -0.004217159 -0.008455357  2.104015e-05 0.0518810
## A5-A3 -0.005606150 -0.009943368 -1.268932e-03 0.0039322
## A5-A4 -0.001388991 -0.006258311  3.480329e-03 0.9365474
## 
## $SEX
##              diff          lwr          upr     p adj
## I-F  0.0067663915  0.003654820  0.009877963 0.0000012
## M-F  0.0006926505 -0.002311380  0.003696681 0.8509975
## M-I -0.0060737409 -0.009070381 -0.003077100 0.0000067

Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as ‘adults?’ If not, why not?

Answer: (The TukeyHSD results for CLASS show a consistent decreasing trend in L_RATIO as the age class increases from A1 to A5, with all pairwise comparisons involving A1 showing statistically significant differences. This suggests that L_RATIO tends to decrease with higher age classes, indicating potential biological differences in shell measurements across age groups. For SEX, significant differences are found between females (F) and immatures (I) as well as between immatures (I) and males (M), while the difference between males and females is not significant (p = 0.85). This lack of significant difference between male and female abalones suggests that these two groups might have similar L_RATIO values. However, because immatures differ significantly from both males and females, combining all three into a single category labeled ‘adults’ would overlook the distinct L_RATIO profile of immatures, indicating that this categorization would not be appropriate.)

#### Section 3: (10 points) ####

(3)(a1) Here, we will combine “M” and “F” into a new level, “ADULT”. The code for doing this is given to you. For (3)(a1), all you need to do is execute the code as given.

## 
## ADULT     I 
##   707   329

(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.

Essay Question: Compare the histograms. How do the distributions differ? Are there going to be any difficulties separating infants from adults based on VOLUME?

Answer: (I think the histograms show that the volume distributions for infants and adults differ significantly. The distribution for adults is more spread out, with a peak around 400-500 and values extending up to 1000, indicating a wider range of volumes among adult abalones. In contrast, the distribution for infants is more concentrated, with a peak around 200 and very few values exceeding 500. This difference suggests that, in general, infants tend to have lower volumes compared to adults. However, there is some overlap between the higher volumes of infants and the lower volumes of adults, which may make it challenging to separate infants from adults solely based on volume. Therefore, while volume can provide some indication, it may not be a fully reliable metric for distinguishing between infants and adults.)

(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.

Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?

Answer: (I feel like the log-transformation of SHUCK and VOLUME reduces the spread of the data, making it more linear and compact. In the original scale, there is a wide range of variability, especially for larger values of VOLUME, where points are more dispersed. After log-transformation, this variability is reduced, leading to a more even distribution of points along a linear trend, which is beneficial for linear regression analysis as it stabilizes variance and meets the assumption of homoscedasticity (constant variance). Regarding CLASS, lower age classes (e.g., A1) are positioned towards the lower end of both VOLUME and SHUCK, while higher classes (e.g., A5) have larger values and are more spread out in the original scale. For TYPE, infants (I) are clustered towards the lower end of both VOLUME and SHUCK, whereas adults (ADULT) are distributed across a broader range, though log-transformation reduces this spread.)

#### Week 8 - Section 4: (5 points) ####

(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. You are given code in (4)(a1) to reclassify the infants in classes A4 and A5 as ADULTS.

## 
## ADULT     I 
##   747   289

(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply summary() to the model object to produce results.

## 
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = mydata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270634 -0.054287  0.000159  0.055986  0.309718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.796418   0.021718 -36.672  < 2e-16 ***
## L_VOLUME     0.999303   0.010262  97.377  < 2e-16 ***
## CLASSA2     -0.018005   0.011005  -1.636 0.102124    
## CLASSA3     -0.047310   0.012474  -3.793 0.000158 ***
## CLASSA4     -0.075782   0.014056  -5.391 8.67e-08 ***
## CLASSA5     -0.117119   0.014131  -8.288 3.56e-16 ***
## TYPEI       -0.021093   0.007688  -2.744 0.006180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9501 
## F-statistic:  3287 on 6 and 1029 DF,  p-value: < 2.2e-16

Essay Question: Interpret the trend in CLASS levelcoefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).

Answer: (The trend in the CLASS level coefficient estimates shows a decreasing pattern from CLASSA2 to CLASSA5, with progressively negative coefficients as the class level increases. This pattern suggests that as the age class advances, there is a corresponding decrease in the predicted L_SHUCK values, even when controlling for L_VOLUME and TYPE. This is consistent with previous analyses, where we observed that higher age classes had distinct differences in size measurements. The negative coefficients indicate that older abalones tend to have smaller L_SHUCK values relative to younger ones in lower classes, reinforcing the biological trend observed in earlier visualizations and statistical tests, where age was associated with different morphological traits.)

Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.

Answer: (TYPE appears to be a less important predictor in this regression compared to L_VOLUME and CLASS. While it does contribute some predictive value, as indicated by its statistically significant coefficient, its effect size (coefficient of -0.02193 for TYPEI) is relatively small. In contrast, L_VOLUME has a much larger coefficient and thus a more substantial impact on L_SHUCK predictions, indicating that volume is a primary factor influencing shuck weight. Additionally, the coefficients for CLASS levels show a notable trend that directly correlates with age classes, which is meaningful for differentiating between younger and older abalones. Since L_SHUCK is critical for harvesting decisions, the predictive contributions from L_VOLUME and CLASS are likely more valuable than TYPE in informing harvesting strategies, suggesting that while TYPE may play a role, it is secondary to these other variables in this context.)

The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).

#### Section 5: (5 points) ####

(5)(a) If “model” is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.

## Skewness of Residuals: -0.05945234

## Adjusted Kurtosis of Residuals: -2.656692

(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using par(mfrow..) or grid.arrange(). Test the homogeneity of variance of the residuals across classes using bartlett.test() (Kabacoff Section 9.3.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  residuals by CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model ‘fit’? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.

Answer: (I think the residuals analysis in (5)(a) and (5)(b) suggests that the model provides a reasonable fit but may have some limitations. The histogram and QQ plot of residuals reveal a distribution close to normal, with minimal skewness and kurtosis deviations, indicating that the residuals are relatively symmetric and that the normality assumption holds well. The scatterplots of residuals against L_VOLUME, colored by CLASS and TYPE, show no clear patterns, suggesting that the residuals are randomly distributed and that L_VOLUME does not exhibit heteroscedasticity in the residuals. The Bartlett test results (p-value = 0.4498) further support the homogeneity of variances across CLASS, reinforcing that the variance of residuals is consistent across classes. This consistency suggests that L_VOLUME, and by extension VOLUME, is a stable predictor in the model and could be useful for predicting L_SHUCK in harvesting decisions. However, while VOLUME contributes predictive value, the overall model fit and residual analysis indicate that additional variables or non-linear adjustments might enhance predictive accuracy for L_SHUCK in real-world applications.)

Harvest Strategy:

There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.The Management needs to make a decision to implement 1 rule that meets the business goal.

The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).” Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.

#### Section 6: (5 points) ####

(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes. Code for doing this is provided.

(6)(b) Our first “rule” will be protection of all infants. We want to find a volume cutoff that protects all infants, but gives us the largest possible harvest of adults. We can achieve this by using the volume of the largest infant as our cutoff. You are given code below to identify the largest infant VOLUME and to return the proportion of adults harvested by using this cutoff. You will need to modify this latter code to return the proportion of infants harvested using this cutoff. Remember that we will harvest any individual with VOLUME greater than our cutoff.

## [1] 526.6383

## [1] 0.2476573

## [1] 0

(6)(c) Our next approaches will look at what happens when we use the median infant and adult harvest VOLUMEs. Using the median VOLUMEs as our cutoffs will give us (roughly) 50% harvests. We need to identify the median volumes and calculate the resulting infant and adult harvest proportions for both.

## [1] 0.4982699

## [1] 0.9330656

## [1] 0.02422145

## [1] 0.4993307

(6)(d) Next, we will create a plot showing the infant conserved proportions (i.e. “not harvested,” the prop.infants vector) and the adult conserved proportions (i.e. prop.adults) as functions of volume.value. We will add vertical A-B lines and text annotations for the three (3) “rules” considered, thus far: “protect all infants,” “median infant” and “median adult.” Your plot will have two (2) curves - one (1) representing infant and one (1) representing adult proportions as functions of volume.value - and three (3) A-B lines representing the cutoffs determined in (6)(b) and (6)(c).

Essay Question: The two 50% “median” values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?

Data Analysis Assignment #2 (75 points total)

lastName, firstName

Test Items starts from here - There are 10 sections - total of 75 points