Submit both the .Rmd and .html files for grading. You may remove the instructions and example problem above, but do not remove the YAML metadata block or the first, “setup” code chunk. Address the steps that appear below and answer all the questions. Be sure to address each question with code and comments as needed. You may use either base R functions or ggplot2 for the visualizations.
The following code chunk will:
Do not include package installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the abalones.csv from the course site to a known location on your machine. Unless a file.path() is specified, R will look to directory where this .Rmd is stored when knitting.
## 'data.frame': 1036 obs. of 8 variables:
## $ SEX : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ LENGTH: num 5.57 3.67 10.08 4.09 6.93 ...
## $ DIAM : num 4.09 2.62 7.35 3.15 4.83 ...
## $ HEIGHT: num 1.26 0.84 2.205 0.945 1.785 ...
## $ WHOLE : num 11.5 3.5 79.38 4.69 21.19 ...
## $ SHUCK : num 4.31 1.19 44 2.25 9.88 ...
## $ RINGS : int 6 4 6 3 6 6 5 6 5 6 ...
## $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##### Section 1: (6 points) Summarizing the data.
(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.
## SEX LENGTH DIAM HEIGHT WHOLE
## F:326 Min. : 2.73 Min. : 1.995 Min. :0.525 Min. : 1.625
## I:329 1st Qu.: 9.45 1st Qu.: 7.350 1st Qu.:2.415 1st Qu.: 56.484
## M:381 Median :11.45 Median : 8.925 Median :2.940 Median :101.344
## Mean :11.08 Mean : 8.622 Mean :2.947 Mean :105.832
## 3rd Qu.:13.02 3rd Qu.:10.185 3rd Qu.:3.570 3rd Qu.:150.319
## Max. :16.80 Max. :13.230 Max. :4.935 Max. :315.750
## SHUCK RINGS CLASS VOLUME
## Min. : 0.5625 Min. : 3.000 A1:108 Min. : 3.612
## 1st Qu.: 23.3006 1st Qu.: 8.000 A2:236 1st Qu.:163.545
## Median : 42.5700 Median : 9.000 A3:329 Median :307.363
## Mean : 45.4396 Mean : 9.993 A4:188 Mean :326.804
## 3rd Qu.: 64.2897 3rd Qu.:11.000 A5:175 3rd Qu.:463.264
## Max. :157.0800 Max. :25.000 Max. :995.673
## RATIO
## Min. :0.06734
## 1st Qu.:0.12241
## Median :0.13914
## Mean :0.14205
## 3rd Qu.:0.15911
## Max. :0.31176
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## A1 9 8 24 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## A2 0 0 0 0 91 145 0 0 0 0 0 0 0 0 0 0 0 0
## A3 0 0 0 0 0 0 182 147 0 0 0 0 0 0 0 0 0 0
## A4 0 0 0 0 0 0 0 0 125 63 0 0 0 0 0 0 0 0
## A5 0 0 0 0 0 0 0 0 0 0 48 35 27 15 13 8 8 6
##
## 21 22 23 24 25
## A1 0 0 0 0 0
## A2 0 0 0 0 0
## A3 0 0 0 0 0
## A4 0 0 0 0 0
## A5 4 1 7 2 1
Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.
Answer: Variable types include categorical- SEX, CLASS and continuous measurements. Distribution analysis reveals mixed skewness: LENGTH and DIAM are left-skewed, while HEIGHT, WHOLE, SHUCK, VOLUME, and RATIO are right-skewed. The RINGS variable is also right-skewed overall, indicating a majority of younger abalones with a long tail of older individuals. This is confirmed by the frequency table, which shows that classes A1-A4 contain only abalones with 12 or fewer rings, while the older class A5 contains all individuals with 13+ rings. Significant outliers are present in VOLUME and WHOLE.
(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
##
## A1 A2 A3 A4 A5 Sum
## F 5 41 121 82 77 326
## I 91 133 65 21 19 329
## M 12 62 143 85 79 381
## Sum 108 236 329 188 175 1036
Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS? Answer: The sex distribution of abalones presents relatively balanced numbers across the three sex categories with males being slightly more prevalent- (381, 36.8%), followed by infants (329, 31.8%)(assuming I stands for infant) and females (326, 31.5%). This near-equal distribution suggests the sampling was representative across sex categories.Regarding the distribution by CLASS, there are several patterns that can be seen: Age progression: There is a clear progression in CLASS distribution from A1 to A5, with A3 being the most frequent class (329, 31.8%), followed by A2 (236, 22.8%), A4 (188, 18.1%), A5 (175, 16.9%), and A1 being the least frequent (108, 10.4%). This suggests a typical population age (assuming A is likely standing for age) structure with fewer very young (A1) and very old (A5) individuals. Sex differences across classes: Infants dominate the A1 class (91 out of 108, 84.3%), which is expected since A1 represents the youngest age group. Females and males show similar patterns in the older classes (A3-A5), with females having slightly higher representation in A5 (77 vs 79). The transition from infant to adult classes occurs around A2-A3, where infant numbers drop dramatically after A2.Population structure: The distribution suggests a healthy population with good representation across all age classes and both adult sexes, though there appears to be higher mortality or sampling underrepresentation in the youngest (A1) and oldest (A5) age groups.
(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.
##### Section 2: (5 points) Summarizing the data using graphics.
(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.
(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.
Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.
Answer: The SHUCK vs WHOLE plot demonstrates homoscedasticity with consistent variability across all values, which indicates a tight, linear relationship. This is expected because SHUCK (meat weight) is a direct component of WHOLE (total weight), creating a strong part-whole relationship. The reference line representing the maximum SHUCK/WHOLE ratio clearly shows the theoretical upper limit.In contrast, the WHOLE vs VOLUME plot exhibits significant heteroscedasticity - the variability increases dramatically as both variables increase. This fanning pattern occurs because volume measures physical shell capacity while whole weight includes both shell and internal components, introducing more complex biological variability. The heteroscedasticity suggests that the relationship between volume and weight becomes less predictable for larger, older abalones.Regarding age classes, both plots show younger abalones (A1-A2) clustering in the lower left with smaller values and less variability, while older classes (A4-A5) extend to the upper right with much greater dispersion. The heteroscedasticity in WHOLE vs VOLUME particularly affects the older age groups, where biological factors like individual growth rates, environmental conditions, and health status create wider size variations among abalones of similar ages. The progression across age classes is visible in both plots but appears more systematic and predictable in the SHUCK vs WHOLE relationship due to its homoscedastic nature.
### Section 3: (8 points) Getting insights about the data using graphs.
(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.
Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.
Answer: All three RATIO distributions show discrepancies from normal distribution across all evaluation criteria. The histograms reveal right-skewness with most values clustered at 0.10-0.20 but with long tails extending to high values. The boxplots confirm this pattern with numerous outliers on the upper end for all sex categories, and the boxes themselves are compressed toward the bottom of the distribution. The Q-Q plots show deviations from the reference lines: the points form distinct curves that bow upward- away from the diagonal, particularly in the upper tails. This curvature pattern indicates the data have much heavier right tails than a normal distribution. None of the sex-specific RATIO distributions approximate normality - they all show the characteristic signs of right-skewed distributions with outliers. By looking at the boxplots and analyzing the QQ-plots, we can see that there seems to be a larger presence of outliers in the Infant group, the most extreme outliers are found among Female group.
(3)(b) (2 points) The boxplots in (3)(a) indicate that there are outlying RATIOs for each sex. boxplot.stats() can be used to identify outlying values of a vector. Present the abalones with these outlying RATIO values along with their associated variables in “mydata”. Display the observations by passing a data frame to the kable() function. Basically, we want to output those rows of “mydata” with an outlying RATIO, but we want to determine outliers looking separately at infants, females and males.
## Outliers using 1.5×IQR (standard):
## Infants: 8
## Females: 6
## Males: 5
##
## Extreme outliers using 3×IQR:
## Infants: 1
## Females: 1
## Males: 0
##
## Extreme Outlier Details:
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | I | 10.08 | 7.35 | 2.205 | 79.3750 | 44.000 | 6 | A1 | 163.3640 | 0.2693371 |
| 350 | F | 7.98 | 6.72 | 2.415 | 80.9375 | 40.375 | 7 | A2 | 129.5058 | 0.3117620 |
##
## All Outliers (1.5×IQR method):
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | I | 10.080 | 7.350 | 2.205 | 79.37500 | 44.00000 | 6 | A1 | 163.364040 | 0.2693371 |
| 37 | I | 4.305 | 3.255 | 0.945 | 6.18750 | 2.93750 | 3 | A1 | 13.242072 | 0.2218308 |
| 42 | I | 2.835 | 2.730 | 0.840 | 3.62500 | 1.56250 | 4 | A1 | 6.501222 | 0.2403394 |
| 58 | I | 6.720 | 4.305 | 1.680 | 22.62500 | 11.00000 | 5 | A1 | 48.601728 | 0.2263294 |
| 67 | I | 5.040 | 3.675 | 0.945 | 9.65625 | 3.93750 | 5 | A1 | 17.503290 | 0.2249577 |
| 89 | I | 3.360 | 2.310 | 0.525 | 2.43750 | 0.93750 | 4 | A1 | 4.074840 | 0.2300704 |
| 105 | I | 6.930 | 4.725 | 1.575 | 23.37500 | 11.81250 | 7 | A2 | 51.572194 | 0.2290478 |
| 200 | I | 9.135 | 6.300 | 2.520 | 74.56250 | 32.37500 | 8 | A2 | 145.027260 | 0.2232339 |
| 350 | F | 7.980 | 6.720 | 2.415 | 80.93750 | 40.37500 | 7 | A2 | 129.505824 | 0.3117620 |
| 379 | F | 15.330 | 11.970 | 3.465 | 252.06250 | 134.89812 | 10 | A3 | 635.827846 | 0.2121614 |
| 420 | F | 11.550 | 7.980 | 3.465 | 150.62500 | 68.55375 | 10 | A3 | 319.365585 | 0.2146560 |
| 421 | F | 13.125 | 10.290 | 2.310 | 142.00000 | 66.47062 | 9 | A3 | 311.979938 | 0.2130606 |
| 458 | F | 11.445 | 8.085 | 3.150 | 139.81250 | 68.49062 | 9 | A3 | 291.478399 | 0.2349767 |
| 586 | F | 12.180 | 9.450 | 4.935 | 133.87500 | 38.25000 | 14 | A5 | 568.023435 | 0.0673388 |
| 746 | M | 13.440 | 10.815 | 1.680 | 130.25000 | 63.73125 | 10 | A3 | 244.194048 | 0.2609861 |
| 754 | M | 10.500 | 7.770 | 3.150 | 132.68750 | 61.13250 | 9 | A3 | 256.992750 | 0.2378764 |
| 803 | M | 10.710 | 8.610 | 3.255 | 160.31250 | 70.41375 | 9 | A3 | 300.153640 | 0.2345924 |
| 810 | M | 12.285 | 9.870 | 3.465 | 176.12500 | 99.00000 | 10 | A3 | 420.141472 | 0.2356349 |
| 852 | M | 11.550 | 8.820 | 3.360 | 167.56250 | 78.27187 | 10 | A3 | 342.286560 | 0.2286735 |
Essay Question (2 points): What are your observations regarding the results in (3)(b)?
Answer: The outlier analysis reveals a critical distinction between standard statistical outliers and biologically meaningful extreme values. Using the conventional 1.5×IQR method identified 19 potential outliers across all sex categories (8 infants, 6 females, 5 males). However, when applying the more stringent 3×IQR definition for extreme outliers, which better accounts for the naturally skewed distribution, only 2 extreme outliers remain (1 infant, 1 female, 0 males). This dramatic reduction from 19 to 2 outliers demonstrates that most flagged observations are not extreme outliers, but rather fall within the expected range of variation for this biological measurement. Examining the specific extreme outliers reveals one female (observation 586) with an unusually low RATIO of 0.067, possibly indicating health issues or measurement error, and one infant with moderately elevated RATIO. The clustering of most “outliers” in a near-normal RATIO range (0.21-0.24) further confirms they represent statistical artifacts of the skewed distribution rather than biologically anomalous specimens. This analysis highlights the importance of considering distribution shape when interpreting outliers in biological data.
### Section 4: (8 points) Getting insights about possible predictors.
(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.
Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.
Answer: Both VOLUME and WHOLE show promise as predictors of age (RINGS) but with limitations. The boxplots reveal clear progression: both variables generally increase with CLASS (A1 through A5), indicating age-related growth. The scatterplots show positive relationships with RINGS, but with substantial variability, more particularly in middle age ranges (RINGS 8-12). VOLUME appears slightly more linearly related to RINGS than WHOLE. However, the high variability within each age class suggests these are imperfect predictors - abalones of the same age can have quite different sizes maybe due to individual growth rates, environmental factors, or genetic differences. The relationships also appear to plateau in older ages, suggesting diminishing returns in predictive power for mature abalones. Overall, they would be moderate predictors that work better for distinguishing very young vs adult or old abalones than for precise age estimation.
### Section 5: (12 points) Getting insights regarding different groups in the data.
(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.
## Mean VOLUME by SEX and CLASS:
| A1 | A2 | A3 | A4 | A5 | |
|---|---|---|---|---|---|
| F | 255.29938 | 276.8573 | 412.6079 | 498.0489 | 486.1525 |
| I | 66.51618 | 160.3200 | 270.7406 | 316.4129 | 318.6930 |
| M | 103.72320 | 245.3857 | 358.1181 | 442.6155 | 440.2074 |
##
## Mean SHUCK by SEX and CLASS:
| A1 | A2 | A3 | A4 | A5 | |
|---|---|---|---|---|---|
| F | 38.90000 | 42.50305 | 59.69121 | 69.05161 | 59.17076 |
| I | 10.11332 | 23.41024 | 37.17969 | 39.85369 | 36.47047 |
| M | 16.39583 | 38.33855 | 52.96933 | 61.42726 | 55.02762 |
##
## Mean RATIO by SEX and CLASS:
| A1 | A2 | A3 | A4 | A5 | |
|---|---|---|---|---|---|
| F | 0.1546644 | 0.1554605 | 0.1450304 | 0.1379609 | 0.1233605 |
| I | 0.1569554 | 0.1475600 | 0.1372256 | 0.1244413 | 0.1167649 |
| M | 0.1512698 | 0.1564017 | 0.1462123 | 0.1364881 | 0.1262089 |
(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().
Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.
Answer: Why do infants maintain a higher RATIO, indicating prioritized meat investment early in life? Why do females start larger while males exhibit rapid growth to nearly match them, suggesting divergent growth strategies? What drives the peak meat efficiency at Class A2 before decline, pointing to a biological shift in resource allocation? Are these sex-based patterns tied to reproductive strategies? These complex interactions between age, sex, and growth directly inform optimal harvesting—suggesting peak profitability may come at Class A2 for efficiency or Class A4 for bulk, depending on market premiums for meat yield versus absolute size.
5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.
Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?
Answer: The displays reveal a fundamental shift in growth strategy after maturity. Infants exhibit predictable, linear growth with tight size distributions at each ring count, indicating a genetically programmed early development phase. In contrast, adults show dramatic variability—for the same age, sizes vary widely, with more outliers and non-monotonic patterns, especially in mid-life (RINGS 8-12). This suggests that upon reaching maturity, other factors like nutrition, environment, and genetics overwhelm age as the primary determinant of size. Furthermore the growth rate slows considerably in older adults, yielding diminishing returns in size per additional ring. This variability is compounded by sexual dimorphism (like was analyzed in the previous plots), where females initially dominate in size, but males undergo rapid growth to nearly catch up, indicating divergent reproductive strategies that further complicate any simple model for predicting age from physical characteristics.
### Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).
Conclusions
Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.
Answer: The original study failed because the EDA reveals fundamental biological and statistical limitations that make precise age prediction from physical measurements impossible. Key reasons include: 1. High Biological Variability: The substantial overlap in physical measurements (VOLUME, WHOLE) between adjacent age classes demonstrates that abalones of the same age can have dramatically different sizes likely due to individual growth rates, genetics, and environmental factors. This inherent variability creates too much “noise” for accurate prediction. 2. Non-Linear Growth Patterns: The relationships between physical measurements and age are not linear. Growth accelerates in youth but plateaus in maturity, meaning the same physical size increase represents different age increments at different life stages. This violates assumptions of simple linear models. 3. Heteroscedasticity: The increasing variability in size with age (visible in the pattern of WHOLE vs VOLUME) means prediction uncertainty grows for older abalones, making the model increasingly unreliable for the very populations most critical for conservation management. 4. Complex Group Differences: Distinct growth patterns between infants and adults, and growth rate differences between males and females, require separate predictive models rather than a single unified approach. The original study likely attempted a “one-size-fits-all” model that couldn’t accommodate these biological realities. 5. Measurement Limitations: While physical measurements can distinguish broad age categories (young vs old), they lack the precision needed for exact ring count prediction. The investigators were correct that additional contextual data (weather patterns, location, food availability) would be necessary to account for the environmental influences on growth that create the observed variability.
Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?
Answer: Before accepting summary statistics as representative, I would ask: 1) What was the sampling methodology? (Random vs convenience, or other?) 2) What was the sample size and response rate? 3) Are there potential selection biases in how participants were recruited or data were collected? 4) What is the temporal and geographical context of the data collection? 5) How were missing data handled in the analysis? 6) What are the operational definitions and measurement protocols for the variables? 7) What evidence exists that the sample adequately represents the target population? 8) Are there important subgroups that might show different patterns? 9) What is known about the reliability and validity of the measurements? 10) Are there potential confounding variables that might affect the interpretation? I’d seek evidence of representativeness, explore potential subgroup variations, and identify any unmeasured confounding variables that could distort the findings.
Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?
Answer: Observational studies present several analytical challenges: 1) Confounding - the inability to control for unmeasured variables that may explain observed relationships. 2) Selection bias - participants may differ systematically from the target population. 3) Lack of randomization - groups may differ in ways beyond the variables of interest. 4) Temporal ambiguity - difficulty establishing whether the presumed cause preceded the effect. 5) Measurement error - variables may be measured with less precision than in controlled experiments. Regarding causality, observational studies alone cannot definitively establish causal relationships due to these limitations. The gold standard for causal inference remains randomized controlled experiments. However, observational studies are invaluable for: generating hypotheses for future experimental testing, identifying associations and patterns in realworld settings, studying phenomena where experiments are unethical or impractical, providing ecological validity by examining relationships in natural contexts, and establishing correlational evidence that can guide policy and further research. They excel at describing “what is” rather than explaining “why it is.”