Submit both the .Rmd and .html files for grading. You may remove the instructions and example problem above, but do not remove the YAML metadata block or the first, “setup” code chunk. Address the steps that appear below and answer all the questions. Be sure to address each question with code and comments as needed. You may use either base R functions or ggplot2 for the visualizations.
The following code chunk will:
Do not include package installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the abalones.csv from the course site to a known location on your machine. Unless a file.path() is specified, R will look to directory where this .Rmd is stored when knitting.
## 'data.frame': 1036 obs. of 8 variables:
## $ SEX : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ LENGTH: num 5.57 3.67 10.08 4.09 6.93 ...
## $ DIAM : num 4.09 2.62 7.35 3.15 4.83 ...
## $ HEIGHT: num 1.26 0.84 2.205 0.945 1.785 ...
## $ WHOLE : num 11.5 3.5 79.38 4.69 21.19 ...
## $ SHUCK : num 4.31 1.19 44 2.25 9.88 ...
## $ RINGS : int 6 4 6 3 6 6 5 6 5 6 ...
## $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.
## [1] Abalones Summary Statistics:
## SEX LENGTH DIAM HEIGHT WHOLE
## F:326 Min. : 2.73 Min. : 1.995 Min. :0.525 Min. : 1.625
## I:329 1st Qu.: 9.45 1st Qu.: 7.350 1st Qu.:2.415 1st Qu.: 56.484
## M:381 Median :11.45 Median : 8.925 Median :2.940 Median :101.344
## Mean :11.08 Mean : 8.622 Mean :2.947 Mean :105.832
## 3rd Qu.:13.02 3rd Qu.:10.185 3rd Qu.:3.570 3rd Qu.:150.319
## Max. :16.80 Max. :13.230 Max. :4.935 Max. :315.750
## SHUCK RINGS CLASS VOLUME
## Min. : 0.5625 Min. : 3.000 A1:108 Min. : 3.612
## 1st Qu.: 23.3006 1st Qu.: 8.000 A2:236 1st Qu.:163.545
## Median : 42.5700 Median : 9.000 A3:329 Median :307.363
## Mean : 45.4396 Mean : 9.993 A4:188 Mean :326.804
## 3rd Qu.: 64.2897 3rd Qu.:11.000 A5:175 3rd Qu.:463.264
## Max. :157.0800 Max. :25.000 Max. :995.673
## RATIO
## Min. :0.06734
## 1st Qu.:0.12241
## Median :0.13914
## Mean :0.14205
## 3rd Qu.:0.15911
## Max. :0.31176
## [1] Abalones Class & Rings Statistics:
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## A1 9 8 24 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## A2 0 0 0 0 91 145 0 0 0 0 0 0 0 0 0 0 0 0
## A3 0 0 0 0 0 0 182 147 0 0 0 0 0 0 0 0 0 0
## A4 0 0 0 0 0 0 0 0 125 63 0 0 0 0 0 0 0 0
## A5 0 0 0 0 0 0 0 0 0 0 48 35 27 15 13 8 8 6
##
## 21 22 23 24 25
## A1 0 0 0 0 0
## A2 0 0 0 0 0
## A3 0 0 0 0 0
## A4 0 0 0 0 0
## A5 4 1 7 2 1
Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.
Answer: (Schuck, volume, whole are all unusual)
(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
##
## A1 A2 A3 A4 A5 Sum
## F 5 41 121 82 77 326
## I 91 133 65 21 19 329
## M 12 62 143 85 79 381
## Sum 108 236 329 188 175 1036
Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?
Answer: Male and female abalone follow a normal distribution curve, but are more negatively skewed based off A4 & A5. There are more male than female abalone in each class. Male and female abalone seem identically matched in A4 and A5, but there is a larger separation gap in A3. The infant abalones are postively skewed toward class A1 and A2.
(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.
(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.
(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.
Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.
Answer: The overall takeaway of the second graph, is that correlation of age and weight tend to become less postive over time. Both plots follow a positive, linear relationship. When Volume increases, whole weight increases as well. When the abalones shuck weight (meat portion) increases, then the overall whole weight increases as well. Whole weight and volume seem to be more closely correlated than shuck weight vs whole weight. The infant abalones, normally found in A1 & A2, are smaller, so this would make sense that they are closely correlated on a 1:1 linear growth. Shuck weight decreases as the abalone ages as the abalone is closer to death, because the shell becomes more weighted than the meat part. This noted in the A5 class abalones are less positively correlated as whole weight increases.
(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.
Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.
Answer: (Enter your answer here.)
(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO |
|---|---|---|---|---|---|---|---|---|---|
| I | 10.080 | 7.350 | 2.205 | 79.37500 | 44.00000 | 6 | A1 | 163.364040 | 0.2693371 |
| I | 4.305 | 3.255 | 0.945 | 6.18750 | 2.93750 | 3 | A1 | 13.242072 | 0.2218308 |
| I | 2.835 | 2.730 | 0.840 | 3.62500 | 1.56250 | 4 | A1 | 6.501222 | 0.2403394 |
| I | 6.720 | 4.305 | 1.680 | 22.62500 | 11.00000 | 5 | A1 | 48.601728 | 0.2263294 |
| I | 5.040 | 3.675 | 0.945 | 9.65625 | 3.93750 | 5 | A1 | 17.503290 | 0.2249577 |
| I | 3.360 | 2.310 | 0.525 | 2.43750 | 0.93750 | 4 | A1 | 4.074840 | 0.2300704 |
| I | 6.930 | 4.725 | 1.575 | 23.37500 | 11.81250 | 7 | A2 | 51.572194 | 0.2290478 |
| I | 9.135 | 6.300 | 2.520 | 74.56250 | 32.37500 | 8 | A2 | 145.027260 | 0.2232339 |
| M | 13.440 | 10.815 | 1.680 | 130.25000 | 63.73125 | 10 | A3 | 244.194048 | 0.2609861 |
| M | 10.500 | 7.770 | 3.150 | 132.68750 | 61.13250 | 9 | A3 | 256.992750 | 0.2378764 |
| M | 10.710 | 8.610 | 3.255 | 160.31250 | 70.41375 | 9 | A3 | 300.153640 | 0.2345924 |
| M | 12.285 | 9.870 | 3.465 | 176.12500 | 99.00000 | 10 | A3 | 420.141472 | 0.2356349 |
| M | 11.550 | 8.820 | 3.360 | 167.56250 | 78.27187 | 10 | A3 | 342.286560 | 0.2286735 |
| F | 7.980 | 6.720 | 2.415 | 80.93750 | 40.37500 | 7 | A2 | 129.505824 | 0.3117620 |
| F | 15.330 | 11.970 | 3.465 | 252.06250 | 134.89812 | 10 | A3 | 635.827846 | 0.2121614 |
| F | 11.550 | 7.980 | 3.465 | 150.62500 | 68.55375 | 10 | A3 | 319.365585 | 0.2146560 |
| F | 13.125 | 10.290 | 2.310 | 142.00000 | 66.47062 | 9 | A3 | 311.979938 | 0.2130606 |
| F | 11.445 | 8.085 | 3.150 | 139.81250 | 68.49062 | 9 | A3 | 291.478399 | 0.2349767 |
| F | 12.180 | 9.450 | 4.935 | 133.87500 | 38.25000 | 14 | A5 | 568.023435 | 0.0673388 |
Essay Question (2 points): What are your observations regarding the results in (3)(b)?
Answer: Males, Females and infant abalone are all skewed to the left. Due to outliers on the boxplot Females seem to have a higher volume and length than males. The abalone with the highest number of rings had the smallest ratio. Outside of this outlier, most male and female abalone are similar in whole, height, diameter, weight, rings, class, and schuck. To be expected, infant abalones are have smaller measurements in almost every category compared to all adult abalones, with some exception in length, diameter, and height.
(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.
Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.
Answer: While these graphs seem identical from age and volume, you’d assume that the older an abalone gets, the greater their volume & weight would be. Despite that, these variables are poor predictors of the abalone’s age. The scatterplot shows clear relationship in rings by class for younger abalones, but older abalones are more tenuously related, making it more difficult to determine age.
(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.
## $Volume
## A1 A2 A3 A4 A5
## M 255.29938 276.8573 412.6079 498.0489 486.1525
## I 66.51618 160.3200 270.7406 316.4129 318.6930
## F 103.72320 245.3857 358.1181 442.6155 440.2074
##
## $Shuck
## A1 A2 A3 A4 A5
## M 38.90000 42.50305 59.69121 69.05161 59.17076
## I 10.11332 23.41024 37.17969 39.85369 36.47047
## F 16.39583 38.33855 52.96933 61.42726 55.02762
##
## $Ratio
## A1 A2 A3 A4 A5
## M 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
## I 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
## F 0.1512698 0.1564017 0.1462123 0.1364881 0.1262089
(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().
Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.
Answer: These plots raise several questions around the relationship between the abalones age, gender and weight. Why are female abalones larger in volume and shuck than the males? Why do infant abalones have a higher ratio? When do growth spurts for abalones occur and what causes it? Is it in A2 or A3? How can an infant abalone be considered A5 in class? Can they stay an infant throughout their entire lifespan? How are class and age distinct? Why is volume constant from A4 to A5?
5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.
Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?
Answer: These displays suggest that abalone growth follows a negatively skewed distribution curve, where the majority of the abalone’s development takes place in the middle to late stage of their life. They seem to peak and reach largest weight around 10-12 rings and then decrease after 12 rings. There is less consistency in weight as the abalone ages, and more so for adults.
Conclusions
Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.
Answer: Based on this data, we have more clarity around why the original study failed. We can see how hard it is to distinguish infant vs adult abalone. The study failed to identify more unique and meaningful relationships in gender, aside from females weighing more. Additionally, correlating factors become less clear as the abalone ages.
Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?
Answer: Questions I would have for whoever collected this data would include, how this sample population was determined? How were the abalones harvested? What challenges existed in measuring the abalones factors (like class, rings, weight etc)? Do we need to consider the location of this sample population? Is it consistent with average abalone across the world?
Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?
Answer: Challenges presented in analysing observational data is in the subjectivity. This study seems to lack any controls within the experiment. We are left wondering so much around this sample size, that we are unsure if we can extrapolate these trends to be representitave of an average abalone population. Because of this, causality becomes very difficult to actually determine. We can take this study as an independent data collection to colleted with future research on more developed abalone research.