## 'data.frame': 1036 obs. of 8 variables:
## $ SEX : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ LENGTH: num 5.57 3.67 10.08 4.09 6.93 ...
## $ DIAM : num 4.09 2.62 7.35 3.15 4.83 ...
## $ HEIGHT: num 1.26 0.84 2.205 0.945 1.785 ...
## $ WHOLE : num 11.5 3.5 79.38 4.69 21.19 ...
## $ SHUCK : num 4.31 1.19 44 2.25 9.88 ...
## $ RINGS : int 6 4 6 3 6 6 5 6 5 6 ...
## $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##### Section 1: (6 points) Summarizing the data.
(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.
## SEX LENGTH DIAM HEIGHT WHOLE
## F:326 Min. : 2.73 Min. : 1.995 Min. :0.525 Min. : 1.625
## I:329 1st Qu.: 9.45 1st Qu.: 7.350 1st Qu.:2.415 1st Qu.: 56.484
## M:381 Median :11.45 Median : 8.925 Median :2.940 Median :101.344
## Mean :11.08 Mean : 8.622 Mean :2.947 Mean :105.832
## 3rd Qu.:13.02 3rd Qu.:10.185 3rd Qu.:3.570 3rd Qu.:150.319
## Max. :16.80 Max. :13.230 Max. :4.935 Max. :315.750
## SHUCK RINGS CLASS VOLUME
## Min. : 0.5625 Min. : 3.000 A1:108 Min. : 3.612
## 1st Qu.: 23.3006 1st Qu.: 8.000 A2:236 1st Qu.:163.545
## Median : 42.5700 Median : 9.000 A3:329 Median :307.363
## Mean : 45.4396 Mean : 9.993 A4:188 Mean :326.804
## 3rd Qu.: 64.2897 3rd Qu.:11.000 A5:175 3rd Qu.:463.264
## Max. :157.0800 Max. :25.000 Max. :995.673
## RATIO
## Min. :0.06734
## 1st Qu.:0.12241
## Median :0.13914
## Mean :0.14205
## 3rd Qu.:0.15911
## Max. :0.31176
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## A1 9 8 24 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## A2 0 0 0 0 91 145 0 0 0 0 0 0 0 0 0 0 0 0
## A3 0 0 0 0 0 0 182 147 0 0 0 0 0 0 0 0 0 0
## A4 0 0 0 0 0 0 0 0 125 63 0 0 0 0 0 0 0 0
## A5 0 0 0 0 0 0 0 0 0 0 48 35 27 15 13 8 8 6
##
## 21 22 23 24 25
## A1 0 0 0 0 0
## A2 0 0 0 0 0
## A3 0 0 0 0 0
## A4 0 0 0 0 0
## A5 4 1 7 2 1
Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.
Answer: Looking at the data, all variables exhibit a wide range of values. For measurements like length, diameter, and height, the mean and median are relatively close, with height showing the smallest difference, indicating it approximates a normal distribution; however, all variables, except for length and diameter have a mean that is greater than the median. This indicates tha the distribitions are right-skewed, suggesting that a few high-value outliers are influencing the average. This trend holds for most variables except length and diameter. Length and diameter have mean values that are less than the median, to see if these are possibly symetrical distributions we would have to do a kurtosis test. In the Classes and Rings table, there seems to be a positive correlation between age and number of rings, suggesting that age can be measured by the number of rings. The largest proportion of abalone falls within Class A3, with 9 and 10 rings, indicating that the majority of abalone in this dataset are middle-aged. Infant abalone (fewer than 7 rings) are relatively scarce, while the frequency of abalone with rings exceeding 13 starts to decline. This suggests that older abalone are statistical outliers
(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
## Margins computed over dimensions
## in the following order:
## 1: Sex
## 2: Class
## Class
## Sex A1 A2 A3 A4 A5 sum
## F 5 41 121 82 77 326
## I 91 133 65 21 19 329
## M 12 62 143 85 79 381
## sum 108 236 329 188 175 1036
Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?
Answer: he first thing taht satnds out is that infant abalone are present across all classes, even classes taht are ebleived to be older.The majority of infant abalone are in A1 and A2 categories, which makes sense because A1 and A2 are the youngest classes based on number of rings, but they are still present in A3 to A5 Classes. Classes A3, A4, and A5 hold the majority of male and female abalones, which indicates that most mature abalone with median number of rings are adults. Also, the number of male abalones is higher than the number of female abalones across all classes, and higgher than the number of infant abalone in Classes A3 to A5. I wonder if Sex of the abalone has an impact on other variables, such as number of rings, whole weight, or volume. Since there are more male abalone than female abalone, they could be pulling the mean up and making teh dictribution right skewed.
(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.
##### Section 2: (5 points) Summarizing the data using graphics.
(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.
(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.
Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.
Answer: The plot in 2b, Shuck vs Whole, has less variability than the plot in 2a, Volume vs Whole, this is probably because shuck weight is a part of the whole weight of the abalone, and they can be more closely correlated. However, the weight of the shuck seems to drop as the whole weight in creases for some classes. Since the plot is color-coded by Class, maybe at some point the shuck portion of the whole weight decreases in comparison to the whole abalone, perhaps as the abalone matures. We would want to see if age classification plays a role in shuck proportion to the whole weight
### Section 3: (8 points) Getting insights about the data using graphs.
(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.
Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.
Answer: In the boxplots, we can see that all of the distributions have outliers, which makes sense because the histograms all show positive-skewed distributions with long tails on the right. This is more noticeable in the box plot for Female Ratio, which shows an upper outlier. The Female Ratio Q-Q Plot also shows values that are higher than the norm near the right tail end, which mirrors the significance of both the box plot and the histogram.
(3)(b) (2 points) The boxplots in (3)(a) indicate that there are outlying RATIOs for each sex. boxplot.stats() can be used to identify outlying values of a vector. Present the abalones with these outlying RATIO values along with their associated variables in “mydata”. Display the observations by passing a data frame to the kable() function. Basically, we want to output those rows of “mydata” with an outlying RATIO, but we want to determine outliers looking separately at infants, females and males.
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | I | 10.080 | 7.350 | 2.205 | 79.37500 | 44.00000 | 6 | A1 | 163.364040 | 0.2693371 |
| 37 | I | 4.305 | 3.255 | 0.945 | 6.18750 | 2.93750 | 3 | A1 | 13.242072 | 0.2218308 |
| 42 | I | 2.835 | 2.730 | 0.840 | 3.62500 | 1.56250 | 4 | A1 | 6.501222 | 0.2403394 |
| 58 | I | 6.720 | 4.305 | 1.680 | 22.62500 | 11.00000 | 5 | A1 | 48.601728 | 0.2263294 |
| 67 | I | 5.040 | 3.675 | 0.945 | 9.65625 | 3.93750 | 5 | A1 | 17.503290 | 0.2249577 |
| 89 | I | 3.360 | 2.310 | 0.525 | 2.43750 | 0.93750 | 4 | A1 | 4.074840 | 0.2300704 |
| 105 | I | 6.930 | 4.725 | 1.575 | 23.37500 | 11.81250 | 7 | A2 | 51.572194 | 0.2290478 |
| 200 | I | 9.135 | 6.300 | 2.520 | 74.56250 | 32.37500 | 8 | A2 | 145.027260 | 0.2232339 |
| 350 | F | 7.980 | 6.720 | 2.415 | 80.93750 | 40.37500 | 7 | A2 | 129.505824 | 0.3117620 |
| 379 | F | 15.330 | 11.970 | 3.465 | 252.06250 | 134.89812 | 10 | A3 | 635.827846 | 0.2121614 |
| 420 | F | 11.550 | 7.980 | 3.465 | 150.62500 | 68.55375 | 10 | A3 | 319.365585 | 0.2146560 |
| 421 | F | 13.125 | 10.290 | 2.310 | 142.00000 | 66.47062 | 9 | A3 | 311.979938 | 0.2130606 |
| 458 | F | 11.445 | 8.085 | 3.150 | 139.81250 | 68.49062 | 9 | A3 | 291.478399 | 0.2349767 |
| 586 | F | 12.180 | 9.450 | 4.935 | 133.87500 | 38.25000 | 14 | A5 | 568.023435 | 0.0673388 |
| 746 | M | 13.440 | 10.815 | 1.680 | 130.25000 | 63.73125 | 10 | A3 | 244.194048 | 0.2609861 |
| 754 | M | 10.500 | 7.770 | 3.150 | 132.68750 | 61.13250 | 9 | A3 | 256.992750 | 0.2378764 |
| 803 | M | 10.710 | 8.610 | 3.255 | 160.31250 | 70.41375 | 9 | A3 | 300.153640 | 0.2345924 |
| 810 | M | 12.285 | 9.870 | 3.465 | 176.12500 | 99.00000 | 10 | A3 | 420.141472 | 0.2356349 |
| 852 | M | 11.550 | 8.820 | 3.360 | 167.56250 | 78.27187 | 10 | A3 | 342.286560 | 0.2286735 |
Essay Question (2 points): What are your observations regarding the results in (3)(b)?
Answer: In this table of values, it is evident that infant measurements are smaller than the values for male and female abalone. However, it is also evident that male measurements are larger than those of female and infant abalone. Most of the outlier values from from infant abalone, but they are on the lower end. While Male abalones tend to be heavier and larger in volume than Female abalones, there are Female outliers that are heavier and larger than all male abalones, for example, Female observation #379.
### Section 4: (8 points) Getting insights about possible predictors.
(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.
Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.
Answer: I think volume and whole weight may be better predictors of age for abalone. There seems to be a linear increase in volume and whole weight that correlates to age classification. The one class that is slightly harder to predict is A5, which is the oldest class. This variability is also observed in the scatterpolots. Although the classes seems to fall within their own bracket of volume, whole weight, and rings frequency, the variability in A5 abalones makes factors other than ring size unreliable.
### Section 5: (12 points) Getting insights regarding different groups in the data.
(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.
| A1 | A2 | A5 | A3 | A4 | |
|---|---|---|---|---|---|
| I | 255.299 | 276.857 | 412.608 | 498.049 | 486.153 |
| F | 66.516 | 160.320 | 270.741 | 316.413 | 318.693 |
| M | 103.723 | 245.386 | 358.118 | 442.616 | 440.207 |
| A1 | A2 | A5 | A3 | A4 | |
|---|---|---|---|---|---|
| I | 38.900 | 42.503 | 59.691 | 69.052 | 59.171 |
| F | 10.113 | 23.410 | 37.180 | 39.854 | 36.470 |
| M | 16.396 | 38.339 | 52.969 | 61.427 | 55.028 |
| A1 | A2 | A5 | A3 | A4 | |
|---|---|---|---|---|---|
| I | 0.155 | 0.155 | 0.145 | 0.138 | 0.123 |
| F | 0.157 | 0.148 | 0.137 | 0.124 | 0.117 |
| M | 0.151 | 0.156 | 0.146 | 0.136 | 0.126 |
(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().
Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.
Answer: Considering aging and sex differences it seems that female abalone have higher mean volume and shuck weight in comparison to males and infants. The mean volume and mean shuck weight increases for all genders as they rise in age class as well. However, it curious to note that mean ratio declines for all genders, but infant abalone begin declining in ratio sooner. This means that shuck weight of meat decreases in comparison to the whole volume for all genders across all classes.
5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.
Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?
Answer: The boxplots suggest that growth for abalone is variable in infancy. For example, the boxplots for infant abalone have more outliers, especially in those with 10 or less rings. Adult abalone have a more stable growth related to number of rings, although they sill have positive outliers. The trend for adult abalone is more linear and peaks at 11 rings. After infant and adult abalone peak at at about 11 and 12 rings, respectively, they experince a slight decline in both volume and weight.
### Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).
Conclusions
Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.
Answer: Possible reasons include presumptions about the data being collected as well as possible errors in data being collected There are infant abalone across all classes, so how does gender of ana abalone play a part in age and number of rings? Also, there is no account for the variability in population of abalones, meaning that the population is not controlled. The study mentioned that abalone take a long time to grow, and that can create changes in the environment of the abalone as well.
Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?
Answer: I’d want to know where the data came from and how it was collected. I’d also like to know if the sample is large enough and a good representative of the whole population. I’d check for potential outliers. I would also want to know the context if the data being presented to me, what message are they trying to get across?.
Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?
Answer: A difficulty in analyzing this data is the variability and the lack of control in the sample of abalone population. Causality can’t be determined because number of rings is correlated with age classification, but not with maturity, infant abalone exist across older age classifications as well. We need to understand the data better in order to meaninhfully ananlyze the data.