Do not include package installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the abalones.csv from the course site to a known location on your machine. Unless a file.path() is specified, R will look to directory where this .Rmd is stored when knitting.
## 'data.frame': 1036 obs. of 8 variables:
## $ SEX : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ LENGTH: num 5.57 3.67 10.08 4.09 6.93 ...
## $ DIAM : num 4.09 2.62 7.35 3.15 4.83 ...
## $ HEIGHT: num 1.26 0.84 2.205 0.945 1.785 ...
## $ WHOLE : num 11.5 3.5 79.38 4.69 21.19 ...
## $ SHUCK : num 4.31 1.19 44 2.25 9.88 ...
## $ RINGS : int 6 4 6 3 6 6 5 6 5 6 ...
## $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata.
## SEX LENGTH DIAM HEIGHT WHOLE
## F:326 Min. : 2.73 Min. : 1.995 Min. :0.525 Min. : 1.625
## I:329 1st Qu.: 9.45 1st Qu.: 7.350 1st Qu.:2.415 1st Qu.: 56.484
## M:381 Median :11.45 Median : 8.925 Median :2.940 Median :101.344
## Mean :11.08 Mean : 8.622 Mean :2.947 Mean :105.832
## 3rd Qu.:13.02 3rd Qu.:10.185 3rd Qu.:3.570 3rd Qu.:150.319
## Max. :16.80 Max. :13.230 Max. :4.935 Max. :315.750
## SHUCK RINGS CLASS VOLUME
## Min. : 0.5625 Min. : 3.000 A1:108 Min. : 3.612
## 1st Qu.: 23.3006 1st Qu.: 8.000 A2:236 1st Qu.:163.545
## Median : 42.5700 Median : 9.000 A3:329 Median :307.363
## Mean : 45.4396 Mean : 9.993 A4:188 Mean :326.804
## 3rd Qu.: 64.2897 3rd Qu.:11.000 A5:175 3rd Qu.:463.264
## Max. :157.0800 Max. :25.000 Max. :995.673
## RATIO
## Min. :0.06734
## 1st Qu.:0.12241
## Median :0.13914
## Mean :0.14205
## 3rd Qu.:0.15911
## Max. :0.31176
Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## A1 9 8 24 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## A2 0 0 0 0 91 145 0 0 0 0 0 0 0 0 0 0 0 0
## A3 0 0 0 0 0 0 182 147 0 0 0 0 0 0 0 0 0 0
## A4 0 0 0 0 0 0 0 0 125 63 0 0 0 0 0 0 0 0
## A5 0 0 0 0 0 0 0 0 0 0 48 35 27 15 13 8 8 6
##
## 21 22 23 24 25
## A1 0 0 0 0 0
## A2 0 0 0 0 0
## A3 0 0 0 0 0
## A4 0 0 0 0 0
## A5 4 1 7 2 1
Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.
Answer: the abalones.csv contains 8 variables, with the two we just added, now it has 10 variables. out of which, two of them are factor type(SEX and CLASS), rest are all numeric except RINGS is int. From the summary we roughly tell: LENGTH, DIAM, HEIGHT are left skewed, and WHOLE, SHUCK, RINGS, VOLUME, RATIO are right skewed;For outliers it looks like almost all numeric data have outliers, and it looks like ring clearly has extreme outliers
skewness can also be checked by using skewness function, negative value indicates the data is left skewed, positive value indicates the data is right skewed.
## Loading required package: e1071
## [1] "Skewness of HEIGHT: -0.2253, Skewness of WHOLE: 0.4705"
History gram tells skewness as well
Check for outliers and extreme outliers
## [1] "LENGTH has 12 outliers and 0 extreme outliers"
## [1] "DIAM has 13 outliers and 0 extreme outliers"
## [1] "HEIGHT has 6 outliers and 0 extreme outliers"
## [1] "WHOLE has 4 outliers and 0 extreme outliers"
## [1] "SHUCK has 10 outliers and 0 extreme outliers"
## [1] "RINGS has 74 outliers and 15 extreme outliers"
## [1] "VOLUME has 1 outliers and 0 extreme outliers"
## [1] "RATIO has 18 outliers and 2 extreme outliers"
check outliers of RINGS
(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
## CLASS
## SEX A1 A2 A3 A4 A5 Sum
## F 5 41 121 82 77 326
## I 91 133 65 21 19 329
## M 12 62 143 85 79 381
## Sum 108 236 329 188 175 1036
plot
Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?
Answer: Based on the Data Analysis Assignments Background New.pdf document, 8. CLASS = Age classification based on RINGS (A1= youngest,., A6=oldest), based on that I thought A4 and A5 should be senior abalones, but we still see a good amount of infants.. is it because of infants are not determined by age? Or RINGS itself is not a very accuracy representation of age? Besides that, in all classes, Female are less than Male, another surprise to me, not sure is it climate related. Espcially in the younger classes like A1 to A3, should we be worry about the future generation of ablones? As of now the younger population(A1-A3) are more than the olderly (A4-A5), that itself (if other info are correct) indicates a healthy population
(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.
(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.
(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.
Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.
Answer: Both plot looks similar in terms of whole weight increases as volume increases, and shuck weight increases as whole weight increases as well. Looking at the age class, we can see the trend all three variables increases as age (age class) increases. In plot a, we do not see significant differences in ratio of whole weight vs volume in different age groups, but in plot b, it looks like there’s a trend that the oldest class (A5) has a lower Shuck vs Whole weight ratio. which make sense biologically.
(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.
Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.
Answer: From the plots, we can see all data are slightly right skewed, from the histgram we can directly see the right tail. The boxplots also have right tails, that tells there are more outiers the the right, the qq plot indicates right tail as well. I ran shapiro test for all three type of data as well, and p value is pretty small, far smaller than generic alpha=.05, the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.
##
## Shapiro-Wilk normality test
##
## data: ratio.F
## W = 0.96028, p-value = 9.595e-08
##
## Shapiro-Wilk normality test
##
## data: ratio.I
## W = 0.96962, p-value = 2.154e-06
##
## Shapiro-Wilk normality test
##
## data: ratio.M
## W = 0.98031, p-value = 4.622e-05
(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).
Female mild and extreme outliers
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 350 | F | 7.980 | 6.720 | 2.415 | 80.9375 | 40.37500 | 7 | A2 | 129.5058 | 0.3117620 |
| 379 | F | 15.330 | 11.970 | 3.465 | 252.0625 | 134.89812 | 10 | A3 | 635.8278 | 0.2121614 |
| 420 | F | 11.550 | 7.980 | 3.465 | 150.6250 | 68.55375 | 10 | A3 | 319.3656 | 0.2146560 |
| 421 | F | 13.125 | 10.290 | 2.310 | 142.0000 | 66.47062 | 9 | A3 | 311.9799 | 0.2130606 |
| 458 | F | 11.445 | 8.085 | 3.150 | 139.8125 | 68.49062 | 9 | A3 | 291.4784 | 0.2349767 |
| 586 | F | 12.180 | 9.450 | 4.935 | 133.8750 | 38.25000 | 14 | A5 | 568.0234 | 0.0673388 |
Infant mild and extreme outliers
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | I | 10.080 | 7.350 | 2.205 | 79.37500 | 44.0000 | 6 | A1 | 163.364040 | 0.2693371 |
| 37 | I | 4.305 | 3.255 | 0.945 | 6.18750 | 2.9375 | 3 | A1 | 13.242072 | 0.2218308 |
| 42 | I | 2.835 | 2.730 | 0.840 | 3.62500 | 1.5625 | 4 | A1 | 6.501222 | 0.2403394 |
| 58 | I | 6.720 | 4.305 | 1.680 | 22.62500 | 11.0000 | 5 | A1 | 48.601728 | 0.2263294 |
| 67 | I | 5.040 | 3.675 | 0.945 | 9.65625 | 3.9375 | 5 | A1 | 17.503290 | 0.2249577 |
| 89 | I | 3.360 | 2.310 | 0.525 | 2.43750 | 0.9375 | 4 | A1 | 4.074840 | 0.2300704 |
| 105 | I | 6.930 | 4.725 | 1.575 | 23.37500 | 11.8125 | 7 | A2 | 51.572194 | 0.2290478 |
| 200 | I | 9.135 | 6.300 | 2.520 | 74.56250 | 32.3750 | 8 | A2 | 145.027260 | 0.2232339 |
Male mild and extreme outliers
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 746 | M | 13.440 | 10.815 | 1.680 | 130.2500 | 63.73125 | 10 | A3 | 244.1940 | 0.2609861 |
| 754 | M | 10.500 | 7.770 | 3.150 | 132.6875 | 61.13250 | 9 | A3 | 256.9928 | 0.2378764 |
| 803 | M | 10.710 | 8.610 | 3.255 | 160.3125 | 70.41375 | 9 | A3 | 300.1536 | 0.2345924 |
| 810 | M | 12.285 | 9.870 | 3.465 | 176.1250 | 99.00000 | 10 | A3 | 420.1415 | 0.2356349 |
| 852 | M | 11.550 | 8.820 | 3.360 | 167.5625 | 78.27187 | 10 | A3 | 342.2866 | 0.2286735 |
Female extreme outliers
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 350 | F | 7.98 | 6.72 | 2.415 | 80.9375 | 40.375 | 7 | A2 | 129.5058 | 0.311762 |
Infant extreme outliers
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | I | 10.08 | 7.35 | 2.205 | 79.375 | 44 | 6 | A1 | 163.364 | 0.2693371 |
No Male extreme outliers
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO |
|---|
Essay Question (2 points): What are your observations regarding the results in (3)(b)?
Answer: The biggest ratio comes from female; Overall female and infant has more outliers in ratio(i.e. their Shuck weight is higher vs volume. The biggest ratio come from a very small female. Most(4 out of 19) outliers comes from smaller (in volume) abalones(volume <= mean(326) or median(307));infant have the most outliers, which is consistent with prior observation of smaller abalones tend to have more outliers
(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.
Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.
Answer: Volume and/or whole weight will not perform very well as predictors of age, as we can tell from the scatter plots, all the dots are widly spread across the figure. Although we can see a trend that there’s a correlation between volume & rings and whole weight vs rings, but the that’s just not enough. Besides I also ran a simple linear function as below, using both WHOLE and VOLUME to predict RINGS, the adjusted R-squared is only .3105, as expected, which is not a powerful enough.
##
## Call:
## lm(formula = RINGS ~ WHOLE + VOLUME, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8782 -1.7440 -0.6368 0.9165 13.6803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.901213 0.169737 40.658 < 2e-16 ***
## WHOLE -0.003082 0.005486 -0.562 0.574
## VOLUME 0.010459 0.001745 5.995 2.82e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.755 on 1033 degrees of freedom
## Multiple R-squared: 0.3118, Adjusted R-squared: 0.3105
## F-statistic: 234 on 2 and 1033 DF, p-value: < 2.2e-16
(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.
## $Volume
## A1 A2 A3 A4 A5
## Female 255.29938 276.8573 412.6079 498.0489 486.1525
## Infant 66.51618 160.3200 270.7406 316.4129 318.6930
## Male 103.72320 245.3857 358.1181 442.6155 440.2074
##
## $Shuck
## A1 A2 A3 A4 A5
## Female 38.90000 42.50305 59.69121 69.05161 59.17076
## Infant 10.11332 23.41024 37.17969 39.85369 36.47047
## Male 16.39583 38.33855 52.96933 61.42726 55.02762
##
## $Ratio
## A1 A2 A3 A4 A5
## Female 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
## Infant 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
## Male 0.1512698 0.1564017 0.1462123 0.1364881 0.1262089
(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().
Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.
Answer: Mean Ratio is clearly descreasing as age (age class) increases, this is true for all sex;Mean Volume increases as age (class) increase, which makes perfect sense as well; this is true for all sex;Chuck weight increases as age (class) increase, true for all sex as well; For all three types, female show little differences between class A1 and A2, maybe female develops slower than other type;In terms of Mean Volume and Shuck weight, Female is always greater than Male then Infant; For mean ratio, male and female adults are the same, and always greater than Infant.
5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.
Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?
Answer: OVerall, Volume and Whole weight increases as age increases, this is true for both adult abalones and infant ablones;adults are bigger in volume and weight (which is 100% expected) compare to infants, and adults have bigger standard deviation compare to infants as well, infant whole weight and volume are more close to the mean;both infant and adult tend to reach peak Volume and Whole weight at around 11~12 years age and then shrink a little bit;Adults grow at a faster pace compare to infant;Based solely on Volume and/Whole weight, it’s hard to differentiate infant and adjult, as there are lots of overlaps
Conclusions
Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.
Answer: I think the biggest reason is that too less variable was obtained (maybe it’s too hard to measure more variables, but the difficulty of measurement is not what’s been discussed here);In addition, too many outliers, and right skewness of the data meaning the data fail to achieve normality, so that make it hard to approximate the data as normal distribution can’t be used;physical measurements can be helpful but only to a limited extent, it’s not useful enough for the purpose, weather, preditor, pollution, food and environmental impact could play a bigger role; some of the variables are mentioend in the background document but no related data was collected; Last, infant can have more than 10 rings, which is confusing to me regarding how infant is classified, more explaination is requried here, or more variable needed to better differentiate infant vs adults in different ages.
Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?
Answer: Questions I would ask include: What’s the sample size? Is there a presence of outliers? What is the mean of the sample distribution? What is the standard deviation of the sample distribution? What is the population mean and standard deviation, can we estimate them? Is there prior study regarding them? what’s the data skewness? Is the sample size big enough? How good are the variables measured logically, can we draw some cause-effecive relationship between them and the predicting variable? How and where was the sample obtained, is this randomized enough?
Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?
Answer: There are many difficulties analyzing data derived from observational studies. There could be human/measurement errors as the values are manually recorded; It’s also possible that the sample is not well randomized there for it leads to biases; Causality is very hard to determine if not impossible, there could be hiden factors behind two highly correlated variables, and in pure observation study data can almost be explained in any ways people want, so it wouldn’t bear much credibility either; Most importantly, without controled experiment, whatever is observed can only remain correlation not causation as we discussed a lot in past few weeks;The biggest thing could learn from observational studies, I think is that we can find some good potential ideas with cheaper cost, that can be used as a starting point for further research.