Do not include package installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the abalones.csv from the course site to a known location on your machine. Unless a file.path() is specified, R will look to directory where this .Rmd is stored when knitting.

## 'data.frame':    1036 obs. of  8 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...

Test Items starts from here - There are 6 sections

Section 1: (6 points) Summarizing the data.

(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata.

##  SEX         LENGTH           DIAM            HEIGHT          WHOLE        
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525   Min.   :  1.625  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415   1st Qu.: 56.484  
##  M:381   Median :11.45   Median : 8.925   Median :2.940   Median :101.344  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947   Mean   :105.832  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570   3rd Qu.:150.319  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935   Max.   :315.750  
##      SHUCK              RINGS        CLASS        VOLUME       
##  Min.   :  0.5625   Min.   : 3.000   A1:108   Min.   :  3.612  
##  1st Qu.: 23.3006   1st Qu.: 8.000   A2:236   1st Qu.:163.545  
##  Median : 42.5700   Median : 9.000   A3:329   Median :307.363  
##  Mean   : 45.4396   Mean   : 9.993   A4:188   Mean   :326.804  
##  3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175   3rd Qu.:463.264  
##  Max.   :157.0800   Max.   :25.000            Max.   :995.673  
##      RATIO        
##  Min.   :0.06734  
##  1st Qu.:0.12241  
##  Median :0.13914  
##  Mean   :0.14205  
##  3rd Qu.:0.15911  
##  Max.   :0.31176

Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.

##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8   6
##     
##       21  22  23  24  25
##   A1   0   0   0   0   0
##   A2   0   0   0   0   0
##   A3   0   0   0   0   0
##   A4   0   0   0   0   0
##   A5   4   1   7   2   1

Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.

Answer: the abalones.csv contains 8 variables, with the two we just added, now it has 10 variables. out of which, two of them are factor type(SEX and CLASS), rest are all numeric except RINGS is int. From the summary we roughly tell: LENGTH, DIAM, HEIGHT are left skewed, and WHOLE, SHUCK, RINGS, VOLUME, RATIO are right skewed;For outliers it looks like almost all numeric data have outliers, and it looks like ring clearly has extreme outliers

skewness can also be checked by using skewness function, negative value indicates the data is left skewed, positive value indicates the data is right skewed.

## Loading required package: e1071
## [1] "Skewness of HEIGHT: -0.2253, Skewness of WHOLE: 0.4705"

History gram tells skewness as well

Check for outliers and extreme outliers

## [1] "LENGTH has 12 outliers and  0 extreme outliers"
## [1] "DIAM   has 13 outliers and  0 extreme outliers"
## [1] "HEIGHT has  6 outliers and  0 extreme outliers"
## [1] "WHOLE  has  4 outliers and  0 extreme outliers"
## [1] "SHUCK  has 10 outliers and  0 extreme outliers"
## [1] "RINGS  has 74 outliers and 15 extreme outliers"
## [1] "VOLUME has  1 outliers and  0 extreme outliers"
## [1] "RATIO  has 18 outliers and  2 extreme outliers"

check outliers of RINGS

(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.

##      CLASS
## SEX     A1   A2   A3   A4   A5  Sum
##   F      5   41  121   82   77  326
##   I     91  133   65   21   19  329
##   M     12   62  143   85   79  381
##   Sum  108  236  329  188  175 1036

plot

Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?

Answer: Based on the Data Analysis Assignments Background New.pdf document, 8. CLASS = Age classification based on RINGS (A1= youngest,., A6=oldest), based on that I thought A4 and A5 should be senior abalones, but we still see a good amount of infants.. is it because of infants are not determined by age? Or RINGS itself is not a very accuracy representation of age? Besides that, in all classes, Female are less than Male, another surprise to me, not sure is it climate related. Espcially in the younger classes like A1 to A3, should we be worry about the future generation of ablones? As of now the younger population(A1-A3) are more than the olderly (A4-A5), that itself (if other info are correct) indicates a healthy population

(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).

Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.


Section 2: (5 points) Summarizing the data using graphics.

(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.

(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.

Answer: Both plot looks similar in terms of whole weight increases as volume increases, and shuck weight increases as whole weight increases as well. Looking at the age class, we can see the trend all three variables increases as age (age class) increases. In plot a, we do not see significant differences in ratio of whole weight vs volume in different age groups, but in plot b, it looks like there’s a trend that the oldest class (A5) has a lower Shuck vs Whole weight ratio. which make sense biologically.


Section 3: (8 points) Getting insights about the data using graphs.

(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.

Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.

Answer: From the plots, we can see all data are slightly right skewed, from the histgram we can directly see the right tail. The boxplots also have right tails, that tells there are more outiers the the right, the qq plot indicates right tail as well. I ran shapiro test for all three type of data as well, and p value is pretty small, far smaller than generic alpha=.05, the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

## 
##  Shapiro-Wilk normality test
## 
## data:  ratio.F
## W = 0.96028, p-value = 9.595e-08
## 
##  Shapiro-Wilk normality test
## 
## data:  ratio.I
## W = 0.96962, p-value = 2.154e-06
## 
##  Shapiro-Wilk normality test
## 
## data:  ratio.M
## W = 0.98031, p-value = 4.622e-05

(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).

Female mild and extreme outliers

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
350 F 7.980 6.720 2.415 80.9375 40.37500 7 A2 129.5058 0.3117620
379 F 15.330 11.970 3.465 252.0625 134.89812 10 A3 635.8278 0.2121614
420 F 11.550 7.980 3.465 150.6250 68.55375 10 A3 319.3656 0.2146560
421 F 13.125 10.290 2.310 142.0000 66.47062 9 A3 311.9799 0.2130606
458 F 11.445 8.085 3.150 139.8125 68.49062 9 A3 291.4784 0.2349767
586 F 12.180 9.450 4.935 133.8750 38.25000 14 A5 568.0234 0.0673388

Infant mild and extreme outliers

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
3 I 10.080 7.350 2.205 79.37500 44.0000 6 A1 163.364040 0.2693371
37 I 4.305 3.255 0.945 6.18750 2.9375 3 A1 13.242072 0.2218308
42 I 2.835 2.730 0.840 3.62500 1.5625 4 A1 6.501222 0.2403394
58 I 6.720 4.305 1.680 22.62500 11.0000 5 A1 48.601728 0.2263294
67 I 5.040 3.675 0.945 9.65625 3.9375 5 A1 17.503290 0.2249577
89 I 3.360 2.310 0.525 2.43750 0.9375 4 A1 4.074840 0.2300704
105 I 6.930 4.725 1.575 23.37500 11.8125 7 A2 51.572194 0.2290478
200 I 9.135 6.300 2.520 74.56250 32.3750 8 A2 145.027260 0.2232339

Male mild and extreme outliers

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
746 M 13.440 10.815 1.680 130.2500 63.73125 10 A3 244.1940 0.2609861
754 M 10.500 7.770 3.150 132.6875 61.13250 9 A3 256.9928 0.2378764
803 M 10.710 8.610 3.255 160.3125 70.41375 9 A3 300.1536 0.2345924
810 M 12.285 9.870 3.465 176.1250 99.00000 10 A3 420.1415 0.2356349
852 M 11.550 8.820 3.360 167.5625 78.27187 10 A3 342.2866 0.2286735

Female extreme outliers

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
350 F 7.98 6.72 2.415 80.9375 40.375 7 A2 129.5058 0.311762

Infant extreme outliers

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
3 I 10.08 7.35 2.205 79.375 44 6 A1 163.364 0.2693371

No Male extreme outliers

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO

Essay Question (2 points): What are your observations regarding the results in (3)(b)?

Answer: The biggest ratio comes from female; Overall female and infant has more outliers in ratio(i.e. their Shuck weight is higher vs volume. The biggest ratio come from a very small female. Most(4 out of 19) outliers comes from smaller (in volume) abalones(volume <= mean(326) or median(307));infant have the most outliers, which is consistent with prior observation of smaller abalones tend to have more outliers


Section 4: (8 points) Getting insights about possible predictors.

(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.

Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.

Answer: Volume and/or whole weight will not perform very well as predictors of age, as we can tell from the scatter plots, all the dots are widly spread across the figure. Although we can see a trend that there’s a correlation between volume & rings and whole weight vs rings, but the that’s just not enough. Besides I also ran a simple linear function as below, using both WHOLE and VOLUME to predict RINGS, the adjusted R-squared is only .3105, as expected, which is not a powerful enough.


## 
## Call:
## lm(formula = RINGS ~ WHOLE + VOLUME, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8782 -1.7440 -0.6368  0.9165 13.6803 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.901213   0.169737  40.658  < 2e-16 ***
## WHOLE       -0.003082   0.005486  -0.562    0.574    
## VOLUME       0.010459   0.001745   5.995 2.82e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.755 on 1033 degrees of freedom
## Multiple R-squared:  0.3118, Adjusted R-squared:  0.3105 
## F-statistic:   234 on 2 and 1033 DF,  p-value: < 2.2e-16
Section 5: (12 points) Getting insights regarding different groups in the data.

(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.

## $Volume
##               A1       A2       A3       A4       A5
## Female 255.29938 276.8573 412.6079 498.0489 486.1525
## Infant  66.51618 160.3200 270.7406 316.4129 318.6930
## Male   103.72320 245.3857 358.1181 442.6155 440.2074
## 
## $Shuck
##              A1       A2       A3       A4       A5
## Female 38.90000 42.50305 59.69121 69.05161 59.17076
## Infant 10.11332 23.41024 37.17969 39.85369 36.47047
## Male   16.39583 38.33855 52.96933 61.42726 55.02762
## 
## $Ratio
##               A1        A2        A3        A4        A5
## Female 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
## Infant 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
## Male   0.1512698 0.1564017 0.1462123 0.1364881 0.1262089

(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.

Answer: Mean Ratio is clearly descreasing as age (age class) increases, this is true for all sex;Mean Volume increases as age (class) increase, which makes perfect sense as well; this is true for all sex;Chuck weight increases as age (class) increase, true for all sex as well; For all three types, female show little differences between class A1 and A2, maybe female develops slower than other type;In terms of Mean Volume and Shuck weight, Female is always greater than Male then Infant; For mean ratio, male and female adults are the same, and always greater than Infant.

5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?

Answer: OVerall, Volume and Whole weight increases as age increases, this is true for both adult abalones and infant ablones;adults are bigger in volume and weight (which is 100% expected) compare to infants, and adults have bigger standard deviation compare to infants as well, infant whole weight and volume are more close to the mean;both infant and adult tend to reach peak Volume and Whole weight at around 11~12 years age and then shrink a little bit;Adults grow at a faster pace compare to infant;Based solely on Volume and/Whole weight, it’s hard to differentiate infant and adjult, as there are lots of overlaps


Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).

Conclusions

Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.

Answer: I think the biggest reason is that too less variable was obtained (maybe it’s too hard to measure more variables, but the difficulty of measurement is not what’s been discussed here);In addition, too many outliers, and right skewness of the data meaning the data fail to achieve normality, so that make it hard to approximate the data as normal distribution can’t be used;physical measurements can be helpful but only to a limited extent, it’s not useful enough for the purpose, weather, preditor, pollution, food and environmental impact could play a bigger role; some of the variables are mentioend in the background document but no related data was collected; Last, infant can have more than 10 rings, which is confusing to me regarding how infant is classified, more explaination is requried here, or more variable needed to better differentiate infant vs adults in different ages.

Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?

Answer: Questions I would ask include: What’s the sample size? Is there a presence of outliers? What is the mean of the sample distribution? What is the standard deviation of the sample distribution? What is the population mean and standard deviation, can we estimate them? Is there prior study regarding them? what’s the data skewness? Is the sample size big enough? How good are the variables measured logically, can we draw some cause-effecive relationship between them and the predicting variable? How and where was the sample obtained, is this randomized enough?

Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

Answer: There are many difficulties analyzing data derived from observational studies. There could be human/measurement errors as the values are manually recorded; It’s also possible that the sample is not well randomized there for it leads to biases; Causality is very hard to determine if not impossible, there could be hiden factors behind two highly correlated variables, and in pure observation study data can almost be explained in any ways people want, so it wouldn’t bear much credibility either; Most importantly, without controled experiment, whatever is observed can only remain correlation not causation as we discussed a lot in past few weeks;The biggest thing could learn from observational studies, I think is that we can find some good potential ideas with cheaper cost, that can be used as a starting point for further research.