Data Analysis Assignment #1 (50 points total)

Do not include package installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the abalones.csv from the course site to a known location on your machine. Unless a file.path() is specified, R will look to directory where this .Rmd is stored when knitting.

Test Items starts from here - There are 6 sections

Section 1: (6 points) Summarizing the data.

(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata.

##  SEX         LENGTH           DIAM            HEIGHT          WHOLE        
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525   Min.   :  1.625  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415   1st Qu.: 56.484  
##  M:381   Median :11.45   Median : 8.925   Median :2.940   Median :101.344  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947   Mean   :105.832  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570   3rd Qu.:150.319  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935   Max.   :315.750  
##      SHUCK              RINGS        CLASS        VOLUME       
##  Min.   :  0.5625   Min.   : 3.000   A1:108   Min.   :  3.612  
##  1st Qu.: 23.3006   1st Qu.: 8.000   A2:236   1st Qu.:163.545  
##  Median : 42.5700   Median : 9.000   A3:329   Median :307.363  
##  Mean   : 45.4396   Mean   : 9.993   A4:188   Mean   :326.804  
##  3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175   3rd Qu.:463.264  
##  Max.   :157.0800   Max.   :25.000            Max.   :995.673  
##      RATIO        
##  Min.   :0.06734  
##  1st Qu.:0.12241  
##  Median :0.13914  
##  Mean   :0.14205  
##  3rd Qu.:0.15911  
##  Max.   :0.31176

Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.

##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8   6
##     
##       21  22  23  24  25
##   A1   0   0   0   0   0
##   A2   0   0   0   0   0
##   A3   0   0   0   0   0
##   A4   0   0   0   0   0
##   A5   4   1   7   2   1

Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.

Answer: the abalones.csv contains 8 variables, with the two we just added, now it has 10 variables. out of which, two of them are factor type(SEX and CLASS), rest are all numeric except RINGS is int. From the summary we roughly tell: LENGTH, DIAM, HEIGHT are left skewed, and WHOLE, SHUCK, RINGS, VOLUME, RATIO are right skewed;For outliers it looks like almost all numeric data have outliers, and it looks like ring clearly has extreme outliers

skewness can also be checked by using skewness function, negative value indicates the data is left skewed, positive value indicates the data is right skewed.

## Loading required package: e1071

## [1] "Skewness of HEIGHT: -0.2253, Skewness of WHOLE: 0.4705"

History gram tells skewness as well

Check for outliers and extreme outliers

## [1] "LENGTH has 12 outliers and  0 extreme outliers"
## [1] "DIAM   has 13 outliers and  0 extreme outliers"
## [1] "HEIGHT has  6 outliers and  0 extreme outliers"
## [1] "WHOLE  has  4 outliers and  0 extreme outliers"
## [1] "SHUCK  has 10 outliers and  0 extreme outliers"
## [1] "RINGS  has 74 outliers and 15 extreme outliers"
## [1] "VOLUME has  1 outliers and  0 extreme outliers"
## [1] "RATIO  has 18 outliers and  2 extreme outliers"

check outliers of RINGS

(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.

##      CLASS
## SEX     A1   A2   A3   A4   A5  Sum
##   F      5   41  121   82   77  326
##   I     91  133   65   21   19  329
##   M     12   62  143   85   79  381
##   Sum  108  236  329  188  175 1036

plot

Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?

Answer: Based on the Data Analysis Assignments Background New.pdf document, 8. CLASS = Age classification based on RINGS (A1= youngest,., A6=oldest), based on that I thought A4 and A5 should be senior abalones, but we still see a good amount of infants.. is it because of infants are not determined by age? Or RINGS itself is not a very accuracy representation of age? Besides that, in all classes, Female are less than Male, another surprise to me, not sure is it climate related. Espcially in the younger classes like A1 to A3, should we be worry about the future generation of ablones? As of now the younger population(A1-A3) are more than the olderly (A4-A5), that itself (if other info are correct) indicates a healthy population

(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).

Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.

Section 2: (5 points) Summarizing the data using graphics.

(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.

(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.

Answer: Both plot looks similar in terms of whole weight increases as volume increases, and shuck weight increases as whole weight increases as well. Looking at the age class, we can see the trend all three variables increases as age (age class) increases. In plot a, we do not see significant differences in ratio of whole weight vs volume in different age groups, but in plot b, it looks like there’s a trend that the oldest class (A5) has a lower Shuck vs Whole weight ratio. which make sense biologically.

Section 3: (8 points) Getting insights about the data using graphs.

(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.

Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.

Answer: From the plots, we can see all data are slightly right skewed, from the histgram we can directly see the right tail. The boxplots also have right tails, that tells there are more outiers the the right, the qq plot indicates right tail as well. I ran shapiro test for all three type of data as well, and p value is pretty small, far smaller than generic alpha=.05, the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

## 
##  Shapiro-Wilk normality test
## 
## data:  ratio.F
## W = 0.96028, p-value = 9.595e-08

## 
##  Shapiro-Wilk normality test
## 
## data:  ratio.I
## W = 0.96962, p-value = 2.154e-06

## 
##  Shapiro-Wilk normality test
## 
## data:  ratio.M
## W = 0.98031, p-value = 4.622e-05

(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).

Female mild and extreme outliers

	SEX	LENGTH	DIAM	HEIGHT	WHOLE	SHUCK	RINGS	CLASS	VOLUME	RATIO
350	F	7.980	6.720	2.415	80.9375	40.37500	7	A2	129.5058	0.3117620
379	F	15.330	11.970	3.465	252.0625	134.89812	10	A3	635.8278	0.2121614
420	F	11.550	7.980	3.465	150.6250	68.55375	10	A3	319.3656	0.2146560
421	F	13.125	10.290	2.310	142.0000	66.47062	9	A3	311.9799	0.2130606
458	F	11.445	8.085	3.150	139.8125	68.49062	9	A3	291.4784	0.2349767
586	F	12.180	9.450	4.935	133.8750	38.25000	14	A5	568.0234	0.0673388

Infant mild and extreme outliers

	SEX	LENGTH	DIAM	HEIGHT	WHOLE	SHUCK	RINGS	CLASS	VOLUME	RATIO
3	I	10.080	7.350	2.205	79.37500	44.0000	6	A1	163.364040	0.2693371
37	I	4.305	3.255	0.945	6.18750	2.9375	3	A1	13.242072	0.2218308
42	I	2.835	2.730	0.840	3.62500	1.5625	4	A1	6.501222	0.2403394
58	I	6.720	4.305	1.680	22.62500	11.0000	5	A1	48.601728	0.2263294
67	I	5.040	3.675	0.945	9.65625	3.9375	5	A1	17.503290	0.2249577
89	I	3.360	2.310	0.525	2.43750	0.9375	4	A1	4.074840	0.2300704
105	I	6.930	4.725	1.575	23.37500	11.8125	7	A2	51.572194	0.2290478
200	I	9.135	6.300	2.520	74.56250	32.3750	8	A2	145.027260	0.2232339

Male mild and extreme outliers

	SEX	LENGTH	DIAM	HEIGHT	WHOLE	SHUCK	RINGS	CLASS	VOLUME	RATIO
746	M	13.440	10.815	1.680	130.2500	63.73125	10	A3	244.1940	0.2609861
754	M	10.500	7.770	3.150	132.6875	61.13250	9	A3	256.9928	0.2378764
803	M	10.710	8.610	3.255	160.3125	70.41375	9	A3	300.1536	0.2345924
810	M	12.285	9.870	3.465	176.1250	99.00000	10	A3	420.1415	0.2356349
852	M	11.550	8.820	3.360	167.5625	78.27187	10	A3	342.2866	0.2286735

Female extreme outliers

	SEX	LENGTH	DIAM	HEIGHT	WHOLE	SHUCK	RINGS	CLASS	VOLUME	RATIO
350	F	7.98	6.72	2.415	80.9375	40.375	7	A2	129.5058	0.311762

Infant extreme outliers

	SEX	LENGTH	DIAM	HEIGHT	WHOLE	SHUCK	RINGS	CLASS	VOLUME	RATIO
3	I	10.08	7.35	2.205	79.375	44	6	A1	163.364	0.2693371

No Male extreme outliers

SEX	LENGTH	DIAM	HEIGHT	WHOLE	SHUCK	RINGS	CLASS	VOLUME	RATIO

Essay Question (2 points): What are your observations regarding the results in (3)(b)?

Answer: The biggest ratio comes from female; Overall female and infant has more outliers in ratio(i.e. their Shuck weight is higher vs volume. The biggest ratio come from a very small female. Most(4 out of 19) outliers comes from smaller (in volume) abalones(volume <= mean(326) or median(307));infant have the most outliers, which is consistent with prior observation of smaller abalones tend to have more outliers

Section 4: (8 points) Getting insights about possible predictors.

(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.

Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.

Answer: Volume and/or whole weight will not perform very well as predictors of age, as we can tell from the scatter plots, all the dots are widly spread across the figure. Although we can see a trend that there’s a correlation between volume & rings and whole weight vs rings, but the that’s just not enough. Besides I also ran a simple linear function as below, using both WHOLE and VOLUME to predict RINGS, the adjusted R-squared is only .3105, as expected, which is not a powerful enough.

## 
## Call:
## lm(formula = RINGS ~ WHOLE + VOLUME, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8782 -1.7440 -0.6368  0.9165 13.6803 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.901213   0.169737  40.658  < 2e-16 ***
## WHOLE       -0.003082   0.005486  -0.562    0.574    
## VOLUME       0.010459   0.001745   5.995 2.82e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.755 on 1033 degrees of freedom
## Multiple R-squared:  0.3118, Adjusted R-squared:  0.3105 
## F-statistic:   234 on 2 and 1033 DF,  p-value: < 2.2e-16

Section 5: (12 points) Getting insights regarding different groups in the data.

(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.

## $Volume
##               A1       A2       A3       A4       A5
## Female 255.29938 276.8573 412.6079 498.0489 486.1525
## Infant  66.51618 160.3200 270.7406 316.4129 318.6930
## Male   103.72320 245.3857 358.1181 442.6155 440.2074
## 
## $Shuck
##              A1       A2       A3       A4       A5
## Female 38.90000 42.50305 59.69121 69.05161 59.17076
## Infant 10.11332 23.41024 37.17969 39.85369 36.47047
## Male   16.39583 38.33855 52.96933 61.42726 55.02762
## 
## $Ratio
##               A1        A2        A3        A4        A5
## Female 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
## Infant 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
## Male   0.1512698 0.1564017 0.1462123 0.1364881 0.1262089

(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.

Answer: Mean Ratio is clearly descreasing as age (age class) increases, this is true for all sex;Mean Volume increases as age (class) increase, which makes perfect sense as well; this is true for all sex;Chuck weight increases as age (class) increase, true for all sex as well; For all three types, female show little differences between class A1 and A2, maybe female develops slower than other type;In terms of Mean Volume and Shuck weight, Female is always greater than Male then Infant; For mean ratio, male and female adults are the same, and always greater than Infant.

5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?

Answer: OVerall, Volume and Whole weight increases as age increases, this is true for both adult abalones and infant ablones;adults are bigger in volume and weight (which is 100% expected) compare to infants, and adults have bigger standard deviation compare to infants as well, infant whole weight and volume are more close to the mean;both infant and adult tend to reach peak Volume and Whole weight at around 11~12 years age and then shrink a little bit;Adults grow at a faster pace compare to infant;Based solely on Volume and/Whole weight, it’s hard to differentiate infant and adjult, as there are lots of overlaps

Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).

Conclusions

Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.

Answer: I think the biggest reason is that too less variable was obtained (maybe it’s too hard to measure more variables, but the difficulty of measurement is not what’s been discussed here);In addition, too many outliers, and right skewness of the data meaning the data fail to achieve normality, so that make it hard to approximate the data as normal distribution can’t be used;physical measurements can be helpful but only to a limited extent, it’s not useful enough for the purpose, weather, preditor, pollution, food and environmental impact could play a bigger role; some of the variables are mentioend in the background document but no related data was collected; Last, infant can have more than 10 rings, which is confusing to me regarding how infant is classified, more explaination is requried here, or more variable needed to better differentiate infant vs adults in different ages.

Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?

Answer: Questions I would ask include: What’s the sample size? Is there a presence of outliers? What is the mean of the sample distribution? What is the standard deviation of the sample distribution? What is the population mean and standard deviation, can we estimate them? Is there prior study regarding them? what’s the data skewness? Is the sample size big enough? How good are the variables measured logically, can we draw some cause-effecive relationship between them and the predicting variable? How and where was the sample obtained, is this randomized enough?

Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

Answer: There are many difficulties analyzing data derived from observational studies. There could be human/measurement errors as the values are manually recorded; It’s also possible that the sample is not well randomized there for it leads to biases; Causality is very hard to determine if not impossible, there could be hiden factors behind two highly correlated variables, and in pure observation study data can almost be explained in any ways people want, so it wouldn’t bear much credibility either; Most importantly, without controled experiment, whatever is observed can only remain correlation not causation as we discussed a lot in past few weeks;The biggest thing could learn from observational studies, I think is that we can find some good potential ideas with cheaper cost, that can be used as a starting point for further research.