Submit both the .Rmd and .html files for grading. You may remove the instructions and example problem above, but do not remove the YAML metadata block or the first, “setup” code chunk. Address the steps that appear below and answer all the questions. Be sure to address each question with code and comments as needed. You may use either base R functions or ggplot2 for the visualizations.


The following code chunk will:

  1. load the “ggplot2”, “gridExtra” and “knitr” packages, assuming each has been installed on your machine,
  2. read-in the abalones dataset, defining a new data frame, “mydata,”
  3. return the structure of that data frame, and
  4. calculate new variables, VOLUME and RATIO.

Do not include package installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the abalones.csv from the course site to a known location on your machine. Unless a file.path() is specified, R will look to directory where this .Rmd is stored when knitting.

## 'data.frame':    1036 obs. of  8 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...

Test Items starts from here - There are 6 sections - Total 50 points

##### Section 1: (6 points) Summarizing the data.

(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.

##  SEX         LENGTH           DIAM            HEIGHT          WHOLE        
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525   Min.   :  1.625  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415   1st Qu.: 56.484  
##  M:381   Median :11.45   Median : 8.925   Median :2.940   Median :101.344  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947   Mean   :105.832  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570   3rd Qu.:150.319  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935   Max.   :315.750  
##      SHUCK              RINGS        CLASS        VOLUME       
##  Min.   :  0.5625   Min.   : 3.000   A1:108   Min.   :  3.612  
##  1st Qu.: 23.3006   1st Qu.: 8.000   A2:236   1st Qu.:163.545  
##  Median : 42.5700   Median : 9.000   A3:329   Median :307.363  
##  Mean   : 45.4396   Mean   : 9.993   A4:188   Mean   :326.804  
##  3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175   3rd Qu.:463.264  
##  Max.   :157.0800   Max.   :25.000            Max.   :995.673  
##      RATIO        
##  Min.   :0.06734  
##  1st Qu.:0.12241  
##  Median :0.13914  
##  Mean   :0.14205  
##  3rd Qu.:0.15911  
##  Max.   :0.31176
##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8   6
##     
##       21  22  23  24  25
##   A1   0   0   0   0   0
##   A2   0   0   0   0   0
##   A3   0   0   0   0   0
##   A4   0   0   0   0   0
##   A5   4   1   7   2   1

Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.

Answer: (Whole, volume and shuck appear unusual since their ranges represent outliers when compared to the rest of the variables. There is also a relatively equal distribution of male, female and infant observations. Ring also has some outliers due to the difference between the 3rd quartile and the Max value presented.)

(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.

##      
##         A1   A2   A3   A4   A5  Sum
##   F      5   41  121   82   77  326
##   I     91  133   65   21   19  329
##   M     12   62  143   85   79  381
##   Sum  108  236  329  188  175 1036

Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?

Answer: CLASS A1 and A2 have a large amount of infant abalones within them, which makes sense since CLASS represents age here. Additionally, the count increases through A3 and then starts to taper off, with male and female remaining roughly equal.

(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).

Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.


##### Section 2: (5 points) Summarizing the data using graphics.

(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.

(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

## [1] 0.5621008

Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.

Answer: The Plot in (a) shows more variability than the plot in (b). Additionally, all abalones fall below the ab line in plot (b), which shows that there is not a lot of evidence that older abalones produce more shuck weight.


### Section 3: (8 points) Getting insights about the data using graphs.

(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.

Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.

Answer: The values appear to be normal when looking at all 3 plots for the variables. The boxplots do show outliers but very few of them, which are then also identified in the Q-Q plots.

(3)(b) (2 points) The boxplots in (3)(a) indicate that there are outlying RATIOs for each sex. boxplot.stats() can be used to identify outlying values of a vector. Present the abalones with these outlying RATIO values along with their associated variables in “mydata”. Display the observations by passing a data frame to the kable() function. Basically, we want to output those rows of “mydata” with an outlying RATIO, but we want to determine outliers looking separately at infants, females and males.

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO Ratio2
3 I 10.080 7.350 2.205 79.37500 44.00000 6 A1 163.364040 0.2693371 0.5543307
37 I 4.305 3.255 0.945 6.18750 2.93750 3 A1 13.242072 0.2218308 0.4747475
42 I 2.835 2.730 0.840 3.62500 1.56250 4 A1 6.501222 0.2403394 0.4310345
58 I 6.720 4.305 1.680 22.62500 11.00000 5 A1 48.601728 0.2263294 0.4861878
67 I 5.040 3.675 0.945 9.65625 3.93750 5 A1 17.503290 0.2249577 0.4077670
89 I 3.360 2.310 0.525 2.43750 0.93750 4 A1 4.074840 0.2300704 0.3846154
105 I 6.930 4.725 1.575 23.37500 11.81250 7 A2 51.572194 0.2290478 0.5053476
200 I 9.135 6.300 2.520 74.56250 32.37500 8 A2 145.027260 0.2232339 0.4341995
350 F 7.980 6.720 2.415 80.93750 40.37500 7 A2 129.505824 0.3117620 0.4988417
379 F 15.330 11.970 3.465 252.06250 134.89812 10 A3 635.827846 0.2121614 0.5351773
420 F 11.550 7.980 3.465 150.62500 68.55375 10 A3 319.365585 0.2146560 0.4551286
421 F 13.125 10.290 2.310 142.00000 66.47062 9 A3 311.979938 0.2130606 0.4681030
458 F 11.445 8.085 3.150 139.81250 68.49062 9 A3 291.478399 0.2349767 0.4898748
586 F 12.180 9.450 4.935 133.87500 38.25000 14 A5 568.023435 0.0673388 0.2857143
746 M 13.440 10.815 1.680 130.25000 63.73125 10 A3 244.194048 0.2609861 0.4892994
754 M 10.500 7.770 3.150 132.68750 61.13250 9 A3 256.992750 0.2378764 0.4607254
803 M 10.710 8.610 3.255 160.31250 70.41375 9 A3 300.153640 0.2345924 0.4392281
810 M 12.285 9.870 3.465 176.12500 99.00000 10 A3 420.141472 0.2356349 0.5621008
852 M 11.550 8.820 3.360 167.56250 78.27187 10 A3 342.286560 0.2286735 0.4671205

Essay Question (2 points): What are your observations regarding the results in (3)(b)?

Answer: One Infant outlier has higher ring count. Larger weight and volume are associated with older Abalones.


### Section 4: (8 points) Getting insights about possible predictors.

(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.

Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.

Answer: If you were differentiating between Classes A1-A3, volume and weight appear to be good predictors of age. However the boxplots for A4 and A5 for both volume and weight are very similar so it would be hard to differentiate between those classes using these variables. Once abalones have reached maturity, these variables will not be good predictors of age. Rings would not be a good indicator either since they have the same relation ship with volume and weight as class does.


### Section 5: (12 points) Getting insights regarding different groups in the data.

(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.

Volume
A1 A2 A3 A4 A5
Female 255.29938 276.8573 412.6079 498.0489 486.1525
Infant 66.51618 160.3200 270.7406 316.4129 318.6930
Male 103.72320 245.3857 358.1181 442.6155 440.2074
Shuck
A1 A2 A3 A4 A5
Female 38.90000 42.50305 59.69121 69.05161 59.17076
Infant 10.11332 23.41024 37.17969 39.85369 36.47047
Male 16.39583 38.33855 52.96933 61.42726 55.02762
Ratio
A1 A2 A3 A4 A5
Female 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
Infant 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
Male 0.1512698 0.1564017 0.1462123 0.1364881 0.1262089

(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.

Answer: It is interesting to note that the females weigh more than the males Additionally, the shuck weight/volume ratio decreases with age. The highest mean shuck weight is for those in Class A4, so it would be interesting to learn what happens when they get to A5.

5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?

Answer: (Enter your answer here.)


### Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).

Conclusions

Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.

Answer: There does seem to be some skewness, which would need to be addressed in order to fully understand the data. Since it is best to work with normally distributed data, the research team may need to look into a different sampling procedure. Since the physcial measurements seemed to show differences between infants and adults, the measurements can certainly be used to predict different life stages, but true age cannot be since there are some aspects, such as weight, that decrease towards end of life.

Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?

Answer: The first question I would ask would be how many individuals were observed. I would use that information to ensure that the sample could be representative of the entire population in question. I would also be interested in the number of areas the individuals sampled were found and try to understand if there was any chance for the sample to be further randomized. I would also want to know the measures of central tendency associated with the population as well as any outliers or skewness present.

Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

Answer: Observational studies tend to leave room for error. When thinking about measuring animals in the wild, measurements, location selection and sexing individuals could produce errors. The data will also have to be cleaned up by an anaylst or data scientist in order to analyze the information provided. Causality cannot be determined by the data without a control data set present and known. Scientists are able to identify trend or correlation but generally cannot point to causation. for certainty. These kinds of studies can help us to understand where relationships between different data points may exist and can help lead to further studies that could potentially show causuality.