The following code chunk will (a) load the “ggplot2”, “gridExtra” and “knitr” packages, assuming each has been installed on your machine, (b) read-in the abalones dataset, defining a new data frame, “mydata,” (c) return the structure of that data frame, and (d) calculate new variables, VOLUME and RATIO. If either package has not been installed, you must do so first via install.packages(); e.g. install.packages(“ggplot2”). Do not include installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the abalones.csv from the course site to a known location on your machine.

## 'data.frame':    1036 obs. of  8 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...

(1)(a) (2 points) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.

## 'data.frame':    1036 obs. of  8 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  SEX         LENGTH           DIAM            HEIGHT     
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415  
##  M:381   Median :11.45   Median : 8.925   Median :2.940  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935  
##      WHOLE             SHUCK              RINGS        CLASS   
##  Min.   :  1.625   Min.   :  0.5625   Min.   : 3.000   A1:108  
##  1st Qu.: 56.484   1st Qu.: 23.3006   1st Qu.: 8.000   A2:236  
##  Median :101.344   Median : 42.5700   Median : 9.000   A3:329  
##  Mean   :105.832   Mean   : 45.4396   Mean   : 9.993   A4:188  
##  3rd Qu.:150.319   3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175  
##  Max.   :315.750   Max.   :157.0800   Max.   :25.000           
##      VOLUME            RATIO        
##  Min.   :  3.612   Min.   :0.06734  
##  1st Qu.:163.545   1st Qu.:0.12241  
##  Median :307.363   Median :0.13914  
##  Mean   :326.804   Mean   :0.14205  
##  3rd Qu.:463.264   3rd Qu.:0.15911  
##  Max.   :995.673   Max.   :0.31176
##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8
##     
##       20  21  22  23  24  25
##   A1   0   0   0   0   0   0
##   A2   0   0   0   0   0   0
##   A3   0   0   0   0   0   0
##   A4   0   0   0   0   0   0
##   A5   6   4   1   7   2   1

Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.

Answer: (There appear to be an equal distribution among Female, Male and Infants. There may also be a potential for outliers coming from larger abalones (see Max of Whole Wt, Volume, and Shuck variables. )

(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.

##    
##      A1  A2  A3  A4  A5
##   F   5  41 121  82  77
##   I  91 133  65  21  19
##   M  12  62 143  85  79
##      
##         A1   A2   A3   A4   A5  Sum
##   F      5   41  121   82   77  326
##   I     91  133   65   21   19  329
##   M     12   62  143   85   79  381
##   Sum  108  236  329  188  175 1036

Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?

Answer: (As might be expected, the youngest Abalones (A1) correspond to Infant abalones “I”. The count of abalones increases until about mid-life (A3) after which, the counts steadily decrease. It is interesting to note that as abalones age (A5 being the oldest), the relative sex of the abalones levels out to be almost equal number of males to females)

(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).

Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.


(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.

(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

## [1] 0.5621008

Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.

Answer: (There is more vaiability in plot (a) compared to plot (b). Plot (a) shows that there are two max outliers associated with A1/Infants that may be skewing the results. Also, plot (b) shows the max ratio of shuck to whole as depicted by a straight line, however, most of the abalones fall under this line which could indicate that waiting for older abalones (A3 - A5) may not yield more shuck weight. )


(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.

Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.

Answer: (The histogram conveys that the results are non-normal and are skewed likely due to outliers per sex. The Q-Q plot for Infant and Female also tends to drift away from normal therefore demonstrating a non-normal distribution as well)

(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
3 I 10.080 7.350 2.205 79.37500 44.00000 6 A1 163.364040 0.2693371
37 I 4.305 3.255 0.945 6.18750 2.93750 3 A1 13.242072 0.2218308
42 I 2.835 2.730 0.840 3.62500 1.56250 4 A1 6.501222 0.2403394
58 I 6.720 4.305 1.680 22.62500 11.00000 5 A1 48.601728 0.2263294
67 I 5.040 3.675 0.945 9.65625 3.93750 5 A1 17.503290 0.2249577
89 I 3.360 2.310 0.525 2.43750 0.93750 4 A1 4.074840 0.2300704
105 I 6.930 4.725 1.575 23.37500 11.81250 7 A2 51.572194 0.2290478
200 I 9.135 6.300 2.520 74.56250 32.37500 8 A2 145.027260 0.2232339
350 F 7.980 6.720 2.415 80.93750 40.37500 7 A2 129.505824 0.3117620
420 F 11.550 7.980 3.465 150.62500 68.55375 10 A3 319.365585 0.2146560
458 F 11.445 8.085 3.150 139.81250 68.49062 9 A3 291.478399 0.2349767
586 F 12.180 9.450 4.935 133.87500 38.25000 14 A5 568.023435 0.0673388
746 M 13.440 10.815 1.680 130.25000 63.73125 10 A3 244.194048 0.2609861
754 M 10.500 7.770 3.150 132.68750 61.13250 9 A3 256.992750 0.2378764
803 M 10.710 8.610 3.255 160.31250 70.41375 9 A3 300.153640 0.2345924
810 M 12.285 9.870 3.465 176.12500 99.00000 10 A3 420.141472 0.2356349
852 M 11.550 8.820 3.360 167.56250 78.27187 10 A3 342.286560 0.2286735
870 M 11.445 8.610 2.520 99.12500 53.70750 9 A3 248.324454 0.2162795

Question (2 points): What are your observations regarding the results in (3)(b)?

Answer: (One AI/Infant class outlier indicates a higher number of rings (6), larger Whole Weight and volume more closely related to older abalones.In addition, volume and whole weight seem to drop as the abalones get older)


(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS (Davies Section 14.3.2). There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.

Question (5 points) How well do you think these variables would perform as predictors of age? Explain.

Answer: (The boxplot indicate that for the older class of abalones, the median is not much different betwee A4 and A5, therefore, Whole Weight would not be a good predictor of age. There is also more variability in Whole Weight and volume with more rings, therefore the scatterplot would not be a good tool to determine age. )


(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.

##               A1       A2       A3       A4       A5
## Female 255.29938 276.8573 412.6079 498.0489 486.1525
## Infant  66.51618 160.3200 270.7406 316.4129 318.6930
## Male   103.72320 245.3857 358.1181 442.6155 440.2074
##              A1       A2       A3       A4       A5
## Female 38.90000 42.50305 59.69121 69.05161 59.17076
## Infant 10.11332 23.41024 37.17969 39.85369 36.47047
## Male   16.39583 38.33855 52.96933 61.42726 55.02762
##               A1        A2        A3        A4        A5
## Female 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
## Infant 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
## Male   0.1512698 0.1564017 0.1462123 0.1364881 0.1262089

(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Question (2 points): What questions do these plots raise? Consider aging and sex differences.

Answer: (After Class A4, there is no clear suggestion of volume or shuck weight being able to predict whether an abalone is male or female)

5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?

Answer: (The displays suggest that most of the volume of abalones occurs when they are adults and not as infants. Despite limiting the number of rings to less than 16, the graphs show that there is not much difference in Whole Weight or Volume between Adults or Infants)


Conclusions

Please respond to each of the following questions (10 points total):

Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.

Answer: (Presence of outliers and not correcting for outlier may explaing the faiure of the orgiginal study and. Phyical measurements as a good indicator of Age can be used as a starting point for further analysis and testing but conclusions should not be solely based on observational measurements)

Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?

Answer: (Questions I would ask include: How long has the study been going on? How was the sample obtained? Where was the sample obtained? Is the sample size big enough to be representative of the population? )

Question 3) (2 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

Answer: Observational studies possess outliers that make it more difficult to develop concrete conclusions. Observational studies may also not consider other envirnonmental factors (weather, location). In the article,“https://www.healthnewsreview.org/toolkit/tips-for-understanding-studies/does-the-language-fit-the-evidence-association-versus-causation/”, Causality may be determined if we ’t know that the sampling was random. However, since observational studies are generally not random in that other factors or exposures cannot be controlled, we cannot assume that causality can be determined from this abalones study. Observational studies teach out that we need to have a control sample for further development of conclusions.