R markdown is a plain-text file format for integrating text and R code, and creating transparent, reproducible and interactive reports. An R markdown file (.Rmd) contains metadata, markdown and R code “chunks,”" and can be “knit” into numerous output types. Answer the test questions by adding R code to the fenced code areas below each item. There are questions that require a written answer that also need to be answered. Enter your comments in the space provided as shown below:

Answer: (Enter your answer here.)

Once completed, you will “knit” and submit the resulting .html document and the .Rmd file. The .html will present the output of your R code and your written answers, but your R code will not appear. Your R code will appear in the .Rmd file. The resulting .html document will be graded. Points assigned to each item appear in this template.

Before proceeding, look to the top of the .Rmd for the (YAML) metadata block, where the title, author and output are given. Please change author to include your name, with the format ‘lastName, firstName.’

If you encounter issues with knitting the .html, please send an email via Canvas to your TA.

Each code chunk is delineated by six (6) backticks; three (3) at the start and three (3) at the end. After the opening ticks, arguments are passed to the code chunk and in curly brackets. Please do not add or remove backticks, or modify the arguments or values inside the curly brackets. An example code chunk is included here:

# Comments are included in each code chunk, simply as prompts

#...R code placed here

#...R code placed here

R code only needs to be added inside the code chunks for each assignment item. However, there are questions that follow many assignment items. Enter your answers in the space provided. An example showing how to use the template and respond to a question follows.


Example Problem with Solution:

Use rbinom() to generate two random samples of size 10,000 from the binomial distribution. For the first sample, use p = 0.45 and n = 10. For the second sample, use p = 0.55 and n = 10. Convert the sample frequencies to sample proportions and compute the mean number of successes for each sample. Present these statistics.

set.seed(123)
sample.one <- table(rbinom(10000, 10, 0.45)) / 10000
sample.two <- table(rbinom(10000, 10, 0.55)) / 10000

successes <- seq(0, 10)

round(sum(sample.one*successes), digits = 1) # [1] 4.5
## [1] 4.5
round(sum(sample.two*successes), digits = 1) # [1] 5.5
## [1] 5.5

Question: How do the simulated expectations compare to calculated binomial expectations?

Answer: The calculated binomial expectations are 10(0.45) = 4.5 and 10(0.55) = 5.5. After rounding the simulated results, the same values are obtained.


Submit both the .Rmd and .html files for grading. You may remove the instructions and example problem above, but do not remove the YAML metadata block or the first, “setup” code chunk. Address the steps that appear below and answer all the questions. Be sure to address each question with code and comments as needed. You may use either base R functions or ggplot2 for the visualizations.


The following code chunk will:

  1. load the “ggplot2”, “gridExtra” and “knitr” packages, assuming each has been installed on your machine,
  2. read-in the abalones dataset, defining a new data frame, “mydata,”
  3. return the structure of that data frame, and
  4. calculate new variables, VOLUME and RATIO.

Do not include package installation code in this document. Packages should be installed via the Console or ‘Packages’ tab. You will also need to download the abalones.csv from the course site to a known location on your machine. Unless a file.path() is specified, R will look to directory where this .Rmd is stored when knitting.

## 'data.frame':    1036 obs. of  8 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...

Test Items starts from here - There are 6 sections

Section 1: (6 points) Summarizing the data.

(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata.
Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.

##  SEX         LENGTH           DIAM            HEIGHT          WHOLE        
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525   Min.   :  1.625  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415   1st Qu.: 56.484  
##  M:381   Median :11.45   Median : 8.925   Median :2.940   Median :101.344  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947   Mean   :105.832  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570   3rd Qu.:150.319  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935   Max.   :315.750  
##      SHUCK              RINGS        CLASS        VOLUME       
##  Min.   :  0.5625   Min.   : 3.000   A1:108   Min.   :  3.612  
##  1st Qu.: 23.3006   1st Qu.: 8.000   A2:236   1st Qu.:163.545  
##  Median : 42.5700   Median : 9.000   A3:329   Median :307.363  
##  Mean   : 45.4396   Mean   : 9.993   A4:188   Mean   :326.804  
##  3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175   3rd Qu.:463.264  
##  Max.   :157.0800   Max.   :25.000            Max.   :995.673  
##      RATIO        
##  Min.   :0.06734  
##  1st Qu.:0.12241  
##  Median :0.13914  
##  Mean   :0.14205  
##  3rd Qu.:0.15911  
##  Max.   :0.31176
##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8   6
##     
##       21  22  23  24  25
##   A1   0   0   0   0   0
##   A2   0   0   0   0   0
##   A3   0   0   0   0   0
##   A4   0   0   0   0   0
##   A5   4   1   7   2   1

Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.

Answer: (The variable types found in this data set are mainly numerical aside from RINGS (int) and CLASS (factor). Potential skewness should be noted in the variables that have mean and median differences that are larger such as VOLUME, and SHUCK. Some outliers may occur in variables such as SHUCK since the min and max values are quite distatnt from the median. )

(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)).
Lastly, present a barplot of these data; ignoring the marginal totals.

##      
##         A1   A2   A3   A4   A5  Sum
##   F      5   41  121   82   77  326
##   I     91  133   65   21   19  329
##   M     12   62  143   85   79  381
##   Sum  108  236  329  188  175 1036

Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?

Answer: (It stands out that there are more infants in A2 than there are in A1. There seems to be infant abalones in groups A3, A4, and A5; this doesn’t really make sense, so I would think that these infants were unable to be classified as female or male. Also, it appears that there are more male abalones compared to females throughout every class; the difference between male and female is less in A4 and A5, though. This suggests that many female abalones die after the A3 age classification.)

(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.”

Use set.seed(123) prior to drawing this sample. Do not change the number 123.

Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).

Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.

## Help on topic 'sample' was found in the following packages:
## 
##   Package               Library
##   dplyr                 /Library/Frameworks/R.framework/Versions/3.6/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...


Section 2: (5 points) Summarizing the data using graphics.

(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.

(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS.

As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE.

Add to the chart a straight line with zero intercept using this maximum value as the slope of the line.

If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot.

Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.

Answer: (There is less variability in the plot in (b) compared to the plot in (a). This is probably because plot (b) is comparing weight to weight, where plot (a) is comparing weight to volume (dimensions). Also, this suggests that abalone volume is less correlated to whole weight compared to shuck weight. As abalones mature, it appears the shell becomes heavier in proportion to the total weight. Additionally, variability increases as the abalones mature.)


Section 3: (8 points) Getting insights about the data using graphs.

(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.

## 'data.frame':    1036 obs. of  10 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ RATIO : num  0.15 0.147 0.269 0.185 0.165 ...

Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.

Answer: (The distributions appear to correspond accurately with the normal QQ plot. The Female quantities appear to have the most extreme outliers among the three groups. The outliers can be seen on the upper ends in both the boxplot and qqnorm visualizataions. While the Female group has the most extreme outlier, it appears that the Infant group has the highest quantity of outliers.)

(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).

## [1] 0.2693371 0.2218308 0.2403394 0.2263294 0.2249577 0.2300704 0.2290478
## [8] 0.2232339
## [1] 0.2609861 0.2378764 0.2345924 0.2356349 0.2286735
## [1] 0.31176204 0.21216140 0.21465603 0.21306058 0.23497668 0.06733877
SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
3 I 10.080 7.350 2.205 79.37500 44.00000 6 A1 163.364040 0.2693371
37 I 4.305 3.255 0.945 6.18750 2.93750 3 A1 13.242072 0.2218308
42 I 2.835 2.730 0.840 3.62500 1.56250 4 A1 6.501222 0.2403394
58 I 6.720 4.305 1.680 22.62500 11.00000 5 A1 48.601728 0.2263294
67 I 5.040 3.675 0.945 9.65625 3.93750 5 A1 17.503290 0.2249577
89 I 3.360 2.310 0.525 2.43750 0.93750 4 A1 4.074840 0.2300704
105 I 6.930 4.725 1.575 23.37500 11.81250 7 A2 51.572194 0.2290478
200 I 9.135 6.300 2.520 74.56250 32.37500 8 A2 145.027260 0.2232339
746 M 13.440 10.815 1.680 130.25000 63.73125 10 A3 244.194048 0.2609861
754 M 10.500 7.770 3.150 132.68750 61.13250 9 A3 256.992750 0.2378764
803 M 10.710 8.610 3.255 160.31250 70.41375 9 A3 300.153640 0.2345924
810 M 12.285 9.870 3.465 176.12500 99.00000 10 A3 420.141472 0.2356349
852 M 11.550 8.820 3.360 167.56250 78.27187 10 A3 342.286560 0.2286735
350 F 7.980 6.720 2.415 80.93750 40.37500 7 A2 129.505824 0.3117620
379 F 15.330 11.970 3.465 252.06250 134.89812 10 A3 635.827846 0.2121614
420 F 11.550 7.980 3.465 150.62500 68.55375 10 A3 319.365585 0.2146560
421 F 13.125 10.290 2.310 142.00000 66.47062 9 A3 311.979938 0.2130606
458 F 11.445 8.085 3.150 139.81250 68.49062 9 A3 291.478399 0.2349767
586 F 12.180 9.450 4.935 133.87500 38.25000 14 A5 568.023435 0.0673388

Essay Question (2 points): What are your observations regarding the results in (3)(b)?

Answer: (There are nineteen outliers. Female abalone seem to have the most extreme outliers. The outliers seem to be pretty evenly distributed amongst the three groups.)


Section 4: (8 points) Getting insights about possible predictors.

(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.

Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.

Answer: (The variables VOLUME, WHOLE, RINGS, and CLASS to be useful predictors of age. From both boxplots and scatterplots, it can be seen that the increase in VOLUME and WHOLE are positively correlated to the increases in CLASS and RINGS respectively. The scatterplot appears to have fewer values from 15 to 25 rings, so this may or may not be enough information to accurately predict age. )


Section 5: (12 points) Getting insights regarding different groups in the data.

(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.

## [1] "Volume"
##               A1       A2       A3       A4       A5
## Female 255.29938 276.8573 412.6079 498.0489 486.1525
## Infant  66.51618 160.3200 270.7406 316.4129 318.6930
## Male   103.72320 245.3857 358.1181 442.6155 440.2074
## [1] "Shuck"
##              A1       A2       A3       A4       A5
## Female 38.90000 42.50305 59.69121 69.05161 59.17076
## Infant 10.11332 23.41024 37.17969 39.85369 36.47047
## Male   16.39583 38.33855 52.96933 61.42726 55.02762
## [1] "Ratio"
##               A1        A2        A3        A4        A5
## Female 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
## Infant 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
## Male   0.1512698 0.1564017 0.1462123 0.1364881 0.1262089

(5)(b) (3 points) Present three graphs.

Each graph should include three lines, one for each SEX.(trace)

The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.

Answer: (I noticed that female VOLUME and female WEIGHT both increase significantly between A2 and A3. Additionally, it can be seen that female RATIO drops rapidly after A4. I would like to learn more about why A2 to A3 seems to be important for females. If this age range is related to reproductive life cycles, then it may be useful to identify whether it is not recommended to farm abalone prior to A3 to maintain a higher species population. As abalones grow older, their RATIO (Shuck weight/Volume) decreases; this can either can either imply that the abalone meat decreases in size, or it can imply that the abalone’s shells continue to grow until the end of their life span. The truth is probably somewhere in between these two assumptions, as SHUCK values decrease from A4 to A5 in the third graph.)

5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings.

One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?

Answer: (These displays show a consistent correlation between Volume and Whole Weight growth in relation to Rings up to about 10 rings. After ring 10, it appears thaat growth slows down. It can be noted that the infant graphs have shorter whiskers on most boxplots. This means we would expect to see less variation in volume and weight at each value for rings. This makes sense, as infants should have less time than adults to grow (or not grow), so there is less possibility of variation in both volume and whole weight.)


Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).

Conclusions

Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.

Answer: (Based on the data from this analysis, it is not easy to identify major differences between male, female, and infant abalones. For example, it is unclear why the infant relationships question 5 display similar trends to that of adults. It appears that the number of infants with twelve rings are very high in this sample (which could explain the right skew in the data), but this is unexpected since the original study assumes that there is a strong correlation between rings and age. A possible failure in this study is that the SEX variable is not explained with enough detail; the studies conducted in this analysis show how gender affects variables such as weight depending on the age classification of abalones. However, I do not believe that enough analysis was done on gender identification, as there seemed to be too many infant abalones in the later age classifications (A4 and A5). )

Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?

Answer: (If I was given a histogram and summary statistics from a sample of some population, I would need to understand the origin of the data. How was the data sourced? Is the data outdated? Were there any known biases necessary to obtain a complete dataset? What confidence interval was used for this study, and how close is it to the true population proportion? )

Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

Answer: (Observational studies may be limited to the extent of both human and environmental biases. The study is an outcome of the quality of data in addition to the conditions and level of understanding the researcher may have. Throughout exploratory data analysis, researchers may rely on identifying correlations between different variables, and this may lead to results that incorrectly assume causation. For an observational study to be successful, it is important to have deep understandings of the study’s variables in addition to learning whether there were any limitations or constraints within the data. Trends and patterns over time could be obtained from observational studies, which could lead to value added towards predictive analytics. )