## 'data.frame':    1036 obs. of  8 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...

Test Items starts from here - There are 6 sections - Total 50 points

##### Section 1: (6 points) Summarizing the data.

(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.

##  SEX         LENGTH           DIAM            HEIGHT          WHOLE        
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525   Min.   :  1.625  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415   1st Qu.: 56.484  
##  M:381   Median :11.45   Median : 8.925   Median :2.940   Median :101.344  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947   Mean   :105.832  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570   3rd Qu.:150.319  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935   Max.   :315.750  
##      SHUCK              RINGS        CLASS        VOLUME       
##  Min.   :  0.5625   Min.   : 3.000   A1:108   Min.   :  3.612  
##  1st Qu.: 23.3006   1st Qu.: 8.000   A2:236   1st Qu.:163.545  
##  Median : 42.5700   Median : 9.000   A3:329   Median :307.363  
##  Mean   : 45.4396   Mean   : 9.993   A4:188   Mean   :326.804  
##  3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175   3rd Qu.:463.264  
##  Max.   :157.0800   Max.   :25.000            Max.   :995.673  
##      RATIO        
##  Min.   :0.06734  
##  1st Qu.:0.12241  
##  Median :0.13914  
##  Mean   :0.14205  
##  3rd Qu.:0.15911  
##  Max.   :0.31176
##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8   6
##     
##       21  22  23  24  25
##   A1   0   0   0   0   0
##   A2   0   0   0   0   0
##   A3   0   0   0   0   0
##   A4   0   0   0   0   0
##   A5   4   1   7   2   1

Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.

Answer: Looking at the data, all variables exhibit a wide range of values. For measurements like length, diameter, and height, the mean and median are relatively close, with height showing the smallest difference, indicating it approximates a normal distribution; however, all variables, except for length and diameter have a mean that is greater than the median. This indicates tha the distribitions are right-skewed, suggesting that a few high-value outliers are influencing the average. This trend holds for most variables except length and diameter. Length and diameter have mean values that are less than the median, to see if these are possibly symetrical distributions we would have to do a kurtosis test. In the Classes and Rings table, there seems to be a positive correlation between age and number of rings, suggesting that age can be measured by the number of rings. The largest proportion of abalone falls within Class A3, with 9 and 10 rings, indicating that the majority of abalone in this dataset are middle-aged. Infant abalone (fewer than 7 rings) are relatively scarce, while the frequency of abalone with rings exceeding 13 starts to decline. This suggests that older abalone are statistical outliers

(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.

## Margins computed over dimensions
## in the following order:
## 1: Sex
## 2: Class
##      Class
## Sex     A1   A2   A3   A4   A5  sum
##   F      5   41  121   82   77  326
##   I     91  133   65   21   19  329
##   M     12   62  143   85   79  381
##   sum  108  236  329  188  175 1036

Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?

Answer: he first thing taht satnds out is that infant abalone are present across all classes, even classes taht are ebleived to be older.The majority of infant abalone are in A1 and A2 categories, which makes sense because A1 and A2 are the youngest classes based on number of rings, but they are still present in A3 to A5 Classes. Classes A3, A4, and A5 hold the majority of male and female abalones, which indicates that most mature abalone with median number of rings are adults. Also, the number of male abalones is higher than the number of female abalones across all classes, and higgher than the number of infant abalone in Classes A3 to A5. I wonder if Sex of the abalone has an impact on other variables, such as number of rings, whole weight, or volume. Since there are more male abalone than female abalone, they could be pulling the mean up and making teh dictribution right skewed.

(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).

Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.


##### Section 2: (5 points) Summarizing the data using graphics.

(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.

(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.

Answer: The plot in 2b, Shuck vs Whole, has less variability than the plot in 2a, Volume vs Whole, this is probably because shuck weight is a part of the whole weight of the abalone, and they can be more closely correlated. However, the weight of the shuck seems to drop as the whole weight in creases for some classes. Since the plot is color-coded by Class, maybe at some point the shuck portion of the whole weight decreases in comparison to the whole abalone, perhaps as the abalone matures. We would want to see if age classification plays a role in shuck proportion to the whole weight


### Section 3: (8 points) Getting insights about the data using graphs.

(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.

Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.

Answer: In the boxplots, we can see that all of the distributions have outliers, which makes sense because the histograms all show positive-skewed distributions with long tails on the right. This is more noticeable in the box plot for Female Ratio, which shows an upper outlier. The Female Ratio Q-Q Plot also shows values that are higher than the norm near the right tail end, which mirrors the significance of both the box plot and the histogram.

(3)(b) (2 points) The boxplots in (3)(a) indicate that there are outlying RATIOs for each sex. boxplot.stats() can be used to identify outlying values of a vector. Present the abalones with these outlying RATIO values along with their associated variables in “mydata”. Display the observations by passing a data frame to the kable() function. Basically, we want to output those rows of “mydata” with an outlying RATIO, but we want to determine outliers looking separately at infants, females and males.

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
3 I 10.080 7.350 2.205 79.37500 44.00000 6 A1 163.364040 0.2693371
37 I 4.305 3.255 0.945 6.18750 2.93750 3 A1 13.242072 0.2218308
42 I 2.835 2.730 0.840 3.62500 1.56250 4 A1 6.501222 0.2403394
58 I 6.720 4.305 1.680 22.62500 11.00000 5 A1 48.601728 0.2263294
67 I 5.040 3.675 0.945 9.65625 3.93750 5 A1 17.503290 0.2249577
89 I 3.360 2.310 0.525 2.43750 0.93750 4 A1 4.074840 0.2300704
105 I 6.930 4.725 1.575 23.37500 11.81250 7 A2 51.572194 0.2290478
200 I 9.135 6.300 2.520 74.56250 32.37500 8 A2 145.027260 0.2232339
350 F 7.980 6.720 2.415 80.93750 40.37500 7 A2 129.505824 0.3117620
379 F 15.330 11.970 3.465 252.06250 134.89812 10 A3 635.827846 0.2121614
420 F 11.550 7.980 3.465 150.62500 68.55375 10 A3 319.365585 0.2146560
421 F 13.125 10.290 2.310 142.00000 66.47062 9 A3 311.979938 0.2130606
458 F 11.445 8.085 3.150 139.81250 68.49062 9 A3 291.478399 0.2349767
586 F 12.180 9.450 4.935 133.87500 38.25000 14 A5 568.023435 0.0673388
746 M 13.440 10.815 1.680 130.25000 63.73125 10 A3 244.194048 0.2609861
754 M 10.500 7.770 3.150 132.68750 61.13250 9 A3 256.992750 0.2378764
803 M 10.710 8.610 3.255 160.31250 70.41375 9 A3 300.153640 0.2345924
810 M 12.285 9.870 3.465 176.12500 99.00000 10 A3 420.141472 0.2356349
852 M 11.550 8.820 3.360 167.56250 78.27187 10 A3 342.286560 0.2286735

Essay Question (2 points): What are your observations regarding the results in (3)(b)?

Answer: In this table of values, it is evident that infant measurements are smaller than the values for male and female abalone. However, it is also evident that male measurements are larger than those of female and infant abalone. Most of the outlier values from from infant abalone, but they are on the lower end. While Male abalones tend to be heavier and larger in volume than Female abalones, there are Female outliers that are heavier and larger than all male abalones, for example, Female observation #379.


### Section 4: (8 points) Getting insights about possible predictors.

(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.

Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.

Answer: I think volume and whole weight may be better predictors of age for abalone. There seems to be a linear increase in volume and whole weight that correlates to age classification. The one class that is slightly harder to predict is A5, which is the oldest class. This variability is also observed in the scatterpolots. Although the classes seems to fall within their own bracket of volume, whole weight, and rings frequency, the variability in A5 abalones makes factors other than ring size unreliable.


### Section 5: (12 points) Getting insights regarding different groups in the data.

(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.

Mean VOLUME by SEX and CLASS
A1 A2 A5 A3 A4
I 255.299 276.857 412.608 498.049 486.153
F 66.516 160.320 270.741 316.413 318.693
M 103.723 245.386 358.118 442.616 440.207
Mean SHUCK by SEX and CLASS
A1 A2 A5 A3 A4
I 38.900 42.503 59.691 69.052 59.171
F 10.113 23.410 37.180 39.854 36.470
M 16.396 38.339 52.969 61.427 55.028
Mean RATIO by SEX and CLASS
A1 A2 A5 A3 A4
I 0.155 0.155 0.145 0.138 0.123
F 0.157 0.148 0.137 0.124 0.117
M 0.151 0.156 0.146 0.136 0.126

(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.

Answer: Considering aging and sex differences it seems that female abalone have higher mean volume and shuck weight in comparison to males and infants. The mean volume and mean shuck weight increases for all genders as they rise in age class as well. However, it curious to note that mean ratio declines for all genders, but infant abalone begin declining in ratio sooner. This means that shuck weight of meat decreases in comparison to the whole volume for all genders across all classes.

5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?

Answer: The boxplots suggest that growth for abalone is variable in infancy. For example, the boxplots for infant abalone have more outliers, especially in those with 10 or less rings. Adult abalone have a more stable growth related to number of rings, although they sill have positive outliers. The trend for adult abalone is more linear and peaks at 11 rings. After infant and adult abalone peak at at about 11 and 12 rings, respectively, they experince a slight decline in both volume and weight.


### Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).

Conclusions

Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.

Answer: Possible reasons include presumptions about the data being collected as well as possible errors in data being collected There are infant abalone across all classes, so how does gender of ana abalone play a part in age and number of rings? Also, there is no account for the variability in population of abalones, meaning that the population is not controlled. The study mentioned that abalone take a long time to grow, and that can create changes in the environment of the abalone as well.

Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?

Answer: I’d want to know where the data came from and how it was collected. I’d also like to know if the sample is large enough and a good representative of the whole population. I’d check for potential outliers. I would also want to know the context if the data being presented to me, what message are they trying to get across?.

Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

Answer: A difficulty in analyzing this data is the variability and the lack of control in the sample of abalone population. Causality can’t be determined because number of rings is correlated with age classification, but not with maturity, infant abalone exist across older age classifications as well. We need to understand the data better in order to meaninhfully ananlyze the data.