## 'data.frame':    1036 obs. of  8 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...

Test Items starts from here - There are 6 sections - Total 50 points

##### Section 1: (6 points) Summarizing the data.

(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.

##  SEX         LENGTH           DIAM            HEIGHT          WHOLE        
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525   Min.   :  1.625  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415   1st Qu.: 56.484  
##  M:381   Median :11.45   Median : 8.925   Median :2.940   Median :101.344  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947   Mean   :105.832  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570   3rd Qu.:150.319  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935   Max.   :315.750  
##      SHUCK              RINGS        CLASS        VOLUME       
##  Min.   :  0.5625   Min.   : 3.000   A1:108   Min.   :  3.612  
##  1st Qu.: 23.3006   1st Qu.: 8.000   A2:236   1st Qu.:163.545  
##  Median : 42.5700   Median : 9.000   A3:329   Median :307.363  
##  Mean   : 45.4396   Mean   : 9.993   A4:188   Mean   :326.804  
##  3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175   3rd Qu.:463.264  
##  Max.   :157.0800   Max.   :25.000            Max.   :995.673  
##      RATIO        
##  Min.   :0.06734  
##  1st Qu.:0.12241  
##  Median :0.13914  
##  Mean   :0.14205  
##  3rd Qu.:0.15911  
##  Max.   :0.31176
##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8   6
##     
##       21  22  23  24  25
##   A1   0   0   0   0   0
##   A2   0   0   0   0   0
##   A3   0   0   0   0   0
##   A4   0   0   0   0   0
##   A5   4   1   7   2   1

Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.

In the abalone data set we have the following variables or obersations about our abalones:

SEX LENGTH DIAMETER HEIGHT WHOLE WEIGHT SHUCK WEIGHT(WITHOUT SHELL) RINGS CLASS VOLUME RATIO

Looking at the summary,the mean is often higher than the median which would indicate a right skewed distribution. This would make sense considering there can’t be negative obervations about a living creature outliers would drag the distribution to the right.

(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.

##         CLASS
## SEX        A1   A2   A3   A4   A5  Sum
##   Female    5   41  121   82   77  326
##   Infant   91  133   65   21   19  329
##   Male     12   62  143   85   79  381
##   Sum     108  236  329  188  175 1036

Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?

Observing the table, classes are classified based on number of rings. The lower the number of rings the lower the class and visa versa. Males and females begin to appear in earnest at A2. What stands out from the barplot distribution there infants in the higher classes and adults in the lower classes. Logically an adult can’t be an infant and an infant can’t be an adult. This would be something that would need to be interrogated.

(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).

Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.


##### Section 2: (5 points) Summarizing the data using graphics.

(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.

(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

Essay Question (2 points): How does the variability in this plot differ `from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.

Shuck weight and whole Weight are positivetly correlated as are whole weight and volume. Which would make make sense because the bigger the volume of the abalone we would expect it to weigh more. And since shuck weight is a component of whole weight as whole weight increases so should shuck weight.The variability is both plots are very similar, we see tighter clusters toward the left side of graphs suggesting higher correlation at lower whole weights. As the abalones get larger we see more variability. The A1 age classes are consistently towards the smaller and as classes increase so does the overall size of the abalone.The A5 class is clustered farther away from our max ratio line suggesting as the abalone ages the shuck ratio decreases.

### Section 3: (8 points) Getting insights about the data using graphs.

(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.

Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.

The Displays are non normal skewed right approaching normal.The distributions are unimodal the female qqnorm plot shows the female distribution is the closest to normal.

(3)(b) (2 points) The boxplots in (3)(a) indicate that there are outlying RATIOs for each sex. boxplot.stats() can be used to identify outlying values of a vector. Present the abalones with these outlying RATIO values along with their associated variables in “mydata”. Display the observations by passing a data frame to the kable() function. Basically, we want to output those rows of “mydata” with an outlying RATIO, but we want to determine outliers looking separately at infants, females and males.

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
I 10.080 7.350 2.205 79.37500 44.00000 6 A1 163.364040 0.2693371
I 4.305 3.255 0.945 6.18750 2.93750 3 A1 13.242072 0.2218308
I 2.835 2.730 0.840 3.62500 1.56250 4 A1 6.501222 0.2403394
I 6.720 4.305 1.680 22.62500 11.00000 5 A1 48.601728 0.2263294
I 5.040 3.675 0.945 9.65625 3.93750 5 A1 17.503290 0.2249577
I 3.360 2.310 0.525 2.43750 0.93750 4 A1 4.074840 0.2300704
I 6.930 4.725 1.575 23.37500 11.81250 7 A2 51.572194 0.2290478
I 9.135 6.300 2.520 74.56250 32.37500 8 A2 145.027260 0.2232339
F 7.980 6.720 2.415 80.93750 40.37500 7 A2 129.505824 0.3117620
F 15.330 11.970 3.465 252.06250 134.89812 10 A3 635.827846 0.2121614
F 11.550 7.980 3.465 150.62500 68.55375 10 A3 319.365585 0.2146560
F 13.125 10.290 2.310 142.00000 66.47062 9 A3 311.979938 0.2130606
F 11.445 8.085 3.150 139.81250 68.49062 9 A3 291.478399 0.2349767
F 12.180 9.450 4.935 133.87500 38.25000 14 A5 568.023435 0.0673388
M 13.440 10.815 1.680 130.25000 63.73125 10 A3 244.194048 0.2609861
M 10.500 7.770 3.150 132.68750 61.13250 9 A3 256.992750 0.2378764
M 10.710 8.610 3.255 160.31250 70.41375 9 A3 300.153640 0.2345924
M 12.285 9.870 3.465 176.12500 99.00000 10 A3 420.141472 0.2356349
M 11.550 8.820 3.360 167.56250 78.27187 10 A3 342.286560 0.2286735

Essay Question (2 points): What are your observations regarding the results in (3)(b)?

The infants have the highest number of ratio outliers, followed by females, then males. Infants have a higher number of outliers could be ddue to misclassification as we discussed in section 1B. The female outliers have the most extreme outliers both large and small.


### Section 4: (8 points) Getting insights about possible predictors.

(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.

Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.

This predictors would be a very poor predictor of age. The boxplots an scatterpots show that they may have some utility predicting younger abalones but in classes A3-A5 there is a fair amount of overlap and the predictions would be very poor. This makes sense logically there is very little variation in volume or weight of most organism from the time they reach maturity till death.


### Section 5: (12 points) Getting insights regarding different groups in the data.

(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.

VOLUME MATRIX
A1 A2 A3 A4 A5
Female 255.29938 276.8573 412.6079 498.0489 486.1525
Infant 66.51618 160.3200 270.7406 316.4129 318.6930
Male 103.72320 245.3857 358.1181 442.6155 440.2074
SHUCK MATRIX
A1 A2 A3 A4 A5
Female 38.90000 42.50305 59.69121 69.05161 59.17076
Infant 10.11332 23.41024 37.17969 39.85369 36.47047
Male 16.39583 38.33855 52.96933 61.42726 55.02762
RATIO MATRIX
A1 A2 A3 A4 A5
Female 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
Infant 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
Male 0.1512698 0.1564017 0.1462123 0.1364881 0.1262089

(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.

Again we see the oddity that some infants are in higher classers and adults in lower classses. These graphs make those misclassifications very obvious. An interesting observation form the volume and shuck grapsh is that femals in class A1 have a much high starting volume and shuck weight than males. The only time ratio increase is between A1 and A2 in the male and female sexes and for all other class increases the ratio drops.

5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?

The graphs suggest like most organisms grow occurs in the beginning stages of life and then tapers off and can even shrink during later years. In adults the variability in volume and whole weight remains simililar as the rings count increases. The variaiblity in infant weight and volume as the ring count increases starts to vary drastically after ring 8. Ring 12 in the infant chart stands out as the box is almost the entire length of the whiskers suggesting the lower and upper quantile are almost equal to the highest and lowest data point.

### Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).

Conclusions

Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.

The biggest confouding variables that are not in the dataset is access to food and the living conditions. Even if the abalone sample is from the same region the subregions could vary drastically in access to food or living conditions. Abalones experiencing harsh conditions or low access to food could vary in size from abalones in better conditions with better access to food. This would make predicting age base on physical measurements difficult. As the abalones from different regions and sub regions are going to vary drastically in size. So using physical measurements to predict age is going to be diffcult comparing across regions and subregions. If we were able to account and adjust for these confouding variables perhaps we could use physical measurements to predict age. There also seems to be a fair amount of outliers as noted in the boxplot graphs. The outliers are further indication of the variaiblity in abalone size. The outliers tend to skew the dataset to the right. If we exlcuded the outliers and the datasets became more normal perhaps we could get better results. I would also like to know how the infant and adult categories are determined. As noted previosuly there are infants with high ring counts and adults with low ring counts. This would indicate to be there is a certain level of misclassification.

Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?

*** I would ask how to sample was obtained (random or not random), the measures of central tendency, outliers, and sample size. If there are no sample biases the mean and median are close together indicating a normal distribution with no skew and a large enough sample size I would consider this a representative sample of the population. ***

Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

*** Observational studies are prone to bias. Observational studies are typical not random like a controlled experiments. Observational studies are suspectible to the bias of the human preforming the observation. Do they already have a conclusion in mind and are finding the observations to confirm that bias? A observer could easily exclude observations that don’t confirm this bias. Obversational studies also do not control for other potential confouding variables. For these reason I believe that correlation can be determined from observational studies but not causation. Knowing correlations is useful because then we can conduct randomized controlled experiments to determine casuation. Experiments can be expensive so knowing correlation beforehand can narrow down to what would be a worthwhile experiment based on observation.***