## 'data.frame': 1036 obs. of 8 variables:
## $ SEX : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ LENGTH: num 5.57 3.67 10.08 4.09 6.93 ...
## $ DIAM : num 4.09 2.62 7.35 3.15 4.83 ...
## $ HEIGHT: num 1.26 0.84 2.205 0.945 1.785 ...
## $ WHOLE : num 11.5 3.5 79.38 4.69 21.19 ...
## $ SHUCK : num 4.31 1.19 44 2.25 9.88 ...
## $ RINGS : int 6 4 6 3 6 6 5 6 5 6 ...
## $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##### Section 1: (6 points) Summarizing the data.
(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.
## SEX LENGTH DIAM HEIGHT WHOLE
## F:326 Min. : 2.73 Min. : 1.995 Min. :0.525 Min. : 1.625
## I:329 1st Qu.: 9.45 1st Qu.: 7.350 1st Qu.:2.415 1st Qu.: 56.484
## M:381 Median :11.45 Median : 8.925 Median :2.940 Median :101.344
## Mean :11.08 Mean : 8.622 Mean :2.947 Mean :105.832
## 3rd Qu.:13.02 3rd Qu.:10.185 3rd Qu.:3.570 3rd Qu.:150.319
## Max. :16.80 Max. :13.230 Max. :4.935 Max. :315.750
## SHUCK RINGS CLASS VOLUME
## Min. : 0.5625 Min. : 3.000 A1:108 Min. : 3.612
## 1st Qu.: 23.3006 1st Qu.: 8.000 A2:236 1st Qu.:163.545
## Median : 42.5700 Median : 9.000 A3:329 Median :307.363
## Mean : 45.4396 Mean : 9.993 A4:188 Mean :326.804
## 3rd Qu.: 64.2897 3rd Qu.:11.000 A5:175 3rd Qu.:463.264
## Max. :157.0800 Max. :25.000 Max. :995.673
## RATIO
## Min. :0.06734
## 1st Qu.:0.12241
## Median :0.13914
## Mean :0.14205
## 3rd Qu.:0.15911
## Max. :0.31176
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## A1 9 8 24 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## A2 0 0 0 0 91 145 0 0 0 0 0 0 0 0 0 0 0 0
## A3 0 0 0 0 0 0 182 147 0 0 0 0 0 0 0 0 0 0
## A4 0 0 0 0 0 0 0 0 125 63 0 0 0 0 0 0 0 0
## A5 0 0 0 0 0 0 0 0 0 0 48 35 27 15 13 8 8 6
##
## 21 22 23 24 25
## A1 0 0 0 0 0
## A2 0 0 0 0 0
## A3 0 0 0 0 0
## A4 0 0 0 0 0
## A5 4 1 7 2 1
Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.
In the abalone data set we have the following variables or obersations about our abalones:
SEX LENGTH DIAMETER HEIGHT WHOLE WEIGHT SHUCK WEIGHT(WITHOUT SHELL) RINGS CLASS VOLUME RATIO
Looking at the summary,the mean is often higher than the median which would indicate a right skewed distribution. This would make sense considering there can’t be negative obervations about a living creature outliers would drag the distribution to the right.
(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
## CLASS
## SEX A1 A2 A3 A4 A5 Sum
## Female 5 41 121 82 77 326
## Infant 91 133 65 21 19 329
## Male 12 62 143 85 79 381
## Sum 108 236 329 188 175 1036
Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?
Observing the table, classes are classified based on number of rings. The lower the number of rings the lower the class and visa versa. Males and females begin to appear in earnest at A2. What stands out from the barplot distribution there infants in the higher classes and adults in the lower classes. Logically an adult can’t be an infant and an infant can’t be an adult. This would be something that would need to be interrogated.
(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.
##### Section 2: (5 points) Summarizing the data using graphics.
(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.
(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.
Essay Question (2 points): How does the variability in this plot differ `from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.
### Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).
Conclusions
Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.
The biggest confouding variables that are not in the dataset is access to food and the living conditions. Even if the abalone sample is from the same region the subregions could vary drastically in access to food or living conditions. Abalones experiencing harsh conditions or low access to food could vary in size from abalones in better conditions with better access to food. This would make predicting age base on physical measurements difficult. As the abalones from different regions and sub regions are going to vary drastically in size. So using physical measurements to predict age is going to be diffcult comparing across regions and subregions. If we were able to account and adjust for these confouding variables perhaps we could use physical measurements to predict age. There also seems to be a fair amount of outliers as noted in the boxplot graphs. The outliers are further indication of the variaiblity in abalone size. The outliers tend to skew the dataset to the right. If we exlcuded the outliers and the datasets became more normal perhaps we could get better results. I would also like to know how the infant and adult categories are determined. As noted previosuly there are infants with high ring counts and adults with low ring counts. This would indicate to be there is a certain level of misclassification.
Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?
*** I would ask how to sample was obtained (random or not random), the measures of central tendency, outliers, and sample size. If there are no sample biases the mean and median are close together indicating a normal distribution with no skew and a large enough sample size I would consider this a representative sample of the population. ***
Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?
*** Observational studies are prone to bias. Observational studies are typical not random like a controlled experiments. Observational studies are suspectible to the bias of the human preforming the observation. Do they already have a conclusion in mind and are finding the observations to confirm that bias? A observer could easily exclude observations that don’t confirm this bias. Obversational studies also do not control for other potential confouding variables. For these reason I believe that correlation can be determined from observational studies but not causation. Knowing correlations is useful because then we can conduct randomized controlled experiments to determine casuation. Experiments can be expensive so knowing correlation beforehand can narrow down to what would be a worthwhile experiment based on observation.***