# Comments are included in each code chunk, simply as prompts
#...R code placed here
#...R code placed here
## 'data.frame': 1036 obs. of 8 variables:
## $ SEX : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ LENGTH: num 5.57 3.67 10.08 4.09 6.93 ...
## $ DIAM : num 4.09 2.62 7.35 3.15 4.83 ...
## $ HEIGHT: num 1.26 0.84 2.205 0.945 1.785 ...
## $ WHOLE : num 11.5 3.5 79.38 4.69 21.19 ...
## $ SHUCK : num 4.31 1.19 44 2.25 9.88 ...
## $ RINGS : int 6 4 6 3 6 6 5 6 5 6 ...
## $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
(1)(a) (2 points) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.
## SEX LENGTH DIAM HEIGHT
## F:326 Min. : 2.73 Min. : 1.995 Min. :0.525
## I:329 1st Qu.: 9.45 1st Qu.: 7.350 1st Qu.:2.415
## M:381 Median :11.45 Median : 8.925 Median :2.940
## Mean :11.08 Mean : 8.622 Mean :2.947
## 3rd Qu.:13.02 3rd Qu.:10.185 3rd Qu.:3.570
## Max. :16.80 Max. :13.230 Max. :4.935
## WHOLE SHUCK RINGS CLASS
## Min. : 1.625 Min. : 0.5625 Min. : 3.000 A1:108
## 1st Qu.: 56.484 1st Qu.: 23.3006 1st Qu.: 8.000 A2:236
## Median :101.344 Median : 42.5700 Median : 9.000 A3:329
## Mean :105.832 Mean : 45.4396 Mean : 9.993 A4:188
## 3rd Qu.:150.319 3rd Qu.: 64.2897 3rd Qu.:11.000 A5:175
## Max. :315.750 Max. :157.0800 Max. :25.000
## VOLUME RATIO
## Min. : 3.612 Min. :0.06734
## 1st Qu.:163.545 1st Qu.:0.12241
## Median :307.363 Median :0.13914
## Mean :326.804 Mean :0.14205
## 3rd Qu.:463.264 3rd Qu.:0.15911
## Max. :995.673 Max. :0.31176
## RINGS
## CLASS 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## A1 9 8 24 67 0 0 0 0 0 0 0 0 0 0 0 0 0
## A2 0 0 0 0 91 145 0 0 0 0 0 0 0 0 0 0 0
## A3 0 0 0 0 0 0 182 147 0 0 0 0 0 0 0 0 0
## A4 0 0 0 0 0 0 0 0 125 63 0 0 0 0 0 0 0
## A5 0 0 0 0 0 0 0 0 0 0 48 35 27 15 13 8 8
## RINGS
## CLASS 20 21 22 23 24 25
## A1 0 0 0 0 0 0
## A2 0 0 0 0 0 0
## A3 0 0 0 0 0 0
## A4 0 0 0 0 0 0
## A5 6 4 1 7 2 1
Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.
Answer: (We have numerical, integers and factor variables. By looking at the summary our of data, we can get an idea of the distribution of the data. At first glance, the WHOLE, SHUCK, RINGS and VOLUME variables have max values substantially above the mean/median. This is interesting and perhaps worth looking into as these variables might have outliers and may be contributing to a skew to the right. )
(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
## CLASS
## SEX A1 A2 A3 A4 A5 Sum
## F 5 41 121 82 77 326
## I 91 133 65 21 19 329
## M 12 62 143 85 79 381
## Sum 108 236 329 188 175 1036
Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?
Answer: (The graph above tells us the distribution of the abalones by sex and by class. Class allows us to see the classification of the abalons by the number of rings in each with A1 being the youngest. It is clear that younger abalons will belong to A1 and A2 which can be appreciated in the graph. We notice that most of the females and males are in the middle of the distribution, and that both have roughly the same shape, but that males account for slightly more than females. One interesting thing we can appreciate from the graph is that there seems to be a number of infant cases in which they are older in age classified(CLASS) by the number of rings (classified as A4 and A5). This point might be interesting to investigate.)
(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.
(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.
(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.
Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.
Answer: (It seems that there is less variability/ dispersion between data points. It seems that the location of A5 is now mostly below the other points as opposed to what we saw in the plot in part (a). This may suggest that weight of the shell (in grams) is greater than the shuck for older classes. In other words, as an abalone gets older, shell growth is faster than shuck growth. Take A3, we can see A3 having a greater proportion of shuck to whole. And if we analyze from younger to older, we can see that older ones (A4, A5) start concentrating downwards as opposed to the younger ones which are closer to the trend line.)
(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.
Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.
Answer: (Given the theoritical quantiles, it seems that the graphs correspond well to a standard normal distribution. However, it seems that for all three sex class, the values seem to be deviating from normality on the upper right hand side of the graph. This can also be seen in the histograms for all sex classes where the histogram exhibit a right skew led by outliers. We can witness the presence of outliers given in the boxplots. By looking at the boxplots and analyzing the QQ-plots, we can see that there seems to be a larger presence of outliers in the Infant group.)
(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).
## [1] 0.31176204 0.21216140 0.21465603 0.21306058 0.23497668 0.06733877
## [1] 0.2693371 0.2218308 0.2403394 0.2263294 0.2249577 0.2300704 0.2290478
## [8] 0.2232339
## [1] 0.2609861 0.2378764 0.2345924 0.2356349 0.2286735
## [1] 0.311762
## [1] 0.2693371
## numeric(0)
| SEX | LENGTH | DIAM | HEIGHT | WHOLE | SHUCK | RINGS | CLASS | VOLUME | RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1:1 | F | 7.980 | 6.720 | 2.415 | 80.93750 | 40.37500 | 7 | A2 | 129.505824 | 0.3117620 |
| 1:2 | F | 15.330 | 11.970 | 3.465 | 252.06250 | 134.89812 | 10 | A3 | 635.827846 | 0.2121614 |
| 1:3 | F | 11.550 | 7.980 | 3.465 | 150.62500 | 68.55375 | 10 | A3 | 319.365585 | 0.2146560 |
| 1:4 | F | 13.125 | 10.290 | 2.310 | 142.00000 | 66.47062 | 9 | A3 | 311.979938 | 0.2130606 |
| 1:5 | F | 11.445 | 8.085 | 3.150 | 139.81250 | 68.49062 | 9 | A3 | 291.478399 | 0.2349767 |
| 2 | F | 12.180 | 9.450 | 4.935 | 133.87500 | 38.25000 | 14 | A5 | 568.023435 | 0.0673388 |
| 3:1 | I | 10.080 | 7.350 | 2.205 | 79.37500 | 44.00000 | 6 | A1 | 163.364040 | 0.2693371 |
| 3:2 | I | 4.305 | 3.255 | 0.945 | 6.18750 | 2.93750 | 3 | A1 | 13.242072 | 0.2218308 |
| 3:3 | I | 2.835 | 2.730 | 0.840 | 3.62500 | 1.56250 | 4 | A1 | 6.501222 | 0.2403394 |
| 3:4 | I | 6.720 | 4.305 | 1.680 | 22.62500 | 11.00000 | 5 | A1 | 48.601728 | 0.2263294 |
| 3:5 | I | 5.040 | 3.675 | 0.945 | 9.65625 | 3.93750 | 5 | A1 | 17.503290 | 0.2249577 |
| 3:6 | I | 3.360 | 2.310 | 0.525 | 2.43750 | 0.93750 | 4 | A1 | 4.074840 | 0.2300704 |
| 3:7 | I | 6.930 | 4.725 | 1.575 | 23.37500 | 11.81250 | 7 | A2 | 51.572194 | 0.2290478 |
| 3:8 | I | 9.135 | 6.300 | 2.520 | 74.56250 | 32.37500 | 8 | A2 | 145.027260 | 0.2232339 |
| 4:1 | M | 13.440 | 10.815 | 1.680 | 130.25000 | 63.73125 | 10 | A3 | 244.194048 | 0.2609861 |
| 4:2 | M | 10.500 | 7.770 | 3.150 | 132.68750 | 61.13250 | 9 | A3 | 256.992750 | 0.2378764 |
| 4:3 | M | 10.710 | 8.610 | 3.255 | 160.31250 | 70.41375 | 9 | A3 | 300.153640 | 0.2345924 |
| 4:4 | M | 12.285 | 9.870 | 3.465 | 176.12500 | 99.00000 | 10 | A3 | 420.141472 | 0.2356349 |
| 4:5 | M | 11.550 | 8.820 | 3.360 | 167.56250 | 78.27187 | 10 | A3 | 342.286560 | 0.2286735 |
Question (2 points): What are your observations regarding the results in (3)(b)?
Answer: (There are mild outliers in all sex classes. However, there only are cases of extreme outliers in the female and infant categories. The presence of outliers was initially suspected by the examination of the plots above. It is a great that extreme outliers were identified in some of the categories.)
(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.
Question (5 points) How well do you think these variables would perform as predictors of age? Explain.
Answer: (The variables Volume, Whole and Rings are good predictors of age. Since the more rings an abalone has, the greaters its age; we can appreciate that these variables have an upwards trend. Both volume and Whole increase as the abalones have more rings. However, it is likely that the explanatory power of the Class variable is smaller compared to the others as we noticed earlier that a number of infants were classified as class 4 and 5.)
(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.
## A1 A2 A3 A4 A5
## Female 255.30 498.05 160.32 318.69 358.12
## Infant 276.86 486.15 270.74 103.72 442.62
## Male 412.61 66.52 316.41 245.39 440.21
## A1 A2 A3 A4 A5
## Female 38.90 69.05 23.41 36.47 52.97
## Infant 42.50 59.17 37.18 16.40 61.43
## Male 59.69 10.11 39.85 38.34 55.03
## A1 A2 A3 A4 A5
## Female 0.15 0.14 0.15 0.12 0.15
## Infant 0.16 0.12 0.14 0.15 0.14
## Male 0.15 0.16 0.12 0.16 0.13
(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().
Question (2 points): What questions do these plots raise? Consider aging and sex differences.
Answer: (It seems that as the abalones age, the proportion of meat as given by the mean ratio, drops considerably. This means that there is less meat relative to the size of the abalone. We can also appreciate that the volume of females is larger than males as the abalones age. It also seems that both females and males peak in size/volume in class 4 as opposed to infants, which increase slightly as they age to class 5. We also see that the mean shuck for all sex classes peakes at class 4. This raises the question as to whether it is better to eat the abalones in class 4 as the shuck does not grow anymore past this point. Furthermore, by looking at the mean ratio, it is clear that the proportion of meat is largerst when all abalones are at Class 2. Would it be more profitable if producers/growers sell abalones for consumption once they reach this stage as they could start growing new ones and the proportion of meat to size does not really grow from here? )
5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.
Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?
Answer: (These graphs suggest that the growth of adult abalones varies more than infant abalones. We can see this by looking at the longer whiskers exhibited by the adult graphs as opposed to shorter whiskers of infants. This suggests that both adult volume and adult whole tends to be more dispersed as the number of rings grows. The graphs also suggest that Volume and Whole also tend to increase with the number of rings.)
We have performed explanatory data analysis on the abalone data set. We can appreciate the usefulness of the data and its potential power for predicting the age of an abalone. However, there are some issues we must first deal with before preceding to modeling. We must first address the presence of outliers in the data to in order to correctly approximate normality when using samples from this data set. There might also be some possible collinearity between some variables. Another issue we might need to examine is the sex category as this category is prone to human error because of measurement difficulty. We also discovered some interesting insights about the data. We witness that the proportion of Shuck to whole weight is lower for older abalones(Class 4,5) This might suggest that this might be a good variable for age prediction. We also appreciated mean shuck, volume and whole might be are good measurements for age prediction. Another variable which presents itself for for age prediction is the ratio variable. Deeper study of the data set is needed. For the improvement of the study, varibles pertaining to the environment should be added
Please respond to each of the following questions (10 points total):
Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.
Answer: (The presence of outliers may be affecting the data which contributes to the right skew witnessed. This should be addressed in order to achieve normality of the data and an approximation of the normal distribution. A possible measure that needs to be addressed is the Infant category and how this is measured. It is known that the sexing of the abalons is difficult, therefore this could be a category prone to error measurement. The idea behind this point is that we saw that infants can still have upwards of 10 rings. This bears the question of how infants are classified as the number of rings is a sign of age. Another point to consider is correlation between the variables as some of them are a function of the other. The model for age prediction might run into multicollinearity. Physical measurements may be used for age prediction to a certain extent. Weight, Sex, Volume are definitely a good measures. However, we would need to add some measures for controlling the environmentral effect on growth. Abalone growth is a function of its environment such as pollution, food and others. Therefore to get a good measure of which physical variables are good predictors of age, measures of environment should be added.)
Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?
Answer: (Is there a presence of outliers?, How many data points are being given in the sample? What is the mean of the sample distribution? What is the standard deviation of the sample distribution? Is the data symmetrical about its mean? I would examine the measures of central tendency)
Question 3) (2 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?
Answer: (Data from observational studies have the disadvantage of having human/measurement errors as the measured values have to be manually recorded. It is also possible that observational studies may lead to biases. Data correlation may be detected and may be interpreted as causality without investigating the underlying factors affecting such relationships. It is possible that a purely observational study may yield some causality in some cases but it is prone to biases and erroneous causality. From observational studies we can learn about relationship between the independent and the dependent variables. From here, we can gather some insights into underlying relationships/other variables that might be affectign some of our variables. We might discover other variables which could perhaps have a strong correlation and a causality effect.)