## [1] FALSE
## Classes 'tbl_df', 'tbl' and 'data.frame': 1036 obs. of 14 variables:
## $ SEX : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ LENGTH: num 5.57 3.67 10.08 4.09 6.93 ...
## $ DIAM : num 4.09 2.62 7.35 3.15 4.83 ...
## $ HEIGHT: num 1.26 0.84 2.205 0.945 1.785 ...
## $ WHOLE : num 11.5 3.5 79.38 4.69 21.19 ...
## $ SHUCK : num 4.31 1.19 44 2.25 9.88 ...
## $ RINGS : num 6 4 6 3 6 6 5 6 5 6 ...
## $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ...9 : logi NA NA NA NA NA NA ...
## $ ...10 : logi NA NA NA NA NA NA ...
## $ ...11 : chr NA NA NA NA ...
## $ ...12 : chr NA NA NA NA ...
## $ VOLUME: num 28.7 8.1 163.4 12.2 59.7 ...
## $ Ratio : num 0.15 0.147 0.269 0.185 0.165 ...
(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.
## SEX LENGTH DIAM HEIGHT
## F:326 Min. : 2.73 Min. : 1.995 Min. :0.525
## I:329 1st Qu.: 9.45 1st Qu.: 7.350 1st Qu.:2.415
## M:381 Median :11.45 Median : 8.925 Median :2.940
## Mean :11.08 Mean : 8.622 Mean :2.947
## 3rd Qu.:13.02 3rd Qu.:10.185 3rd Qu.:3.570
## Max. :16.80 Max. :13.230 Max. :4.935
## WHOLE SHUCK RINGS CLASS
## Min. : 1.625 Min. : 0.5625 Min. : 3.000 A1:108
## 1st Qu.: 56.484 1st Qu.: 23.3006 1st Qu.: 8.000 A2:236
## Median :101.344 Median : 42.5700 Median : 9.000 A3:329
## Mean :105.832 Mean : 45.4396 Mean : 9.993 A4:188
## 3rd Qu.:150.319 3rd Qu.: 64.2897 3rd Qu.:11.000 A5:175
## Max. :315.750 Max. :157.0800 Max. :25.000
## ...9 ...10 ...11 ...12
## Mode:logical Mode:logical Length:1036 Length:1036
## NA's:1036 NA's:1036 Class :character Class :character
## Mode :character Mode :character
##
##
##
## VOLUME Ratio
## Min. : 3.612 Min. :0.06734
## 1st Qu.:163.545 1st Qu.:0.12241
## Median :307.363 Median :0.13914
## Mean :326.804 Mean :0.14205
## 3rd Qu.:463.264 3rd Qu.:0.15911
## Max. :995.673 Max. :0.31176
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## A1 9 8 24 67 0 0 0 0 0 0 0 0 0 0 0 0 0
## A2 0 0 0 0 91 145 0 0 0 0 0 0 0 0 0 0 0
## A3 0 0 0 0 0 0 182 147 0 0 0 0 0 0 0 0 0
## A4 0 0 0 0 0 0 0 0 125 63 0 0 0 0 0 0 0
## A5 0 0 0 0 0 0 0 0 0 0 48 35 27 15 13 8 8
##
## 20 21 22 23 24 25
## A1 0 0 0 0 0 0
## A2 0 0 0 0 0 0
## A3 0 0 0 0 0 0
## A4 0 0 0 0 0 0
## A5 6 4 1 7 2 1
## [1] 1.238773
## [1] 0.0536942
Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.
There appears to be a skewness value of 1.238773 for abalone rings. This would indicate that there is a slight skew to the right that is causing the data to be non-normally distributed about the mean. This could be attributed to outliers.
there appears to be a skewness value of 0.0536942 for abalone classes. This would indicate that there is an ever so slight skew to the right causing the data to be slightly non-normally distrubuted. However, this is quite slight. outliers are most likely not an issue with this particular physical parameter.
(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1036
##
##
## | abalones$CLASS
## abalones$SEX | A1 | A2 | A3 | A4 | A5 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## F | 5 | 41 | 121 | 82 | 77 | 326 |
## | 24.720 | 14.898 | 2.949 | 8.819 | 8.735 | |
## | 0.015 | 0.126 | 0.371 | 0.252 | 0.236 | 0.315 |
## | 0.046 | 0.174 | 0.368 | 0.436 | 0.440 | |
## | 0.005 | 0.040 | 0.117 | 0.079 | 0.074 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## I | 91 | 133 | 65 | 21 | 19 | 329 |
## | 93.745 | 44.969 | 14.918 | 25.089 | 24.070 | |
## | 0.277 | 0.404 | 0.198 | 0.064 | 0.058 | 0.318 |
## | 0.843 | 0.564 | 0.198 | 0.112 | 0.109 | |
## | 0.088 | 0.128 | 0.063 | 0.020 | 0.018 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## M | 12 | 62 | 143 | 85 | 79 | 381 |
## | 19.344 | 7.082 | 4.003 | 3.639 | 3.331 | |
## | 0.031 | 0.163 | 0.375 | 0.223 | 0.207 | 0.368 |
## | 0.111 | 0.263 | 0.435 | 0.452 | 0.451 | |
## | 0.012 | 0.060 | 0.138 | 0.082 | 0.076 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 108 | 236 | 329 | 188 | 175 | 1036 |
## | 0.104 | 0.228 | 0.318 | 0.181 | 0.169 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
##
## A1 A2 A3 A4 A5
## F 5 41 121 82 77
## I 91 133 65 21 19
## M 12 62 143 85 79
Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?
It would appear as though the infants are primarily in the A1, A2 class distinctions. Most of the mature/adult abalone population appears to fall into the A4 and A5 class distinction. As far as sex distribution there appears to be a roughly equal distribution of males and females throughout the class distinctions. However, within each class distinction there are more males than females. It is hard to determine if this is a statistically significant number by glance.
(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.
(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.
(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.
## [1] 0.5621008
Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.
In graph 2b there is a clear grouping/distinction between the classes. In graph 2b Class A5 appears to be clustered towards the bottom, A4 is clustered just above it, and A3 is clustered above them, etc. In graph 2a there is not clear grouping or distinctions between the classes along the line. Graph 2a shows the classes randomly scattered as the values increase. Both graphs 2A and 2B show a positive trend/correlation.
(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.
Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.
The values appear to be normal for the most part when one examines the histograms, box plots, and qqplots. The histograms appear to have a bell shape with perhaps a slight skew to the right for the infants, females, and male ratios. The boxplots show a few outliers, and the qq plots show what would appear to be a few outliers along the beginning of the qqline and the end of the qqlines.
(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).
## $stats
## [1] 0.07808601 0.12314437 0.14159986 0.15948490 0.21142642
##
## $n
## [1] 329
##
## $conf
## [1] 0.1384343 0.1447654
##
## $out
## [1] 0.2693371 0.2218308 0.2403394 0.2263294 0.2249577 0.2300704 0.2290478
## [8] 0.2232339
## $stats
## [1] 0.07174086 0.12123374 0.13808946 0.15542149 0.20522327
##
## $n
## [1] 326
##
## $conf
## [1] 0.1350978 0.1410812
##
## $out
## [1] 0.31176204 0.21216140 0.21465603 0.21306058 0.23497668 0.06733877
## $stats
## [1] 0.07171905 0.12228101 0.13776396 0.16085876 0.21627955
##
## $n
## [1] 381
##
## $conf
## [1] 0.1346412 0.1408867
##
## $out
## [1] 0.2609861 0.2378764 0.2345924 0.2356349 0.2286735
## $stats
## [1] 0.07808601 0.12314437 0.14159986 0.15948490 0.24033943
##
## $n
## [1] 329
##
## $conf
## [1] 0.1384343 0.1447654
##
## $out
## [1] 0.2693371
## $stats
## [1] 0.06733877 0.12123374 0.13808946 0.15542149 0.23497668
##
## $n
## [1] 326
##
## $conf
## [1] 0.1350978 0.1410812
##
## $out
## [1] 0.311762
## $stats
## [1] 0.07171905 0.12228101 0.13776396 0.16085876 0.26098609
##
## $n
## [1] 381
##
## $conf
## [1] 0.1346412 0.1408867
##
## $out
## numeric(0)
Essay Question (2 points): What are your observations regarding the results in (3)(b)?
The Infants Ratio appears to have 8 outliers. 0.2693371 0.2218308 0.2403394 0.2263294 0.2249577 0.2300704 0.2290478 0.2232339 These outliers could be causing a skew of the data for infants and may have to be addressed later on.
The Male Ratio appears to have 5 outliers. 0.2609861 0.2378764 0.2345924 0.2356349 0.2286735 These outliers could be causing a skew of the data for males and may have to be addressed later on.
The Female Ratio appear sto have 6 outliers. 0.31176204 0.21216140 0.21465603 0.21306058 0.23497668 0.06733877 These outliers may be causing a skew of the data for females and may have to be addressed later on.
The female and infant categories are the only one with outliers that we would deem “extreme.”
The female extreme outlier being 0.311762 The Infant extreme outlier being 0.2693371
(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.
Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.
The variables of volume and whole weight appear to be good predictors of age if you are differentiating between abalones in classes A1, A2, and A3. However, the means and medians for classes A4 and A5 are extremely similar. Also A5 appears to overlap with A3 rather significantly as well. Therefore it does not appear as though these physical parameters are good at differentiating between ages of abalones once they have reached maturity. Abalone rings also do not appear to be good indicators of age since they also increase with volume and whole weight in the same way that class does.
(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.
##
## A1 A2 A3 A4 A5 Sum
## F 5 41 121 82 77 326
## I 91 133 65 21 19 329
## M 12 62 143 85 79 381
## Sum 108 236 329 188 175 1036
## Sex Class Average Volume Average Shuck Average Ratio Count
## 1 F A1 255.29938 38.90000 0.1546644 5
## 2 I A1 66.51618 10.11332 0.1569554 91
## 3 M A1 103.72320 16.39583 0.1512698 12
## 4 F A2 276.85731 42.50305 0.1554605 41
## 5 I A2 160.31999 23.41024 0.1475600 133
## 6 M A2 245.38571 38.33855 0.1564017 62
## 7 F A3 412.60794 59.69121 0.1450304 121
## 8 I A3 270.74063 37.17969 0.1372256 65
## 9 M A3 358.11811 52.96933 0.1462123 143
## 10 F A4 498.04889 69.05161 0.1379609 82
## 11 I A4 316.41292 39.85369 0.1244413 21
## 12 M A4 442.61552 61.42726 0.1364881 85
## 13 F A5 486.15253 59.17076 0.1233605 77
## 14 I A5 318.69299 36.47047 0.1167649 19
## 15 M A5 440.20736 55.02762 0.1262089 79
(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().
Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.
It is interesting that the female abalones weigh more than their male counterparts. Also classes A1, A2, and A3 appear to gain significant volume and shuck but once they reach class A4 and A5 the changes are far less drastic. It is also interesting that for mean shuck abalones of class A4 have the greatest value. It would be interesting to learn what causes this mass to go down once they reach class A5.
5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.
Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?
These graphs show that as the number of rings increases the volume of the abalones increases as well. However, the increase appears to become far less once the abalones reach the 10 ring mark. Once abalones reach 10 rings the change in volume is far less prominent. These graphs also show that the size of an abalone infant varies far less than their adult counterparts. This is shown by the fact that the infant boxplots have very small whiskers when compared to the adult boxplot whiskers.
Conclusions
Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.
There appears to be some skewness to the data that would need to be addressed in order to attain the best results possible from the data. It is always best to work with data that is normally distributed. They may have to rexamine their sampling procedures. They may also need to readdress how they sex the abalones, this can be a very difficult task.
A lot of the physical parameters measured also seem to be very similar in nature. For example diameter and height, length and diameter, whole weight and volume, just to name a few. These parameters should be examined for multi-collinearity by finding their VIF statistic and removing them if they have a value higher than 5 (or whatever the data analyst deems appropriate). Multicollinearity can be a major issue when running any regression analysis.
Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?
I would be very curious about everything that went into the sampling for the abalones. For example, how many abalones were used? I would want to ensure that the sample size given to me can feasibly be representative of the population as a whole. The sample size is crucial. I would also want to know how the sites were picked for abalone collection and what means of collection was used. Can it be randomized any better? Does there appear to be some sort of bias with the locations picked? Then I would want to know if there was any skewness to the histogram presented to me, if there were any outliers and if they may be causing this skewness, as well as what the various measures of central tendancy are.
Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?
Some difficulties or concerns that I see with observational studies is that they leave room for error. This error can be present in things such as measuring abalones, sexing abalones, recording data, selecting locations to procure abalones. These all lend themselves to human bias and human error. Also most observational studies will not have perfect data so it will most likely have to be cleaned up by the data scientist and data scientists may have to make judgement calls on whether certain outliers should be included, etc.
Causality cannot be determined with the data. There is no control to compare the data to and scientists are always careful to make complete and overarching claims. Scientists generally just report that there appears to be a trend, pattern, or correlation between certain factors. One can never safely state something for a certainty.
These studies can be very effective in identifying where correlations may exist between abalone weight and sex, abalone class and weight, etc. The best information that can be gleaned from a study such as this is where relationships MAY exist.