Data Analysis Assignment #1 (50 points total)

## [1] FALSE

## Classes 'tbl_df', 'tbl' and 'data.frame':    1036 obs. of  14 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : num  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ...9  : logi  NA NA NA NA NA NA ...
##  $ ...10 : logi  NA NA NA NA NA NA ...
##  $ ...11 : chr  NA NA NA NA ...
##  $ ...12 : chr  NA NA NA NA ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ Ratio : num  0.15 0.147 0.269 0.185 0.165 ...

(1)(a) (1 point) Use summary() to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.

##  SEX         LENGTH           DIAM            HEIGHT     
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415  
##  M:381   Median :11.45   Median : 8.925   Median :2.940  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935  
##      WHOLE             SHUCK              RINGS        CLASS   
##  Min.   :  1.625   Min.   :  0.5625   Min.   : 3.000   A1:108  
##  1st Qu.: 56.484   1st Qu.: 23.3006   1st Qu.: 8.000   A2:236  
##  Median :101.344   Median : 42.5700   Median : 9.000   A3:329  
##  Mean   :105.832   Mean   : 45.4396   Mean   : 9.993   A4:188  
##  3rd Qu.:150.319   3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175  
##  Max.   :315.750   Max.   :157.0800   Max.   :25.000           
##    ...9          ...10            ...11              ...12          
##  Mode:logical   Mode:logical   Length:1036        Length:1036       
##  NA's:1036      NA's:1036      Class :character   Class :character  
##                                Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##      VOLUME            Ratio        
##  Min.   :  3.612   Min.   :0.06734  
##  1st Qu.:163.545   1st Qu.:0.12241  
##  Median :307.363   Median :0.13914  
##  Mean   :326.804   Mean   :0.14205  
##  3rd Qu.:463.264   3rd Qu.:0.15911  
##  Max.   :995.673   Max.   :0.31176

##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8
##     
##       20  21  22  23  24  25
##   A1   0   0   0   0   0   0
##   A2   0   0   0   0   0   0
##   A3   0   0   0   0   0   0
##   A4   0   0   0   0   0   0
##   A5   6   4   1   7   2   1

## [1] 1.238773

## [1] 0.0536942

Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.

There appears to be a skewness value of 1.238773 for abalone rings. This would indicate that there is a slight skew to the right that is causing the data to be non-normally distributed about the mean. This could be attributed to outliers.

there appears to be a skewness value of 0.0536942 for abalone classes. This would indicate that there is an ever so slight skew to the right causing the data to be slightly non-normally distrubuted. However, this is quite slight. outliers are most likely not an issue with this particular physical parameter.

(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply table() first, then pass the table object to addmargins() (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1036 
## 
##  
##              | abalones$CLASS 
## abalones$SEX |        A1 |        A2 |        A3 |        A4 |        A5 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            F |         5 |        41 |       121 |        82 |        77 |       326 | 
##              |    24.720 |    14.898 |     2.949 |     8.819 |     8.735 |           | 
##              |     0.015 |     0.126 |     0.371 |     0.252 |     0.236 |     0.315 | 
##              |     0.046 |     0.174 |     0.368 |     0.436 |     0.440 |           | 
##              |     0.005 |     0.040 |     0.117 |     0.079 |     0.074 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            I |        91 |       133 |        65 |        21 |        19 |       329 | 
##              |    93.745 |    44.969 |    14.918 |    25.089 |    24.070 |           | 
##              |     0.277 |     0.404 |     0.198 |     0.064 |     0.058 |     0.318 | 
##              |     0.843 |     0.564 |     0.198 |     0.112 |     0.109 |           | 
##              |     0.088 |     0.128 |     0.063 |     0.020 |     0.018 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            M |        12 |        62 |       143 |        85 |        79 |       381 | 
##              |    19.344 |     7.082 |     4.003 |     3.639 |     3.331 |           | 
##              |     0.031 |     0.163 |     0.375 |     0.223 |     0.207 |     0.368 | 
##              |     0.111 |     0.263 |     0.435 |     0.452 |     0.451 |           | 
##              |     0.012 |     0.060 |     0.138 |     0.082 |     0.076 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       108 |       236 |       329 |       188 |       175 |      1036 | 
##              |     0.104 |     0.228 |     0.318 |     0.181 |     0.169 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##

##    
##      A1  A2  A3  A4  A5
##   F   5  41 121  82  77
##   I  91 133  65  21  19
##   M  12  62 143  85  79

Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?

It would appear as though the infants are primarily in the A1, A2 class distinctions. Most of the mature/adult abalone population appears to fall into the A4 and A5 class distinction. As far as sex distribution there appears to be a roughly equal distribution of males and females throughout the class distinctions. However, within each class distinction there are more males than females. It is hard to determine if this is a statistically significant number by glance.

(1)(c) (1 point) Select a simple random sample of 200 observations from “mydata” and identify this sample as “work.” Use set.seed(123) prior to drawing this sample. Do not change the number 123. Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).

Using “work”, construct a scatterplot matrix of variables 2-6 with plot(work[, 2:6]) (these are the continuous variables excluding VOLUME and RATIO). The sample “work” will not be used in the remainder of the assignment.

Section 2: (5 points) Summarizing the data using graphics.

(2)(a) (1 point) Use “mydata” to plot WHOLE versus VOLUME. Color code data points by CLASS.

(2)(b) (2 points) Use “mydata” to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

## [1] 0.5621008

Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.

In graph 2b there is a clear grouping/distinction between the classes. In graph 2b Class A5 appears to be clustered towards the bottom, A4 is clustered just above it, and A3 is clustered above them, etc. In graph 2a there is not clear grouping or distinctions between the classes along the line. Graph 2a shows the classes randomly scattered as the values increase. Both graphs 2A and 2B show a positive trend/correlation.

Section 3: (8 points) Getting insights about the data using graphs.

(3)(a) (2 points) Use “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.

Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.

The values appear to be normal for the most part when one examines the histograms, box plots, and qqplots. The histograms appear to have a bell shape with perhaps a slight skew to the right for the infants, females, and male ratios. The boxplots show a few outliers, and the qq plots show what would appear to be a few outliers along the beginning of the qqline and the end of the qqlines.

(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in “mydata” (Hint: display the observations by passing a data frame to the kable() function).

## $stats
## [1] 0.07808601 0.12314437 0.14159986 0.15948490 0.21142642
## 
## $n
## [1] 329
## 
## $conf
## [1] 0.1384343 0.1447654
## 
## $out
## [1] 0.2693371 0.2218308 0.2403394 0.2263294 0.2249577 0.2300704 0.2290478
## [8] 0.2232339

## $stats
## [1] 0.07174086 0.12123374 0.13808946 0.15542149 0.20522327
## 
## $n
## [1] 326
## 
## $conf
## [1] 0.1350978 0.1410812
## 
## $out
## [1] 0.31176204 0.21216140 0.21465603 0.21306058 0.23497668 0.06733877

## $stats
## [1] 0.07171905 0.12228101 0.13776396 0.16085876 0.21627955
## 
## $n
## [1] 381
## 
## $conf
## [1] 0.1346412 0.1408867
## 
## $out
## [1] 0.2609861 0.2378764 0.2345924 0.2356349 0.2286735

## $stats
## [1] 0.07808601 0.12314437 0.14159986 0.15948490 0.24033943
## 
## $n
## [1] 329
## 
## $conf
## [1] 0.1384343 0.1447654
## 
## $out
## [1] 0.2693371

## $stats
## [1] 0.06733877 0.12123374 0.13808946 0.15542149 0.23497668
## 
## $n
## [1] 326
## 
## $conf
## [1] 0.1350978 0.1410812
## 
## $out
## [1] 0.311762

## $stats
## [1] 0.07171905 0.12228101 0.13776396 0.16085876 0.26098609
## 
## $n
## [1] 381
## 
## $conf
## [1] 0.1346412 0.1408867
## 
## $out
## numeric(0)

Essay Question (2 points): What are your observations regarding the results in (3)(b)?

The Infants Ratio appears to have 8 outliers. 0.2693371 0.2218308 0.2403394 0.2263294 0.2249577 0.2300704 0.2290478 0.2232339 These outliers could be causing a skew of the data for infants and may have to be addressed later on.

The Male Ratio appears to have 5 outliers. 0.2609861 0.2378764 0.2345924 0.2356349 0.2286735 These outliers could be causing a skew of the data for males and may have to be addressed later on.

The Female Ratio appear sto have 6 outliers. 0.31176204 0.21216140 0.21465603 0.21306058 0.23497668 0.06733877 These outliers may be causing a skew of the data for females and may have to be addressed later on.

The female and infant categories are the only one with outliers that we would deem “extreme.”
The female extreme outlier being 0.311762 The Infant extreme outlier being 0.2693371

Section 4: (8 points) Getting insights about possible predictors.

(4)(a) (3 points) With “mydata,” display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.

Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.

The variables of volume and whole weight appear to be good predictors of age if you are differentiating between abalones in classes A1, A2, and A3. However, the means and medians for classes A4 and A5 are extremely similar. Also A5 appears to overlap with A3 rather significantly as well. Therefore it does not appear as though these physical parameters are good at differentiating between ages of abalones once they have reached maturity. Abalone rings also do not appear to be good indicators of age since they also increase with volume and whole weight in the same way that class does.

Section 5: (12 points) Getting insights regarding different groups in the data.

(5)(a) (2 points) Use aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using matrix(), create matrices of the mean values. Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The kable() function is useful for this purpose. You do not need to be concerned with the number of digits presented.

##      
##         A1   A2   A3   A4   A5  Sum
##   F      5   41  121   82   77  326
##   I     91  133   65   21   19  329
##   M     12   62  143   85   79  381
##   Sum  108  236  329  188  175 1036

##    Sex Class Average Volume Average Shuck Average Ratio Count
## 1    F    A1      255.29938      38.90000     0.1546644     5
## 2    I    A1       66.51618      10.11332     0.1569554    91
## 3    M    A1      103.72320      16.39583     0.1512698    12
## 4    F    A2      276.85731      42.50305     0.1554605    41
## 5    I    A2      160.31999      23.41024     0.1475600   133
## 6    M    A2      245.38571      38.33855     0.1564017    62
## 7    F    A3      412.60794      59.69121     0.1450304   121
## 8    I    A3      270.74063      37.17969     0.1372256    65
## 9    M    A3      358.11811      52.96933     0.1462123   143
## 10   F    A4      498.04889      69.05161     0.1379609    82
## 11   I    A4      316.41292      39.85369     0.1244413    21
## 12   M    A4      442.61552      61.42726     0.1364881    85
## 13   F    A5      486.15253      59.17076     0.1233605    77
## 14   I    A5      318.69299      36.47047     0.1167649    19
## 15   M    A5      440.20736      55.02762     0.1262089    79

(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.

It is interesting that the female abalones weigh more than their male counterparts. Also classes A1, A2, and A3 appear to gain significant volume and shuck but once they reach class A4 and A5 the changes are far less drastic. It is also interesting that for mean shuck abalones of class A4 have the greatest value. It would be interesting to learn what causes this mass to go down once they reach class A5.

5(c) (3 points) Present four boxplots using par(mfrow = c(2, 2) or grid.arrange(). The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels “M” and “F,” combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?

These graphs show that as the number of rings increases the volume of the abalones increases as well. However, the increase appears to become far less once the abalones reach the 10 ring mark. Once abalones reach 10 rings the change in volume is far less prominent. These graphs also show that the size of an abalone infant varies far less than their adult counterparts. This is shown by the fact that the infant boxplots have very small whiskers when compared to the adult boxplot whiskers.

Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).

Conclusions

Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.

There appears to be some skewness to the data that would need to be addressed in order to attain the best results possible from the data. It is always best to work with data that is normally distributed. They may have to rexamine their sampling procedures. They may also need to readdress how they sex the abalones, this can be a very difficult task.

A lot of the physical parameters measured also seem to be very similar in nature. For example diameter and height, length and diameter, whole weight and volume, just to name a few. These parameters should be examined for multi-collinearity by finding their VIF statistic and removing them if they have a value higher than 5 (or whatever the data analyst deems appropriate). Multicollinearity can be a major issue when running any regression analysis.

Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?

I would be very curious about everything that went into the sampling for the abalones. For example, how many abalones were used? I would want to ensure that the sample size given to me can feasibly be representative of the population as a whole. The sample size is crucial. I would also want to know how the sites were picked for abalone collection and what means of collection was used. Can it be randomized any better? Does there appear to be some sort of bias with the locations picked? Then I would want to know if there was any skewness to the histogram presented to me, if there were any outliers and if they may be causing this skewness, as well as what the various measures of central tendancy are.

Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

Some difficulties or concerns that I see with observational studies is that they leave room for error. This error can be present in things such as measuring abalones, sexing abalones, recording data, selecting locations to procure abalones. These all lend themselves to human bias and human error. Also most observational studies will not have perfect data so it will most likely have to be cleaned up by the data scientist and data scientists may have to make judgement calls on whether certain outliers should be included, etc.

Causality cannot be determined with the data. There is no control to compare the data to and scientists are always careful to make complete and overarching claims. Scientists generally just report that there appears to be a trend, pattern, or correlation between certain factors. One can never safely state something for a certainty.