Data analysis is the process of evaluating data using analytical and statistical tools to discover useful insights.
Raw Data
(Preprocessing). Before applying machine learning algorithms.
Preprocessing is the process of giving structure to our data for better understanding and decision-making related to the data. The following steps summarizes the data preprocessing pipeline:
Observations: 1,036
Variables: 8
$ Sex <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I…
$ Length <dbl> 5.565, 3.675, 10.080, 4.095, 6.930, 7.875, 6.300, 6.615, 5.250…
$ Diam <dbl> 4.095, 2.625, 7.350, 3.150, 4.830, 6.090, 4.620, 4.935, 3.885,…
$ Height <dbl> 1.260, 0.840, 2.205, 0.945, 1.785, 2.100, 1.680, 1.575, 1.365,…
$ Whole <dbl> 11.500000, 3.500000, 79.375000, 4.687500, 21.187500, 27.375000…
$ Shuck <dbl> 4.3125, 1.1875, 44.0000, 2.2500, 9.8750, 11.5625, 5.9375, 7.31…
$ Rings <int> 6, 4, 6, 3, 6, 6, 5, 6, 5, 6, 6, 5, 5, 5, 6, 5, 6, 4, 6, 5, 6,…
$ Class <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A…
Our base dataset contains a total of 8 variables; 5 continuous (Length, Diameter, Height, Whole and Shuck, 1 discrete variable, (Rings), and 2 categorical variables (Sex and Class)
Volume and Ratio are variables appended to our dataframe “mydata.”
Volume is computed as Length * Diameter * Height
Ratio is computed as Shuck / Volume
| Variable | Min. | 1st Qu. | Mean | Median | 3rd Qu. | Max. |
|---|---|---|---|---|---|---|
| Diam | 1.995 | 7.350 | 8.622 | 8.925 | 10.185 | 13.230 |
| Height | 0.525 | 2.415 | 2.947 | 2.940 | 3.570 | 4.935 |
| Length | 2.73 | 9.45 | 11.08 | 11.45 | 13.02 | 16.80 |
| Ratio | 0.06734 | 0.12241 | 0.14205 | 0.13914 | 0.15911 | 0.31176 |
| Rings | 3.000 | 8.000 | 9.993 | 9.000 | 11.000 | 25.000 |
| Shuck | 0.5625 | 23.3006 | 45.4396 | 42.5700 | 64.2897 | 157.0800 |
| Volume | 3.612 | 163.545 | 326.804 | 307.363 | 463.264 | 995.673 |
| Whole | 1.625 | 56.484 | 105.832 | 101.344 | 150.319 | 315.750 |
| 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | Sum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1 | 9 | 8 | 24 | 67 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 108 |
| A2 | 0 | 0 | 0 | 0 | 91 | 145 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 236 |
| A3 | 0 | 0 | 0 | 0 | 0 | 0 | 182 | 147 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 329 |
| A4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 125 | 63 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 188 |
| A5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 48 | 35 | 27 | 15 | 13 | 8 | 8 | 6 | 4 | 1 | 7 | 2 | 1 | 175 |
| Sum | 9 | 8 | 24 | 67 | 91 | 145 | 182 | 147 | 125 | 63 | 48 | 35 | 27 | 15 | 13 | 8 | 8 | 6 | 4 | 1 | 7 | 2 | 1 | 1036 |
Question:
Briefly discuss variable types and distribution implications such as potential skewness and outliers.
Answer:
Our base dataset contains a total of 8 variables; 5 continuous (Length, Diameter, Height, Whole and Shuck, 1 discrete variable, (Rings), and 2 categorical variables (Sex and Class).
SEX is a regular nominal value, ordering does not apply to this variable in general. CLASS is a bit more interesting, in that it is an ordered (ordinal) level feature that is given to a abalone based on the RINGS feature.
There are 2 additional variables I explicitly added to the data for analysis, VOLUME and RATIO. Both variables are continuous as they are derived from multiple underlying continuous variables.
VOLUME is calculated for proof in later analysis from Length, Diameter and Height.
RATIO is a continuous measure of the SHUCK weight (meat vs. shell, in grams), to the VOLUME of the shell.
All continuous variables have well behaved distributions. HEIGHT exhibits the most normality, given the shape being approximately symmetric about the mean.
WHOLE and SHUCK are significantly skewed to the right, to the point they could almost fit an exponential distribution given their sharp descending shape.
SHUCK prominently display this characteristic. DIAMETER and LENGTH appear to be approximately normal overall, however, the distribution does have some considerable outliers, LENGTH in particular.
SEX is evenly distributed with each SEX representing approximately a third of the dataset as the distribution by SEX is approximately uniform.
CLASS is relatively normal with A3 being the dominant Age CLASS.
RINGS are the only discrete measure, having a relatively normal shape that is rightly-skewed.
SHUCK and LENGTH variables have a high probability of outlines given they have max values over 3 times their respective IQR, 40.99 and 3.57 respectively.
The distribution of RINGS by Age CLASS is curious, as the majority, 64%, of abalone have between 8 and 12 rings irrespective of their Age CLASS. A disproportionate amount of abalone are also represented by the A3 class with 32%, where we might expect around ~25%, this could be due to a function of harvesting limiting older populations, a side-effect of this sample or simply a characteristic of the true population.
| A1 | A2 | A3 | A4 | A5 | Sum | |
|---|---|---|---|---|---|---|
| Infant | 91 | 133 | 65 | 21 | 19 | 329 |
| Female | 5 | 41 | 121 | 82 | 77 | 326 |
| Male | 12 | 62 | 143 | 85 | 79 | 381 |
| Sum | 108 | 236 | 329 | 188 | 175 | 1036 |
Question:
Discuss the SEX distribution of abalone. What stands out about the distribution of abalone by CLASS?
Answer:
In the breakdown of Age CLASS by SEX I noticed an expected rise in the count of Male and Female with a corresponding drop in Infants, up until the A3 CLASS, where there is across the board drop in abalone in both A4 and A5 (although the relative change in A4 and A5 is minuscule.
The curious part of the above data is the persistence of Infants throughout all 5 Age Classes. Intuitively, one would think they would drop off after A3, turning either Male or Female. This could also be a function of how SEX is classified in abalone. As previously mentioned, the universal drop in A4 and A5 CLASS could be due to harvesting or the sampling technique.
Using “work”, constructing a scatterplot matrix of variables 2-6 with plot(work[, 2:6])
(these are the continuous variables excluding VOLUME and RATIO). The sample object “work” will not be used in the remainder of the analysis.
(to avoid selection bias, I’m using random sampling). The choices of sampling being (a) Simple Random Sampling or (b) Stratified Random Sampling.
(selection bias occurs during sampling of the population. It’s when a selected sample does not represent the characteristics of the population.) Class imbalances i.e., imbalanced data can have a significant impact on model predictions and performance
Using “mydata” to plot WHOLE vs. VOLUME. Color coding data points by CLASS.
Question:
How does the variability in this plot differ from the plot in (2a)?
Compare the 2 displays. Keeping in mind that SHUCK is a part of WHOLE. Consider the location of the different Age Classes.
Answer:
Whole ~ Shuck Correlation: 0.973There is a strong positive correlation between WHOLE and SHUCK, 0.973, as we also saw this behavior in the Sample Statistics section (1c), which is intuitive as WHOLE Weight is composed from SHUCK Weight. The overall relationship in both charts is strongly linear.
The variability is noticeably larger between WHOLE Weight and VOLUME than in SHUCK Weight to WHOLE Weight. The data points in SHUCK | WHOLE tend to be more normally distributed above and below the mean and approximately symmetrical, where as WHOLE Weight to VOLUME has a strong positive skew; meaning the distribution would have a long right tail, where WHOLE to VOLUME would be a more symmetric normal shape, aside from some obvious outliers.
Additionally, I colored the points by Age CLASS and the dominant outliers in both sets are dominant in the A4 and A5 Age CLASS. This could be a naturally occurring phenomenon as abalone Age, they will have unique size characteristics based on their genetics.
It’s interesting to note that the maximum weight, WHOLE (109.25) and SHUCK (51.25), in A1 confines this Age Classes into basically the lower quarter of both displays. The rest of the abalone in classes A2-A5 are distributed pretty randomly above these thresholds.
| Class | Whole | Shuck |
|---|---|---|
| A1 | 109.25 | 51.25 |
| A2 | 212.56 | 102.16 |
| A5 | 315.75 | 157.08 |
| A3 | 297.62 | 146.01 |
| A4 | 296.25 | 140.81 |
Question:
Comparing the displays. How does our distributions compare to normality?
Answer:
The RATIO variable is a well-behaved distribution with the SEX of abalone. All 3 Sexes have a slightly right-skew to their distributions, and even though the majority of our data is clustered about the mean in a normal fashion, there are a considerable amount of outlines in these data.
The boxplot and QQ-plots do a good job of pointing out the outliers in the sample. The QQ for Female is interesting as it follows basically a straight line until the occurrence of outliers start to show at around 1.75, then outliers begin to form frequently.
Surprisingly, Infants have the most deviated distribution, having outliers in both positive and negative directions. This could be attributed to the older, A4 and A5 Age CLASS Infants, as one would expect a lower deviation in younger specimen.
(Anomaly Detection Methods)
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. The simplest approach to identify irregularities in data is to flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles.
Density-Based Anomaly Detection is a technique which works on the assumption that normal data points occur around a dense neighborhood and abnormalities are far away.
The nearest set of data points are evaluated using a score, which could be Euclidean distance or something else. Another technique to detect anomalies is Z-score, which is a parametric outlier detection method. This technique assumes a Gaussian distribution of the data. The outliers are the data points that are in the tails of the distribution and therefore far from the mean.
| Sex | w | p | Result |
|---|---|---|---|
| Infant | 0.96962 | 0e+00 | Reject Null |
| Female | 0.96028 | 0e+00 | Reject Null |
| Male | 0.98031 | 5e-05 | Reject Null |
The Shapiro-Wilk test on the RATIO variable by SEX shows that we should reject the Null Hypothesis that these data came from a normal distribution, so the chance is smaller that these data coming from a truly normally distribution.
(to detect if a new observation is an outliers, the following visualizations are available)
Boxplot/Whiskers plot to visualize outlier: Any value that will be more than the upper limit or lesser than the lower limit of the plot will be the outliers. Only the data that lies within the Lower and Upper limit is statistically considered normal and thus can be used for further analysis.
Std deviation: Finding the points which lie more than 3 times the std deviation of the data. According to the empirical sciences, the so-called “three-sigma rule of thumb” expresses a conventional heuristic that nearly all values are taken to lie within 3 std deviations of the mean.
Clustering: Using K-means or (DBSCAN) Density-Based Spatial Clustering of Applications with noise for clustering to detect outliers.
| Sex | Length | Diam | Height | Whole | Shuck | Rings | Class | Volume | Ratio |
|---|---|---|---|---|---|---|---|---|---|
| Infant | 10.08 | 7.35 | 2.21 | 79.38 | 44.00 | 6 | A1 | 163.36 | 0.27 |
| Infant | 4.30 | 3.25 | 0.94 | 6.19 | 2.94 | 3 | A1 | 13.24 | 0.22 |
| Infant | 2.84 | 2.73 | 0.84 | 3.62 | 1.56 | 4 | A1 | 6.50 | 0.24 |
| Infant | 6.72 | 4.30 | 1.68 | 22.62 | 11.00 | 5 | A1 | 48.60 | 0.23 |
| Infant | 5.04 | 3.67 | 0.94 | 9.66 | 3.94 | 5 | A1 | 17.50 | 0.22 |
| Infant | 3.36 | 2.31 | 0.52 | 2.44 | 0.94 | 4 | A1 | 4.07 | 0.23 |
| Infant | 6.93 | 4.72 | 1.57 | 23.38 | 11.81 | 7 | A2 | 51.57 | 0.23 |
| Infant | 9.13 | 6.30 | 2.52 | 74.56 | 32.38 | 8 | A2 | 145.03 | 0.22 |
| Female | 7.98 | 6.72 | 2.42 | 80.94 | 40.38 | 7 | A2 | 129.51 | 0.31 |
| Female | 11.55 | 7.98 | 3.46 | 150.62 | 68.55 | 10 | A3 | 319.37 | 0.21 |
| Female | 11.45 | 8.09 | 3.15 | 139.81 | 68.49 | 9 | A3 | 291.48 | 0.23 |
| Female | 12.18 | 9.45 | 4.93 | 133.88 | 38.25 | 14 | A5 | 568.02 | 0.07 |
| Male | 13.44 | 10.81 | 1.68 | 130.25 | 63.73 | 10 | A3 | 244.19 | 0.26 |
| Male | 10.50 | 7.77 | 3.15 | 132.69 | 61.13 | 9 | A3 | 256.99 | 0.24 |
| Male | 10.71 | 8.61 | 3.25 | 160.31 | 70.41 | 9 | A3 | 300.15 | 0.23 |
| Male | 12.29 | 9.87 | 3.46 | 176.12 | 99.00 | 10 | A3 | 420.14 | 0.24 |
| Male | 11.55 | 8.82 | 3.36 | 167.56 | 78.27 | 10 | A3 | 342.29 | 0.23 |
| Male | 11.45 | 8.61 | 2.52 | 99.12 | 53.71 | 9 | A3 | 248.32 | 0.22 |
| Sex | Length | Diam | Height | Whole | Shuck | Rings | Class | Volume | Ratio |
|---|---|---|---|---|---|---|---|---|---|
| Infant | 10.08 | 7.35 | 2.21 | 79.38 | 44.00 | 6 | A1 | 163.36 | 0.27 |
| Female | 7.98 | 6.72 | 2.42 | 80.94 | 40.38 | 7 | A2 | 129.51 | 0.31 |
Question:
What are our observations regarding the results in (3)(b)?
Answer:
The RATIO boxplot by SEX highlights the dominance of outliers in Infant abalone. Almost half of the total mild outliers belong to the Infants, almost as much as Male and Female combined. Interestingly, the detailed display here is helpful in that none of the outliers are in the A4 or A5 Age CLASS, which is surprising.
The Infant outliers are 75% A1, which could be derived from the variability among different species of abalone Infants. Notably, both of the extreme outliers are in A1/A2 as well; highlighting the variability among younger abalone.
In the Male and Female samples I notice that the mild outliers are concentrated in the A3 Age CLASS.
With “mydata,”
Question:
How well do we think these variables would perform as predictors of Age?
Answer:
If I break down the distribution of VOLUME and WHOLE weight by Age CLASS, I see a lot of clustering around the mean, however, the data has an abundance of outliers in every regard. The A1 Age CLASS in particular is wildly distributed, having a narrow IQR and several noticeable outliers. The overall variability seems to stabilize somewhat in A3 continuing linearly into A4 where the mean peaks, then a downward trend into A5, although the existence of so many outliers makes it difficult so see any clear patterns.
Color coding the scatterplot in 4b by Age CLASS helps us to see the clustering of VOLUME/WHOLE to RINGS by Age CLASS. The variability in both VOLUME and WHOLE is evident, as the only clear pattern one can discerning is the relationship between Age CLASS and RINGS, which is not particularly helpful given that Age CLASS is a function of RINGS.
The overall relationship to both VOLUME and WHOLE weight to Age CLASS is not strong, and I think they would both perform poorly as predictors of AGE given the information as provided due to the abundance of outliers and non-linear relationship to AGE.
Note, I do not need to be concerned with the number of digits presented.
| Sex | A1 | A2 | A3 | A4 | A5 |
|---|---|---|---|---|---|
| Infant | 66.516 | 160.320 | 270.741 | 316.413 | 318.693 |
| Female | 255.299 | 276.857 | 412.608 | 498.049 | 486.153 |
| Male | 103.723 | 245.386 | 358.118 | 442.616 | 440.207 |
| Sex | A1 | A2 | A3 | A4 | A5 |
|---|---|---|---|---|---|
| Infant | 10.113 | 23.410 | 37.180 | 39.854 | 36.470 |
| Female | 38.900 | 42.503 | 59.691 | 69.052 | 59.171 |
| Male | 16.396 | 38.339 | 52.969 | 61.427 | 55.028 |
| Sex | A1 | A2 | A3 | A4 | A5 |
|---|---|---|---|---|---|
| Infant | 0.157 | 0.148 | 0.137 | 0.124 | 0.117 |
| Female | 0.155 | 0.155 | 0.145 | 0.138 | 0.123 |
| Male | 0.151 | 0.156 | 0.146 | 0.136 | 0.126 |
Presenting 3 graphs. Each graph includes 3 lines, one for each SEX.
This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().
Question:
What questions do these plots raise? Considering Aging and SEX differences.
Answer:
From the first plot of RATIO, to Age CLASS, I find it interesting that between A1 and A2 Age Classes, an increase in both Male and Female, but a sharp decline in Infants. Perhaps this has something to do with reproduction in young abalone?
The strong universal decent in RATIO after A2 is basically identical between all Sexes otherwise if I account for the offset jump with Males and Females in A1-A2.
Looking at the plot of VOLUME, to Age CLASS, it is interesting that Males have the lowest increase over the interval from A1-A2 especially relative to Females and Infants, however, have the overall largest volume irrespective of SEX.
Looking at the plot of SHUCK to Age CLASS, it is interesting to note the strong linear relationship between SHUCK and VOLUME by SEX, at least until A5 when there is an especially steep decline in SHUCK for both Males and Females. The two charts would essentially overlap if I laid them on top of each other for the range A1 to A4. These could possibly be candidates for predictors of each other. Would need to explore further.
Additionally, the fact that the SHUCK weight lowers in A4 to A5 while the VOLUME stays relatively the same suggests that perhaps the actual abalone begin to shrink in old Age, given the fact that they stay inside the same shell throughout their life.
Presenting 4 boxplots using par(mfrow = c(2, 2) or grid.arrange().
If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.
Question:
What do these displays suggest about abalone growth? Also, comparing the Infant and Adult displays. What differences stand out?
Answer:
These displays suggest that abalone have a similar growth pattern throughout their life. Both classifications of abalone demonstrate a clear linear growth trend until they reach an inflection point, around ~10-12 rings, in which they start to decline in size by both VOLUME and WHOLE weight. This further suggests that as abalone Age they either shrink, or perhaps a side-effect of harvesting where larger abalone are taken from the population leaving only smaller samples to develop the upper end of the ring range.
The Infants have an interesting characteristic in that at ~12 rings, there is an extremely large cluster of abalone relative to every other ring point. Notably, there are almost no Adults with fewer than 6 rings, which suggests that at around 6 rings (Years of AGE) many abalone become either Male or Female. That does not explain the clear persistence of Infant abalone with 8-12 rings, which is about 66% of all Infant abalone.
The clearest difference between Infant and Adult abalone is the variance in size amongst the 2 groups per ring count. Infant abalone with 7-9 rings have a large variance, the rest of the ring categories are tightly clustered around the mean. While Adult abalone have wild variance at every ring count where there is any significant portion, i.e., 6 rings or more.
Reference
Infant abalone with between 6 and 12 rings : 65.84
Responding to each of the following questions:
Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Considering to what extent physical measurements may be used for AGE prediction.
It is not surprising to me that the original study failed to produce an accurate predictor of AGE from physical measurements given the characteristics of the data explored in this analysis.
Initial inspection of the overall summary of the sample provided at the beginning of this study suggest I have a relatively good sample of data. There is not an abundance of outliers in any metric, the sample is evenly distributed between AGE CLASS and SEX. The continuous variables all have well behaved distribution characteristics and the random sample from our initial population conformed approximately to the overall dataset.
Abalone growth patterns as they relate to Age CLASS do not have an obvious relationship. As I saw in the Growth Patterns section in this study. I noted that abalone that have weight over ~109.25 WHOLE or ~51.25 SHUCK, are essentially randomly distributed above this area with no apparent commonality.
Taking a deeper look at abalone by their physical Age classifier, RINGS, I noted similar behavior. Abalone physical metrics like VOLUME and WHOLE weight by ring size seems to be stochastically distributed. I see abalone with less than 6 RINGS, tend to weigh less than ~100 grams, and after that I see that abalone weighing over ~100 grams can have anywhere between 7 and 25 rings with no apparent relationship. This behavior is similar with the total volume of a given abalone.
SEX is another interesting characteristic in that while Infant and Adult abalone have some measure of distinguishable, particularly at younger Age classifications, as abalone Age the overall variability in their size continues to increase. I can see a relatively clear classification of abalone Infants and Adults when they have under 6 RINGS, however, after that point it becomes difficult to distinguish based on any physical measurement.
(Representative Sample). Some key features need to be kept in mind while selecting a representative sample.
Diversity: A sample must be as diverse as the search queries. It should be sensitive to all the local differences between the search query and should keep those features in mind.
Consistency: We need to make sure that any change we see in our sample data is also reflected in the true population which is the queries.
Transparency: It is extremely important to decide the appropriate sample size and structure so that it is a true representative. These properties of a sample should be discussed to ensure that the results are accurate.
Question:
If we were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what Questions might we ask before accepting them as representative of the sampled population or phenomenon?
Answer:
I would ask:
Question:
What do we see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?
Answer:
From observational studies it is difficult to know how well the sample truly represents the larger population. This kind of study can raise many questions that the data at hand cannot adequately answer. I can run several sets of summary statistics on the sample, however, I’m unable to know if the sample was collected in a way that bias the data, thereby invaliding our observational study. (Statistical bias is the systematical difference in a models hypothesis and the true distribution. Bias measures how much the predictions deviate from the true value we are trying to predict.)
For example, the entire dataset could have been collected under the influence of a confounding factor, thereby tilting my conclusions. There is also a factor of random error in every study, and it is difficult, if not impossible, to account for such occurrences; I can only execute the study to the mathematical rigors of our discipline and hope they will catch such errors through validation and peer review.
(The importance of bias-variance trade-off while modeling). Bias and Variance are part of model prediction errors. A model with high bias pays very little attention to the training data and oversimplifies the model leading to underfitting. A model with high variance pays a lot of attention to training data and does not generalize well on the unseen data which leads to overfitting. Gaining proper insights and understandings into these errors would help us not only in building accurate models, but also in avoiding the mistake of overfitting and underfitting.
Underfitting/Bias: Bias error is the difference between the expected/average prediction of the model and the true value. The model building/prediction process is repeated more than once with new variations of the data. Hence, due to the randomness in the underlying dataset, we will have a set of predictions for each point. Bias measures how much the predictions deviate from the true value we are trying to predict.
Overfitting/Variance: Variance error is defined as the variability of model prediction for a given data point. The model prediction is repeated for various datasets. It’s an indicator to a model’s sensitivity to small variations that can exist while feeding a new subset of the training data. For instance, if a model has high variance then small changes in the training data can result in large prediction changes.
There is no analytical way to measure the point at which we can achieve the bias-variance trade off. To figure it out, it’s essential to explore the complexity of the model and measure the prediction error in order to minimize the overall error.