The objective: to predict the Age of Abalone Species for harvesting

Data Analysis

Data analysis is the process of evaluating data using analytical and statistical tools to discover useful insights.

Abalone

Raw Data

(Preprocessing). Before applying machine learning algorithms.

Preprocessing is the process of giving structure to our data for better understanding and decision-making related to the data. The following steps summarizes the data preprocessing pipeline:

  1. Discovering/Data Acquisition: Gather the data from the source and try to understand and make sense of the data.
  2. Structuring/Data Transformation: Since the data may come in different formats and sizes, it needs to have a consistent size and shape when merged together.
  3. Cleaning: This step consists of imputing null values and treating outlines/anomalies in the data to make the data ready for further analysis.
  4. Exploratory Data Analysis: Try to find patterns in the dataset and extract new features from the given data in order to optimize the performance of the applied machine learning model.
  5. Validating: This stage verifies data consistency and quality.
  6. Publishing/Modeling: The wrangled data is ready for processing further by the machine learning model.
Observations: 1,036
Variables: 8
$ Sex    <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I…
$ Length <dbl> 5.565, 3.675, 10.080, 4.095, 6.930, 7.875, 6.300, 6.615, 5.250…
$ Diam   <dbl> 4.095, 2.625, 7.350, 3.150, 4.830, 6.090, 4.620, 4.935, 3.885,…
$ Height <dbl> 1.260, 0.840, 2.205, 0.945, 1.785, 2.100, 1.680, 1.575, 1.365,…
$ Whole  <dbl> 11.500000, 3.500000, 79.375000, 4.687500, 21.187500, 27.375000…
$ Shuck  <dbl> 4.3125, 1.1875, 44.0000, 2.2500, 9.8750, 11.5625, 5.9375, 7.31…
$ Rings  <int> 6, 4, 6, 3, 6, 6, 5, 6, 5, 6, 6, 5, 5, 5, 6, 5, 6, 4, 6, 5, 6,…
$ Class  <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A…

1.) Dataset

Our base dataset contains a total of 8 variables; 5 continuous (Length, Diameter, Height, Whole and Shuck, 1 discrete variable, (Rings), and 2 categorical variables (Sex and Class)

Volume and Ratio are variables appended to our dataframe “mydata.”

Volume is computed as Length * Diameter * Height

Ratio is computed as Shuck / Volume

  1. SEX = M (male), F (female), I (infant)
  2. LENGTH = Longest shell length in cm
  3. DIAM = Diameter perpendicular to length in cm
  4. HEIGHT = Height perpendicular to length and diameter in cm
  5. WHOLE = Whole weight of abalone in grams
  6. SHUCK = Shucked weight of meat in grams
  7. RINGS = Age (+1.5 gives the age in years)
  8. CLASS = Age classification based on RINGS (A1 = youngest,., A5 = oldest)

a.) Summary

  • Using summary() to obtain descriptive statistics of mydata.
Summary
Variable Min. 1st Qu. Mean Median 3rd Qu. Max.
Diam 1.995 7.350 8.622 8.925 10.185 13.230
Height 0.525 2.415 2.947 2.940 3.570 4.935
Length 2.73 9.45 11.08 11.45 13.02 16.80
Ratio 0.06734 0.12241 0.14205 0.13914 0.15911 0.31176
Rings 3.000 8.000 9.993 9.000 11.000 25.000
Shuck 0.5625 23.3006 45.4396 42.5700 64.2897 157.0800
Volume 3.612 163.545 326.804 307.363 463.264 995.673
Whole 1.625 56.484 105.832 101.344 150.319 315.750
  • Using table() to present a frequency table using variables CLASS and RINGS.
Frequency Table: Class ~ Rings
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Sum
A1 9 8 24 67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 108
A2 0 0 0 0 91 145 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 236
A3 0 0 0 0 0 0 182 147 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 329
A4 0 0 0 0 0 0 0 0 125 63 0 0 0 0 0 0 0 0 0 0 0 0 0 188
A5 0 0 0 0 0 0 0 0 0 0 48 35 27 15 13 8 8 6 4 1 7 2 1 175
Sum 9 8 24 67 91 145 182 147 125 63 48 35 27 15 13 8 8 6 4 1 7 2 1 1036

Question:

Briefly discuss variable types and distribution implications such as potential skewness and outliers.

Answer:

Our base dataset contains a total of 8 variables; 5 continuous (Length, Diameter, Height, Whole and Shuck, 1 discrete variable, (Rings), and 2 categorical variables (Sex and Class).

SEX is a regular nominal value, ordering does not apply to this variable in general. CLASS is a bit more interesting, in that it is an ordered (ordinal) level feature that is given to a abalone based on the RINGS feature.

There are 2 additional variables I explicitly added to the data for analysis, VOLUME and RATIO. Both variables are continuous as they are derived from multiple underlying continuous variables.

VOLUME is calculated for proof in later analysis from Length, Diameter and Height.

RATIO is a continuous measure of the SHUCK weight (meat vs. shell, in grams), to the VOLUME of the shell.

All continuous variables have well behaved distributions. HEIGHT exhibits the most normality, given the shape being approximately symmetric about the mean.

WHOLE and SHUCK are significantly skewed to the right, to the point they could almost fit an exponential distribution given their sharp descending shape.

SHUCK prominently display this characteristic. DIAMETER and LENGTH appear to be approximately normal overall, however, the distribution does have some considerable outliers, LENGTH in particular.

SEX is evenly distributed with each SEX representing approximately a third of the dataset as the distribution by SEX is approximately uniform.

CLASS is relatively normal with A3 being the dominant Age CLASS.

RINGS are the only discrete measure, having a relatively normal shape that is rightly-skewed.

SHUCK and LENGTH variables have a high probability of outlines given they have max values over 3 times their respective IQR, 40.99 and 3.57 respectively.

The distribution of RINGS by Age CLASS is curious, as the majority, 64%, of abalone have between 8 and 12 rings irrespective of their Age CLASS. A disproportionate amount of abalone are also represented by the A3 class with 32%, where we might expect around ~25%, this could be due to a function of harvesting limiting older populations, a side-effect of this sample or simply a characteristic of the true population.

b.) Sex ~ Class

  • Generating a table of counts using variables SEX and CLASS.
  • Added margins to our table (There should be 15 cells in this table plus the marginal totals.
  • Applying table() first, then passing the table object to addmargins()
  • Lastly, presenting a bar plot of these data; ignoring the marginal totals.
Sex ~ Class
A1 A2 A3 A4 A5 Sum
Infant 91 133 65 21 19 329
Female 5 41 121 82 77 326
Male 12 62 143 85 79 381
Sum 108 236 329 188 175 1036

Question:

Discuss the SEX distribution of abalone. What stands out about the distribution of abalone by CLASS?

Answer:

In the breakdown of Age CLASS by SEX I noticed an expected rise in the count of Male and Female with a corresponding drop in Infants, up until the A3 CLASS, where there is across the board drop in abalone in both A4 and A5 (although the relative change in A4 and A5 is minuscule.

The curious part of the above data is the persistence of Infants throughout all 5 Age Classes. Intuitively, one would think they would drop off after A3, turning either Male or Female. This could also be a function of how SEX is classified in abalone. As previously mentioned, the universal drop in A4 and A5 CLASS could be due to harvesting or the sampling technique.

c.) Sample Statistics

  • Selecting a simple random sample of 200 observations from “mydata” and identifying this sample as “work.”
  • Using set.seed(123) prior to drawing this sample. Do not change the number 123. For reproducible, but the number itself has no special meaning.
  • Note that sample() “takes a sample of the specified size from the elements of x.” We cannot sample directly from “mydata.” Instead, we need to sample from the integers, 1 to 1036, representing the rows of “mydata.” Then, selecting those rows from our dataframe.

Using “work”, constructing a scatterplot matrix of variables 2-6 with plot(work[, 2:6])

(these are the continuous variables excluding VOLUME and RATIO). The sample object “work” will not be used in the remainder of the analysis.

(to avoid selection bias, I’m using random sampling). The choices of sampling being (a) Simple Random Sampling or (b) Stratified Random Sampling.

(selection bias occurs during sampling of the population. It’s when a selected sample does not represent the characteristics of the population.) Class imbalances i.e., imbalanced data can have a significant impact on model predictions and performance


2.) Growth Patterns

a.) Whole ~ Volume

Using “mydata” to plot WHOLE vs. VOLUME. Color coding data points by CLASS.

b.) Shuck ~ Whole

  • Using “mydata” to plot SHUCK vs. WHOLE with WHOLE on the x-axis.
  • As an aid to interpretation, determining the maximum value of the ratio of SHUCK to WHOLE.
  • Adding to the chart a straight line with zero intercept using this maximum value as the slope of the line.
  • If you are using the ‘base R’ plot() function, you may use abline() to add this line to the plot. Use help(abline) in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, geom_abline() should be used.

Question:

How does the variability in this plot differ from the plot in (2a)?
Compare the 2 displays. Keeping in mind that SHUCK is a part of WHOLE. Consider the location of the different Age Classes.

Answer:

There is a strong positive correlation between WHOLE and SHUCK, 0.973, as we also saw this behavior in the Sample Statistics section (1c), which is intuitive as WHOLE Weight is composed from SHUCK Weight. The overall relationship in both charts is strongly linear.

The variability is noticeably larger between WHOLE Weight and VOLUME than in SHUCK Weight to WHOLE Weight. The data points in SHUCK | WHOLE tend to be more normally distributed above and below the mean and approximately symmetrical, where as WHOLE Weight to VOLUME has a strong positive skew; meaning the distribution would have a long right tail, where WHOLE to VOLUME would be a more symmetric normal shape, aside from some obvious outliers.

Additionally, I colored the points by Age CLASS and the dominant outliers in both sets are dominant in the A4 and A5 Age CLASS. This could be a naturally occurring phenomenon as abalone Age, they will have unique size characteristics based on their genetics.

It’s interesting to note that the maximum weight, WHOLE (109.25) and SHUCK (51.25), in A1 confines this Age Classes into basically the lower quarter of both displays. The rest of the abalone in classes A2-A5 are distributed pretty randomly above these thresholds.

Whole ~ Shuck Correlation: 0.973
Max Weight by Age Class
Class Whole Shuck
A1 109.25 51.25
A2 212.56 102.16
A5 315.75 157.08
A3 297.62 146.01
A4 296.25 140.81

3.) Sex Characteristics

a.) Normality

  • Using “mydata” to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by SEX.
  • This can be done using par(mfrow = c(3,3)) and base R or grid.arrange() and ggplot2.
  • The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots.

Question:

Comparing the displays. How does our distributions compare to normality?

Answer:

The RATIO variable is a well-behaved distribution with the SEX of abalone. All 3 Sexes have a slightly right-skew to their distributions, and even though the majority of our data is clustered about the mean in a normal fashion, there are a considerable amount of outlines in these data.

The boxplot and QQ-plots do a good job of pointing out the outliers in the sample. The QQ for Female is interesting as it follows basically a straight line until the occurrence of outliers start to show at around 1.75, then outliers begin to form frequently.

Surprisingly, Infants have the most deviated distribution, having outliers in both positive and negative directions. This could be attributed to the older, A4 and A5 Age CLASS Infants, as one would expect a lower deviation in younger specimen.

(Anomaly Detection Methods)

  • Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. The simplest approach to identify irregularities in data is to flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles.

    Density-Based Anomaly Detection is a technique which works on the assumption that normal data points occur around a dense neighborhood and abnormalities are far away.

  • The nearest set of data points are evaluated using a score, which could be Euclidean distance or something else. Another technique to detect anomalies is Z-score, which is a parametric outlier detection method. This technique assumes a Gaussian distribution of the data. The outliers are the data points that are in the tails of the distribution and therefore far from the mean.

Shapiro-Wilk Normality Test for Normality, 0.05 level
Sex w p Result
Infant 0.96962 0e+00 Reject Null
Female 0.96028 0e+00 Reject Null
Male 0.98031 5e-05 Reject Null

The Shapiro-Wilk test on the RATIO variable by SEX shows that we should reject the Null Hypothesis that these data came from a normal distribution, so the chance is smaller that these data coming from a truly normally distribution.

b.) Outliers

  • Using the boxplots to visually identify RATIO outliers (mild and extreme) both for each SEX.
  • Presenting the data with these outlying RATIO values along with their associated variables in “mydata

(to detect if a new observation is an outliers, the following visualizations are available)

  • Boxplot/Whiskers plot to visualize outlier: Any value that will be more than the upper limit or lesser than the lower limit of the plot will be the outliers. Only the data that lies within the Lower and Upper limit is statistically considered normal and thus can be used for further analysis.

  • Std deviation: Finding the points which lie more than 3 times the std deviation of the data. According to the empirical sciences, the so-called “three-sigma rule of thumb” expresses a conventional heuristic that nearly all values are taken to lie within 3 std deviations of the mean.

  • Clustering: Using K-means or (DBSCAN) Density-Based Spatial Clustering of Applications with noise for clustering to detect outliers.

Abalones with Mild Outlier Ratio Values
Sex Length Diam Height Whole Shuck Rings Class Volume Ratio
Infant 10.08 7.35 2.21 79.38 44.00 6 A1 163.36 0.27
Infant 4.30 3.25 0.94 6.19 2.94 3 A1 13.24 0.22
Infant 2.84 2.73 0.84 3.62 1.56 4 A1 6.50 0.24
Infant 6.72 4.30 1.68 22.62 11.00 5 A1 48.60 0.23
Infant 5.04 3.67 0.94 9.66 3.94 5 A1 17.50 0.22
Infant 3.36 2.31 0.52 2.44 0.94 4 A1 4.07 0.23
Infant 6.93 4.72 1.57 23.38 11.81 7 A2 51.57 0.23
Infant 9.13 6.30 2.52 74.56 32.38 8 A2 145.03 0.22
Female 7.98 6.72 2.42 80.94 40.38 7 A2 129.51 0.31
Female 11.55 7.98 3.46 150.62 68.55 10 A3 319.37 0.21
Female 11.45 8.09 3.15 139.81 68.49 9 A3 291.48 0.23
Female 12.18 9.45 4.93 133.88 38.25 14 A5 568.02 0.07
Male 13.44 10.81 1.68 130.25 63.73 10 A3 244.19 0.26
Male 10.50 7.77 3.15 132.69 61.13 9 A3 256.99 0.24
Male 10.71 8.61 3.25 160.31 70.41 9 A3 300.15 0.23
Male 12.29 9.87 3.46 176.12 99.00 10 A3 420.14 0.24
Male 11.55 8.82 3.36 167.56 78.27 10 A3 342.29 0.23
Male 11.45 8.61 2.52 99.12 53.71 9 A3 248.32 0.22
Abalones with Extreme Outlier Ratio Values
Sex Length Diam Height Whole Shuck Rings Class Volume Ratio
Infant 10.08 7.35 2.21 79.38 44.00 6 A1 163.36 0.27
Female 7.98 6.72 2.42 80.94 40.38 7 A2 129.51 0.31

Question:

What are our observations regarding the results in (3)(b)?

Answer:

The RATIO boxplot by SEX highlights the dominance of outliers in Infant abalone. Almost half of the total mild outliers belong to the Infants, almost as much as Male and Female combined. Interestingly, the detailed display here is helpful in that none of the outliers are in the A4 or A5 Age CLASS, which is surprising.

The Infant outliers are 75% A1, which could be derived from the variability among different species of abalone Infants. Notably, both of the extreme outliers are in A1/A2 as well; highlighting the variability among younger abalone.

In the Male and Female samples I notice that the mild outliers are concentrated in the A3 Age CLASS.


4.) Growth Prediction

a.) Size ~ Rings

With “mydata,”

  • Displaying side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS. There are 5 boxes for VOLUME and 5 for WHOLE.
  • Also, displaying side-by-side scatterplots: VOLUME and WHOLE vs. RINGS.

Question:

How well do we think these variables would perform as predictors of Age?

Answer:

If I break down the distribution of VOLUME and WHOLE weight by Age CLASS, I see a lot of clustering around the mean, however, the data has an abundance of outliers in every regard. The A1 Age CLASS in particular is wildly distributed, having a narrow IQR and several noticeable outliers. The overall variability seems to stabilize somewhat in A3 continuing linearly into A4 where the mean peaks, then a downward trend into A5, although the existence of so many outliers makes it difficult so see any clear patterns.

Color coding the scatterplot in 4b by Age CLASS helps us to see the clustering of VOLUME/WHOLE to RINGS by Age CLASS. The variability in both VOLUME and WHOLE is evident, as the only clear pattern one can discerning is the relationship between Age CLASS and RINGS, which is not particularly helpful given that Age CLASS is a function of RINGS.

The overall relationship to both VOLUME and WHOLE weight to Age CLASS is not strong, and I think they would both perform poorly as predictors of AGE given the information as provided due to the abundance of outliers and non-linear relationship to AGE.


5.) Age Characteristics

a.) Tabular

  • Using aggregate() with “mydata” to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS.
  • Then, using matrix(), to create a matrices of the mean values.
  • Using the “dimnames” argument within matrix() or the rownames() and colnames() functions on the matrices, labeling the rows by SEX and columns by CLASS.
  • Presenting the 3 matrices. The kable() function is useful for this purpose.

Note, I do not need to be concerned with the number of digits presented.

Volume
Sex A1 A2 A3 A4 A5
Infant 66.516 160.320 270.741 316.413 318.693
Female 255.299 276.857 412.608 498.049 486.153
Male 103.723 245.386 358.118 442.616 440.207
Shuck
Sex A1 A2 A3 A4 A5
Infant 10.113 23.410 37.180 39.854 36.470
Female 38.900 42.503 59.691 69.052 59.171
Male 16.396 38.339 52.969 61.427 55.028
Ratio
Sex A1 A2 A3 A4 A5
Infant 0.157 0.148 0.137 0.124 0.117
Female 0.155 0.155 0.145 0.138 0.123
Male 0.151 0.156 0.146 0.136 0.126

b.) Graphical

Presenting 3 graphs. Each graph includes 3 lines, one for each SEX.

  • The first, shows mean RATIO vs. CLASS
  • The second, shows mean VOLUME vs. CLASS
  • The third, shows mean SHUCK vs. CLASS

This may be done with the ‘base R’ interaction.plot() function or with ggplot2 using grid.arrange().

Question:

What questions do these plots raise? Considering Aging and SEX differences.

Answer:

From the first plot of RATIO, to Age CLASS, I find it interesting that between A1 and A2 Age Classes, an increase in both Male and Female, but a sharp decline in Infants. Perhaps this has something to do with reproduction in young abalone?

The strong universal decent in RATIO after A2 is basically identical between all Sexes otherwise if I account for the offset jump with Males and Females in A1-A2.

Looking at the plot of VOLUME, to Age CLASS, it is interesting that Males have the lowest increase over the interval from A1-A2 especially relative to Females and Infants, however, have the overall largest volume irrespective of SEX.

Looking at the plot of SHUCK to Age CLASS, it is interesting to note the strong linear relationship between SHUCK and VOLUME by SEX, at least until A5 when there is an especially steep decline in SHUCK for both Males and Females. The two charts would essentially overlap if I laid them on top of each other for the range A1 to A4. These could possibly be candidates for predictors of each other. Would need to explore further.

Additionally, the fact that the SHUCK weight lowers in A4 to A5 while the VOLUME stays relatively the same suggests that perhaps the actual abalone begin to shrink in old Age, given the fact that they stay inside the same shell throughout their life.

c.) Infant ~ Adult

Presenting 4 boxplots using par(mfrow = c(2, 2) or grid.arrange().

  • The first line shows VOLUME by RINGS for the Infants and, separately, for the Adults; factor levels “M” and “F,” combined.
  • The second line shows WHOLE by RINGS for the Infants and, separately, for the Adults.
  • Since the data are sparse beyond 15 rings, limiting the displays to less than 16 rings.
  • One way to accomplish this is to generate a new dataset using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE.

If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.

Question:

What do these displays suggest about abalone growth? Also, comparing the Infant and Adult displays. What differences stand out?

Answer:

These displays suggest that abalone have a similar growth pattern throughout their life. Both classifications of abalone demonstrate a clear linear growth trend until they reach an inflection point, around ~10-12 rings, in which they start to decline in size by both VOLUME and WHOLE weight. This further suggests that as abalone Age they either shrink, or perhaps a side-effect of harvesting where larger abalone are taken from the population leaving only smaller samples to develop the upper end of the ring range.

The Infants have an interesting characteristic in that at ~12 rings, there is an extremely large cluster of abalone relative to every other ring point. Notably, there are almost no Adults with fewer than 6 rings, which suggests that at around 6 rings (Years of AGE) many abalone become either Male or Female. That does not explain the clear persistence of Infant abalone with 8-12 rings, which is about 66% of all Infant abalone.

The clearest difference between Infant and Adult abalone is the variance in size amongst the 2 groups per ring count. Infant abalone with 7-9 rings have a large variance, the rest of the ring categories are tightly clustered around the mean. While Adult abalone have wild variance at every ring count where there is any significant portion, i.e., 6 rings or more.

Reference

Infant abalone with between 6 and 12 rings : 65.84


Conclusions

Responding to each of the following questions:

1.) Study Observations

Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Considering to what extent physical measurements may be used for AGE prediction.

It is not surprising to me that the original study failed to produce an accurate predictor of AGE from physical measurements given the characteristics of the data explored in this analysis.

Initial inspection of the overall summary of the sample provided at the beginning of this study suggest I have a relatively good sample of data. There is not an abundance of outliers in any metric, the sample is evenly distributed between AGE CLASS and SEX. The continuous variables all have well behaved distribution characteristics and the random sample from our initial population conformed approximately to the overall dataset.

Abalone growth patterns as they relate to Age CLASS do not have an obvious relationship. As I saw in the Growth Patterns section in this study. I noted that abalone that have weight over ~109.25 WHOLE or ~51.25 SHUCK, are essentially randomly distributed above this area with no apparent commonality.

Taking a deeper look at abalone by their physical Age classifier, RINGS, I noted similar behavior. Abalone physical metrics like VOLUME and WHOLE weight by ring size seems to be stochastically distributed. I see abalone with less than 6 RINGS, tend to weigh less than ~100 grams, and after that I see that abalone weighing over ~100 grams can have anywhere between 7 and 25 rings with no apparent relationship. This behavior is similar with the total volume of a given abalone.

SEX is another interesting characteristic in that while Infant and Adult abalone have some measure of distinguishable, particularly at younger Age classifications, as abalone Age the overall variability in their size continues to increase. I can see a relatively clear classification of abalone Infants and Adults when they have under 6 RINGS, however, after that point it becomes difficult to distinguish based on any physical measurement.

2.) Sample Statistics

(Representative Sample). Some key features need to be kept in mind while selecting a representative sample.

Diversity: A sample must be as diverse as the search queries. It should be sensitive to all the local differences between the search query and should keep those features in mind.

Consistency: We need to make sure that any change we see in our sample data is also reflected in the true population which is the queries.

Transparency: It is extremely important to decide the appropriate sample size and structure so that it is a true representative. These properties of a sample should be discussed to ensure that the results are accurate.

Question:

If we were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what Questions might we ask before accepting them as representative of the sampled population or phenomenon?

Answer:

I would ask:

  • What evidence do we have and what associated test have been performed to prove that the sample is representative of the overall population?
  • How did they derive this sample?
  • What methodologies did our study implement as part of sample collection?
  • What measures did we take to ensure that there is no inherit bias in the collected sample?
  • Has this sample been reproduced successfully outside our initial conditions?
  • What are some potential confounding variables that attribute to these statistics?

3.) Observational Studies

Question:

What do we see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?

Answer:

From observational studies it is difficult to know how well the sample truly represents the larger population. This kind of study can raise many questions that the data at hand cannot adequately answer. I can run several sets of summary statistics on the sample, however, I’m unable to know if the sample was collected in a way that bias the data, thereby invaliding our observational study. (Statistical bias is the systematical difference in a models hypothesis and the true distribution. Bias measures how much the predictions deviate from the true value we are trying to predict.)

For example, the entire dataset could have been collected under the influence of a confounding factor, thereby tilting my conclusions. There is also a factor of random error in every study, and it is difficult, if not impossible, to account for such occurrences; I can only execute the study to the mathematical rigors of our discipline and hope they will catch such errors through validation and peer review.

(The importance of bias-variance trade-off while modeling). Bias and Variance are part of model prediction errors. A model with high bias pays very little attention to the training data and oversimplifies the model leading to underfitting. A model with high variance pays a lot of attention to training data and does not generalize well on the unseen data which leads to overfitting. Gaining proper insights and understandings into these errors would help us not only in building accurate models, but also in avoiding the mistake of overfitting and underfitting.

Underfitting/Bias: Bias error is the difference between the expected/average prediction of the model and the true value. The model building/prediction process is repeated more than once with new variations of the data. Hence, due to the randomness in the underlying dataset, we will have a set of predictions for each point. Bias measures how much the predictions deviate from the true value we are trying to predict.

Overfitting/Variance: Variance error is defined as the variability of model prediction for a given data point. The model prediction is repeated for various datasets. It’s an indicator to a model’s sensitivity to small variations that can exist while feeding a new subset of the training data. For instance, if a model has high variance then small changes in the training data can result in large prediction changes.

There is no analytical way to measure the point at which we can achieve the bias-variance trade off. To figure it out, it’s essential to explore the complexity of the model and measure the prediction error in order to minimize the overall error.