2022-11-11

Introduction

In this presentation, we will make use of the abalone data set generously provided by the UCI Machine Learning Repository. This data set contains 9 attributes and 4176 observations regarding abalones (a type of marine snail) that we will use to explore statistical principles in RMarkdown, such as data visualization and hypothesis testing.

Citation: Nash, W. J., Sellers, T. L., Talbot, S. R., Cawthorn, A. J., and Ford, W. B. (1994). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/datasets/Abalone]. Irvine, CA: University of California, School of Information and Computer Science.

Downloading the Data Set

By saving the abalone data set into the same file as this source code, we can use a read.csv function to download the data set with a relative data path.

abaloneData = read.csv(file = 'abalone.data')

Unfortunately, the data set we just downloaded has unhelpful attribute names. Luckily, it is easy enough to change them.

names(abaloneData) = c('Sex', 'Length', 'Diameter', 'Height', 'Weight', 'mWeight', 'vWeight', 'sWeight', 'Rings')

We can estimate the age of an abalone (in years) by adding 1.5 to the number of rings on its shell.

abalone = mutate(abaloneData, Age = Rings + 1.5)

Data Table

##    Sex Length Diameter Height Weight mWeight vWeight sWeight Rings  Age
## 1    M  0.350    0.265  0.090 0.2255  0.0995  0.0485   0.070     7  8.5
## 2    F  0.530    0.420  0.135 0.6770  0.2565  0.1415   0.210     9 10.5
## 3    M  0.440    0.365  0.125 0.5160  0.2155  0.1140   0.155    10 11.5
## 4    I  0.330    0.255  0.080 0.2050  0.0895  0.0395   0.055     7  8.5
## 5    I  0.425    0.300  0.095 0.3515  0.1410  0.0775   0.120     8  9.5
## 6    F  0.530    0.415  0.150 0.7775  0.2370  0.1415   0.330    20 21.5
## 7    F  0.545    0.425  0.125 0.7680  0.2940  0.1495   0.260    16 17.5
## 8    M  0.475    0.370  0.125 0.5095  0.2165  0.1125   0.165     9 10.5
## 9    F  0.550    0.440  0.150 0.8945  0.3145  0.1510   0.320    19 20.5
## 10   F  0.525    0.380  0.140 0.6065  0.1940  0.1475   0.210    14 15.5
## 11   M  0.430    0.350  0.110 0.4060  0.1675  0.0810   0.135    10 11.5
## 12   M  0.490    0.380  0.135 0.5415  0.2175  0.0950   0.190    11 12.5
## 13   F  0.535    0.405  0.145 0.6845  0.2725  0.1710   0.205    10 11.5
## 14   F  0.470    0.355  0.100 0.4755  0.1675  0.0805   0.185    10 11.5
## 15   M  0.500    0.400  0.130 0.6645  0.2580  0.1330   0.240    12 13.5

ggplot2 Histogram and Bar Chart

The ggplot2 package provides many ways to visualize data. We will begin with some simpler methods, such as histograms and bar graphs. The code below creates a histogram to visualize age distribution and a bar graph to visualize sex distribution (infants are treated as a separate sex). ggplot(data = abalone, aes(x = Age)) + geom_histogram(color = 'purple', fill = 'white') ggplot(data = abalone, aes(x = Sex, fill = Sex)) + geom_bar()

ggplot2 Scatter Plot

Typically, a 2D scatter plot depicts two different variables, one along the x-axis and another along the y-axis. However, by tying the size, shape, and color of dots to other variables, ggplot2 allows us to depict as many as five variables on a single scatter plot, complete with a trend line.

Because our data set is massive, we will truncate it somewhat by omitting infant abalones.

noInfants = filter(abalone, Sex != 'I' )

g = ggplot(data = noInfants, aes(x = Age, y = Diameter)) + geom_point(aes(color = Weight, shape = Sex, size = Height))

l = g + geom_smooth(method = lm, se = F) + theme_bw()

Scatter Plot of Five Variables

Plotly Scatter Plot

Another package that allows us to make data visualizations is Plotly. It will be used here to create a 3D scatter plot in which the three dimensions compare meat weight, viscera weight, and shell weight respectively.

The size of a point indicates the overall weight of the abalone, and the color of a point indicates the abalone’s age.

plot_ly(x = abalone$mWeight, y = abalone$vWeight, z = abalone$sWeight, size = abalone$Weight, color = abalone$Age, type = 'scatter3d', mode = 'markers')

3D Plot

This graphic can be freely rotated and zoomed - try it out!

The Hypothesis

Now that we have sampled some of what we can do with visualizations in RMarkdown, it is time to move onto another crucial aspect of statistics: hypothesis testing.

It is reasonable to assume that older abalones, which have had more time to grow and mature, tend to be heavier. But is that true? We can informally test this by creating a graphic to compare age and weight.

The Correlation Coefficient

The plot and trend line on the previous slide suggest that older abalones are heavier. However, we want to test this hypothesis more rigorously. To do so, we will calculate the Pearson Correlation Coefficient of age and weight.

The most straightforward way to find the correlation coefficient is to use the cor() function:

cor(abalone$Age, abalone$Weight)

## [1] 0.5408179

Pearson Correlation Coefficients range from -1 to 1. The greater the absolute value of a coefficient is, the stronger the correlation. There is a correlation coefficient of about 0.54 between abalone age and weight, indicating that there is a moderate positive relationship between the two variables.

The P-Value

A p-value is a measure that tells us how likely it is that a perceived relationship between two variables is actually a result of random chance. Typically, it is assumed that there is a meaningful relationship between variables if the p-value is 0.05 or less. The smaller the p-value, the more certain the relation is.

We can calculate the p-value, along with several other values, using the cor.test function.

cor.test(abalone$Age, abalone$Weight)

See the results on the next page.

P-Value

## 
##  Pearson's product-moment correlation
## 
## data:  abalone$Age and abalone$Weight
## t = 41.539, df = 4174, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5190001 0.5619314
## sample estimates:
##       cor 
## 0.5408179

The p-value is smaller than 2.2e-16, which is a shortened form of 0.00000000000000022. Suffice to say, there is very strong evidence of a link between the ages and weights of abalones, supporting our initial hypothesis.

Latex Math Text

The Latex syntax is excellent for depicting mathematical formulas in RMarkdown, since it allows us to create a wide array of neat-looking symbols and formats beyond the reach of a standard keyboard.

For instance, we can create a vertical fraction in two different ways:

\[{9 \over 10},\frac{9}{10}\]

We can easily create superscripts, subscripts, and both combined:

\[x^{2},{y}_3,z_{4}^5\]

Latex Math Text Continued

Below is a selection of common mathematical symbols that can easily be created in Latex:

\[\in,\subset, \subseteq, \cup, \cap \int, \sum, \infty\] We have observed how Latex behaves when given its own line. However, it can also be used within a line containing non-Latex text:

The exponent \(12^{2}\) is equal to the fraction \({288 \over 2}\). They are both equal to \(144\).