Phil Gonzalez D590 - Applied Data Science Homework 1

The “Abalone” dataset was created by Warwick Nash, Tracy Sellers, Simon Talbot, Andrew Cawthorn, and Wes Ford. It was released on November 30, 1995. This dataset is located on the UC Irvine Machine Learning Repository website and can be found at https://archive.ics.uci.edu/dataset/1/abalone.

Import data and display the first 4-5 rows

# Import data and display the first 4-5 rows
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
unzip("/Users/gonzalezp/Downloads/abalone.zip", files = "abalone.data", exdir = "C:/Users/gonzalezp/Downloads")
abalone <- read.table("/Users/gonzalezp/Downloads/abalone.data", header = FALSE, sep = ",")
colnames(abalone) <- c("Sex", "Length", "Diameter", "Height", "Whole_weight", "Shucked_weight", "Viscera_weight", "Shell_weight", "Rings")
sample_n(abalone,5)
##   Sex Length Diameter Height Whole_weight Shucked_weight Viscera_weight
## 1   M  0.615    0.470  0.160       1.0175         0.4730         0.2395
## 2   M  0.330    0.245  0.085       0.1710         0.0655         0.0365
## 3   I  0.475    0.380  0.120       0.4410         0.1785         0.0885
## 4   F  0.665    0.505  0.160       1.2915         0.6310         0.2925
## 5   F  0.515    0.400  0.170       0.7960         0.2580         0.1755
##   Shell_weight Rings
## 1       0.2800    10
## 2       0.0550    11
## 3       0.1505     8
## 4       0.3200    11
## 5       0.2800    16

Data Quality: Check for data size

# Check for data size
dim(abalone)
## [1] 4177    9

Data Quality: Check for data types

# Check for data types
glimpse(abalone)
## Rows: 4,177
## Columns: 9
## $ Sex            <chr> "M", "M", "F", "M", "I", "I", "F", "F", "M", "F", "F", …
## $ Length         <dbl> 0.455, 0.350, 0.530, 0.440, 0.330, 0.425, 0.530, 0.545,…
## $ Diameter       <dbl> 0.365, 0.265, 0.420, 0.365, 0.255, 0.300, 0.415, 0.425,…
## $ Height         <dbl> 0.095, 0.090, 0.135, 0.125, 0.080, 0.095, 0.150, 0.125,…
## $ Whole_weight   <dbl> 0.5140, 0.2255, 0.6770, 0.5160, 0.2050, 0.3515, 0.7775,…
## $ Shucked_weight <dbl> 0.2245, 0.0995, 0.2565, 0.2155, 0.0895, 0.1410, 0.2370,…
## $ Viscera_weight <dbl> 0.1010, 0.0485, 0.1415, 0.1140, 0.0395, 0.0775, 0.1415,…
## $ Shell_weight   <dbl> 0.150, 0.070, 0.210, 0.155, 0.055, 0.120, 0.330, 0.260,…
## $ Rings          <int> 15, 7, 9, 10, 7, 8, 20, 16, 9, 19, 14, 10, 11, 10, 10, …

Data Quality: Provide overall descriptive statistics

# Provide overall descriptive statistics
summary(abalone)
##      Sex                Length         Diameter          Height      
##  Length:4177        Min.   :0.075   Min.   :0.0550   Min.   :0.0000  
##  Class :character   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150  
##  Mode  :character   Median :0.545   Median :0.4250   Median :0.1400  
##                     Mean   :0.524   Mean   :0.4079   Mean   :0.1395  
##                     3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650  
##                     Max.   :0.815   Max.   :0.6500   Max.   :1.1300  
##   Whole_weight    Shucked_weight   Viscera_weight    Shell_weight   
##  Min.   :0.0020   Min.   :0.0010   Min.   :0.0005   Min.   :0.0015  
##  1st Qu.:0.4415   1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300  
##  Median :0.7995   Median :0.3360   Median :0.1710   Median :0.2340  
##  Mean   :0.8287   Mean   :0.3594   Mean   :0.1806   Mean   :0.2388  
##  3rd Qu.:1.1530   3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290  
##  Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050  
##      Rings       
##  Min.   : 1.000  
##  1st Qu.: 8.000  
##  Median : 9.000  
##  Mean   : 9.934  
##  3rd Qu.:11.000  
##  Max.   :29.000

Data Quality: Assess missing data

# Assess missing data
sum(is.na(abalone) == TRUE)
## [1] 0
table(complete.cases(abalone))
## 
## TRUE 
## 4177
# Assess missing data
sort(sapply(abalone, function(x) sum(is.na(x))))
##            Sex         Length       Diameter         Height   Whole_weight 
##              0              0              0              0              0 
## Shucked_weight Viscera_weight   Shell_weight          Rings 
##              0              0              0              0
library(visdat)
vis_dat(abalone)

Functions are run to determine the total number of occurrences of missing data and the number of occurrences of missing data by variable in the dataset. These functions reveal no missing data. Additionally, a function is run to determine the number of complete cases in the dataset. This function reveals that there are 4,177 complete cases, equivalent to the total number of cases in the dataset. Finally, a visual analysis of the data is performed with the vis_dat function. This function reveals the data type of each variable and that there is no missing data. Ultimately, these various functions reveal that there is no missing data in this dataset.

Data Visualization

summary(abalone$Rings)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   8.000   9.000   9.934  11.000  29.000

Data Visualization: One-Dimensional Visualization

hist(abalone$Rings, col = "green")
abline(v = mean(abalone$Rings), lwd = 2)
abline(v = median(abalone$Rings), col = "magenta", lwd = 4)

library(moments)
skewness(abalone$Rings)
## [1] 1.113702
kurtosis(abalone$Rings)
## [1] 5.326462

In the data visualization section I have opted to analyze the dependent variable of this dataset: rings. As explained in the dataset notes, the age of an abalone is calculated by adding 1.5 to the number of rings an abalone has. In this way, rings serves as a proxy measure of age. In this section I will analyze the distribution of abalone rings and visualize the relationship between rings and other, independent variables.

The “Rings” variable is positively skewed.This positive distribution is supported by the summary statistics which show a mean of 9.934 rings compared to the median of 9 rings. The median and mean lines have been added to the histogram to visualize the positive skewness. Additionally, a measure of the Fisher-Pearson coefficient of skewness shows a skewness of 1.113702. This result of a skewness value greater than 1 reveals that the distribution of rings is substantially positively skewed. Finally, a measure of the kurtosis of the Rings distribution reveals a result of 5.326462. A positive kurtosis value indicates a leptokurtic, or peaked, distribution.

Data Visualization: Comparison Visualization

boxplot(Rings ~ Sex, data = abalone, col = "blue")

This boxplot serves as a comparison of the distribution of abalone rings based on the sex of the abalone. In this comparison “F” indicates “female,” “I” indicates “infant,” and “M” indicates “male.” As the number of rings an abalone has is an indicator of its age, the difference in median rings between infant abalone and male and female abalone stands to reason. A comparison of these three distributions reveals similar dispersions in terms of the length of the overall and interquartile ranges. One difference is in the interquartile range of the infant group. The infant group appears to show negative skewness while the male and female gropus appear to show positive skewness.

The male and female distributions are very similar in nearly all facets. The only significant difference between the two groups is in the outliers. The female distribution has outliers at a greater distance from the range, while the male distribution has outliers below the range.

Data Visualization: Two-Dimensional Visualization (Categorical Data)

library(dplyr)
table(abalone$Sex) %>% barplot(col = "lightgreen")

This barplot is a two-dimensional visualization comparing the number of female, infant, and male abalone in the dataset. This barplot shows that in this sampl there are more male abalone than infants and females. Knowing from our previous visualization that the male and female distributions of rings were very similar in terms of dispersion and positive skewness, this barplot helps us understand the distribution of rings among all abalone in the sample group. Because the combined number of male and female abalone are nearly double the number of infant abalone, their effect on the distribution of rings is greater as well. For this reason the overall distribution of rings closely mirrors the distribution of rings in the male and female abalone groups.

Data Visualization: Two-Dimensional Visualization (Pairwise Relationship)

library(ggplot2)
ggplot(abalone, aes(Length, Rings, alpha = Sex)) +
  geom_point(color = "darkred")
## Warning: Using alpha for a discrete variable is not advised.

cor(abalone$Length, abalone$Rings, method = "pearson")
## [1] 0.5567196

In the abalone dataset, “Length” is the longest shell measurement of each abalone. This scatterplot visualizes the relationship between the length and rings variables, ultimately investigating the strength of the relationship between the length of an abalone and its age (calculated by multiplying the number of rings by 1.5). This scatterplot shows that there is a positive, linear relationship between length and rings. However, the relationship between the two variables does not appear to be particularly strong. This is confirmed by calculating the correlation coefficient. The result is 0.5567.

An additional feature of this scatterplot is that the variable “Sex” has been mapped to the alpha feature so that the transparency of each point indicates the sex of the abalone. This additional mapping was performed to determine if there was any difference in the strength of the relationship between the length and rings variables based on the sex of the abalone. Though this transparency mapping does reveal the slight variation in distribution between the groups, the scatterplot does not appear to indicate that the relationship between length and rings varies by sex.

Sources: In completing this homework assignment I referenced the following sources. 1. Peng, R. D. (2020). 6 Exploratory Graphs | Exploratory Data Analysis with R. In bookdown.org. Retrieved from https://bookdown.org/rdpeng/exdata/exploratory-graphs.html 2. Tierney, N. (2024, March 5). Getting Started with naniar. Retrieved from cran.r-project.org website: https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html