Phil Gonzalez D590 - Applied Data Science Homework 1
The “Abalone” dataset was created by Warwick Nash, Tracy Sellers, Simon Talbot, Andrew Cawthorn, and Wes Ford. It was released on November 30, 1995. This dataset is located on the UC Irvine Machine Learning Repository website and can be found at https://archive.ics.uci.edu/dataset/1/abalone.
Import data and display the first 4-5 rows
# Import data and display the first 4-5 rows
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
unzip("/Users/gonzalezp/Downloads/abalone.zip", files = "abalone.data", exdir = "C:/Users/gonzalezp/Downloads")
abalone <- read.table("/Users/gonzalezp/Downloads/abalone.data", header = FALSE, sep = ",")
colnames(abalone) <- c("Sex", "Length", "Diameter", "Height", "Whole_weight", "Shucked_weight", "Viscera_weight", "Shell_weight", "Rings")
sample_n(abalone,5)
## Sex Length Diameter Height Whole_weight Shucked_weight Viscera_weight
## 1 M 0.615 0.470 0.160 1.0175 0.4730 0.2395
## 2 M 0.330 0.245 0.085 0.1710 0.0655 0.0365
## 3 I 0.475 0.380 0.120 0.4410 0.1785 0.0885
## 4 F 0.665 0.505 0.160 1.2915 0.6310 0.2925
## 5 F 0.515 0.400 0.170 0.7960 0.2580 0.1755
## Shell_weight Rings
## 1 0.2800 10
## 2 0.0550 11
## 3 0.1505 8
## 4 0.3200 11
## 5 0.2800 16
Data Quality: Check for data size
# Check for data size
dim(abalone)
## [1] 4177 9
Data Quality: Check for data types
# Check for data types
glimpse(abalone)
## Rows: 4,177
## Columns: 9
## $ Sex <chr> "M", "M", "F", "M", "I", "I", "F", "F", "M", "F", "F", …
## $ Length <dbl> 0.455, 0.350, 0.530, 0.440, 0.330, 0.425, 0.530, 0.545,…
## $ Diameter <dbl> 0.365, 0.265, 0.420, 0.365, 0.255, 0.300, 0.415, 0.425,…
## $ Height <dbl> 0.095, 0.090, 0.135, 0.125, 0.080, 0.095, 0.150, 0.125,…
## $ Whole_weight <dbl> 0.5140, 0.2255, 0.6770, 0.5160, 0.2050, 0.3515, 0.7775,…
## $ Shucked_weight <dbl> 0.2245, 0.0995, 0.2565, 0.2155, 0.0895, 0.1410, 0.2370,…
## $ Viscera_weight <dbl> 0.1010, 0.0485, 0.1415, 0.1140, 0.0395, 0.0775, 0.1415,…
## $ Shell_weight <dbl> 0.150, 0.070, 0.210, 0.155, 0.055, 0.120, 0.330, 0.260,…
## $ Rings <int> 15, 7, 9, 10, 7, 8, 20, 16, 9, 19, 14, 10, 11, 10, 10, …
Data Quality: Provide overall descriptive statistics
# Provide overall descriptive statistics
summary(abalone)
## Sex Length Diameter Height
## Length:4177 Min. :0.075 Min. :0.0550 Min. :0.0000
## Class :character 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150
## Mode :character Median :0.545 Median :0.4250 Median :0.1400
## Mean :0.524 Mean :0.4079 Mean :0.1395
## 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650
## Max. :0.815 Max. :0.6500 Max. :1.1300
## Whole_weight Shucked_weight Viscera_weight Shell_weight
## Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015
## 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300
## Median :0.7995 Median :0.3360 Median :0.1710 Median :0.2340
## Mean :0.8287 Mean :0.3594 Mean :0.1806 Mean :0.2388
## 3rd Qu.:1.1530 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290
## Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050
## Rings
## Min. : 1.000
## 1st Qu.: 8.000
## Median : 9.000
## Mean : 9.934
## 3rd Qu.:11.000
## Max. :29.000
Data Quality: Assess missing data
# Assess missing data
sum(is.na(abalone) == TRUE)
## [1] 0
table(complete.cases(abalone))
##
## TRUE
## 4177
# Assess missing data
sort(sapply(abalone, function(x) sum(is.na(x))))
## Sex Length Diameter Height Whole_weight
## 0 0 0 0 0
## Shucked_weight Viscera_weight Shell_weight Rings
## 0 0 0 0
library(visdat)
vis_dat(abalone)
Functions are run to determine the total number of occurrences of
missing data and the number of occurrences of missing data by variable
in the dataset. These functions reveal no missing data. Additionally, a
function is run to determine the number of complete cases in the
dataset. This function reveals that there are 4,177 complete cases,
equivalent to the total number of cases in the dataset. Finally, a
visual analysis of the data is performed with the vis_dat function. This
function reveals the data type of each variable and that there is no
missing data. Ultimately, these various functions reveal that there is
no missing data in this dataset.
Data Visualization
summary(abalone$Rings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 8.000 9.000 9.934 11.000 29.000
Data Visualization: One-Dimensional Visualization
hist(abalone$Rings, col = "green")
abline(v = mean(abalone$Rings), lwd = 2)
abline(v = median(abalone$Rings), col = "magenta", lwd = 4)
library(moments)
skewness(abalone$Rings)
## [1] 1.113702
kurtosis(abalone$Rings)
## [1] 5.326462
In the data visualization section I have opted to analyze the dependent variable of this dataset: rings. As explained in the dataset notes, the age of an abalone is calculated by adding 1.5 to the number of rings an abalone has. In this way, rings serves as a proxy measure of age. In this section I will analyze the distribution of abalone rings and visualize the relationship between rings and other, independent variables.
The “Rings” variable is positively skewed.This positive distribution is supported by the summary statistics which show a mean of 9.934 rings compared to the median of 9 rings. The median and mean lines have been added to the histogram to visualize the positive skewness. Additionally, a measure of the Fisher-Pearson coefficient of skewness shows a skewness of 1.113702. This result of a skewness value greater than 1 reveals that the distribution of rings is substantially positively skewed. Finally, a measure of the kurtosis of the Rings distribution reveals a result of 5.326462. A positive kurtosis value indicates a leptokurtic, or peaked, distribution.
Data Visualization: Comparison Visualization
boxplot(Rings ~ Sex, data = abalone, col = "blue")
This boxplot serves as a comparison of the distribution of abalone rings
based on the sex of the abalone. In this comparison “F” indicates
“female,” “I” indicates “infant,” and “M” indicates “male.” As the
number of rings an abalone has is an indicator of its age, the
difference in median rings between infant abalone and male and female
abalone stands to reason. A comparison of these three distributions
reveals similar dispersions in terms of the length of the overall and
interquartile ranges. One difference is in the interquartile range of
the infant group. The infant group appears to show negative skewness
while the male and female gropus appear to show positive skewness.
The male and female distributions are very similar in nearly all facets. The only significant difference between the two groups is in the outliers. The female distribution has outliers at a greater distance from the range, while the male distribution has outliers below the range.
Data Visualization: Two-Dimensional Visualization (Categorical Data)
library(dplyr)
table(abalone$Sex) %>% barplot(col = "lightgreen")
This barplot is a two-dimensional visualization comparing the number of
female, infant, and male abalone in the dataset. This barplot shows that
in this sampl there are more male abalone than infants and females.
Knowing from our previous visualization that the male and female
distributions of rings were very similar in terms of dispersion and
positive skewness, this barplot helps us understand the distribution of
rings among all abalone in the sample group. Because the combined number
of male and female abalone are nearly double the number of infant
abalone, their effect on the distribution of rings is greater as well.
For this reason the overall distribution of rings closely mirrors the
distribution of rings in the male and female abalone groups.
Data Visualization: Two-Dimensional Visualization (Pairwise Relationship)
library(ggplot2)
ggplot(abalone, aes(Length, Rings, alpha = Sex)) +
geom_point(color = "darkred")
## Warning: Using alpha for a discrete variable is not advised.
cor(abalone$Length, abalone$Rings, method = "pearson")
## [1] 0.5567196
In the abalone dataset, “Length” is the longest shell measurement of each abalone. This scatterplot visualizes the relationship between the length and rings variables, ultimately investigating the strength of the relationship between the length of an abalone and its age (calculated by multiplying the number of rings by 1.5). This scatterplot shows that there is a positive, linear relationship between length and rings. However, the relationship between the two variables does not appear to be particularly strong. This is confirmed by calculating the correlation coefficient. The result is 0.5567.
An additional feature of this scatterplot is that the variable “Sex” has been mapped to the alpha feature so that the transparency of each point indicates the sex of the abalone. This additional mapping was performed to determine if there was any difference in the strength of the relationship between the length and rings variables based on the sex of the abalone. Though this transparency mapping does reveal the slight variation in distribution between the groups, the scatterplot does not appear to indicate that the relationship between length and rings varies by sex.
Sources: In completing this homework assignment I referenced the following sources. 1. Peng, R. D. (2020). 6 Exploratory Graphs | Exploratory Data Analysis with R. In bookdown.org. Retrieved from https://bookdown.org/rdpeng/exdata/exploratory-graphs.html 2. Tierney, N. (2024, March 5). Getting Started with naniar. Retrieved from cran.r-project.org website: https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html