Executive Summary

The data set (Abalone Data Set) was collected from the open-source UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/Abalone. These data consisted of 4177 observations of 9 attributes and were imported directly into the RStudio IDE from a URL. Interesting, these data were not available in any file structure, rather were displayed as text at the specified URL (see figure 1). An additional URL, containing descriptive information about these data, was also imported into the IDE.


Figure 1: Screen shot of the native data used for the analysis.


Summary of Abalone Data Set

Data Set Characteristics Number of Instances Number of Attributes Attribute Characteristics
Multivariate 4177 9 Categorical, Integer, Real

Setup

The following packages were employed during this task.

library(readr) # Useful for importing data
library(knitr) # Useful for creating nice tables
library(data.table)
library(tm) #Text Manager
## Loading required package: NLP
library(stringr) #String functions

Data Description

These data come from the study performd by Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994):

The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait, Sea Fisheries Division, Technical Report No. 48

As is outlined below, the data set can be surmised as follows

Attribute Data Type Units Description
Sex nominal N/A M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer N/A +1.5 gives the age in years

Read/Import Data

These data were directly imported from the relevant URL into the RStudio IDE.

# Creating a variable for the URL string
abaloneDataURL <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
# Use data.table package
abaloneDS <- fread(abaloneDataURL)
head(abaloneDS, n = 3)
##    V1    V2    V3    V4     V5     V6     V7   V8 V9
## 1:  M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.15 15
## 2:  M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.07  7
## 3:  F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.21  9
tail(abaloneDS, n = 3)
##    V1    V2    V3    V4     V5     V6     V7    V8 V9
## 1:  M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.308  9
## 2:  F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.296 10
## 3:  M 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.495 12
# Checking the size
dim(abaloneDS)
## [1] 4177    9

Note that the attributes do not contain descriptive information. In order to obtain this, we will check the text information from the URL below.

# Creating a variable for the URL string
abaloneDataDescURL <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names'
# Create a new string variable
abaloneDSDesc <- readLines(abaloneDataDescURL)
# Check the String
head(abaloneDSDesc, n=8)
## [1] "1. Title of Database: Abalone data"                      
## [2] ""                                                        
## [3] "2. Sources:"                                             
## [4] ""                                                        
## [5] "   (a) Original owners of database:"                     
## [6] "\tMarine Resources Division"                             
## [7] "\tMarine Research Laboratories - Taroona"                
## [8] "\tDepartment of Primary Industry and Fisheries, Tasmania"
# Remove "\t"
abaloneDSDesc = str_replace(abaloneDSDesc, "\t", "")
# Check the String
head(abaloneDSDesc, n=8)
## [1] "1. Title of Database: Abalone data"                    
## [2] ""                                                      
## [3] "2. Sources:"                                           
## [4] ""                                                      
## [5] "   (a) Original owners of database:"                   
## [6] "Marine Resources Division"                             
## [7] "Marine Research Laboratories - Taroona"                
## [8] "Department of Primary Industry and Fisheries, Tasmania"
# Save this in WD for info
write.table(abaloneDSDesc, "AbaloneDescription.txt", sep="\t")
#Search for Attribute information
test <- str_detect(abaloneDSDesc, "Attribute")
which(test == TRUE)
## [1]  78  81 109
# Substring from line 78 till 109
str_view(abaloneDSDesc[78:109], "Attribute")
# Substring from line 86 till 98 - contain desired attribute information
str_view(abaloneDSDesc[86:98], "Attribute")


We can now construct a header for the dataset.

# We know from dim, have 9 attributes
AbaloneHeader = c("Sex", "Length", "Diameter",
                  "Height", "WholeWeight", "ShuckedWeight",
                  "VisceraWeight", "ShellWeight", "Rings")
colnames(abaloneDS) <- AbaloneHeader
# Checking the first 3 instances (abaloneDS header also displayed)
head(abaloneDS, n = 3)
##    Sex Length Diameter Height WholeWeight ShuckedWeight VisceraWeight
## 1:   M  0.455    0.365  0.095      0.5140        0.2245        0.1010
## 2:   M  0.350    0.265  0.090      0.2255        0.0995        0.0485
## 3:   F  0.530    0.420  0.135      0.6770        0.2565        0.1415
##    ShellWeight Rings
## 1:        0.15    15
## 2:        0.07     7
## 3:        0.21     9
# Checking the Data Types
str(abaloneDS)
## Classes 'data.table' and 'data.frame':   4177 obs. of  9 variables:
##  $ Sex          : chr  "M" "M" "F" "M" ...
##  $ Length       : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ Diameter     : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ Height       : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ WholeWeight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ ShuckedWeight: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ VisceraWeight: num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ ShellWeight  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ Rings        : int  15 7 9 10 7 8 20 16 9 19 ...
##  - attr(*, ".internal.selfref")=<externalptr>
# Turn Sex into a factor rather than Char
abaloneDS$Sex <- factor(abaloneDS$Sex)
# Summary
summary(abaloneDS)
##  Sex          Length         Diameter          Height      
##  F:1307   Min.   :0.075   Min.   :0.0550   Min.   :0.0000  
##  I:1342   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150  
##  M:1528   Median :0.545   Median :0.4250   Median :0.1400  
##           Mean   :0.524   Mean   :0.4079   Mean   :0.1395  
##           3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650  
##           Max.   :0.815   Max.   :0.6500   Max.   :1.1300  
##   WholeWeight     ShuckedWeight    VisceraWeight     ShellWeight    
##  Min.   :0.0020   Min.   :0.0010   Min.   :0.0005   Min.   :0.0015  
##  1st Qu.:0.4415   1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300  
##  Median :0.7995   Median :0.3360   Median :0.1710   Median :0.2340  
##  Mean   :0.8287   Mean   :0.3594   Mean   :0.1806   Mean   :0.2388  
##  3rd Qu.:1.1530   3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290  
##  Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050  
##      Rings       
##  Min.   : 1.000  
##  1st Qu.: 8.000  
##  Median : 9.000  
##  Mean   : 9.934  
##  3rd Qu.:11.000  
##  Max.   :29.000

Subsetting I

Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Provide the R codes with outputs and explain everything that you do in this step.

# Subset the first 10 observations, including all variables
abaloneSS <- abaloneDS[1:10, ]
# Checking the SS pt. 1
dplyr::tbl_df(abaloneSS)
## # A tibble: 10 x 9
##    Sex   Length Diameter Height WholeWeight ShuckedWeight VisceraWeight
##    <fct>  <dbl>    <dbl>  <dbl>       <dbl>         <dbl>         <dbl>
##  1 M      0.455    0.365 0.0950       0.514        0.224         0.101 
##  2 M      0.350    0.265 0.0900       0.226        0.0995        0.0485
##  3 F      0.530    0.420 0.135        0.677        0.256         0.142 
##  4 M      0.440    0.365 0.125        0.516        0.216         0.114 
##  5 I      0.330    0.255 0.0800       0.205        0.0895        0.0395
##  6 I      0.425    0.300 0.0950       0.352        0.141         0.0775
##  7 F      0.530    0.415 0.150        0.778        0.237         0.142 
##  8 F      0.545    0.425 0.125        0.768        0.294         0.150 
##  9 M      0.475    0.370 0.125        0.509        0.216         0.112 
## 10 F      0.550    0.440 0.150        0.894        0.314         0.151 
## # ... with 2 more variables: ShellWeight <dbl>, Rings <int>
# Checking the SS pt. 2
str(abaloneSS)
## Classes 'data.table' and 'data.frame':   10 obs. of  9 variables:
##  $ Sex          : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1
##  $ Length       : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55
##  $ Diameter     : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44
##  $ Height       : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15
##  $ WholeWeight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ ShuckedWeight: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ VisceraWeight: num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ ShellWeight  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32
##  $ Rings        : int  15 7 9 10 7 8 20 16 9 19
##  - attr(*, ".internal.selfref")=<externalptr>
# Convert SS to a matrix
abaloneSSmatrix <- as.matrix(abaloneSS)
# Checking the class
class(abaloneSSmatrix)
## [1] "matrix"

Subsetting II

Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.

# Subset the first 10 observations, including all variables
abaloneSStwo <- abaloneDS[c(1,4177), ]
# Checking the SS
dplyr::tbl_df(abaloneSStwo)
## # A tibble: 2 x 9
##   Sex   Length Diameter Height WholeWeight ShuckedWeight VisceraWeight
##   <fct>  <dbl>    <dbl>  <dbl>       <dbl>         <dbl>         <dbl>
## 1 M      0.455    0.365 0.0950       0.514         0.224         0.101
## 2 M      0.710    0.555 0.195        1.95          0.946         0.376
## # ... with 2 more variables: ShellWeight <dbl>, Rings <int>
# Checking the SS
str(abaloneSStwo)
## Classes 'data.table' and 'data.frame':   2 obs. of  9 variables:
##  $ Sex          : Factor w/ 3 levels "F","I","M": 3 3
##  $ Length       : num  0.455 0.71
##  $ Diameter     : num  0.365 0.555
##  $ Height       : num  0.095 0.195
##  $ WholeWeight  : num  0.514 1.948
##  $ ShuckedWeight: num  0.225 0.946
##  $ VisceraWeight: num  0.101 0.377
##  $ ShellWeight  : num  0.15 0.495
##  $ Rings        : int  15 12
##  - attr(*, ".internal.selfref")=<externalptr>
# Save as RDS in working directory
saveRDS(abaloneSStwo, "abalonePT2.rds")