I will be doing my Final Project Presentation on the Abalone Dataset & Ionosphere Dataset Below is my work, and I hope you enjoy!

Introduction of the Abalone Dataset

The Abalone dataset contains the physical measurements of abalones, which are large shellfish (edible sea snails).The dataset comes from a 1994 study "The Population Biology of Abalone (Haliotis species) in Tasmania.The dataset information was donated to the UCI Machine Learning Repository in 1995 by Sam Waugh from the Department of Computer Science at the University of Tasmania (Australia). During my research on this dataset, I learned the original dataset contained missing values, though those were removed before the dataset was donated.

The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. Although the physical measurements can be used to predict the number of rings and its age with some accuracy. However, it should be noted that information not present in the dataset like the weather patterns, locations, and food availiability could be used to improve the accuracy of predictions.

Abalone Dataset Contents

There are 4177 rows and 9 columns. The columns include 1 categorical predictor (sex), 7 continuous predictors (Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, Shell weight), and an integer response variable (number of rings).

Descriptions:

  • Sex: M(Male), F(Female), and I (infant)
  • Length: Longest shell measurement
  • Diameter: Perpendicular to length
  • Height: Has meat in shell
  • Whole weight: Whole abalone
  • Shucked weight: Weight of the meat
  • Viscera weight: Gut weight of the meat (after cleaning)
  • Shell weight: Weight after being dried
  • Rings: +1.5 Gives the age in years
# read the dataset into a data frame
abalone <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header=FALSE)

# Names of the Abalone Dataset
names(abalone) <- c("Sex", "Length", "Diameter", "Height", "Weight.whole",
    "Weight.shucked", "Weight.viscera", "Weight.shell", "Rings")

Here is a quick summary of the data:

summary(abalone)
##  Sex          Length         Diameter          Height      
##  F:1307   Min.   :0.075   Min.   :0.0550   Min.   :0.0000  
##  I:1342   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150  
##  M:1528   Median :0.545   Median :0.4250   Median :0.1400  
##           Mean   :0.524   Mean   :0.4079   Mean   :0.1395  
##           3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650  
##           Max.   :0.815   Max.   :0.6500   Max.   :1.1300  
##   Weight.whole    Weight.shucked   Weight.viscera    Weight.shell   
##  Min.   :0.0020   Min.   :0.0010   Min.   :0.0005   Min.   :0.0015  
##  1st Qu.:0.4415   1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300  
##  Median :0.7995   Median :0.3360   Median :0.1710   Median :0.2340  
##  Mean   :0.8287   Mean   :0.3594   Mean   :0.1806   Mean   :0.2388  
##  3rd Qu.:1.1530   3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290  
##  Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050  
##      Rings       
##  Min.   : 1.000  
##  1st Qu.: 8.000  
##  Median : 9.000  
##  Mean   : 9.934  
##  3rd Qu.:11.000  
##  Max.   :29.000

Methods

I explored various graphics to see correlations and relationships within the dataset.

Here I first explored the rings using the historgram function, but then used the density plot (much smoother version and better visual) to see how the sex impacted the affects of the rings.

ggplot(abalone) + aes(Rings, color = Sex) + geom_histogram(bins = 30) #Kept having an error message without the bins!

ggplot(abalone) + aes(Rings, color = Sex) + geom_density()

With using the density plot function, you can see that the female and male are nearly identical. Unless you use the facet function, you could not see the corelation from the histogram that showed the identicalness of the male and female.

ggplot(abalone, aes(x = Weight.whole, color = Sex)) + 
  geom_histogram(bins = 30) +
  facet_grid(Sex~.)

Continuing on, I wanted to see the relationship between the length of the abalones and its rings impacted by its sex. I figured the scatterplot function would be best to use for this scenario (many ways to do it).

ggplot(abalone) + aes(Length, Rings, color = Sex) + geom_point() + labs(x = "Shell Length", y = "Rings", title = "Relationship between Rings and Length", color = "Sex of Abalone") 

For a better look individually on each sex, I used the facet function again.

ggplot(abalone) + aes(Length, Rings, color = Sex) + geom_point() + labs(x = "Shell Length", y = "Rings", title = "Relationship Between Length and Rings using Facet") + facet_grid(. ~ Sex)

In conclusion there were many different graphics I could have used for this dataset (weights etc), I just was curious about using the ones above.

Introduction of the Ionosphere Dataset

The Ionosphere Dataset contains radar returns collected by a system which consists of phased array of 16 high-frequency antennas together with a total transmitted power on the order of 6.4 kilowatts from the ionosphere, which is the layer of the earth’s atmosphere that contains a high concentration of ions and free electrons; able to reflect radio waves.The Johns Hopkins University Ionosphere database collected from the UCI Repository of Machine Learning Databases donated by Vince Sigillito in 1989. This dataset has been used in the past for classification of radar returns from the ionosphere using neural networks by Sigillito.

Furthermore,the free electrons in the ionosphere were the target. “Good” radar returns are those showing evidence of some type of structure in the ionosphere, while the “Bad” radar returns are those that do not; their signals pass through the ionosphere. Finally, those received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number (radio waves). There were 17 pulse numbers for the Goose Bay system. Instances in this databse are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal.

Ionosphere Dataset Contents

A data frame with 351 observations on 35 independent variables, some numerical and 2 nominal, and one last defining the class.The first 34 are used for the prediction, and last one the class attribute.

Each one of the Last Attribute must be either:

  • Good: Shows some structure
  • Bad: Does not show some structure
# Library for constructing a decision tree
library(rpart)
# Ionosphere dataset is available in 'mlbench' library
library(mlbench)
## Warning: package 'mlbench' was built under R version 3.6.3
# Used for splittin dataset into train and test data
library(caTools)
## Warning: package 'caTools' was built under R version 3.6.3
library(datasets)

Here is a quick summary of the data:

data('Ionosphere')
summary(Ionosphere)
##  V1      V2            V3                V4                 V5         
##  0: 38   0:351   Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000  
##  1:313           1st Qu.: 0.4721   1st Qu.:-0.06474   1st Qu.: 0.4127  
##                  Median : 0.8711   Median : 0.01631   Median : 0.8092  
##                  Mean   : 0.6413   Mean   : 0.04437   Mean   : 0.6011  
##                  3rd Qu.: 1.0000   3rd Qu.: 0.19418   3rd Qu.: 1.0000  
##                  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000  
##        V6                V7                V8                 V9          
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.00000  
##  1st Qu.:-0.0248   1st Qu.: 0.2113   1st Qu.:-0.05484   1st Qu.: 0.08711  
##  Median : 0.0228   Median : 0.7287   Median : 0.01471   Median : 0.68421  
##  Mean   : 0.1159   Mean   : 0.5501   Mean   : 0.11936   Mean   : 0.51185  
##  3rd Qu.: 0.3347   3rd Qu.: 0.9692   3rd Qu.: 0.44567   3rd Qu.: 0.95324  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.00000  
##       V10                V11                V12          
##  Min.   :-1.00000   Min.   :-1.00000   Min.   :-1.00000  
##  1st Qu.:-0.04807   1st Qu.: 0.02112   1st Qu.:-0.06527  
##  Median : 0.01829   Median : 0.66798   Median : 0.02825  
##  Mean   : 0.18135   Mean   : 0.47618   Mean   : 0.15504  
##  3rd Qu.: 0.53419   3rd Qu.: 0.95790   3rd Qu.: 0.48237  
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 1.00000  
##       V13               V14                V15               V16          
##  Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.: 0.0000   1st Qu.:-0.07372   1st Qu.: 0.0000   1st Qu.:-0.08170  
##  Median : 0.6441   Median : 0.03027   Median : 0.6019   Median : 0.00000  
##  Mean   : 0.4008   Mean   : 0.09341   Mean   : 0.3442   Mean   : 0.07113  
##  3rd Qu.: 0.9555   3rd Qu.: 0.37486   3rd Qu.: 0.9193   3rd Qu.: 0.30897  
##  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.00000  
##       V17               V18                 V19         
##  Min.   :-1.0000   Min.   :-1.000000   Min.   :-1.0000  
##  1st Qu.: 0.0000   1st Qu.:-0.225690   1st Qu.: 0.0000  
##  Median : 0.5909   Median : 0.000000   Median : 0.5762  
##  Mean   : 0.3819   Mean   :-0.003617   Mean   : 0.3594  
##  3rd Qu.: 0.9357   3rd Qu.: 0.195285   3rd Qu.: 0.8993  
##  Max.   : 1.0000   Max.   : 1.000000   Max.   : 1.0000  
##       V20                V21               V22           
##  Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.000000  
##  1st Qu.:-0.23467   1st Qu.: 0.0000   1st Qu.:-0.243870  
##  Median : 0.00000   Median : 0.4991   Median : 0.000000  
##  Mean   :-0.02402   Mean   : 0.3367   Mean   : 0.008296  
##  3rd Qu.: 0.13437   3rd Qu.: 0.8949   3rd Qu.: 0.188760  
##  Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.000000  
##       V23               V24                V25               V26          
##  Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.: 0.0000   1st Qu.:-0.36689   1st Qu.: 0.0000   1st Qu.:-0.33239  
##  Median : 0.5318   Median : 0.00000   Median : 0.5539   Median :-0.01505  
##  Mean   : 0.3625   Mean   :-0.05741   Mean   : 0.3961   Mean   :-0.07119  
##  3rd Qu.: 0.9112   3rd Qu.: 0.16463   3rd Qu.: 0.9052   3rd Qu.: 0.15676  
##  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.00000  
##       V27               V28                V29               V30          
##  Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.: 0.2864   1st Qu.:-0.44316   1st Qu.: 0.0000   1st Qu.:-0.23689  
##  Median : 0.7082   Median :-0.01769   Median : 0.4966   Median : 0.00000  
##  Mean   : 0.5416   Mean   :-0.06954   Mean   : 0.3784   Mean   :-0.02791  
##  3rd Qu.: 0.9999   3rd Qu.: 0.15354   3rd Qu.: 0.8835   3rd Qu.: 0.15407  
##  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.00000  
##       V31               V32                 V33         
##  Min.   :-1.0000   Min.   :-1.000000   Min.   :-1.0000  
##  1st Qu.: 0.0000   1st Qu.:-0.242595   1st Qu.: 0.0000  
##  Median : 0.4428   Median : 0.000000   Median : 0.4096  
##  Mean   : 0.3525   Mean   :-0.003794   Mean   : 0.3494  
##  3rd Qu.: 0.8576   3rd Qu.: 0.200120   3rd Qu.: 0.8138  
##  Max.   : 1.0000   Max.   : 1.000000   Max.   : 1.0000  
##       V34            Class    
##  Min.   :-1.00000   bad :126  
##  1st Qu.:-0.16535   good:225  
##  Median : 0.00000             
##  Mean   : 0.01448             
##  3rd Qu.: 0.17166             
##  Max.   : 1.00000

This was absolutely the hardest part. I was struggling to put it in graphics (histogram, boxplot, etc). Through my research I was able to learn how to put the Ionosphere into a Decision Tree.Decision tree is a graph to represent choices and their results in form of a tree (a smoother flowchart).Generally, this model is created with observed data also called training data (which you will see below). Then a set of validation data is used to verify and improve the model.

actdata<-Ionosphere
samples<-sample.split(actdata$Class,SplitRatio = 0.8)
# Train data
train_set<- subset(actdata,samples==TRUE)
# Test data
test_set<- subset(actdata,samples==FALSE)

## rpart is used for constructing a decision tree. Here, we have taken method as class
modeling<- rpart(Class~.,data = train_set,method = 'class')

plot(modeling);text(modeling)

In Conclusion, the Decision Tree not only helped me to actually get a visual of the ionosphere dataset, it help me understand a new way display data.

I hope you enjoyed my presentation, and THANK YOU!