I will be doing my Final Project Presentation on the Abalone Dataset & Ionosphere Dataset Below is my work, and I hope you enjoy!
The Abalone dataset contains the physical measurements of abalones, which are large shellfish (edible sea snails).The dataset comes from a 1994 study "The Population Biology of Abalone (Haliotis species) in Tasmania.The dataset information was donated to the UCI Machine Learning Repository in 1995 by Sam Waugh from the Department of Computer Science at the University of Tasmania (Australia). During my research on this dataset, I learned the original dataset contained missing values, though those were removed before the dataset was donated.
The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. Although the physical measurements can be used to predict the number of rings and its age with some accuracy. However, it should be noted that information not present in the dataset like the weather patterns, locations, and food availiability could be used to improve the accuracy of predictions.
There are 4177 rows and 9 columns. The columns include 1 categorical predictor (sex), 7 continuous predictors (Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, Shell weight), and an integer response variable (number of rings).
# read the dataset into a data frame
abalone <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header=FALSE)
# Names of the Abalone Dataset
names(abalone) <- c("Sex", "Length", "Diameter", "Height", "Weight.whole",
"Weight.shucked", "Weight.viscera", "Weight.shell", "Rings")
summary(abalone)
## Sex Length Diameter Height
## F:1307 Min. :0.075 Min. :0.0550 Min. :0.0000
## I:1342 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150
## M:1528 Median :0.545 Median :0.4250 Median :0.1400
## Mean :0.524 Mean :0.4079 Mean :0.1395
## 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650
## Max. :0.815 Max. :0.6500 Max. :1.1300
## Weight.whole Weight.shucked Weight.viscera Weight.shell
## Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015
## 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300
## Median :0.7995 Median :0.3360 Median :0.1710 Median :0.2340
## Mean :0.8287 Mean :0.3594 Mean :0.1806 Mean :0.2388
## 3rd Qu.:1.1530 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290
## Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050
## Rings
## Min. : 1.000
## 1st Qu.: 8.000
## Median : 9.000
## Mean : 9.934
## 3rd Qu.:11.000
## Max. :29.000
I explored various graphics to see correlations and relationships within the dataset.
ggplot(abalone) + aes(Rings, color = Sex) + geom_histogram(bins = 30) #Kept having an error message without the bins!
ggplot(abalone) + aes(Rings, color = Sex) + geom_density()
ggplot(abalone, aes(x = Weight.whole, color = Sex)) +
geom_histogram(bins = 30) +
facet_grid(Sex~.)
ggplot(abalone) + aes(Length, Rings, color = Sex) + geom_point() + labs(x = "Shell Length", y = "Rings", title = "Relationship between Rings and Length", color = "Sex of Abalone")
ggplot(abalone) + aes(Length, Rings, color = Sex) + geom_point() + labs(x = "Shell Length", y = "Rings", title = "Relationship Between Length and Rings using Facet") + facet_grid(. ~ Sex)
The Ionosphere Dataset contains radar returns collected by a system which consists of phased array of 16 high-frequency antennas together with a total transmitted power on the order of 6.4 kilowatts from the ionosphere, which is the layer of the earth’s atmosphere that contains a high concentration of ions and free electrons; able to reflect radio waves.The Johns Hopkins University Ionosphere database collected from the UCI Repository of Machine Learning Databases donated by Vince Sigillito in 1989. This dataset has been used in the past for classification of radar returns from the ionosphere using neural networks by Sigillito.
Furthermore,the free electrons in the ionosphere were the target. “Good” radar returns are those showing evidence of some type of structure in the ionosphere, while the “Bad” radar returns are those that do not; their signals pass through the ionosphere. Finally, those received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number (radio waves). There were 17 pulse numbers for the Goose Bay system. Instances in this databse are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal.
A data frame with 351 observations on 35 independent variables, some numerical and 2 nominal, and one last defining the class.The first 34 are used for the prediction, and last one the class attribute.
# Library for constructing a decision tree
library(rpart)
# Ionosphere dataset is available in 'mlbench' library
library(mlbench)
## Warning: package 'mlbench' was built under R version 3.6.3
# Used for splittin dataset into train and test data
library(caTools)
## Warning: package 'caTools' was built under R version 3.6.3
library(datasets)
data('Ionosphere')
summary(Ionosphere)
## V1 V2 V3 V4 V5
## 0: 38 0:351 Min. :-1.0000 Min. :-1.00000 Min. :-1.0000
## 1:313 1st Qu.: 0.4721 1st Qu.:-0.06474 1st Qu.: 0.4127
## Median : 0.8711 Median : 0.01631 Median : 0.8092
## Mean : 0.6413 Mean : 0.04437 Mean : 0.6011
## 3rd Qu.: 1.0000 3rd Qu.: 0.19418 3rd Qu.: 1.0000
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000
## V6 V7 V8 V9
## Min. :-1.0000 Min. :-1.0000 Min. :-1.00000 Min. :-1.00000
## 1st Qu.:-0.0248 1st Qu.: 0.2113 1st Qu.:-0.05484 1st Qu.: 0.08711
## Median : 0.0228 Median : 0.7287 Median : 0.01471 Median : 0.68421
## Mean : 0.1159 Mean : 0.5501 Mean : 0.11936 Mean : 0.51185
## 3rd Qu.: 0.3347 3rd Qu.: 0.9692 3rd Qu.: 0.44567 3rd Qu.: 0.95324
## Max. : 1.0000 Max. : 1.0000 Max. : 1.00000 Max. : 1.00000
## V10 V11 V12
## Min. :-1.00000 Min. :-1.00000 Min. :-1.00000
## 1st Qu.:-0.04807 1st Qu.: 0.02112 1st Qu.:-0.06527
## Median : 0.01829 Median : 0.66798 Median : 0.02825
## Mean : 0.18135 Mean : 0.47618 Mean : 0.15504
## 3rd Qu.: 0.53419 3rd Qu.: 0.95790 3rd Qu.: 0.48237
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## V13 V14 V15 V16
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 0.0000 1st Qu.:-0.07372 1st Qu.: 0.0000 1st Qu.:-0.08170
## Median : 0.6441 Median : 0.03027 Median : 0.6019 Median : 0.00000
## Mean : 0.4008 Mean : 0.09341 Mean : 0.3442 Mean : 0.07113
## 3rd Qu.: 0.9555 3rd Qu.: 0.37486 3rd Qu.: 0.9193 3rd Qu.: 0.30897
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000 Max. : 1.00000
## V17 V18 V19
## Min. :-1.0000 Min. :-1.000000 Min. :-1.0000
## 1st Qu.: 0.0000 1st Qu.:-0.225690 1st Qu.: 0.0000
## Median : 0.5909 Median : 0.000000 Median : 0.5762
## Mean : 0.3819 Mean :-0.003617 Mean : 0.3594
## 3rd Qu.: 0.9357 3rd Qu.: 0.195285 3rd Qu.: 0.8993
## Max. : 1.0000 Max. : 1.000000 Max. : 1.0000
## V20 V21 V22
## Min. :-1.00000 Min. :-1.0000 Min. :-1.000000
## 1st Qu.:-0.23467 1st Qu.: 0.0000 1st Qu.:-0.243870
## Median : 0.00000 Median : 0.4991 Median : 0.000000
## Mean :-0.02402 Mean : 0.3367 Mean : 0.008296
## 3rd Qu.: 0.13437 3rd Qu.: 0.8949 3rd Qu.: 0.188760
## Max. : 1.00000 Max. : 1.0000 Max. : 1.000000
## V23 V24 V25 V26
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 0.0000 1st Qu.:-0.36689 1st Qu.: 0.0000 1st Qu.:-0.33239
## Median : 0.5318 Median : 0.00000 Median : 0.5539 Median :-0.01505
## Mean : 0.3625 Mean :-0.05741 Mean : 0.3961 Mean :-0.07119
## 3rd Qu.: 0.9112 3rd Qu.: 0.16463 3rd Qu.: 0.9052 3rd Qu.: 0.15676
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000 Max. : 1.00000
## V27 V28 V29 V30
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 0.2864 1st Qu.:-0.44316 1st Qu.: 0.0000 1st Qu.:-0.23689
## Median : 0.7082 Median :-0.01769 Median : 0.4966 Median : 0.00000
## Mean : 0.5416 Mean :-0.06954 Mean : 0.3784 Mean :-0.02791
## 3rd Qu.: 0.9999 3rd Qu.: 0.15354 3rd Qu.: 0.8835 3rd Qu.: 0.15407
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000 Max. : 1.00000
## V31 V32 V33
## Min. :-1.0000 Min. :-1.000000 Min. :-1.0000
## 1st Qu.: 0.0000 1st Qu.:-0.242595 1st Qu.: 0.0000
## Median : 0.4428 Median : 0.000000 Median : 0.4096
## Mean : 0.3525 Mean :-0.003794 Mean : 0.3494
## 3rd Qu.: 0.8576 3rd Qu.: 0.200120 3rd Qu.: 0.8138
## Max. : 1.0000 Max. : 1.000000 Max. : 1.0000
## V34 Class
## Min. :-1.00000 bad :126
## 1st Qu.:-0.16535 good:225
## Median : 0.00000
## Mean : 0.01448
## 3rd Qu.: 0.17166
## Max. : 1.00000
actdata<-Ionosphere
samples<-sample.split(actdata$Class,SplitRatio = 0.8)
# Train data
train_set<- subset(actdata,samples==TRUE)
# Test data
test_set<- subset(actdata,samples==FALSE)
## rpart is used for constructing a decision tree. Here, we have taken method as class
modeling<- rpart(Class~.,data = train_set,method = 'class')
plot(modeling);text(modeling)