(A)Loading and Viewing Iris Dataset

Iris dataset

data(iris)

The Iris data set contains as follows:

Summarising and describing data

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

describe(iris)

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Thus we find Three Species of Iris flowers namely: Versicolor, Setossa and Virginica.Also the flower analysis is done by considering four subjects or variables of the flower, Sepal Length , Petal Length, Sepal Width and Petal Width.

Further we observe the Head of the dataset.

sep.l <- iris$Sepal.Length
sep.w <- iris$Sepal.Width
pet.l <- iris$Petal.Length
pet.w <- iris$Petal.Width
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

(B)Creating Validation, Training and Test Datasets

While performing Machine Learning over the given dataset we divide dataset into three phases :

1)Training Phase, 2)Test Phase, 3)Validation Phase,

Here we use 20% of data for Validation and rest 80% for Test and Training.

PARTITIONING THE IRIS DATASET

Using CreateDataPartition Function we partition the data into desired output.

vldndataindex <- createDataPartition(iris$Species, p=0.80, list=FALSE) 
vldndata <- iris[-vldndataindex,]
ds <- iris[vldndataindex,]

Levels

levels(ds$Species)

## [1] "setosa"     "versicolor" "virginica"

Summarizing new data set

We must now summarize the new obtained dataset.

summary(ds)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width 
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.1  
##  1st Qu.:5.100   1st Qu.:2.775   1st Qu.:1.575   1st Qu.:0.3  
##  Median :5.750   Median :3.000   Median :4.250   Median :1.3  
##  Mean   :5.814   Mean   :3.052   Mean   :3.729   Mean   :1.2  
##  3rd Qu.:6.400   3rd Qu.:3.325   3rd Qu.:5.100   3rd Qu.:1.8  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.5  
##        Species  
##  setosa    :40  
##  versicolor:40  
##  virginica :40  
##                 
##                 
##

str(ds)

## 'data.frame':    120 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 5 4.4 5.4 4.8 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 2.9 3.7 3.4 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.5 1.4 1.5 1.6 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Table describing new data set

percentage <- prop.table(table(ds$Species)) * 100
cbind(freq=table(ds$Species), percentage=percentage)

##            freq percentage
## setosa       40   33.33333
## versicolor   40   33.33333
## virginica    40   33.33333

(C)Data visualisation

For Better understanding of data we must visualize data to analyze the behaviour of the dataset given.

Box Plot of the new dataset

#Box Plot
a <- ds[,1:4]
b <- ds[,5]
par(mfrow=c(1,4))
  for(i in 1:4) {
  boxplot(a[,i], main=names(iris)[i])
  }

Here we observe the distribution and variation of different observations on the basis of the variables: Sepal Length , Petal Length, Sepal Width and Petal Width.

Scatter PLot between Sepal and Petal Length’s of the Iris data set

xyplot(sep.l ~ pet.l | Species,group = Species, data = iris,type = c("p", "smooth"),scales = "free")

###Scatter PLot between Sepal and Petal Width’s of the Iris data set

xyplot(sep.w ~ pet.w | Species,group = Species, data = iris,type = c("p", "smooth"),scales = "free")

###Scatter plot using ggplot

Scatter PLot using ggplot2

scatterPlot <- ggplot(iris , aes(x = sep.l , y =pet.l, color= Species)) + geom_point() +  scale_color_manual(values = c('#999999','#E69F00' ,'#56B4E9')) + theme(legend.position=c(0,1), legend.justification=c(0,1)) + geom_density2d()

scatterPlot

Scatter Plot (Three Dimensional) between the Species

#Scatter Plot 3D
colors <- c("#999999", "#E69F00", "#56B4E9")
colors <- colors[as.numeric(iris$Species)]
s3d <- scatterplot3d(iris[,1:3], pch = 16, color=colors)
legend(s3d$xyz.convert(7.5, 3, 4.5), legend = levels(iris$Species),
      col =  c("#999999", "#E69F00", "#56B4E9"), pch = 16)

3 Dimensional Plot between Sepal and Petal Length’s with Regression plane

scatter3d(x = sep.l, y = pet.l, z = sep.w, groups = iris$Species, grid = FALSE, fit = "smooth")

You must enable Javascript to view this page properly.

(D)Data Modelling

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

Linear Discreminant Analysis(LDA) model

fit.lda <- train(Species~., data=ds, method="lda", metric=metric, trControl=control)

Classification And Regression(CART) model

fit.cart <- train(Species~., data=ds, method="rpart", metric=metric, trControl=control)

kNN model

fit.knn <- train(Species~., data=ds, method="knn", metric=metric, trControl=control)

SVM

fit.svm <- train(Species~., data=ds, method="svmRadial", metric=metric, trControl=control)

Random Forest

fit.rf <- train(Species~., data=ds, method="rf", metric=metric, trControl=control)

Summarising Results

results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu. Median      Mean 3rd Qu. Max. NA's
## lda  0.9166667 0.9375000      1 0.9750000       1    1    0
## cart 0.8333333 0.9375000      1 0.9666667       1    1    0
## knn  0.9166667 0.9375000      1 0.9750000       1    1    0
## svm  0.9166667 0.9166667      1 0.9666667       1    1    0
## rf   0.8333333 0.9375000      1 0.9666667       1    1    0
## 
## Kappa 
##       Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## lda  0.875 0.90625      1 0.9625       1    1    0
## cart 0.750 0.90625      1 0.9500       1    1    0
## knn  0.875 0.90625      1 0.9625       1    1    0
## svm  0.875 0.87500      1 0.9500       1    1    0
## rf   0.750 0.90625      1 0.9500       1    1    0

Visualising Results

dotplot(results)

Printing the Results with their accuaracy

print(fit.lda)

## Linear Discriminant Analysis 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results:
## 
##   Accuracy  Kappa 
##   0.975     0.9625

print(fit.cart)

## CART 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa
##   0.000  0.9666667  0.950
##   0.475  0.7166667  0.575
##   0.500  0.3333333  0.000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.

print(fit.knn)

## k-Nearest Neighbors 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa 
##   5  0.9666667  0.9500
##   7  0.9750000  0.9625
##   9  0.9666667  0.9500
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.

print(fit.rf)

## Random Forest 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa
##   2     0.9666667  0.95 
##   3     0.9666667  0.95 
##   4     0.9666667  0.95 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

print(fit.svm)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa 
##   0.25  0.9500000  0.9250
##   0.50  0.9583333  0.9375
##   1.00  0.9666667  0.9500
## 
## Tuning parameter 'sigma' was held constant at a value of 0.7450586
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.7450586 and C = 1.

Hence we observe that Linear Discreminant Analysis (LDA) model is the BEST model with accuracy of around 97.5% .

Predicting on Validation dataset

The prediction on Validation dataset is done as follows:

predictions <- predict(fit.lda, vldndata)
confusionMatrix(predictions, vldndata$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         0
##   virginica       0          1        10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           1.0000
## Specificity                 1.0000            1.0000           0.9500
## Pos Pred Value              1.0000            1.0000           0.9091
## Neg Pred Value              1.0000            0.9524           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3333
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.9500           0.9750

Iris flower

Pratik Kumar