data(iris)
The Iris data set contains as follows:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
describe(iris)
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Thus we find Three Species of Iris flowers namely: Versicolor, Setossa and Virginica.Also the flower analysis is done by considering four subjects or variables of the flower, Sepal Length , Petal Length, Sepal Width and Petal Width.
Further we observe the Head of the dataset.
sep.l <- iris$Sepal.Length
sep.w <- iris$Sepal.Width
pet.l <- iris$Petal.Length
pet.w <- iris$Petal.Width
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
While performing Machine Learning over the given dataset we divide dataset into three phases :
1)Training Phase, 2)Test Phase, 3)Validation Phase,
Here we use 20% of data for Validation and rest 80% for Test and Training.
Using CreateDataPartition Function we partition the data into desired output.
vldndataindex <- createDataPartition(iris$Species, p=0.80, list=FALSE)
vldndata <- iris[-vldndataindex,]
ds <- iris[vldndataindex,]
levels(ds$Species)
## [1] "setosa" "versicolor" "virginica"
We must now summarize the new obtained dataset.
summary(ds)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.1
## 1st Qu.:5.100 1st Qu.:2.775 1st Qu.:1.575 1st Qu.:0.3
## Median :5.750 Median :3.000 Median :4.250 Median :1.3
## Mean :5.814 Mean :3.052 Mean :3.729 Mean :1.2
## 3rd Qu.:6.400 3rd Qu.:3.325 3rd Qu.:5.100 3rd Qu.:1.8
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.5
## Species
## setosa :40
## versicolor:40
## virginica :40
##
##
##
str(ds)
## 'data.frame': 120 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 5 4.4 5.4 4.8 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 2.9 3.7 3.4 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.5 1.4 1.5 1.6 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
percentage <- prop.table(table(ds$Species)) * 100
cbind(freq=table(ds$Species), percentage=percentage)
## freq percentage
## setosa 40 33.33333
## versicolor 40 33.33333
## virginica 40 33.33333
For Better understanding of data we must visualize data to analyze the behaviour of the dataset given.
#Box Plot
a <- ds[,1:4]
b <- ds[,5]
par(mfrow=c(1,4))
for(i in 1:4) {
boxplot(a[,i], main=names(iris)[i])
}
Here we observe the distribution and variation of different observations on the basis of the variables: Sepal Length , Petal Length, Sepal Width and Petal Width.
xyplot(sep.l ~ pet.l | Species,group = Species, data = iris,type = c("p", "smooth"),scales = "free")
###Scatter PLot between Sepal and Petal Width’s of the Iris data set
xyplot(sep.w ~ pet.w | Species,group = Species, data = iris,type = c("p", "smooth"),scales = "free")
###Scatter plot using ggplot
scatterPlot <- ggplot(iris , aes(x = sep.l , y =pet.l, color= Species)) + geom_point() + scale_color_manual(values = c('#999999','#E69F00' ,'#56B4E9')) + theme(legend.position=c(0,1), legend.justification=c(0,1)) + geom_density2d()
scatterPlot
#Scatter Plot 3D
colors <- c("#999999", "#E69F00", "#56B4E9")
colors <- colors[as.numeric(iris$Species)]
s3d <- scatterplot3d(iris[,1:3], pch = 16, color=colors)
legend(s3d$xyz.convert(7.5, 3, 4.5), legend = levels(iris$Species),
col = c("#999999", "#E69F00", "#56B4E9"), pch = 16)
scatter3d(x = sep.l, y = pet.l, z = sep.w, groups = iris$Species, grid = FALSE, fit = "smooth")
You must enable Javascript to view this page properly.
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
fit.lda <- train(Species~., data=ds, method="lda", metric=metric, trControl=control)
fit.cart <- train(Species~., data=ds, method="rpart", metric=metric, trControl=control)
fit.knn <- train(Species~., data=ds, method="knn", metric=metric, trControl=control)
fit.svm <- train(Species~., data=ds, method="svmRadial", metric=metric, trControl=control)
fit.rf <- train(Species~., data=ds, method="rf", metric=metric, trControl=control)
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, cart, knn, svm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.9166667 0.9375000 1 0.9750000 1 1 0
## cart 0.8333333 0.9375000 1 0.9666667 1 1 0
## knn 0.9166667 0.9375000 1 0.9750000 1 1 0
## svm 0.9166667 0.9166667 1 0.9666667 1 1 0
## rf 0.8333333 0.9375000 1 0.9666667 1 1 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.875 0.90625 1 0.9625 1 1 0
## cart 0.750 0.90625 1 0.9500 1 1 0
## knn 0.875 0.90625 1 0.9625 1 1 0
## svm 0.875 0.87500 1 0.9500 1 1 0
## rf 0.750 0.90625 1 0.9500 1 1 0
dotplot(results)
print(fit.lda)
## Linear Discriminant Analysis
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results:
##
## Accuracy Kappa
## 0.975 0.9625
print(fit.cart)
## CART
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.000 0.9666667 0.950
## 0.475 0.7166667 0.575
## 0.500 0.3333333 0.000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
print(fit.knn)
## k-Nearest Neighbors
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9666667 0.9500
## 7 0.9750000 0.9625
## 9 0.9666667 0.9500
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
print(fit.rf)
## Random Forest
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9666667 0.95
## 3 0.9666667 0.95
## 4 0.9666667 0.95
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9500000 0.9250
## 0.50 0.9583333 0.9375
## 1.00 0.9666667 0.9500
##
## Tuning parameter 'sigma' was held constant at a value of 0.7450586
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.7450586 and C = 1.
Hence we observe that Linear Discreminant Analysis (LDA) model is the BEST model with accuracy of around 97.5% .
The prediction on Validation dataset is done as follows:
predictions <- predict(fit.lda, vldndata)
confusionMatrix(predictions, vldndata$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 0
## virginica 0 1 10
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8278, 0.9992)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.963e-13
##
## Kappa : 0.95
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 1.0000
## Specificity 1.0000 1.0000 0.9500
## Pos Pred Value 1.0000 1.0000 0.9091
## Neg Pred Value 1.0000 0.9524 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.3333
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.9500 0.9750