The Iris flower data set is a multivariate data set introduced by the British statistician. The data was collected to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
The dataset contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).
data("iris")
head(iris)
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
any(is.na(iris))
[1] FALSE
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
partation data
set.seed(123)
ir<- sample(2,nrow(iris),replace = TRUE, prob = c(0.80,0.20)) # 80% for traning and 20% for testing the data
train <- iris[ir == 1,]
test <- iris[ir == 2,]
scatter plot and correlation coefficents.
library(psych)
pairs.panels(train[,-5],gap = 0,
bg = c("blue", "green", "red")[train$Species],
pch=21)
high corelations among independent veriables lead to “multicollinearity” problems.
Let’s perform PCA : Principle Component Analysis
pca<- prcomp(train[,-5], center = TRUE, scale. = TRUE)
attributes(pca)
$names
[1] "sdev" "rotation" "center" "scale" "x"
$class
[1] "prcomp"
pca$center
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.826446 3.021488 3.759504 1.201653
mean(train$Sepal.Length)
[1] 5.826446
pca$scale
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.8105316 0.4270454 1.7377945 0.7557539
sd(train$Sepal.Length)
[1] 0.8105316
print(pca)
Standard deviations (1, .., p=4):
[1] 1.7237041 0.9309494 0.3749824 0.1468522
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Sepal.Length 0.5201832 -0.37206837 0.7292637 0.2432058
Sepal.Width -0.2937566 -0.92188195 -0.2184802 -0.1269139
Petal.Length 0.5747475 -0.03477237 -0.1603396 -0.8017153
Petal.Width 0.5592690 -0.10241915 -0.6282771 0.5310334
each principle components are normilized linear combination of original variables which are listed here.
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7237 0.9309 0.37498 0.14685
Proportion of Variance 0.7428 0.2167 0.03515 0.00539
Cumulative Proportion 0.7428 0.9595 0.99461 1.00000
orthogonality of pca
pairs.panels(pca$x, gap = 0, bg=c("blue","green", "Red")[train$Species],
pch=21)
Bi-plot
library(devtools)
library(factoextra)
res.pca <- prcomp(train[, -5], scale = TRUE)
fviz_pca_biplot(): Biplot of individuals of variables
fviz_pca_biplot(res.pca)
fviz_pca_biplot(res.pca, label ="var")
fviz_pca_biplot(res.pca, label ="ind")
fviz_pca_biplot(res.pca, label ="var", col.ind="cos2") +
theme_minimal()
fviz_pca_biplot(res.pca, label="var",
select.ind = list(contrib = 20))
prediction with PCA
traning <- predict(pca,train)
traning <- data.frame(traning,train[5])
testing <- predict(pca,test)
testing <- data.frame(testing,test$Species)
multinomial Logstic Regression with Frist Two PCAs
model1<- multinom(Species~ PC1+PC2,data = traning)
# weights: 12 (6 variable)
initial value 132.932087
iter 10 value 19.674294
iter 20 value 17.413814
iter 30 value 17.299839
iter 40 value 17.297242
iter 50 value 17.296292
final value 17.294539
converged
summary(model1)
Call:
multinom(formula = Species ~ PC1 + PC2, data = traning)
Coefficients:
(Intercept) PC1 PC2
versicolor 10.584590 15.26832 4.019053
virginica 3.007363 21.95663 4.228617
Std. Errors:
(Intercept) PC1 PC2
versicolor 273.3561 177.9735 223.6350
virginica 273.3636 177.9818 223.6359
Residual Deviance: 34.58908
AIC: 46.58908
confusion matrix and missclassification error - train
prd <- predict(model1, traning)
table <- table(prd, traning$Species)
table
prd setosa versicolor virginica
setosa 39 0 0
versicolor 0 38 4
virginica 0 5 35
calculate misclassification error
1- sum(diag(table))/sum(table)
[1] 0.07438017
misclassification error about 7.7%
confusion matrix and missclassification error - test
testing <- data.frame(testing,test[5])
prd1 <- predict(model1, testing)
table1 <- table(prd1, testing$Species)
table1
prd1 setosa versicolor virginica
setosa 11 0 0
versicolor 0 7 2
virginica 0 0 9
1- sum(diag(table1))/sum(table1)
[1] 0.06896552
misclassification error for testing data is about 6.8%