ISB CBA 7 Data Mining 2 Homework Problems

Problem 1 : IRIS - HIERARCHICAL FISHER

1a: Two classes in IRIS are more “similar” to each other. Find which ones using scatter plots. Lets say class 1 and class 2.

levels(unclass(iris$Species))

## [1] "setosa"     "versicolor" "virginica"

pairs(iris[1:4], main = "Iris Data -- 3 species", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)], lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)

# now laod the given taring and test data
training_set <- read.csv("D:\\Google Drive\\Term3-2-DMG2\\Assignment 1\\datasets\\iris\\train.csv")
validation_set = read.csv("D:\\Google Drive\\Term3-2-DMG2\\Assignment 1\\datasets\\iris/test.csv")
#now scaling these 2 data sets
training_set[-5] = scale(training_set[-5])
validation_set[-5] = scale(validation_set[-5])

From above scatter plot it is evident that Versicolor and Virginica species are more similar to each other than Setosa.

1b: Lets create a “meta-class” combining class 1 and class 2 (or whichever are the two most similar classes). Lets call it class 4.

training_set$SpeciesMetaClass <- 'virginica_versicolor'
training_set$SpeciesMetaClass[training_set$Species == "setosa"] <- "setosa"
#setDT(training_set)[Species=="virginica" | Species=="versicolor", SpeciesMetaClass:='virginica_versicolor']
#setDT(training_set)[Species=="setosa", SpeciesMetaClass:='setosa']

1c: Create the first Fisher projection by trying to discriminate class 3 (thedifferent class) from class 4 (the meta-class). Do this on training data only

lda1 = lda(formula = SpeciesMetaClass ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width , data = training_set)
projecteddata1 <- as.matrix(training_set[,1:4])%*%lda1$scaling
projecteddata1df <- data.frame(projecteddata1,training_set$SpeciesMetaClass)
ggplot(projecteddata1df, aes(x=(1:length(LD1)), y=LD1, color=training_set.SpeciesMetaClass, shape=training_set.SpeciesMetaClass)) + 
  geom_point()+
  labs(title="FDA : 2 Classes Setosa and Metaclass Projection",
       x="Observations", y = "Fisher Discriminant")

1d: Create the second Fisher projection by trying to discriminate class 1 from class 2 (the original two similar classes). Do this on training data only

# Now apply data on remaining 2 similar classes on training data
training_set_subset <- subset(training_set, SpeciesMetaClass =='virginica_versicolor')
lda2 <- lda(formula = Species ~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data = training_set_subset)

## Warning in lda.default(x, grouping, ...): group setosa is empty

# we will do this to get data frame from matrix
projecteddata2 <- as.matrix(training_set_subset[,1:4])%*%lda2$scaling
projecteddata2df <- data.frame(projecteddata2,training_set_subset$Species)
ggplot(projecteddata2df, aes(x=(1:length(LD1)), y=LD1, color=training_set_subset.Species, shape=training_set_subset.Species)) + 
  geom_point()+
  labs(title="FDA : 2 Classes  Virginica and Versicolor Projection",
       x="Observations", y = "Fisher Discriminant")

1e: Now project the entire data in these two projections and color code the class points.Do this on test data only.

fp1 <- predict(lda1,newdata = validation_set) # first projection
fp2 <- predict(lda2,newdata = validation_set) # second projection

fdadf <- data.frame(fp1$x, fp2$x,validation_set$Species)

ggplot(fdadf, aes(x=LD1, y=LD1.1, color=validation_set.Species, shape=validation_set.Species)) + 
  geom_point()+
  labs(title="FDA 3 Classes  Setosa, Virginica and Versicolor Projection",
       x="FP1", y = " FP2")

1f: Comment on what you observed and did.

We first scatter plot each feature against another feature and found out that in 2D projection we can get some valuable information like 1 class is dissimilar to remaining 2 classes
on the basis of above information we combined these 2 classes and determine first Fisher Discriminant Projection.
After that we again took rest of 2 classes and determine its second Fisher Discriminant Projection.
We ran these 2 projections on test set and predict their 2 Fisher Discriminants.
We have plotted 4 features and 3 classes in 2D plain and able to clearly distinguish them.

Observation:
Simple technique like Fisher Discriminant helped us to do classification and reduce the dimensionality. We can get better picture about classes than using merely feature vs feature scatter plots.