The aim of the project is to first identify whether there is a significant difference in features for different species and to come up with a model that can predict the species based on their features.

Testing difference in features for different species

Sepal Length

data(iris)
model1 = lm(Sepal.Length ~ Species, data = iris)
summary(model1)
## 
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
## Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

Petal Length

model2 = lm(Petal.Length ~ Species, data = iris)
summary(model2)
## 
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.260 -0.258  0.038  0.240  1.348 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.46200    0.06086   24.02   <2e-16 ***
## Speciesversicolor  2.79800    0.08607   32.51   <2e-16 ***
## Speciesvirginica   4.09000    0.08607   47.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4303 on 147 degrees of freedom
## Multiple R-squared:  0.9414, Adjusted R-squared:  0.9406 
## F-statistic:  1180 on 2 and 147 DF,  p-value: < 2.2e-16

Petal Width

model3 = lm(Petal.Width ~ Species, data = iris)
summary(model3)
## 
## Call:
## lm(formula = Petal.Width ~ Species, data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.626 -0.126 -0.026  0.154  0.474 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.24600    0.02894    8.50 1.96e-14 ***
## Speciesversicolor  1.08000    0.04093   26.39  < 2e-16 ***
## Speciesvirginica   1.78000    0.04093   43.49  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2047 on 147 degrees of freedom
## Multiple R-squared:  0.9289, Adjusted R-squared:  0.9279 
## F-statistic:   960 on 2 and 147 DF,  p-value: < 2.2e-16

Pincipal Component Analysis

As noted from the EDA, the first two principal components capture 95% of the total variability but have significant overlap between versicolor and virginica. We test whether we can come up with a summary statistic of the original data that statistically differ between species.

Using principal component analysis (PCA), we end up with a linear combination of the original variables that capture the most variances among them. The first principal component is approximately proportional to \[ Sepal Length + Petal Length + Petal Width - \frac{1}{2}Sepal Width \]

A similar ANOVA test shows that one can distinguish each species based on this first principal component significantly.

pca = princomp(iris[,1:4], cor = T)
pca$loadings[,1]
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.5210659   -0.2693474    0.5804131    0.5648565
pca1 = pca$scores[,1]
ggplot(iris, aes(x = Species, y= pca1, col = Species)) +
  geom_boxplot() +
  geom_jitter(width = 0.1) +
  theme(legend.position = "none")

model_pca = lm(pca1 ~ Species, data = iris)
summary(model_pca)
## 
## Call:
## lm(formula = pca1 ~ Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36129 -0.25971  0.00195  0.23267  1.58239 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.22475    0.06240  -35.65   <2e-16 ***
## Speciesversicolor  2.72120    0.08825   30.83   <2e-16 ***
## Speciesvirginica   3.95306    0.08825   44.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4413 on 147 degrees of freedom
## Multiple R-squared:  0.9346, Adjusted R-squared:  0.9337 
## F-statistic:  1051 on 2 and 147 DF,  p-value: < 2.2e-16
data(iris)
logistic_model <- vglm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
                       family = multinomial, data = iris)
predict_probs <- predict(logistic_model, type = 'response')
predict_class <- colnames(predict_probs)[apply(predict_probs, 1, which.max)]
table(iris$Species, predict_class)
##             predict_class
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         49         1
##   virginica       0          1        49

Training and Test set

test_index <- seq(1, 150, by = 3)
train <- iris[-test_index, ]
test <- iris[test_index, ]

logistic_train <- vglm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
                       family = multinomial, data = train)

predict_probs <- predict(logistic_train, newdata = test, type = 'response')
predict_class <- colnames(predict_probs)[apply(predict_probs, 1, which.max)]

table(test$Species, predict_class)
##             predict_class
##              setosa versicolor virginica
##   setosa         17          0         0
##   versicolor      0         17         0
##   virginica       0          1        15