The aim of the project is to first identify whether there is a significant difference in features for different species and to come up with a model that can predict the species based on their features.
data(iris)
model1 = lm(Sepal.Length ~ Species, data = iris)
summary(model1)
##
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6880 -0.3285 -0.0060 0.3120 1.3120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
## Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
## Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
## F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
model2 = lm(Petal.Length ~ Species, data = iris)
summary(model2)
##
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.260 -0.258 0.038 0.240 1.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.46200 0.06086 24.02 <2e-16 ***
## Speciesversicolor 2.79800 0.08607 32.51 <2e-16 ***
## Speciesvirginica 4.09000 0.08607 47.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4303 on 147 degrees of freedom
## Multiple R-squared: 0.9414, Adjusted R-squared: 0.9406
## F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16
model3 = lm(Petal.Width ~ Species, data = iris)
summary(model3)
##
## Call:
## lm(formula = Petal.Width ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.626 -0.126 -0.026 0.154 0.474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24600 0.02894 8.50 1.96e-14 ***
## Speciesversicolor 1.08000 0.04093 26.39 < 2e-16 ***
## Speciesvirginica 1.78000 0.04093 43.49 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2047 on 147 degrees of freedom
## Multiple R-squared: 0.9289, Adjusted R-squared: 0.9279
## F-statistic: 960 on 2 and 147 DF, p-value: < 2.2e-16
As noted from the EDA, the first two principal components capture 95% of the total variability but have significant overlap between versicolor and virginica. We test whether we can come up with a summary statistic of the original data that statistically differ between species.
Using principal component analysis (PCA), we end up with a linear combination of the original variables that capture the most variances among them. The first principal component is approximately proportional to \[ Sepal Length + Petal Length + Petal Width - \frac{1}{2}Sepal Width \]
A similar ANOVA test shows that one can distinguish each species based on this first principal component significantly.
pca = princomp(iris[,1:4], cor = T)
pca$loadings[,1]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.5210659 -0.2693474 0.5804131 0.5648565
pca1 = pca$scores[,1]
ggplot(iris, aes(x = Species, y= pca1, col = Species)) +
geom_boxplot() +
geom_jitter(width = 0.1) +
theme(legend.position = "none")
model_pca = lm(pca1 ~ Species, data = iris)
summary(model_pca)
##
## Call:
## lm(formula = pca1 ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36129 -0.25971 0.00195 0.23267 1.58239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.22475 0.06240 -35.65 <2e-16 ***
## Speciesversicolor 2.72120 0.08825 30.83 <2e-16 ***
## Speciesvirginica 3.95306 0.08825 44.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4413 on 147 degrees of freedom
## Multiple R-squared: 0.9346, Adjusted R-squared: 0.9337
## F-statistic: 1051 on 2 and 147 DF, p-value: < 2.2e-16
data(iris)
logistic_model <- vglm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
family = multinomial, data = iris)
predict_probs <- predict(logistic_model, type = 'response')
predict_class <- colnames(predict_probs)[apply(predict_probs, 1, which.max)]
table(iris$Species, predict_class)
## predict_class
## setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 49 1
## virginica 0 1 49
test_index <- seq(1, 150, by = 3)
train <- iris[-test_index, ]
test <- iris[test_index, ]
logistic_train <- vglm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
family = multinomial, data = train)
predict_probs <- predict(logistic_train, newdata = test, type = 'response')
predict_class <- colnames(predict_probs)[apply(predict_probs, 1, which.max)]
table(test$Species, predict_class)
## predict_class
## setosa versicolor virginica
## setosa 17 0 0
## versicolor 0 17 0
## virginica 0 1 15