This project aims to apply Multinomial Logistic Regression to classify more than 2 classes. The available Iris data will be used for this classficiation problem
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We have the last column represents three species responded to their characteristic in the first 4 variables. We will build the multinomial logistic regression model based on that four variables to predict their species.
Multinomial logistic regression requires the baseline for the probability calculation, we have to choose any one out of three species as the baseline.
#Choose Setosa as the baseline
iris$base<-relevel(iris$Species,ref="setosa")
After choosing the baseline, we will create the model.
library(nnet)
model<-multinom(iris$base~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
## # weights: 18 (10 variable)
## initial value 164.791843
## iter 10 value 16.177348
## iter 20 value 7.111438
## iter 30 value 6.182999
## iter 40 value 5.984028
## iter 50 value 5.961278
## iter 60 value 5.954900
## iter 70 value 5.951851
## iter 80 value 5.950343
## iter 90 value 5.949904
## iter 100 value 5.949867
## final value 5.949867
## stopped after 100 iterations
model
## Call:
## multinom(formula = iris$base ~ Sepal.Length + Sepal.Width + Petal.Length +
## Petal.Width, data = iris)
##
## Coefficients:
## (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
## versicolor 18.69037 -5.458424 -8.707401 14.24477 -3.097684
## virginica -23.83628 -7.923634 -15.370769 23.65978 15.135301
##
## Residual Deviance: 11.89973
## AIC: 31.89973
We can set up the formula based on the coefficient of our model.
\(y_1=\ln.\big(\frac{P(versicolor)}{P(setosa)}\big)=18.7-5.458.Sepal.Length-8.707.Sepal.Width+14.245.Petal.Length-3.097.Petal.Width\)
and
\(y_2=\ln.\big(\frac{P(virginica)}{P(setosa)}\big)=-23.836-7.923.Sepal.Length-15.370.Sepal.Width+23.659.Petal.Length+15.135.Petal.Width\)
then
\(\frac{P(versicolor)}{P(setosa)}=e^{y1}\)
and
\(\frac{P(verginica)}{P(setosa)}=e^{y2}\)
Since \(P(versicolor)+P(setosa)+P(virginica)=1\) then we obtain
\(P(setosa)=\frac{1}{1+e^{y_1}+e^{y_2}}\)
We will predict the classification by our multinomial logistic regression model.
predicted<-predict(model,iris)
tab<-table(Predicted=predicted,Actual=iris$base)
print(tab)
## Actual
## Predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 49 1
## virginica 0 1 49
#Accuracy of the model
sum(diag(tab))/sum(tab)
## [1] 0.9866667
The accuracy of this model is up to 98.6%.