In this blog, I would like to focus on the algorithms which are very good at classification. Some of them like KNN, SVM could also be used for regression, but they are best at predicting classification values.
If our dependent variable is categorical in nature i.e., something which has discrete set of value or binary values, then we refer them as categorical. For example, predicting if it will rain as Yes or No, result of a coin flip, categorizing a transaction as fraudulent or genuine etc
Within classification, we can sub divide the types into two at a broad level 1. Binary classification - Where we have only 2 expected result values. Eg: Yes or No, Heads or Tails 2. Multi classification - Here our dependent variable can contain more than 2 discrete values, example flavours of icecream (Chocolate, Vanilla, Pistachio etc)
Logistic regression is one of the most simplest of the model available. It relies on calculating the log of value being beyond 50% or not, and if it is then it’s categorized as a positive else a negative case. The formula used for logistic regression is below
Sample R code:
library(datasets)
library(nnet)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
iris$speciesRelevel <- relevel(iris$Species, ref = "setosa")
multinom(speciesRelevel ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width , data = iris)
## # weights: 18 (10 variable)
## initial value 164.791843
## iter 10 value 16.177348
## iter 20 value 7.111438
## iter 30 value 6.182999
## iter 40 value 5.984028
## iter 50 value 5.961278
## iter 60 value 5.954900
## iter 70 value 5.951851
## iter 80 value 5.950343
## iter 90 value 5.949904
## iter 100 value 5.949867
## final value 5.949867
## stopped after 100 iterations
## Call:
## multinom(formula = speciesRelevel ~ Sepal.Length + Sepal.Width +
## Petal.Length + Petal.Width, data = iris)
##
## Coefficients:
## (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
## versicolor 18.69037 -5.458424 -8.707401 14.24477 -3.097684
## virginica -23.83628 -7.923634 -15.370769 23.65978 15.135301
##
## Residual Deviance: 11.89973
## AIC: 31.89973
Navie Bayes algorithm works on the basis of calculating the probability and assigning a value to the dependent variable. It is also called generative algorithm, because it generates new data our of the existing data and use it to calculate the probability. Naive Bayes is very good at NLP as it work really great with Strings of data.
The formula used for Naive Bayes is below
P(B|A) = ( P(A|B) * P(B) ) / (P(A))
Support Vector Machines is considered one of the best algorithms for classification, since it is much faster and deals with outliers very well. It works on the principle of using the data to construct a plane that separates out the data points. And any new observation that’s given is calculate to see on which side of the plane it lands and hence it classifies the value.
Decision Trees work by using a Tree based structure of the data. Using concepts such as Entropy, Jini and Information Gain, Decision Tree algorithm evaluates and constructs a Tree with root node being the most efficient one that could help classify the data, and the child nodes being the other features that end up with nodes that represent the dependent variable classes.
KNN works on the nearest neighbors model. It projects the data onto a plane and any new observation is receives, it projects onto the plane and calculates the nearest neighbour using Euclidean distance formula. Once it identifies the neighbours, then it finds the majority of them and assigns the value to target variable of the observation.