ML ALGORITHMS How it works and more

Jessica González Doña
August '17

Index

REGRESSION
- Simple & Multiple Linear Regression (SLM-MLR)
- Polinomial Regression (PR)
- Decision Tree Regression (DTR)
- Random Forest Regression (RFR)

CLASSIFICATION
- Logistic Regression (LR)
- K Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Kernel Support Vector Machine (K-SVM)
- Naive Bayes Classifier (NBC)
- Decision Tree (DT)
- Random Forest (RF)

CLUSTERING
- K-Means
- Hierarchical Clustering (HC)
ASSOCIATION RULE LEARNING
- Apriori
- Eclat
DIMENSIONALITY REDUCTION
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)

REGRESSION

Multiple Linear Regression

Mathematical method that have the main objective of view the dependency relationship of two or more variables. This implies that we have one or more IV (explicative variables) and one DV (predict variable).

The procedure consists to fit a straight line to a dataset, this is given by:

\( Y= {b}_0 + {b}_1~ * X \)

\( Y \)=dependent variable

\( {b}_0 \)=value of dependent variable when IV is 0

\( {b}_1 \)=effect of VI or slope line

\( X \)=independent variable or explicative variable

The line if fitted by Ordinary Least Squares Method (OLSM), this means that the straight line is fitted taking into account the minimum distance between each point and the vertical point of the line.

\( OLSM = SUM(y- \bar{y} )^2 \)

(p.ex: green line is one of the distance between that point and the regression line)

When we have 2 or more VI, we have to verificate the following criteria:

Linearity
Homoscedasticity
Normality
Independence of errors
Absence of multicollinearity: no relationship between IV

One advantage of Linear Regression is that give us information of the relevance of IV

Polinomial Regression

When relationship between DV and IV is not linear, Polinomial Regression is used. In this type of Regression Algorithm, the different IV are elevated to an exponent. The equation is the following:

\( Y= {b}_0 + {b}_1~ * X + {b}_2 * X^2 ... {b}_n * X^n \)

As Linear Regression, the fitted method of the curve is the one that minimize the sum of squared SD (SRC), similar to OLSM.

We have other type of non linear regression like Logaritmic Regression and Exponential Regression.

Decision Tree for Regression

Decision Tree is a CART algorithm, that means Clasification And Regression Trees. DT works in a linear and non-linear problems.

The train step consists in divide the dataset in different subsets of values, so that the algorithm constructs a tree, where the nodes are the different attributes (IV) in the dataset. In the final branches we have the mean of DV values, depend on all the divisions that have been created in the previous steps.

The attributes in the nodes are chosen by the Information Entropy.

Entropy is a existing uncertainty mesure of a set of information (in our case, DV values). To choose one or another IV as classifier parameter, entoropy is computed for each of them. In this way, the uncertainty is reduced.

Then, if we want to know the DV vaue of a new point, this is evaluated in descending order through the tree.

Random Forest for Regression

Random forest is an assemply algorithm, that means that in RF multiple decision tree are created, for combine the information. Is a very powerful algorithm that works well in large datasets.

In train process are created a collection of decision tree. Each tree take n random IV attributes, and n random DV values.

That trees are used for predict DV values of a new data point, so that each new data point is evaluated by the selected trees. The DV value that has been predicted most often, it will be the predict DV value.

CLASSIFICATION

Logistic Regression

Type of regression that the predict variable, is binary categorical (0, 1). Follow the Sigmoid function:

\( ln (\frac {p}{1-p}) = {b}_0 + {b}_1~ * X \)

The obtained predictions are values between 0 and 1, that is to say, the estimated probability (to belong to category 0 or category 1). The bound to belong to one category or another, usually in 0.5 (50%).

K Nearest Neighbors

When the goal is predict the class (A and B) of a new point, Knn algorithm classify that new point based on the information of the k nearest points around the new point.

To know which points are closer to the new point, it is computed the Euclidean distance:

\( d_E(P_1, P_2) = \sqrt{(x_2 - x_1)^2 + (y_2 -y_1)^2} \)

\( P_1 \) = point 1

\( P_2 \) = point 2

\( x \) and \( y \) = space coordinates

If the most nearest points of the new point, belong to A category, the new point it will be predicted by A category. The optimal number of k nearest neighbors which must be fitted, it depends of the Accuracy ¹ that provides.

(p.ex: k = 4, predicted to Green category)

DEFINICION DE ACCURACY ¹

Support Vector Machine

We have a dataset with two categories (A and B), and it is intended to predict to which cateogry belongs a new point.

In that 2-dimensional space, different categories are separated by an hiperplane, that works like the maximum distance between two categories (optimal separation). The hiperplane are holded by two vectors (reason for which this algorithm is called Support Vector Machine), so that, the support vector of category A, it will be the point who have more attributes of the B category, and the support vector of category B, it will be the point who have more attributes of the A category.

Kernel Support Vector Machine

When we have a multidimensional space, we can´t divide different categories in a linear way, so we have to use Kernel functions. It exists 4 types of Kernel functions:

Polinomial

\( K(x_i, x_j) = (x_i * x_j)^n \)

Perceptron

\( K(x_i, x_j) = \lvert\lvert x_i - x_j \rvert\rvert \)

Gaussian

\( K(x_i, x_j) = e - \frac{\lvert\lvert(x_i-x_j)\rvert\rvert^2} {2 \sigma^2} \)

Sigmoid

\( K(x_i, x_j) = tanh(x_i * x_j - \theta) \)

Gaussian function:

\( K(x_i, x_j) = e - \frac{\lvert\lvert(x_i-x_j)\rvert\rvert^2} {2 \sigma^2} \)

\( x_i \) = any point

\( x_j \) = central point

\( \sigma \) = value that defines the radius of the circumference

For each point of the space it calculated the Kernel function, resulting in the conical shape that we can see in the previous slide. The algorithm learn which is the \( x_j \) point, and find the optimal sigma value for divide the two categories in the dataset.

All the points inside the radius, have the 1 value, and the points outside the radius have the 0 value. The view in 3 and 2-dimensional space is the following:

Naive Bayes Classifier

The NB algorithm compute the probability that have a new point to be classificated in A category and B category, based on their IV values, through Bayes theorem:

\( P(A \lvert B) = \frac {P(B \lvert A) * P(A)} {P(B)} \)

With those probabilities, these are compared, the highest value it will be the category of that new point.

It calls “Naive”, becouse independence is assumed between IV, so that, all the IV contribute independently to the probability that is calculated.

Decision Tree Classifier

The DT algorithm divide the dataset having into account the existing categories (those that are intended to predict), in a optimal way, and constructs a tree, where the nodes are the different values of IV. For choose the best IV that allows to divide better, is used the entropy ^1. Depending on the entropy that contribute each attribute, it will be chosen as classifier parametter.

Each new point is evaluated in a descending order through the tree, resulting in the final branch, the predicted category to which it belongs.

Seen in DT for regression ¹

Random Forest Classifier

As RF for regression, is an algorithm that combines multiple algorithms; a collection of trees.

Each tree is constructed chosen k random values of attributes (IV), and k random values of dataset (DV). The optimal number of trees are chosen for predict the category of a new point. Each new point is evaluated in a descending order through the trees, the predict category it will be the value obtained most often.

CLUSTERING

K-Means

Tool for discover categories or groups in a dataset. It must choose the number of clusters (categories) and the number of centroids (random points in the space, existing in the dataset or not).

The process is the following:

From distance between k centroids, an imaginary perpendicular line is drawn (red line), in the center of the line between centroids.

In that way, we have a divide dataset, so the centroids of that k parts move to the gravity point.

This process is iterative, until centroids are in a fixed position, that is, they can't be reassigned to a new position.

The optimal number of clusters it depends on the squared distance between each centroid and its points (WCSS), which should be minimum.

Note: in geometry, centroid is the simmetric point in a space.

Hierarchical Clustering

Unlike K-Means, Hierarchical Clustering algorithm works in a different way; for each point of the dataset a cluster is assigned. The two nearest points (clusters) become in a single cluster, from here, the two nearest clusters become in a single cluster. That process is iterative until we have only one cluster, which is the entire dataset.

To choose the rigth number of clusters, we use the Dendogram. Dendogram is a graphic that represents all the clusters unions that has been created in all the process.

The horizontal lines are the unions between clusters, and the vertical lines are the distance between those clusters. It must be drawn an horizontal line along the longest vertical lines.

(In that example, we choose 2 clusters - 2 vertical lines-)

ASSOCIATION RULE LEARNING

Apriori

Apriori is an algorithm that allows to know the existing associations between items in a dataset. Is usually used in stores, which helps to know that if a person buys item A, it is more likely that he will also buy item B. It is used the following formula:

\( support(A) = \frac {Purchase .number. of. item A} {Total. number. of. purchase} \)

\( confidence(A->B) = \frac {Purchase. number. of. item. containing. A. and. B} {Total number of purchase} \)

\( lift(A->B) = \frac {confidence(A->B)} {support(B)} \)

The algorithm computes Support and Confidence for all the possible pairs of items in the subset. As this process takes a lot of computational cost, it must be set a minimum of confidence and support. All pairs that have higher Support are chosen, and then all pairs that have higher Confidence. Of the remaining pairs it is calculated the Lift, and it is ordered in decreasing way.

Eclat

It works very similar to Apriori algorithm, but Eclat only use Support index:

\( support(A) = \frac {Purchase .number. of. item A} {Total. number. of. purchase} \)

Is chosen the minimum value for Support index, then is chosen the higher values, and finally are ordered it in decreasing way.

DIMENSIONALITY REDUCTION

Principal Component Analysis

For a large dataset, Principal Component Analysis algorithm extract those n features that explain more variance of the dataset (is not taking into account DV). Is obtained a new n variables called components (or dimensions), which are independent of each other. We choose the number k of components that explain more variance.

It that way, all the existing originals attributes in dataset have a certain weight in each of the components. When more weight has, more information shares the original item with the new component. Sometimes, for a correct distribution of the weights, the new components must be rotated (two ways: orthogonal and non-orthogonal).

Linear Discriminant Analysis

Having a set of IV as classifiers features, and one categorical DV, Linear Discriminant Analysis algorithm allows to find linear relationships between IV that better discriminate in groups. In that way, the intention is to find potential differences between groups, from all the information we have in all IV.

SUMMARY