Introduction to Programming - Machine Learning with R

Introduction
1. Data Preprocessing

Introduction

This markdown document is intended as an introduction to programming and serves as a template for machine learning purposes. The goal is to provide sufficient information so readers will be able to use it as a practical reference in their daily work but it does not attempt to give a thorough discussion of the theoretical details behind each method. That is left to the interested reader.

This document is divided into 11 sections covering the more popular machine learning algorithms:

Data Preprocessing
Regression
Classification
Clustering
Association Rule Learning
Reinforcement Learning
Natural Language Processing
Deep Learning
Dimensionality Reduction
Model Selection & Boosting
Time Series Forecasting

In Section 1: Data Preprocessing, we will quickly cover the basics. In particular, we will learn how to (i) import a dataset, (ii) do a quick exploration of the dataset, (iii) split the dataset into training and test sets, (iv) take care of missing data by imputing means, (v) encode categorical data and (vi) perform feature scaling.

In Section 2: Regression, we start to build regression models beginning with a (2.1) Simple Linear Regression model, before moving on to (2.2) Multiple Linear Regression, (2.3) Polynomial Regression, (2.4) Support Vector Regression, (2.5) Decision Tree Regression and (2.6) Random Forest Regression. We will learn how to use the model to predict new results and plot some visualisations to examine the results.

In Section 3: Classification, we will build classification models starting with the workhorse (3.1) Logistics Regression model. We will also look at (3.2) K-Nearest Neighbours, (3.3) Support Vector Machine, (3.4) Naive Bayes, (3.5) Decision Tree Classification, (3.6) Random Forest Classification and (3.7) Extreme Gradient Boosting (XGBoost).

Moving on to Section 4: Clustering, we turn our attention to unsupervised learning where we will build (4.1) K-Means Clustering and (4.2) Hierarchical Clustering.

In Section 5: Association Rule Learning, we look at the (5.1) Apriori and (5.2) Eclat methods, both of which are helpful in market basket optimisation.

In Section 6: Reinforcement Learning (aka Online Learning), this is where things start to get really interesting. We examine two methods, namely, the (6.1) Upper Confidence Bound and (6.2) Thompson Sampling.

The packages used in this document include:

caTools - For data processing
ggplot2 - For visualisation
Hmisc - For data exploration
e1071 - For Support Vector Regression, Support Vector Machine & Naives Bayes Classifier
rpart - For Decision Tree Regression & Decision Tree Classifier
randomForest - For Random Forest Regression & Random Forest Classifier
ElemStatLearn - For plotting of Logistic Regression Classifie
Class - For K-Nearest Neighbour Classifier
caret - For Performance Statistics
xgboost - For Extreme Gradient Boosting
cluster - For visualising clusters
arules - For Apriori

To install the required packages, users can first install the package ‘pacman’ then run the code below:

# install.packages('pacman')
# pacman::p_load(caTools, ggplot2, Hmisc, e1071, rpart, randomForest, ElemStatLearn, Class, caret, xgboost, cluster)

1. Data Preprocessing

Import dataset

Let us go ahead to import our first dataset using read.csv() from base R.

dataset = read.csv('cars.csv')

Explore dataset

It is good practice to always first examine the data to see what we are working on.

head(dataset) # Examine first 6 rows

##             car country  mpg cyl  disp  hp drat    wt purchased
## 1     Mazda RX4   Japan 21.0   6 160.0 110 3.90 2.620         0
## 2 Mazda RX4 Wag   Japan 21.0   6 160.0 110 3.90 2.875         0
## 3    Datsun 710   Korea 22.8   4 108.0  93 3.85 2.320         0
## 4     Merc 240D Germany 24.4   4 146.7  62 3.69 3.190         1
## 5      Merc 230 Germany 22.8   4 140.8  95 3.92 3.150         1
## 6      Merc 280 Germany 19.2   6 167.6 123 3.92 3.440         0

dim(dataset) # Check size of dataset

## [1] 20  9

str(dataset) # Examine data types of variables

## 'data.frame':    20 obs. of  9 variables:
##  $ car      : Factor w/ 20 levels "Cadillac Fleetwood",..: 9 10 3 12 11 13 14 15 16 17 ...
##  $ country  : Factor w/ 6 levels "Germany","Italy",..: 3 3 4 1 1 1 1 1 1 1 ...
##  $ mpg      : num  21 21 22.8 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ...
##  $ cyl      : int  6 6 4 4 4 6 6 8 8 8 ...
##  $ disp     : num  160 160 108 147 141 ...
##  $ hp       : int  110 110 93 62 95 123 123 180 180 180 ...
##  $ drat     : num  3.9 3.9 3.85 3.69 3.92 3.92 3.92 3.07 3.07 3.07 ...
##  $ wt       : num  2.62 2.88 2.32 3.19 3.15 ...
##  $ purchased: int  0 0 0 1 1 0 0 0 0 0 ...

This dataset is a small one, with only 20 observations and 9 variables. It shows different car makes, the country of manufacturing (factor variable with 6 levels), and some characteristics of the car. The last column is a binary indicator that shows whether the car has been purchased. Let’s also take a look at the summary statistics.

summary(dataset) # Quick overview of summary statistics  of data

##                  car        country       mpg             cyl     
##  Cadillac Fleetwood: 1   Germany:7   Min.   :10.40   Min.   :4.0  
##  Chrysler Imperial : 1   Italy  :2   1st Qu.:16.10   1st Qu.:4.0  
##  Datsun 710        : 1   Japan  :5   Median :20.35   Median :6.0  
##  Ferrari Dino      : 1   Korea  :1   Mean   :20.39   Mean   :5.9  
##  Fiat 128          : 1   Sweden :1   3rd Qu.:22.80   3rd Qu.:8.0  
##  Honda Civic       : 1   USA    :4   Max.   :33.90   Max.   :8.0  
##  (Other)           :14                                            
##       disp             hp             drat             wt       
##  Min.   : 71.1   Min.   : 52.0   Min.   :2.930   Min.   :1.615  
##  1st Qu.:120.1   1st Qu.: 95.5   1st Qu.:3.150   1st Qu.:2.581  
##  Median :146.7   Median :116.5   Median :3.700   Median :3.170  
##  Mean   :202.6   Mean   :139.4   Mean   :3.618   Mean   :3.293  
##  3rd Qu.:275.8   3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.743  
##  Max.   :472.0   Max.   :335.0   Max.   :4.220   Max.   :5.424  
##  NA's   :3       NA's   :2       NA's   :1                      
##    purchased  
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.0  
##  Mean   :0.2  
##  3rd Qu.:0.0  
##  Max.   :1.0  
##

To get a more detailed summary statistics we can use the describe() function in the Hmisc package.

library(Hmisc)
describe(dataset) # More detailed summary statistics  of data

## dataset 
## 
##  9  Variables      20  Observations
## ---------------------------------------------------------------------------
## car 
##        n  missing distinct 
##       20        0       20 
## 
## lowest : Cadillac Fleetwood Chrysler Imperial  Datsun 710         Ferrari Dino       Fiat 128          
## highest: Merc 450SL         Merc 450SLC        Toyota Corolla     Toyota Corona      Volvo 142E        
## ---------------------------------------------------------------------------
## country 
##        n  missing distinct 
##       20        0        6 
##                                                           
## Value      Germany   Italy   Japan   Korea  Sweden     USA
## Frequency        7       2       5       1       1       4
## Proportion    0.35    0.10    0.25    0.05    0.05    0.20
## ---------------------------------------------------------------------------
## mpg 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       20        0       17    0.998    20.38    7.229    10.40    14.27 
##      .25      .50      .75      .90      .95 
##    16.10    20.35    22.80    30.60    32.48 
##                                                                       
## Value      10.4 14.7 15.0 15.2 16.4 17.3 17.8 19.2 19.7 21.0 21.4 21.5
## Frequency     2    1    1    1    1    1    1    1    1    2    1    1
## Proportion 0.10 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.10 0.05 0.05
##                                    
## Value      22.8 24.4 30.4 32.4 33.9
## Frequency     2    1    1    1    1
## Proportion 0.10 0.05 0.05 0.05 0.05
## ---------------------------------------------------------------------------
## cyl 
##        n  missing distinct     Info     Mean      Gmd 
##       20        0        3     0.88      5.9    1.968 
##                          
## Value         4    6    8
## Frequency     8    5    7
## Proportion 0.40 0.25 0.35
## ---------------------------------------------------------------------------
## disp 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       17        3       16    0.999    202.6    144.8    74.78    77.50 
##      .25      .50      .75      .90      .95 
##   120.10   146.70   275.80   448.00   462.40 
##                                                                       
## Value       71.1  75.7  78.7 108.0 120.1 121.0 140.8 145.0 146.7 160.0
## Frequency      1     1     1     1     1     1     1     1     1     2
## Proportion 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.118
##                                               
## Value      167.6 275.8 301.0 440.0 460.0 472.0
## Frequency      1     1     1     1     1     1
## Proportion 0.059 0.059 0.059 0.059 0.059 0.059
## ---------------------------------------------------------------------------
## hp 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       18        2       14    0.994    139.4    75.89     60.5     64.1 
##      .25      .50      .75      .90      .95 
##     95.5    116.5    180.0    208.0    233.0 
##                                                                       
## Value         52    62    65    93    95    97   109   110   123   175
## Frequency      1     1     1     1     1     1     1     2     2     1
## Proportion 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.111 0.111 0.056
##                                   
## Value        180   205   215   335
## Frequency      3     1     1     1
## Proportion 0.167 0.056 0.056 0.056
## ---------------------------------------------------------------------------
## drat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       19        1       14    0.992    3.618    0.485    2.993    3.056 
##      .25      .50      .75      .90      .95 
##    3.150    3.700    3.920    4.086    4.121 
##                                                                       
## Value       2.93  3.00  3.07  3.23  3.54  3.62  3.69  3.70  3.85  3.90
## Frequency      1     1     3     1     1     1     1     1     1     2
## Proportion 0.053 0.053 0.158 0.053 0.053 0.053 0.053 0.053 0.053 0.105
##                                   
## Value       3.92  4.08  4.11  4.22
## Frequency      3     1     1     1
## Proportion 0.158 0.053 0.053 0.053
## ---------------------------------------------------------------------------
## wt 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       20        0       19    0.999    3.293     1.24    1.824    2.164 
##      .25      .50      .75      .90      .95 
##    2.581    3.170    3.742    5.260    5.349 
##                                                                       
## Value      1.615 1.835 2.200 2.320 2.465 2.620 2.770 2.780 2.875 3.150
## Frequency      1     1     1     1     1     1     1     1     1     1
## Proportion  0.05  0.05  0.05  0.05  0.05  0.05  0.05  0.05  0.05  0.05
##                                                                 
## Value      3.190 3.440 3.570 3.730 3.780 4.070 5.250 5.345 5.424
## Frequency      1     2     1     1     1     1     1     1     1
## Proportion  0.05  0.10  0.05  0.05  0.05  0.05  0.05  0.05  0.05
## ---------------------------------------------------------------------------
## purchased 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##       20        0        2    0.481        4      0.2   0.3368 
## 
## ---------------------------------------------------------------------------

Split dataset into Training set and Test set

For machine learning purposes, it is common practice to split the dataset into a training set and test set. The training set is used to train the model and the model is then validated against the test set. We use the sample.split() function in the caTools package to randomly split the dataset 80/20 (it is also common to use a 70/30 split)

library(caTools)
set.seed(123)
split = sample.split(dataset$purchased, SplitRatio = 0.8) # Note: select dependent variable to split
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)