Introduction to Programming - Machine Learning with R
Introduction
This markdown document is intended as an introduction to programming and serves as a template for machine learning purposes. The goal is to provide sufficient information so readers will be able to use it as a practical reference in their daily work but it does not attempt to give a thorough discussion of the theoretical details behind each method. That is left to the interested reader.
This document is divided into 11 sections covering the more popular machine learning algorithms:
- Data Preprocessing
- Regression
- Classification
- Clustering
- Association Rule Learning
- Reinforcement Learning
- Natural Language Processing
- Deep Learning
- Dimensionality Reduction
- Model Selection & Boosting
- Time Series Forecasting
In Section 1: Data Preprocessing, we will quickly cover the basics. In particular, we will learn how to (i) import a dataset, (ii) do a quick exploration of the dataset, (iii) split the dataset into training and test sets, (iv) take care of missing data by imputing means, (v) encode categorical data and (vi) perform feature scaling.
In Section 2: Regression, we start to build regression models beginning with a (2.1) Simple Linear Regression model, before moving on to (2.2) Multiple Linear Regression, (2.3) Polynomial Regression, (2.4) Support Vector Regression, (2.5) Decision Tree Regression and (2.6) Random Forest Regression. We will learn how to use the model to predict new results and plot some visualisations to examine the results.
In Section 3: Classification, we will build classification models starting with the workhorse (3.1) Logistics Regression model. We will also look at (3.2) K-Nearest Neighbours, (3.3) Support Vector Machine, (3.4) Naive Bayes, (3.5) Decision Tree Classification, (3.6) Random Forest Classification and (3.7) Extreme Gradient Boosting (XGBoost).
Moving on to Section 4: Clustering, we turn our attention to unsupervised learning where we will build (4.1) K-Means Clustering and (4.2) Hierarchical Clustering.
In Section 5: Association Rule Learning, we look at the (5.1) Apriori and (5.2) Eclat methods, both of which are helpful in market basket optimisation.
In Section 6: Reinforcement Learning (aka Online Learning), this is where things start to get really interesting. We examine two methods, namely, the (6.1) Upper Confidence Bound and (6.2) Thompson Sampling.
The packages used in this document include:
- caTools - For data processing
- ggplot2 - For visualisation
- Hmisc - For data exploration
- e1071 - For Support Vector Regression, Support Vector Machine & Naives Bayes Classifier
- rpart - For Decision Tree Regression & Decision Tree Classifier
- randomForest - For Random Forest Regression & Random Forest Classifier
- ElemStatLearn - For plotting of Logistic Regression Classifie
- Class - For K-Nearest Neighbour Classifier
- caret - For Performance Statistics
- xgboost - For Extreme Gradient Boosting
- cluster - For visualising clusters
- arules - For Apriori
To install the required packages, users can first install the package ‘pacman’ then run the code below:
# install.packages('pacman')
# pacman::p_load(caTools, ggplot2, Hmisc, e1071, rpart, randomForest, ElemStatLearn, Class, caret, xgboost, cluster)
1. Data Preprocessing
Import dataset
Let us go ahead to import our first dataset using read.csv() from base R.
dataset = read.csv('cars.csv')
Explore dataset
It is good practice to always first examine the data to see what we are working on.
head(dataset) # Examine first 6 rows
## car country mpg cyl disp hp drat wt purchased
## 1 Mazda RX4 Japan 21.0 6 160.0 110 3.90 2.620 0
## 2 Mazda RX4 Wag Japan 21.0 6 160.0 110 3.90 2.875 0
## 3 Datsun 710 Korea 22.8 4 108.0 93 3.85 2.320 0
## 4 Merc 240D Germany 24.4 4 146.7 62 3.69 3.190 1
## 5 Merc 230 Germany 22.8 4 140.8 95 3.92 3.150 1
## 6 Merc 280 Germany 19.2 6 167.6 123 3.92 3.440 0
dim(dataset) # Check size of dataset
## [1] 20 9
str(dataset) # Examine data types of variables
## 'data.frame': 20 obs. of 9 variables:
## $ car : Factor w/ 20 levels "Cadillac Fleetwood",..: 9 10 3 12 11 13 14 15 16 17 ...
## $ country : Factor w/ 6 levels "Germany","Italy",..: 3 3 4 1 1 1 1 1 1 1 ...
## $ mpg : num 21 21 22.8 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ...
## $ cyl : int 6 6 4 4 4 6 6 8 8 8 ...
## $ disp : num 160 160 108 147 141 ...
## $ hp : int 110 110 93 62 95 123 123 180 180 180 ...
## $ drat : num 3.9 3.9 3.85 3.69 3.92 3.92 3.92 3.07 3.07 3.07 ...
## $ wt : num 2.62 2.88 2.32 3.19 3.15 ...
## $ purchased: int 0 0 0 1 1 0 0 0 0 0 ...
This dataset is a small one, with only 20 observations and 9 variables. It shows different car makes, the country of manufacturing (factor variable with 6 levels), and some characteristics of the car. The last column is a binary indicator that shows whether the car has been purchased. Let’s also take a look at the summary statistics.
summary(dataset) # Quick overview of summary statistics of data
## car country mpg cyl
## Cadillac Fleetwood: 1 Germany:7 Min. :10.40 Min. :4.0
## Chrysler Imperial : 1 Italy :2 1st Qu.:16.10 1st Qu.:4.0
## Datsun 710 : 1 Japan :5 Median :20.35 Median :6.0
## Ferrari Dino : 1 Korea :1 Mean :20.39 Mean :5.9
## Fiat 128 : 1 Sweden :1 3rd Qu.:22.80 3rd Qu.:8.0
## Honda Civic : 1 USA :4 Max. :33.90 Max. :8.0
## (Other) :14
## disp hp drat wt
## Min. : 71.1 Min. : 52.0 Min. :2.930 Min. :1.615
## 1st Qu.:120.1 1st Qu.: 95.5 1st Qu.:3.150 1st Qu.:2.581
## Median :146.7 Median :116.5 Median :3.700 Median :3.170
## Mean :202.6 Mean :139.4 Mean :3.618 Mean :3.293
## 3rd Qu.:275.8 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.743
## Max. :472.0 Max. :335.0 Max. :4.220 Max. :5.424
## NA's :3 NA's :2 NA's :1
## purchased
## Min. :0.0
## 1st Qu.:0.0
## Median :0.0
## Mean :0.2
## 3rd Qu.:0.0
## Max. :1.0
##
To get a more detailed summary statistics we can use the describe() function in the Hmisc package.
library(Hmisc)
describe(dataset) # More detailed summary statistics of data
## dataset
##
## 9 Variables 20 Observations
## ---------------------------------------------------------------------------
## car
## n missing distinct
## 20 0 20
##
## lowest : Cadillac Fleetwood Chrysler Imperial Datsun 710 Ferrari Dino Fiat 128
## highest: Merc 450SL Merc 450SLC Toyota Corolla Toyota Corona Volvo 142E
## ---------------------------------------------------------------------------
## country
## n missing distinct
## 20 0 6
##
## Value Germany Italy Japan Korea Sweden USA
## Frequency 7 2 5 1 1 4
## Proportion 0.35 0.10 0.25 0.05 0.05 0.20
## ---------------------------------------------------------------------------
## mpg
## n missing distinct Info Mean Gmd .05 .10
## 20 0 17 0.998 20.38 7.229 10.40 14.27
## .25 .50 .75 .90 .95
## 16.10 20.35 22.80 30.60 32.48
##
## Value 10.4 14.7 15.0 15.2 16.4 17.3 17.8 19.2 19.7 21.0 21.4 21.5
## Frequency 2 1 1 1 1 1 1 1 1 2 1 1
## Proportion 0.10 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.10 0.05 0.05
##
## Value 22.8 24.4 30.4 32.4 33.9
## Frequency 2 1 1 1 1
## Proportion 0.10 0.05 0.05 0.05 0.05
## ---------------------------------------------------------------------------
## cyl
## n missing distinct Info Mean Gmd
## 20 0 3 0.88 5.9 1.968
##
## Value 4 6 8
## Frequency 8 5 7
## Proportion 0.40 0.25 0.35
## ---------------------------------------------------------------------------
## disp
## n missing distinct Info Mean Gmd .05 .10
## 17 3 16 0.999 202.6 144.8 74.78 77.50
## .25 .50 .75 .90 .95
## 120.10 146.70 275.80 448.00 462.40
##
## Value 71.1 75.7 78.7 108.0 120.1 121.0 140.8 145.0 146.7 160.0
## Frequency 1 1 1 1 1 1 1 1 1 2
## Proportion 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.118
##
## Value 167.6 275.8 301.0 440.0 460.0 472.0
## Frequency 1 1 1 1 1 1
## Proportion 0.059 0.059 0.059 0.059 0.059 0.059
## ---------------------------------------------------------------------------
## hp
## n missing distinct Info Mean Gmd .05 .10
## 18 2 14 0.994 139.4 75.89 60.5 64.1
## .25 .50 .75 .90 .95
## 95.5 116.5 180.0 208.0 233.0
##
## Value 52 62 65 93 95 97 109 110 123 175
## Frequency 1 1 1 1 1 1 1 2 2 1
## Proportion 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.111 0.111 0.056
##
## Value 180 205 215 335
## Frequency 3 1 1 1
## Proportion 0.167 0.056 0.056 0.056
## ---------------------------------------------------------------------------
## drat
## n missing distinct Info Mean Gmd .05 .10
## 19 1 14 0.992 3.618 0.485 2.993 3.056
## .25 .50 .75 .90 .95
## 3.150 3.700 3.920 4.086 4.121
##
## Value 2.93 3.00 3.07 3.23 3.54 3.62 3.69 3.70 3.85 3.90
## Frequency 1 1 3 1 1 1 1 1 1 2
## Proportion 0.053 0.053 0.158 0.053 0.053 0.053 0.053 0.053 0.053 0.105
##
## Value 3.92 4.08 4.11 4.22
## Frequency 3 1 1 1
## Proportion 0.158 0.053 0.053 0.053
## ---------------------------------------------------------------------------
## wt
## n missing distinct Info Mean Gmd .05 .10
## 20 0 19 0.999 3.293 1.24 1.824 2.164
## .25 .50 .75 .90 .95
## 2.581 3.170 3.742 5.260 5.349
##
## Value 1.615 1.835 2.200 2.320 2.465 2.620 2.770 2.780 2.875 3.150
## Frequency 1 1 1 1 1 1 1 1 1 1
## Proportion 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
##
## Value 3.190 3.440 3.570 3.730 3.780 4.070 5.250 5.345 5.424
## Frequency 1 2 1 1 1 1 1 1 1
## Proportion 0.05 0.10 0.05 0.05 0.05 0.05 0.05 0.05 0.05
## ---------------------------------------------------------------------------
## purchased
## n missing distinct Info Sum Mean Gmd
## 20 0 2 0.481 4 0.2 0.3368
##
## ---------------------------------------------------------------------------
Split dataset into Training set and Test set
For machine learning purposes, it is common practice to split the dataset into a training set and test set. The training set is used to train the model and the model is then validated against the test set. We use the sample.split() function in the caTools package to randomly split the dataset 80/20 (it is also common to use a 70/30 split)
library(caTools)
set.seed(123)
split = sample.split(dataset$purchased, SplitRatio = 0.8) # Note: select dependent variable to split
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)