Do you want to do machine learning using R, but you’re having trouble getting started? In this post you will complete your first machine learning project using R. In this step-by-step tutorial you will:
If you are a machine learning beginner and looking to finally get started using R, this tutorial was designed for you.
Let’s get started!
The best way to learn machine learning is by designing and completing small projects.
R provides a scripting language with an odd syntax. There are also hundreds of packages and thousands of functions to choose from, providing multiple ways to do each task. It can feel overwhelming.
The best way to get started using R for machine learning is to complete a project.
Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.
When you are applying machine learning to your own datasets, you are working on a project.
The process of a machine learning project may not be linear, but there are a number of well-known steps:
For more information on the steps in a machine learning project see this checklist and more on the process.
The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing your data, evaluating algorithms and making some predictions.
If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.
The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).
This is a good project because it is so well understood.
Let’s get started with your hello world machine learning project in R.
We are going to use the iris HR Attrition dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
#Atrrition Analysis
#Load Data
setwd("~/Rattle-data")
data<-read.csv("HR_comma_sep1.csv", sep=";", header=TRUE)
data<-na.omit(data)
str(data)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
We need to know that the model we created is any good.
Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data. That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data$left, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- data[-validation_index,]
# use the remaining 80% of data to training and testing the models
data <- data[validation_index,]
You now have training data in the dataset variable and a validation set we will use later in the validation variable.
Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.
Now it is time to take a look at the data.
In this step we are going to take a look at the data a few different ways:
Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.
Dimensions of Dataset We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.
# dimensions of dataset
dim(data)
## [1] 12000 10
You should see 14999 instances and 10 attributes.
It is a good idea to get an idea of the types of the attributes. They could be doubles, integers, strings, factors and other types.
Knowing the types is important as it will give you an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it.
# list types for each attribute
sapply(data, class)
## satisfaction_level last_evaluation number_project
## "numeric" "numeric" "integer"
## average_montly_hours time_spend_company Work_accident
## "integer" "integer" "integer"
## left promotion_last_5years sales
## "factor" "integer" "factor"
## salary
## "factor"
You should see that all of the inputs are double and that the class value is a factor.
It is also always a good idea to actually eyeball your data. You should see the first 5 rows of the data:
# take a peek at the first 5 rows of the data
head(data)
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## time_spend_company Work_accident left promotion_last_5years sales salary
## 1 3 0 YES 0 sales low
## 2 6 0 YES 0 sales medium
## 3 4 0 YES 0 sales medium
## 4 5 0 YES 0 sales low
## 5 3 0 YES 0 sales low
## 6 3 0 YES 0 sales low
The class variable is a factor. A factor is a class that has multiple class labels or levels. Let’s look at the levels:
# list the levels for the class
data$left<-as.factor(data$left)
levels(data$left)
## [1] "NO" "YES"
Notice above how we can refer to an attribute by name as a property of the dataset. In the results we can see that the class has 2 different labels.
There were two levels, it would be a binary classification problem.
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage.
# summarize the class distribution
percentage <- prop.table(table(data$left)) * 100
cbind(freq=table(data$left), percentage=percentage)
## freq percentage
## NO 9143 76.19167
## YES 2857 23.80833
Now finally, we can take a look at a summary of each attribute.
This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).
# summarize attribute distributions
summary(data)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.090 Min. :0.360 Min. :2.000 Min. : 96.0
## 1st Qu.:0.440 1st Qu.:0.560 1st Qu.:3.000 1st Qu.:156.0
## Median :0.640 Median :0.720 Median :4.000 Median :200.0
## Mean :0.613 Mean :0.715 Mean :3.804 Mean :200.9
## 3rd Qu.:0.820 3rd Qu.:0.870 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.000 Max. :1.000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left promotion_last_5years
## Min. : 2.000 Min. :0.0000 NO :9143 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.0000 YES:2857 1st Qu.:0.00000
## Median : 3.000 Median :0.0000 Median :0.00000
## Mean : 3.497 Mean :0.1436 Mean :0.02175
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.0000 Max. :1.00000
##
## sales salary
## sales :3297 high : 997
## technical :2142 low :5839
## support :1788 medium:5164
## IT : 994
## product_mng: 727
## marketing : 687
## (Other) :2365
We start with some univariate plots, that is, plots of each individual variable.
It is helpful with visualization to have a way to refer to just the input attributes and just the output attributes. Let’s set that up and call the inputs attributes x and the output attribute (or class) y.
This gives us a much clearer idea of the distribution of the input attributes:
# split input and output
x <- data[,-7]
y <- data[,7]
x1<-x[,1:7]
# boxplot for each attribute on one image
par(mfrow=c(1,7))
for(i in 1:7) {
boxplot(x[,i], main=names(data)[i])
}
This confirms what we learned in the last section, that the instances are evenly distributed across the two class:
# barplot for class breakdown
par(mfrow=c(1,1))
plot(y)
Now it is time to create some models of the data and estimate their accuracy on unseen data.
Here is what we are going to cover in this step:
Set-up the test harness to use 10-fold cross validation. Build 5 different models to predict species from flower measurements Select the best model.
We will 10-fold crossvalidation to estimate accuracy.
This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
Let’s evaluate 6 different algorithms:
This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF, LR). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
Let’s build our six models:
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
#Model
# a) linear algorithms
set.seed(7)
library(caret)
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.3
fit.lda <- train(left~., data=data, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(left~., data=data, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(left~., data=data, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(left~., data=data, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(left~., data=data, method="rf", metric=metric, trControl=control)
#Logistic
set.seed(7)
fit.log<- train(left~., data=data, method="glm", family="binomial",metric=metric, trControl=control)
Caret does support the configuration and tuning of the configuration of each model, but we are not going to cover that in this tutorial.
We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
We can report on the accuracy of each model by first creating a list of the created models and using the summary function.
We can see the accuracy of each classifier and also other metrics like Kappa:
#Summarize accuracy of models
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf, log=fit.log))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, cart, knn, svm, rf, log
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.7566667 0.7733805 0.7787500 0.7783342 0.7872467 0.7941667 0
## cart 0.8933333 0.9061005 0.9300000 0.9247529 0.9433210 0.9533333 0
## knn 0.9316097 0.9341667 0.9396090 0.9389161 0.9435417 0.9475000 0
## svm 0.9400000 0.9424635 0.9479378 0.9472494 0.9514583 0.9541667 0
## rf 0.9850000 0.9874917 0.9912500 0.9909994 0.9931292 0.9983333 0
## log 0.7725000 0.7850000 0.7933332 0.7917500 0.7981250 0.8050000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.2212641 0.2691945 0.2844733 0.2857422 0.3102573 0.3562679 0
## cart 0.6723549 0.7170290 0.7986994 0.7804789 0.8468939 0.8738952 0
## knn 0.8152966 0.8231301 0.8393116 0.8371848 0.8497635 0.8605876 0
## svm 0.8323216 0.8422577 0.8566114 0.8544491 0.8656128 0.8736068 0
## rf 0.9579780 0.9650241 0.9757647 0.9749560 0.9809362 0.9953983 0
## log 0.2709177 0.3133297 0.3354736 0.3319510 0.3451048 0.3909818 0
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
We can see that the most accurate model in this case was Random Forest (RF):
# compare accuracy of models
dotplot(results)
The results for just the RF model can be summarized.
# summarize Best Model
print(fit.rf)
## Random Forest
##
## 12000 samples
## 9 predictor
## 2 classes: 'NO', 'YES'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10799, 10800, 10800, 10800, 10800, 10801, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9693331 0.9121286
## 10 0.9909994 0.9749560
## 18 0.9887492 0.9687990
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 99.05% accuracy +/- 4% with mtry =10.
The RF was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.
We can run the RF model directly on the validation set and summarize the results in a confusion matrix.
# estimate skill of LDA on the validation dataset
predictions <- predict(fit.rf, validation)
confusionMatrix(predictions, validation$left)
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 2282 21
## YES 3 693
##
## Accuracy : 0.992
## 95% CI : (0.9881, 0.9949)
## No Information Rate : 0.7619
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9777
## Mcnemar's Test P-Value : 0.0005202
##
## Sensitivity : 0.9987
## Specificity : 0.9706
## Pos Pred Value : 0.9909
## Neg Pred Value : 0.9957
## Prevalence : 0.7619
## Detection Rate : 0.7609
## Detection Prevalence : 0.7679
## Balanced Accuracy : 0.9846
##
## 'Positive' Class : NO
##
We can see that the accuracy is 100%. It was a small validation dataset (20%), but this result is within our expected margin of 99.13% +/-4% suggesting we may have an accurate and a reliably accurate model.
Receiver Operating Curve Result :
pred<-predict(fit.rf,validation,type="prob")
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.4.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.4.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred1 <- prediction(pred$YES, validation$left)
perf1 <- performance(pred1,"tpr","fpr")
plot(perf1, colorize=TRUE)
abline(a=0, b= 1)
Area Under Curve (AUC)
auc_ROCR <- performance(pred1, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
print(auc_ROCR)
## [1] 0.9936383
Work through the tutorial above. It will take you 5-to-10 minutes, max!
You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the ?FunctionName help syntax in R to learn about all of the functions that you’re using.
You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.
You do not need to be an R programmer. The syntax of the R language can be confusing. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a <- “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.
You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.
What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.
In this post you discovered step-by-step how to complete your first machine learning project in R.
You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.