Machine Learning Project

In this part we are going to load the dataset, and attached it to the environment

data("iris")
dataset <- iris
dataset <- iris
colnames(dataset) <- c("Sepal.Length", "Sepal.Width", "Petal.Lenght", "Petal.Width", "Species")

Now I have the iris data loaded in R and accessible via the dataset variable

2.3 Create a Validation Dataset

We will use statiscal methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accurary of the best model on unseen data by evaluating it on actual unseen data. That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

we will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

# Create a list of 80% of the rows in the original dataset we can use for training.

# create a list of 80% of the rows in the original dataset we can use for training
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
library(mlbench)
## Warning: package 'mlbench' was built under R version 3.3.3
validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- dataset[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset <- dataset[validation_index,]

Now we have training data in the dataset variable and a validation set we will use late in the validation variable

Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.

3.Summarize Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

  1. Dimensions of the dataset.
  2. Type of the attributes.
  3. Peek at the data itself
  4. Levels of the class attribute.
  5. Breakdown of the instances in each class.
  6. Statistical sumary of all attributes.

3.1 Dimensions of Dataet

We can get a quick idea of how many instances (rows) and how many attributes(columns) the data contains with the dim function.

# dimensions of dataset

dim(dataset)
## [1] 120   5

we should see 120 instances and 5 attributes.

3.2 Types of Attributes

It is a good idea to get an idea of the types of the attributes. They could be doubles, integers, strings, factors and other types.

Knowing the types is important as it will give yu an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it.

# list types for each attribute
sapply(dataset, class)
## Sepal.Length  Sepal.Width Petal.Lenght  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

We should see that all of the inputs are double and that the class value is a factor.

3.3 Peek at the Data.

It is also always a good idea to actually eyeball your data

# take a peek at the first 5 rows of the data
head(dataset)
##   Sepal.Length Sepal.Width Petal.Lenght Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 7          4.6         3.4          1.4         0.3  setosa

You should see the first 5 rows of the data.

3.4 Levels of the Class

The class variable is a factor. A factor is a class that has multiple class labels or levels. Let’s look at the levels.

levels(dataset$Species)
## [1] "setosa"     "versicolor" "virginica"

Notice above how we can refer to an attribute by name as proprety of the dataset. In the results we can see that the class has 3 different labels.

This is a multi-class or a multinomial classification problem. If there were two levels, it would be a binary classifition problem

3.5 Class Distribuition

Let’s now take a look at the number of instances(rows) that belong to each class. We can view this as an absolute count and as a percentage.

# summarize the class distribution
percentage <- prop.table(table(dataset$Species)) * 100
cbind(freq=table(dataset$Species), percentage=percentage)
##            freq percentage
## setosa       40   33.33333
## versicolor   40   33.33333
## virginica    40   33.33333

We can see that each class has the same number of instances(40 or 33% of the dataset)

3.6 Statistical Summary Now finally, we can take a look at a summary of each attribute.

This includes the mean, the min and max values as well as some percentiles(25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attributes)

# summarize attribute distributions
summary(dataset)
##   Sepal.Length    Sepal.Width     Petal.Lenght    Petal.Width   
##  Min.   :4.300   Min.   :2.200   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.400   Median :1.400  
##  Mean   :5.865   Mean   :3.083   Mean   :3.794   Mean   :1.218  
##  3rd Qu.:6.425   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :40  
##  versicolor:40  
##  virginica :40  
##                 
##                 
## 

we can see that all of the numerical values have the same scale(centimeters) and similar ranges[0.8] centimeters.

  1. Visualization Dataset. We now have a basic idea abut the data. We need to extend that with some visualizations.

we are going to look at two types of plots. 1.Univariate plots to better understand each attribute. 2.Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots we start with some univariate plots, that is, plots of each individual variable.

It is helpuful with visialization to have a way to refer to just the input attributes and just the output attributes.Let’s set that up and call the inputs attributes x and the output arrtibute( or class)y.

# split input and output
x <- dataset[,1:4]
y <- dataset[,5]

Given that the input variables are numeric, we can create box and whisker plots of each

# boxplot for each attribute on one image
par(mfrow=c(1,4))
  for(i in 1:4) {
    boxplot(x[,i],  main=names(iris)[i])
  }

This gives us a much clearer idea of the distribution of the input attributes:

We can also create a barplot of the Species class variable to get a graphical representation of the class distribuition ( generally uninteresting in this case because they’re even)

#barplot for class breakdown
plot(y)

This confirms what we learned in the last section, that the instances are evenly distributed across the three class:

4.2 Multivariate Plots

Now we can look at the inteactios between the variables

First lets look at scatterplots of all pairs of attributes and color the points by class. In addition, because the scatterplots show that pints for each class are generally separate, we can draw ellipses around them

# scatterplot matrix
featurePlot(x=x, y=y, plot = "ellipse")

We can see some clear relationships between the input attributes (trends)and between attributes and the class values (ellipses)

We can also look at box and whisker plots of each input variable again, but this time broken down into separate plots for each class. This can help to tease out obvious linear separation between the classes.

# box and whikser plots for each attributes
featurePlot(x=x, y=y, plot = "box")

This is useful to see that there are clearly different distributions of the attributes for each class value.

Next we can get an idea of the distribution of each attribute, again like the box and whisker plots, broken down by class value. Sometimes histograms are good for this, but in this case we will use some probability density to give nice smooth lines for each distribuition.

# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation ="free"))
featurePlot(x=x,y=y, plot = "density", scales = scales)

Like the boxplots, we can see the difference in distribution of each attribute by class value. We can also see the Gaussian-like distribution(bell curve) of each attribute.

  1. Evaluate Some Algorithms Now it is time to create some models of the data and estimate their acccuracy on unseen data.

Here is what we are goin to cover in this step.

  1. Set-up the test harness to use 10-fold cross validation.
  2. Build 5 different models to predict species from flower measurements
  3. Select best model

5.1 Test harness We will 10-fold crossvalidation to estimate accuracy. This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combination of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.

# Run algithms using 10-fold cross validation
control <- trainControl(method = "cv", number = 10)
metric <- "accuracy"

We are using the metric of “accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate) we will be using the metric variable when we run build and evaluate each model next.

5.2 Build Models We don’t know which algorithms would be good on this problem or what configuration to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 5 different algorithms

  1. Linear Discriminant Analysis(LDA)
  2. Classification and Regression Trees (CART)
  3. K-Nearest Neighbors(kNN) 4.Supprt Vector Machines (SVM) with linear kernel 5.Random Forest.

This is good mixture of simple linear(LDA), nonlinear(CART, KNN)and complex nonlinear methods(SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same dat splits. It ensures the results are directly comparable.

Lets build our five models.

# a) linear algorithms
set.seed(7)
fit.lda <- train(Species~.,data = dataset, method="lda", metric=metric, trControl=control)
## Loading required package: MASS
## Warning in train.default(x, y, weights = w, ...): The metric "accuracy" was
## not in the result set. Accuracy will be used instead.
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Species~., data = dataset, method="rpart",metric=metric, trControl=control)
## Loading required package: rpart
## Warning in train.default(x, y, weights = w, ...): The metric "accuracy" was
## not in the result set. Accuracy will be used instead.
# kNN
set.seed(7)
fit.knn <- train(Species~., data = dataset, method="knn", metric=metric, trControl = control)
## Warning in train.default(x, y, weights = w, ...): The metric "accuracy" was
## not in the result set. Accuracy will be used instead.
# c) advanced algorithms
# SVM 
set.seed(7)
fit.svm <- train(Species~., data = dataset, method="svmRadial", metric=metric, trControl=control)
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## Warning in train.default(x, y, weights = w, ...): The metric "accuracy" was
## not in the result set. Accuracy will be used instead.
# Random Forest.
set.seed(7)
fit.rf <- train(Species~., data = dataset, method="rf", metric=metric, trControl=control)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Warning in train.default(x, y, weights = w, ...): The metric "accuracy" was
## not in the result set. Accuracy will be used instead.

5.3 Select Best Model

We now have 5 models and accuracy estimations for each.We need to compare the models to each other and select the most accurate.

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

# summarize accuracy of models
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##        Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## lda  0.9167  0.9375 1.0000 0.9750       1    1    0
## cart 0.8333  0.9167 0.9583 0.9417       1    1    0
## knn  0.9167  0.9375 1.0000 0.9750       1    1    0
## svm  0.9167  0.9167 1.0000 0.9667       1    1    0
## rf   0.8333  0.9167 1.0000 0.9583       1    1    0
## 
## Kappa 
##       Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## lda  0.875  0.9062 1.0000 0.9625       1    1    0
## cart 0.750  0.8750 0.9375 0.9125       1    1    0
## knn  0.875  0.9062 1.0000 0.9625       1    1    0
## svm  0.875  0.8750 1.0000 0.9500       1    1    0
## rf   0.750  0.8750 1.0000 0.9375       1    1    0

We can see the accuracy of each classifier and also other metrics like Kappa:

We can also create a plot of the model evalution results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation)

# compare accuracy of models
dotplot(results)

We can see that the most accurate model in this case was LDA.

The result for just the LDA model can be summarized.

# summarize Best Model
print(fit.lda)
## Linear Discriminant Analysis 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results:
## 
##   Accuracy  Kappa 
##   0.975     0.9625
## 
## 

This gives a nice summary of what was used to train the model and the mean and standard deviation(SD) accuracy acchieved, specially 97.5% accuracy +/-4%

  1. Make Predictions The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a a data leak. Both will result in an overly optimistic result.

We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.

# estimate skill of LDA on the validation dataset.
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         1
##   virginica       0          0         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9000
## Specificity                 1.0000            0.9500           1.0000
## Pos Pred Value              1.0000            0.9091           1.0000
## Neg Pred Value              1.0000            1.0000           0.9524
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3000
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9750           0.9500

We can see that the accuracy is 100%. It was a small validation dataset(20%)but this result is within our expected margin 97%+/-4% suggesting we may have an accurate and a reliable accurate model.