1 Introduction

This document demonstrates the use of R for machine learning. This is done by applying machine learning techniques to some actual data. The data used are all secondary and obtained from Kaggle and UCI machine learning repository websites.

The links to the dataset are given below:

Breast cancer: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Titanic: https://www.kaggle.com/c/titanic/data

Boston Housing data: https://www.kaggle.com/prasadperera/the-boston-housing-dataset

Note that the corrected Boston Housing data is available in R: data(BostonHousing2)

Ames Housing data: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

The main functions for machine learning are available in the caret package. This package contains functions that can be used to train supervised machine learning models. The main functions for training models in caret are trainControl() and train(). The function trainControl is used to split observations into a training and test set. The trainControl() function however is used to specify the machine learning model to use fro predictions. In this function, other parameters of the model can be specified.

Other packages that are useful pre-processing and cleaning the data prior to training are psych for getting correlation plots, performing principal component analysis and plotting multiple histograms and Hmisc for computing correlation matrices and p-values.

The package mice is useful for dealing with missing values and can be used for missing data imputation techniques such as regression imputation and multiple imputation. The package VIM is used to summary information relating to missing value and for generating relevant displays of missing data patterns.

In this document we will build machine learning models for both regression and classification problems and in each case pre-process the data prior to model fitting.

2 Install relevant packages

#Caret package for machine learning 
install.packages("caret")
library(caret)

#Psych package for Correlation plots and multiple histograms
install.packages("psych")
library(psych)

#mlbench package for Boston dataset
install.packages("mlbench")
library(mlbench)

#Hmisc for generating correlation matrix and p.values
install.packages("Hmisc")
library("Hmisc")

#cluster for estimating silhouette score and gap statistics
#[usually comes pre-installed]
install.packages('cluster')
library(cluster)

#VIM for exploring missing data
install.packages("VIM")
require(VIM)

#mice for missing data imputation
install.packages('mice')
require(mice)

3 Breast Cancer Prediction

In this example we have data on 569 patients and are interested in predicting whether the tumuor from digitised breast mass images are benign or malignant. A total of 30 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The outcome variable is binary: malignant vs benign.

3.1 Data Pre-processing

Th first step is to load the data using the read.csv() function.

BCData <- read.csv('data_breast_cancer.csv')

View the first few observations using head() function. This shows that all the input variables are numeric and the outcome has two values ‘M’ and ‘B’ and there is also an id variable.

head(BCData)

Next, we clean the data. This involves deleting the first column, which is the id of the patients and will not be used for any further analysis. The outcome variable which is defined as diagnosis is a categoric variable and should be defined as such using factor() function.

BCData <- BCData[,2:32] #Remove id variable

#Define diagnosis as categoric 
BCData$diagnosis <- factor(BCData$diagnosis)

3.2 Univariate Summary

Now that the data has been cleaned up and irrelevant variables removed, it is time to summarise the information provided in the dataset. The first is a univariate summary and the type of summary will depend on if the variable is numeric or categoric.

3.2.1 Numeric variables

The summary() function is applied to all the numeric variables by passing it as an argument in the sapply() function.

sapply(X=BCData[,2:31],FUN=summary)
##         radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## Min.        6.98100      9.71000       43.79000  143.5000      0.05263000
## 1st Qu.    11.70000     16.17000       75.17000  420.3000      0.08637000
## Median     13.37000     18.84000       86.24000  551.1000      0.09587000
## Mean       14.12729     19.28965       91.96903  654.8891      0.09636028
## 3rd Qu.    15.78000     21.80000      104.10000  782.7000      0.10530000
## Max.       28.11000     39.28000      188.50000 2501.0000      0.16340000

The multi.hist() function within the psych package can be used to plot multiple histograms for numeric variables. The argument global = F prevents the function from using the same axis values for all plots, which is important here since the variables are on different scales.

The histograms show that variables texture_mean, smoothness_mean and symmetry_mean are approximately normally distributed and the other variables show a positive skew.

The histograms and summary of variables show that the variables are on different scales. For example, the perimeter_mean variable takes values between 143.5 and 2501, while the variable smoothness_mean takes values between 0.05 and 0.16. This is important to note as the scale of the variables may have an impact of the performance of some machine learning models by affecting the weights they assign to certain variables.

#Visualisation
multi.hist(BCData[,2:31],global = F)

3.2.2 Categoric variables

To summarise the categorical outcome, compute frequencies and percentage of observations in each category using the table() and prop.table() functions. Of the 569 observations in this data, 62.7% (357) had a benign tumour while 37.3% (212) had a malignant tumour.

#Frequencies
table(BCData$diagnosis)
## 
##   B   M 
## 357 212
#M: Malignant (harmful)

#B: Benign (not harmful)

#Percentages
prop.table(summary(BCData$diagnosis))*100
##        B        M 
## 62.74165 37.25835

3.3 Bivariate Associations

3.3.1 Two numeric variables (Correlations)

The function rcorr() within the Hmisc package produces correlation coefficients and p-values for the correlation coefficient. The correlation between the first 10 predictor variables are produced

#rcorr function from Hmisc function gets correlation and p-values
res2 <- rcorr(as.matrix(BCData[,2:31]))
round(res2$r,digits = 2)
##                radius_mean texture_mean perimeter_mean area_mean
## radius_mean           1.00         0.32           1.00      0.99
## texture_mean          0.32         1.00           0.33      0.32
## perimeter_mean        1.00         0.33           1.00      0.99
## area_mean             0.99         0.32           0.99      1.00

To produce multiple scatterplots on a single graph, the pairs() function is used.

#Scatterplot matrix using pairs function
pairs(BCData[,2:11]) #Means

pairs(BCData[,12:21])#SE

pairs(BCData[,22:31])#Worst

To visualise the association between the variables, a correlation plot is produced using the corPlot() function in the psych package.

#Correlation plot (darker blue implies strong correlation)
r=round(cor(BCData[,2:31]),digits = 2)
#Correlation plot (darker blue implies strong correlation)
corPlot(r, 
        cex = 0.01,
        numbers = F,
        cex.axis=0.4)

Some variables with strong associations and potential collinearity are given below:

radius_mean vs perimeter mean, area_mean, radius_worst, perimeter worst, area_worst.

texture_mean vs texture worst.

perimeter_mean vs area_mean, radius_worst, perimeter_worst, area_worst.

Concavity_mean vs concavity.points_mean.

radius_se vs perimeter_se, area_se.

3.3.2 One numeric and one categoric

To check the association between the predictor variables and the outcome variable (diagnosis), we perform multiple t-tests using the t.test() function and the results are shown below.

#Mean tests for association between predictors and outcome

lapply(X=BCData[,2:31],
       function(X) t.test(X ~ BCData$diagnosis))

The predictors that are not significantly associated with diagnosis are given below with their p-values:

symmetry_se : p-value=0.89

smoothness_se : p-value=0.11

texture_se : p-value=0.84

fractal_dimension_mean: p-value=0.77

The two-sample t-test is a parametric test and should be used when the parametric assumptions are met. The assumptions of the t-test are normality of the variable of interest in each group and a large enough sample size (usually n>20). When the parametric assumptions are not met a non-parametric test should be used.

Using a non-parametric test (Wilcoxon signed-rank test) we can also check for associations between the predictors and diagnosis. The Wilcoxon test is performed using the wilcox.test() function. Results are shown below.

#Median(non-params) tests for association between predictors and outcome

lapply(X=BCData[,2:31],
       function(X) wilcox.test(X ~ BCData$diagnosis))

The predictor variables with non-significant association with the outcome are listed below with their p-values:

smoothness_se : p-value=0.21

texture_se : p-value=0.64

fractal_dimension_mean : p-value=0.54

Some of these variables are not normally distributed, hence the difference in significance results between t.test and Wilcoxon test. This can be seen from the box plot for fractal_dimension_se which is significant in the t-test but not in the wilcoxon test.

#Let's see box plot for fractional_dimension_se

#Fractal Dimension SE
plot(BCData$fractal_dimension_se~BCData$diagnosis)

3.4 Missing Data

Using the is.na() function, we check for missing values in the dataset. The output shows there are no missing values on any of the variables, which is good.

any(is.na(BCData))
## [1] FALSE
#No missing data

3.5 Standardisation

The predictor variables are on different scales which can often influence the weight assigned to the predictors when fitting machine learning models. To prevent this, it is common to standardise the variables so that they have a mean at 0 and standard deviation of 1. This kind of standardisation is commonly used when variables are normally distributed.

To apply standardisation on the current data, we start by creating a duplicate of the data named BCData_std and use the scale() function to standardise.

#Create duplicate of data frame
BCData_std<-BCData


#Standardise numeric variables only
BCData_std[,2:31]<-scale(BCData[,2:31],center = T,scale = T)

summary(BCData_std$area_mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4532 -0.6666 -0.2949  0.0000  0.3632  5.2459

3.6 Principal Component Analysis (PCA)

PCA is an unsupervised learning technique that will be used to reduce the dimension of the data through linear transformation. The principal() function in the psych package will be used to perform PCA.

#Principal Component Analysis (PCA)
pc<-principal(BCData_std[,2:31],nfactors = 30)

#plot eigenvalues
plot(1:30,pc$values,xlab="Principal Component",
     ylab="Eigenvalue")

#Calculate percentage explained by principal component
pc_explained<-(pc$values/sum(pc$values))*100

plot(1:30,pc_explained,
     ylab = "percentage explained",
     xlab="principal component")

#PC1 explains 44.27%
#PC2 explains 18.97%
#PC3 explains 9.39%
#PC4 explains 6.6%
#PC5 explains 5.5%

sum(pc_explained[1:5])
## [1] 84.73427
#First five PCs explain 85% of total variation

sum(pc_explained[1:10])
## [1] 95.15688
#First 10 PCs explain 95% of total variation

From the PCA output, the first principal component explains 44.27% of the total variation, while the first 5 principal components explain 85% of the total variance. Adding the next 5 principal components increases the variance explained to 95%.

Pairwise plots of the PC scores show that the first two PC scores could likely discriminate between individuals with malignant and benign tumours.

#Plot PC scores

#PC1 vs PC2
plot(pc$scores[,1],pc$scores[,2],col=BCData_std$diagnosis)

#PC1 vs PC3
plot(pc$scores[,1],pc$scores[,3],col=BCData_std$diagnosis)

#PC1 vs PC3
plot(pc$scores[,2],pc$scores[,3],col=BCData_std$diagnosis)

3.7 Clustering

In this section, K-means clustering is used to check for any interesting patterns in the predictor variables. It is an unsupervised technique as it does not use the outcome variable in detecting the clusters. The function for k-means clustering is kmeans(). Setting centers=2 creates 2 partitions of the data and a summary of the number of observations in each cluster is given below.

#K-means clustering with k=2
km1<-kmeans(BCData_std[,2:31], centers = 2, nstart = 25)

#Summary of clusters
table(km1$cluster)
## 
##   1   2 
## 380 189

Next, the clusters are compared to the actual target (outcomes) values to check if observations were partitioned according to tumour type. To do this, the cluster labels are set to correspond to the actual outcome labels. The majority group in the clusters is given the label “B” while the minority group is defined as “M”. The results below show that out of 357 observations that actually have tumors 343 are correctly defined and overall 91% of observations are correctly assigned to their target labels.

#Actual labels
table(BCData_std$diagnosis)
## 
##   B   M 
## 357 212
#Define cluster with most values as "B" and other as "M"
c1<-which.min(table(km1$cluster))
c2<-which.max(table(km1$cluster))

clust1<-factor(x=km1$cluster,levels = c(c2,c1),
               labels = c("B","M"))

#Compare clusters with actual labels
table(BCData_std$diagnosis,clust1)
##    clust1
##       B   M
##   B 343  14
##   M  37 175
#Using a confusion matrix gives more info like sensitivity
confusionMatrix(clust1,BCData_std$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 343  37
##          M  14 175
##                                           
##                Accuracy : 0.9104          
##                  95% CI : (0.8838, 0.9325)
##     No Information Rate : 0.6274          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.804           
##                                           
##  Mcnemar's Test P-Value : 0.002066        
##                                           
##             Sensitivity : 0.9608          
##             Specificity : 0.8255          
##          Pos Pred Value : 0.9026          
##          Neg Pred Value : 0.9259          
##              Prevalence : 0.6274          
##          Detection Rate : 0.6028          
##    Detection Prevalence : 0.6678          
##       Balanced Accuracy : 0.8931          
##                                           
##        'Positive' Class : B               
## 

Next, let us cluster observations using the PC score rather than the actual values. The first 5 PC score are used for clustering below. This gives an accuracy of 87%, which is good here because only five variables (PC scores) are used to build clusters. The PC scores correctly identified 356 of the 357 observations with benign tumours but wrongly identified 73 observations with malignant tumours as benign.

#Clustering with PC
km1_pc<-kmeans(pc$scores[,1:5], 
               centers = 2, nstart = 25)

table(km1_pc$cluster)
## 
##   1   2 
## 429 140
table(BCData_std$diagnosis)
## 
##   B   M 
## 357 212
#Define cluster with most values as "B" and other as "M"
c1<-which.min(table(km1_pc$cluster))
c2<-which.max(table(km1_pc$cluster))

clust2<-factor(x=km1_pc$cluster,levels = c(c2,c1),
               labels = c("B","M"))



table(BCData_std$diagnosis,clust2)
##    clust2
##       B   M
##   B 356   1
##   M  73 139
#Using a confusion matrix gives more info like sensitivity
confusionMatrix(clust2,BCData_std$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 356  73
##          M   1 139
##                                           
##                Accuracy : 0.8699          
##                  95% CI : (0.8395, 0.8965)
##     No Information Rate : 0.6274          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7012          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9972          
##             Specificity : 0.6557          
##          Pos Pred Value : 0.8298          
##          Neg Pred Value : 0.9929          
##              Prevalence : 0.6274          
##          Detection Rate : 0.6257          
##    Detection Prevalence : 0.7540          
##       Balanced Accuracy : 0.8264          
##                                           
##        'Positive' Class : B               
## 

So far, k=2 has been used to define the number of clusters for k-means clustering. To determine the optimal number of clusters, we calculate the silhouette score using the silhouette function within the cluster package.

kvals=2:5 #set number of clusters

sil_score=numeric(length(kvals))#define empty vector to store silhouette scores

#Loop to cluster data and compute silhouette score for k=2:5
for(i in 1:length(kvals)){
  km<-kmeans(BCData_std[,2:31], centers = kvals[i], nstart = 25)
  SC<-silhouette(km$cluster, dist(BCData_std[,2:31]))
  sil_score[i]<-mean(SC[,3])
}

print(cbind(kvals, sil_score))  
##      kvals sil_score
## [1,]     2 0.3449740
## [2,]     3 0.3143840
## [3,]     4 0.2798136
## [4,]     5 0.1697861

3.8 Supervised Learning (Classification)

Using linear discriminant analysis, we will classify individuals as either having benign or malignant tumours using their predictor values. The linear discriminant analysis is used here and out-of-sample performance is assessed through cross-validation. The main functions for supervised learning are trainControl() and train(). In trainControl(), the type of cross-validation to use is defined and other arguments can specify the output that are generated from the model. The function train() is used to specify the machine learning algorithm to use. The train() and trainControl() functions are contained in the caret package.

The first step in supervised learning is to specify the input (predictor) and output (outcome) variables. Here, we define them as \(X\) and \(y\) respectively.

#Define outcome and predictors

X=BCData_std[, 2:31]
y=BCData_std$diagnosis

3.8.1 LDA with Hold-Out Cross-Validation

Next, split the data into a training a test set using hold-out cross-validation. The training set is made up of 80% of observations and the test set is the remaining 20%.

#Split data into train and test set using hold-out: 

#80% training, 20% testing
ctrl_holdout<-trainControl(method="LGOCV",p=0.8,
                           savePredictions = "all", 
                           returnResamp = 'all',
                           classProbs = TRUE,
                           number = 1)

To classify the observations using LDA, the MASS package is required before using the function train(). In the train() function we have specified method="lda" to get predictions based on linear discriminant analysis.

library(MASS)
## Warning: package 'MASS' was built under R version 4.4.2
model_lda_hld<-train(X, y, trControl = ctrl_holdout, 
                     method="lda")

print(model_lda_hld)
## Linear Discriminant Analysis 
## 
## 569 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (1 reps, 80%) 
## Summary of sample sizes: 456 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9646018  0.9234677
model_lda_hld$results

The results above shows that with LDA, we achieve an accuracy of 96.5%. More details on the performance of the LDA can be seen with the confusion matrix which shows a sensitivity of 98.6% and specificity of 92.9%. In this case the benign class is treated as the positive class. The positive class is set by default as the class that comes first. This can also be specified in the confusionMatrix() function.

#Get summary with confusion matrix

confusionMatrix(data=model_lda_hld$pred$pred, 
                reference = model_lda_hld$pred$obs)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  3
##          M  1 39
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9118, 0.9903)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9235          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9859          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9589          
##          Neg Pred Value : 0.9750          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.6460          
##       Balanced Accuracy : 0.9572          
##                                           
##        'Positive' Class : B               
## 

3.8.2 LDA with 10-fold Cross-Validation

The performance of LDA will also be assessed with a 10-fold cross-validation. With k-fold cross we are able to explore more possible splits of the data into a training and test set. The result are shown below. The cross-validation classification accuracy is 95.6%. However, the sensitivity and specificity are not given.

#10-fold cv
ctrl_Kfold<-trainControl(method="cv",number = 10)

model_lda_kf<-train(X, y, trControl = ctrl_Kfold, method="lda")

print(model_lda_kf)
## Linear Discriminant Analysis 
## 
## 569 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9562257  0.9037376
model_lda_kf$results

The classification results gives a sensitivity of 99.4% and specificity of 89.2% as shown below. To get the sensitivity and specificity, an additional argument (summaryFunction = twoClassSummary) is required in trainControl().

ctrl_Kfold<-trainControl(method="cv",
        number = 10,
        classProbs = TRUE,
        savePredictions ='all',
        returnResamp = 'all',
        summaryFunction = twoClassSummary)

model_lda_kf<-train(X, y, trControl = ctrl_Kfold, method="lda")
## Warning in train.default(X, y, trControl = ctrl_Kfold, method = "lda"): The
## metric "Accuracy" was not in the result set. ROC will be used instead.
print(model_lda_kf)
## Linear Discriminant Analysis 
## 
## 569 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.9910106  0.9943651  0.8919913
model_lda_kf$results

It is also possible to calculate the classification accuracy across all the folds by using confusionMatrix(). However, the results should not be reported as it defeats the purpose of cross-validation. An exception can be made when one wants to see the performance of the classifier for a subset of observations.

#Confusion matrix across all folds
confusionMatrix(model_lda_kf$pred$obs,
                model_lda_kf$pred$pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 355   2
##          M  23 189
##                                           
##                Accuracy : 0.9561          
##                  95% CI : (0.9358, 0.9714)
##     No Information Rate : 0.6643          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9041          
##                                           
##  Mcnemar's Test P-Value : 6.334e-05       
##                                           
##             Sensitivity : 0.9392          
##             Specificity : 0.9895          
##          Pos Pred Value : 0.9944          
##          Neg Pred Value : 0.8915          
##              Prevalence : 0.6643          
##          Detection Rate : 0.6239          
##    Detection Prevalence : 0.6274          
##       Balanced Accuracy : 0.9643          
##                                           
##        'Positive' Class : B               
## 

3.9 Sensitivity Analysis

In this section, we will perform sensitivity analysis to see how the performance of the classifier will be affected by some changes to the variables.

3.9.1 Remove all “worst” variables

We start by removing all the “worst” variables as it is computed using just a single value unlike the “mean” and “se” variables.

#Identify and remove "worst" variables
vals<-grep("worst", names(X))

X_reduced<-X[,-vals]

When the “worst” variables are removed, the classification accuracy becomes 93.7%, which is about 2% less than when all variables were used. This is an acceptable drop in accuracy since 10 variables have been removed.

#Kfold with reduced variables
ctrl_Kfold<-trainControl(method="cv",number = 10)

model_lda_kf<-train(X_reduced, y, trControl = ctrl_Kfold, method="lda")

print(model_lda_kf)
## Linear Discriminant Analysis 
## 
## 569 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ... 
## Resampling results:
## 
##   Accuracy  Kappa   
##   0.936895  0.859126
model_lda_kf$results

3.9.2 Remove highly correlated values

Next we check how the classification accuracy is affected by removing variables with possible multi-collinearity problems. The function findCorrelation() in caret is used to detect highly correlated variables which can then be removed prior to model training. Any two Variables with a correlation of at least 0.9 are considered highly correlated and the variable with the largest mean absolute correlation with other variables is removed.

#findCorrelation() finds variables with largest mean absolute correlation for removal
corr_vals<-findCorrelation(r,cutoff = 0.9)

names(X)[corr_vals]#Variables removed
##  [1] "concavity_mean"      "concave.points_mean" "perimeter_worst"    
##  [4] "radius_worst"        "perimeter_mean"      "area_worst"         
##  [7] "radius_mean"         "perimeter_se"        "area_se"            
## [10] "texture_mean"
X_reduced <- X[,-corr_vals]

Ten variables are removed as a result of very high correlations. The classification accuracy with the possible multi-collinearity among predictors removed is 95.6%. This accuracy which is based on 20 predictors is similar to the accuracy our classifier achieved with 30 predictors.

#Kfold with reduced variables
ctrl_Kfold<-trainControl(method="cv",number = 10)

model_lda_kf<-train(X_reduced, y, trControl = ctrl_Kfold, method="lda")

print(model_lda_kf)
## Linear Discriminant Analysis 
## 
## 569 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9561933  0.9042792
model_lda_kf$results

3.9.3 Classification using PCA

Finally, the principal component scores are used as predictors to compute the classification accuracy. When the first 20 PC scores are used as predictors, the classifier achieves an accuracy of 95.5%.

ctrl_Kfold<-trainControl(method="cv",number = 10)

model_lda_kf<-train(pc$scores[,1:20], y, trControl = ctrl_Kfold, method="lda")

print(model_lda_kf)
## Linear Discriminant Analysis 
## 
## 569 samples
##  20 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9545621  0.9004795
model_lda_kf$results

4 Titanic Survival Prediction

In this example, we are interested in the predicting the chance of survival in the Titanic disaster. Information on passengers such as their socio-economic status, gender, age are used as predictors of survival. The Titanic was a ship which set sail from Southampton (England) was expected to end its journey in New York (USA). However, the ship reportedly struck an iceberg near Newfoundland (Canada) and led to death of some passengers aboard the ship.

4.1 Load data and pre-processing

#Load Titanic data
Titanic_data<- read.csv("trainTitanic.csv")

Next, we can view the first few observations of the dataset.

head(Titanic_data)
dim(Titanic_data)
## [1] 891  12

There are some redundant columns in the data which will not be used for any further analysis and they need to be removed. The columns are Name, PassengerId, Ticket, and Cabin.

The binary operator %in% is used to identify the columns with these names for removal. There are originally 12 columns and this is reduced to 8. The predictor variable is binary and indicates whether or not a passenger survived.

remove_vars<-
  which(names(Titanic_data)%in%c('PassengerId','Name',
                                 'Ticket','Cabin'))


Titanic_data<- Titanic_data[,-remove_vars]

dim(Titanic_data)
## [1] 891   8

With the data loaded and unwanted variables removed, it is now time to check that variables have been corretly defined as either numeric or categoric. For example, in the outcome variable Survived, the levels are defined as yes and no. Other variables like Pclass (which denotes a passenger’s ticket type and is a measure of socio-economic status) and Embarked (which specifies where the passenger boarded the ship or embarked on their journey) are defined as categoric. The ports of embarkation are:

C = Cherbourg, Q = Queenstown, S = Southampton.

#Check that variables are correctly defined as numeric and cat
sapply(X=Titanic_data,FUN=class)
##    Survived      Pclass         Sex         Age       SibSp       Parch 
##   "integer"   "integer" "character"   "numeric"   "integer"   "integer" 
##        Fare    Embarked 
##   "numeric" "character"
#Define categoric variables

Titanic_data$Survived<-factor(x=Titanic_data$Survived,
                              levels=c(0,1),
                              labels=c("No","Yes"))

Titanic_data$Pclass<-factor(x=Titanic_data$Pclass,
                              levels=c(1,2,3),
                              labels=c("1st","2nd","3rd"))


Titanic_data$Sex<-factor(Titanic_data$Sex)

Titanic_data$Embarked<-factor(Titanic_data$Embarked)

table(Titanic_data$Embarked)
## 
##       C   Q   S 
##   2 168  77 644

There are some blank cells in the variable Embarked and they need to be defined as missing values. There are 2 missing values for Embarked.

#Define the blank cell for Embarked as NA
Titanic_data$Embarked[Titanic_data$Embarked==""]<-NA

sum(is.na(Titanic_data$Embarked)) 
## [1] 2
#Drop unused levels of Embarked
Titanic_data$Embarked<-droplevels(Titanic_data$Embarked)

table(Titanic_data$Embarked)
## 
##   C   Q   S 
## 168  77 644

The variables SibSp and Parch which denote number of siblings/spouses and number of parents/children aboard the Titanic respectively are defined as numeric.

#Define numeric variables
Titanic_data$SibSp<-as.numeric(Titanic_data$SibSp)

Titanic_data$Parch<-as.numeric(Titanic_data$Parch)

4.2 Univariate Summary

To summarise the variables, we need to create a new indicator variable to define the variable type in each column.

#Univariate summary
var_type<-sapply(X=Titanic_data,FUN=class)

4.2.1 Numeric Variables

For numeric variables, we plot histograms to assess the distribution of the data and use the summary() function to compute summary statistics like the mean, median and quartiles. The variable Age appears approximately normally distributed while SibSp, Parch and Fare are positively skewed. There are 177 missing values for the variable Age.

#Numeric variables: Histogram
multi.hist(Titanic_data[,which(var_type=="numeric")],
           global=F)

sapply(X=Titanic_data[,which(var_type=="numeric")], 
       FUN=summary)

4.2.2 Categoric Variables

To summarise the categoric variables, compute the frequency and percentage in each category.

#Categoric variables

#Frequencies
lapply(X=Titanic_data[,which(var_type=="factor")], 
       FUN=table)

#Percentages
lapply(X=Titanic_data[,which(var_type=="factor")], 
       FUN=function(X)  prop.table(table(X))*100)

4.3 Bivariate associations

4.3.1 Two numeric variables (correlations)

There are no highly correlated variables from looking at both the scatter plots and the correlation plot. Since there some missing values in Age, the argument use="complete.obs" is used to compute correlation for only observations with complete data.

#Two numeric
pairs(Titanic_data[,which(var_type=="numeric")])

r=round(cor(Titanic_data[,which(var_type=="numeric")],
            use ="complete.obs"),
        digits = 2)

corPlot(r, cex = 0.3,numbers = F)


#No variables with high correlations

4.3.2 Two categoric variables (Contigency tables)

Compute the bivariate association between the categoric variables using the table() and prop.table() functions. The results show a higher survival rate among the passengers in first class compared to those in second and third class. There was always higher survival rates among the women compared to the men. Survival was significantly associated with passenger class, port of embarkation and gender.

#Survived and pclass
t1<-table(Titanic_data$Survived,Titanic_data$Pclass)

round(prop.table(t1,margin = 2),digits = 2)

#Higher survival rates in first class


#Survived and sex
t2<-table(Titanic_data$Survived,Titanic_data$Sex)

round(prop.table(t2,margin = 2),digits = 2)

#Survived and Embarked
t3<-table(Titanic_data$Survived,Titanic_data$Embarked)

round(prop.table(t3,margin = 2),digits = 2)

#Embarked and Pclass
t4<-table(Titanic_data$Embarked,Titanic_data$Pclass)

round(prop.table(t4,margin = 1),digits = 2)

#Sex and Pclass
t5<-table(Titanic_data$Sex,Titanic_data$Pclass)

round(prop.table(t5,margin = 2),digits = 2)

#Sex and Embarked
t6<-table(Titanic_data$Sex,Titanic_data$Embarked)
t6
round(prop.table(t6,margin = 1),digits = 2)

#Association between numeric variables and survived
lapply(Titanic_data[,which(var_type=="factor")],
       FUN=function(X)chisq.test(table(X,Titanic_data$Survived)))

4.3.3 One numeric and one categoric

Conducting multiple t-test, we can check for associations between the numeric predictors and the outcome variable Survived. The variables Age, Parch and Fare are significantly associated with survival. These were found by conducting t-tests. The output below shows that those who survived paid significantly higher fare on average than those that did not. Comparing age shows that those who survived are significantly younger on average than those who did not survive.

#Association between numeric variables and survived
lapply(Titanic_data[,which(var_type=="numeric")],
       FUN=function(X)t.test(X~Titanic_data$Survived))

4.4 Missing data summary and visualisation

The function md.pattern() in mice is used to visualise the missing values. There are 712 observations with no missing information, 177 observations have missing values on Age and 2 observations have missing values on Embarked.

md.pattern(Titanic_data)

##     Survived Pclass Sex SibSp Parch Fare Embarked Age    
## 712        1      1   1     1     1    1        1   1   0
## 177        1      1   1     1     1    1        1   0   1
## 2          1      1   1     1     1    1        0   1   1
##            0      0   0     0     0    0        2 177 179
#177 missing values for age

#2 missing values for embarked

To check the missingness mechanism, we create missing indicator variables for Age and Embarked and compute associations between missing values and other observed variables. A significant association could be an indication that missing values are MAR and not MCAR. Some interesting results found are the significant association between missingness on age and survival. For those missing age, 29.4% of them survived, while for those with age available 40.6% survived. Missingness on age was also significantly associated with socio-economic status, port of embarkation, fare and number of parents/children aboard the Titanic. The relationship between missingness on age and the other variables suggest that missingness may be MAR.

#Check if age missingness is associated with survived
#create dummy for missingness

missing_Age<-is.na(Titanic_data$Age)

#Missing Age vs Survived
t_ms1<-table(missing_Age,Titanic_data$Survived)

prop.table(t_ms1,margin=1)

chisq.test(t_ms1)
#Missingness on Age is associated with Survival

#Those whose Age are missing less likely to survive
#May be on lower income level

#Missing Age vs Sex
t_ms2<-table(missing_Age,Titanic_data$Sex)

prop.table(t_ms2,margin=1)

chisq.test(t_ms2)
#Missingnes on Age not associated with Sex

#Missing Age vs Pclass
t_ms3<-table(missing_Age,Titanic_data$Pclass)

prop.table(t_ms3,margin=1)

chisq.test(t_ms3)
#Missingness on Age associated with Pclass

#Missing Age vs Embarked
t_ms4<-table(missing_Age,Titanic_data$Embarked)

prop.table(t_ms4,margin=1)

chisq.test(t_ms4)
#Missingness on Age associated with Embarked

#Missing Age vs # of Siblings
t.test(Titanic_data$SibSp~missing_Age)

tapply(X=Titanic_data$SibSp,INDEX=missing_Age,FUN=mean)
#Missingness on Age not associated with no of siblings


#Missing Age vs # of parents
t.test(Titanic_data$Parch~missing_Age)

tapply(X=Titanic_data$Parch,INDEX=missing_Age,FUN=mean)
#Missingness on Age is associated with no of parents
#Those with parents aboard were less to be missing Age

#Missing Age vs Fare
t.test(Titanic_data$Fare~missing_Age)

tapply(X=Titanic_data$Fare,INDEX=missing_Age,FUN=mean)
#Missingness on Age is associated with Fare
#Those with missing Age paid less Fare on average

4.5 Impute missing values

Since missing values are associated with other observed information, we can estimate those missing values using regression imputation, where the variables with missing values are used as the outcome and other variables are used to predict the missing values. The type of regression model used to estimate missing values will depend on the type of variable that has missing values. Linear regression is used for numeric variables, while logistic regression is used for binary variables.

#Impute missing data using Regression Imputation

reg_Titanic <- mice(Titanic_data, m=1, defaultmethod = c('norm.predict','logreg'))

#Imputed data
Impute_Titanic <- complete(data=reg_Titanic)

A copy of the data with list-wise deletion is also created, for use later.

#Create data with only complete cases
cc<-complete.cases(Titanic_data)

Titanic_cc<-Titanic_data[cc,]#Complete cases data
#712 complete cases

4.6 Classification using random forest

To classify using random forest the package randomForest needs to be installed and the argument method='rf' will be used in the train() function. The data with missing values imputed using regression imputation will be used for classification. This gives an accuracy of 84.3%.

Xtitanic<-Impute_Titanic[,2:8]
ytitanic<-Impute_Titanic$Survived
install.packages("randomForest")
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
#mtry: Number of variables randomly sampled as candidates at each split.

#K-fold cross-validation

ctrl_Kfold<-trainControl(method="cv",number = 10)

#Random forest classification
model_rf<-train(Xtitanic, ytitanic, 
                trControl = ctrl_Kfold, 
                method="rf")

print(model_rf)
## Random Forest 
## 
## 891 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 803, 802, 801, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8428306  0.6573035
##   4     0.8338668  0.6426292
##   7     0.8170869  0.6085680
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
model_rf$results

4.6.1 Classification using random forest with listwise deletion

The classification process is repeated with using only cases with complete data. This will have a sample size of 712. The accuracy obtained in this case is 82.3%.

#Specify outcome and predictors for complete cases 
Xtitanic_cc<-Titanic_cc[,2:8]
ytitanic<-Titanic_cc$Survived


#K-fold cross-validation

ctrl_Kfold<-trainControl(method="cv",number = 10)

model_rf<-train(Xtitanic_cc, ytitanic, 
                trControl = ctrl_Kfold, 
                method="rf")

print(model_rf)
## Random Forest 
## 
## 712 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 641, 641, 641, 641, 640, 641, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8230634  0.6241823
##   4     0.8048122  0.5891226
##   7     0.8006064  0.5826426
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
model_rf$results

4.6.2 Model Tuning: ntree and mtry

The random forest classifier using two main hyperparameters mtry and ntree. The hyperparameter mtry defines the umber of variables randomly sampled as candidates at each split. The hyperparameter ntree defines the number of classification trees to be generated to get a random forest. The function train() uses ntree=500 and mtry=sqrt(p) by default. Where \(p\) is the number of predictors. We can specify different values of mtry from 1 to 7 and set ntree=50. The function expand.grid() is used to set several values of mtry to use in the random forest classifier. The argument ntree=50 is specified in train(). Only one value of ntree be specified at a time. The random forest classifier with hyperparameter tuning gives classification accuracy of 83.8%%.

# Tuning the parameter of RF (Imputed)

#specify: mtry and ntree

Xtitanic<-Impute_Titanic[,2:8]

ytitanic<-Impute_Titanic$Survived


ctrl_Kfold<-trainControl(method="cv",number = 10)

#Specify tuning parameters mtry and ntree

grid_search<-expand.grid(.mtry=1:7)

#Add tuneGrid argument
model_rf<-train(Xtitanic, ytitanic, 
                trControl = ctrl_Kfold, 
                method="rf",
                tuneGrid = grid_search,
                ntree=50)

print(model_rf)
## Random Forest 
## 
## 891 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 803, 802, 801, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.8225925  0.6010523
##   2     0.8338163  0.6401468
##   3     0.8383992  0.6504886
##   4     0.8360637  0.6474715
##   5     0.8237782  0.6221869
##   6     0.8159508  0.6075236
##   7     0.8148147  0.6039820
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
model_rf$results

5 Boston House Price Prediction

This example involves predicting the median house prices using census tract housing data in Boston. The outcome variable is the median value of owner occupied home and predictors such as per capita crime rate, property tax rate, town, longitude and latitude of census tract.

5.1 Load data and pre-processing

The data is available from the mlbench package, which needs to be installed.

install.packages("mlbench")
library(mlbench)
#Load data
data("BostonHousing2")

Once the data is loaded, we can explore the first few observations.

head(BostonHousing2)

The data contains 17 predictors and the outcome variables defined as medv (median house value) and cmedv (corrected median house value). We will use the corrected median house value here as the target variable for the supervised learning model.

The next step is to clean the data. Looking through the data, some categorical variables do not have their levels defined and some numeric variables are defined as integers. The variable medv will also be removed.

#Define categoric variables
BostonHousing2$chas<-factor(BostonHousing2$chas,
        levels = c(0,1),
        labels = c("Does not bound Charles River",
        "Bounds Charles River"))

BostonHousing2$town<-factor(BostonHousing2$town)

#Define numeric variables
BostonHousing2$tax<-as.numeric(BostonHousing2$tax)

#Remove medv (since there is cmedv) and tract (an id variable)
BostonHousing2<-BostonHousing2[,
-which(names(BostonHousing2)%in%c("medv","tract"))]

5.2 Univariate Summary

The univariate summary shows that the variables crim, zn, disand lstat are positively skewed while age, tax and b are negatively skewed. The other variables are approximately symmetric. Of the 506 tracts, 471 bounded the Charles rives while the others did not. The variable town has 92 levels with some levels having only 1 observation. This may cause some problems with cross-validation.

#Univariate Summary
boston_vars<-sapply(X=BostonHousing2,FUN=class)


#Numeric summary

multi.hist(BostonHousing2[,
        which(boston_vars=="numeric")],
        global=F)

hist(BostonHousing2$rad)

sapply(BostonHousing2[,which(boston_vars=="numeric")],
       FUN=summary)

#Positive skew: crim, zn, dis,lstat

#Negative skew: age, tax, b


#Categoric summary

sapply(BostonHousing2[,which(boston_vars=="factor")],
       FUN=table)

length(levels(BostonHousing2$town))

5.3 Bivariate Associations

There are no numeric variables that are highly correlated, using an absolute cutoff of 0.9 for the correlation coefficients. There is a strong negative correlation between lstat (percentage of population on a lower status) and the outcome cmedv from the scatterplot below. The variable rm (average number of rooms) has a strong positive association with cmedv and a negative association with lstat. There was a significant association between the categoric variables chas and town. The variable chas is significantly associated with cmedv, lon, indus and ptratio. The houses that bound the Charles had significantly higher median value than those that did not. The median house values was also found to be significantly different across the various towns.

5.3.1 Two numeric variables

#Bivariate associations

#Two numeric variable

pairs(BostonHousing2[,which(boston_vars=="numeric")])

r=cor(BostonHousing2[,which(boston_vars=="numeric")])

#Check for highly correlated variables
findCorrelation(r,cutoff =0.9)
## integer(0)
#None detected

5.3.2 Two categoric

#Two categoric

#Contigency table
tb1<-table(BostonHousing2$chas,BostonHousing2$town)

#Fisher's exact test
fisher.test(tb1,simulate.p.value = TRUE)
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  tb1
## p-value = 0.0004998
## alternative hypothesis: two.sided
#Significant association between chas and town

5.3.3 One numeric and one categoric

#One Numeric and One Categoric

#cmedv and chas
t.test(BostonHousing2$cmedv~BostonHousing2$chas)
#Significant difference in price by chas


#cmedv and town
kruskal.test(BostonHousing2$cmedv~BostonHousing2$town)

#Significant association between town and median price


sapply(BostonHousing2[,which(boston_vars=="numeric")],
       FUN=function(X)wilcox.test(X~BostonHousing2$chas))

#chas associated with:
#lon
#indus
#ptratio

5.4 Missing data

There are no variables with missing values.

#Missing Data
any(is.na(BostonHousing2))
## [1] FALSE
#No variable with missing values

5.5 Linear Regression

A linear regression model is fitted to the data by setting method="glm" in train(). The outcome variable is cmedv and all the predictors are used in the model. In this instance 10-fold cross-validation with 10 repetitions is used in computing the out-sample performance measures. Repeated k-fold cross-validation helps improve precision (get lower variance). This gives an r-squared of 0.85, with RMSE=3.52.

Xboston<-BostonHousing2[,
      -which(names(BostonHousing2)=="cmedv")]

yboston<-BostonHousing2$cmedv
#10-fold cross-validation with 10 repeats
ctrl_Kfold<-trainControl(method="repeatedcv",
                         number = 10,
                         repeats = 10)


model_bsnet<-train(Xboston, yboston, 
                trControl = ctrl_Kfold, 
                method="glm")

print(model_bsnet)
## Generalized Linear Model 
## 
## 506 samples
##  16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 458, 455, 455, 455, 456, 455, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   3.521034  0.8499445  2.20802
model_bsnet$results

5.6 Linear Regression with levels of town combined

The predictor town 92 levels with some levels having only one observation. This may cause some overfitting problems and also be problematic when splitting the data into a training and test set as the training set may not contain a level which is then in the test set. If a dummy variable was not created for a level in the training set, then predictions will be impossible for any fold containing such levels. Therefore, we combine levels of town with 3 observations or less and name this new level others. The r-squared value for this linear regression model is 0.82 and RMSE is 3.89. This does not show an obvious improvement in predictions. However, this is expected to give more stable predictions.

#install package to combine levels of town
install.packages("rockchalk")
library(rockchalk)
## Warning: package 'rockchalk' was built under R version 4.4.3
# Tune model (Remove some levels of town)--------------------------------------------------------------
Xboston2<-Xboston

#Find levels with 3 levels or less
rm_levels<-table(Xboston2$town)<=3

sum(table(Xboston2$town)<=3)

loc_levels<-which(rm_levels)

name_levels<-names(table(Xboston2$town))

name_rm_levels<-name_levels[loc_levels]



#Combine levels
Xboston2$town<-combineLevels(Xboston2$town,
                levs=name_rm_levels,
              newLabel = c("Others"))
#Number of levels after combination
length(levels(Xboston2$town))
## [1] 51
#Build Model
ctrl_Kfold<-trainControl(method="repeatedcv",
                         number = 10,
                         repeats = 10)


model_bsnet2<-train(Xboston2, yboston, 
                   trControl = ctrl_Kfold, 
                   method="glm")

print(model_bsnet2)
## Generalized Linear Model 
## 
## 506 samples
##  16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 458, 455, 455, 455, 456, 455, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.889367  0.8243728  2.607308
model_bsnet2$results

5.7 Stepwise Linear Regression using AIC

The stepwise linear regression using AIC gives an r-squared of 0.83, RMSE=3.87. Using the stepwise method for variable selection did not lead to an improvement in prediction.

# Step wise regression using AIC (levels combined) -----------------------------------

ctrl_Kfold<-trainControl(method="repeatedcv",
                         number = 10,
                         repeats = 10)

model_bsnet3<-train(Xboston2, yboston, 
                    trControl = ctrl_Kfold, 
                    method="glmStepAIC")
print(model_bsnet3)
## Generalized Linear Model with Stepwise Feature Selection 
## 
## 506 samples
##  16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 458, 455, 455, 455, 456, 455, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.869267  0.8258016  2.604477
model_bsnet3$results

5.8 Random Forest Prediction

With random forest the prediction improved with an r-squared of 0.90 and RMSE of 2.92. This is an improvement on the prediction from linear regression.

# Using Random Forest on Boston housing Data

ctrl_Kfold<-trainControl(method="repeatedcv",
                         number = 10,
                         repeats = 10)
model_bs_rf<-train(Xboston2, yboston, 
                    trControl = ctrl_Kfold, 
                    method="rf")

print(model_bs_rf)
## Random Forest 
## 
## 506 samples
##  16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 457, 455, 455, 455, 455, 455, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    3.413198  0.8800161  2.284545
##    9    2.920345  0.9039607  1.967742
##   16    2.987373  0.8961365  2.015499
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
model_bs_rf$results