This document demonstrates the use of R for machine learning. This is done by applying machine learning techniques to some actual data. The data used are all secondary and obtained from Kaggle and UCI machine learning repository websites.
The links to the dataset are given below:
Breast cancer: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Titanic: https://www.kaggle.com/c/titanic/data
Boston Housing data: https://www.kaggle.com/prasadperera/the-boston-housing-dataset
Note that the corrected Boston Housing data is available in R: data(BostonHousing2)
Ames Housing data: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
The main functions for machine learning are available in the
caret package. This package contains
functions that can be used to train supervised machine learning models.
The main functions for training models in
caret are trainControl() and
train(). The function trainControl is used to
split observations into a training and test set. The
trainControl() function however is used to specify the
machine learning model to use fro predictions. In this function, other
parameters of the model can be specified.
Other packages that are useful pre-processing and cleaning the data
prior to training are psych for getting
correlation plots, performing principal component analysis and plotting
multiple histograms and Hmisc for
computing correlation matrices and p-values.
The package mice is useful for dealing
with missing values and can be used for missing data imputation
techniques such as regression imputation and multiple imputation. The
package VIM is used to summary information
relating to missing value and for generating relevant displays of
missing data patterns.
In this document we will build machine learning models for both regression and classification problems and in each case pre-process the data prior to model fitting.
#Caret package for machine learning
install.packages("caret")
library(caret)
#Psych package for Correlation plots and multiple histograms
install.packages("psych")
library(psych)
#mlbench package for Boston dataset
install.packages("mlbench")
library(mlbench)
#Hmisc for generating correlation matrix and p.values
install.packages("Hmisc")
library("Hmisc")
#cluster for estimating silhouette score and gap statistics
#[usually comes pre-installed]
install.packages('cluster')
library(cluster)
#VIM for exploring missing data
install.packages("VIM")
require(VIM)
#mice for missing data imputation
install.packages('mice')
require(mice)
Image of a malignant breast FNA. Source:https://www.researchgate.net/publication/2302195_Breast_Cancer_Diagnosis_and_Prognosis_Via_Linear_Programming
In this example we have data on 569 patients and are interested in predicting whether the tumuor from digitised breast mass images are benign or malignant. A total of 30 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The outcome variable is binary: malignant vs benign.
Th first step is to load the data using the read.csv()
function.
BCData <- read.csv('data_breast_cancer.csv')
View the first few observations using head() function.
This shows that all the input variables are numeric and the outcome has
two values ‘M’ and ‘B’ and there is also an id variable.
head(BCData)
Next, we clean the data. This involves deleting the first column,
which is the id of the patients and will not be used for any further
analysis. The outcome variable which is defined as diagnosis is
a categoric variable and should be defined as such using
factor() function.
BCData <- BCData[,2:32] #Remove id variable
#Define diagnosis as categoric
BCData$diagnosis <- factor(BCData$diagnosis)
Now that the data has been cleaned up and irrelevant variables removed, it is time to summarise the information provided in the dataset. The first is a univariate summary and the type of summary will depend on if the variable is numeric or categoric.
The summary() function is applied to all the numeric
variables by passing it as an argument in the sapply()
function.
sapply(X=BCData[,2:31],FUN=summary)
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## Min. 6.98100 9.71000 43.79000 143.5000 0.05263000
## 1st Qu. 11.70000 16.17000 75.17000 420.3000 0.08637000
## Median 13.37000 18.84000 86.24000 551.1000 0.09587000
## Mean 14.12729 19.28965 91.96903 654.8891 0.09636028
## 3rd Qu. 15.78000 21.80000 104.10000 782.7000 0.10530000
## Max. 28.11000 39.28000 188.50000 2501.0000 0.16340000
The multi.hist() function within the
psych package can be used to plot multiple
histograms for numeric variables. The argument global = F
prevents the function from using the same axis values for all plots,
which is important here since the variables are on different scales.
The histograms show that variables texture_mean,
smoothness_mean and symmetry_mean are
approximately normally distributed and the other variables show a
positive skew.
The histograms and summary of variables show that the variables are
on different scales. For example, the perimeter_mean
variable takes values between 143.5 and 2501, while the variable
smoothness_mean takes values between 0.05 and 0.16. This is
important to note as the scale of the variables may have an impact of
the performance of some machine learning models by affecting the weights
they assign to certain variables.
#Visualisation
multi.hist(BCData[,2:31],global = F)
To summarise the categorical outcome, compute frequencies and
percentage of observations in each category using the
table() and prop.table() functions. Of the 569
observations in this data, 62.7% (357) had a benign tumour while 37.3%
(212) had a malignant tumour.
#Frequencies
table(BCData$diagnosis)
##
## B M
## 357 212
#M: Malignant (harmful)
#B: Benign (not harmful)
#Percentages
prop.table(summary(BCData$diagnosis))*100
## B M
## 62.74165 37.25835
The function rcorr() within the
Hmisc package produces correlation
coefficients and p-values for the correlation coefficient. The
correlation between the first 10 predictor variables are produced
#rcorr function from Hmisc function gets correlation and p-values
res2 <- rcorr(as.matrix(BCData[,2:31]))
round(res2$r,digits = 2)
## radius_mean texture_mean perimeter_mean area_mean
## radius_mean 1.00 0.32 1.00 0.99
## texture_mean 0.32 1.00 0.33 0.32
## perimeter_mean 1.00 0.33 1.00 0.99
## area_mean 0.99 0.32 0.99 1.00
To produce multiple scatterplots on a single graph, the
pairs() function is used.
#Scatterplot matrix using pairs function
pairs(BCData[,2:11]) #Means
pairs(BCData[,12:21])#SE
pairs(BCData[,22:31])#Worst
To visualise the association between the variables, a correlation
plot is produced using the corPlot() function in the
psych package.
#Correlation plot (darker blue implies strong correlation)
r=round(cor(BCData[,2:31]),digits = 2)
#Correlation plot (darker blue implies strong correlation)
corPlot(r,
cex = 0.01,
numbers = F,
cex.axis=0.4)
Some variables with strong associations and potential collinearity are given below:
radius_mean vs perimeter mean,
area_mean, radius_worst,
perimeter worst, area_worst.
texture_mean vs texture worst.
perimeter_mean vs area_mean,
radius_worst, perimeter_worst,
area_worst.
Concavity_mean vs
concavity.points_mean.
radius_se vs perimeter_se,
area_se.
To check the association between the predictor variables and the
outcome variable (diagnosis), we perform multiple t-tests
using the t.test() function and the results are shown
below.
#Mean tests for association between predictors and outcome
lapply(X=BCData[,2:31],
function(X) t.test(X ~ BCData$diagnosis))
The predictors that are not significantly associated with diagnosis are given below with their p-values:
symmetry_se : p-value=0.89
smoothness_se : p-value=0.11
texture_se : p-value=0.84
fractal_dimension_mean: p-value=0.77
The two-sample t-test is a parametric test and should be used when the parametric assumptions are met. The assumptions of the t-test are normality of the variable of interest in each group and a large enough sample size (usually n>20). When the parametric assumptions are not met a non-parametric test should be used.
Using a non-parametric test (Wilcoxon signed-rank test) we can also
check for associations between the predictors and diagnosis. The
Wilcoxon test is performed using the wilcox.test()
function. Results are shown below.
#Median(non-params) tests for association between predictors and outcome
lapply(X=BCData[,2:31],
function(X) wilcox.test(X ~ BCData$diagnosis))
The predictor variables with non-significant association with the outcome are listed below with their p-values:
smoothness_se : p-value=0.21
texture_se : p-value=0.64
fractal_dimension_mean : p-value=0.54
Some of these variables are not normally distributed, hence the
difference in significance results between t.test and Wilcoxon test.
This can be seen from the box plot for fractal_dimension_se
which is significant in the t-test but not in the wilcoxon test.
#Let's see box plot for fractional_dimension_se
#Fractal Dimension SE
plot(BCData$fractal_dimension_se~BCData$diagnosis)
Using the is.na() function, we check for missing values
in the dataset. The output shows there are no missing values on any of
the variables, which is good.
any(is.na(BCData))
## [1] FALSE
#No missing data
The predictor variables are on different scales which can often influence the weight assigned to the predictors when fitting machine learning models. To prevent this, it is common to standardise the variables so that they have a mean at 0 and standard deviation of 1. This kind of standardisation is commonly used when variables are normally distributed.
To apply standardisation on the current data, we start by creating a
duplicate of the data named BCData_std and use the
scale() function to standardise.
#Create duplicate of data frame
BCData_std<-BCData
#Standardise numeric variables only
BCData_std[,2:31]<-scale(BCData[,2:31],center = T,scale = T)
summary(BCData_std$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4532 -0.6666 -0.2949 0.0000 0.3632 5.2459
PCA is an unsupervised learning technique that will be used to reduce
the dimension of the data through linear transformation. The
principal() function in the
psych package will be used to perform
PCA.
#Principal Component Analysis (PCA)
pc<-principal(BCData_std[,2:31],nfactors = 30)
#plot eigenvalues
plot(1:30,pc$values,xlab="Principal Component",
ylab="Eigenvalue")
#Calculate percentage explained by principal component
pc_explained<-(pc$values/sum(pc$values))*100
plot(1:30,pc_explained,
ylab = "percentage explained",
xlab="principal component")
#PC1 explains 44.27%
#PC2 explains 18.97%
#PC3 explains 9.39%
#PC4 explains 6.6%
#PC5 explains 5.5%
sum(pc_explained[1:5])
## [1] 84.73427
#First five PCs explain 85% of total variation
sum(pc_explained[1:10])
## [1] 95.15688
#First 10 PCs explain 95% of total variation
From the PCA output, the first principal component explains 44.27% of the total variation, while the first 5 principal components explain 85% of the total variance. Adding the next 5 principal components increases the variance explained to 95%.
Pairwise plots of the PC scores show that the first two PC scores could likely discriminate between individuals with malignant and benign tumours.
#Plot PC scores
#PC1 vs PC2
plot(pc$scores[,1],pc$scores[,2],col=BCData_std$diagnosis)
#PC1 vs PC3
plot(pc$scores[,1],pc$scores[,3],col=BCData_std$diagnosis)
#PC1 vs PC3
plot(pc$scores[,2],pc$scores[,3],col=BCData_std$diagnosis)
In this section, K-means clustering is used to check for any
interesting patterns in the predictor variables. It is an unsupervised
technique as it does not use the outcome variable in detecting the
clusters. The function for k-means clustering is kmeans().
Setting centers=2 creates 2 partitions of the data and a
summary of the number of observations in each cluster is given
below.
#K-means clustering with k=2
km1<-kmeans(BCData_std[,2:31], centers = 2, nstart = 25)
#Summary of clusters
table(km1$cluster)
##
## 1 2
## 380 189
Next, the clusters are compared to the actual target (outcomes) values to check if observations were partitioned according to tumour type. To do this, the cluster labels are set to correspond to the actual outcome labels. The majority group in the clusters is given the label “B” while the minority group is defined as “M”. The results below show that out of 357 observations that actually have tumors 343 are correctly defined and overall 91% of observations are correctly assigned to their target labels.
#Actual labels
table(BCData_std$diagnosis)
##
## B M
## 357 212
#Define cluster with most values as "B" and other as "M"
c1<-which.min(table(km1$cluster))
c2<-which.max(table(km1$cluster))
clust1<-factor(x=km1$cluster,levels = c(c2,c1),
labels = c("B","M"))
#Compare clusters with actual labels
table(BCData_std$diagnosis,clust1)
## clust1
## B M
## B 343 14
## M 37 175
#Using a confusion matrix gives more info like sensitivity
confusionMatrix(clust1,BCData_std$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 343 37
## M 14 175
##
## Accuracy : 0.9104
## 95% CI : (0.8838, 0.9325)
## No Information Rate : 0.6274
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.804
##
## Mcnemar's Test P-Value : 0.002066
##
## Sensitivity : 0.9608
## Specificity : 0.8255
## Pos Pred Value : 0.9026
## Neg Pred Value : 0.9259
## Prevalence : 0.6274
## Detection Rate : 0.6028
## Detection Prevalence : 0.6678
## Balanced Accuracy : 0.8931
##
## 'Positive' Class : B
##
Next, let us cluster observations using the PC score rather than the actual values. The first 5 PC score are used for clustering below. This gives an accuracy of 87%, which is good here because only five variables (PC scores) are used to build clusters. The PC scores correctly identified 356 of the 357 observations with benign tumours but wrongly identified 73 observations with malignant tumours as benign.
#Clustering with PC
km1_pc<-kmeans(pc$scores[,1:5],
centers = 2, nstart = 25)
table(km1_pc$cluster)
##
## 1 2
## 429 140
table(BCData_std$diagnosis)
##
## B M
## 357 212
#Define cluster with most values as "B" and other as "M"
c1<-which.min(table(km1_pc$cluster))
c2<-which.max(table(km1_pc$cluster))
clust2<-factor(x=km1_pc$cluster,levels = c(c2,c1),
labels = c("B","M"))
table(BCData_std$diagnosis,clust2)
## clust2
## B M
## B 356 1
## M 73 139
#Using a confusion matrix gives more info like sensitivity
confusionMatrix(clust2,BCData_std$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 356 73
## M 1 139
##
## Accuracy : 0.8699
## 95% CI : (0.8395, 0.8965)
## No Information Rate : 0.6274
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7012
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9972
## Specificity : 0.6557
## Pos Pred Value : 0.8298
## Neg Pred Value : 0.9929
## Prevalence : 0.6274
## Detection Rate : 0.6257
## Detection Prevalence : 0.7540
## Balanced Accuracy : 0.8264
##
## 'Positive' Class : B
##
So far, k=2 has been used to define the number of clusters for
k-means clustering. To determine the optimal number of clusters, we
calculate the silhouette score using the silhouette
function within the cluster package.
kvals=2:5 #set number of clusters
sil_score=numeric(length(kvals))#define empty vector to store silhouette scores
#Loop to cluster data and compute silhouette score for k=2:5
for(i in 1:length(kvals)){
km<-kmeans(BCData_std[,2:31], centers = kvals[i], nstart = 25)
SC<-silhouette(km$cluster, dist(BCData_std[,2:31]))
sil_score[i]<-mean(SC[,3])
}
print(cbind(kvals, sil_score))
## kvals sil_score
## [1,] 2 0.3449740
## [2,] 3 0.3143840
## [3,] 4 0.2798136
## [4,] 5 0.1697861
Using linear discriminant analysis, we will classify individuals as
either having benign or malignant tumours using their predictor values.
The linear discriminant analysis is used here and out-of-sample
performance is assessed through cross-validation. The main functions for
supervised learning are trainControl() and
train(). In trainControl(), the type of
cross-validation to use is defined and other arguments can specify the
output that are generated from the model. The function
train() is used to specify the machine learning algorithm
to use. The train() and trainControl()
functions are contained in the caret
package.
The first step in supervised learning is to specify the input (predictor) and output (outcome) variables. Here, we define them as \(X\) and \(y\) respectively.
#Define outcome and predictors
X=BCData_std[, 2:31]
y=BCData_std$diagnosis
Next, split the data into a training a test set using hold-out cross-validation. The training set is made up of 80% of observations and the test set is the remaining 20%.
#Split data into train and test set using hold-out:
#80% training, 20% testing
ctrl_holdout<-trainControl(method="LGOCV",p=0.8,
savePredictions = "all",
returnResamp = 'all',
classProbs = TRUE,
number = 1)
To classify the observations using LDA, the
MASS package is required before using the
function train(). In the train() function we
have specified method="lda" to get predictions based on
linear discriminant analysis.
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.2
model_lda_hld<-train(X, y, trControl = ctrl_holdout,
method="lda")
print(model_lda_hld)
## Linear Discriminant Analysis
##
## 569 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (1 reps, 80%)
## Summary of sample sizes: 456
## Resampling results:
##
## Accuracy Kappa
## 0.9646018 0.9234677
model_lda_hld$results
The results above shows that with LDA, we achieve an accuracy of
96.5%. More details on the performance of the LDA can be seen with the
confusion matrix which shows a sensitivity of 98.6% and specificity of
92.9%. In this case the benign class is treated as the positive class.
The positive class is set by default as the class that comes first. This
can also be specified in the confusionMatrix()
function.
#Get summary with confusion matrix
confusionMatrix(data=model_lda_hld$pred$pred,
reference = model_lda_hld$pred$obs)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 3
## M 1 39
##
## Accuracy : 0.9646
## 95% CI : (0.9118, 0.9903)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9235
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9859
## Specificity : 0.9286
## Pos Pred Value : 0.9589
## Neg Pred Value : 0.9750
## Prevalence : 0.6283
## Detection Rate : 0.6195
## Detection Prevalence : 0.6460
## Balanced Accuracy : 0.9572
##
## 'Positive' Class : B
##
The performance of LDA will also be assessed with a 10-fold cross-validation. With k-fold cross we are able to explore more possible splits of the data into a training and test set. The result are shown below. The cross-validation classification accuracy is 95.6%. However, the sensitivity and specificity are not given.
#10-fold cv
ctrl_Kfold<-trainControl(method="cv",number = 10)
model_lda_kf<-train(X, y, trControl = ctrl_Kfold, method="lda")
print(model_lda_kf)
## Linear Discriminant Analysis
##
## 569 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9562257 0.9037376
model_lda_kf$results
The classification results gives a sensitivity of 99.4% and
specificity of 89.2% as shown below. To get the sensitivity and
specificity, an additional argument
(summaryFunction = twoClassSummary) is required in
trainControl().
ctrl_Kfold<-trainControl(method="cv",
number = 10,
classProbs = TRUE,
savePredictions ='all',
returnResamp = 'all',
summaryFunction = twoClassSummary)
model_lda_kf<-train(X, y, trControl = ctrl_Kfold, method="lda")
## Warning in train.default(X, y, trControl = ctrl_Kfold, method = "lda"): The
## metric "Accuracy" was not in the result set. ROC will be used instead.
print(model_lda_kf)
## Linear Discriminant Analysis
##
## 569 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9910106 0.9943651 0.8919913
model_lda_kf$results
It is also possible to calculate the classification accuracy across
all the folds by using confusionMatrix(). However, the
results should not be reported as it defeats the purpose of
cross-validation. An exception can be made when one wants to see the
performance of the classifier for a subset of observations.
#Confusion matrix across all folds
confusionMatrix(model_lda_kf$pred$obs,
model_lda_kf$pred$pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 355 2
## M 23 189
##
## Accuracy : 0.9561
## 95% CI : (0.9358, 0.9714)
## No Information Rate : 0.6643
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9041
##
## Mcnemar's Test P-Value : 6.334e-05
##
## Sensitivity : 0.9392
## Specificity : 0.9895
## Pos Pred Value : 0.9944
## Neg Pred Value : 0.8915
## Prevalence : 0.6643
## Detection Rate : 0.6239
## Detection Prevalence : 0.6274
## Balanced Accuracy : 0.9643
##
## 'Positive' Class : B
##
In this section, we will perform sensitivity analysis to see how the performance of the classifier will be affected by some changes to the variables.
We start by removing all the “worst” variables as it is computed using just a single value unlike the “mean” and “se” variables.
#Identify and remove "worst" variables
vals<-grep("worst", names(X))
X_reduced<-X[,-vals]
When the “worst” variables are removed, the classification accuracy becomes 93.7%, which is about 2% less than when all variables were used. This is an acceptable drop in accuracy since 10 variables have been removed.
#Kfold with reduced variables
ctrl_Kfold<-trainControl(method="cv",number = 10)
model_lda_kf<-train(X_reduced, y, trControl = ctrl_Kfold, method="lda")
print(model_lda_kf)
## Linear Discriminant Analysis
##
## 569 samples
## 20 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ...
## Resampling results:
##
## Accuracy Kappa
## 0.936895 0.859126
model_lda_kf$results
Finally, the principal component scores are used as predictors to compute the classification accuracy. When the first 20 PC scores are used as predictors, the classifier achieves an accuracy of 95.5%.
ctrl_Kfold<-trainControl(method="cv",number = 10)
model_lda_kf<-train(pc$scores[,1:20], y, trControl = ctrl_Kfold, method="lda")
print(model_lda_kf)
## Linear Discriminant Analysis
##
## 569 samples
## 20 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 513, 512, 511, 512, 511, 513, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9545621 0.9004795
model_lda_kf$results
In this example, we are interested in the predicting the chance of survival in the Titanic disaster. Information on passengers such as their socio-economic status, gender, age are used as predictors of survival. The Titanic was a ship which set sail from Southampton (England) was expected to end its journey in New York (USA). However, the ship reportedly struck an iceberg near Newfoundland (Canada) and led to death of some passengers aboard the ship.
Image of the Titanic. Source:https://nypost.com/2017/01/03/is-this-the-real-reason-the-titanic-sank/
#Load Titanic data
Titanic_data<- read.csv("trainTitanic.csv")
Next, we can view the first few observations of the dataset.
head(Titanic_data)
dim(Titanic_data)
## [1] 891 12
There are some redundant columns in the data which will not be used
for any further analysis and they need to be removed. The columns are
Name, PassengerId, Ticket, and
Cabin.
The binary operator %in% is used to identify the columns
with these names for removal. There are originally 12 columns and this
is reduced to 8. The predictor variable is binary and indicates whether
or not a passenger survived.
remove_vars<-
which(names(Titanic_data)%in%c('PassengerId','Name',
'Ticket','Cabin'))
Titanic_data<- Titanic_data[,-remove_vars]
dim(Titanic_data)
## [1] 891 8
With the data loaded and unwanted variables removed, it is now time
to check that variables have been corretly defined as either numeric or
categoric. For example, in the outcome variable Survived,
the levels are defined as yes and no. Other variables like
Pclass (which denotes a passenger’s ticket type and is a
measure of socio-economic status) and Embarked (which
specifies where the passenger boarded the ship or embarked on their
journey) are defined as categoric. The ports of embarkation are:
C = Cherbourg, Q = Queenstown, S = Southampton.
#Check that variables are correctly defined as numeric and cat
sapply(X=Titanic_data,FUN=class)
## Survived Pclass Sex Age SibSp Parch
## "integer" "integer" "character" "numeric" "integer" "integer"
## Fare Embarked
## "numeric" "character"
#Define categoric variables
Titanic_data$Survived<-factor(x=Titanic_data$Survived,
levels=c(0,1),
labels=c("No","Yes"))
Titanic_data$Pclass<-factor(x=Titanic_data$Pclass,
levels=c(1,2,3),
labels=c("1st","2nd","3rd"))
Titanic_data$Sex<-factor(Titanic_data$Sex)
Titanic_data$Embarked<-factor(Titanic_data$Embarked)
table(Titanic_data$Embarked)
##
## C Q S
## 2 168 77 644
There are some blank cells in the variable Embarked and
they need to be defined as missing values. There are 2 missing values
for Embarked.
#Define the blank cell for Embarked as NA
Titanic_data$Embarked[Titanic_data$Embarked==""]<-NA
sum(is.na(Titanic_data$Embarked))
## [1] 2
#Drop unused levels of Embarked
Titanic_data$Embarked<-droplevels(Titanic_data$Embarked)
table(Titanic_data$Embarked)
##
## C Q S
## 168 77 644
The variables SibSp and Parch which denote
number of siblings/spouses and number of parents/children aboard the
Titanic respectively are defined as numeric.
#Define numeric variables
Titanic_data$SibSp<-as.numeric(Titanic_data$SibSp)
Titanic_data$Parch<-as.numeric(Titanic_data$Parch)
To summarise the variables, we need to create a new indicator variable to define the variable type in each column.
#Univariate summary
var_type<-sapply(X=Titanic_data,FUN=class)
For numeric variables, we plot histograms to assess the distribution
of the data and use the summary() function to compute
summary statistics like the mean, median and quartiles. The variable
Age appears approximately normally distributed while
SibSp, Parch and Fare are
positively skewed. There are 177 missing values for the variable
Age.
#Numeric variables: Histogram
multi.hist(Titanic_data[,which(var_type=="numeric")],
global=F)
sapply(X=Titanic_data[,which(var_type=="numeric")],
FUN=summary)
To summarise the categoric variables, compute the frequency and percentage in each category.
#Categoric variables
#Frequencies
lapply(X=Titanic_data[,which(var_type=="factor")],
FUN=table)
#Percentages
lapply(X=Titanic_data[,which(var_type=="factor")],
FUN=function(X) prop.table(table(X))*100)
There are no highly correlated variables from looking at both the
scatter plots and the correlation plot. Since there some missing values
in Age, the argument use="complete.obs" is
used to compute correlation for only observations with complete
data.
#Two numeric
pairs(Titanic_data[,which(var_type=="numeric")])
r=round(cor(Titanic_data[,which(var_type=="numeric")],
use ="complete.obs"),
digits = 2)
corPlot(r, cex = 0.3,numbers = F)
#No variables with high correlations
Compute the bivariate association between the categoric variables
using the table() and prop.table() functions.
The results show a higher survival rate among the passengers in first
class compared to those in second and third class. There was always
higher survival rates among the women compared to the men. Survival was
significantly associated with passenger class, port of embarkation and
gender.
#Survived and pclass
t1<-table(Titanic_data$Survived,Titanic_data$Pclass)
round(prop.table(t1,margin = 2),digits = 2)
#Higher survival rates in first class
#Survived and sex
t2<-table(Titanic_data$Survived,Titanic_data$Sex)
round(prop.table(t2,margin = 2),digits = 2)
#Survived and Embarked
t3<-table(Titanic_data$Survived,Titanic_data$Embarked)
round(prop.table(t3,margin = 2),digits = 2)
#Embarked and Pclass
t4<-table(Titanic_data$Embarked,Titanic_data$Pclass)
round(prop.table(t4,margin = 1),digits = 2)
#Sex and Pclass
t5<-table(Titanic_data$Sex,Titanic_data$Pclass)
round(prop.table(t5,margin = 2),digits = 2)
#Sex and Embarked
t6<-table(Titanic_data$Sex,Titanic_data$Embarked)
t6
round(prop.table(t6,margin = 1),digits = 2)
#Association between numeric variables and survived
lapply(Titanic_data[,which(var_type=="factor")],
FUN=function(X)chisq.test(table(X,Titanic_data$Survived)))
Conducting multiple t-test, we can check for associations between the
numeric predictors and the outcome variable Survived. The
variables Age, Parch and Fare are
significantly associated with survival. These were found by conducting
t-tests. The output below shows that those who survived paid
significantly higher fare on average than those that did not. Comparing
age shows that those who survived are significantly younger on average
than those who did not survive.
#Association between numeric variables and survived
lapply(Titanic_data[,which(var_type=="numeric")],
FUN=function(X)t.test(X~Titanic_data$Survived))
The function md.pattern() in
mice is used to visualise the missing
values. There are 712 observations with no missing information, 177
observations have missing values on Age and 2 observations
have missing values on Embarked.
md.pattern(Titanic_data)
## Survived Pclass Sex SibSp Parch Fare Embarked Age
## 712 1 1 1 1 1 1 1 1 0
## 177 1 1 1 1 1 1 1 0 1
## 2 1 1 1 1 1 1 0 1 1
## 0 0 0 0 0 0 2 177 179
#177 missing values for age
#2 missing values for embarked
To check the missingness mechanism, we create missing indicator
variables for Age and Embarked and compute
associations between missing values and other observed variables. A
significant association could be an indication that missing values are
MAR and not MCAR. Some interesting results found are the significant
association between missingness on age and survival. For those missing
age, 29.4% of them survived, while for those with age available 40.6%
survived. Missingness on age was also significantly associated with
socio-economic status, port of embarkation, fare and number of
parents/children aboard the Titanic. The relationship between
missingness on age and the other variables suggest that missingness may
be MAR.
#Check if age missingness is associated with survived
#create dummy for missingness
missing_Age<-is.na(Titanic_data$Age)
#Missing Age vs Survived
t_ms1<-table(missing_Age,Titanic_data$Survived)
prop.table(t_ms1,margin=1)
chisq.test(t_ms1)
#Missingness on Age is associated with Survival
#Those whose Age are missing less likely to survive
#May be on lower income level
#Missing Age vs Sex
t_ms2<-table(missing_Age,Titanic_data$Sex)
prop.table(t_ms2,margin=1)
chisq.test(t_ms2)
#Missingnes on Age not associated with Sex
#Missing Age vs Pclass
t_ms3<-table(missing_Age,Titanic_data$Pclass)
prop.table(t_ms3,margin=1)
chisq.test(t_ms3)
#Missingness on Age associated with Pclass
#Missing Age vs Embarked
t_ms4<-table(missing_Age,Titanic_data$Embarked)
prop.table(t_ms4,margin=1)
chisq.test(t_ms4)
#Missingness on Age associated with Embarked
#Missing Age vs # of Siblings
t.test(Titanic_data$SibSp~missing_Age)
tapply(X=Titanic_data$SibSp,INDEX=missing_Age,FUN=mean)
#Missingness on Age not associated with no of siblings
#Missing Age vs # of parents
t.test(Titanic_data$Parch~missing_Age)
tapply(X=Titanic_data$Parch,INDEX=missing_Age,FUN=mean)
#Missingness on Age is associated with no of parents
#Those with parents aboard were less to be missing Age
#Missing Age vs Fare
t.test(Titanic_data$Fare~missing_Age)
tapply(X=Titanic_data$Fare,INDEX=missing_Age,FUN=mean)
#Missingness on Age is associated with Fare
#Those with missing Age paid less Fare on average
Since missing values are associated with other observed information, we can estimate those missing values using regression imputation, where the variables with missing values are used as the outcome and other variables are used to predict the missing values. The type of regression model used to estimate missing values will depend on the type of variable that has missing values. Linear regression is used for numeric variables, while logistic regression is used for binary variables.
#Impute missing data using Regression Imputation
reg_Titanic <- mice(Titanic_data, m=1, defaultmethod = c('norm.predict','logreg'))
#Imputed data
Impute_Titanic <- complete(data=reg_Titanic)
A copy of the data with list-wise deletion is also created, for use later.
#Create data with only complete cases
cc<-complete.cases(Titanic_data)
Titanic_cc<-Titanic_data[cc,]#Complete cases data
#712 complete cases
To classify using random forest the package
randomForest needs to be installed and the
argument method='rf' will be used in the
train() function. The data with missing values imputed
using regression imputation will be used for classification. This gives
an accuracy of 84.3%.
Xtitanic<-Impute_Titanic[,2:8]
ytitanic<-Impute_Titanic$Survived
install.packages("randomForest")
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
#mtry: Number of variables randomly sampled as candidates at each split.
#K-fold cross-validation
ctrl_Kfold<-trainControl(method="cv",number = 10)
#Random forest classification
model_rf<-train(Xtitanic, ytitanic,
trControl = ctrl_Kfold,
method="rf")
print(model_rf)
## Random Forest
##
## 891 samples
## 7 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 803, 802, 801, 802, 802, 802, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8428306 0.6573035
## 4 0.8338668 0.6426292
## 7 0.8170869 0.6085680
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
model_rf$results
The classification process is repeated with using only cases with complete data. This will have a sample size of 712. The accuracy obtained in this case is 82.3%.
#Specify outcome and predictors for complete cases
Xtitanic_cc<-Titanic_cc[,2:8]
ytitanic<-Titanic_cc$Survived
#K-fold cross-validation
ctrl_Kfold<-trainControl(method="cv",number = 10)
model_rf<-train(Xtitanic_cc, ytitanic,
trControl = ctrl_Kfold,
method="rf")
print(model_rf)
## Random Forest
##
## 712 samples
## 7 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 641, 641, 641, 641, 640, 641, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8230634 0.6241823
## 4 0.8048122 0.5891226
## 7 0.8006064 0.5826426
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
model_rf$results
The random forest classifier using two main hyperparameters
mtry and ntree. The hyperparameter
mtry defines the umber of variables randomly sampled as
candidates at each split. The hyperparameter ntree defines
the number of classification trees to be generated to get a random
forest. The function train() uses ntree=500
and mtry=sqrt(p) by default. Where \(p\) is the number of predictors. We can
specify different values of mtry from 1 to 7 and set
ntree=50. The function expand.grid() is used
to set several values of mtry to use in the random forest
classifier. The argument ntree=50 is specified in
train(). Only one value of ntree be specified
at a time. The random forest classifier with hyperparameter tuning gives
classification accuracy of 83.8%%.
# Tuning the parameter of RF (Imputed)
#specify: mtry and ntree
Xtitanic<-Impute_Titanic[,2:8]
ytitanic<-Impute_Titanic$Survived
ctrl_Kfold<-trainControl(method="cv",number = 10)
#Specify tuning parameters mtry and ntree
grid_search<-expand.grid(.mtry=1:7)
#Add tuneGrid argument
model_rf<-train(Xtitanic, ytitanic,
trControl = ctrl_Kfold,
method="rf",
tuneGrid = grid_search,
ntree=50)
print(model_rf)
## Random Forest
##
## 891 samples
## 7 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 803, 802, 801, 802, 802, 802, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.8225925 0.6010523
## 2 0.8338163 0.6401468
## 3 0.8383992 0.6504886
## 4 0.8360637 0.6474715
## 5 0.8237782 0.6221869
## 6 0.8159508 0.6075236
## 7 0.8148147 0.6039820
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
model_rf$results
This example involves predicting the median house prices using census tract housing data in Boston. The outcome variable is the median value of owner occupied home and predictors such as per capita crime rate, property tax rate, town, longitude and latitude of census tract.
The data is available from the mlbench
package, which needs to be installed.
install.packages("mlbench")
library(mlbench)
#Load data
data("BostonHousing2")
Once the data is loaded, we can explore the first few observations.
head(BostonHousing2)
The data contains 17 predictors and the outcome variables defined as
medv (median house value) and cmedv (corrected
median house value). We will use the corrected median house value here
as the target variable for the supervised learning model.
The next step is to clean the data. Looking through the data, some
categorical variables do not have their levels defined and some numeric
variables are defined as integers. The variable medv will
also be removed.
#Define categoric variables
BostonHousing2$chas<-factor(BostonHousing2$chas,
levels = c(0,1),
labels = c("Does not bound Charles River",
"Bounds Charles River"))
BostonHousing2$town<-factor(BostonHousing2$town)
#Define numeric variables
BostonHousing2$tax<-as.numeric(BostonHousing2$tax)
#Remove medv (since there is cmedv) and tract (an id variable)
BostonHousing2<-BostonHousing2[,
-which(names(BostonHousing2)%in%c("medv","tract"))]
The univariate summary shows that the variables crim,
zn, disand lstat are positively
skewed while age, tax and b are
negatively skewed. The other variables are approximately symmetric. Of
the 506 tracts, 471 bounded the Charles rives while the others did not.
The variable town has 92 levels with some levels having
only 1 observation. This may cause some problems with
cross-validation.
#Univariate Summary
boston_vars<-sapply(X=BostonHousing2,FUN=class)
#Numeric summary
multi.hist(BostonHousing2[,
which(boston_vars=="numeric")],
global=F)
hist(BostonHousing2$rad)
sapply(BostonHousing2[,which(boston_vars=="numeric")],
FUN=summary)
#Positive skew: crim, zn, dis,lstat
#Negative skew: age, tax, b
#Categoric summary
sapply(BostonHousing2[,which(boston_vars=="factor")],
FUN=table)
length(levels(BostonHousing2$town))
There are no numeric variables that are highly correlated, using an
absolute cutoff of 0.9 for the correlation coefficients. There is a
strong negative correlation between lstat (percentage of
population on a lower status) and the outcome cmedv from
the scatterplot below. The variable rm (average number of
rooms) has a strong positive association with cmedv and a
negative association with lstat. There was a significant
association between the categoric variables chas and
town. The variable chas is significantly
associated with cmedv, lon, indus
and ptratio. The houses that bound the Charles had
significantly higher median value than those that did not. The median
house values was also found to be significantly different across the
various towns.
#Bivariate associations
#Two numeric variable
pairs(BostonHousing2[,which(boston_vars=="numeric")])
r=cor(BostonHousing2[,which(boston_vars=="numeric")])
#Check for highly correlated variables
findCorrelation(r,cutoff =0.9)
## integer(0)
#None detected
#Two categoric
#Contigency table
tb1<-table(BostonHousing2$chas,BostonHousing2$town)
#Fisher's exact test
fisher.test(tb1,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: tb1
## p-value = 0.0004998
## alternative hypothesis: two.sided
#Significant association between chas and town
#One Numeric and One Categoric
#cmedv and chas
t.test(BostonHousing2$cmedv~BostonHousing2$chas)
#Significant difference in price by chas
#cmedv and town
kruskal.test(BostonHousing2$cmedv~BostonHousing2$town)
#Significant association between town and median price
sapply(BostonHousing2[,which(boston_vars=="numeric")],
FUN=function(X)wilcox.test(X~BostonHousing2$chas))
#chas associated with:
#lon
#indus
#ptratio
There are no variables with missing values.
#Missing Data
any(is.na(BostonHousing2))
## [1] FALSE
#No variable with missing values
A linear regression model is fitted to the data by setting
method="glm" in train(). The outcome variable
is cmedv and all the predictors are used in the model. In
this instance 10-fold cross-validation with 10 repetitions is used in
computing the out-sample performance measures. Repeated k-fold
cross-validation helps improve precision (get lower variance). This
gives an r-squared of 0.85, with RMSE=3.52.
Xboston<-BostonHousing2[,
-which(names(BostonHousing2)=="cmedv")]
yboston<-BostonHousing2$cmedv
#10-fold cross-validation with 10 repeats
ctrl_Kfold<-trainControl(method="repeatedcv",
number = 10,
repeats = 10)
model_bsnet<-train(Xboston, yboston,
trControl = ctrl_Kfold,
method="glm")
print(model_bsnet)
## Generalized Linear Model
##
## 506 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 458, 455, 455, 455, 456, 455, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.521034 0.8499445 2.20802
model_bsnet$results
The predictor town 92 levels with some levels having
only one observation. This may cause some overfitting problems and also
be problematic when splitting the data into a training and test set as
the training set may not contain a level which is then in the test set.
If a dummy variable was not created for a level in the training set,
then predictions will be impossible for any fold containing such levels.
Therefore, we combine levels of town with 3 observations or
less and name this new level others. The r-squared value
for this linear regression model is 0.82 and RMSE is 3.89. This does not
show an obvious improvement in predictions. However, this is expected to
give more stable predictions.
#install package to combine levels of town
install.packages("rockchalk")
library(rockchalk)
## Warning: package 'rockchalk' was built under R version 4.4.3
# Tune model (Remove some levels of town)--------------------------------------------------------------
Xboston2<-Xboston
#Find levels with 3 levels or less
rm_levels<-table(Xboston2$town)<=3
sum(table(Xboston2$town)<=3)
loc_levels<-which(rm_levels)
name_levels<-names(table(Xboston2$town))
name_rm_levels<-name_levels[loc_levels]
#Combine levels
Xboston2$town<-combineLevels(Xboston2$town,
levs=name_rm_levels,
newLabel = c("Others"))
#Number of levels after combination
length(levels(Xboston2$town))
## [1] 51
#Build Model
ctrl_Kfold<-trainControl(method="repeatedcv",
number = 10,
repeats = 10)
model_bsnet2<-train(Xboston2, yboston,
trControl = ctrl_Kfold,
method="glm")
print(model_bsnet2)
## Generalized Linear Model
##
## 506 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 458, 455, 455, 455, 456, 455, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.889367 0.8243728 2.607308
model_bsnet2$results
The stepwise linear regression using AIC gives an r-squared of 0.83, RMSE=3.87. Using the stepwise method for variable selection did not lead to an improvement in prediction.
# Step wise regression using AIC (levels combined) -----------------------------------
ctrl_Kfold<-trainControl(method="repeatedcv",
number = 10,
repeats = 10)
model_bsnet3<-train(Xboston2, yboston,
trControl = ctrl_Kfold,
method="glmStepAIC")
print(model_bsnet3)
## Generalized Linear Model with Stepwise Feature Selection
##
## 506 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 458, 455, 455, 455, 456, 455, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.869267 0.8258016 2.604477
model_bsnet3$results
With random forest the prediction improved with an r-squared of 0.90 and RMSE of 2.92. This is an improvement on the prediction from linear regression.
# Using Random Forest on Boston housing Data
ctrl_Kfold<-trainControl(method="repeatedcv",
number = 10,
repeats = 10)
model_bs_rf<-train(Xboston2, yboston,
trControl = ctrl_Kfold,
method="rf")
print(model_bs_rf)
## Random Forest
##
## 506 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 457, 455, 455, 455, 455, 455, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 3.413198 0.8800161 2.284545
## 9 2.920345 0.9039607 1.967742
## 16 2.987373 0.8961365 2.015499
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
model_bs_rf$results