In the following project we’ll use the ‘Boston Housing dataset’. The dataset was gathered by the US Census Bureau regarding housing in the Boston Massachussetts area and it has been used extensively in academic literature because it can be used as a benchmark for different algorithms.
Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.
The attributes are designened as follows (taken from the UCI Machine Learning Repository1):
CRIM: per capita crime rate by town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
MEDV: Median value of owner-occupied homes in $1000s
head(class)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Let’s also take a look to a summary of the whole dataset.
psych::describe(class)
## vars n mean sd median trimmed mad min max range skew
## crim 1 506 3.61 8.60 0.26 1.68 0.33 0.01 88.98 88.97 5.19
## zn 2 506 11.36 23.32 0.00 5.08 0.00 0.00 100.00 100.00 2.21
## indus 3 506 11.14 6.86 9.69 10.93 9.37 0.46 27.74 27.28 0.29
## chas 4 506 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00 3.39
## nox 5 506 0.55 0.12 0.54 0.55 0.13 0.38 0.87 0.49 0.72
## rm 6 506 6.28 0.70 6.21 6.25 0.51 3.56 8.78 5.22 0.40
## age 7 506 68.57 28.15 77.50 71.20 28.98 2.90 100.00 97.10 -0.60
## dis 8 506 3.80 2.11 3.21 3.54 1.91 1.13 12.13 11.00 1.01
## rad 9 506 9.55 8.71 5.00 8.73 2.97 1.00 24.00 23.00 1.00
## tax 10 506 408.24 168.54 330.00 400.04 108.23 187.00 711.00 524.00 0.67
## ptratio 11 506 18.46 2.16 19.05 18.66 1.70 12.60 22.00 9.40 -0.80
## black 12 506 356.67 91.29 391.44 383.17 8.09 0.32 396.90 396.58 -2.87
## lstat 13 506 12.65 7.14 11.36 11.90 7.11 1.73 37.97 36.24 0.90
## medv 14 506 22.53 9.20 21.20 21.56 5.93 5.00 50.00 45.00 1.10
## kurtosis se
## crim 36.60 0.38
## zn 3.95 1.04
## indus -1.24 0.30
## chas 9.48 0.01
## nox -0.09 0.01
## rm 1.84 0.03
## age -0.98 1.25
## dis 0.46 0.09
## rad -0.88 0.39
## tax -1.15 7.49
## ptratio -0.30 0.10
## black 7.10 4.06
## lstat 0.46 0.32
## medv 1.45 0.41
str(class)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Before to start let’s first check for some missing values
missmap(class,col=c('yellow','black'),y.at=1,y.labels='',legend=TRUE)
let’s also upload some useful function!
#removing outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
# Population standard deviation function
PopSD= function (x){
sum((x-mean(x))^2)/length(x)
}
# Variable standardizer function
Standardize = function(x){(x-mean(x))/PopSD(x)}
# Normalization
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}
intro
According to the following variables let’s discover if there are different patterns.
medv
nox
crim
rm
age
first of all let’s choose our variables of interest, let’s remove some outliers and let’s normalize them.
#View(class)
clustering <- class%>%
dplyr::select(medv, nox, crim, rm, age)
#normalize
clustering$medv <- normalize(clustering$medv)
clustering$nox <- normalize(clustering$nox)
clustering$crim <- normalize(clustering$crim)
clustering$rm <- normalize(clustering$rm)
clustering$age <- normalize(clustering$age)
#outliers
clustering$medv <- remove_outliers(clustering$medv)
clustering$nox <- remove_outliers(clustering$nox)
clustering$crim <- remove_outliers(clustering$crim)
clustering$rm <- remove_outliers(clustering$rm)
clustering$age <- remove_outliers(clustering$age)
clustering<- na.omit(clustering)
let’s visualize our data
head(clustering)
## medv nox crim rm age
## 1 0.4222222 0.3148148 0.0000000000 0.5775053 0.6416066
## 2 0.3688889 0.1728395 0.0002359225 0.5479977 0.7826982
## 3 0.6600000 0.1728395 0.0002356977 0.6943859 0.5993821
## 4 0.6311111 0.1502058 0.0002927957 0.6585553 0.4418126
## 5 0.6933333 0.1502058 0.0007050701 0.6871048 0.5283213
## 6 0.5266667 0.1502058 0.0002644715 0.5497222 0.5746653
library(clValid)
#clustering=as.data.frame(clustering)
intern<- clValid(clustering, 2:5,
clMethods = c("hierarchical", "kmeans", "diana", "pam"),
validation = "internal")
summary(intern)
##
## Clustering Methods:
## hierarchical kmeans diana pam
##
## Cluster sizes:
## 2 3 4 5
##
## Validation Measures:
## 2 3 4 5
##
## hierarchical Connectivity 25.6786 28.6964 33.7147 42.3544
## Dunn 0.0660 0.0888 0.0888 0.0692
## Silhouette 0.4517 0.4111 0.3859 0.2937
## kmeans Connectivity 38.1905 30.5710 76.0135 92.0123
## Dunn 0.0467 0.0559 0.0354 0.0395
## Silhouette 0.4639 0.4193 0.3415 0.3356
## diana Connectivity 42.2306 44.6230 63.7643 89.1964
## Dunn 0.0458 0.0602 0.0617 0.0653
## Silhouette 0.4629 0.4141 0.3352 0.3012
## pam Connectivity 31.1817 82.2917 71.3421 105.0988
## Dunn 0.0415 0.0283 0.0441 0.0498
## Silhouette 0.4613 0.3263 0.3411 0.2626
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 25.6786 hierarchical 2
## Dunn 0.0888 hierarchical 3
## Silhouette 0.4639 kmeans 2
let’s also display it graphically
op <- par(no.readonly=TRUE)
par(mfrow=c(2,2),mar=c(4,4,3,1))
plot(intern, legend=FALSE)
plot(nClusters(intern),measures(intern,"Dunn")[,,1],type="n",axes=F,
xlab="",ylab="")
legend("center", clusterMethods(intern), col=1:9, lty=1:9, pch=paste(1:9))
par(op)
connectivity should be minimized
dunn should be maximized
silhouette is between 1 and -1, best fit for value closest to 1
Hierarchical with 2 cluster seems to be the best procedure
let’s compute the distance matrix, plot the dendogram and using the elbow method and the silhouette how many clus
dd = dist(clustering, method = "euclidean")
clusters= hclust(dd, method = "complete")
plot(clusters)
cut=cutree(clusters,2)
rect.hclust(clusters, k=2, border = 2:6)
This is the plot showing the two cluster using euclidean distance and complete -linkage.
ggpairs(cbind(clustering, cut=as.factor(cut)),
columns=1:4, aes(colour=cut, alpha=0.5),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both")+
theme_bw()
for a better visualization of the cluster
library(circlize)
Clusters = hclust(dd, method = "ave")
dend<- as.dendrogram(Clusters)
dend<- color_branches(dend, k=2)
circlize_dendrogram(dend, labels = FALSE)
fviz_cluster(list(data=clustering, cluster=cut))+
theme_minimal()
pc <- PCA(clustering, ncp = 2 ,scale.unit = TRUE)
We’ll try to predict whether a house is in the expensive range or not using the same predictors that we used for clustering and, furthermore, we also add the dummy variable CHAS which state that there are more expensive houses than cheap houses situated on the banks of the Charles River.
first of all let’s select our variables.
c<- class %>%
dplyr::select(medv, nox, crim, rm, age, chas)
c$chas<- as.factor(c$chas)
c<- na.omit(c)
head(c)
## medv nox crim rm age chas
## 1 24.0 0.538 0.00632 6.575 65.2 0
## 2 21.6 0.469 0.02731 6.421 78.9 0
## 3 34.7 0.469 0.02729 7.185 61.1 0
## 4 33.4 0.458 0.03237 6.998 45.8 0
## 5 36.2 0.458 0.06905 7.147 54.2 0
## 6 28.7 0.458 0.02985 6.430 58.7 0
#View(classif)
dim(c)
## [1] 506 6
the main problem with our data is that, the classes for chas, are unbalanced.
Due to this problem the AUC of the KNN and also the random forest algorithm have really low percentage of accuracy.
table(c$chas)
##
## 0 1
## 471 35
let’s try to solve our problem with ROSE algorithm (Generation of synthetic data by Randomly Over Sampling Examples)
data<- c %>%
dplyr::select(medv, nox, crim, rm, age, chas)
data$chas<- as.factor(data$chas)
set.seed(123)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
rose <- ROSE(chas~., data = train, N = 500, seed=111)$data
table(rose$chas)
##
## 0 1
## 234 266
as we can notice, now our data are almost balanced, let’s try them with random forest.
random forest
Let’s use random forrest in order to compare the original dataset with this second one.
With the original set the accuracy was quite low, almost reaching the 0. Instead, with the rose algorithm we can notice how the accuracy is really high.
rfrose <- randomForest(chas~., data=rose)
confusionMatrix(predict(rfrose, test), test$chas, positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 104 4
## 1 32 3
##
## Accuracy : 0.7483
## 95% CI : (0.6689, 0.817)
## No Information Rate : 0.951
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0667
##
## Mcnemar's Test P-Value : 6.795e-06
##
## Sensitivity : 0.42857
## Specificity : 0.76471
## Pos Pred Value : 0.08571
## Neg Pred Value : 0.96296
## Prevalence : 0.04895
## Detection Rate : 0.02098
## Detection Prevalence : 0.24476
## Balanced Accuracy : 0.59664
##
## 'Positive' Class : 1
##
The accuracy is not that bad but, in our case, we are facing problem of data reliability.
Since that I don’t prefer to use data create in this way due to problem of accuracy, let’s create a new dummy variable and use the KNN.
we’ll try to divide the house according to the mean value of medv.
c$chas <- ifelse(c$medv < "22.5", 1,0)
now the classes are still more balanced
table(c$chas)
##
## 0 1
## 236 270
as first, have to take our data randomly
set.seed(123)
random<- sample(rep(1:506))
Classif<- c[random,]
classif2<- as.data.frame(lapply(Classif[,c(1,2,3,4,5,6)], normalize))
head(classif2)
## medv nox crim rm age chas
## 1 0.04444444 0.63374486 5.141041e-01 0.1835601 1.0000000 0
## 2 0.32222222 0.67489712 7.472866e-02 0.5280705 0.8249228 1
## 3 0.55333333 0.25720165 6.755095e-04 0.6321134 0.7363543 0
## 4 0.34222222 0.31481481 7.007315e-03 0.4575589 0.6065911 1
## 5 0.53555556 0.03292181 9.070485e-05 0.5830619 0.1637487 0
## 6 0.07333333 0.60493827 1.781949e-01 0.4474037 0.9526262 0
now let’s build our training and test set
boston.train<- classif2[106:506,1:5]
boston.train.label<- Classif[106:506, 6]
boston.test<- classif2[1:106,1:5]
boston.test.label<-Classif[1:106, 6]
classif2<- na.omit(classif2)
predict<- knn(train=boston.train,test=boston.test, cl = boston.train.label, k=2)
Tab<- table(boston.test.label, predict)
Tab
## predict
## boston.test.label 0 1
## 0 42 8
## 1 6 50
SPECIFICITY, SENSITIVITY AND AUC
psych::AUC(Tab, plot = "a")
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## predict
## boston.test.label Predicted.Pos Predicted.Neg
## True.Pos 0.396 0.075
## True.Neg 0.057 0.472
##
## Conditional probabilities of
## predict
## boston.test.label Predicted.Pos Predicted.Neg
## True.Pos 0.84 0.16
## True.Neg 0.11 0.89
##
## Accuracy = 0.87 Sensitivity = 0.84 Specificity = 0.89
## with Area Under the Curve = 0.94
## d.prime = 2.24 Criterion = 1.24 Beta = 0.84
## Observed Phi correlation = 0.73
## Inferred latent (tetrachoric) correlation = 0.92
From the plot we can notice how both sensitivity and specificity are quite high.
at the same time, also the accuracy is good.
The AUC indicates that our classifier is quite good, about 0.94.
PLOT ACCURACY AND MISS-CLASSIFICATION RATE
res=rep(NA,20)
for(i in 1:20){mod=knn(train=boston.train, test=boston.test, cl=boston.train.label, k=i)
acc=sum(boston.test.label==mod)
res[i]=1-acc/length(mod)}
par(mfrow=c(1,2))
plot(1:20,res,type="l",col="red", xlab="Neighbours", ylab="Misclassification error rate")
plot(1:20,1-res,type="l",col="blue", xlab="Neighbours", ylab="Accuracy")
the best number of K is 5 otherwise the accuracy falls drammatically
This approach instead is based on resampling methods, and it’s very useful to not lose information of our dataset to implement our classifier
TrainData<-Classif[, 1:5]
TrainClasses<- Classif[, 6]
control <- trainControl(method="cv", number=10)
CrossValid <- train(TrainData, TrainClasses,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 12,
trControl = control)
plot(CrossValid,col = thematic::thematic_get_option("accent"))
CrossValid$finalModel
## 5-nearest neighbor regression model
CrossValid$results
## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 5 0.2815012 0.6870619 0.1467059 0.03156493 0.06780671 0.02020376
## 2 7 0.2872150 0.6718190 0.1609636 0.03540508 0.08170840 0.02781818
## 3 9 0.2903372 0.6654101 0.1679068 0.03992556 0.09111475 0.02971096
## 4 11 0.2942705 0.6560726 0.1757825 0.03807501 0.08681636 0.02750793
## 5 13 0.2930117 0.6609281 0.1771610 0.03767991 0.08483267 0.02762821
## 6 15 0.2945096 0.6586036 0.1805490 0.03312762 0.07320200 0.02368154
## 7 17 0.2979507 0.6507523 0.1844083 0.03348400 0.07336711 0.02442628
## 8 19 0.2999605 0.6468468 0.1887503 0.03413859 0.07618057 0.02409907
## 9 21 0.3039704 0.6372801 0.1931887 0.03409637 0.07533997 0.02317155
## 10 23 0.3082294 0.6268530 0.1994140 0.03528681 0.07819190 0.02396937
## 11 25 0.3124397 0.6164402 0.2045086 0.03502049 0.07876307 0.02338226
## 12 27 0.3131420 0.6155239 0.2068286 0.03480779 0.07891584 0.02332958
Also in this case we can notice how if K=5 the rmse it’s at its own lowest value and the rsquared is maximized.