Statistical Learning Final Project

Boston Housing data set

In the following project we’ll use the ‘Boston Housing dataset’. The dataset was gathered by the US Census Bureau regarding housing in the Boston Massachussetts area and it has been used extensively in academic literature because it can be used as a benchmark for different algorithms.

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.

Variables

The attributes are designened as follows (taken from the UCI Machine Learning Repository1):

CRIM: per capita crime rate by town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
MEDV: Median value of owner-occupied homes in $1000s

head(class)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Let’s also take a look to a summary of the whole dataset.

psych::describe(class)

##         vars   n   mean     sd median trimmed    mad    min    max  range  skew
## crim       1 506   3.61   8.60   0.26    1.68   0.33   0.01  88.98  88.97  5.19
## zn         2 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00  2.21
## indus      3 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28  0.29
## chas       4 506   0.07   0.25   0.00    0.00   0.00   0.00   1.00   1.00  3.39
## nox        5 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49  0.72
## rm         6 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22  0.40
## age        7 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10 -0.60
## dis        8 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00  1.01
## rad        9 506   9.55   8.71   5.00    8.73   2.97   1.00  24.00  23.00  1.00
## tax       10 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00  0.67
## ptratio   11 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40 -0.80
## black     12 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58 -2.87
## lstat     13 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24  0.90
## medv      14 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00  1.10
##         kurtosis   se
## crim       36.60 0.38
## zn          3.95 1.04
## indus      -1.24 0.30
## chas        9.48 0.01
## nox        -0.09 0.01
## rm          1.84 0.03
## age        -0.98 1.25
## dis         0.46 0.09
## rad        -0.88 0.39
## tax        -1.15 7.49
## ptratio    -0.30 0.10
## black       7.10 4.06
## lstat       0.46 0.32
## medv        1.45 0.41

str(class)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Before to start let’s first check for some missing values

missmap(class,col=c('yellow','black'),y.at=1,y.labels='',legend=TRUE)

let’s also upload some useful function!

#removing outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

# Population standard deviation function
PopSD= function (x){
  sum((x-mean(x))^2)/length(x)
}
# Variable standardizer function
Standardize = function(x){(x-mean(x))/PopSD(x)}

# Normalization
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}

Unsupervised classification

intro

According to the following variables let’s discover if there are different patterns.

medv
nox
crim
rm
age

first of all let’s choose our variables of interest, let’s remove some outliers and let’s normalize them.

#View(class)

clustering <- class%>%
  dplyr::select(medv, nox, crim, rm, age)

#normalize

clustering$medv <- normalize(clustering$medv)
clustering$nox <- normalize(clustering$nox)
clustering$crim <- normalize(clustering$crim)
clustering$rm <- normalize(clustering$rm)
clustering$age <- normalize(clustering$age)

#outliers

clustering$medv <- remove_outliers(clustering$medv)
clustering$nox <- remove_outliers(clustering$nox)
clustering$crim <- remove_outliers(clustering$crim)
clustering$rm <- remove_outliers(clustering$rm)
clustering$age <- remove_outliers(clustering$age)

clustering<- na.omit(clustering)

let’s visualize our data

head(clustering)

##        medv       nox         crim        rm       age
## 1 0.4222222 0.3148148 0.0000000000 0.5775053 0.6416066
## 2 0.3688889 0.1728395 0.0002359225 0.5479977 0.7826982
## 3 0.6600000 0.1728395 0.0002356977 0.6943859 0.5993821
## 4 0.6311111 0.1502058 0.0002927957 0.6585553 0.4418126
## 5 0.6933333 0.1502058 0.0007050701 0.6871048 0.5283213
## 6 0.5266667 0.1502058 0.0002644715 0.5497222 0.5746653

internal validation

library(clValid)

#clustering=as.data.frame(clustering)

intern<- clValid(clustering, 2:5,
                 clMethods = c("hierarchical", "kmeans", "diana", "pam"),
                 validation = "internal")


summary(intern)

## 
## Clustering Methods:
##  hierarchical kmeans diana pam 
## 
## Cluster sizes:
##  2 3 4 5 
## 
## Validation Measures:
##                                   2        3        4        5
##                                                               
## hierarchical Connectivity   25.6786  28.6964  33.7147  42.3544
##              Dunn            0.0660   0.0888   0.0888   0.0692
##              Silhouette      0.4517   0.4111   0.3859   0.2937
## kmeans       Connectivity   38.1905  30.5710  76.0135  92.0123
##              Dunn            0.0467   0.0559   0.0354   0.0395
##              Silhouette      0.4639   0.4193   0.3415   0.3356
## diana        Connectivity   42.2306  44.6230  63.7643  89.1964
##              Dunn            0.0458   0.0602   0.0617   0.0653
##              Silhouette      0.4629   0.4141   0.3352   0.3012
## pam          Connectivity   31.1817  82.2917  71.3421 105.0988
##              Dunn            0.0415   0.0283   0.0441   0.0498
##              Silhouette      0.4613   0.3263   0.3411   0.2626
## 
## Optimal Scores:
## 
##              Score   Method       Clusters
## Connectivity 25.6786 hierarchical 2       
## Dunn          0.0888 hierarchical 3       
## Silhouette    0.4639 kmeans       2

let’s also display it graphically

op <- par(no.readonly=TRUE)
par(mfrow=c(2,2),mar=c(4,4,3,1))
plot(intern, legend=FALSE)
plot(nClusters(intern),measures(intern,"Dunn")[,,1],type="n",axes=F,
     xlab="",ylab="")
legend("center", clusterMethods(intern), col=1:9, lty=1:9, pch=paste(1:9))

par(op)

connectivity should be minimized
dunn should be maximized
silhouette is between 1 and -1, best fit for value closest to 1

Hierarchical with 2 cluster seems to be the best procedure

let’s compute the distance matrix, plot the dendogram and using the elbow method and the silhouette how many clus

dd = dist(clustering, method = "euclidean")

clusters= hclust(dd, method = "complete")

plot(clusters)

cut=cutree(clusters,2)

rect.hclust(clusters, k=2, border = 2:6)

This is the plot showing the two cluster using euclidean distance and complete -linkage.

descriptive of the two clusters

ggpairs(cbind(clustering, cut=as.factor(cut)),
        columns=1:4, aes(colour=cut, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both")+
        theme_bw()

Dendogram

for a better visualization of the cluster

library(circlize)

Clusters = hclust(dd, method = "ave")

dend<- as.dendrogram(Clusters)
dend<- color_branches(dend, k=2)
circlize_dendrogram(dend, labels = FALSE)

PCA

fviz_cluster(list(data=clustering, cluster=cut))+
  theme_minimal()

pc <- PCA(clustering, ncp = 2 ,scale.unit = TRUE)

Supervised Classification

We’ll try to predict whether a house is in the expensive range or not using the same predictors that we used for clustering and, furthermore, we also add the dummy variable CHAS which state that there are more expensive houses than cheap houses situated on the banks of the Charles River.

first of all let’s select our variables.

c<- class %>%
  dplyr::select(medv, nox, crim, rm, age, chas)


c$chas<- as.factor(c$chas)

c<- na.omit(c)

head(c)

##   medv   nox    crim    rm  age chas
## 1 24.0 0.538 0.00632 6.575 65.2    0
## 2 21.6 0.469 0.02731 6.421 78.9    0
## 3 34.7 0.469 0.02729 7.185 61.1    0
## 4 33.4 0.458 0.03237 6.998 45.8    0
## 5 36.2 0.458 0.06905 7.147 54.2    0
## 6 28.7 0.458 0.02985 6.430 58.7    0

#View(classif)

dim(c)

## [1] 506   6

the main problem with our data is that, the classes for chas, are unbalanced.

Due to this problem the AUC of the KNN and also the random forest algorithm have really low percentage of accuracy.

table(c$chas)

## 
##   0   1 
## 471  35

let’s try to solve our problem with ROSE algorithm (Generation of synthetic data by Randomly Over Sampling Examples)

data<- c %>%
  dplyr::select(medv, nox, crim, rm, age, chas)

data$chas<- as.factor(data$chas)

set.seed(123)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]


rose <- ROSE(chas~., data = train, N = 500, seed=111)$data

table(rose$chas)

## 
##   0   1 
## 234 266

as we can notice, now our data are almost balanced, let’s try them with random forest.

random forest

Let’s use random forrest in order to compare the original dataset with this second one.

With the original set the accuracy was quite low, almost reaching the 0. Instead, with the rose algorithm we can notice how the accuracy is really high.

rfrose <- randomForest(chas~., data=rose)
confusionMatrix(predict(rfrose, test), test$chas, positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 104   4
##          1  32   3
##                                          
##                Accuracy : 0.7483         
##                  95% CI : (0.6689, 0.817)
##     No Information Rate : 0.951          
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0667         
##                                          
##  Mcnemar's Test P-Value : 6.795e-06      
##                                          
##             Sensitivity : 0.42857        
##             Specificity : 0.76471        
##          Pos Pred Value : 0.08571        
##          Neg Pred Value : 0.96296        
##              Prevalence : 0.04895        
##          Detection Rate : 0.02098        
##    Detection Prevalence : 0.24476        
##       Balanced Accuracy : 0.59664        
##                                          
##        'Positive' Class : 1              
##

The accuracy is not that bad but, in our case, we are facing problem of data reliability.

Since that I don’t prefer to use data create in this way due to problem of accuracy, let’s create a new dummy variable and use the KNN.

we’ll try to divide the house according to the mean value of medv.

c$chas <- ifelse(c$medv < "22.5", 1,0)

now the classes are still more balanced

table(c$chas)

## 
##   0   1 
## 236 270

K- NN

as first, have to take our data randomly

set.seed(123)


random<- sample(rep(1:506))

Classif<- c[random,]


classif2<- as.data.frame(lapply(Classif[,c(1,2,3,4,5,6)], normalize))

head(classif2)

##         medv        nox         crim        rm       age chas
## 1 0.04444444 0.63374486 5.141041e-01 0.1835601 1.0000000    0
## 2 0.32222222 0.67489712 7.472866e-02 0.5280705 0.8249228    1
## 3 0.55333333 0.25720165 6.755095e-04 0.6321134 0.7363543    0
## 4 0.34222222 0.31481481 7.007315e-03 0.4575589 0.6065911    1
## 5 0.53555556 0.03292181 9.070485e-05 0.5830619 0.1637487    0
## 6 0.07333333 0.60493827 1.781949e-01 0.4474037 0.9526262    0

now let’s build our training and test set

boston.train<- classif2[106:506,1:5]
boston.train.label<- Classif[106:506, 6]

boston.test<- classif2[1:106,1:5]
boston.test.label<-Classif[1:106, 6]


classif2<- na.omit(classif2)


predict<- knn(train=boston.train,test=boston.test, cl = boston.train.label, k=2)

Tab<- table(boston.test.label, predict)
Tab

##                  predict
## boston.test.label  0  1
##                 0 42  8
##                 1  6 50

SPECIFICITY, SENSITIVITY AND AUC

psych::AUC(Tab, plot = "a")

## Decision Theory and Area under the Curve
## 
## The original data implied the following 2 x 2 table
##                  predict
## boston.test.label Predicted.Pos Predicted.Neg
##          True.Pos         0.396         0.075
##          True.Neg         0.057         0.472
## 
## Conditional probabilities of 
##                  predict
## boston.test.label Predicted.Pos Predicted.Neg
##          True.Pos          0.84          0.16
##          True.Neg          0.11          0.89
## 
## Accuracy =  0.87  Sensitivity =  0.84   Specificity =  0.89 
## with Area Under the Curve =  0.94
## d.prime =  2.24  Criterion =  1.24  Beta =  0.84
## Observed Phi correlation =  0.73 
## Inferred latent (tetrachoric) correlation  =  0.92

From the plot we can notice how both sensitivity and specificity are quite high.

at the same time, also the accuracy is good.

The AUC indicates that our classifier is quite good, about 0.94.

PLOT ACCURACY AND MISS-CLASSIFICATION RATE

res=rep(NA,20)

for(i in 1:20){mod=knn(train=boston.train, test=boston.test, cl=boston.train.label, k=i)
          acc=sum(boston.test.label==mod)
          res[i]=1-acc/length(mod)}

par(mfrow=c(1,2))

plot(1:20,res,type="l",col="red", xlab="Neighbours", ylab="Misclassification error rate")
plot(1:20,1-res,type="l",col="blue", xlab="Neighbours", ylab="Accuracy")

the best number of K is 5 otherwise the accuracy falls drammatically

KNN with 10- fold cross validation

This approach instead is based on resampling methods, and it’s very useful to not lose information of our dataset to implement our classifier

TrainData<-Classif[, 1:5]
TrainClasses<- Classif[, 6]
  

control <- trainControl(method="cv", number=10)

CrossValid <- train(TrainData, TrainClasses,
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 12,
                 trControl = control)
plot(CrossValid,col = thematic::thematic_get_option("accent"))

CrossValid$finalModel

## 5-nearest neighbor regression model

CrossValid$results

##     k      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 1   5 0.2815012 0.6870619 0.1467059 0.03156493 0.06780671 0.02020376
## 2   7 0.2872150 0.6718190 0.1609636 0.03540508 0.08170840 0.02781818
## 3   9 0.2903372 0.6654101 0.1679068 0.03992556 0.09111475 0.02971096
## 4  11 0.2942705 0.6560726 0.1757825 0.03807501 0.08681636 0.02750793
## 5  13 0.2930117 0.6609281 0.1771610 0.03767991 0.08483267 0.02762821
## 6  15 0.2945096 0.6586036 0.1805490 0.03312762 0.07320200 0.02368154
## 7  17 0.2979507 0.6507523 0.1844083 0.03348400 0.07336711 0.02442628
## 8  19 0.2999605 0.6468468 0.1887503 0.03413859 0.07618057 0.02409907
## 9  21 0.3039704 0.6372801 0.1931887 0.03409637 0.07533997 0.02317155
## 10 23 0.3082294 0.6268530 0.1994140 0.03528681 0.07819190 0.02396937
## 11 25 0.3124397 0.6164402 0.2045086 0.03502049 0.07876307 0.02338226
## 12 27 0.3131420 0.6155239 0.2068286 0.03480779 0.07891584 0.02332958

Also in this case we can notice how if K=5 the rmse it’s at its own lowest value and the rsquared is maximized.