Data Set 1: HousePrices.csv
This data set includes prices and characteristics of n=128 houses. The following analysis attempts to predict what neighborhood a house will be in based on its characteristics using a K nearest neighbor approach.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(textir)
## Loading required package: distrom
## Loading required package: Matrix
## Loading required package: gamlr
## Loading required package: parallel
library(class)
library(car)
hpr<- read.csv("C:/DataMining/Data/HousePrices.csv")
head(hpr)
## HomeID Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 1 1 114300 1790 2 2 2 No East
## 2 2 114200 2030 4 2 3 No East
## 3 3 114800 1740 3 2 1 No East
## 4 4 94700 1980 3 2 3 No East
## 5 5 119800 2130 3 3 3 No East
## 6 6 114600 1780 3 2 2 No North
table(hpr$Neighborhood)
##
## East North West
## 45 44 39
dim(hpr)
## [1] 128 8
v1=rep(1,dim(hpr)[1])
v2=rep(0,dim(hpr)[1])
hpr$BrickTrue = ifelse(hpr$Brick == "Yes",v1,v2)
house <- hpr[,c("Neighborhood","Price","SqFt","Bedrooms","Bathrooms","Offers","BrickTrue")]
Here we can see the 6 housing characteristic variables separated for the three different neighborhoods. This visualization shows that a house can be feasibly classified into a neighborhood based on its characteristics.
par(mfrow=c(3,3), mai=c(.3,.6,.1,.1))
plot(Price ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(SqFt ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(Bedrooms ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(Bathrooms ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(Offers ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(BrickTrue ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))

The data is split into a training set and a test set and then the data set is rescaled so the variables will all contribute equally in the model.
n=length(house$Neighborhood)
n
## [1] 128
nt=100
set.seed(1) ### To make the calculations reproducible in repeated runs
train <- sample(1:n,nt)
x<-scale(house[,c(2:7)])
x[1:3,]
## Price SqFt Bedrooms Bathrooms Offers BrickTrue
## [1,] -0.6002263 -0.9969990 -1.40978793 -0.8655378 -0.5406451 -0.6961011
## [2,] -0.6039481 0.1373643 1.34521749 -0.8655378 0.3945248 -0.6961011
## [3,] -0.5816174 -1.2333247 -0.03228522 -0.8655378 -1.4758150 -0.6961011
mean(x)
## [1] 6.38084e-18
sd(x)
## [1] 0.9967352
The nearest neighbor method is used to classify the 28 houses in our test set using k=1 and k=5. The results were then plotted (the test data are the solid points and the training data are the open points).
nearest1 <- knn(train=x[train,],test=x[-train,],cl=house$Neighborhood[train],k=1)#nearest neighbor
nearest5 <- knn(train=x[train,],test=x[-train,],cl=house$Neighborhood[train],k=5)#5 nearest neighbors
data.frame(house$Neighborhood[-train],nearest1,nearest5)
## house.Neighborhood..train. nearest1 nearest5
## 1 East North North
## 2 East East North
## 3 West West West
## 4 East North North
## 5 East East East
## 6 West West West
## 7 North East East
## 8 East East East
## 9 North North North
## 10 North East East
## 11 East North North
## 12 East North North
## 13 East East North
## 14 East North North
## 15 North North North
## 16 East East East
## 17 North North East
## 18 East North North
## 19 West West North
## 20 North North North
## 21 West West West
## 22 West West West
## 23 West West West
## 24 West West West
## 25 East West West
## 26 West West West
## 27 East North North
## 28 East West West
par(mfrow=c(1,2))
## plot for k=1 (single) nearest neighbor
plot(x[train,],col=house$Neighborhood[train],cex=.8,main="1-nearest neighbor")
points(x[-train,],bg=nearest1,pch=21,col=grey(.9),cex=1.25)
## plot for k=5 nearest neighbors
plot(x[train,],col=house$Neighborhood[train],cex=.8,main="5-nearest neighbors")
points(x[-train,],bg=nearest5,pch=21,col=grey(.9),cex=1.25)
legend("topright",legend=levels(house$Neighborhood),fill=1:6,bty="n",cex=.75)

Press’s Q is evaluated against the critical Chi-Square. We see that this particular Chi-Square is 5.99 and that the Press’s Q for both single nearest neighbor (43.2) and 5 nearest neighbors (9.9) are greater than the Chi-Square proving our models to be adequate at correct classification. The single nearest neighbor proves to be quite good when compared to the Chi-Square.
To see if there is a more accurate number of neighbors for classification a cross validation is run and the most accurate model/method is selected.
## cross-validation (leave one out)
pcorr=dim(10)
neighbors=dim(10) ## Creates a variable to count how many neighbors are being used
for (k in 1:10) {
pred=knn.cv(x,cl=house$Neighborhood,k)
pcorr[k]=100*sum(house$Neighborhood==pred)/n
neighbors[k]=k ## Populates that count variable with k each time through
}
pcorr
## [1] 60.93750 57.81250 52.34375 54.68750 57.03125 57.81250 60.15625
## [8] 58.59375 58.59375 60.93750
neighbors
## [1] 1 2 3 4 5 6 7 8 9 10
maxAcc<-max(pcorr) ## Maximum accuracy (percent correct)
accTable<-rbind(pcorr,neighbors)
accTable
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## pcorr 60.9375 57.8125 52.34375 54.6875 57.03125 57.8125 60.15625
## neighbors 1.0000 2.0000 3.00000 4.0000 5.00000 6.0000 7.00000
## [,8] [,9] [,10]
## pcorr 58.59375 58.59375 60.9375
## neighbors 8.00000 9.00000 10.0000
kBest=neighbors[pcorr==maxAcc]
kBest
## [1] 1 10
Generally, this results in 1 being the best number of neighbors to classify houses with. Occasionally another number is selected as well when it is run several times as ties are broken at random. To do a final evaluation on the single nearest neighbor method a confusion matrix is run.
Near1<-data.frame(truetype=house$Neighborhood[-train],predtype=nearest1)
confusionMatrix(data=nearest1, reference=house$Neighborhood[-train])
## Confusion Matrix and Statistics
##
## Reference
## Prediction East North West
## East 5 2 0
## North 7 4 0
## West 2 0 8
##
## Overall Statistics
##
## Accuracy : 0.6071
## 95% CI : (0.4058, 0.785)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.1725
##
## Kappa : 0.4296
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: East Class: North Class: West
## Sensitivity 0.3571 0.6667 1.0000
## Specificity 0.8571 0.6818 0.9000
## Pos Pred Value 0.7143 0.3636 0.8000
## Neg Pred Value 0.5714 0.8824 1.0000
## Prevalence 0.5000 0.2143 0.2857
## Detection Rate 0.1786 0.1429 0.2857
## Detection Prevalence 0.2500 0.3929 0.3571
## Balanced Accuracy 0.6071 0.6742 0.9500
The confusion matrix output shows that the model is better at predicting and correctly classifying houses into east and west neighborhoods than north neighborhoods. It has an accuracy of 61% which is higher than the no information rate which means it is a pretty good model. However, the kappa is below 50% so the model isn’t ideal and other methods should probably be investigated to produce a more accurate predictive classification.
Data Set 2: ReducedFoodInsec.csv
This data set includes answers from a NHIS survey of 1168 individuals or families. The following analysis attempts to predict whether an individual answered Often True, Sometimes True, or Never True when asked “were you worried whether their food would run out before you got money to buy more”. Classification will be based on their answers to several other survey questions using a K nearest neighbor approach.
famx<- read.csv("C:/DataMining/Data/ReducedFoodInsec.csv")
fam <- famx[,c("FSRUNOUT","FM_EDUC1","INCGRP5","FSNAP","FHICOST","FM_TYPE","FM_SIZE","FDGLWCT1","FWICYN")]
fam$WICBenefits<-fam$FWICYN
fam$FamilySize<-fam$FM_SIZE
fam$NumMembersWorking<-fam$FDGLWCT1
fam$EducationLevel<-fam$FM_EDUC1
fam$MedicalDentalCost<-fam$FHICOST
fam$SNAPBenefits<-fam$FSNAP
fam$FamilyType <-fam$FM_TYPE
fam$FamilyIncome<-fam$INCGRP5
fam$FSRUNOUT = factor(fam$FSRUNOUT, levels=c("1","2","3"))
levels(fam$FSRUNOUT) = c(" OftenTrue","SometimesTrue","NeverTrue")
fam$FSRUNOUT <- factor(fam$FSRUNOUT)
family=fam[,c(-2:-9)]
head(family)
## FSRUNOUT WICBenefits FamilySize NumMembersWorking EducationLevel
## 1 NeverTrue 2 3 1 8
## 2 SometimesTrue 1 12 2 5
## 3 SometimesTrue 2 3 1 6
## 4 SometimesTrue 1 6 3 8
## 5 NeverTrue 1 4 2 8
## 6 NeverTrue 2 4 2 6
## MedicalDentalCost SNAPBenefits FamilyType FamilyIncome
## 1 2 2 4 2
## 2 3 1 4 2
## 3 4 2 4 3
## 4 2 1 4 2
## 5 1 1 4 4
## 6 1 2 4 4
summary(family)
## FSRUNOUT WICBenefits FamilySize NumMembersWorking
## OftenTrue : 30 Min. :1.000 Min. : 2.000 Min. :0.000
## SometimesTrue: 87 1st Qu.:2.000 1st Qu.: 4.000 1st Qu.:1.000
## NeverTrue :1051 Median :2.000 Median : 4.000 Median :2.000
## Mean :1.942 Mean : 4.537 Mean :1.783
## 3rd Qu.:2.000 3rd Qu.: 5.000 3rd Qu.:2.000
## Max. :9.000 Max. :12.000 Max. :6.000
## EducationLevel MedicalDentalCost SNAPBenefits FamilyType
## Min. : 1.000 Min. :0.000 Min. :1.000 Min. :3.000
## 1st Qu.: 5.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:4.000
## Median : 8.000 Median :2.000 Median :2.000 Median :4.000
## Mean : 7.256 Mean :2.241 Mean :1.988 Mean :3.967
## 3rd Qu.: 9.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :99.000 Max. :9.000 Max. :9.000 Max. :4.000
## FamilyIncome
## Min. : 1.00
## 1st Qu.: 2.00
## Median : 4.00
## Mean :13.81
## 3rd Qu.: 4.00
## Max. :99.00
Here we can see the 8 survey question variables separated for the three different FSRUNOUT responses. This visualization shows that an individual/family can be feasibly classified into a response group based on their responses to the other questions.
par(mfrow=c(3,3), mai=c(.3,.6,.1,.1))
plot(FamilySize ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(FamilyType ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(FamilyIncome ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(NumMembersWorking ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(MedicalDentalCost ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(EducationLevel ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(WICBenefits ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(SNAPBenefits ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))

The data is split into a training set and a test set and then the data set is rescaled so the variables will all contribute equally in the model.
n=length(family$FSRUNOUT)
n
## [1] 1168
nt=1100
set.seed(1) ### To make the calculations reproducible in repeated runs
train <- sample(1:n,nt)
x<- scale(family[,c(2:9)])
x[1:3,]
## WICBenefits FamilySize NumMembersWorking EducationLevel
## [1,] 0.06215024 -1.08243 -0.9861999 0.1752954
## [2,] -1.00537152 5.25657 0.2740643 -0.5315345
## [3,] 0.06215024 -1.08243 -0.9861999 -0.2959245
## MedicalDentalCost SNAPBenefits FamilyType FamilyIncome
## [1,] -0.1467714 0.0139804 0.1833017 -0.3922275
## [2,] 0.4611329 -1.1523840 0.1833017 -0.3922275
## [3,] 1.0690373 0.0139804 0.1833017 -0.3590254
The nearest neighbor method is used to classify the 28 houses in our test set using k=1 and k=5. The results were then plotted.
nearest1 <- knn(train=x[train,],test=x[-train,],cl=family$FSRUNOUT[train],k=1)#nearest neighbor
nearest5 <- knn(train=x[train,],test=x[-train,],cl=family$FSRUNOUT[train],k=5)#5 nearest neighbors
data.frame(family$FSRUNOUT[-train],nearest1,nearest5)
## family.FSRUNOUT..train. nearest1 nearest5
## 1 NeverTrue NeverTrue NeverTrue
## 2 NeverTrue NeverTrue NeverTrue
## 3 NeverTrue NeverTrue NeverTrue
## 4 NeverTrue NeverTrue NeverTrue
## 5 NeverTrue NeverTrue NeverTrue
## 6 NeverTrue NeverTrue NeverTrue
## 7 NeverTrue NeverTrue NeverTrue
## 8 SometimesTrue NeverTrue NeverTrue
## 9 NeverTrue NeverTrue NeverTrue
## 10 NeverTrue SometimesTrue NeverTrue
## 11 NeverTrue NeverTrue NeverTrue
## 12 NeverTrue NeverTrue NeverTrue
## 13 NeverTrue NeverTrue NeverTrue
## 14 SometimesTrue OftenTrue OftenTrue
## 15 NeverTrue NeverTrue NeverTrue
## 16 NeverTrue NeverTrue NeverTrue
## 17 NeverTrue NeverTrue NeverTrue
## 18 NeverTrue NeverTrue NeverTrue
## 19 NeverTrue NeverTrue NeverTrue
## 20 NeverTrue NeverTrue NeverTrue
## 21 NeverTrue NeverTrue NeverTrue
## 22 NeverTrue NeverTrue NeverTrue
## 23 NeverTrue NeverTrue NeverTrue
## 24 NeverTrue NeverTrue NeverTrue
## 25 NeverTrue NeverTrue NeverTrue
## 26 NeverTrue NeverTrue NeverTrue
## 27 OftenTrue SometimesTrue NeverTrue
## 28 NeverTrue OftenTrue NeverTrue
## 29 NeverTrue NeverTrue NeverTrue
## 30 NeverTrue NeverTrue NeverTrue
## 31 NeverTrue NeverTrue NeverTrue
## 32 NeverTrue NeverTrue NeverTrue
## 33 NeverTrue NeverTrue NeverTrue
## 34 NeverTrue NeverTrue NeverTrue
## 35 NeverTrue NeverTrue NeverTrue
## 36 NeverTrue NeverTrue NeverTrue
## 37 NeverTrue NeverTrue NeverTrue
## 38 NeverTrue NeverTrue NeverTrue
## 39 NeverTrue NeverTrue NeverTrue
## 40 NeverTrue NeverTrue NeverTrue
## 41 SometimesTrue NeverTrue NeverTrue
## 42 NeverTrue NeverTrue NeverTrue
## 43 NeverTrue NeverTrue NeverTrue
## 44 NeverTrue NeverTrue NeverTrue
## 45 NeverTrue NeverTrue NeverTrue
## 46 NeverTrue NeverTrue NeverTrue
## 47 NeverTrue NeverTrue NeverTrue
## 48 NeverTrue NeverTrue NeverTrue
## 49 NeverTrue NeverTrue NeverTrue
## 50 NeverTrue NeverTrue NeverTrue
## 51 NeverTrue NeverTrue NeverTrue
## 52 NeverTrue OftenTrue NeverTrue
## 53 NeverTrue NeverTrue NeverTrue
## 54 NeverTrue NeverTrue NeverTrue
## 55 NeverTrue NeverTrue NeverTrue
## 56 NeverTrue NeverTrue NeverTrue
## 57 NeverTrue NeverTrue NeverTrue
## 58 NeverTrue NeverTrue NeverTrue
## 59 NeverTrue NeverTrue NeverTrue
## 60 NeverTrue NeverTrue NeverTrue
## 61 NeverTrue NeverTrue NeverTrue
## 62 NeverTrue NeverTrue NeverTrue
## 63 NeverTrue NeverTrue NeverTrue
## 64 NeverTrue NeverTrue NeverTrue
## 65 NeverTrue NeverTrue NeverTrue
## 66 NeverTrue NeverTrue NeverTrue
## 67 SometimesTrue NeverTrue NeverTrue
## 68 NeverTrue NeverTrue NeverTrue
par(mfrow=c(1,2))
## plot for k=1 (single) nearest neighbor
plot(x[train,],col=family$FSRUNOUT[train],cex=.8,main="1-nearest neighbor")
points(x[-train,],bg=nearest1,pch=21,col=grey(.9),cex=1.25)
## plot for k=5 nearest neighbors
plot(x[train,],col=family$FSRUNOUT[train],cex=.8,main="5-nearest neighbors")
points(x[-train,],bg=nearest5,pch=21,col=grey(.9),cex=1.25)
legend("topright",legend=levels(family$FSRUNOUT),fill=1:6,bty="n",cex=.75)

Press’s Q is evaluated against the critical Chi-Square. We see that this particular Chi-Square is 5.99 and that the Press’s Q for both single nearest neighbor and 5 nearest neighbors are significantly greater than the Chi-Square proving our models to be quite good at correct classification. 5 nearest neighbors once again prove to be better than the single nearest neighbor method.
To see if there is a more accurate number of neighbors for classification a cross validation is run and the most accurate model/method is selected.
pcorr=dim(10)
neighbors=dim(10) ## Creates a variable to count how many neighbors are being used
for (k in 1:10) {
pred=knn.cv(x,cl=family$FSRUNOUT,k)
pcorr[k]=100*sum(family$FSRUNOUT==pred)/n
neighbors[k]=k ## Populates that count variable with k each time through
}
pcorr
## [1] 84.07534 84.41781 88.09932 88.35616 89.04110 89.64041 89.72603
## [8] 89.81164 89.98288 90.06849
neighbors
## [1] 1 2 3 4 5 6 7 8 9 10
maxAcc<-max(pcorr) ## Maximum accuracy (percent correct)
accTable<-rbind(pcorr,neighbors)
accTable
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## pcorr 84.07534 84.41781 88.09932 88.35616 89.0411 89.64041 89.72603
## neighbors 1.00000 2.00000 3.00000 4.00000 5.0000 6.00000 7.00000
## [,8] [,9] [,10]
## pcorr 89.81164 89.98288 90.06849
## neighbors 8.00000 9.00000 10.00000
kBest=neighbors[pcorr==maxAcc]
kBest
## [1] 10
This results in 10 being the optimum number of neighbors to classify responses with. To evaluate the 10 nearest neighbors method a Press’s Q measure of classification accuracy is run as well as a confusion matrix.
nearest10 <- knn(train=x[train,],test=x[-train,],cl=family$FSRUNOUT[train],k=10)#nearest neighbor
pcorrn10=100*sum(family$FSRUNOUT[-train]==nearest10)/(n-nt)
pcorrn10 #K=10
## [1] 92.64706
qchisq(.95,2)##critical value for chi-square with alpha=.05, k-1 d.f. where k=3 here
## [1] 5.991465
numCorrn10=(pcorrn10/100)*n #Press's Q = [N-(n*k)]^2/N(k-1)
PressQ10=((n-(numCorrn10*3))^2)/(n*2)
PressQ10
## [1] 1849.123
Near10<-data.frame(truetype=family$FSRUNOUT[-train],predtype=nearest10)
confusionMatrix(data=nearest10, reference=family$FSRUNOUT[-train])
## Confusion Matrix and Statistics
##
## Reference
## Prediction OftenTrue SometimesTrue NeverTrue
## OftenTrue 0 0 0
## SometimesTrue 0 0 0
## NeverTrue 1 4 63
##
## Overall Statistics
##
## Accuracy : 0.9265
## 95% CI : (0.8367, 0.9757)
## No Information Rate : 0.9265
## P-Value [Acc > NIR] : 0.616
##
## Kappa : 0
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: OftenTrue Class: SometimesTrue
## Sensitivity 0.00000 0.00000
## Specificity 1.00000 1.00000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.98529 0.94118
## Prevalence 0.01471 0.05882
## Detection Rate 0.00000 0.00000
## Detection Prevalence 0.00000 0.00000
## Balanced Accuracy 0.50000 0.50000
## Class: NeverTrue
## Sensitivity 1.0000
## Specificity 0.0000
## Pos Pred Value 0.9265
## Neg Pred Value NaN
## Prevalence 0.9265
## Detection Rate 0.9265
## Detection Prevalence 1.0000
## Balanced Accuracy 0.5000
The Press’s Q is significantly greater than the Chi-Square. However, the model is no better at classifying the test set than if you were to classify them with no information, it also has a kappa of zero. This model produces results that are no better than someone seeing that a reply of “Never True” is most likely and classifying all unknown responses as “Never True” responses.