Data Set 1: HousePrices.csv

This data set includes prices and characteristics of n=128 houses. The following analysis attempts to predict what neighborhood a house will be in based on its characteristics using a K nearest neighbor approach.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(textir)

## Loading required package: distrom

## Loading required package: Matrix

## Loading required package: gamlr

## Loading required package: parallel

library(class)
library(car)

hpr<- read.csv("C:/DataMining/Data/HousePrices.csv")
head(hpr)

##   HomeID  Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 1      1 114300 1790        2         2      2    No         East
## 2      2 114200 2030        4         2      3    No         East
## 3      3 114800 1740        3         2      1    No         East
## 4      4  94700 1980        3         2      3    No         East
## 5      5 119800 2130        3         3      3    No         East
## 6      6 114600 1780        3         2      2    No        North

table(hpr$Neighborhood)

## 
##  East North  West 
##    45    44    39

dim(hpr)

## [1] 128   8

v1=rep(1,dim(hpr)[1])
v2=rep(0,dim(hpr)[1])
hpr$BrickTrue = ifelse(hpr$Brick == "Yes",v1,v2)

house <- hpr[,c("Neighborhood","Price","SqFt","Bedrooms","Bathrooms","Offers","BrickTrue")]

Here we can see the 6 housing characteristic variables separated for the three different neighborhoods. This visualization shows that a house can be feasibly classified into a neighborhood based on its characteristics.

par(mfrow=c(3,3), mai=c(.3,.6,.1,.1))
plot(Price ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(SqFt ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(Bedrooms ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(Bathrooms ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(Offers ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))
plot(BrickTrue ~ Neighborhood, data=hpr, col=c(grey(.2),2:6))

The data is split into a training set and a test set and then the data set is rescaled so the variables will all contribute equally in the model.

n=length(house$Neighborhood)
n

## [1] 128

nt=100
set.seed(1) ### To make the calculations reproducible in repeated runs
train <- sample(1:n,nt)

x<-scale(house[,c(2:7)])
x[1:3,]

##           Price       SqFt    Bedrooms  Bathrooms     Offers  BrickTrue
## [1,] -0.6002263 -0.9969990 -1.40978793 -0.8655378 -0.5406451 -0.6961011
## [2,] -0.6039481  0.1373643  1.34521749 -0.8655378  0.3945248 -0.6961011
## [3,] -0.5816174 -1.2333247 -0.03228522 -0.8655378 -1.4758150 -0.6961011

mean(x)

## [1] 6.38084e-18

sd(x)

## [1] 0.9967352

The nearest neighbor method is used to classify the 28 houses in our test set using k=1 and k=5. The results were then plotted (the test data are the solid points and the training data are the open points).

nearest1 <- knn(train=x[train,],test=x[-train,],cl=house$Neighborhood[train],k=1)#nearest neighbor
nearest5 <- knn(train=x[train,],test=x[-train,],cl=house$Neighborhood[train],k=5)#5 nearest neighbors
data.frame(house$Neighborhood[-train],nearest1,nearest5)

##    house.Neighborhood..train. nearest1 nearest5
## 1                        East    North    North
## 2                        East     East    North
## 3                        West     West     West
## 4                        East    North    North
## 5                        East     East     East
## 6                        West     West     West
## 7                       North     East     East
## 8                        East     East     East
## 9                       North    North    North
## 10                      North     East     East
## 11                       East    North    North
## 12                       East    North    North
## 13                       East     East    North
## 14                       East    North    North
## 15                      North    North    North
## 16                       East     East     East
## 17                      North    North     East
## 18                       East    North    North
## 19                       West     West    North
## 20                      North    North    North
## 21                       West     West     West
## 22                       West     West     West
## 23                       West     West     West
## 24                       West     West     West
## 25                       East     West     West
## 26                       West     West     West
## 27                       East    North    North
## 28                       East     West     West

par(mfrow=c(1,2))
## plot for k=1 (single) nearest neighbor
plot(x[train,],col=house$Neighborhood[train],cex=.8,main="1-nearest neighbor")
points(x[-train,],bg=nearest1,pch=21,col=grey(.9),cex=1.25)
## plot for k=5 nearest neighbors
plot(x[train,],col=house$Neighborhood[train],cex=.8,main="5-nearest neighbors")
points(x[-train,],bg=nearest5,pch=21,col=grey(.9),cex=1.25)
legend("topright",legend=levels(house$Neighborhood),fill=1:6,bty="n",cex=.75)

To get a sense of how well the two performed the proportion of correct classifications is calculated.

pcorrn1=100*sum(house$Neighborhood[-train]==nearest1)/(n-nt)
pcorrn1 #K=1

## [1] 60.71429

pcorrn5=100*sum(house$Neighborhood[-train]==nearest5)/(n-nt)
pcorrn5 #k=5

## [1] 46.42857

The proportion correct can be compared against the chance of getting a correct classification without any information. If you were to classify all of the houses into one of the three almost equally distributed neighborhoods you would have a 30-35% chance of being correct and both models produced percentages greater than that. There is a higher proportion correct for the single nearest neighbor classification than for the 5 nearest neighbors classification. This shows that so far classifying with a single nearest neighbor is the superior method. To further demonstrate that conclusion a Press’s Q measure of classification accuracy is run.

qchisq(.95,2)##critical value for chi-square with alpha=.05, k-1 d.f. where k=3 here

## [1] 5.991465

numCorrn1=(pcorrn1/100)*n #Press's Q = [N-(n*k)]^2/N(k-1) 
PressQ1=((n-(numCorrn1*3))^2)/(n*2)
PressQ1

## [1] 43.18367

numCorrn5=(pcorrn5/100)*n
PressQ5=((n-(numCorrn5*3))^2)/(n*2)
PressQ5

## [1] 9.877551

Press’s Q is evaluated against the critical Chi-Square. We see that this particular Chi-Square is 5.99 and that the Press’s Q for both single nearest neighbor (43.2) and 5 nearest neighbors (9.9) are greater than the Chi-Square proving our models to be adequate at correct classification. The single nearest neighbor proves to be quite good when compared to the Chi-Square.

To see if there is a more accurate number of neighbors for classification a cross validation is run and the most accurate model/method is selected.

## cross-validation (leave one out)
pcorr=dim(10)
neighbors=dim(10)  ## Creates a variable to count how many neighbors are being used
for (k in 1:10) {
  pred=knn.cv(x,cl=house$Neighborhood,k)
  pcorr[k]=100*sum(house$Neighborhood==pred)/n
  neighbors[k]=k   ## Populates that count variable with k each time through
}
pcorr

##  [1] 60.93750 57.81250 52.34375 54.68750 57.03125 57.81250 60.15625
##  [8] 58.59375 58.59375 60.93750

neighbors

##  [1]  1  2  3  4  5  6  7  8  9 10

maxAcc<-max(pcorr) ## Maximum accuracy (percent correct) 
accTable<-rbind(pcorr,neighbors)
accTable

##              [,1]    [,2]     [,3]    [,4]     [,5]    [,6]     [,7]
## pcorr     60.9375 57.8125 52.34375 54.6875 57.03125 57.8125 60.15625
## neighbors  1.0000  2.0000  3.00000  4.0000  5.00000  6.0000  7.00000
##               [,8]     [,9]   [,10]
## pcorr     58.59375 58.59375 60.9375
## neighbors  8.00000  9.00000 10.0000

kBest=neighbors[pcorr==maxAcc]
kBest

## [1]  1 10

Generally, this results in 1 being the best number of neighbors to classify houses with. Occasionally another number is selected as well when it is run several times as ties are broken at random. To do a final evaluation on the single nearest neighbor method a confusion matrix is run.

Near1<-data.frame(truetype=house$Neighborhood[-train],predtype=nearest1)
confusionMatrix(data=nearest1, reference=house$Neighborhood[-train])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction East North West
##      East     5     2    0
##      North    7     4    0
##      West     2     0    8
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6071         
##                  95% CI : (0.4058, 0.785)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 0.1725         
##                                          
##                   Kappa : 0.4296         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: East Class: North Class: West
## Sensitivity               0.3571       0.6667      1.0000
## Specificity               0.8571       0.6818      0.9000
## Pos Pred Value            0.7143       0.3636      0.8000
## Neg Pred Value            0.5714       0.8824      1.0000
## Prevalence                0.5000       0.2143      0.2857
## Detection Rate            0.1786       0.1429      0.2857
## Detection Prevalence      0.2500       0.3929      0.3571
## Balanced Accuracy         0.6071       0.6742      0.9500

The confusion matrix output shows that the model is better at predicting and correctly classifying houses into east and west neighborhoods than north neighborhoods. It has an accuracy of 61% which is higher than the no information rate which means it is a pretty good model. However, the kappa is below 50% so the model isn’t ideal and other methods should probably be investigated to produce a more accurate predictive classification.

Data Set 2: ReducedFoodInsec.csv

This data set includes answers from a NHIS survey of 1168 individuals or families. The following analysis attempts to predict whether an individual answered Often True, Sometimes True, or Never True when asked “were you worried whether their food would run out before you got money to buy more”. Classification will be based on their answers to several other survey questions using a K nearest neighbor approach.

famx<- read.csv("C:/DataMining/Data/ReducedFoodInsec.csv")

fam <- famx[,c("FSRUNOUT","FM_EDUC1","INCGRP5","FSNAP","FHICOST","FM_TYPE","FM_SIZE","FDGLWCT1","FWICYN")]

fam$WICBenefits<-fam$FWICYN
fam$FamilySize<-fam$FM_SIZE
fam$NumMembersWorking<-fam$FDGLWCT1
fam$EducationLevel<-fam$FM_EDUC1
fam$MedicalDentalCost<-fam$FHICOST
fam$SNAPBenefits<-fam$FSNAP
fam$FamilyType <-fam$FM_TYPE
fam$FamilyIncome<-fam$INCGRP5
fam$FSRUNOUT = factor(fam$FSRUNOUT, levels=c("1","2","3"))
levels(fam$FSRUNOUT) = c(" OftenTrue","SometimesTrue","NeverTrue")
fam$FSRUNOUT <- factor(fam$FSRUNOUT)

family=fam[,c(-2:-9)]
head(family)

##        FSRUNOUT WICBenefits FamilySize NumMembersWorking EducationLevel
## 1     NeverTrue           2          3                 1              8
## 2 SometimesTrue           1         12                 2              5
## 3 SometimesTrue           2          3                 1              6
## 4 SometimesTrue           1          6                 3              8
## 5     NeverTrue           1          4                 2              8
## 6     NeverTrue           2          4                 2              6
##   MedicalDentalCost SNAPBenefits FamilyType FamilyIncome
## 1                 2            2          4            2
## 2                 3            1          4            2
## 3                 4            2          4            3
## 4                 2            1          4            2
## 5                 1            1          4            4
## 6                 1            2          4            4

summary(family)

##           FSRUNOUT     WICBenefits      FamilySize     NumMembersWorking
##   OftenTrue   :  30   Min.   :1.000   Min.   : 2.000   Min.   :0.000    
##  SometimesTrue:  87   1st Qu.:2.000   1st Qu.: 4.000   1st Qu.:1.000    
##  NeverTrue    :1051   Median :2.000   Median : 4.000   Median :2.000    
##                       Mean   :1.942   Mean   : 4.537   Mean   :1.783    
##                       3rd Qu.:2.000   3rd Qu.: 5.000   3rd Qu.:2.000    
##                       Max.   :9.000   Max.   :12.000   Max.   :6.000    
##  EducationLevel   MedicalDentalCost  SNAPBenefits     FamilyType   
##  Min.   : 1.000   Min.   :0.000     Min.   :1.000   Min.   :3.000  
##  1st Qu.: 5.000   1st Qu.:1.000     1st Qu.:2.000   1st Qu.:4.000  
##  Median : 8.000   Median :2.000     Median :2.000   Median :4.000  
##  Mean   : 7.256   Mean   :2.241     Mean   :1.988   Mean   :3.967  
##  3rd Qu.: 9.000   3rd Qu.:3.000     3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :99.000   Max.   :9.000     Max.   :9.000   Max.   :4.000  
##   FamilyIncome  
##  Min.   : 1.00  
##  1st Qu.: 2.00  
##  Median : 4.00  
##  Mean   :13.81  
##  3rd Qu.: 4.00  
##  Max.   :99.00

Here we can see the 8 survey question variables separated for the three different FSRUNOUT responses. This visualization shows that an individual/family can be feasibly classified into a response group based on their responses to the other questions.

par(mfrow=c(3,3), mai=c(.3,.6,.1,.1))
plot(FamilySize ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(FamilyType ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(FamilyIncome ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(NumMembersWorking ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(MedicalDentalCost ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(EducationLevel ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(WICBenefits ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))
plot(SNAPBenefits ~ FSRUNOUT, data=family, col=c(grey(.2),2:6))

The data is split into a training set and a test set and then the data set is rescaled so the variables will all contribute equally in the model.

n=length(family$FSRUNOUT)
n

## [1] 1168

nt=1100
set.seed(1) ### To make the calculations reproducible in repeated runs
train <- sample(1:n,nt)

x<- scale(family[,c(2:9)])
x[1:3,]

##      WICBenefits FamilySize NumMembersWorking EducationLevel
## [1,]  0.06215024   -1.08243        -0.9861999      0.1752954
## [2,] -1.00537152    5.25657         0.2740643     -0.5315345
## [3,]  0.06215024   -1.08243        -0.9861999     -0.2959245
##      MedicalDentalCost SNAPBenefits FamilyType FamilyIncome
## [1,]        -0.1467714    0.0139804  0.1833017   -0.3922275
## [2,]         0.4611329   -1.1523840  0.1833017   -0.3922275
## [3,]         1.0690373    0.0139804  0.1833017   -0.3590254

The nearest neighbor method is used to classify the 28 houses in our test set using k=1 and k=5. The results were then plotted.

nearest1 <- knn(train=x[train,],test=x[-train,],cl=family$FSRUNOUT[train],k=1)#nearest neighbor
nearest5 <- knn(train=x[train,],test=x[-train,],cl=family$FSRUNOUT[train],k=5)#5 nearest neighbors
data.frame(family$FSRUNOUT[-train],nearest1,nearest5)

##    family.FSRUNOUT..train.      nearest1   nearest5
## 1                NeverTrue     NeverTrue  NeverTrue
## 2                NeverTrue     NeverTrue  NeverTrue
## 3                NeverTrue     NeverTrue  NeverTrue
## 4                NeverTrue     NeverTrue  NeverTrue
## 5                NeverTrue     NeverTrue  NeverTrue
## 6                NeverTrue     NeverTrue  NeverTrue
## 7                NeverTrue     NeverTrue  NeverTrue
## 8            SometimesTrue     NeverTrue  NeverTrue
## 9                NeverTrue     NeverTrue  NeverTrue
## 10               NeverTrue SometimesTrue  NeverTrue
## 11               NeverTrue     NeverTrue  NeverTrue
## 12               NeverTrue     NeverTrue  NeverTrue
## 13               NeverTrue     NeverTrue  NeverTrue
## 14           SometimesTrue     OftenTrue  OftenTrue
## 15               NeverTrue     NeverTrue  NeverTrue
## 16               NeverTrue     NeverTrue  NeverTrue
## 17               NeverTrue     NeverTrue  NeverTrue
## 18               NeverTrue     NeverTrue  NeverTrue
## 19               NeverTrue     NeverTrue  NeverTrue
## 20               NeverTrue     NeverTrue  NeverTrue
## 21               NeverTrue     NeverTrue  NeverTrue
## 22               NeverTrue     NeverTrue  NeverTrue
## 23               NeverTrue     NeverTrue  NeverTrue
## 24               NeverTrue     NeverTrue  NeverTrue
## 25               NeverTrue     NeverTrue  NeverTrue
## 26               NeverTrue     NeverTrue  NeverTrue
## 27               OftenTrue SometimesTrue  NeverTrue
## 28               NeverTrue     OftenTrue  NeverTrue
## 29               NeverTrue     NeverTrue  NeverTrue
## 30               NeverTrue     NeverTrue  NeverTrue
## 31               NeverTrue     NeverTrue  NeverTrue
## 32               NeverTrue     NeverTrue  NeverTrue
## 33               NeverTrue     NeverTrue  NeverTrue
## 34               NeverTrue     NeverTrue  NeverTrue
## 35               NeverTrue     NeverTrue  NeverTrue
## 36               NeverTrue     NeverTrue  NeverTrue
## 37               NeverTrue     NeverTrue  NeverTrue
## 38               NeverTrue     NeverTrue  NeverTrue
## 39               NeverTrue     NeverTrue  NeverTrue
## 40               NeverTrue     NeverTrue  NeverTrue
## 41           SometimesTrue     NeverTrue  NeverTrue
## 42               NeverTrue     NeverTrue  NeverTrue
## 43               NeverTrue     NeverTrue  NeverTrue
## 44               NeverTrue     NeverTrue  NeverTrue
## 45               NeverTrue     NeverTrue  NeverTrue
## 46               NeverTrue     NeverTrue  NeverTrue
## 47               NeverTrue     NeverTrue  NeverTrue
## 48               NeverTrue     NeverTrue  NeverTrue
## 49               NeverTrue     NeverTrue  NeverTrue
## 50               NeverTrue     NeverTrue  NeverTrue
## 51               NeverTrue     NeverTrue  NeverTrue
## 52               NeverTrue     OftenTrue  NeverTrue
## 53               NeverTrue     NeverTrue  NeverTrue
## 54               NeverTrue     NeverTrue  NeverTrue
## 55               NeverTrue     NeverTrue  NeverTrue
## 56               NeverTrue     NeverTrue  NeverTrue
## 57               NeverTrue     NeverTrue  NeverTrue
## 58               NeverTrue     NeverTrue  NeverTrue
## 59               NeverTrue     NeverTrue  NeverTrue
## 60               NeverTrue     NeverTrue  NeverTrue
## 61               NeverTrue     NeverTrue  NeverTrue
## 62               NeverTrue     NeverTrue  NeverTrue
## 63               NeverTrue     NeverTrue  NeverTrue
## 64               NeverTrue     NeverTrue  NeverTrue
## 65               NeverTrue     NeverTrue  NeverTrue
## 66               NeverTrue     NeverTrue  NeverTrue
## 67           SometimesTrue     NeverTrue  NeverTrue
## 68               NeverTrue     NeverTrue  NeverTrue

par(mfrow=c(1,2))
## plot for k=1 (single) nearest neighbor
plot(x[train,],col=family$FSRUNOUT[train],cex=.8,main="1-nearest neighbor")
points(x[-train,],bg=nearest1,pch=21,col=grey(.9),cex=1.25)
## plot for k=5 nearest neighbors
plot(x[train,],col=family$FSRUNOUT[train],cex=.8,main="5-nearest neighbors")
points(x[-train,],bg=nearest5,pch=21,col=grey(.9),cex=1.25)
legend("topright",legend=levels(family$FSRUNOUT),fill=1:6,bty="n",cex=.75)

To get a sense of how well the two performed the proportion of correct classifications is calculated.

table(family$FSRUNOUT)

## 
##     OftenTrue SometimesTrue     NeverTrue 
##            30            87          1051

pcorrn1=100*sum(family$FSRUNOUT[-train]==nearest1)/(n-nt)
pcorrn1 #K=1

## [1] 88.23529

pcorrn5=100*sum(family$FSRUNOUT[-train]==nearest5)/(n-nt)
pcorrn5 #k=5

## [1] 92.64706

The proportion correct can be compared against the chance of getting a correct classification without any information. If you were to classify all of the individuals into the largest group you would have a 90% chance of being correct and only the 5 nearest neighbors produced a percentage greater than that. This shows that so far classifying with the 5 nearest neighbors is the superior method. To further demonstrate that conclusion a Press’s Q measure of classification accuracy is run.

qchisq(.95,2)##critical value for chi-square with alpha=.05, k-1 d.f. where k=3 here

## [1] 5.991465

numCorrn1=(pcorrn1/100)*n #Press's Q = [N-(n*k)]^2/N(k-1) 
PressQ1=((n-(numCorrn1*3))^2)/(n*2)
PressQ1

## [1] 1584.277

numCorrn5=(pcorrn5/100)*n
PressQ5=((n-(numCorrn5*3))^2)/(n*2)
PressQ5

## [1] 1849.123

Press’s Q is evaluated against the critical Chi-Square. We see that this particular Chi-Square is 5.99 and that the Press’s Q for both single nearest neighbor and 5 nearest neighbors are significantly greater than the Chi-Square proving our models to be quite good at correct classification. 5 nearest neighbors once again prove to be better than the single nearest neighbor method.

To see if there is a more accurate number of neighbors for classification a cross validation is run and the most accurate model/method is selected.

pcorr=dim(10)
neighbors=dim(10)  ## Creates a variable to count how many neighbors are being used
for (k in 1:10) {
  pred=knn.cv(x,cl=family$FSRUNOUT,k)
  pcorr[k]=100*sum(family$FSRUNOUT==pred)/n
  neighbors[k]=k   ## Populates that count variable with k each time through
}
pcorr

##  [1] 84.07534 84.41781 88.09932 88.35616 89.04110 89.64041 89.72603
##  [8] 89.81164 89.98288 90.06849

neighbors

##  [1]  1  2  3  4  5  6  7  8  9 10

maxAcc<-max(pcorr) ## Maximum accuracy (percent correct) 
accTable<-rbind(pcorr,neighbors)
accTable

##               [,1]     [,2]     [,3]     [,4]    [,5]     [,6]     [,7]
## pcorr     84.07534 84.41781 88.09932 88.35616 89.0411 89.64041 89.72603
## neighbors  1.00000  2.00000  3.00000  4.00000  5.0000  6.00000  7.00000
##               [,8]     [,9]    [,10]
## pcorr     89.81164 89.98288 90.06849
## neighbors  8.00000  9.00000 10.00000

kBest=neighbors[pcorr==maxAcc]
kBest

## [1] 10

This results in 10 being the optimum number of neighbors to classify responses with. To evaluate the 10 nearest neighbors method a Press’s Q measure of classification accuracy is run as well as a confusion matrix.

nearest10 <- knn(train=x[train,],test=x[-train,],cl=family$FSRUNOUT[train],k=10)#nearest neighbor
pcorrn10=100*sum(family$FSRUNOUT[-train]==nearest10)/(n-nt)
pcorrn10 #K=10

## [1] 92.64706

qchisq(.95,2)##critical value for chi-square with alpha=.05, k-1 d.f. where k=3 here

## [1] 5.991465

numCorrn10=(pcorrn10/100)*n #Press's Q = [N-(n*k)]^2/N(k-1) 
PressQ10=((n-(numCorrn10*3))^2)/(n*2)
PressQ10

## [1] 1849.123

Near10<-data.frame(truetype=family$FSRUNOUT[-train],predtype=nearest10)
confusionMatrix(data=nearest10, reference=family$FSRUNOUT[-train])

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction       OftenTrue SometimesTrue NeverTrue
##    OftenTrue             0             0         0
##   SometimesTrue          0             0         0
##   NeverTrue              1             4        63
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9265          
##                  95% CI : (0.8367, 0.9757)
##     No Information Rate : 0.9265          
##     P-Value [Acc > NIR] : 0.616           
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class:  OftenTrue Class: SometimesTrue
## Sensitivity                    0.00000              0.00000
## Specificity                    1.00000              1.00000
## Pos Pred Value                     NaN                  NaN
## Neg Pred Value                 0.98529              0.94118
## Prevalence                     0.01471              0.05882
## Detection Rate                 0.00000              0.00000
## Detection Prevalence           0.00000              0.00000
## Balanced Accuracy              0.50000              0.50000
##                      Class: NeverTrue
## Sensitivity                    1.0000
## Specificity                    0.0000
## Pos Pred Value                 0.9265
## Neg Pred Value                    NaN
## Prevalence                     0.9265
## Detection Rate                 0.9265
## Detection Prevalence           1.0000
## Balanced Accuracy              0.5000

The Press’s Q is significantly greater than the Chi-Square. However, the model is no better at classifying the test set than if you were to classify them with no information, it also has a kappa of zero. This model produces results that are no better than someone seeing that a reply of “Never True” is most likely and classifying all unknown responses as “Never True” responses.

Homework 4

Hepp

November 8, 2017

Data Set 1: HousePrices.csv

This data set includes prices and characteristics of n=128 houses. The following analysis attempts to predict what neighborhood a house will be in based on its characteristics using a K nearest neighbor approach.

Here we can see the 6 housing characteristic variables separated for the three different neighborhoods. This visualization shows that a house can be feasibly classified into a neighborhood based on its characteristics.

The data is split into a training set and a test set and then the data set is rescaled so the variables will all contribute equally in the model.

The nearest neighbor method is used to classify the 28 houses in our test set using k=1 and k=5. The results were then plotted (the test data are the solid points and the training data are the open points).

To get a sense of how well the two performed the proportion of correct classifications is calculated.

To see if there is a more accurate number of neighbors for classification a cross validation is run and the most accurate model/method is selected.

Generally, this results in 1 being the best number of neighbors to classify houses with. Occasionally another number is selected as well when it is run several times as ties are broken at random. To do a final evaluation on the single nearest neighbor method a confusion matrix is run.

Data Set 2: ReducedFoodInsec.csv

Here we can see the 8 survey question variables separated for the three different FSRUNOUT responses. This visualization shows that an individual/family can be feasibly classified into a response group based on their responses to the other questions.

The data is split into a training set and a test set and then the data set is rescaled so the variables will all contribute equally in the model.

The nearest neighbor method is used to classify the 28 houses in our test set using k=1 and k=5. The results were then plotted.

To get a sense of how well the two performed the proportion of correct classifications is calculated.

To see if there is a more accurate number of neighbors for classification a cross validation is run and the most accurate model/method is selected.

This results in 10 being the optimum number of neighbors to classify responses with. To evaluate the 10 nearest neighbors method a Press’s Q measure of classification accuracy is run as well as a confusion matrix.