Problem

The file BostonHousing.csv contains information on over 500 census tracts in Boston, where for each tract multiple variables are recorded. The last column (CAT.MEDV) was derived from MEDV, such that it obtains the value 1 if MEDV > 30 and 0 otherwise. Consider the goal of predicting the median value (MEDV) of a tract, given the information in the first 12 columns.

Data

Load the BostonHousing.csv and install/load any required packages. Partition the data into training (60%) and validation (40%) sets.

housing.df <- read.csv("D:/MSBA/3-Winter 2020/560/data/BostonHousing.csv")

set.seed(123)
train.index <- sample(row.names(housing.df), 0.6*dim(housing.df)[1])  
valid.index <- setdiff(row.names(housing.df), train.index)  
train.df <- housing.df[train.index, -14]
valid.df <- housing.df[valid.index, -14]

Part A

Perform a k-NN prediction with all 12 predictors (ignore the CAT.MEDV column), trying values of k from 1 to 5. Make sure to normalize the data, and choose function knn() from package class rather than package FNN. To make sure R is using the class package (when both packages are loaded), use class::knn(). What is the best k? What does it mean?

# initialize normalized training, validation data, complete data frames to originals
train.norm.df <- train.df
valid.norm.df <- valid.df
housing.norm.df <-housing.df

# use preProcess() from the caret package to normalize Income and Lot_Size.
norm.values <- preProcess(train.df, method=c("center", "scale"))
train.norm.df <- as.data.frame(predict(norm.values, train.df))
valid.norm.df <- as.data.frame(predict(norm.values, valid.df))
housing.norm.df <- as.data.frame(predict(norm.values, housing.df))

#initialize a data frame with two columns: k, and accuracy
accuracy.df <- data.frame(k = seq(1, 5, 1), RMSE = rep(0, 5))

# compute knn for different k on validation.
for(i in 1:5){
  knn.pred<-class::knn(train = train.norm.df[,-13],                          
                         test = valid.norm.df[,-13],                          
                         cl = train.df[,13], k = i)
  accuracy.df[i,2]<-RMSE(as.numeric(as.character(knn.pred)),valid.df[,13])
}

accuracy.df
##   k     RMSE
## 1 1 4.941440
## 2 2 5.143047
## 3 3 6.191194
## 4 4 6.772547
## 5 5 6.961959

k=1 is the best fit since it has the lowest RMSE (meaning it has the highest accuracy rate of the values tried) However, we do not want to use k=1 because of overfit so we will use the next lowest RMSE (k=2). This means that, for a given record, MEDV is predicted by averaging the MEDVs for the 2 closest records, proximity being measured by the distance between the vectors of predictor values.

Part B

Predict the MEDV for a tract with the following information, using the best k:

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT
A 0.2 0 7 0 0.538 6 62 4.7 4 307 21 10
#create new dataframe with table values
new.df<-data.frame(0.2,0,7,0,0.538,6,62,4.7,4,307,21,10)
names(new.df)<-names(train.norm.df)[-13]

#norm your new data
new.norm.values <- preProcess(new.df, method=c("center", "scale"))
## Warning in preProcess.default(new.df, method = c("center", "scale")): Std.
## deviations could not be computed for: CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS,
## RAD, TAX, PTRATIO, LSTAT
new.norm.df <- predict(new.norm.values, newdata = new.df)

#predict the MEDV
new.knn.pred <- class::knn(train = train.norm.df[,-13],
                       test = new.norm.df,
                       cl = train.df$MEDV, k = 2)
new.knn.pred
## [1] 21
## 180 Levels: 5 5.6 6.3 7 7.2 7.5 8.1 8.3 8.4 8.5 8.8 9.5 9.7 10.2 10.5 ... 50

Part C

If we used the above k-NN algorithm to score the training data, what would be the error of the training set?

new.accuracy.df<-RMSE(as.numeric(as.character(new.knn.pred)),valid.df[,13])
new.accuracy.df
## [1] 9.469794

Part D

Why is the validation data error overly optimistic compared to the error rate when applying this k-NN predictor to new data?

Since our validation data was part of the same set that our traing data was from (with both being derived from our orignial dataset)our results are overly optimistic because our model was essentialy trained to solove that specific data set.

Part E

If the purpose is to predict MEDV for several thousands of new tracts, what would be the disadvantage of using k-NN prediction? List the operations that the algorithm goes through in order to produce each prediction.

The major disadvantage would be the amount of time and comparrisions that would need to be done to determine the distance between each new tract and ALL the examples in the data. Then it would need to average those values of its nearest neighbors to predict the MEDV. The algorithm would have to go through the following operations to produce each prediction:

  1. Compute the Mahalanobis distance from the new tract and the example data set.
  2. Order the example data points by increasing distance.
  3. Performe a cross validation based on RMSE to find the best number k of nearest neighbors.
  4. Calculate an inverse distance weighted average with the k-nearest multivariate neighbors.