This is my first article on understanding Machine Learning. The methods and flow has been take from gigadom.wordpress.com. I have replicated all the codes for my own unserstanding.
The code below implements KNN Regression in R. This is done for different neighbors. The R squared is computed in each case. This is repeated after performing feature scaling. It can be seen the model fit is much better after feature scaling.
library(dplyr) #For Data manipulation
library(class) #For KNN Regression
library(ggplot2) #For making plots
df <- read.csv("auto_mpg.csv",stringsAsFactors = FALSE)
df1 <- as.data.frame(sapply(df,as.numeric))
df2 <- df1 %>%
select(cylinders,displacement, horsepower,weight, acceleration, model.year,mpg)
df3 <- df2[complete.cases(df2),]
set.seed(1)
# Split train and test
train_idx <- sample(1:nrow(df3),nrow(df3)*0.8 )
train <- df3[train_idx, ]
test <- df3[-train_idx, ]
# Select the feature variables
train.X=train[,1:6]
# Set the target for training
train.Y=train[,7]
# Do the same for test set
test.X=test[,1:6]
test.Y=test[,7]
error <- c()
set.seed(1)
# Create a list of neighbors
neighbors <-c(1,2,4,5,8,10,15,20)
for(i in seq_along(neighbors))
{
# Perform a KNN regression fit
knn_res <- knn(train.X, test.X, train.Y, k = neighbors[i])
# Compute R sqaured
error[i] <- sqrt(sum((test.Y - as.numeric(knn_res))^2))
}
# Make a dataframe for plotting
dfx <- data.frame(neighbors,error = error)
# Plot the number of neighors vs the R squared
ggplot(dfx,aes(x = neighbors,y = error)) +
geom_point() +
geom_line(color="blue") +
xlab("Number of neighbors") +
ylab("Error") +
ggtitle("KNN regression - error vs Number of Neighors (Unnormalized)")
# Select the feature variables
train.X.scaled=scale(train[,1:6])
# Set the target for training
train.Y=train[,7]
# Do the same for test set
test.X.scaled <- scale(test[,1:6])
test.Y=test[,7]
error_normal <- c()
set.seed(1)
# Create a list of neighbors
neighbors <-c(1,2,4,5,8,10,15,20)
for(i in seq_along(neighbors))
{
# Perform a KNN regression fit
knn_res <- knn(train.X.scaled, test.X.scaled, train.Y, k = neighbors[i])
# Compute R sqaured
error_normal[i] <- sqrt(sum((test.Y - as.numeric(knn_res))^2))
}
# Make a dataframe for plotting
dfx <- data.frame(neighbors,error = error_normal)
# Plot the number of neighors vs the R squared
ggplot(dfx,aes(x = neighbors,y = error_normal)) +
geom_point() +
geom_line(color="blue") +
xlab("Number of neighbors") +
ylab("Error") +
ggtitle("KNN regression - error vs Number of Neighors (Normalized)")
We see that there is difference in Error termsfor Normalised vs Unnormalised model. As scaling removes the dependancy on the variable it is always a good idea to scale before running K Nearest Neighbors Regression