This is an introduction to KNN regression.
We will be using mtcars dataset for this example.
We pick 12 observations from mtcars data set. We will call this dataset “know_data”" as we know the repsonses. The attributes are disp, hp, drat, wt and qsec. The response variable is continous.
We will pick an observation from the original mtcars dataset (observation number 15) and call it “unknown_data”" as we don’t know the response and predicit what will be mpg for a new observation.
We will predict the mpg for Cadillac Fleetwood using KNN.
#known data points
known_data <- mtcars[1:12, - c(2,8,9,10,11)]
head(known_data)
## mpg disp hp drat wt qsec
## Mazda RX4 21.0 160 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 160 110 3.90 2.875 17.02
## Datsun 710 22.8 108 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 360 175 3.15 3.440 17.02
## Valiant 18.1 225 105 2.76 3.460 20.22
Before proceding further, lets normalize the attributes so that they are independent of units.
# Function to Normalize attributes
normalize <- function(x) {
norm <- ((x - min(x))/(max(x) - min(x)))
return (norm)
}
# Apply the function on all data points
known_normalized_data <- apply(known_data[,-1],2, normalize)
head(known_normalized_data)
## disp hp drat wt qsec
## Mazda RX4 0.2063492 0.2622951 0.9827586 0.1714286 0.0878187
## Mazda RX4 Wag 0.2063492 0.2622951 0.9827586 0.3171429 0.1671388
## Datsun 710 0.0000000 0.1693989 0.9396552 0.0000000 0.3923513
## Hornet 4 Drive 0.5952381 0.2622951 0.2758621 0.5114286 0.5099150
## Hornet Sportabout 1.0000000 0.6174863 0.3362069 0.6400000 0.1671388
## Valiant 0.4642857 0.2349727 0.0000000 0.6514286 0.6203966
# unknown data points
unknown_data <- mtcars[15, - c(1,2,8,9,10,11)]
unknown_data
## disp hp drat wt qsec
## Cadillac Fleetwood 472 205 2.93 5.25 17.98
Lets build a KNN regression step by step.
\[Euclideandistance = \sqrt{(x_{1} - x_{2})^2 - (y_{1} - y_{2})^2}...\]
# function to calculate euclidean distance
eculidian_dist <- function(k,unk) {
distance <- rep(0, nrow(k))
for(i in 1:nrow(k))
distance[i] <- sqrt((k[,1][i] - unk[,1])^2 + (k[,2][i] - unk[,2])^2 +
(k[,3] - unk[,3])^2 + (k[,4][i] - unk[,4])^2 +
(k[,5][i] - unk[,5])^2)
return(distance)
}
# Euclidean distance
edist <- data.frame(dist = eculidian_dist(known_normalized_data, unknown_data), mpg = mtcars[1:12,1])
print(edist)
## dist mpg
## 1 514.6421 21.0
## 2 514.6379 21.0
## 3 514.8595 22.8
## 4 514.2679 21.4
## 5 513.7660 18.7
## 6 514.3938 18.1
## 7 513.6192 14.3
## 8 514.7746 24.4
## 9 514.7108 22.8
## 10 514.5728 19.2
## 11 514.5699 17.8
## 12 514.0566 16.4
\[K = \sqrt(nrow)\]
# calculate the number of K
k <- floor(sqrt(nrow(known_normalized_data)))
print(k)
## [1] 3
rank_edist <- edist[order(edist$dist),]
# we will print first three distances as k =3
rank_edist[1:3,]
## dist mpg
## 7 513.6192 14.3
## 5 513.7660 18.7
## 12 514.0566 16.4
yhat <- mean(rank_edist[1:3,2])
print(yhat)
## [1] 16.46667
The pridicted value of mpg for Cadillac Fleetwood is the average of the three distances values that is 16.46
Lets calculate the error
y <- mtcars[15,1]
error <- y - yhat
error_sqaure <- (error)^2
print(error_sqaure)
## [1] 36.80444
My experience in using KNN for regression