This is an introduction to KNN regression.

We will be using mtcars dataset for this example.

We pick 12 observations from mtcars data set. We will call this dataset “know_data”" as we know the repsonses. The attributes are disp, hp, drat, wt and qsec. The response variable is continous.

We will pick an observation from the original mtcars dataset (observation number 15) and call it “unknown_data”" as we don’t know the response and predicit what will be mpg for a new observation.

We will predict the mpg for Cadillac Fleetwood using KNN.

#known data points

known_data <- mtcars[1:12, - c(2,8,9,10,11)]

head(known_data)

##                    mpg disp  hp drat    wt  qsec
## Mazda RX4         21.0  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0  160 110 3.90 2.875 17.02
## Datsun 710        22.8  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7  360 175 3.15 3.440 17.02
## Valiant           18.1  225 105 2.76 3.460 20.22

Before proceding further, lets normalize the attributes so that they are independent of units.

# Function to Normalize attributes

normalize <- function(x) {
  norm <- ((x - min(x))/(max(x) - min(x)))
  return (norm)
}

# Apply the function on all data points
known_normalized_data <- apply(known_data[,-1],2, normalize)

head(known_normalized_data)

##                        disp        hp      drat        wt      qsec
## Mazda RX4         0.2063492 0.2622951 0.9827586 0.1714286 0.0878187
## Mazda RX4 Wag     0.2063492 0.2622951 0.9827586 0.3171429 0.1671388
## Datsun 710        0.0000000 0.1693989 0.9396552 0.0000000 0.3923513
## Hornet 4 Drive    0.5952381 0.2622951 0.2758621 0.5114286 0.5099150
## Hornet Sportabout 1.0000000 0.6174863 0.3362069 0.6400000 0.1671388
## Valiant           0.4642857 0.2349727 0.0000000 0.6514286 0.6203966

# unknown data points
unknown_data <- mtcars[15, - c(1,2,8,9,10,11)]

unknown_data

##                    disp  hp drat   wt  qsec
## Cadillac Fleetwood  472 205 2.93 5.25 17.98

Lets build a KNN regression step by step.

step 1: calculate euclidean distance between the known data points and unknown data point.

\[Euclideandistance = \sqrt{(x_{1} - x_{2})^2 - (y_{1} - y_{2})^2}...\]

# function to calculate euclidean distance

eculidian_dist <- function(k,unk) {
  
  distance <- rep(0, nrow(k))
  
  for(i in 1:nrow(k))
    
    distance[i] <- sqrt((k[,1][i] - unk[,1])^2 + (k[,2][i] - unk[,2])^2 + 
                        (k[,3] - unk[,3])^2 + (k[,4][i] - unk[,4])^2 +
                        (k[,5][i] - unk[,5])^2)
  
  return(distance)
}


# Euclidean distance
edist <- data.frame(dist = eculidian_dist(known_normalized_data, unknown_data), mpg = mtcars[1:12,1])


print(edist)

##        dist  mpg
## 1  514.6421 21.0
## 2  514.6379 21.0
## 3  514.8595 22.8
## 4  514.2679 21.4
## 5  513.7660 18.7
## 6  514.3938 18.1
## 7  513.6192 14.3
## 8  514.7746 24.4
## 9  514.7108 22.8
## 10 514.5728 19.2
## 11 514.5699 17.8
## 12 514.0566 16.4

Step 2: Calculate the number of K

\[K = \sqrt(nrow)\]

# calculate the number of K

k <- floor(sqrt(nrow(known_normalized_data)))

print(k)

## [1] 3

Step 3: Rank the distances in ascending order

rank_edist <- edist[order(edist$dist),]

# we will print first three distances as k =3
rank_edist[1:3,]

##        dist  mpg
## 7  513.6192 14.3
## 5  513.7660 18.7
## 12 514.0566 16.4

yhat <- mean(rank_edist[1:3,2])

print(yhat)

## [1] 16.46667

The pridicted value of mpg for Cadillac Fleetwood is the average of the three distances values that is 16.46

Lets calculate the error

y <- mtcars[15,1]

error <- y - yhat

error_sqaure <- (error)^2

print(error_sqaure)

## [1] 36.80444

My experience in using KNN for regression

pros

ease of use
not worry about underlying distribution

cons

No way to represent the model mathematically
Not being able to determine which variables are contributing and which are not.

KNN Regression

karthik

August 20, 2017

step 1: calculate euclidean distance between the known data points and unknown data point.

Step 2: Calculate the number of K

Step 3: Rank the distances in ascending order

pros

cons