1. Load and view dataset

The details about this dataset can be found at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html

Let us assume that “Month” attribute is our class attribute with five possible values denoting five different months(May to September). So, our classification task will involve predicting the month of the year when a set of values of Ozone, Solar.R, Wind and Temp are given. (Important: This classification problem has been created specifically for the purpose of understanding by beginners and it doesn’t portray real-life scenario in any way.)

require("class")
## Loading required package: class
require("datasets")
data("airquality")
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Our dataset has first 4 numeric attributes which will be used as predictors. 5th attribute is “Month” which will act as our Class Attribute. Last attribute is “Day” of the month, which we can ignore and hence, will be removed during preprocessing of data.

2. Preprocess the dataset

2.1) Remove extra attributes

Let’s remove “Day” attribute from the dataset.

airquality$Day<- NULL
head(airquality)
##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5

2.2) Find and impute missing values. Normalize the dataset.

We will find predictors that have missing values. We then need to impute those missing values(NA) with their monthly average(just to keep it simple, there are other methods though!). Let’s begin!

col1<- mapply(anyNA,airquality) # apply function anyNA() on all columns of airquality dataset
col1
##   Ozone Solar.R    Wind    Temp   Month 
##    TRUE    TRUE   FALSE   FALSE   FALSE

The output shows that only Ozone and Solar.R attributes have NA i.e. some missing value.

# Impute monthly mean in Ozone
for (i in 1:nrow(airquality)){
  if(is.na(airquality[i,"Ozone"])){
    airquality[i,"Ozone"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Ozone"],na.rm = TRUE)
  }

# Impute monthly mean in Solar.R
    if(is.na(airquality[i,"Solar.R"])){
    airquality[i,"Solar.R"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Solar.R"],na.rm = TRUE)
  }
}

#Normalize the predictor attributes so that no particular attribute has more impact on clustering algorithm than others.
normalize<- function(x){
  return((x-min(x))/(max(x)-min(x)))
}
airquality[,1:4]<- normalize(airquality[,1:4]) # replace contents of dataset with normalized values

Yay! We have removed missing values from our dataset. And since we have normalized our dataset, all values fall in the range of 0 and 1.

2.3) Seperate Class Attribute from rest of the dataset

class<- data.frame("month"=airquality$Month)
names(class)= "Month"
airquality[,5]<- NULL #remove "Month" from airquality
head(airquality)
##        Ozone   Solar.R       Wind      Temp
## 1 0.12012012 0.5675676 0.01921922 0.1981982
## 2 0.10510511 0.3513514 0.02102102 0.2132132
## 3 0.03303303 0.4444444 0.03483483 0.2192192
## 4 0.05105105 0.9369369 0.03153153 0.1831832
## 5 0.06791407 0.5414303 0.03993994 0.1651652
## 6 0.08108108 0.5414303 0.04174174 0.1951952

2.4) Create Training and Test data subsets

Since classification is a type of Supervised Learning, we would require two sets of data i.e. Training and Testing Data(generally in 80:20 ratio). We will now divide the dataset into two subsets. Our knn classification model would then be trained using subset “airquality.train” and tested using “airquality.test”. Since the airquality dataset is sorted by “Months” by default, we will first jumble the data rows and then take subset.

set.seed(999) # required to reproduce the results
rnum<- sample(rep(1:153))
airquality<- airquality[rnum,] #randomize "airquality" dataset
class<- as.data.frame(class[rnum,]) #apply same randomization on "class" attribute

airquality.train<- airquality[1:130,]
airquality.train.target<- class[1:130,]
airquality.test<- airquality[131:153,]
airquality.test.target<- class[131:153,]

Finally preprocessing is over :-)

3. Apply k-NN classification algorithm

neigh<- round(sqrt(nrow(airquality)))+1 # no. of neighbours are generally square root of total number of instances
model<- knn(train = airquality.train,  test = airquality.test, cl=airquality.train.target, k=neigh) # apply knn algo

Number of neighbours considered by algorithm are 13.

4. Visualize classification results

table(airquality.test.target, model)
##                       model
## airquality.test.target 5 6 7 8 9
##                      5 2 1 0 0 0
##                      6 2 1 0 0 0
##                      7 1 1 3 0 0
##                      8 0 1 1 2 1
##                      9 1 3 0 1 2

The values on the diagonal shows number of correctly classified instances out of total 153 instances. The values not on the diagonal implies that they have been incorrectly instances. Hence, there is a scope of further improvement in classifier model. Improvement may be done in terms of trying different values of “k” and choosing the one with maximum accuracy. However, other classification algorithms may also be tried to get a better result. There is no stopping in Optimization!

5. Calculate Accuracy

mean(airquality.test.target== model)
## [1] 0.4347826

This accuracy is worst than random guessing (50%). Try and see if we can get a better accuracy by changing the value of “k”(by increasing or decreasing). Do comment with the value of “k” which gives better accuracy.

You may also wish to try out Data Classification, Clustering or Linear Regression from following links:

  1. k-NN Classification for beginners

    Using Iris Dataset
  2. k-means Clustering for beginners

    Using Iris Dataset

    Using Airquality Dataset
  3. Linear Regression for beginners

    Using Iris Dataset

    Using Airquality Dataset

Good luck! :)