The details about this dataset can be found at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html
Let us assume that “Month” attribute is our class attribute with five possible values denoting five different months(May to September). So, our classification task will involve predicting the month of the year when a set of values of Ozone, Solar.R, Wind and Temp are given. (Important: This classification problem has been created specifically for the purpose of understanding by beginners and it doesn’t portray real-life scenario in any way.)
require("class")
## Loading required package: class
require("datasets")
data("airquality")
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Our dataset has first 4 numeric attributes which will be used as predictors. 5th attribute is “Month” which will act as our Class Attribute. Last attribute is “Day” of the month, which we can ignore and hence, will be removed during preprocessing of data.
Let’s remove “Day” attribute from the dataset.
airquality$Day<- NULL
head(airquality)
## Ozone Solar.R Wind Temp Month
## 1 41 190 7.4 67 5
## 2 36 118 8.0 72 5
## 3 12 149 12.6 74 5
## 4 18 313 11.5 62 5
## 5 NA NA 14.3 56 5
## 6 28 NA 14.9 66 5
We will find predictors that have missing values. We then need to impute those missing values(NA) with their monthly average(just to keep it simple, there are other methods though!). Let’s begin!
col1<- mapply(anyNA,airquality) # apply function anyNA() on all columns of airquality dataset
col1
## Ozone Solar.R Wind Temp Month
## TRUE TRUE FALSE FALSE FALSE
The output shows that only Ozone and Solar.R attributes have NA i.e. some missing value.
# Impute monthly mean in Ozone
for (i in 1:nrow(airquality)){
if(is.na(airquality[i,"Ozone"])){
airquality[i,"Ozone"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Ozone"],na.rm = TRUE)
}
# Impute monthly mean in Solar.R
if(is.na(airquality[i,"Solar.R"])){
airquality[i,"Solar.R"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Solar.R"],na.rm = TRUE)
}
}
#Normalize the predictor attributes so that no particular attribute has more impact on clustering algorithm than others.
normalize<- function(x){
return((x-min(x))/(max(x)-min(x)))
}
airquality[,1:4]<- normalize(airquality[,1:4]) # replace contents of dataset with normalized values
Yay! We have removed missing values from our dataset. And since we have normalized our dataset, all values fall in the range of 0 and 1.
class<- data.frame("month"=airquality$Month)
names(class)= "Month"
airquality[,5]<- NULL #remove "Month" from airquality
head(airquality)
## Ozone Solar.R Wind Temp
## 1 0.12012012 0.5675676 0.01921922 0.1981982
## 2 0.10510511 0.3513514 0.02102102 0.2132132
## 3 0.03303303 0.4444444 0.03483483 0.2192192
## 4 0.05105105 0.9369369 0.03153153 0.1831832
## 5 0.06791407 0.5414303 0.03993994 0.1651652
## 6 0.08108108 0.5414303 0.04174174 0.1951952
Since classification is a type of Supervised Learning, we would require two sets of data i.e. Training and Testing Data(generally in 80:20 ratio). We will now divide the dataset into two subsets. Our knn classification model would then be trained using subset “airquality.train” and tested using “airquality.test”. Since the airquality dataset is sorted by “Months” by default, we will first jumble the data rows and then take subset.
set.seed(999) # required to reproduce the results
rnum<- sample(rep(1:153))
airquality<- airquality[rnum,] #randomize "airquality" dataset
class<- as.data.frame(class[rnum,]) #apply same randomization on "class" attribute
airquality.train<- airquality[1:130,]
airquality.train.target<- class[1:130,]
airquality.test<- airquality[131:153,]
airquality.test.target<- class[131:153,]
Finally preprocessing is over :-)
neigh<- round(sqrt(nrow(airquality)))+1 # no. of neighbours are generally square root of total number of instances
model<- knn(train = airquality.train, test = airquality.test, cl=airquality.train.target, k=neigh) # apply knn algo
Number of neighbours considered by algorithm are 13.
table(airquality.test.target, model)
## model
## airquality.test.target 5 6 7 8 9
## 5 2 1 0 0 0
## 6 2 1 0 0 0
## 7 1 1 3 0 0
## 8 0 1 1 2 1
## 9 1 3 0 1 2
The values on the diagonal shows number of correctly classified instances out of total 153 instances. The values not on the diagonal implies that they have been incorrectly instances. Hence, there is a scope of further improvement in classifier model. Improvement may be done in terms of trying different values of “k” and choosing the one with maximum accuracy. However, other classification algorithms may also be tried to get a better result. There is no stopping in Optimization!
mean(airquality.test.target== model)
## [1] 0.4347826
This accuracy is worst than random guessing (50%). Try and see if we can get a better accuracy by changing the value of “k”(by increasing or decreasing). Do comment with the value of “k” which gives better accuracy.
You may also wish to try out Data Classification, Clustering or Linear Regression from following links:
k-NN Classification for beginners
Using Iris Datasetk-means Clustering for beginners
Using Airquality DatasetLinear Regression for beginners
Good luck! :)