1. Load and view dataset

The details about this dataset can be found at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html

data("airquality")
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

2. Preprocess the dataset

Let’s begin by finding which attributes have missing values. We then need to impute those missing values(NA), which we will be doing simply by replacing NA with monthly average. Let’s begin!

col1<- mapply(anyNA,airquality) # apply function anyNA() on all columns of airquality dataset
col1
##   Ozone Solar.R    Wind    Temp   Month     Day 
##    TRUE    TRUE   FALSE   FALSE   FALSE   FALSE

The output shows that only Ozone and Solar.R attributes have NA i.e. some missing value.

# Impute monthly mean in Ozone
for (i in 1:nrow(airquality)){
  if(is.na(airquality[i,"Ozone"])){
    airquality[i,"Ozone"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Ozone"],na.rm = TRUE)
  }
# Impute monthly mean in Solar.R
    if(is.na(airquality[i,"Solar.R"])){
    airquality[i,"Solar.R"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Solar.R"],na.rm = TRUE)
  }
  
}
#Normalize the dataset so that no particular attribute has more impact on clustering algorithm than others.
normalize<- function(x){
  return((x-min(x))/(max(x)-min(x)))
}
airquality<- normalize(airquality) # replace contents of dataset with normalized values

Yay! We have removed missing values from our dataset. We can now perform k-means clustering on our dataset!

3. Apply k-means clustering algorithm

result<- kmeans(airquality[c(1,2,3,4)],3) # apply k-means algorithmusing first 4 attributes and with k=3(no. of required clusters)
result$size # gives no. of records in each cluster
## [1] 47 69 37
result$centers # gives value of cluster center datapoint value(3 centers for k=3)
##        Ozone   Solar.R       Wind      Temp
## 1 0.13504282 0.5084504 0.02332119 0.2387707
## 2 0.13747545 0.7907037 0.02754058 0.2345389
## 3 0.06695191 0.1710900 0.03024917 0.2140248
result$cluster #gives cluster vector showing the custer where each record falls
##   [1] 1 1 1 2 1 1 2 3 3 1 1 2 2 2 3 2 2 3 2 3 3 2 3 3 3 2 1 3 2 2 2 2 2 2 1
##  [36] 2 2 1 2 2 2 2 2 1 2 2 1 2 3 1 1 1 3 3 2 1 1 3 3 3 1 2 2 2 3 1 2 2 2 2
##  [71] 1 1 2 1 2 3 2 2 2 1 2 3 2 2 2 2 3 3 1 2 2 2 3 3 3 1 1 1 2 2 1 2 1 1 2
## [106] 1 3 3 3 3 2 1 2 3 2 1 2 1 1 1 2 2 1 1 1 1 1 3 3 2 2 2 2 2 2 2 3 3 2 2
## [141] 3 2 1 2 3 1 3 3 1 1 1 1 2

4. Visualize clustering results

par(mfrow=c(1,2), mar=c(5,4,2,2))
plot(airquality[,1:2], col=result$cluster) # Plot to see how Ozone and Solar.R data points have been distributed in clusters

Graph shows that we have got 3 clearly distinguishable clusters for Ozone and Solar.R data points. Let’s see how clustering has performed on Wind and Temp attributes.

plot(airquality[,3:4], col=result$cluster) # Plot to see how Wind and Temp data points have been distributed in clusters

This graph shows that Wind and Temp data points have not been clustered properly. Let us find out which attributes have been taken into consideration more by k-means algorithm. For this, we will plot all possible combinations of attributes!

plot(airquality[,], col=result$cluster) # Plot to see all attribute combinations

From the above plot, it can be seen that k-means algorithm has successfully clustered Ozone:Solar.R, Wind:Solar.R, Temp:Solar.R, Month:Solar.R, Day:Solar.R. By observing the pattern, it can be seen that Solar.R is common amongst all the groups and therefore, it could be concluded that the clustering has been most affected by values in Solar.R attribute.

You may also wish to try out Data Classification, Clustering or Linear Regression from following links:

  1. k-NN Classification for beginners

    Using Iris Dataset

    Using Airquality Dataset
  2. k-means Clustering for beginners

    Using Iris Dataset
  3. Linear Regression for beginners

    Using Iris Dataset

    Using Airquality Dataset

Good luck! :)