The details about this dataset can be found at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html
data("airquality")
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Let’s begin by finding which attributes have missing values. We then need to impute those missing values(NA), which we will be doing simply by replacing NA with monthly average. Let’s begin!
col1<- mapply(anyNA,airquality) # apply function anyNA() on all columns of airquality dataset
col1
## Ozone Solar.R Wind Temp Month Day
## TRUE TRUE FALSE FALSE FALSE FALSE
The output shows that only Ozone and Solar.R attributes have NA i.e. some missing value.
# Impute monthly mean in Ozone
for (i in 1:nrow(airquality)){
if(is.na(airquality[i,"Ozone"])){
airquality[i,"Ozone"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Ozone"],na.rm = TRUE)
}
# Impute monthly mean in Solar.R
if(is.na(airquality[i,"Solar.R"])){
airquality[i,"Solar.R"]<- mean(airquality[which(airquality[,"Month"]==airquality[i,"Month"]),"Solar.R"],na.rm = TRUE)
}
}
#Normalize the dataset so that no particular attribute has more impact on clustering algorithm than others.
normalize<- function(x){
return((x-min(x))/(max(x)-min(x)))
}
airquality<- normalize(airquality) # replace contents of dataset with normalized values
Yay! We have removed missing values from our dataset. We can now perform k-means clustering on our dataset!
result<- kmeans(airquality[c(1,2,3,4)],3) # apply k-means algorithmusing first 4 attributes and with k=3(no. of required clusters)
result$size # gives no. of records in each cluster
## [1] 47 69 37
result$centers # gives value of cluster center datapoint value(3 centers for k=3)
## Ozone Solar.R Wind Temp
## 1 0.13504282 0.5084504 0.02332119 0.2387707
## 2 0.13747545 0.7907037 0.02754058 0.2345389
## 3 0.06695191 0.1710900 0.03024917 0.2140248
result$cluster #gives cluster vector showing the custer where each record falls
## [1] 1 1 1 2 1 1 2 3 3 1 1 2 2 2 3 2 2 3 2 3 3 2 3 3 3 2 1 3 2 2 2 2 2 2 1
## [36] 2 2 1 2 2 2 2 2 1 2 2 1 2 3 1 1 1 3 3 2 1 1 3 3 3 1 2 2 2 3 1 2 2 2 2
## [71] 1 1 2 1 2 3 2 2 2 1 2 3 2 2 2 2 3 3 1 2 2 2 3 3 3 1 1 1 2 2 1 2 1 1 2
## [106] 1 3 3 3 3 2 1 2 3 2 1 2 1 1 1 2 2 1 1 1 1 1 3 3 2 2 2 2 2 2 2 3 3 2 2
## [141] 3 2 1 2 3 1 3 3 1 1 1 1 2
par(mfrow=c(1,2), mar=c(5,4,2,2))
plot(airquality[,1:2], col=result$cluster) # Plot to see how Ozone and Solar.R data points have been distributed in clusters
Graph shows that we have got 3 clearly distinguishable clusters for Ozone and Solar.R data points. Let’s see how clustering has performed on Wind and Temp attributes.
plot(airquality[,3:4], col=result$cluster) # Plot to see how Wind and Temp data points have been distributed in clusters
This graph shows that Wind and Temp data points have not been clustered properly. Let us find out which attributes have been taken into consideration more by k-means algorithm. For this, we will plot all possible combinations of attributes!
plot(airquality[,], col=result$cluster) # Plot to see all attribute combinations
From the above plot, it can be seen that k-means algorithm has successfully clustered Ozone:Solar.R, Wind:Solar.R, Temp:Solar.R, Month:Solar.R, Day:Solar.R. By observing the pattern, it can be seen that Solar.R is common amongst all the groups and therefore, it could be concluded that the clustering has been most affected by values in Solar.R attribute.
You may also wish to try out Data Classification, Clustering or Linear Regression from following links:
k-NN Classification for beginners
Using Airquality Datasetk-means Clustering for beginners
Using Iris DatasetLinear Regression for beginners
Good luck! :)