Link to project on GitHUB
Link to project on RPub

About

In this work we will show clustering the weather dataset from rattle (Williams, 2014).

Data processing

Loading necessary libraries:

library(rattle) # Load weather dataset. Normalise names normVarNames().
library(randomForest) # Impute missing using na.roughfix().
library(ggplot2) # Visualise the data through plots.
library(reshape2) # Reshape data for plotting.
library(fpc) #  K-means with estimating k (we will calculate average silhouette width)  

Loading dataset (we need only first 1000 rows):

ds <- read.csv('weatherAUS.csv', nrows = 1000)        

Identify the dataset:

names(ds) <- normVarNames(names(ds))
vars <- names(ds)
target <- "rain_tomorrow"
risk <- "risk_mm"
id <- c("date", "location")        

Ignore the IDs and the risk variable:

ignore <- union(id, if (exists("risk")) risk)        

Ignore variables which are completely missing:

mvc <- sapply(ds[vars], function(x) sum(is.na(x))) # Missing value count.
mvn <- names(ds)[(which(mvc == nrow(ds)))] # Missing var names.
ignore <- union(ignore, mvn)        

Initialise the variables:

vars <- setdiff(vars, ignore)        

Defining numeric variables:

inputc <- setdiff(vars, target)
inputi <- sapply(inputc, function(x) which(x == names(ds)), USE.NAMES=FALSE)

numi <- intersect(inputi, which(sapply(ds, is.numeric)))

Impute missing values:

if (sum(is.na(ds[vars]))) ds[vars] <- na.roughfix(ds[vars])        

Show size and head of dataset:

dim(ds)  
## [1] 1000   24
head(ds)
##         date location min_temp max_temp rainfall evaporation sunshine
## 1 2008-12-01   Albury     13.4     22.9      0.6          NA       NA
## 2 2008-12-02   Albury      7.4     25.1      0.0          NA       NA
## 3 2008-12-03   Albury     12.9     25.7      0.0          NA       NA
## 4 2008-12-04   Albury      9.2     28.0      0.0          NA       NA
## 5 2008-12-05   Albury     17.5     32.3      1.0          NA       NA
## 6 2008-12-06   Albury     14.6     29.7      0.2          NA       NA
##   wind_gust_dir wind_gust_speed wind_dir_9am wind_dir_3pm wind_speed_9am
## 1             W              44            W          WNW             20
## 2           WNW              44          NNW          WSW              4
## 3           WSW              46            W          WSW             19
## 4            NE              24           SE            E             11
## 5             W              41          ENE           NW              7
## 6           WNW              56            W            W             19
##   wind_speed_3pm humidity_9am humidity_3pm pressure_9am pressure_3pm
## 1             24           71           22       1007.7       1007.1
## 2             22           44           25       1010.6       1007.8
## 3             26           38           30       1007.6       1008.7
## 4              9           45           16       1017.6       1012.8
## 5             20           82           33       1010.8       1006.0
## 6             24           55           23       1009.2       1005.4
##   cloud_9am cloud_3pm temp_9am temp_3pm rain_today risk_mm rain_tomorrow
## 1         8         7     16.9     21.8         No     0.0            No
## 2         8         7     17.2     24.3         No     0.0            No
## 3         8         2     21.0     23.2         No     0.0            No
## 4         8         7     18.1     26.5         No     1.0            No
## 5         7         8     17.8     29.7         No     0.2            No
## 6         8         7     20.6     28.9         No     0.0            No

Clusterization

The k-means algorithm is a traditional and widely used clustering algorithm.
The algorithm begins by specifying the number of clusters we are interested in - this is the k. Each of the k clusters is identified as the vector of the average (i.e., the mean) value of each of the variables for observations within a cluster. A random clustering is first constructed, the k means calculated, and then using the distance measure we gravitate each observation to its nearest mean. The means are then recalculated and the points re-gravitate. And so on until there is no change to the means.

A unit of distance is different for differently measure variables. For example, one year of difference in age seems like it should be a larger difference than $1 difference in our income. A common approach is to rescale our data by subtracting the mean and dividing by the standard deviation. The result is that the mean for all variables is 0 and a unit of difference is one standard deviation.

Defining the number of clusters

For selecting k we will use Average Silhouette Width. The technique provides a succinct graphical representation of how well each object lies within its cluster. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

nk <- 1:20
model <- kmeansruns(scale(ds[numi]), krange=nk, criterion="asw")
model$bestk
## [1] 2

So, in our case k = 2 is the optimum choice.

Plot clusters

Clusters centres:

model$centers
##     min_temp   max_temp    rainfall wind_gust_speed wind_speed_9am
## 1 -0.7343731 -0.7255936 -0.08635213      -0.4125834     -0.2999989
## 2  0.8381603  0.8281401  0.09855607       0.4708928      0.3423971
##   wind_speed_3pm humidity_9am humidity_3pm pressure_9am pressure_3pm
## 1     -0.2778298    0.5982247    0.4736027    0.4891541    0.5051701
## 2      0.3170949   -0.6827704   -0.5405359   -0.5582850   -0.5765646
##     cloud_9am    cloud_3pm   temp_9am   temp_3pm
## 1  0.03173184  0.004475099 -0.7559283 -0.7058590
## 2 -0.03621642 -0.005107554  0.8627618  0.8056164

Show clusters:

nclust <- 2
model <- m.kms <- kmeans(scale(ds[numi]), nclust)
dscm <- melt(model$centers)
names(dscm) <- c("Cluster", "Variable", "Value")
dscm$Cluster <- factor(dscm$Cluster)
dscm$Order <- as.vector(sapply(1:length(numi), rep, nclust))
p <- ggplot(dscm,
            aes(x=reorder(Variable, Order),
                y=Value, group=Cluster, colour=Cluster))
p <- p + coord_polar()
p <- p + geom_point()
p <- p + geom_path()
p <- p + labs(x=NULL, y=NULL)
p <- p + ylim(-1, 1)
p <- p + ggtitle("Clusters profile (variables were scaled)")
p