Clustering (example)

Link to project on GitHUB
Link to project on RPub

About

In this work we will show clustering the weather dataset from rattle (Williams, 2014).

Data processing

Loading necessary libraries:

library(rattle) # Load weather dataset. Normalise names normVarNames().
library(randomForest) # Impute missing using na.roughfix().
library(ggplot2) # Visualise the data through plots.
library(reshape2) # Reshape data for plotting.
library(fpc) #  K-means with estimating k (we will calculate average silhouette width)

Loading dataset (we need only first 1000 rows):

ds <- read.csv('weatherAUS.csv', nrows = 1000)

Identify the dataset:

names(ds) <- normVarNames(names(ds))
vars <- names(ds)
target <- "rain_tomorrow"
risk <- "risk_mm"
id <- c("date", "location")

Ignore the IDs and the risk variable:

ignore <- union(id, if (exists("risk")) risk)

Ignore variables which are completely missing:

mvc <- sapply(ds[vars], function(x) sum(is.na(x))) # Missing value count.
mvn <- names(ds)[(which(mvc == nrow(ds)))] # Missing var names.
ignore <- union(ignore, mvn)

Initialise the variables:

vars <- setdiff(vars, ignore)

Defining numeric variables:

inputc <- setdiff(vars, target)
inputi <- sapply(inputc, function(x) which(x == names(ds)), USE.NAMES=FALSE)

numi <- intersect(inputi, which(sapply(ds, is.numeric)))

Impute missing values:

if (sum(is.na(ds[vars]))) ds[vars] <- na.roughfix(ds[vars])

Show size and head of dataset:

dim(ds)

## [1] 1000   24

head(ds)

##         date location min_temp max_temp rainfall evaporation sunshine
## 1 2008-12-01   Albury     13.4     22.9      0.6          NA       NA
## 2 2008-12-02   Albury      7.4     25.1      0.0          NA       NA
## 3 2008-12-03   Albury     12.9     25.7      0.0          NA       NA
## 4 2008-12-04   Albury      9.2     28.0      0.0          NA       NA
## 5 2008-12-05   Albury     17.5     32.3      1.0          NA       NA
## 6 2008-12-06   Albury     14.6     29.7      0.2          NA       NA
##   wind_gust_dir wind_gust_speed wind_dir_9am wind_dir_3pm wind_speed_9am
## 1             W              44            W          WNW             20
## 2           WNW              44          NNW          WSW              4
## 3           WSW              46            W          WSW             19
## 4            NE              24           SE            E             11
## 5             W              41          ENE           NW              7
## 6           WNW              56            W            W             19
##   wind_speed_3pm humidity_9am humidity_3pm pressure_9am pressure_3pm
## 1             24           71           22       1007.7       1007.1
## 2             22           44           25       1010.6       1007.8
## 3             26           38           30       1007.6       1008.7
## 4              9           45           16       1017.6       1012.8
## 5             20           82           33       1010.8       1006.0
## 6             24           55           23       1009.2       1005.4
##   cloud_9am cloud_3pm temp_9am temp_3pm rain_today risk_mm rain_tomorrow
## 1         8         7     16.9     21.8         No     0.0            No
## 2         8         7     17.2     24.3         No     0.0            No
## 3         8         2     21.0     23.2         No     0.0            No
## 4         8         7     18.1     26.5         No     1.0            No
## 5         7         8     17.8     29.7         No     0.2            No
## 6         8         7     20.6     28.9         No     0.0            No

Clusterization

The k-means algorithm is a traditional and widely used clustering algorithm.
The algorithm begins by specifying the number of clusters we are interested in - this is the k. Each of the k clusters is identified as the vector of the average (i.e., the mean) value of each of the variables for observations within a cluster. A random clustering is first constructed, the k means calculated, and then using the distance measure we gravitate each observation to its nearest mean. The means are then recalculated and the points re-gravitate. And so on until there is no change to the means.

A unit of distance is different for differently measure variables. For example, one year of difference in age seems like it should be a larger difference than $1 difference in our income. A common approach is to rescale our data by subtracting the mean and dividing by the standard deviation. The result is that the mean for all variables is 0 and a unit of difference is one standard deviation.

Defining the number of clusters

For selecting k we will use Average Silhouette Width. The technique provides a succinct graphical representation of how well each object lies within its cluster. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

nk <- 1:20
model <- kmeansruns(scale(ds[numi]), krange=nk, criterion="asw")
model$bestk

## [1] 2

So, in our case k = 2 is the optimum choice.

Plot clusters

Clusters centres:

model$centers

##     min_temp   max_temp    rainfall wind_gust_speed wind_speed_9am
## 1 -0.7343731 -0.7255936 -0.08635213      -0.4125834     -0.2999989
## 2  0.8381603  0.8281401  0.09855607       0.4708928      0.3423971
##   wind_speed_3pm humidity_9am humidity_3pm pressure_9am pressure_3pm
## 1     -0.2778298    0.5982247    0.4736027    0.4891541    0.5051701
## 2      0.3170949   -0.6827704   -0.5405359   -0.5582850   -0.5765646
##     cloud_9am    cloud_3pm   temp_9am   temp_3pm
## 1  0.03173184  0.004475099 -0.7559283 -0.7058590
## 2 -0.03621642 -0.005107554  0.8627618  0.8056164

Show clusters:

nclust <- 2
model <- m.kms <- kmeans(scale(ds[numi]), nclust)
dscm <- melt(model$centers)
names(dscm) <- c("Cluster", "Variable", "Value")
dscm$Cluster <- factor(dscm$Cluster)
dscm$Order <- as.vector(sapply(1:length(numi), rep, nclust))
p <- ggplot(dscm,
            aes(x=reorder(Variable, Order),
                y=Value, group=Cluster, colour=Cluster))
p <- p + coord_polar()
p <- p + geom_point()
p <- p + geom_path()
p <- p + labs(x=NULL, y=NULL)
p <- p + ylim(-1, 1)
p <- p + ggtitle("Clusters profile (variables were scaled)")
p