Link to project on GitHUB
Link to project on RPub
In this work we will show clustering the weather dataset from rattle (Williams, 2014).
Loading necessary libraries:
library(rattle) # Load weather dataset. Normalise names normVarNames().
library(randomForest) # Impute missing using na.roughfix().
library(ggplot2) # Visualise the data through plots.
library(reshape2) # Reshape data for plotting.
library(fpc) # K-means with estimating k (we will calculate average silhouette width)
Loading dataset (we need only first 1000 rows):
ds <- read.csv('weatherAUS.csv', nrows = 1000)
Identify the dataset:
names(ds) <- normVarNames(names(ds))
vars <- names(ds)
target <- "rain_tomorrow"
risk <- "risk_mm"
id <- c("date", "location")
Ignore the IDs and the risk variable:
ignore <- union(id, if (exists("risk")) risk)
Ignore variables which are completely missing:
mvc <- sapply(ds[vars], function(x) sum(is.na(x))) # Missing value count.
mvn <- names(ds)[(which(mvc == nrow(ds)))] # Missing var names.
ignore <- union(ignore, mvn)
Initialise the variables:
vars <- setdiff(vars, ignore)
Defining numeric variables:
inputc <- setdiff(vars, target)
inputi <- sapply(inputc, function(x) which(x == names(ds)), USE.NAMES=FALSE)
numi <- intersect(inputi, which(sapply(ds, is.numeric)))
Impute missing values:
if (sum(is.na(ds[vars]))) ds[vars] <- na.roughfix(ds[vars])
Show size and head of dataset:
dim(ds)
## [1] 1000 24
head(ds)
## date location min_temp max_temp rainfall evaporation sunshine
## 1 2008-12-01 Albury 13.4 22.9 0.6 NA NA
## 2 2008-12-02 Albury 7.4 25.1 0.0 NA NA
## 3 2008-12-03 Albury 12.9 25.7 0.0 NA NA
## 4 2008-12-04 Albury 9.2 28.0 0.0 NA NA
## 5 2008-12-05 Albury 17.5 32.3 1.0 NA NA
## 6 2008-12-06 Albury 14.6 29.7 0.2 NA NA
## wind_gust_dir wind_gust_speed wind_dir_9am wind_dir_3pm wind_speed_9am
## 1 W 44 W WNW 20
## 2 WNW 44 NNW WSW 4
## 3 WSW 46 W WSW 19
## 4 NE 24 SE E 11
## 5 W 41 ENE NW 7
## 6 WNW 56 W W 19
## wind_speed_3pm humidity_9am humidity_3pm pressure_9am pressure_3pm
## 1 24 71 22 1007.7 1007.1
## 2 22 44 25 1010.6 1007.8
## 3 26 38 30 1007.6 1008.7
## 4 9 45 16 1017.6 1012.8
## 5 20 82 33 1010.8 1006.0
## 6 24 55 23 1009.2 1005.4
## cloud_9am cloud_3pm temp_9am temp_3pm rain_today risk_mm rain_tomorrow
## 1 8 7 16.9 21.8 No 0.0 No
## 2 8 7 17.2 24.3 No 0.0 No
## 3 8 2 21.0 23.2 No 0.0 No
## 4 8 7 18.1 26.5 No 1.0 No
## 5 7 8 17.8 29.7 No 0.2 No
## 6 8 7 20.6 28.9 No 0.0 No
The k-means algorithm is a traditional and widely used clustering algorithm.
The algorithm begins by specifying the number of clusters we are interested in - this is the k. Each of the k clusters is identified as the vector of the average (i.e., the mean) value of each of the variables for observations within a cluster. A random clustering is first constructed, the k means calculated, and then using the distance measure we gravitate each observation to its nearest mean. The means are then recalculated and the points re-gravitate. And so on until there is no change to the means.
A unit of distance is different for differently measure variables. For example, one year of difference in age seems like it should be a larger difference than $1 difference in our income. A common approach is to rescale our data by subtracting the mean and dividing by the standard deviation. The result is that the mean for all variables is 0 and a unit of difference is one standard deviation.
For selecting k we will use Average Silhouette Width. The technique provides a succinct graphical representation of how well each object lies within its cluster. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
nk <- 1:20
model <- kmeansruns(scale(ds[numi]), krange=nk, criterion="asw")
model$bestk
## [1] 2
So, in our case k = 2 is the optimum choice.
Clusters centres:
model$centers
## min_temp max_temp rainfall wind_gust_speed wind_speed_9am
## 1 -0.7343731 -0.7255936 -0.08635213 -0.4125834 -0.2999989
## 2 0.8381603 0.8281401 0.09855607 0.4708928 0.3423971
## wind_speed_3pm humidity_9am humidity_3pm pressure_9am pressure_3pm
## 1 -0.2778298 0.5982247 0.4736027 0.4891541 0.5051701
## 2 0.3170949 -0.6827704 -0.5405359 -0.5582850 -0.5765646
## cloud_9am cloud_3pm temp_9am temp_3pm
## 1 0.03173184 0.004475099 -0.7559283 -0.7058590
## 2 -0.03621642 -0.005107554 0.8627618 0.8056164
Show clusters:
nclust <- 2
model <- m.kms <- kmeans(scale(ds[numi]), nclust)
dscm <- melt(model$centers)
names(dscm) <- c("Cluster", "Variable", "Value")
dscm$Cluster <- factor(dscm$Cluster)
dscm$Order <- as.vector(sapply(1:length(numi), rep, nclust))
p <- ggplot(dscm,
aes(x=reorder(Variable, Order),
y=Value, group=Cluster, colour=Cluster))
p <- p + coord_polar()
p <- p + geom_point()
p <- p + geom_path()
p <- p + labs(x=NULL, y=NULL)
p <- p + ylim(-1, 1)
p <- p + ggtitle("Clusters profile (variables were scaled)")
p