Before running through this tutorial you will first need to manually install the gd library. We hope to get this on CRAN soon. There are a number of dependencies which are loaded first. This library is experimental at present, and it is highly likely there will be some bugs!
# Get dependencies
install.packages("car")
install.packages("gplots")
install.packages("ggplot2")
# Get gd
temp <- tempfile(fileext = ".zip")
download.file("http://dl.dropbox.com/u/881843/RPubsData/gd/gd.zip", temp)
unzip(temp)
install.packages("gd", repos = NULL, type = "source", depend = TRUE)
unlink(temp)
library(gd)
This tutorial takes you through the basic process of creating a geodemographic classification. In this example, a classification will be built at Lower Super Output Area level for the North West of England with data including:
The first step is to download and integrate some data into a single dataframe. The CSV files are all data at LSOA level and from various online / open data sources.
# Download csv files from dropbox
population <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv",
stringsAsFactors = FALSE)
ethnic_group <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/Census_2001_Ethnic.csv",
stringsAsFactors = FALSE)
benefits <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/benefits_Aug_2010.csv",
stringsAsFactors = FALSE)
council_tax <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/Ctax_2010.csv",
stringsAsFactors = FALSE)
pensions <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/dwp_2010_state_pension_claims.csv",
stringsAsFactors = FALSE)
education <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/IMD_education_2010.csv",
stringsAsFactors = FALSE)
# Create a single dataframe for the North West (some of the input data are
# national, others are NW only)
input_data <- merge(population, ethnic_group, by.x = "LSOA.Code", by.y = "LSOA_CODE",
all.y = TRUE)
input_data <- merge(input_data, benefits, by.x = "LSOA.Code", by.y = "LSOA",
all.x = TRUE)
input_data <- merge(input_data, council_tax, by.x = "LSOA.Code", by.y = "LSOA_CODE",
all.x = TRUE)
input_data <- merge(input_data, pensions, by.x = "LSOA.Code", by.y = "LSOA",
all.x = TRUE)
input_data <- merge(input_data, education, by.x = "LSOA.Code", by.y = "LSOA_CODE",
all.x = TRUE)
# Create a new area ID code
colnames(input_data)[1] <- "LSOA"
# Check for any variables that contain non-numeric values Obviously the
# 'area code' variable will contain non numerics
count <- 0
non_numerics <- c()
for (i in 1:ncol(input_data)) {
if (class(input_data[, i]) == "character") {
count <- count + 1
print(colnames(input_data)[i])
non_numerics[count] <- colnames(input_data)[i]
}
}
Outside of R, the next step is to create a specification text file that details the numerator and denominator relationships, and desired transformations. The example illustrated below in this section is available to download from here. For details of the formatting of specification files please see the prepare_geodemographic function help file.
The specification file contains the following lines, detailing the count data numerator and denominators, the non-count variables, the area identifier codes then the transformation required for each variable.
count pop_0_15-pop_65_plus | totpop
count Cen_White-Cen_Chinese | Cen_All_people
count BC_18_24-BC_over_6months | totpop
count CTAX_BandA-CTAX_BandH | CTAX_Total
count St_Pen_M_65_69-St_Pen_F_80_plus | St_Pen_Total
noncount Edu_skills_train_score, Not_Enter_HE_rate
transform pop_0_15-pop_65_plus |IND,BC
transform Cen_White-Cen_Chinese |IND,BC
transform BC_18_24-BC_over_6months |IND,BC
transform CTAX_BandA-CTAX_BandH |IND,BC
transform St_Pen_M_65_69-St_Pen_F_80_plus |IND,BC
transform Edu_skills_train_score, Not_Enter_HE_rate |NONE,BC
zonecodes LSOA
Once the specification file is saved in the working directory, the next step is to use the input_geodemographic function to attach the attribute information to the corresponding columns. The zone information is attached to the resulting data set as a row names attribute. The following code calls the input_geodemographic function, which outputs a data frame (data_with_attributes) which is ready to be transformed into input data. Any variables not included in the specification file will be printed to the terminal. Furthermore, the area id variable will not be returned in the final data set as these attributes have been attached to the dataframe rows.
data_with_attributes <- input_geodemographic(input_data, "http://dl.dropbox.com/u/881843/RPubsData/gd/spec.txt")
The prepare_geodemographic function implements the relevant index score/percentage calculations and transformations attributed in the specification file. At this stage, it is often useful to examine the correlation between variables and potentially remove those which are very similar. To help make these comparisons, the 'compare' parameter can be set to either “correlation_matrix” or “heatmap”. The heatmap will is plotted in the plot space and the correlation matrix printed to the console.
# prepare the data for clustering with the compare options turned off.
prepared <- prepare_geodemographic(data_with_attributes)
# prepare the data for clustering with the heatmap option turned on.
prepared_2 <- prepare_geodemographic(data_with_attributes, compare = "heatmap")
The other comparison option is to create a correlation matrix which is printed to the console.
prepared_3 <- prepare_geodemographic(data_with_attributes, compare = "correlation_matrix")
As an example, we can see that CTAX_BandD CTAX_BandE and CTAX_BandF are strongly correlated, and as such, we may decide to remove two of these variables from the clustering stage. This can be done by specifying the variable names as strings in the 'remove' parameter in the prepare_geodemographic function. Another method might be to collapse these variables into a new variable prior to the data input stage - also requiring changes to the specification file.
prepared_4 <- prepare_geodemographic(data_with_attributes, remove = c("CTAX_BandD",
"CTAX_BandE"))
Now that the specified preparations have been done, we can move onto clustering the Type level of the classification.
The finest “Type” level in a geodemographic classification is often created using k-means. This is however an unstable algorithm, and as such requires multiple runs to find the “best” result. The cluster_types function runs k-means for a user-specified k (clusters) and re-run frequency. For each value of k, an optimal result is identified by comparing all the sum of square results, and identifying the “best” result as the run with the lowest value. If parameters are not specified, defaults of k=60 and 100 re-runs are used. Values can be specified individually or as a list.
# k=60 with 2 runs
default_clusters_mr <- cluster_types(prepared_4, 60, 2)
# k=45,46 and 47; each of which are run 3 times
user_clusters <- cluster_types(prepared_4, c(45, 46, 47), 10)
For each value of K the within sum of squares for each run of k means are appended to a list which can be accessed with the following code.
TWS <- user_clusters[[length(user_clusters)]]
These can also be easily graphed.
# Prepare the data to graph
t1 <- aggregate(TWSS ~ K, data = TWS, median)
t2 <- aggregate(TWSS ~ K, data = TWS, min)[2]
t3 <- aggregate(TWSS ~ K, data = TWS, max)[2]
t4 <- cbind(t1, t2, t3)
colnames(t4) <- c("K", "median", "min", "max")
t4$group <- "results"
# Plot the graph
ggplot(t4, aes(x = as.character(K), y = median, group = group)) + geom_errorbar(aes(ymax = t4$max,
ymin = t4$min), width = 0.1) + geom_line() + geom_point() + xlab("Values of K") +
ylab("Within Sum of Squares")
For the rest of this tutorial, we will use the optimized results for k = 47.
At this stage, it is usually a good idea to examine the characteristics of the Type level clusters. The describe_types function creates a visual representation, printing the count variables as a heatmap and the non-count variables as bar charts. In the heatmap, the blue squares and green squares stand for high and low index scores respectively, with the following criteria for each colour:
Index Score< 50 = dark green
50 < Index Score < 80 = light green
80 < Index Score < 120 = grey
120 < Index Score < 150 = light blue
150 < Index Score = steel blue.
K_47 <- user_clusters[[3]]
cl <- K_47$cluster
out_desc <- describe_types(cl, data_with_attributes, bar = FALSE, heatmap = TRUE)
An alternate representation of the Type level results can be generated using the grand_index_generator function. This calculates index scores for each of variables in the results of the cluster analysis.
# the cluster_types function globalises a cluster vector to the workspace,
# and in this example, k=47 is selected
grand_index1 <- grand_index_generator(Clusters_for_K_47, data_with_attributes)
The size of each of the cluster was also printed to the R terminal by the cluster_types function. If clusters are considered too big or too small, it may be necessary to split or merge these using split_types or merge_types. This is completed reasonably arbitrarily in the following code.
K_47 <- user_clusters[[3]]
# find the two biggest cluster to split:
pops <- K_47$size
maximum <- max(pops)
max_index <- match(maximum, pops)
temp_max <- pops[-max_index]
max_index2 <- match(max(temp_max), pops)
# find the two smallest cluster to merge:
minimum <- min(pops)
min_index <- match(minimum, pops)
temp_min <- pops[-min_index]
min_index2 <- match(min(temp_min), pops)
# Both merge_types and split_types return data frames that are ready for
# clustering at 'Groups' level.
merged <- merge_types(Clusters_for_K_47, c(min_index, min_index2))
merged_and_split_clusters <- split_types(data_with_attributes, merged, c(max_index,
max_index2), c(2, 3))
After you are happy with the Type level clustering results, Groups can be compiled using the hierarchical clustering function cluster_groups. This function requires the specification of a datasource and the Type level clusters as parameters, but also a 'cut off' point which provides the number of clusters that the Types should be grouped into.
Once a group level result is created; the cluster_lookup function can be used to generate a new data frame that gives the classifications a hierarchical coding scheme, where a letter represents each group and a number the type.
new_groups <- cluster_groups(Clusters_for_K_47, data_with_attributes, 10)
lookup <- cluster_lookup(new_groups)
Bar graphs can be printed using the visualise_classification function. In these bar charts each bar represents a Type and each colour represents a Group.
# visualise the results of the classification
analysis <- visualise_classification(lookup, data_with_attributes, variables = colnames(prepared_4)[1:4])
The package currently provides an evaluation function that calculations gini coefficients and Lorenz curves for a set of variables. This can evaluate either the Group or Type level clusters which must be specified as a string in the parameter groups_or_types. The variables that are to be examined must be provided as the final parameter, and these are plotted on the resulting bar chart and Lorenz curve.
# evaluation for 'types' and the first ten variables
evaluation <- evaluate_classification(data_with_attributes, lookup, "types",
c(colnames(prepared_4)[1:10]))
# evaluation for 'groups' and the next five variables
evaluation <- evaluate_classification(data_with_attributes, lookup, "groups",
c(colnames(prepared_4)[11:15]))
# Plot graphs
plot(evaluation$lorenz)
plot(evaluation$gini)