1 Introduction

There are 8000 general practices in England and data for each practice is increasingly available in the public domain. There are however few tools which enable analysts and practices to compare practices with similar peers to help evaluate their performance. A previous tool developed by APHO seemingly considerably misclassified practices, and it was agreed by the National Practice Benchmarking and Indicators group that the issue of finding peer groups should be revisited.

This short note is a pilot evaluation of a simple method based on k-means analysis analysis of an extract of the national general practice profiles (NGPP). The NGPP contain about 250 publicly available indicators of the health, utilisation, demography and characteristics of general practice populations. For this analysis and to keep things simple we use demographic variables - total population, ethnicity, age break down and deprivation to apply unsupervised machine learning to group practices.

K-means analysis is a relatively simple and widely algorithm for unsupervised machine learning. It clusters data on the basis of similarity among all predictor variables into a pre-specified number of groups. The algorithm optimises clusters which are more similar to each other on the variables of interest, and more dissimilar to other clusters.

2 Method

The data for this analysis was extracted from the spreadsheets downloaded from the NGPP website and included the following variables for each practice:

  • Practice code
  • CCG
  • Total registered population (2013)
  • IMD 2010 practice score (% of population with some form of deprivation)
  • % population <5
  • % population 5-14
  • % population 65+
  • % population 75+
  • % population 85+
  • % population under 18
  • % eth - proportion of the population in white ethnic group

The analysis has been conducted in R and we have written this report in R Markdown using the R Studio package to allow us to embed relevant code which can then be shared and easily modified if we wish to change the variables or analysis. The core dataset can be regenerated from the Fingertips API.

The source files, code books and other analytical inputs are available here.

2.1 Import the data and create dataset for analysis

The first step is to import and clean the data.

The dataset contains 7544 practices because it excludes small practices and those with discrepancies between QOF reported practice size and the registered population.

2.2 The dataset

We can summarise the dataset:

options(digits = 2)
data %>%
  group_by(indicator) %>%
  summarise(practice = n(),
    missing = mean(is.na(value)),
            mean = mean(value),
            sd = sd(value),
            median = median(value),
            min = min(value), 
            max = max(value)
            ) %>%
  knitr::kable(caption = "Summary statistics", format = "pandoc" )
Summary statistics
indicator practice missing mean sd median min max
% GP registered population aged 5-14 years 7582 0 11.62 2.6 11.39 0.0 30
% GP registered population aged 75+ years 7582 0 7.65 3.4 7.66 0.0 79
% GP registered population aged 85+ years 7582 0 2.22 1.3 2.17 0.0 49
% GP registered population aged under 18 years 7582 0 20.85 4.2 20.35 0.0 54
% GP registered population aged under 5 years 7582 0 5.89 1.6 5.69 0.0 21
% population aged 65+ years 7582 0 17.01 6.8 17.24 0.0 92
%eth 7750 0 0.83 0.2 0.93 0.1 1
Index of multiple deprivation score (IMD 2015) 7586 0 23.74 11.9 21.88 3.2 66
pop 7750 0 7336.64 4421.0 6514.50 294.0 54848

Exploring the relationship between the variables suggests that deprivation has complex relationship with age structure. The under 18 variable is strongly correlated with %0-4 and %5-14 so we could exclude this. Similarly we could use 75+ and drop 65+ and 85+ variables.

There also seems to be an outlying practice with high proportions of older people. This practice is Y02625. Looking at the characteristics of this practice shows it is small with an exclusively older population suggesting it is probably a nursing home practice. We’ll exclude it from the analysis.

2.2.1 Missing data

There is no missing data but the dataset does not include all practices.

2.3 Boxplots

We can look practice variation for each variable.

data  %>%
  ggplot(aes(indicator, log(value))) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Boxplot of practice variation", 
       caption = "Note log scale",
       x = "")
## Warning: Removed 29 rows containing non-finite values (stat_boxplot).

2.4 Cluster analysis

We’ll start with all the variables and perform hierarchical cluster analysis produces the dendrogram below. Each branch is a general practice. The dendorgram partitions the data according to similarity - the further down the tree the more similar the practice are. It is hard to see all the detail (the chart shows all 8000 practices) but it suggests there are number of practice groupings. We can use this as a basis for assigning clusters, depending on how fine grained we want them to be. Note that we have scaled the data (z-scores) because clustering is sensitive to absolute values.

2.5 Running k-means analysis with 10 clusters

The k in k means to be specified and we’ll arbitrarily start with 10. It is important to set a seed - because there is a random start point, this needs to be fixed to be able to reproduce the analysis.

set.seed(2) ## this is needed for reproduciblity because there is an element of random sampling

k <-10 ## number of clusters

km <- kmeans(scale(data1[,-(1:2)]), k, nstart = 12, iter.max = 25) ## 10 clusters also repeat 25 times with 12 starting points

2.5.1 Exploring the results

km$size 
##  [1]  737  941  629  329 1064 1135  364  644 1138  562
library(broom)
tidy(km)
##        x1     x2     x3    x4    x5     x6    x7    x8    x9 size withinss
## 1   0.146 -0.993 -0.959  0.22  0.39 -1.083 -1.86  0.53 -0.31  737     2544
## 2   0.052 -0.112 -0.184  0.12  0.17 -0.094  0.30  1.09 -0.35  941     2185
## 3   0.097 -0.049 -0.006  0.13  0.19 -0.050  0.28 -0.32  1.84  629     2014
## 4  -2.022 -1.304 -1.169 -1.69 -1.16 -1.344 -0.69  0.23  0.18  329     2049
## 5   0.238 -0.242 -0.221  0.20  0.17 -0.170  0.30 -0.66 -0.29 1064     2213
## 6  -0.240  0.741  0.743 -0.35 -0.48  0.728  0.56 -0.78  0.74 1135     1846
## 7   2.408 -1.393 -1.272  2.50  2.03 -1.529 -2.09  1.46 -0.41  364     1802
## 8   1.144 -0.997 -0.910  1.32  1.42 -0.986 -0.11  0.88 -0.39  644     2334
## 9  -0.553  0.672  0.527 -0.66 -0.71  0.769  0.55 -0.38 -0.63 1138     2104
## 10 -0.919  1.954  1.973 -1.11 -1.17  1.833  0.68 -0.66 -0.12  562     1742
##    cluster
## 1        1
## 2        2
## 3        3
## 4        4
## 5        5
## 6        6
## 7        7
## 8        8
## 9        9
## 10      10

For 10 clusters, the mean values of each variable is shown in the table. The size of clusters varies between 329 and 1138 practices.

We can now attach the clusters to the original data and plot.

clus <- augment(km, data1)


clus %>%
  tidyr::gather(indicator, value, 3:11) %>%
  ggplot(aes(.cluster, value, fill = factor(.cluster))) +
  geom_boxplot() +
  coord_flip()+
  facet_wrap(~indicator, scale = "free", nrow = 3)+
  theme(legend.position = "") +
  labs(title = "Distribution of demographic variables by cluster", x = "")

Alternatively we can compare the profile of the clusters with radial plots. It helps to scale and centre the data so each cluster is compared to the average.

options(scipen = 2)

clusScale <- scale(clus[, -c(1:2, 12)])

clus1 <- data.frame(cbind(clusScale, clus$.cluster)) %>%
  rename(cluster = V10)

clus1 %>%
  gather(indicator, value,1:9) %>%
  group_by(cluster, indicator) %>%
  summarise(meanval = mean(value)) %>%
  ggplot(aes(indicator, meanval, colour = factor(cluster))) +
  geom_point() +
  geom_line(aes(group = cluster))+
  geom_hline(yintercept = 0) +
  coord_polar()+
  facet_wrap(~cluster) +
  labs(caption = "Black line = mean value") +
  theme(legend.position = "bottom")

This visualisation aids qualitative description. For example:

  • Cluster 1 practices tend to be younger and more deprived than average. They are ethnically diverse (low proportion of population of white ethnicity)
  • Cluster 2 practices are deprived but average in other respects
  • Cluster 3 practices are similar to cluster 2 but larger
  • Cluster 4 practices have a highly atypical age structure
  • Cluster 5 practices are similar to cluster 2, but less deprived than average
  • Cluster 6 practices are older, less deprived, less ethnically diverse and larger than average
  • Cluster 7 practices much younger, more ethnically diverse, deprived and smaller than average
  • Cluster 8 practices are similar to cluster 7 but less ethnically diverse
  • Cluster 9 practices are similar to cluster 6 but much smaller
  • Cluster 10 practices are much older than average, less deprived and less ethnically diverse

2.5.2 Download the data and cluster assignment for 10 clusters

For 10 clusters, the input data and cluster assignments can be explored further and downloaded below.

options(digits = 2)
DT::datatable(
  clus, extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
    extensions = 'ColReorder', options = list(colReorder = TRUE)
  )
)

2.6 Running k-means analysis with 15 clusters

We can repeat the analysis with 15 (or any number of clusters)

set.seed(2) ## this is needed for reproduciblity because there is an element of random sampling

k <-15 ## number of clusters

km15 <- kmeans(scale(data1[,-(1:2)]), k, nstart = 12, iter.max = 25) ## 10 clusters also repeat 25 times with 12 starting points

2.6.1 Exploring the results

tidy(km15) %>%
  select(cluster, size) %>%
  arrange(-size)
##    cluster size
## 1        8  880
## 2       12  871
## 3       11  730
## 4        7  706
## 5        6  682
## 6       14  662
## 7        5  603
## 8        2  535
## 9       15  451
## 10       1  390
## 11      10  315
## 12      13  252
## 13       4  187
## 14       9  152
## 15       3  127

For 15 clusters, the mean values of each variable is shown in the table. The size of clusters varies between 127 and 880 practices.

2.6.2 Cluster plot with 15 clusters

options(scipen = 2)
clus15 <- augment(km15, data1)
clusScale1 <- scale(clus15[, -c(1:2, 12)])

clus1 <- data.frame(cbind(clusScale1, clus15$.cluster)) %>%
  rename(cluster = V10)

clus1 %>%
  gather(indicator, value,1:9) %>%
  group_by(cluster, indicator) %>%
  summarise(meanval = mean(value)) %>%
  ggplot(aes(indicator, meanval, colour = factor(cluster))) +
  geom_point() +
  geom_line(aes(group = cluster))+
  geom_hline(yintercept = 0) +
  coord_polar()+
  facet_wrap(~cluster, ncol = 5) +
  labs(caption = "Black line = mean value") +
  theme(legend.position = "bottom")

With 15 clusters the distinction between them some is much more nuanced and less clear than with 10 groups.

The largest practice cluster #8 has average values on most variables. Cluster #3, the smallest, are larger practices with lower proportions of younger and older people.

2.6.3 Download and explore 15 clusters

DT::datatable(
  clus15, extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
    extensions = 'ColReorder', options = list(colReorder = TRUE)
  )
)

2.7 Comparison with other sources

We can use other sources such as practice profiles to provide some external validation. If we take cluster #3 for example, we can take a sample of practices assigned to cluster and review in the practice profiles.

clus15 %>%
  filter(area == "D81037" | area == "C84117" )
##     area                                     ccg percent_5_14 percent_75
## 1 C84117                 NHS Nottingham City CCG          1.8       0.82
## 2 D81037 NHS Cambridgeshire and Peterborough CCG          3.9       2.52
##   percent_85 percent_18 percent_5 percent_65 percent_eth imd2015 prac_pop
## 1       0.23         14       1.3        1.7        0.63      38    16277
## 2       0.63         12       2.0        6.4        0.81      11     9776
##   .cluster
## 1        3
## 2        3

This shows that these practices have an ‘atypical’ population profile which over rides deprivation scores. It may be advisable to subdivide some clusters by deprivation if this is important to take into account.

C84117 - cluster #3

C84117 - cluster #3

D81037 - cluster #3

D81037 - cluster #3

2.8 Distribution of clusters within CCGs

Further insights can be obtained by looking at the distribution of practice clusters within CCGs.

options(digits = 3)

clus15 %>%
  crosstab(ccg, .cluster, percent = "row") -> xtabccg

DT::datatable(xtabccg) %>%
  formatPercentage(2:12, 2)

Cursory analysis suggests that the the majority of practices in London are in clusters #2 and #3. Cluster #4 practices tend to be in coastal CCGs. Cluster #5 practices tend to be in more Northern, urban and deprived CCGs.

3 Conclusions

K-means analysis provides a data-driven way of clustering general practices into similar groups on the basis of simple demographic variables. Using the dataset and method outlined in this note practices will be consistently assigned. On the face of they seem to be produce separable groups of practices with age distribution and practice size being the largest drivers of the clustering.

Note that clusters create groups on the combination of demographic variables so there will be a range of deprivation scores within each cluster, but on average some clusters will be more deprived than others.

There are methods for determining the optimum number of clusters statistically - applying these would produce fewer clusters, but the purpose here is to define a set of groups of ‘similar’ practices.