There are 8000 general practices in England and data for each practice is increasingly available in the public domain. There are however few tools which enable analysts and practices to compare among practices with similar similar characteristics or peer groups. A previous tool developed by APHO seemingly considerably misclassified practices, and it was agreed by the National Practice Benchmarking and Indicators group that the issue of finding peer groups should be revisited.

This short note is a pilot evaluation of a simple method based on k-means analysis of an extract of the national general practice profiles (NGPP). The NGPP contain about 250 publicly available indicators of the health and wellbeing, care utilisation, demography and characteristics of general practice populations. For this analysis we have focussed on key demographic variables.

1 Method

1.1 Data

The data for this analysis was extracted from the spreadsheets downloaded from the NGPP website and included the following variables for each practice:

  • practice practice code
  • ccg CCG
  • pop Total registered population (2014)
  • imd IMD 2015 practice score (% of population with some form of deprivation) - calculated by Dept Primary Care, Kings
  • %0-4 population <5
  • %5-14
  • %65+
  • %75+
  • %85+
  • <18
  • eth % population non-white ethnicity

1.2 Machine learning (ML)

ML refers to the ability of computers to learn without explicit programming. In a data context this generally means being able to detect patterns in the data, or make predictions from existing data about new, ‘unseen’ data.

ML algorithms are often subdivided into:

  • Supervised - we build statistical models on datasets where we know the answer (i.e. the outcome or classification) - this is the training element - and these are applied to unseen or test data for evaluation. The statistical models used for supervised ML fall into regression models (where the outcome is a continuous variable) or classification where the outcome is categorical.

  • Unsupervised - there are no outcome or known categories, and the algorithms try and draw out patterns based on the data. Cluster analysis is often categorised as unsupervised ML.

In this analysis we will do two things:

1 Attempt to cluster general practices using a commonly used clustering algorithm known as k-means. 2 Use this classification to build a model which will allow us to assign practices missing from this analysis, or new practices (or indeed groups of practices) into the identified clusters.

K means analysis is a form of unsupervised machine learning designed to cluster data into groups which are more similar to each other and dissimilar to other groups - it is entirely data driven.[link] The k needs to be specified in advanced although there are methods for determining the optimum nmber of clusters. Based on previous iterations of this work and discussions with others we will use 15 clusters but it is possibe to rerun the analysis with other values of k.

The analysis has been conducted in R Studio and this report written in R Markdown using the R Studio package to allow us to embed relevant code which can then be shared and easily modified if we wish to change the variables or analysis.

It updates the previous analysis to latest data for populations and IMD scores and adds ethnicity (% population white)

1.2.1 The dataset

##   practice                                     ccg  %0-4  %5-14   %<18
## 1   A81007 NHS Hartlepool And Stockton-On-Tees CCG 5.281 12.483 21.493
## 2   A81008                      NHS South Tees CCG 5.993 11.008 20.538
## 3   A81009                      NHS South Tees CCG 6.659 12.142 22.584
## 4   A81011 NHS Hartlepool And Stockton-On-Tees CCG 5.529 10.572 19.716
## 5   A81012                      NHS South Tees CCG 7.317 12.048 22.935
## 6   A81013                      NHS South Tees CCG 6.335 12.292 22.300
##     %65+  %75+  %85+    imd   pop    eth1
## 1 18.593 9.260 2.709 31.740  9525  2.5665
## 2 16.585 7.730 2.055 51.915  4088  4.3742
## 3 14.690 6.724 1.824 35.465  9265 12.9636
## 4 18.626 8.507 1.923 34.840 11285  2.3051
## 5 14.340 7.338 1.787 49.744  4756  8.9801
## 6 17.788 6.994 1.662 29.891  6077  0.8455

The dataset only contains 7750 practices because it excludes small practices and those with discrepancies between QOF reported practice size and the registered population.

1.2.2 Data summaries

We can summarise the dataset and see that there is considerable variation between practices. For example, the % of under 5’s varies between 0 and 17.3%, of over 75s, between 0 and 79%, % with white ethnicity between 9% and 99%.

%0-4 %5-14 %<18 %65+ %75+ %85+ imd pop eth1
Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 3.213 Min. : 294 Min. : 0.4367
1st Qu.: 4.888 1st Qu.:10.11 1st Qu.:18.43 1st Qu.:12.23 1st Qu.: 5.366 1st Qu.: 1.396 1st Qu.:13.987 1st Qu.: 3956 1st Qu.: 2.7586
Median : 5.790 Median :11.28 Median :20.40 Median :17.15 Median : 7.670 Median : 2.131 Median :21.927 Median : 6514 Median : 7.4807
Mean : 5.985 Mean :11.49 Mean :20.89 Mean :16.85 Mean : 7.623 Mean : 2.183 Mean :23.724 Mean : 7337 Mean :16.8606
3rd Qu.: 6.869 3rd Qu.:12.66 3rd Qu.:22.82 3rd Qu.:21.20 3rd Qu.: 9.671 3rd Qu.: 2.822 3rd Qu.:31.731 3rd Qu.: 9811 3rd Qu.:24.1855
Max. :17.268 Max. :30.32 Max. :53.51 Max. :92.51 Max. :79.648 Max. :48.196 Max. :66.479 Max. :54848 Max. :90.4988

Exploring the relationship between the variables suggests that deprivation has a complex relationship with age structure. The under 18 variable is strongly correlated with %0-4 and %5-14 so to simplify the analysis we could exclude this. Similarly we could use 75+ and drop 65+ and 85+.

There also seems to be an outlying practice with high proportions of older people. This practice is Y02625. Looking at the characteristics of this practice shows it is small with an exclusively older population, suggesting it is probably a nursing home practice. We’ll keep it in for the analysis.

practice ccg %0-4 %5-14 %<18 %65+ %75+ %85+ imd pop eth1
7615 Y02625 NHS Salford CCG 0 0 0 92.507 79.648 48.196 35.006 1081 11.5939

2 Cluster analysis

A simple hierarchical cluster analysis produces the dendrogram below. Each branch is a general practice. The dendrogram partitions the data according to similarity based on Euclidean distance - the further down the tree the more similar practices are. It is hard to see all the detail (the chart shows all 7750 practices) but it picks out the outlier (single branch at the top left of the chart) and suggests there are number of practice groupings. We can use this as a basis for assigning clusters, depending on how fine grained we want them to be. Note that we have scaled the data (z-scores) because clustering is sensitive to absolute values.

## Loading required package: broom

2.1 K means analysis

For context the average of each variable is:

require(knitr)
kable( kgp %>% summarise_each(funs(mean)))
practice ccg %0-4 %5-14 %<18 %65+ %75+ %85+ imd pop eth1
NA NA 5.985418 11.49312 20.89179 16.84751 7.623197 2.183238 23.72421 7336.641 16.86065

2.1.1 Running the k means analysis

With 15 clusters

set.seed(1) ## this is needed because there is an element of random sampling
k <- kmeans(scale(kgp[, -(1:2)]), 15, nstart = 25)

2.2 Summary of results

require(knitr); require(tidyr)
kable(k$size) ## how many practices in each cluster
819
710
391
326
903
159
188
632
711
664
681
328
911
326
1
kgp$cluster <- k$cluster

head(kgp)
##   practice                                     ccg  %0-4  %5-14   %<18
## 1   A81007 NHS Hartlepool And Stockton-On-Tees CCG 5.281 12.483 21.493
## 2   A81008                      NHS South Tees CCG 5.993 11.008 20.538
## 3   A81009                      NHS South Tees CCG 6.659 12.142 22.584
## 4   A81011 NHS Hartlepool And Stockton-On-Tees CCG 5.529 10.572 19.716
## 5   A81012                      NHS South Tees CCG 7.317 12.048 22.935
## 6   A81013                      NHS South Tees CCG 6.335 12.292 22.300
##     %65+  %75+  %85+    imd   pop    eth1 cluster
## 1 18.593 9.260 2.709 31.740  9525  2.5665       5
## 2 16.585 7.730 2.055 51.915  4088  4.3742       8
## 3 14.690 6.724 1.824 35.465  9265 12.9636       8
## 4 18.626 8.507 1.923 34.840 11285  2.3051       5
## 5 14.340 7.338 1.787 49.744  4756  8.9801       8
## 6 17.788 6.994 1.662 29.891  6077  0.8455       8
## the average values for each cluster

agg <- kgp[, -(1:2)] %>% group_by(cluster) %>% summarise_each(funs(mean))
kable(agg)
cluster %0-4 %5-14 %<18 %65+ %75+ %85+ imd pop eth1
1 5.458320 10.355247 19.10164 18.758365 8.504031 2.3674371 30.51271 5094.231 9.142872
2 4.226663 9.580172 16.94906 25.442985 11.581613 3.3530127 16.81817 4560.131 4.329802
3 9.166604 14.807522 27.60716 8.740724 3.657634 0.9811688 34.73493 4645.263 20.643653
4 4.087423 6.202715 13.65165 7.807482 3.358199 0.9053436 25.80130 8152.285 30.741583
5 6.009210 11.370726 20.72232 17.750556 8.051447 2.3480321 18.76671 11922.000 9.041510
6 5.730767 11.338849 20.60306 18.195553 8.220635 2.4128491 16.18125 21443.566 6.932090
7 3.778218 8.462202 15.07149 32.491319 16.223649 5.3119787 16.45609 6975.750 3.279205
8 6.844609 12.579426 23.12021 14.439660 6.404903 1.7113307 41.03463 5484.375 12.915971
9 6.639204 12.467626 22.50230 14.207638 6.086474 1.7180872 17.53106 5240.120 12.046672
10 6.599983 11.903497 21.80131 9.710417 4.431166 1.1201069 30.50566 5376.386 55.287250
11 4.808363 10.487881 18.61249 23.836206 11.129703 3.3470000 14.42905 11161.984 4.392428
12 7.598180 13.261655 24.33471 11.382899 5.147564 1.4655305 24.90804 12358.857 30.064052
13 5.191360 11.273181 19.87465 20.014585 8.843134 2.5422569 13.16405 5826.271 6.406844
14 9.283905 17.898613 31.64721 6.688423 3.123896 0.7610890 41.86686 5408.046 61.325038
15 0.000000 0.000000 0.00000 92.507000 79.648000 48.1960000 35.00600 1081.000 11.593900

The smallest cluster contains 1 practices, the largest 911, and the average size of cluster is 516.6666667,

2.2.1 Plotting the clusters

We can plot the mean values for each cluster for each variable as a parallel coordinate plot

require(ggplot2);require(tidyr);require(directlabels)
## Loading required package: directlabels
agg$cluster <- as.factor(agg$cluster)

agg1 <- scale(agg[-1])
agg1 <- data.frame(cbind(agg$cluster, agg1))

aggw <- gather(agg1, ind, value, 2:10)

kplot <- ggplot(aggw, aes(ind, value, label = as.numeric(aggw$V1)))
kplot <- kplot +  geom_line(aes(group = factor(aggw$V1), colour = factor(aggw$V1))) + ggtitle("Cluster charactertistics (scaled values)")
kplot + geom_text(size = 2) + geom_hline(yintercept = 0) + ylim(c(-3, 4))

kplot1 <- ggplot(aggw, aes(ind, value, colour = factor(V1))) + geom_point() + geom_line(aes(group = factor(V1)))
kplot1 + facet_wrap(~V1) + coord_polar() + geom_hline(yintercept = 0) + ggtitle("Cluster profiles")

We can begin to see the nature of the clusters this method identifies.

  • Cluster 15 for example consists of small practices which are more deprived than average and are much older than average
  • Cluster 5 and 6 are very similar but 6 is a group of much larger practices
  • Clusters 9 and 10 are similar - younger than average, but c10 is more deprived and is more ethnically diverse
  • Cluster 3 is mainly deprived, young practices but less ethnically diverse than c10
  • Cluster 8 is similar to 3 but more deprived

It is possible to to create qualitative labels for each group, and enrich the clustering with additional variables - for example rurality.

We can explore cluster characteristics more

2.2.2 Interactive lookup table of practice values, clusters, IMD deciles and CCG

2.2.3 Interactive table of practices by CCG by cluster

##   practice                                     ccg cluster        imd1
## 1   A81007 NHS Hartlepool And Stockton-On-Tees CCG       5 (28.5,34.8]
## 2   A81008                      NHS South Tees CCG       8 (47.5,53.8]
## 3   A81009                      NHS South Tees CCG       8 (34.8,41.2]
## 4   A81011 NHS Hartlepool And Stockton-On-Tees CCG       5 (28.5,34.8]
## 5   A81012                      NHS South Tees CCG       8 (47.5,53.8]
## 6   A81013                      NHS South Tees CCG       8 (28.5,34.8]
##    ind value
## 1 %0-4 5.281
## 2 %0-4 5.993
## 3 %0-4 6.659
## 4 %0-4 5.529
## 5 %0-4 7.317
## 6 %0-4 6.335

2.2.4 Distribution of clusters by CCG

g <- ggplot(c, aes(ccg)) + geom_bar(aes(fill = factor(cluster)), position = "fill") + coord_flip() + theme(axis.text.y = element_text(size = 5)) + xlab("") + theme(legend.position = "bottom")

g