There are 8000 general practices in England and data for each practice is increasingly available in the public domain. There are however few tools which enable analysts and practices to compare among practices with similar similar characteristics or peer groups. A previous tool developed by APHO seemingly considerably misclassified practices, and it was agreed by the National Practice Benchmarking and Indicators group that the issue of finding peer groups should be revisited.
This short note is a pilot evaluation of a simple method based on k-means analysis of an extract of the national general practice profiles (NGPP). The NGPP contain about 250 publicly available indicators of the health and wellbeing, care utilisation, demography and characteristics of general practice populations. For this analysis we have focussed on key demographic variables.
The data for this analysis was extracted from the spreadsheets downloaded from the NGPP website and included the following variables for each practice:
practice
practice codeccg
CCGpop
Total registered population (2014)imd
IMD 2015 practice score (% of population with some form of deprivation) - calculated by Dept Primary Care, Kings%0-4
population <5%5-14
%65+
%75+
%85+
<18
eth
% population non-white ethnicityML refers to the ability of computers to learn without explicit programming. In a data context this generally means being able to detect patterns in the data, or make predictions from existing data about new, ‘unseen’ data.
ML algorithms are often subdivided into:
Supervised - we build statistical models on datasets where we know the answer (i.e. the outcome or classification) - this is the training element - and these are applied to unseen or test data for evaluation. The statistical models used for supervised ML fall into regression models (where the outcome is a continuous variable) or classification where the outcome is categorical.
Unsupervised - there are no outcome or known categories, and the algorithms try and draw out patterns based on the data. Cluster analysis is often categorised as unsupervised ML.
In this analysis we will do two things:
1 Attempt to cluster general practices using a commonly used clustering algorithm known as k-means
. 2 Use this classification to build a model which will allow us to assign practices missing from this analysis, or new practices (or indeed groups of practices) into the identified clusters.
K means analysis is a form of unsupervised machine learning designed to cluster data into groups which are more similar to each other and dissimilar to other groups - it is entirely data driven.[link] The k needs to be specified in advanced although there are methods for determining the optimum nmber of clusters. Based on previous iterations of this work and discussions with others we will use 15 clusters but it is possibe to rerun the analysis with other values of k.
The analysis has been conducted in R Studio and this report written in R Markdown using the R Studio package to allow us to embed relevant code which can then be shared and easily modified if we wish to change the variables or analysis.
It updates the previous analysis to latest data for populations and IMD scores and adds ethnicity (% population white)
## practice ccg %0-4 %5-14 %<18
## 1 A81007 NHS Hartlepool And Stockton-On-Tees CCG 5.281 12.483 21.493
## 2 A81008 NHS South Tees CCG 5.993 11.008 20.538
## 3 A81009 NHS South Tees CCG 6.659 12.142 22.584
## 4 A81011 NHS Hartlepool And Stockton-On-Tees CCG 5.529 10.572 19.716
## 5 A81012 NHS South Tees CCG 7.317 12.048 22.935
## 6 A81013 NHS South Tees CCG 6.335 12.292 22.300
## %65+ %75+ %85+ imd pop eth1
## 1 18.593 9.260 2.709 31.740 9525 2.5665
## 2 16.585 7.730 2.055 51.915 4088 4.3742
## 3 14.690 6.724 1.824 35.465 9265 12.9636
## 4 18.626 8.507 1.923 34.840 11285 2.3051
## 5 14.340 7.338 1.787 49.744 4756 8.9801
## 6 17.788 6.994 1.662 29.891 6077 0.8455
The dataset only contains 7750 practices because it excludes small practices and those with discrepancies between QOF reported practice size and the registered population.
We can summarise the dataset and see that there is considerable variation between practices. For example, the % of under 5’s varies between 0 and 17.3%, of over 75s, between 0 and 79%, % with white ethnicity between 9% and 99%.
%0-4 | %5-14 | %<18 | %65+ | %75+ | %85+ | imd | pop | eth1 | |
---|---|---|---|---|---|---|---|---|---|
Min. : 0.000 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 | Min. : 0.000 | Min. : 3.213 | Min. : 294 | Min. : 0.4367 | |
1st Qu.: 4.888 | 1st Qu.:10.11 | 1st Qu.:18.43 | 1st Qu.:12.23 | 1st Qu.: 5.366 | 1st Qu.: 1.396 | 1st Qu.:13.987 | 1st Qu.: 3956 | 1st Qu.: 2.7586 | |
Median : 5.790 | Median :11.28 | Median :20.40 | Median :17.15 | Median : 7.670 | Median : 2.131 | Median :21.927 | Median : 6514 | Median : 7.4807 | |
Mean : 5.985 | Mean :11.49 | Mean :20.89 | Mean :16.85 | Mean : 7.623 | Mean : 2.183 | Mean :23.724 | Mean : 7337 | Mean :16.8606 | |
3rd Qu.: 6.869 | 3rd Qu.:12.66 | 3rd Qu.:22.82 | 3rd Qu.:21.20 | 3rd Qu.: 9.671 | 3rd Qu.: 2.822 | 3rd Qu.:31.731 | 3rd Qu.: 9811 | 3rd Qu.:24.1855 | |
Max. :17.268 | Max. :30.32 | Max. :53.51 | Max. :92.51 | Max. :79.648 | Max. :48.196 | Max. :66.479 | Max. :54848 | Max. :90.4988 |
Exploring the relationship between the variables suggests that deprivation has a complex relationship with age structure. The under 18 variable is strongly correlated with %0-4 and %5-14 so to simplify the analysis we could exclude this. Similarly we could use 75+ and drop 65+ and 85+.
There also seems to be an outlying practice with high proportions of older people. This practice is Y02625. Looking at the characteristics of this practice shows it is small with an exclusively older population, suggesting it is probably a nursing home practice. We’ll keep it in for the analysis.
practice | ccg | %0-4 | %5-14 | %<18 | %65+ | %75+ | %85+ | imd | pop | eth1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
7615 | Y02625 | NHS Salford CCG | 0 | 0 | 0 | 92.507 | 79.648 | 48.196 | 35.006 | 1081 | 11.5939 |
A simple hierarchical cluster analysis produces the dendrogram below. Each branch is a general practice. The dendrogram partitions the data according to similarity based on Euclidean distance - the further down the tree the more similar practices are. It is hard to see all the detail (the chart shows all 7750 practices) but it picks out the outlier (single branch at the top left of the chart) and suggests there are number of practice groupings. We can use this as a basis for assigning clusters, depending on how fine grained we want them to be. Note that we have scaled the data (z-scores) because clustering is sensitive to absolute values.
## Loading required package: broom
For context the average of each variable is:
require(knitr)
kable( kgp %>% summarise_each(funs(mean)))
practice | ccg | %0-4 | %5-14 | %<18 | %65+ | %75+ | %85+ | imd | pop | eth1 |
---|---|---|---|---|---|---|---|---|---|---|
NA | NA | 5.985418 | 11.49312 | 20.89179 | 16.84751 | 7.623197 | 2.183238 | 23.72421 | 7336.641 | 16.86065 |
With 15 clusters
set.seed(1) ## this is needed because there is an element of random sampling
k <- kmeans(scale(kgp[, -(1:2)]), 15, nstart = 25)
require(knitr); require(tidyr)
kable(k$size) ## how many practices in each cluster
819 |
710 |
391 |
326 |
903 |
159 |
188 |
632 |
711 |
664 |
681 |
328 |
911 |
326 |
1 |
kgp$cluster <- k$cluster
head(kgp)
## practice ccg %0-4 %5-14 %<18
## 1 A81007 NHS Hartlepool And Stockton-On-Tees CCG 5.281 12.483 21.493
## 2 A81008 NHS South Tees CCG 5.993 11.008 20.538
## 3 A81009 NHS South Tees CCG 6.659 12.142 22.584
## 4 A81011 NHS Hartlepool And Stockton-On-Tees CCG 5.529 10.572 19.716
## 5 A81012 NHS South Tees CCG 7.317 12.048 22.935
## 6 A81013 NHS South Tees CCG 6.335 12.292 22.300
## %65+ %75+ %85+ imd pop eth1 cluster
## 1 18.593 9.260 2.709 31.740 9525 2.5665 5
## 2 16.585 7.730 2.055 51.915 4088 4.3742 8
## 3 14.690 6.724 1.824 35.465 9265 12.9636 8
## 4 18.626 8.507 1.923 34.840 11285 2.3051 5
## 5 14.340 7.338 1.787 49.744 4756 8.9801 8
## 6 17.788 6.994 1.662 29.891 6077 0.8455 8
## the average values for each cluster
agg <- kgp[, -(1:2)] %>% group_by(cluster) %>% summarise_each(funs(mean))
kable(agg)
cluster | %0-4 | %5-14 | %<18 | %65+ | %75+ | %85+ | imd | pop | eth1 |
---|---|---|---|---|---|---|---|---|---|
1 | 5.458320 | 10.355247 | 19.10164 | 18.758365 | 8.504031 | 2.3674371 | 30.51271 | 5094.231 | 9.142872 |
2 | 4.226663 | 9.580172 | 16.94906 | 25.442985 | 11.581613 | 3.3530127 | 16.81817 | 4560.131 | 4.329802 |
3 | 9.166604 | 14.807522 | 27.60716 | 8.740724 | 3.657634 | 0.9811688 | 34.73493 | 4645.263 | 20.643653 |
4 | 4.087423 | 6.202715 | 13.65165 | 7.807482 | 3.358199 | 0.9053436 | 25.80130 | 8152.285 | 30.741583 |
5 | 6.009210 | 11.370726 | 20.72232 | 17.750556 | 8.051447 | 2.3480321 | 18.76671 | 11922.000 | 9.041510 |
6 | 5.730767 | 11.338849 | 20.60306 | 18.195553 | 8.220635 | 2.4128491 | 16.18125 | 21443.566 | 6.932090 |
7 | 3.778218 | 8.462202 | 15.07149 | 32.491319 | 16.223649 | 5.3119787 | 16.45609 | 6975.750 | 3.279205 |
8 | 6.844609 | 12.579426 | 23.12021 | 14.439660 | 6.404903 | 1.7113307 | 41.03463 | 5484.375 | 12.915971 |
9 | 6.639204 | 12.467626 | 22.50230 | 14.207638 | 6.086474 | 1.7180872 | 17.53106 | 5240.120 | 12.046672 |
10 | 6.599983 | 11.903497 | 21.80131 | 9.710417 | 4.431166 | 1.1201069 | 30.50566 | 5376.386 | 55.287250 |
11 | 4.808363 | 10.487881 | 18.61249 | 23.836206 | 11.129703 | 3.3470000 | 14.42905 | 11161.984 | 4.392428 |
12 | 7.598180 | 13.261655 | 24.33471 | 11.382899 | 5.147564 | 1.4655305 | 24.90804 | 12358.857 | 30.064052 |
13 | 5.191360 | 11.273181 | 19.87465 | 20.014585 | 8.843134 | 2.5422569 | 13.16405 | 5826.271 | 6.406844 |
14 | 9.283905 | 17.898613 | 31.64721 | 6.688423 | 3.123896 | 0.7610890 | 41.86686 | 5408.046 | 61.325038 |
15 | 0.000000 | 0.000000 | 0.00000 | 92.507000 | 79.648000 | 48.1960000 | 35.00600 | 1081.000 | 11.593900 |
The smallest cluster contains 1 practices, the largest 911, and the average size of cluster is 516.6666667,
We can plot the mean values for each cluster for each variable as a parallel coordinate plot
require(ggplot2);require(tidyr);require(directlabels)
## Loading required package: directlabels
agg$cluster <- as.factor(agg$cluster)
agg1 <- scale(agg[-1])
agg1 <- data.frame(cbind(agg$cluster, agg1))
aggw <- gather(agg1, ind, value, 2:10)
kplot <- ggplot(aggw, aes(ind, value, label = as.numeric(aggw$V1)))
kplot <- kplot + geom_line(aes(group = factor(aggw$V1), colour = factor(aggw$V1))) + ggtitle("Cluster charactertistics (scaled values)")
kplot + geom_text(size = 2) + geom_hline(yintercept = 0) + ylim(c(-3, 4))
kplot1 <- ggplot(aggw, aes(ind, value, colour = factor(V1))) + geom_point() + geom_line(aes(group = factor(V1)))
kplot1 + facet_wrap(~V1) + coord_polar() + geom_hline(yintercept = 0) + ggtitle("Cluster profiles")
We can begin to see the nature of the clusters this method identifies.
It is possible to to create qualitative labels for each group, and enrich the clustering with additional variables - for example rurality.
We can explore cluster characteristics more
## practice ccg cluster imd1
## 1 A81007 NHS Hartlepool And Stockton-On-Tees CCG 5 (28.5,34.8]
## 2 A81008 NHS South Tees CCG 8 (47.5,53.8]
## 3 A81009 NHS South Tees CCG 8 (34.8,41.2]
## 4 A81011 NHS Hartlepool And Stockton-On-Tees CCG 5 (28.5,34.8]
## 5 A81012 NHS South Tees CCG 8 (47.5,53.8]
## 6 A81013 NHS South Tees CCG 8 (28.5,34.8]
## ind value
## 1 %0-4 5.281
## 2 %0-4 5.993
## 3 %0-4 6.659
## 4 %0-4 5.529
## 5 %0-4 7.317
## 6 %0-4 6.335
g <- ggplot(c, aes(ccg)) + geom_bar(aes(fill = factor(cluster)), position = "fill") + coord_flip() + theme(axis.text.y = element_text(size = 5)) + xlab("") + theme(legend.position = "bottom")
g