1 Method
- 1.1 Data
- 1.2 Machine learning (ML)
  - 1.2.1 The dataset
  - 1.2.2 Data summaries
2 Cluster analysis
- 2.1 K means analysis
  - 2.1.1 Running the k means analysis
- 2.2 Summary of results

There are 8000 general practices in England and data for each practice is increasingly available in the public domain. There are however few tools which enable analysts and practices to compare among practices with similar similar characteristics or peer groups. A previous tool developed by APHO seemingly considerably misclassified practices, and it was agreed by the National Practice Benchmarking and Indicators group that the issue of finding peer groups should be revisited.

This short note is a pilot evaluation of a simple method based on k-means analysis of an extract of the national general practice profiles (NGPP). The NGPP contain about 250 publicly available indicators of the health and wellbeing, care utilisation, demography and characteristics of general practice populations. For this analysis we have focussed on key demographic variables.

1 Method

1.1 Data

The data for this analysis was extracted from the spreadsheets downloaded from the NGPP website and included the following variables for each practice:

practice practice code
ccg CCG
pop Total registered population (2014)
imd IMD 2015 practice score (% of population with some form of deprivation) - calculated by Dept Primary Care, Kings
%0-4 population <5
%5-14
%65+
%75+
%85+
<18
eth % population non-white ethnicity

1.2 Machine learning (ML)

ML refers to the ability of computers to learn without explicit programming. In a data context this generally means being able to detect patterns in the data, or make predictions from existing data about new, ‘unseen’ data.

ML algorithms are often subdivided into:

Supervised - we build statistical models on datasets where we know the answer (i.e. the outcome or classification) - this is the training element - and these are applied to unseen or test data for evaluation. The statistical models used for supervised ML fall into regression models (where the outcome is a continuous variable) or classification where the outcome is categorical.
Unsupervised - there are no outcome or known categories, and the algorithms try and draw out patterns based on the data. Cluster analysis is often categorised as unsupervised ML.

In this analysis we will do two things:

1 Attempt to cluster general practices using a commonly used clustering algorithm known as k-means. 2 Use this classification to build a model which will allow us to assign practices missing from this analysis, or new practices (or indeed groups of practices) into the identified clusters.

K means analysis is a form of unsupervised machine learning designed to cluster data into groups which are more similar to each other and dissimilar to other groups - it is entirely data driven.[link] The k needs to be specified in advanced although there are methods for determining the optimum nmber of clusters. Based on previous iterations of this work and discussions with others we will use 15 clusters but it is possibe to rerun the analysis with other values of k.

The analysis has been conducted in R Studio and this report written in R Markdown using the R Studio package to allow us to embed relevant code which can then be shared and easily modified if we wish to change the variables or analysis.

It updates the previous analysis to latest data for populations and IMD scores and adds ethnicity (% population white)

1.2.1 The dataset

##   practice                                     ccg  %0-4  %5-14   %<18
## 1   A81007 NHS Hartlepool And Stockton-On-Tees CCG 5.281 12.483 21.493
## 2   A81008                      NHS South Tees CCG 5.993 11.008 20.538
## 3   A81009                      NHS South Tees CCG 6.659 12.142 22.584
## 4   A81011 NHS Hartlepool And Stockton-On-Tees CCG 5.529 10.572 19.716
## 5   A81012                      NHS South Tees CCG 7.317 12.048 22.935
## 6   A81013                      NHS South Tees CCG 6.335 12.292 22.300
##     %65+  %75+  %85+    imd   pop    eth1
## 1 18.593 9.260 2.709 31.740  9525  2.5665
## 2 16.585 7.730 2.055 51.915  4088  4.3742
## 3 14.690 6.724 1.824 35.465  9265 12.9636
## 4 18.626 8.507 1.923 34.840 11285  2.3051
## 5 14.340 7.338 1.787 49.744  4756  8.9801
## 6 17.788 6.994 1.662 29.891  6077  0.8455

The dataset only contains 7750 practices because it excludes small practices and those with discrepancies between QOF reported practice size and the registered population.

1.2.2 Data summaries

We can summarise the dataset and see that there is considerable variation between practices. For example, the % of under 5’s varies between 0 and 17.3%, of over 75s, between 0 and 79%, % with white ethnicity between 9% and 99%.

%0-4	%5-14	%<18	%65+	%75+	%85+	imd	pop	eth1
Min. : 0.000	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.000	Min. : 0.000	Min. : 3.213	Min. : 294	Min. : 0.4367
1st Qu.: 4.888	1st Qu.:10.11	1st Qu.:18.43	1st Qu.:12.23	1st Qu.: 5.366	1st Qu.: 1.396	1st Qu.:13.987	1st Qu.: 3956	1st Qu.: 2.7586
Median : 5.790	Median :11.28	Median :20.40	Median :17.15	Median : 7.670	Median : 2.131	Median :21.927	Median : 6514	Median : 7.4807
Mean : 5.985	Mean :11.49	Mean :20.89	Mean :16.85	Mean : 7.623	Mean : 2.183	Mean :23.724	Mean : 7337	Mean :16.8606
3rd Qu.: 6.869	3rd Qu.:12.66	3rd Qu.:22.82	3rd Qu.:21.20	3rd Qu.: 9.671	3rd Qu.: 2.822	3rd Qu.:31.731	3rd Qu.: 9811	3rd Qu.:24.1855
Max. :17.268	Max. :30.32	Max. :53.51	Max. :92.51	Max. :79.648	Max. :48.196	Max. :66.479	Max. :54848	Max. :90.4988

Exploring the relationship between the variables suggests that deprivation has a complex relationship with age structure. The under 18 variable is strongly correlated with %0-4 and %5-14 so to simplify the analysis we could exclude this. Similarly we could use 75+ and drop 65+ and 85+.

There also seems to be an outlying practice with high proportions of older people. This practice is Y02625. Looking at the characteristics of this practice shows it is small with an exclusively older population, suggesting it is probably a nursing home practice. We’ll keep it in for the analysis.

	practice	ccg	%0-4	%5-14	%<18	%65+	%75+	%85+	imd	pop	eth1
7615	Y02625	NHS Salford CCG	0	0	0	92.507	79.648	48.196	35.006	1081	11.5939

2 Cluster analysis

A simple hierarchical cluster analysis produces the dendrogram below. Each branch is a general practice. The dendrogram partitions the data according to similarity based on Euclidean distance - the further down the tree the more similar practices are. It is hard to see all the detail (the chart shows all 7750 practices) but it picks out the outlier (single branch at the top left of the chart) and suggests there are number of practice groupings. We can use this as a basis for assigning clusters, depending on how fine grained we want them to be. Note that we have scaled the data (z-scores) because clustering is sensitive to absolute values.

## Loading required package: broom

2.1 K means analysis

For context the average of each variable is:

require(knitr)
kable( kgp %>% summarise_each(funs(mean)))

practice	ccg	%0-4	%5-14	%<18	%65+	%75+	%85+	imd	pop	eth1
NA	NA	5.985418	11.49312	20.89179	16.84751	7.623197	2.183238	23.72421	7336.641	16.86065

2.1.1 Running the k means analysis

With 15 clusters

set.seed(1) ## this is needed because there is an element of random sampling
k <- kmeans(scale(kgp[, -(1:2)]), 15, nstart = 25)

2.2 Summary of results

require(knitr); require(tidyr)
kable(k$size) ## how many practices in each cluster

819
710
391
326
903
159
188
632
711
664
681
328
911
326
1

kgp$cluster <- k$cluster

head(kgp)

##   practice                                     ccg  %0-4  %5-14   %<18
## 1   A81007 NHS Hartlepool And Stockton-On-Tees CCG 5.281 12.483 21.493
## 2   A81008                      NHS South Tees CCG 5.993 11.008 20.538
## 3   A81009                      NHS South Tees CCG 6.659 12.142 22.584
## 4   A81011 NHS Hartlepool And Stockton-On-Tees CCG 5.529 10.572 19.716
## 5   A81012                      NHS South Tees CCG 7.317 12.048 22.935
## 6   A81013                      NHS South Tees CCG 6.335 12.292 22.300
##     %65+  %75+  %85+    imd   pop    eth1 cluster
## 1 18.593 9.260 2.709 31.740  9525  2.5665       5
## 2 16.585 7.730 2.055 51.915  4088  4.3742       8
## 3 14.690 6.724 1.824 35.465  9265 12.9636       8
## 4 18.626 8.507 1.923 34.840 11285  2.3051       5
## 5 14.340 7.338 1.787 49.744  4756  8.9801       8
## 6 17.788 6.994 1.662 29.891  6077  0.8455       8

## the average values for each cluster

agg <- kgp[, -(1:2)] %>% group_by(cluster) %>% summarise_each(funs(mean))
kable(agg)

cluster	%0-4	%5-14	%<18	%65+	%75+	%85+	imd	pop	eth1
1	5.458320	10.355247	19.10164	18.758365	8.504031	2.3674371	30.51271	5094.231	9.142872
2	4.226663	9.580172	16.94906	25.442985	11.581613	3.3530127	16.81817	4560.131	4.329802
3	9.166604	14.807522	27.60716	8.740724	3.657634	0.9811688	34.73493	4645.263	20.643653
4	4.087423	6.202715	13.65165	7.807482	3.358199	0.9053436	25.80130	8152.285	30.741583
5	6.009210	11.370726	20.72232	17.750556	8.051447	2.3480321	18.76671	11922.000	9.041510
6	5.730767	11.338849	20.60306	18.195553	8.220635	2.4128491	16.18125	21443.566	6.932090
7	3.778218	8.462202	15.07149	32.491319	16.223649	5.3119787	16.45609	6975.750	3.279205
8	6.844609	12.579426	23.12021	14.439660	6.404903	1.7113307	41.03463	5484.375	12.915971
9	6.639204	12.467626	22.50230	14.207638	6.086474	1.7180872	17.53106	5240.120	12.046672
10	6.599983	11.903497	21.80131	9.710417	4.431166	1.1201069	30.50566	5376.386	55.287250
11	4.808363	10.487881	18.61249	23.836206	11.129703	3.3470000	14.42905	11161.984	4.392428
12	7.598180	13.261655	24.33471	11.382899	5.147564	1.4655305	24.90804	12358.857	30.064052
13	5.191360	11.273181	19.87465	20.014585	8.843134	2.5422569	13.16405	5826.271	6.406844
14	9.283905	17.898613	31.64721	6.688423	3.123896	0.7610890	41.86686	5408.046	61.325038
15	0.000000	0.000000	0.00000	92.507000	79.648000	48.1960000	35.00600	1081.000	11.593900

The smallest cluster contains 1 practices, the largest 911, and the average size of cluster is 516.6666667,

2.2.1 Plotting the clusters

We can plot the mean values for each cluster for each variable as a parallel coordinate plot

require(ggplot2);require(tidyr);require(directlabels)

## Loading required package: directlabels

agg$cluster <- as.factor(agg$cluster)

agg1 <- scale(agg[-1])
agg1 <- data.frame(cbind(agg$cluster, agg1))

aggw <- gather(agg1, ind, value, 2:10)

kplot <- ggplot(aggw, aes(ind, value, label = as.numeric(aggw$V1)))
kplot <- kplot +  geom_line(aes(group = factor(aggw$V1), colour = factor(aggw$V1))) + ggtitle("Cluster charactertistics (scaled values)")
kplot + geom_text(size = 2) + geom_hline(yintercept = 0) + ylim(c(-3, 4))

kplot1 <- ggplot(aggw, aes(ind, value, colour = factor(V1))) + geom_point() + geom_line(aes(group = factor(V1)))
kplot1 + facet_wrap(~V1) + coord_polar() + geom_hline(yintercept = 0) + ggtitle("Cluster profiles")

We can begin to see the nature of the clusters this method identifies.

Cluster 15 for example consists of small practices which are more deprived than average and are much older than average
Cluster 5 and 6 are very similar but 6 is a group of much larger practices
Clusters 9 and 10 are similar - younger than average, but c10 is more deprived and is more ethnically diverse
Cluster 3 is mainly deprived, young practices but less ethnically diverse than c10
Cluster 8 is similar to 3 but more deprived

It is possible to to create qualitative labels for each group, and enrich the clustering with additional variables - for example rurality.

We can explore cluster characteristics more

2.2.2 Interactive lookup table of practice values, clusters, IMD deciles and CCG

2.2.3 Interactive table of practices by CCG by cluster

##   practice                                     ccg cluster        imd1
## 1   A81007 NHS Hartlepool And Stockton-On-Tees CCG       5 (28.5,34.8]
## 2   A81008                      NHS South Tees CCG       8 (47.5,53.8]
## 3   A81009                      NHS South Tees CCG       8 (34.8,41.2]
## 4   A81011 NHS Hartlepool And Stockton-On-Tees CCG       5 (28.5,34.8]
## 5   A81012                      NHS South Tees CCG       8 (47.5,53.8]
## 6   A81013                      NHS South Tees CCG       8 (28.5,34.8]
##    ind value
## 1 %0-4 5.281
## 2 %0-4 5.993
## 3 %0-4 6.659
## 4 %0-4 5.529
## 5 %0-4 7.317
## 6 %0-4 6.335

2.2.4 Distribution of clusters by CCG

g <- ggplot(c, aes(ccg)) + geom_bar(aes(fill = factor(cluster)), position = "fill") + coord_flip() + theme(axis.text.y = element_text(size = 5)) + xlab("") + theme(legend.position = "bottom")

g

Classifying general practices on demographic variables: a machine learning approach. Part 1. DRAFT

Julian Flowers

7 August 2016