This note outlines how to use the new Fingertips API to extract data, create data frames and perform further analysis. The data is accessible in JSON format, and the hierarchical, nested nature of Fingertips architecture makes extracting data from the API quite complex.
The tutorial makes extensive use of the jsonlite
and tidyjson
packages which have made it easier to convert complex JSON data to data frames and into tidy format for analysis. Extensions of this script could be used to extract and recombine data from different profiles, areas or time periods.
First we will load the libraries.
As a simple example we can extract a list of ageband and gender mappings and area types directly from the API.
sex <- fromJSON("http://fingertips.phe.org.uk/api/sexes")
sex %>% head %>% knitr::kable(format = "pandoc", caption = "Sex groupings")
Id | Name |
---|---|
-1 | Not applicable |
4 | Persons |
1 | Male |
2 | Female |
age <- fromJSON("http://fingertips.phe.org.uk/api/ages")
age %>% head %>% knitr::kable(format = "pandoc", caption = "Age groupings")
Id | Name |
---|---|
-1 | Not applicable |
1 | All ages |
2 | < 1 yr |
3 | 1-4 yrs |
4 | 5-9 yrs |
5 | 10-14 yrs |
areatype <- fromJSON("http://fingertips.phe.org.uk/api/area_types")
areatype %>% head %>% knitr::kable(format = "pandoc", caption = "Areatypes")
Id | Name | Short | IsSearchable |
---|---|---|---|
1 | Local Authority District | LA | FALSE |
2 | Primary Care Trust | Primary Care Trust | FALSE |
3 | Middle Super Output Area | MSOA | FALSE |
4 | Lower Super Output Area | LSOA | FALSE |
5 | Strategic Health Authority | SHA | FALSE |
6 | Government Office Region | Region | TRUE |
Overall Fingertips has 281 ageband categories and 51 area types.
As a more complex example, we can extract information about which profiles the API gives access to. The API gives access to 52 profiles - we just show the first 6 here.
profiles <- fromJSON("http://fingertips.phe.org.uk/api/profiles", simplifyDataFrame = TRUE)
profiles %>% head %>% select(-Key) %>%
knitr::kable(format = "pandoc", caption = "Profiles")
Id | Name | GroupIds |
---|---|---|
8 | Adult Social Care | 1000101, 1000102, 1000103, 1000104, 1000105, 1938132733 |
18 | Local Tobacco Control Profiles | 1938132885, 1938132886, 1938132887, 1938132888, 1938132889, 1938132890, 1938132900 |
19 | Public Health Outcomes Framework | 1000041, 1000042, 1000043, 1000044, 1000049, 1938132983 |
20 | National General Practice Profiles | 2000002, 2000003, 2000004, 2000005, 2000006, 2000008, 2000009, 2000011, 3000007, 3000008, 3000009, 3000010, 1938132829, 1938132970, 1938133086 |
21 | National General Practice Profiles (supporting indicators) | 1200006 |
22 | Longer Lives | 1000001 |
Extracting profile information produces a nested list of the GroupIds for each profile, and the number of groups or domains within each profile. For example, health profiles have 7 groups, the national-child-measurement-programme 2 and so on.
We can extract the profile IDs, groupIDs and names as a tidy data frame using the tidyjson
package.
library(tidyjson)
prof_json <- "http://fingertips.phe.org.uk/api/profiles"
prof <- prof_json %>%
gather_array %>%
spread_values(ID = jnumber("Id"),
Name = jstring("Name"),
Key = jstring("Key")
) %>%
enter_object("GroupIds") %>%
gather_array %>%
append_values_number("groupid") %>%
select(ID, Name, Key, groupid)
array.index column name already exists, changing to array.index.2
prof
The next level of complexity is to extract the latest data for a single profile. We’ll start with just the data for a single area type. The structure of Fingertips means we need to pass a profile id, group id, area code and parent_area code to the API. We’ll do this for the Health Profile for UTLAs.
We’ll first import the data using the jsonlite
package and look at the data structure
hp_df <- fromJSON(hp_url, simplifyDataFrame = TRUE)
hp_df %>% str(max.level = 3, list.len = 3)
'data.frame': 3 obs. of 11 variables:
$ Grouping :List of 3
..$ :'data.frame': 2 obs. of 6 variables:
.. ..$ GroupId : int 1938132694 1938132694
.. ..$ ComparatorId : int 1 4
.. ..$ ComparatorMethodId: int 1 1
.. .. [list output truncated]
..$ :'data.frame': 2 obs. of 6 variables:
.. ..$ GroupId : int 1938132694 1938132694
.. ..$ ComparatorId : int 1 4
.. ..$ ComparatorMethodId: int 1 1
.. .. [list output truncated]
..$ :'data.frame': 2 obs. of 6 variables:
.. ..$ GroupId : int 1938132694 1938132694
.. ..$ ComparatorId : int 1 4
.. ..$ ComparatorMethodId: int 1 1
.. .. [list output truncated]
$ Data :List of 3
..$ :'data.frame': 9 obs. of 12 variables:
.. ..$ AgeId : int 168 168 168 168 168 168 168 168 168
.. ..$ SexId : int 4 4 4 4 4 4 4 4 4
.. ..$ AreaCode: chr "E06000015" "E06000016" "E06000017" "E06000018" ...
.. .. [list output truncated]
..$ :'data.frame': 9 obs. of 12 variables:
.. ..$ AgeId : int 164 164 164 164 164 164 164 164 164
.. ..$ SexId : int 4 4 4 4 4 4 4 4 4
.. ..$ AreaCode: chr "E06000015" "E06000016" "E06000017" "E06000018" ...
.. .. [list output truncated]
..$ :'data.frame': 9 obs. of 12 variables:
.. ..$ AgeId : int 164 164 164 164 164 164 164 164 164
.. ..$ SexId : int 4 4 4 4 4 4 4 4 4
.. ..$ AreaCode: chr "E06000015" "E06000016" "E06000017" "E06000018" ...
.. .. [list output truncated]
$ IID : int 92443 90275 90640
[list output truncated]
The data is complex. The data values are contained in a nested series of data frames as part of the $Data
variable. The indicator IDs are stored in the $IID
variable but the variable names are not stored here and will need to be extracted with a different call to the API.
We can use the tidyjson
package to extract the data and indicator ids.
hp1 <- hp_url %>%
gather_array %>%
spread_values(ind_id = jnumber("IID")) %>%
enter_object("Data") %>%
gather_array %>%
spread_values(
area = jstring("AreaCode"),
value = jnumber("Val"),
lci = jnumber("LoCI"),
uci = jnumber("UpCI"),
denom = jnumber("Denom"),
count = jnumber("Count"),
age = jnumber("AgeId"),
sex = jnumber("SexId")
)%>%
select(ind_id, area, age, sex, value, lci, uci, denom, count)
array.index column name already exists, changing to array.index.2
hp2 <- hp_url %>%
gather_array() %>%
spread_values(ind_id = jnumber("IID")) %>%
enter_object("Grouping") %>%
gather_array %>%
spread_values(
time = jstring("Period")
)
array.index column name already exists, changing to array.index.2
hp1 %>%
left_join(hp2) %>%
select(ind_id, area, time, value, lci, uci, count, denom, age, sex) %>%
distinct
Joining, by = "ind_id"
We have now extracted a data frame of indicator values, time periods, sex and age ids for each indicator for each area.
To obtain indicator names from the API we need to call IndicatorMetadata https://fingertips.phe.org.uk/api#!/IndicatorMetadata/IndicatorMetadata_GetIndicatorMetadata to which we can pass a list of indicator ids.
[1] "92443%2C90275%2C90640"
List of 3
$ 90275:List of 7
..$ IID : int 90275
..$ Unit :List of 3
.. ..$ Id : int 5
.. ..$ Value: num 100
.. ..$ Label: chr "%"
..$ ValueType :List of 2
.. ..$ Id : int 5
.. ..$ Name: chr "Proportion"
..$ YearType :List of 2
.. ..$ Id : int 1
.. ..$ Name: chr "Calendar"
..$ ConfidenceLevel : num 95
..$ ConfidenceIntervalMethod:List of 3
.. ..$ Id : int 5
.. ..$ Name : chr "Normal approximation"
.. ..$ Description: chr "<p>A confidence interval is a range of values that is used to quantify the imprecision in the estimate of a particular indicato"| __truncated__
..$ Descriptive :List of 3
.. ..$ Name : chr "2.13i - Percentage of physically active and inactive adults - active adults"
.. ..$ NameLong : chr "2.13i - Percentage of adults achieving at least 150 minutes of physical activity per week in accordance with UK Chief Medical O"| __truncated__
.. ..$ DataSource: chr "Active People Survey, Sport England"
$ 90640:List of 7
..$ IID : int 90640
..$ Unit :List of 3
.. ..$ Id : int 5
.. ..$ Value: num 100
.. ..$ Label: chr "%"
..$ ValueType :List of 2
.. ..$ Id : int 5
.. ..$ Name: chr "Proportion"
..$ YearType :List of 2
.. ..$ Id : int 1
.. ..$ Name: chr "Calendar"
..$ ConfidenceLevel : num 95
..$ ConfidenceIntervalMethod:List of 3
.. ..$ Id : int 5
.. ..$ Name : chr "Normal approximation"
.. ..$ Description: chr "<p>A confidence interval is a range of values that is used to quantify the imprecision in the estimate of a particular indicato"| __truncated__
..$ Descriptive :List of 3
.. ..$ Name : chr "2.12 - Excess weight in Adults"
.. ..$ NameLong : chr "2.12 - Percentage of adults classified as overweight or obese"
.. ..$ DataSource: chr "Active People Survey, Sport England"
$ 92443:List of 7
..$ IID : int 92443
..$ Unit :List of 3
.. ..$ Id : int 5
.. ..$ Value: num 100
.. ..$ Label: chr "%"
..$ ValueType :List of 2
.. ..$ Id : int 5
.. ..$ Name: chr "Proportion"
..$ YearType :List of 2
.. ..$ Id : int 1
.. ..$ Name: chr "Calendar"
..$ ConfidenceLevel : num 95
..$ ConfidenceIntervalMethod:List of 3
.. ..$ Id : int 5
.. ..$ Name : chr "Normal approximation"
.. ..$ Description: chr "<p>A confidence interval is a range of values that is used to quantify the imprecision in the estimate of a particular indicato"| __truncated__
..$ Descriptive :List of 3
.. ..$ Name : chr "Smoking Prevalence in adults - current smokers (APS)"
.. ..$ NameLong : chr "Smoking Prevalence in adults - current smokers (APS)"
.. ..$ DataSource: chr "Annual Population Survey (APS)"
Joining, by = "ind_id"
Joining, by = "ind_id"
ind_id | area | time | value | lci | uci | count | denom | age | sex | Descriptive.NameLong |
---|---|---|---|---|---|---|---|---|---|---|
92443 | E06000015 | 2015 | 18.7 | 16.31 | 21.0 | -1 | 1045 | 168 | 4 | Smoking Prevalence in adults - current smokers (APS) |
92443 | E06000016 | 2015 | 20.1 | 17.82 | 22.3 | -1 | 1226 | 168 | 4 | Smoking Prevalence in adults - current smokers (APS) |
92443 | E06000017 | 2015 | 11.6 | 8.46 | 14.7 | -1 | 399 | 168 | 4 | Smoking Prevalence in adults - current smokers (APS) |
92443 | E06000018 | 2015 | 24.0 | 21.78 | 26.3 | -1 | 1368 | 168 | 4 | Smoking Prevalence in adults - current smokers (APS) |
92443 | E10000007 | 2015 | 17.9 | 15.82 | 19.9 | -1 | 1332 | 168 | 4 | Smoking Prevalence in adults - current smokers (APS) |
92443 | E10000018 | 2015 | 17.4 | 15.26 | 19.4 | -1 | 1262 | 168 | 4 | Smoking Prevalence in adults - current smokers (APS) |
To extract all the indicators for the profiles we need to loop through group IDs and parent codes. We’ll use a different example - data from the National Child Measurement Programme (NCMP).
## Firstly extract the profile ID
hpid <- prof[prof$Key == "national-child-measurement-programme", 1][[1]][1]
## Then extract group (domain) IDs
groupids <- prof[prof$Key == "national-child-measurement-programme", 4]
## Create and empty data frame
df <- data_frame()
## Loop through group IDs and parent codes
for(groupid in groupids){
for(parentcode in parentareacodes$Code) {
## Create API URLs
hp_url1 <- paste0("http://fingertips.phe.org.uk/api/latest_data/all_indicators_in_profile_group_for_child_areas",
"?profile_id=",hpid,
"&group_id=",groupid,
"&area_type_id=",102,
"&parent_area_code=",parentcode)
## Extract data
data <- hp_url1 %>%
gather_array %>%
spread_values(ind_id = jnumber("IID")) %>%
enter_object("Data") %>%
gather_array %>%
spread_values(
area = jstring("AreaCode"),
value = jnumber("Val"),
lci = jnumber("LoCI"),
uci = jnumber("UpCI"),
denom = jnumber("Denom"),
count = jnumber("Count"),
age = jnumber("AgeId"),
sex = jnumber("SexId")
)%>%
select(ind_id, area, age, sex, value, lci, uci, denom, count)
## Build data frame
df <- bind_rows(df, data)
}
}
array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2
## Add time period (uses the same looping process)
dft <- data_frame()
for(groupid in groupids){
for(parentcode in parentareacodes$Code) {
hp_url1 <- paste0("http://fingertips.phe.org.uk/api/latest_data/all_indicators_in_profile_group_for_child_areas",
"?profile_id=",hpid,
"&group_id=",groupid,
"&area_type_id=",102,
"&parent_area_code=",parentcode)
data1 <- hp_url1 %>%
gather_array() %>%
spread_values(ind_id = jnumber("IID")) %>%
enter_object("Grouping") %>%
gather_array %>%
spread_values(
time = jstring("Period")
)
dft <- bind_rows(dft, data1)
}
}
array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2array.index column name already exists, changing to array.index.2
## Join time periods to main data frame
df <- df %>%
left_join(dft) %>%
select(ind_id, area, time, value, lci, uci, count, denom, age, sex) %>%
distinct
Joining, by = "ind_id"
df %>% head
NA
We can add age and gender descriptions and indicator names to the main data frame.
sex <- sex %>%
rename(sex = Id, gender = Name)
age <- age %>%
rename(age = Id)
ind_hp <- paste(unique(df$ind_id), collapse = "%2C")
ind_url <- paste0("https://fingertips.phe.org.uk/api/indicator_metadata/by_indicator_id?indicator_ids=",ind_hp)
ind_url %>%
gather_object() %>%
spread_all() %>%
select(ind_id = IID, Descriptive.NameLong)->ind_names
df <- df %>%
left_join(ind_names, by = "ind_id")
df %>%
# select(-c(Descriptive.NameLong.x, Descriptive.NameLong.y.y, Descriptive.NameLong.x.x)) %>%
select(Ind_name = Descriptive.NameLong, 2:10) %>%
left_join(age) %>%
left_join(sex) -> df_analysis
Joining, by = "age"
Joining, by = "sex"
We now have an analytical data frame df_analysis
we can explore further. We can follow the same processes to extract and combine data from other profiles or other areas, add area names, and add other elements of metadata.
Given the df_analysis
data frame we can explore further.
Scaling up further, the following code adapted from above, extracts the latest data from the General Practice Profiles.
We now have an analytical data set of the latest general practice indicators which contains 204 indicators - we are just showing the first few rows of 1710254.