The Government Digital Service (GDS) is promoting a new analyical workflow based on R Markdown. R Markdown is a way of writing reports using R statistical software and RStudio which combines analysis and reporting in a single document which can be automated, reproduced and output in html format as as word or pdf documents.
The proposed data flow means that documents can be easily prepared in appropriate format for publication to .gov.uk. The GDS data science team have produced some graphical templates for use on the .gov.uk platorm. It cuts down the number of steps involved in creating reports, reduces the risk of error and can be automated to produce multiple reports in one go, or adapted as a template to report on different topics or issues without too much effort.
As a major publication plaform for Official Statistics in PHE, Fingertips, producing reports, analysis and interpretation alongside the publication of statistical data is increasingly important.
We have produced an R package - fingertipsR
- to facilitate data extraction from the Fingertips Automated Programming Interface (API).
This report shows how to:
fingertipsR
packagermarkdown
A good starting point for R Markdown is the Cheat Sheet. There are 3 parts to any markdown document:
In addition R code can be run inside the text to produce figures and tables.
R uses additional packages
1 to perform some functions - these have to be loaded before they can be used. For this analysis we will use:
fingertipsR
ggplot2
dplyr
readr
govstyle
The latter is a ggplot2 theme which complies with gov.uk colours and layouts
library(dplyr)
library(ggplot2)
##library(fingertipsR)
library(readxl)
library(readr)
devtools::install_github(repo = "ivyleavedtoadflax/govstyle")
library(govstyle)
To do this we will use thefingertipsR
package, and extract data for teenage conceptions. We need to identify an ID number in Fingertips for the teenage conceptions data, and area type code - we’ll use data for lower tier LAs and its straightforward to extract the data:
library(stringr)
# which indicator ID is teenage pregnancy?
# ind <- indicators()
# ind <- ind[str_detect(ind$IndicatorName, "Rate of conceptions per"),] ## identify relevant indicator ID
#
# areas <- area_types("district") ## Identify area type code
#
# df <- fingertips_data(IndicatorID = 20401,
# AreaTypeID = 101,
# ParentAreaTypeID = 6) ## download the dataset
We can check that we have the correct indicator:
df <- read_csv("~/Downloads/Teenage_pregnancy.zip")
df %>%
glimpse
## Observations: 6,568
## Variables: 21
## $ IndicatorID <int> 20401, 20401, 20...
## $ IndicatorName <chr> "Under 18s conce...
## $ ParentCode <chr> NA, NA, NA, NA, ...
## $ ParentName <chr> NA, NA, NA, NA, ...
## $ AreaCode <chr> "E92000001", "E9...
## $ AreaName <chr> "England", "Engl...
## $ AreaType <chr> "Country", "Coun...
## $ Sex <chr> "Female", "Femal...
## $ Age <chr> "<18 yrs", "<18 ...
## $ CategoryType <chr> NA, "General Pra...
## $ Category <chr> NA, "Most depriv...
## $ Timeperiod <int> 1998, 1998, 1998...
## $ Value <dbl> 46.64402, NA, NA...
## $ LowerCIlimit <dbl> 46.19409, NA, NA...
## $ UpperCIlimit <dbl> 47.09724, NA, NA...
## $ Count <int> 41089, NA, NA, N...
## $ Denominator <int> 880906, NA, NA, ...
## $ Valuenote <chr> NA, NA, NA, NA, ...
## $ RecentTrend <chr> NA, NA, NA, NA, ...
## $ ComparedtoEnglandvalueorpercentiles <chr> "Not compared", ...
## $ Comparedtosubnationalparentvalueorpercentiles <chr> "Not compared", ...
##unique(df$AreaName)
Next we can choose a single area and plot the trend - we’ll use England as an example. We need to filter the data to choose an area and in this case we’ll used deprivation deciles.
We can plot the data with ggplot2
and apply the govstyle
format.
plot <- df %>%
filter(AreaName == "England" & !is.na(Value) & CategoryType == "District & UA deprivation deciles in England (IMD2010)") %>%
ggplot(aes(Timeperiod, Value,colour = Category)) +
geom_line(aes( group = Category)) +
theme_gov() +
expand_limits(y = c(0, 70), x = c(1990, 2015)) +
labs(y = "Teenage pregnancy rate",
x = "Year",
title = "Trends in teenage pregnancy rate by deprivation decile\n1998-2014")
plot +
geom_text(data = df %>% filter( Timeperiod == "1998" & CategoryType == "District & UA deprivation deciles in England (IMD2010)" ),
size = 2,
aes(
label = Category,
hjust = 1,
vjust = 0,
fontface = "bold"
))
Under 18 conception rates have fallen substantially since 1998 and the ‘gap’ between rates the most and least deprived tenths of areas has fallen from 51.39 conceptions per 100,000 in 2008 to 29.48 in 2014.
Let us say we want to create the same plots for every area. This can be achieved with a for loop.
## Single area
df %>%
filter(AreaName == "Cambridge" & !is.na(Value) & AreaType == "District & UA") %>%
ggplot(aes(Timeperiod, Value)) +
geom_line() +
theme_gov() +
expand_limits(y = c(0, 70), x = c(1996, 2015)) +
labs(y = "Teenage pregnancy rate",
x = "Year",
title = paste0("Trends in teenage pregnancy rate\n1998-2015: ", "Cambridge")) +
geom_text(data = df %>% filter( (Timeperiod == "1998"|Timeperiod == "2015") & AreaName == "Cambridge" ),
size = 3,
aes(
label = round(Value,2),
hjust = 0.5,
vjust = 0,
fontface = "bold"))
### Multiple areas
## Example areas
areas <- c("Cambridge","East Cambridgeshire", "Fenland", "Blackburn with Darwen" )
for(area in areas){
print(df %>%
filter(AreaName == area & !is.na(Value) & AreaType == "District & UA") %>%
ggplot(aes(Timeperiod, Value)) +
geom_line() +
theme_gov() +
expand_limits(y = c(0, 70), x = c(1997, 2015)) +
labs(y = "Teenage pregnancy rate",
x = "Year",
title = paste0("Trends in teenage pregnancy rate\n1998-2015: ", area)) +
geom_text(data = df %>% filter( (Timeperiod == "1998"|Timeperiod == "2015") & AreaName == area ),
size = 3,
aes(
label = round(Value,2),
hjust = 0.5,
vjust = 0,
fontface = "bold")) +
geom_smooth(lwd = 0.5, lty = "dotted")
)
}
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
A package is a set of functions for a specific purpose↩