We are going to look at house prices per square metre in England and Wales, using data from ONS.
This code illustrates a few commonly used techniques:
make.names to make variable names ggplot-friendlyFirst step is to load the required packages.
library(tidyverse)
library(readxl)
library(ggridges)
library(viridis)
Next we create a data folder in our project directory, where we are going to save the data we download.
dir.create("data")
Next download the Excel file from ONS and save it in the data directory. NB, make sure you include mode = "wb" or the file won’t be read properly.
download.file(url="https://www.ons.gov.uk/file?uri=/economy/inflationandpriceindices/datasets/housepricepersquaremetreandhousepriceperroomenglandandwales/2004to2016/priceperareadata.xls", destfile="data/priceperareadata.xls", mode = "wb")
Now we need to read the Excel data into R. There are a number of ways of doing this but as the tidyverse package already lincludes the readxl library let’s use that. First we get the list of sheet names
excel_sheets("data/priceperareadata.xls")
## [1] "Content" "Table1" "Table2" "Table3" "Table4" "Table5" "Table6"
## [8] "Table7" "Table8" "Table9" "Table10" "Table11" "Table12"
Then we import the sheet we want.
prices <- read_excel("data/priceperareadata.xls", sheet="Table10")
Take a look at what you imported using:
head(prices)
## # A tibble: 6 x 17
## `Table 10: House ~ X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> <NA> <NA> <NA> NA NA NA NA NA NA
## 2 Region code Regi~ LA c~ LA n~ 2004 2005 2006 2007 2008 2009
## 3 E12000001 Nort~ E060~ Hart~ 770. 962. 1101. 1203. 1141. 1059.
## 4 E12000001 Nort~ E060~ Midd~ 781. 971. 1127. 1211. 1172. 1167.
## 5 E12000001 Nort~ E060~ Redc~ 951. 1121. 1209. 1266. 1242. 1202.
## 6 E12000001 Nort~ E060~ Stoc~ 1013. 1163. 1266. 1357. 1292. 1250.
## # ... with 7 more variables: X__10 <dbl>, X__11 <dbl>, X__12 <dbl>,
## # X__13 <dbl>, X__14 <dbl>, X__15 <dbl>, X__16 <dbl>
The problem here is that there are two rows in the Excel file above the table we want. So let’s import the data again starting at cell A3 - though we need to tell it to start at A4 in order for it to pick up the column headers.
prices <- read_excel("data/priceperareadata.xls", sheet="Table10",
range=cell_limits(c(4,1),c(NA,NA)))
Check what you imported.
head(prices)
## # A tibble: 6 x 17
## `Region code` `Region name` `LA code` `LA name` `2004` `2005` `2006`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 E12000001 North East E06000001 Hartlepool 770. 962. 1101.
## 2 E12000001 North East E06000002 Middlesbrough 781. 971. 1127.
## 3 E12000001 North East E06000003 Redcar and C~ 951. 1121. 1209.
## 4 E12000001 North East E06000004 Stockton-on-~ 1013. 1163. 1266.
## 5 E12000001 North East E06000005 Darlington 1078. 1186. 1291.
## 6 E12000001 North East E06000047 County Durham 894. 1051. 1141.
## # ... with 10 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## # `2015` <dbl>, `2016` <dbl>
Note that the data is in ‘wide’ format, with the data for each year split into separate columns. In order to analyse it more easily we’d like to convert this into ‘long’ format, with one column for the year variable and one for the price variable. For this we use the gather function from the tidyr package, preloaded with the tidyverse.
prices <- gather(data=prices,
key="year", # The new variable you're creating
value="ppsqm", # The new value field
"2004":"2016") # The columns to gather together
Check what the first several rows of your data looks like now.
head(prices)
## # A tibble: 6 x 6
## `Region code` `Region name` `LA code` `LA name` year ppsqm
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 E12000001 North East E06000001 Hartlepool 2004 770.
## 2 E12000001 North East E06000002 Middlesbrough 2004 781.
## 3 E12000001 North East E06000003 Redcar and Cleveland 2004 951.
## 4 E12000001 North East E06000004 Stockton-on-Tees 2004 1013.
## 5 E12000001 North East E06000005 Darlington 2004 1078.
## 6 E12000001 North East E06000047 County Durham 2004 894.
ggplot2 won’t work well with any column names that have spaces. So use make.names to turn your column names into something usable.
names(prices) <- make.names(names(prices))
Let’s also replace ‘Yorkshire and the Humber’ with something shorter.
prices <- prices %>%
mutate(Region.name=replace(Region.name, Region.name=="Yorkshire and The Humber",
"Yorks & Humber")) %>%
as.data.frame()
Now for some plotting. First, let’s try to make a boxplot for 2016.
p <- prices %>%
filter(year==2016) %>%
ggplot(mapping=aes(x=Region.name, y=ppsqm))
p + geom_boxplot()
A better way to shows these distributions might be an ordered ridgeplot with a log scale
p1 <- prices %>%
filter(year==2016) %>%
ggplot(mapping=aes(x=ppsqm,
y=reorder(Region.name,ppsqm)))
p1 <- p1 + geom_density_ridges() + labs(y=NULL) + scale_x_log10()
p1
## Picking joint bandwidth of 0.0438
We can put two ridges on the same line by adding a fill aesthetic to geom_density_ridges
p2 <- prices %>%
filter(year==2004 | year==2016) %>%
ggplot(mapping=aes(x=ppsqm, y=reorder(Region.name,ppsqm)))
p2 <- p2 + geom_density_ridges(mapping=aes(fill=year)) + labs(y=NULL) + scale_x_log10()
p2
## Picking joint bandwidth of 0.0416
Let’s prettify p1 a bit.
p1 +
labs(title="Price per square metre by region",
caption="Data from ONS",
x="Price per square metre (log)",
y=NULL) +
theme_ridges(font_size=11)
## Picking joint bandwidth of 0.0438
Now let’s add some colour (using tips from Len Kiefer)
p1 <- prices %>%
filter(year==2016) %>%
ggplot(mapping=aes(x=ppsqm,
y=reorder(Region.name,ppsqm),
fill=..x..))
p1 + geom_density_ridges_gradient(rel_min_height=0.01) +
scale_fill_viridis() +
guides(fill=F) +
scale_x_log10(labels=scales::dollar_format(prefix="£", big.mark=",")) +
labs(title="Local authority average house price per square metre by region, 2016",
subtitle="Data from ONS",
x="Price per square metre (log)",
y=NULL) +
theme_ridges(font_size = 11)
## Picking joint bandwidth of 0.0438
Here’s a version that uses transparency, which can’t be combined with a gradient fill.
p <- prices %>%
filter(year==2016) %>%
ggplot(mapping=aes(x=ppsqm,
y=reorder(Region.name,ppsqm),
fill=reorder(Region.name,ppsqm)))
p + geom_density_ridges(alpha=0.75) +
guides(fill=F) +
scale_x_log10(labels=scales::dollar_format(prefix="£", big.mark=",")) +
labs(title="Local authority average house price per square metre by region, 2016",
subtitle="Data from ONS",
x="Price per square metre (log)",
y=NULL) +
theme_ridges(font_size = 11)
## Picking joint bandwidth of 0.0438
Let’s try the two-ridge p2 with transparency.
p <- prices %>%
filter(year==2004 | year==2016) %>%
ggplot(mapping=aes(x=ppsqm,
y=reorder(Region.name,ppsqm)))
p + geom_density_ridges(aes(fill=year, scale=1.5), alpha=0.75,
rel_min_height=0.01) +
scale_x_log10(labels=scales::dollar_format(prefix="£", big.mark=","),
breaks=c(500,1000,2500,10000),
expand=c(0.01,0)) +
scale_y_discrete(expand=c(0,0)) +
scale_fill_manual(name="Year", values=c("#4477AA","#CCBB44")) +
labs(title="Distribution of local authority average house prices per square \nmetre by region, 2004 and 2016",
x="Price per square metre (log)",
y=NULL) +
theme(text=element_text(family="Foundry Form Sans", size = 11),
plot.title = element_text(face = "bold", size=12, hjust=0.5),
axis.text.x = element_text(size=11, colour="black"),
axis.text.y = element_text(vjust=-0.5, size = 11, colour="black"),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(colour="Gray40"),
panel.background = element_rect(fill="white"),
axis.line.x = element_line(colour="Gray40"),
legend.position = "top",
legend.spacing.y=unit(1,"cm"),
legend.text = element_text(size=11, colour="black"))
## Picking joint bandwidth of 0.0416