We are going to look at house prices per square metre in England and Wales, using data from ONS.

This code illustrates a few commonly used techniques:

First step is to load the required packages.

library(tidyverse)
library(readxl)
library(ggridges)
library(viridis)

Next we create a data folder in our project directory, where we are going to save the data we download.

dir.create("data")

Next download the Excel file from ONS and save it in the data directory. NB, make sure you include mode = "wb" or the file won’t be read properly.

download.file(url="https://www.ons.gov.uk/file?uri=/economy/inflationandpriceindices/datasets/housepricepersquaremetreandhousepriceperroomenglandandwales/2004to2016/priceperareadata.xls", destfile="data/priceperareadata.xls", mode = "wb")

Now we need to read the Excel data into R. There are a number of ways of doing this but as the tidyverse package already lincludes the readxl library let’s use that. First we get the list of sheet names

excel_sheets("data/priceperareadata.xls")
##  [1] "Content" "Table1"  "Table2"  "Table3"  "Table4"  "Table5"  "Table6" 
##  [8] "Table7"  "Table8"  "Table9"  "Table10" "Table11" "Table12"

Then we import the sheet we want.

prices <- read_excel("data/priceperareadata.xls", sheet="Table10")

Take a look at what you imported using:

head(prices)
## # A tibble: 6 x 17
##   `Table 10: House ~ X__1  X__2  X__3   X__4  X__5  X__6  X__7  X__8  X__9
##   <chr>              <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA>               <NA>  <NA>  <NA>    NA    NA    NA    NA    NA    NA 
## 2 Region code        Regi~ LA c~ LA n~ 2004  2005  2006  2007  2008  2009 
## 3 E12000001          Nort~ E060~ Hart~  770.  962. 1101. 1203. 1141. 1059.
## 4 E12000001          Nort~ E060~ Midd~  781.  971. 1127. 1211. 1172. 1167.
## 5 E12000001          Nort~ E060~ Redc~  951. 1121. 1209. 1266. 1242. 1202.
## 6 E12000001          Nort~ E060~ Stoc~ 1013. 1163. 1266. 1357. 1292. 1250.
## # ... with 7 more variables: X__10 <dbl>, X__11 <dbl>, X__12 <dbl>,
## #   X__13 <dbl>, X__14 <dbl>, X__15 <dbl>, X__16 <dbl>

The problem here is that there are two rows in the Excel file above the table we want. So let’s import the data again starting at cell A3 - though we need to tell it to start at A4 in order for it to pick up the column headers.

prices <- read_excel("data/priceperareadata.xls", sheet="Table10",
                     range=cell_limits(c(4,1),c(NA,NA)))

Check what you imported.

head(prices)
## # A tibble: 6 x 17
##   `Region code` `Region name` `LA code` `LA name`     `2004` `2005` `2006`
##   <chr>         <chr>         <chr>     <chr>          <dbl>  <dbl>  <dbl>
## 1 E12000001     North East    E06000001 Hartlepool      770.   962.  1101.
## 2 E12000001     North East    E06000002 Middlesbrough   781.   971.  1127.
## 3 E12000001     North East    E06000003 Redcar and C~   951.  1121.  1209.
## 4 E12000001     North East    E06000004 Stockton-on-~  1013.  1163.  1266.
## 5 E12000001     North East    E06000005 Darlington     1078.  1186.  1291.
## 6 E12000001     North East    E06000047 County Durham   894.  1051.  1141.
## # ... with 10 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## #   `2015` <dbl>, `2016` <dbl>

Note that the data is in ‘wide’ format, with the data for each year split into separate columns. In order to analyse it more easily we’d like to convert this into ‘long’ format, with one column for the year variable and one for the price variable. For this we use the gather function from the tidyr package, preloaded with the tidyverse.

prices <-  gather(data=prices, 
                  key="year", # The new variable you're creating
                  value="ppsqm", # The new value field
                  "2004":"2016") # The columns to gather together

Check what the first several rows of your data looks like now.

head(prices)
## # A tibble: 6 x 6
##   `Region code` `Region name` `LA code` `LA name`            year  ppsqm
##   <chr>         <chr>         <chr>     <chr>                <chr> <dbl>
## 1 E12000001     North East    E06000001 Hartlepool           2004   770.
## 2 E12000001     North East    E06000002 Middlesbrough        2004   781.
## 3 E12000001     North East    E06000003 Redcar and Cleveland 2004   951.
## 4 E12000001     North East    E06000004 Stockton-on-Tees     2004  1013.
## 5 E12000001     North East    E06000005 Darlington           2004  1078.
## 6 E12000001     North East    E06000047 County Durham        2004   894.

ggplot2 won’t work well with any column names that have spaces. So use make.names to turn your column names into something usable.

names(prices) <- make.names(names(prices))

Let’s also replace ‘Yorkshire and the Humber’ with something shorter.

prices <- prices %>% 
  mutate(Region.name=replace(Region.name, Region.name=="Yorkshire and The Humber",
                             "Yorks & Humber")) %>%
  as.data.frame()

Now for some plotting. First, let’s try to make a boxplot for 2016.

p <- prices %>%
  filter(year==2016) %>%
  ggplot(mapping=aes(x=Region.name, y=ppsqm))
p + geom_boxplot()

A better way to shows these distributions might be an ordered ridgeplot with a log scale

p1 <- prices %>%
  filter(year==2016) %>%
  ggplot(mapping=aes(x=ppsqm,
                y=reorder(Region.name,ppsqm)))
p1 <- p1 + geom_density_ridges() + labs(y=NULL) + scale_x_log10()
p1
## Picking joint bandwidth of 0.0438

We can put two ridges on the same line by adding a fill aesthetic to geom_density_ridges

p2 <- prices %>%
  filter(year==2004 | year==2016) %>%
  ggplot(mapping=aes(x=ppsqm, y=reorder(Region.name,ppsqm)))
p2 <- p2 + geom_density_ridges(mapping=aes(fill=year)) + labs(y=NULL) + scale_x_log10()
p2
## Picking joint bandwidth of 0.0416

Let’s prettify p1 a bit.

p1 + 
  labs(title="Price per square metre by region",
       caption="Data from ONS",
       x="Price per square metre (log)",
       y=NULL) +
  theme_ridges(font_size=11)
## Picking joint bandwidth of 0.0438

Now let’s add some colour (using tips from Len Kiefer)

p1 <- prices %>%
  filter(year==2016) %>%
  ggplot(mapping=aes(x=ppsqm,
                y=reorder(Region.name,ppsqm),
                fill=..x..))
p1 + geom_density_ridges_gradient(rel_min_height=0.01) + 
  scale_fill_viridis() +
  guides(fill=F) +
  scale_x_log10(labels=scales::dollar_format(prefix="£", big.mark=",")) +
  labs(title="Local authority average house price per square metre by region, 2016",
       subtitle="Data from ONS",
       x="Price per square metre (log)",
       y=NULL) + 
  theme_ridges(font_size = 11)
## Picking joint bandwidth of 0.0438

Here’s a version that uses transparency, which can’t be combined with a gradient fill.

p <- prices %>%
  filter(year==2016) %>%
  ggplot(mapping=aes(x=ppsqm,
                y=reorder(Region.name,ppsqm),
                fill=reorder(Region.name,ppsqm)))
p + geom_density_ridges(alpha=0.75) + 
  guides(fill=F) +
  scale_x_log10(labels=scales::dollar_format(prefix="£", big.mark=",")) +
  labs(title="Local authority average house price per square metre by region, 2016",
       subtitle="Data from ONS",
       x="Price per square metre (log)",
       y=NULL) + 
  theme_ridges(font_size = 11)
## Picking joint bandwidth of 0.0438

Let’s try the two-ridge p2 with transparency.

p <- prices %>%
  filter(year==2004 | year==2016) %>%
  ggplot(mapping=aes(x=ppsqm,
                y=reorder(Region.name,ppsqm)))
p + geom_density_ridges(aes(fill=year, scale=1.5), alpha=0.75,
                         rel_min_height=0.01) + 
  scale_x_log10(labels=scales::dollar_format(prefix="£", big.mark=","),
                breaks=c(500,1000,2500,10000),
                expand=c(0.01,0)) +
  scale_y_discrete(expand=c(0,0)) +
  scale_fill_manual(name="Year", values=c("#4477AA","#CCBB44")) +
  labs(title="Distribution of local authority average house prices per square \nmetre by region, 2004 and 2016",
       x="Price per square metre (log)",
       y=NULL) + 
  theme(text=element_text(family="Foundry Form Sans", size = 11),
        plot.title = element_text(face = "bold", size=12, hjust=0.5),
        axis.text.x = element_text(size=11, colour="black"),
        axis.text.y = element_text(vjust=-0.5, size = 11, colour="black"),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(colour="Gray40"),
        panel.background = element_rect(fill="white"),
        axis.line.x = element_line(colour="Gray40"),
        legend.position = "top",
        legend.spacing.y=unit(1,"cm"),
        legend.text = element_text(size=11, colour="black"))
## Picking joint bandwidth of 0.0416