A Practical Example

Initial Notes

Interested in Crime Data From Houston, Texas
http://www.houstontx.gov/police/cs/index.htm
Potential for Space/Time Investigation
Real World Issues to journal
- Use R to obtain data
- Use R to clean data - deal with junk
- Use R to tidy data - put in a useful format
Then use R to explore

Open Data

The five star approach (Tim Berners-Lee)
- ✷ Data is accessible on the web and is under an open licence. It is readable by the human eye but not by software.
- ✷✷ Data is accessible on the web in a structured, machine-readable format, but this is propriety.
- ✷✷✷ Data is accessible on the web in an open structured, machine-readable format.
- ✷✷✷✷ A URI exists for the data (ie a location on the web allowing direct access to the data)
- ✷✷✷✷✷ URI’s for related data are provided
Conditions/stars are cumulative

How many stars for Houston?

TL/DR - ✷✷
Why
- ‘You must have Microsoft Access or Microsoft Excel installed on your computer to view these files.’ (sic)
- Not open data format
Although: URI can be deduced (ie hacked) but with oddities
Other issues
- One column disappears from data at one point
- Column names change over period data is available
- Other irregulaties mean data harvesting is very ‘sellotape and string’-like

In other words…

… but that’s how things often are in reality

Some code to pull together a data set:

# First load some packages
library(dplyr) # Pipeline manipulation etc
library(Hmisc) # Reading in .mdb files


crimes <- NULL # Location for results

stems <- array(outer(c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"),
                     c("10","11","12","13"),paste0)) # 'stem' for filenames in URI

for (stem in stems) {
  holder <- tempfile(fileext='.mdb') # Create a temporary file name ending in mdb
  download.file(paste0('http://www.houstontx.gov/police/cs/mdbs/',stem,'.mdb'),
                holder) # Download one months data into this file
  df <- mdb.get(holder) # Read this temporary mdb file into a data frame
  colnames(df) <- c("date","hour","offense","beat","premise",
                     "block_range","street","type","suffix","n_events") # Standardise column names
  if (is.null(crimes)) { # Do this for first month - create the data frame
    crimes<- df
  } else { # Do this for the rest - append latest data
  crimes <- rbind(crimes,df[,1:10])
  }
}

crimes <- as.tbl(crimes) # Convert to more modern tbl form - useful later

First temporal issue - proper dates:

Ones you can do useful things with
- eg days between dates, day of week, extract month, year etc.
- Rather than just characters "29-01-1977" or a factor
Currently they are just factors:

crimes %>% head(n=5)

# A tibble: 5 × 10
               date  hour offense   beat premise block_range     street
             <fctr> <chr>  <fctr> <fctr>  <fctr>      <fctr>     <fctr>
1 01/08/10 00:00:00    22  Murder  11H30     20R   3000-3099   BROADWAY
2 01/17/10 00:00:00    18  Murder  10H50     18C   2800-2899    MCGOWEN
3 01/01/10 00:00:00     0  Murder  15E30     18A   9600-9699    MARLIVE
4 01/09/10 00:00:00    14  Murder   5F20     13R   9000-9099 LONG POINT
5 01/18/10 00:00:00    14  Murder  10H40     13R   4300-4399    GARROTT
# ... with 3 more variables: type <fctr>, suffix <fctr>, n_events <chr>

`lubridate` for dates

Turns characters into dates
- Turns factors into dates if you use as.character first:

library(lubridate)
crimes %>% 
  transmute(date=mdy_hms(as.character(date)),day=wday(date,label=TRUE),d_o_y=yday(date)) %>%
  head(n=4)

# A tibble: 4 × 3
        date   day d_o_y
      <dttm> <ord> <dbl>
1 2010-01-08   Fri     8
2 2010-01-17   Sun    17
3 2010-01-01   Fri     1
4 2010-01-09   Sat     9

Can also combine with hours column…

crimes %>% transmute(
    date=mdy_hms(as.character(date)) + hours(as.numeric(as.character(hour)))) %>%
  head(n=4)

Warning in period(hour = x): NAs introduced by coercion

# A tibble: 4 × 1
                 date
               <dttm>
1 2010-01-08 22:00:00
2 2010-01-17 18:00:00
3 2010-01-01 00:00:00
4 2010-01-09 14:00:00

Comments:
- hour is a character variable, is made numeric before using hours
- There are some NA values for hour - time of event not known
- hours function issues a warning in that case

Thus, data can be tidied…

Tidy up the date format

crimes %>% mutate(date=mdy_hms(as.character(date)) + hours(as.numeric(as.character(hour)))) -> crimes

Get rid of variables we won’t use (select) and NA values of date (filter)

crimes %>% select(-hour,-beat,-premise,-block_range,-suffix,-n_events) %>% filter(!is.na(date)) -> crimes

Make street and type a single character variable

crimes %>% mutate_at(vars(street,type),as.character) %>% 
  mutate(addr=paste(street,type)) %>% select(-street,-type) -> crimes
crimes %>% head(n=3)

# A tibble: 3 × 3
                 date offense        addr
               <dttm>  <fctr>       <chr>
1 2010-01-08 22:00:00  Murder BROADWAY ST
2 2010-01-17 18:00:00  Murder  MCGOWEN ST
3 2010-01-01 00:00:00  Murder  MARLIVE LN

Finally, some simple exploration…

library(ggplot2)
crimes %>%  ggplot(aes(x=offense)) + geom_bar()

Notes

ggplot2 - a library of plot routines based on The Grammar of Graphics
- Leland Wilkinson (Springer 2005) ISBN: 978-0-387-24544-7
- Very expensive, see http://ggplot2.org/resources/2007-vanderbilt.pdf
In summary
- ggplot creates the ‘framework’ for a plot
- aes - aesthetics, mapping input data to plot characteristics
- geom - a geometry turns characteristics into visual entity
- … possibly transforming data via a stat - here it counts the number of observations in each x category (default for geom_bar, so nothing shown in code)

Room for Improvement?

Nicer axes
- Default is variable names
- Default format for axes is exponential notation 1e+05 not 100,000
Order of bars - easier to read when counts are sorted
- This depends on order of labelling of a factor
- Default is alphabetical
- Alphabetical ordering in dictionaries is great
- Alphabetical ordering in graphics is often dreadful

Some things are quite easy to fix

crimes %>%  ggplot(aes(x=offense)) + geom_bar() + labs(x='Offense Type',y="Number of Offenses (1000's)") + 
  scale_y_continuous(breaks=c(0,100000,200000,300000),labels=c(0,100,200,300))

Notes

labs re-specifies axes labels
scale_y_continuous re-specifies points on the y axis where values are marked.
- breaks - where the labels go
- labels what is written there
Here ‘clutter’ is reduced using labs in conjunction with scale_y_continuous
- Axis title tells you that there are multiples of 1000
- The labels at the value points are then just multiples of 100.

Other things more complicated but possible: bar ordering

Here we use the forcats package
‘For categories’ - tools for factor variables
- Factor order in a variable is arbitrary
- Default is alphabetical - not great for graphics
Function fct_infreq reorders factors by frequency of occurrence
- Use this as a further ‘tidying’ operationon crimes

library(forcats)
crimes %>% mutate(offense = fct_infreq(offense)) -> crimes

Improved plot

crimes %>%  ggplot(aes(x=offense)) + geom_bar() + labs(x='Offense Type',y="Number of Offenses (1000's)") + 
  scale_y_continuous(breaks=c(0,100000,200000,300000),labels=c(0,100,200,300))

Storing sufficiently-specified plots for later use:

crimes %>%  ggplot(aes(x=offense)) + geom_bar() -> cplot # A plot object not yet drawn
cplot + labs(x='Offense Type',y="Number of Offenses (1000's)") + 
  scale_y_continuous(breaks=c(0,100000,200000,300000),labels=c(0,100,200,300)) + theme_light() # Clean up/new theme

More themes

library(ggthemes)
cplot + labs(x='Offense Type',y="Number of Offenses (1000's)") + 
  scale_y_continuous(breaks=c(0,100000,200000,300000),labels=c(0,100,200,300)) + theme_economist() # Re-use cplot

Back to time

crimes %>%  ggplot(aes(x=hour(date))) + geom_bar()  + labs(x='Time of Day',y="Number of Offenses") + 
  scale_x_continuous(breaks=c(3,9,12,15,21),labels=c("3am","9am","Noon","3pm","9pm")) -> crime_time
crime_time

Facets

crime_time + facet_wrap(~offense,scale='free')

Facets 2 - overwrite axis format

crime_time + facet_wrap(~offense,scale='free') +
  scale_x_continuous(breaks=c(3,9,15,21),labels=c("3am","9am","3pm","9pm"))

Observations/Notes

Wow factor vs, Ah-ha! factor
- Impressive looking graphics vs.
- Visual revelations
Remain Critical of Data
- Why do some many offenses have a mid day spike?
- Because offenses commited at mid-day?
- Some thing to do with data recording?
You are visualising entire process
- Event happens (?)
- Someone chooses to report it (?)
- Someone chooses to record it (?)
- Someone records it correctly (?)

Focus on murder …

crimes %>% filter(offense=="Murder") %>% mutate(day=wday(date,label=TRUE)) %>% 
  filter(date %within% interval(dmy("01012010"),dmy("01012014")))  -> murders
murders %>% ggplot(aes(x=day)) + geom_bar() + labs(y='Number of Murders',x='Day of Week')

Time of day again…

murders %>% ggplot(aes(x=hour(date))) + geom_bar() + labs(y='Number of Murders',x='Time of Day') + 
  scale_x_continuous(breaks=c(3,9,15,21),labels=c("3am","9am","3pm","9pm")) -> murder_time
murder_time

A weekly pattern?

murder_time + facet_wrap(~day)

Cyclic Coordinates

murder_time + facet_wrap(~day,nrow=2) + coord_polar(start=-pi/24)

Aside - Cyclic Heresy

Pie charts considered harmful!

crimes %>%  ggplot(aes(x=factor(1),fill=offense)) + geom_bar(width=1)  + scale_x_discrete(breaks=NULL) + 
  scale_y_continuous(breaks=seq(100000,600000,by=100000),labels=paste0(1:6,"00,000")) + 
  coord_polar(theta="y") + labs(x='',y='',fill="Offense Type") + scale_fill_brewer(type='qual')

Why are pie charts not so great?

Hard to compare values - People less good at comparing angles to height
Creative Commons Attribution 1.0, Link to source item

Geographical Aspects of Crime Pattern

First task is to Geocode the address data
- Can do this with tmap’s geocode_OSM function
- Take an address, attempt to identify longitude and latitude

crimes  %>% filter(offense=="Murder") %>% transmute(addr=paste0(addr,', Houston, TX')) -> addresses
addresses$addr %>% geocode_OSM -> crime_loc

OpenStreetMap geocoding service - Nominatim.
For large data sets see usage policy
- http://wiki.openstreetmap.org/wiki/Nominatim_usage_policy

head(crime_loc,n=3)

                     query      lat       lon  lat_min  lat_max   lon_min
1 BROADWAY ST, Houston, TX 29.69464 -95.27777 29.68736 29.70153 -95.27795
2  MCGOWEN ST, Houston, TX 29.74280 -95.36977 29.74071 29.74536 -95.37403
3  MARLIVE LN, Houston, TX 29.67905 -95.43738 29.67694 29.67905 -95.43738
    lon_max
1 -95.27759
2 -95.36631
3 -95.43736

Link the referenced data back to the murders

library(sp)
crime_loc %>% group_by(query) %>% summarise(lat=first(lat),lon=first(lon)) %>%
  transmute(addr=query,lat,lon) -> crime_loc2
crimes %>% mutate(addr=paste0(addr,', Houston, TX')) %>% filter(offense=="Murder") %>%
  left_join(crime_loc2) %>% filter(!is.na(lon)) -> murders
SpatialPointsDataFrame(murders[,c("lon","lat")],murders) -> murders

First, create a unique look-up for location coordinates
Select the murders from the other crimes
Join the locations to the full crime description
Make it a SpatialPointsDataFrame (like a shapefile)

A first look…

library(tmap)
tmap_mode('view')
murders %>% qtm

Does Day of Event Have a Geography

Could use faceted mapping in standard mode
But this works better in ‘view’ (interactive mode)
So we create a layer for each day and then make a plot…

# Compute the day of the week for the murder
murders@data %>% mutate(day=wday(date,label=TRUE)) -> murders@data
# Create the sub-layers.  Easiset via a new R function
day_layer <- function(spdf,day) spdf[spdf$day==day,]
murders %>% day_layer("Sun") -> sun_murders
murders %>% day_layer("Mon") -> mon_murders
murders %>% day_layer("Tues") -> tue_murders
murders %>% day_layer("Wed") -> wed_murders
murders %>% day_layer("Thurs") -> thu_murders
murders %>% day_layer("Fri") -> fri_murders
murders %>% day_layer("Sat") -> sat_murders

Data is ready - produce the map!

tm_shape(sun_murders) + tm_dots(col='red') + tm_shape(mon_murders) + tm_dots(col='dodgerblue') + 
  tm_shape(tue_murders) + tm_dots(col='dodgerblue') + tm_shape(wed_murders) + tm_dots(col='dodgerblue') +
  tm_shape(thu_murders) + tm_dots(col='dodgerblue') + tm_shape(fri_murders) + tm_dots(col='dodgerblue') +
  tm_shape(sat_murders) + tm_dots(col='red')

What about time of day?

day_point <- function(x) {
  temp1 <- cut(hour(x),c(0,3,9,15,21,24))
  temp1 <- recode(temp1, `(0,3]`='9pm-3am',`(3,9]`='3am-9am',`(9,15]`='9am-3pm',
                  `(15,21]`='3pm-9pm',`(21,24]`='9pm-3am')
  return(factor(temp1))
}

murders@data %>% mutate(dpoint = day_point(date)) -> murders@data
tm_shape(murders) + tm_dots(col='dpoint',title='Time of Day')

Looking at Crime Data

Chris Brunsdon

A Practical Example

Initial Notes

Open Data

How many stars for Houston?

In other words…

Some code to pull together a data set:

First temporal issue - proper dates:

`lubridate` for dates

Can also combine with hours column…

Thus, data can be tidied…

Finally, some simple exploration…

Notes

Room for Improvement?

Some things are quite easy to fix

Notes

Other things more complicated but possible: bar ordering

Improved plot

Storing sufficiently-specified plots for later use:

More themes

Back to time

Facets

Facets 2 - overwrite axis format

Observations/Notes

Focus on murder …

Time of day again…

A weekly pattern?

Cyclic Coordinates

Aside - Cyclic Heresy

Why are pie charts not so great?

Geographical Aspects of Crime Pattern

Link the referenced data back to the murders

A first look…

Does Day of Event Have a Geography

Data is ready - produce the map!

What about time of day?

Looking at Crime Data

Chris Brunsdon

A Practical Example

Initial Notes

Open Data

How many stars for Houston?

In other words…

Some code to pull together a data set:

First temporal issue - proper dates:

lubridate for dates

Can also combine with hours column…

Thus, data can be tidied…

Finally, some simple exploration…

Notes

Room for Improvement?

Some things are quite easy to fix

Notes

Other things more complicated but possible: bar ordering

Improved plot

Storing sufficiently-specified plots for later use:

More themes

Back to time

Facets

Facets 2 - overwrite axis format

Observations/Notes

Focus on murder …

Time of day again…

A weekly pattern?

Cyclic Coordinates

Aside - Cyclic Heresy

Why are pie charts not so great?

Geographical Aspects of Crime Pattern

Link the referenced data back to the murders

A first look…

Does Day of Event Have a Geography

Data is ready - produce the map!

What about time of day?

`lubridate` for dates