Maps and Time Series

Heike Hofmann
Stat 579, Fall 2013

Exam

We cannot quite discuss the exam yet - there are still a few exams that need to be written, but if you check into the class lectures on Thursday, there will be some discussion.

A bit of motivation and picker-upper

Outline

  • Melting
  • Maps
  • Time Series

Warmup

  • Start R and load data ‘fbi’ from http://www.hofroe.net/stat579/crimes-2012.csv

  • This data set contains number of crimes by type for each state in the U.S for 1950 to 2012.

  • Exclude data for the whole of the U.S.

  • Investigate which states have the highest number of crimes (almost independently of type)

  • Plot scatterplot of population size against number of violent crimes. What is your conclusion?

fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
head(fbi)
                                           Source Year
1 http://www.disastercenter.com/crime/alcrime.htm 1960
2 http://www.disastercenter.com/crime/alcrime.htm 1961
3 http://www.disastercenter.com/crime/alcrime.htm 1962
4 http://www.disastercenter.com/crime/alcrime.htm 1963
5 http://www.disastercenter.com/crime/alcrime.htm 1964
6 http://www.disastercenter.com/crime/alcrime.htm 1965
  Population Violent Property Murder Forcible.Rape Robbery
1    3266740    6097    33823    406           281     898
2    3302000    5564    32541    427           252     630
3    3358000    5283    35829    316           218     754
4    3347000    6115    38521    340           192     828
5    3407000    7260    46290    316           397     992
6    3462000    6916    48215    395           367     992
  Aggravated.Assault Burglary Larceny.Theft Vehicle.Theft
1               4512    11626         19344          2853
2               4255    11205         18801          2535
3               3995    11722         21306          2801
4               4755    12614         22874          3033
5               5555    15898         26713          3679
6               5162    16398         28115          3702
  abbr   state
1   AL Alabama
2   AL Alabama
3   AL Alabama
4   AL Alabama
5   AL Alabama
6   AL Alabama
library(ggplot2)
qplot(Violent, Property, data=subset(fbi, Year==2012)) + geom_text(aes(label=abbr), data=subset(fbi, (Violent > 50000) & (Year == 2012)), hjust=1.25)

plot of chunk unnamed-chunk-2

Data Format

What we are really interested in with this data set, are rates of crimes rather than numbers.

It would be quite tedious (and inconsistent with the DRY principle) to convert each type of crime to a rate

Instead: we will use reshape again

Melting

Two step process:

  • melt: get data into a “convenient”“ shape, i.e. one that is particularly flexible

  • cast data into new shape(s) that are better suited for analysis

  • melt.data.frame(data, id.vars, measure.vars, na.rm = F, ...)

Alt text

In our dataset ...

library(reshape)
fbi.melt <- melt(fbi, id.vars=c("Source","state","abbr","Year", "Population"), measure.vars=4:12)
head(fbi.melt[,-1])
    state abbr Year Population variable value
1 Alabama   AL 1960    3266740  Violent  6097
2 Alabama   AL 1961    3302000  Violent  5564
3 Alabama   AL 1962    3358000  Violent  5283
4 Alabama   AL 1963    3347000  Violent  6115
5 Alabama   AL 1964    3407000  Violent  7260
6 Alabama   AL 1965    3462000  Violent  6916
summary(fbi.melt[,-1])
        state            abbr            Year     
 Alabama   :  477   AK     :  477   Min.   :1960  
 Alaska    :  477   AL     :  477   1st Qu.:1973  
 Arizona   :  477   AR     :  477   Median :1986  
 Arkansas  :  477   AZ     :  477   Mean   :1986  
 California:  477   CA     :  477   3rd Qu.:1999  
 Colorado  :  477   CO     :  477   Max.   :2012  
 (Other)   :21420   (Other):21420                 
   Population                     variable   
 Min.   :  226167   Violent           :2698  
 1st Qu.: 1179000   Property          :2698  
 Median : 3211500   Murder            :2698  
 Mean   : 4751993   Forcible.Rape     :2698  
 3rd Qu.: 5689170   Robbery           :2698  
 Max.   :38041430   Aggravated.Assault:2698  
                    (Other)           :8094  
     value        
 Min.   :      1  
 1st Qu.:    884  
 Median :   7122  
 Mean   :  45152  
 3rd Qu.:  35694  
 Max.   :2384280  

Rates are now easy to compute

fbi.melt$rate <- fbi.melt$value/fbi.melt$Population*50000
summary(fbi.melt$rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      18     130     440     516    4760 

This rate is now on 'Ames standard' - i.e. numbers compare directly to number of crimes in Ames in a year

Recreate this chart

Alt text

Note: this chart is based on old data - your chart might look a bit different

Hint: if you are completely stuck start with

qplot(state, weight=rate, data=fbi.melt, geom="bar", facets=.~variable)+coord_flip()

Practising casting

Using cast:

  • find the number of all offenses in 2012

  • find the number of offenses by type of crime

  • find the number of all offenses by state

What is a map?

What is a map?

  • Statistically, a map is made up of points that have a latitude and a longitude
library(ggplot2)
library(maps)
states <- map_data("state")
head(states)
    long   lat group order  region subregion
1 -87.46 30.39     1     1 alabama      <NA>
2 -87.48 30.37     1     2 alabama      <NA>
3 -87.53 30.37     1     3 alabama      <NA>
4 -87.53 30.33     1     4 alabama      <NA>
5 -87.57 30.33     1     5 alabama      <NA>
6 -87.59 30.33     1     6 alabama      <NA>
qplot(long, lat, data=subset(states, region=="iowa"))

plot of chunk unnamed-chunk-5

  • Those points need to be connected

plot of chunk unnamed-chunk-6

  • … but in the right order: define parameter order

plot of chunk unnamed-chunk-7

  • and only the “right” regions should be connected

plot of chunk unnamed-chunk-8

  • the parameter group does that

Maps in ggplot2

qplot(long, lat, data=states)

plot of chunk unnamed-chunk-9

qplot(long, lat, data=states, geom="path", order=order, group=group)

plot of chunk unnamed-chunk-9

qplot(long, lat, data=states, geom="polygon", order=order, group=group, colour=I("grey30"))

plot of chunk unnamed-chunk-10

qplot(long, lat, data=states, geom="polygon", order=order, group=group, fill=long)

plot of chunk unnamed-chunk-10

Your Turn

  • Using the package maps, pull out map data for all countries in the world: world <- map_data("world")

  • Draw a map of of the world

  • Pick one country and color it

Merging Data

  • We would like to draw a choropleth map (one with color) of the US, and indicate crime rates on it

  • We have two data sources

head(states)
    long   lat group order  region subregion
1 -87.46 30.39     1     1 alabama      <NA>
2 -87.48 30.37     1     2 alabama      <NA>
3 -87.53 30.37     1     3 alabama      <NA>
4 -87.53 30.33     1     4 alabama      <NA>
5 -87.57 30.33     1     5 alabama      <NA>
6 -87.59 30.33     1     6 alabama      <NA>
head(fbi.melt)
                                           Source   state
1 http://www.disastercenter.com/crime/alcrime.htm Alabama
2 http://www.disastercenter.com/crime/alcrime.htm Alabama
3 http://www.disastercenter.com/crime/alcrime.htm Alabama
4 http://www.disastercenter.com/crime/alcrime.htm Alabama
5 http://www.disastercenter.com/crime/alcrime.htm Alabama
6 http://www.disastercenter.com/crime/alcrime.htm Alabama
  abbr Year Population variable value   rate
1   AL 1960    3266740  Violent  6097  93.32
2   AL 1961    3302000  Violent  5564  84.25
3   AL 1962    3358000  Violent  5283  78.66
4   AL 1963    3347000  Violent  6115  91.35
5   AL 1964    3407000  Violent  7260 106.55
6   AL 1965    3462000  Violent  6916  99.88

Merging Data

Idea is, to match data between data sets by one or more columns with the same information

The common element between states and fbi is the variable information of the state name

… but state is spelled with a lower case first letter in states and an upper case first letter in fbi

for ease, we will introduce a new variable in fbi called region that matches the region variable in states

fbi.melt$region <- tolower(fbi$state)
head(fbi.melt$region)
[1] "alabama" "alabama" "alabama" "alabama" "alabama"
[6] "alabama"

What does merging do?

Alt text

The resulting dataset could be quite big!

dim(subset(states, region=="iowa"))
[1] 256   6
dim(subset(fbi.melt, region=="iowa"))
[1] 477   9

The result from the merge will have 256*477 = 122112 rows for Iowa!

Merging ... finally!

fbi.map <- merge(states, subset(fbi.melt, Year==2012), by="region")
dim(fbi.map)  # huge!!!
[1] 139743     14
head(fbi.map)
   region   long   lat group order subregion
1 alabama -87.46 30.39     1     1      <NA>
2 alabama -87.46 30.39     1     1      <NA>
3 alabama -87.46 30.39     1     1      <NA>
4 alabama -87.46 30.39     1     1      <NA>
5 alabama -87.46 30.39     1     1      <NA>
6 alabama -87.46 30.39     1     1      <NA>
                                           Source   state
1 http://www.disastercenter.com/crime/alcrime.htm Alabama
2 http://www.disastercenter.com/crime/alcrime.htm Alabama
3 http://www.disastercenter.com/crime/alcrime.htm Alabama
4 http://www.disastercenter.com/crime/alcrime.htm Alabama
5 http://www.disastercenter.com/crime/alcrime.htm Alabama
6 http://www.disastercenter.com/crime/alcrime.htm Alabama
  abbr Year Population           variable  value     rate
1   AL 2012    4822023             Murder    342    3.546
2   AL 2012    4822023            Robbery   5020   52.053
3   AL 2012    4822023           Property 168878 1751.112
4   AL 2012    4822023      Forcible.Rape   1296   13.438
5   AL 2012    4822023 Aggravated.Assault  15035  155.899
6   AL 2012    4822023      Vehicle.Theft   9874  102.384

Rates of crimes ....

qplot(long, lat, geom="polygon", group=group, order=order, fill=rate, data=subset(fbi.map, variable == "Property"))

plot of chunk unnamed-chunk-15

… something strange is going on with Louisiana

other than that, we can see some North-South trend