Heike Hofmann
Stat 579, Fall 2013
We cannot quite discuss the exam yet - there are still a few exams that need to be written, but if you check into the class lectures on Thursday, there will be some discussion.
Start R and load data ‘fbi’ from http://www.hofroe.net/stat579/crimes-2012.csv
This data set contains number of crimes by type for each state in the U.S for 1950 to 2012.
Exclude data for the whole of the U.S.
Investigate which states have the highest number of crimes (almost independently of type)
Plot scatterplot of population size against number of violent crimes. What is your conclusion?
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
head(fbi)
Source Year
1 http://www.disastercenter.com/crime/alcrime.htm 1960
2 http://www.disastercenter.com/crime/alcrime.htm 1961
3 http://www.disastercenter.com/crime/alcrime.htm 1962
4 http://www.disastercenter.com/crime/alcrime.htm 1963
5 http://www.disastercenter.com/crime/alcrime.htm 1964
6 http://www.disastercenter.com/crime/alcrime.htm 1965
Population Violent Property Murder Forcible.Rape Robbery
1 3266740 6097 33823 406 281 898
2 3302000 5564 32541 427 252 630
3 3358000 5283 35829 316 218 754
4 3347000 6115 38521 340 192 828
5 3407000 7260 46290 316 397 992
6 3462000 6916 48215 395 367 992
Aggravated.Assault Burglary Larceny.Theft Vehicle.Theft
1 4512 11626 19344 2853
2 4255 11205 18801 2535
3 3995 11722 21306 2801
4 4755 12614 22874 3033
5 5555 15898 26713 3679
6 5162 16398 28115 3702
abbr state
1 AL Alabama
2 AL Alabama
3 AL Alabama
4 AL Alabama
5 AL Alabama
6 AL Alabama
library(ggplot2)
qplot(Violent, Property, data=subset(fbi, Year==2012)) + geom_text(aes(label=abbr), data=subset(fbi, (Violent > 50000) & (Year == 2012)), hjust=1.25)
What we are really interested in with this data set, are rates of crimes rather than numbers.
It would be quite tedious (and inconsistent with the DRY principle) to convert each type of crime to a rate
Instead: we will use reshape again
Two step process:
melt: get data into a “convenient”“ shape, i.e. one that is particularly flexible
cast data into new shape(s) that are better suited for analysis
melt.data.frame(data, id.vars, measure.vars, na.rm = F, ...)
library(reshape)
fbi.melt <- melt(fbi, id.vars=c("Source","state","abbr","Year", "Population"), measure.vars=4:12)
head(fbi.melt[,-1])
state abbr Year Population variable value
1 Alabama AL 1960 3266740 Violent 6097
2 Alabama AL 1961 3302000 Violent 5564
3 Alabama AL 1962 3358000 Violent 5283
4 Alabama AL 1963 3347000 Violent 6115
5 Alabama AL 1964 3407000 Violent 7260
6 Alabama AL 1965 3462000 Violent 6916
summary(fbi.melt[,-1])
state abbr Year
Alabama : 477 AK : 477 Min. :1960
Alaska : 477 AL : 477 1st Qu.:1973
Arizona : 477 AR : 477 Median :1986
Arkansas : 477 AZ : 477 Mean :1986
California: 477 CA : 477 3rd Qu.:1999
Colorado : 477 CO : 477 Max. :2012
(Other) :21420 (Other):21420
Population variable
Min. : 226167 Violent :2698
1st Qu.: 1179000 Property :2698
Median : 3211500 Murder :2698
Mean : 4751993 Forcible.Rape :2698
3rd Qu.: 5689170 Robbery :2698
Max. :38041430 Aggravated.Assault:2698
(Other) :8094
value
Min. : 1
1st Qu.: 884
Median : 7122
Mean : 45152
3rd Qu.: 35694
Max. :2384280
fbi.melt$rate <- fbi.melt$value/fbi.melt$Population*50000
summary(fbi.melt$rate)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 18 130 440 516 4760
This rate is now on 'Ames standard' - i.e. numbers compare directly to number of crimes in Ames in a year
Note: this chart is based on old data - your chart might look a bit different
Hint: if you are completely stuck start with
qplot(state, weight=rate, data=fbi.melt, geom="bar", facets=.~variable)+coord_flip()
Using cast:
find the number of all offenses in 2012
find the number of offenses by type of crime
find the number of all offenses by state
library(ggplot2)
library(maps)
states <- map_data("state")
head(states)
long lat group order region subregion
1 -87.46 30.39 1 1 alabama <NA>
2 -87.48 30.37 1 2 alabama <NA>
3 -87.53 30.37 1 3 alabama <NA>
4 -87.53 30.33 1 4 alabama <NA>
5 -87.57 30.33 1 5 alabama <NA>
6 -87.59 30.33 1 6 alabama <NA>
qplot(long, lat, data=subset(states, region=="iowa"))
order
group does thatqplot(long, lat, data=states)
qplot(long, lat, data=states, geom="path", order=order, group=group)
qplot(long, lat, data=states, geom="polygon", order=order, group=group, colour=I("grey30"))
qplot(long, lat, data=states, geom="polygon", order=order, group=group, fill=long)
Using the package maps, pull out map data for all countries in the world:
world <- map_data("world")
Draw a map of of the world
Pick one country and color it
We would like to draw a choropleth map (one with color) of the US, and indicate crime rates on it
We have two data sources
head(states)
long lat group order region subregion
1 -87.46 30.39 1 1 alabama <NA>
2 -87.48 30.37 1 2 alabama <NA>
3 -87.53 30.37 1 3 alabama <NA>
4 -87.53 30.33 1 4 alabama <NA>
5 -87.57 30.33 1 5 alabama <NA>
6 -87.59 30.33 1 6 alabama <NA>
head(fbi.melt)
Source state
1 http://www.disastercenter.com/crime/alcrime.htm Alabama
2 http://www.disastercenter.com/crime/alcrime.htm Alabama
3 http://www.disastercenter.com/crime/alcrime.htm Alabama
4 http://www.disastercenter.com/crime/alcrime.htm Alabama
5 http://www.disastercenter.com/crime/alcrime.htm Alabama
6 http://www.disastercenter.com/crime/alcrime.htm Alabama
abbr Year Population variable value rate
1 AL 1960 3266740 Violent 6097 93.32
2 AL 1961 3302000 Violent 5564 84.25
3 AL 1962 3358000 Violent 5283 78.66
4 AL 1963 3347000 Violent 6115 91.35
5 AL 1964 3407000 Violent 7260 106.55
6 AL 1965 3462000 Violent 6916 99.88
Idea is, to match data between data sets by one or more columns with the same information
The common element between states and fbi is the variable information of the state name
… but state is spelled with a lower case first letter in states and an upper case first letter in fbi
for ease, we will introduce a new variable in fbi called region that matches the region variable in states
fbi.melt$region <- tolower(fbi$state)
head(fbi.melt$region)
[1] "alabama" "alabama" "alabama" "alabama" "alabama"
[6] "alabama"
The resulting dataset could be quite big!
dim(subset(states, region=="iowa"))
[1] 256 6
dim(subset(fbi.melt, region=="iowa"))
[1] 477 9
The result from the merge will have 256*477 = 122112 rows for Iowa!
fbi.map <- merge(states, subset(fbi.melt, Year==2012), by="region")
dim(fbi.map) # huge!!!
[1] 139743 14
head(fbi.map)
region long lat group order subregion
1 alabama -87.46 30.39 1 1 <NA>
2 alabama -87.46 30.39 1 1 <NA>
3 alabama -87.46 30.39 1 1 <NA>
4 alabama -87.46 30.39 1 1 <NA>
5 alabama -87.46 30.39 1 1 <NA>
6 alabama -87.46 30.39 1 1 <NA>
Source state
1 http://www.disastercenter.com/crime/alcrime.htm Alabama
2 http://www.disastercenter.com/crime/alcrime.htm Alabama
3 http://www.disastercenter.com/crime/alcrime.htm Alabama
4 http://www.disastercenter.com/crime/alcrime.htm Alabama
5 http://www.disastercenter.com/crime/alcrime.htm Alabama
6 http://www.disastercenter.com/crime/alcrime.htm Alabama
abbr Year Population variable value rate
1 AL 2012 4822023 Murder 342 3.546
2 AL 2012 4822023 Robbery 5020 52.053
3 AL 2012 4822023 Property 168878 1751.112
4 AL 2012 4822023 Forcible.Rape 1296 13.438
5 AL 2012 4822023 Aggravated.Assault 15035 155.899
6 AL 2012 4822023 Vehicle.Theft 9874 102.384
qplot(long, lat, geom="polygon", group=group, order=order, fill=rate, data=subset(fbi.map, variable == "Property"))
… something strange is going on with Louisiana
other than that, we can see some North-South trend