Murder Minder
========================================================
Gapminder Data

I obtained data on death through interpersonal violence in 2002 and 2004 and total population for 2004 from Gapminder. I then loaded in the cases that did not have missing data.

The data came from two separate spread sheets downloaded from Gapminder, the population and interpersonal violence spreadsheets. The population spreadsheet had data for over 200 years and, since it included countries that have since disappeared, had a lot more entries than the violence data. I merged the data by hand in Mac Numbers, which may have been a mistake, since Numbers does not seem to have an option for deleting individual cells.

The assignment is to create 2-5 plots that make use of the techniques from Lesson 3. These include simple histograms, boxplots split over a categorical variable, or frequency polygons. The assignment also includes the task of saving the pictures created and posting them to the dicussion board as well as submitting my code on the Udacity website.

I will begin with a histogram of worldwide murders in 2004.

The first thing I realize is that I have not follow the naming conventions from the Fox and Weisberg text of capitalizing data sets and keeping variable names in small case. Indeed, I seem to have done just the opposite.

#install.packages("knitr")
#install.packages("ggplot2")
library(knitr)
library(ggplot2)
worldHomicide <- read.csv("/Users/michaelreinhard/Google Drive/R/worldHomicide.csv")
head(worldHomicide)
##               Country population Death_02 Death_04 deathRate04 dper100k
## 1         Afghanistan   26693486      916      813   3.046e-05    3.046
## 2             Albania    3124861      187      208   6.656e-05    6.656
## 3             Algeria   32396048     3745     3102   9.575e-05    9.575
## 4             Andorra      75292        1        1   1.328e-05    1.328
## 5              Angola   15957460     5217     6226   3.902e-04   39.016
## 6 Antigua and Barbuda      82838        7        6   7.243e-05    7.243
WrdMurder <- as.data.frame(worldHomicide) #this seems to be sufficient without adding the table() command inbetween. 
head(WrdMurder)
##               Country population Death_02 Death_04 deathRate04 dper100k
## 1         Afghanistan   26693486      916      813   3.046e-05    3.046
## 2             Albania    3124861      187      208   6.656e-05    6.656
## 3             Algeria   32396048     3745     3102   9.575e-05    9.575
## 4             Andorra      75292        1        1   1.328e-05    1.328
## 5              Angola   15957460     5217     6226   3.902e-04   39.016
## 6 Antigua and Barbuda      82838        7        6   7.243e-05    7.243
#Now make variable names lower case
names(WrdMurder) <- tolower(names(WrdMurder))
names(WrdMurder)
## [1] "country"     "population"  "death_02"    "death_04"    "deathrate04"
## [6] "dper100k"
names(WrdMurder) <- c("nat","pop","death02","death04","dRate04","d100k")
head(WrdMurder)
##                   nat      pop death02 death04   dRate04  d100k
## 1         Afghanistan 26693486     916     813 3.046e-05  3.046
## 2             Albania  3124861     187     208 6.656e-05  6.656
## 3             Algeria 32396048    3745    3102 9.575e-05  9.575
## 4             Andorra    75292       1       1 1.328e-05  1.328
## 5              Angola 15957460    5217    6226 3.902e-04 39.016
## 6 Antigua and Barbuda    82838       7       6 7.243e-05  7.243
names(WrdMurder)
## [1] "nat"     "pop"     "death02" "death04" "dRate04" "d100k"

Now I am going to attach the data set to save a lot of typing. I also take a look to see how many cases there are after missing data is eliminated.

WrdMurderNA <- na.omit(WrdMurder)
head(WrdMurderNA)
##                   nat      pop death02 death04   dRate04  d100k
## 1         Afghanistan 26693486     916     813 3.046e-05  3.046
## 2             Albania  3124861     187     208 6.656e-05  6.656
## 3             Algeria 32396048    3745    3102 9.575e-05  9.575
## 4             Andorra    75292       1       1 1.328e-05  1.328
## 5              Angola 15957460    5217    6226 3.902e-04 39.016
## 6 Antigua and Barbuda    82838       7       6 7.243e-05  7.243
summary(WrdMurderNA)
##                   nat           pop              death02     
##  Afghanistan        :  1   Min.   :1.73e+03   Min.   :    0  
##  Albania            :  1   1st Qu.:1.35e+06   1st Qu.:   57  
##  Algeria            :  1   Median :6.47e+06   Median :  345  
##  Andorra            :  1   Mean   :3.34e+07   Mean   : 2921  
##  Angola             :  1   3rd Qu.:2.15e+07   3rd Qu.: 1618  
##  Antigua and Barbuda:  1   Max.   :1.30e+09   Max.   :57516  
##  (Other)            :185                                     
##     death04         dRate04             d100k      
##  Min.   :    0   Min.   :0.00e+00   Min.   : 0.00  
##  1st Qu.:   63   1st Qu.:1.92e-05   1st Qu.: 1.92  
##  Median :  328   Median :6.59e-05   Median : 6.59  
##  Mean   : 3129   Mean   :1.08e-04   Mean   :10.76  
##  3rd Qu.: 2062   3rd Qu.:1.63e-04   3rd Qu.:16.27  
##  Max.   :61229   Max.   :8.64e-04   Max.   :86.39  
## 
str(WrdMurderNA)
## 'data.frame':    191 obs. of  6 variables:
##  $ nat    : Factor w/ 195 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ pop    : int  26693486 3124861 32396048 75292 15957460 82838 38340778 3062612 20103822 8185553 ...
##  $ death02: int  916 187 3745 1 5217 7 3329 112 284 75 ...
##  $ death04: int  813 208 3102 1 6226 6 2596 100 253 63 ...
##  $ dRate04: num  3.05e-05 6.66e-05 9.58e-05 1.33e-05 3.90e-04 ...
##  $ d100k  : num  3.05 6.66 9.58 1.33 39.02 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:4] 25 85 153 191
##   .. ..- attr(*, "names")= chr [1:4] "25" "85" "153" "191"
hist(WrdMurderNA$pop)

plot of chunk removeNA

hist(WrdMurderNA$d100k)

plot of chunk removeNA

The first thing we see is that the distribution seems to be very skewed to the right.

Now I switch over to ggplot.

p <- ggplot(data=WrdMurderNA, aes(x=d100k))
p + geom_bar()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk basicGgplot

p + geom_histogram(binwidth = 1)

plot of chunk basicGgplot

p + geom_density()

plot of chunk basicGgplot

WrdMurderNA_large <- subset(WrdMurderNA, pop >= 1000000)

I have imported the data. Now I want to inspect the basic properties of the data set. According to the data set there are 191 cases in the NA purged data set and 149.

summary(WrdMurderNA$pop)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 1.73e+03 1.35e+06 6.47e+06 3.34e+07 2.15e+07 1.30e+09
class(WrdMurderNA$pop)
## [1] "integer"
summary(WrdMurderNA$death04)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      63     328    3130    2060   61200

This is a good illustration of how right skewed data results in the mean being higher than the median.

summary(WrdMurderNA$d100k)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.92    6.59   10.80   16.30   86.40
WrdMurderNA[186,]
##         nat    pop death02 death04   dRate04  d100k
## 189 Vanuatu 205561       3       2 9.729e-06 0.9729
WrdMurderNA[185,]
##            nat      pop death02 death04   dRate04 d100k
## 188 Uzbekistan 25708188     951     921 3.583e-05 3.583
WrdMurderNA[183,]
##               nat       pop death02 death04   dRate04 d100k
## 186 United States 294063120   15726   17647 6.001e-05 6.001
#WrdMurderNA$nat["Afghanistan"]

#WrdMurderNA[nat=="Afghanistan",]
#WrdMurderNA[nat=="United States",]
#WrdMurderNA[nat=="United Kingdom",]

I want to do a few things with this data set.

First I create a simple histogram of the world’s homicides. I adjust the bin width to see if there is an optimal trade-off between grainularity and detecting an overall pattern. Then I create a single boxplot to identify the outliers. I also create a violin plot.

pMurder <- ggplot(aes(x = death04), data = WrdMurderNA)
pMurder + geom_histogram(binwidth = 100)

plot of chunk histograms

pMurder + geom_histogram(binwidth = 1, fill = "steelblue") + xlim(0,50) 

plot of chunk histograms

pMurder + geom_histogram(binwidth = 1000, fill = "steelblue") + xlim(5000,70000)

plot of chunk histograms Next, I look at the population varible to see how population size is distributed among the nations of the world. I first create a general histogram and experiment with a few different bin sizes.

pPop <- ggplot(aes(x = pop), data = WrdMurderNA)
pPop + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk histogramPop

#pPop + geom_histogram(binwidth = range(WrdMurder)pop/10) # doesn't work
pPop + geom_density(fill = "steelblue")

plot of chunk histogramPop

ggsave('population_density.jpg')
## Saving 7 x 5 in image

Next, I attempt to fit a power distribution to the histogram to get a sense of the most appropriate model for the distribution of population.

Another thing that would nice would be to make murders per capita into an ordered factor and use that as the colors in a uniform size bar chart showing the proportions of countries at each level of population that are high, medium or low murder countries. This is done with the cut function:

WrdMurderNA$d100k5cat <- cut(WrdMurderNA$d100k,quantile(WrdMurderNA$d100k, probs=seq(0,1,0.2)))
WrdMurderNA$death04cat5 <- cut(WrdMurderNA$death04,quantile(WrdMurderNA$death04, probs=seq(0,1,0.2)))
summary(WrdMurderNA$death04cat5) #still gives missing values
##              (0,38]            (38,208]           (208,680] 
##                  33                  38                  38 
##      (680,2.56e+03] (2.56e+03,6.12e+04]                NA's 
##                  38                  38                   6
WrdMurderNA$d100k5cat
##   [1] (1.4,3.4]   (3.4,9.91]  (3.4,9.91]  (0,1.4]     (18.6,86.4]
##   [6] (3.4,9.91]  (3.4,9.91]  (1.4,3.4]   (0,1.4]     (0,1.4]    
##  [11] (1.4,3.4]   (18.6,86.4] (0,1.4]     (3.4,9.91]  (9.91,18.6]
##  [16] (9.91,18.6] (1.4,3.4]   (18.6,86.4] (9.91,18.6] (3.4,9.91] 
##  [21] (3.4,9.91]  (1.4,3.4]   (18.6,86.4] (18.6,86.4] (0,1.4]    
##  [26] (1.4,3.4]   (9.91,18.6] (18.6,86.4] (18.6,86.4] (9.91,18.6]
##  [31] (0,1.4]     (9.91,18.6] (18.6,86.4] (18.6,86.4] (3.4,9.91] 
##  [36] (1.4,3.4]   (18.6,86.4] (9.91,18.6] (18.6,86.4] (18.6,86.4]
##  [41] <NA>        (3.4,9.91]  (18.6,86.4] (1.4,3.4]   (3.4,9.91] 
##  [46] (0,1.4]     (0,1.4]     (0,1.4]     (1.4,3.4]   (9.91,18.6]
##  [51] (9.91,18.6] (18.6,86.4] (0,1.4]     (18.6,86.4] (18.6,86.4]
##  [56] (9.91,18.6] (3.4,9.91]  (18.6,86.4] (0,1.4]     (1.4,3.4]  
##  [61] (0,1.4]     (9.91,18.6] (9.91,18.6] (3.4,9.91]  (0,1.4]    
##  [66] (9.91,18.6] (0,1.4]     (3.4,9.91]  (18.6,86.4] (9.91,18.6]
##  [71] (18.6,86.4] (18.6,86.4] (3.4,9.91]  (18.6,86.4] (1.4,3.4]  
##  [76] (0,1.4]     (3.4,9.91]  (3.4,9.91]  (1.4,3.4]   (3.4,9.91] 
##  [81] (0,1.4]     (3.4,9.91]  (0,1.4]     (0,1.4]     (3.4,9.91] 
##  [86] (9.91,18.6] (18.6,86.4] (3.4,9.91]  (18.6,86.4] (1.4,3.4]  
##  [91] (1.4,3.4]   (3.4,9.91]  (3.4,9.91]  (9.91,18.6] (1.4,3.4]  
##  [96] (9.91,18.6] (9.91,18.6] (1.4,3.4]   (3.4,9.91]  (0,1.4]    
## [101] (3.4,9.91]  (9.91,18.6] (9.91,18.6] (3.4,9.91]  (1.4,3.4]  
## [106] (9.91,18.6] (0,1.4]     (1.4,3.4]   (9.91,18.6] (1.4,3.4]  
## [111] (3.4,9.91]  (0,1.4]     (3.4,9.91]  <NA>        (1.4,3.4]  
## [116] (0,1.4]     (18.6,86.4] (9.91,18.6] (9.91,18.6] (3.4,9.91] 
## [121] (9.91,18.6] (0,1.4]     (0,1.4]     (9.91,18.6] (18.6,86.4]
## [126] (9.91,18.6] <NA>        (0,1.4]     (1.4,3.4]   (3.4,9.91] 
## [131] <NA>        (9.91,18.6] (9.91,18.6] (9.91,18.6] (1.4,3.4]  
## [136] (18.6,86.4] (1.4,3.4]   (1.4,3.4]   (0,1.4]     (1.4,3.4]  
## [141] (18.6,86.4] (18.6,86.4] (9.91,18.6] (18.6,86.4] (9.91,18.6]
## [146] (0,1.4]     <NA>        (3.4,9.91]  (1.4,3.4]   (9.91,18.6]
## [151] (1.4,3.4]   (3.4,9.91]  (18.6,86.4] (0,1.4]     (1.4,3.4]  
## [156] (1.4,3.4]   (1.4,3.4]   (1.4,3.4]   (18.6,86.4] (1.4,3.4]  
## [161] (3.4,9.91]  (18.6,86.4] (9.91,18.6] (18.6,86.4] (0,1.4]    
## [166] (0,1.4]     (1.4,3.4]   (1.4,3.4]   (18.6,86.4] (3.4,9.91] 
## [171] (9.91,18.6] (9.91,18.6] (0,1.4]     (9.91,18.6] (1.4,3.4]  
## [176] (1.4,3.4]   (3.4,9.91]  <NA>        (18.6,86.4] (9.91,18.6]
## [181] (0,1.4]     (1.4,3.4]   (3.4,9.91]  (3.4,9.91]  (3.4,9.91] 
## [186] (0,1.4]     (18.6,86.4] (3.4,9.91]  (1.4,3.4]   (18.6,86.4]
## [191] (18.6,86.4]
## Levels: (0,1.4] (1.4,3.4] (3.4,9.91] (9.91,18.6] (18.6,86.4]
summary(WrdMurderNA)
##                   nat           pop              death02     
##  Afghanistan        :  1   Min.   :1.73e+03   Min.   :    0  
##  Albania            :  1   1st Qu.:1.35e+06   1st Qu.:   57  
##  Algeria            :  1   Median :6.47e+06   Median :  345  
##  Andorra            :  1   Mean   :3.34e+07   Mean   : 2921  
##  Angola             :  1   3rd Qu.:2.15e+07   3rd Qu.: 1618  
##  Antigua and Barbuda:  1   Max.   :1.30e+09   Max.   :57516  
##  (Other)            :185                                     
##     death04         dRate04             d100k             d100k5cat 
##  Min.   :    0   Min.   :0.00e+00   Min.   : 0.00   (0,1.4]    :33  
##  1st Qu.:   63   1st Qu.:1.92e-05   1st Qu.: 1.92   (1.4,3.4]  :38  
##  Median :  328   Median :6.59e-05   Median : 6.59   (3.4,9.91] :38  
##  Mean   : 3129   Mean   :1.08e-04   Mean   :10.76   (9.91,18.6]:38  
##  3rd Qu.: 2062   3rd Qu.:1.63e-04   3rd Qu.:16.27   (18.6,86.4]:38  
##  Max.   :61229   Max.   :8.64e-04   Max.   :86.39   NA's       : 6  
##                                                                     
##               death04cat5
##  (0,38]             :33  
##  (38,208]           :38  
##  (208,680]          :38  
##  (680,2.56e+03]     :38  
##  (2.56e+03,6.12e+04]:38  
##  NA's               : 6  
## 
WrdMurderNAnone <- WrdMurderNA
summary(WrdMurderNAnone)
##                   nat           pop              death02     
##  Afghanistan        :  1   Min.   :1.73e+03   Min.   :    0  
##  Albania            :  1   1st Qu.:1.35e+06   1st Qu.:   57  
##  Algeria            :  1   Median :6.47e+06   Median :  345  
##  Andorra            :  1   Mean   :3.34e+07   Mean   : 2921  
##  Angola             :  1   3rd Qu.:2.15e+07   3rd Qu.: 1618  
##  Antigua and Barbuda:  1   Max.   :1.30e+09   Max.   :57516  
##  (Other)            :185                                     
##     death04         dRate04             d100k             d100k5cat 
##  Min.   :    0   Min.   :0.00e+00   Min.   : 0.00   (0,1.4]    :33  
##  1st Qu.:   63   1st Qu.:1.92e-05   1st Qu.: 1.92   (1.4,3.4]  :38  
##  Median :  328   Median :6.59e-05   Median : 6.59   (3.4,9.91] :38  
##  Mean   : 3129   Mean   :1.08e-04   Mean   :10.76   (9.91,18.6]:38  
##  3rd Qu.: 2062   3rd Qu.:1.63e-04   3rd Qu.:16.27   (18.6,86.4]:38  
##  Max.   :61229   Max.   :8.64e-04   Max.   :86.39   NA's       : 6  
##                                                                     
##               death04cat5
##  (0,38]             :33  
##  (38,208]           :38  
##  (208,680]          :38  
##  (680,2.56e+03]     :38  
##  (2.56e+03,6.12e+04]:38  
##  NA's               : 6  
## 
WrdMurderNAnone <- subset(WrdMurderNA, is.na(death04cat5==FALSE))
summary(WrdMurderNAnone)
##          nat         pop           death02     death04     dRate04 
##  Cook Is   :1   Min.   : 1728   Min.   :0   Min.   :0   Min.   :0  
##  Monaco    :1   1st Qu.:12003   1st Qu.:0   1st Qu.:0   1st Qu.:0  
##  Niue      :1   Median :19439   Median :0   Median :0   Median :0  
##  Palau     :1   Mean   :19209   Mean   :0   Mean   :0   Mean   :0  
##  San Marino:1   3rd Qu.:27242   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
##  Tuvalu    :1   Max.   :35282   Max.   :0   Max.   :0   Max.   :0  
##  (Other)   :0                                                      
##      d100k         d100k5cat              death04cat5
##  Min.   :0   (0,1.4]    :0   (0,38]             :0   
##  1st Qu.:0   (1.4,3.4]  :0   (38,208]           :0   
##  Median :0   (3.4,9.91] :0   (208,680]          :0   
##  Mean   :0   (9.91,18.6]:0   (680,2.56e+03]     :0   
##  3rd Qu.:0   (18.6,86.4]:0   (2.56e+03,6.12e+04]:0   
##  Max.   :0   NA's       :6   NA's               :6   
## 
#death04nrm <- death04/sum(death04)
#death04nrm5cat <- cut(death04nrm, quantile(death04nrm, probs=seq(0,1,0.2)))

#WrdMurder$d100knrmQuintiles100 <- cut(WrdMurder$d100knrm,quantile(WrdMurder$d100knrm, probs=seq(0,1,0.2)))

Now I will make the histogram of population with murder rates as the color. Since the population variable is so spread out I will make the x scale logarithmic.

{r popMurder} pPop <- ggplot(aes(x = pop), data = WrdMurderNA) pPop + geom_histogram(aes(fill = d100k5cat)) + scale_color_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40"))+ scale_fill_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40")) + scale_x_log10() pPop + geom_density(aes(fill = dPercentiles5)) + scale_color_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40"))+ scale_fill_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40")) + scale_x_log10() pPop + geom_bar(aes(fill = dPercentiles5, position = "stack")) + scale_color_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40"))+ scale_fill_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40")) + scale_x_log10() pPop + geom_bar(aes(fill = death04nrm5cat, position = "stack")) + scale_color_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40"))+ scale_fill_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40")) + scale_x_log10() #

popNorm <- as.numeric(pop)/sum(as.numeric(pop)) popNorm sum(popNorm) class(popNorm) WrdMurder$popNorm <- popNorm nrow(WrdMurder)

pPopNorm <- ggplot(aes(x = popNorm, data = WrdMurder)) pPopNorm + geom_bar(aes(fill = death04nrm5cat, position = “stack”)) + scale_color_manual(values = c(“grey80”, “grey70”, “grey60”, “grey50”, “grey40”))+ scale_fill_manual(values = c(“grey80”, “grey70”, “grey60”, “grey50”, “grey40”)) + scale_x_log10() pop_dist <- ggplot(aes(x = pop), data = na.omit(WrdMurderNA)) pop_dist + geom_histogram(aes(fill = d100k5cat),binwidth = 0.1, position = “fill”) + scale_x_log10()

pop_dist + geom_histogram(aes(fill = d100k5cat),binwidth = 0.1, position = “fill”) + scale_color_manual(values = c(“grey80”, “grey70”, “grey60”, “grey50”, “grey40”))+ scale_fill_manual(values = c(“grey80”, “grey70”, “grey60”, “grey50”, “grey40”)) + scale_x_log10()

```

plot <- ggplot(aes(x = pop), data = na.omit(WrdMurderNA)) 
plot + geom_histogram(aes(y = ..density..)) + facet_grid(.~d100k5cat) # this makes the facets side by side. 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk reproducing page 68 with murder

plot + geom_histogram(aes(y = ..density..)) + 
   facet_grid(d100k5cat~.) + 
   scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk reproducing page 68 with murder

plot + geom_histogram(aes(fill = d100k5cat), position = "fill") +
   scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk reproducing page 68 with murder

plot + geom_histogram(aes(y = ..density.., color = d100k5cat, fill = d100k5cat)) +
   scale_x_log10() # this is the stacked effect I was looking for and have now found by accident.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk reproducing page 68 with murder

plot + geom_freqpoly(aes(y = ..density.., color = d100k5cat)) + 
   scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk reproducing page 68 with murder

``{r} sum(d100k) d100knrm <- d100k/sum(d100k)

Ok, what I really want here is a big rectangle with the different bands of color. What is that called?

Next, I focus on the extremes of the distribution. I use the xlim() function to find the ten largest and ten smallest countries in my data.

I then go back to an earlier plot and use the labeling capabilities of R to mark them out on the larger graph.

I then compare murder to population to see if there is a pattern connecting the size of a country and the number of murders it has.

I obtain the pearson correlation statistic to see if there is a relation. To the extent there is a strong relation between population and number of murders we can infer that small countries are inherently no safer than large countries. On the other hand, if large countries are systematically safer then we can speculate that Collier’s thesis, that the provision of security operates under significant economies of scale, has some support in these data.

Next, I show the countries in order of their size on the x axis and their murders on the y axis as a histogram first.

And now as a scatter plot.

Now I add lines showing the median, 25th and 75th percentiles. (Does this even make sense?)

Now I calculate the murder rate by dividing the raw number of murders by the country’s population and display the results in a histogram.

To make the results more interpretable I employ some transformations of the data. First, I simply multiply by 100k to get the murder rate per 100,000 and display the results in a histogram.

Now, I do the same thing by adjusting the scale.

Now I employ some statistical transformations to make the distribution more normal. First I employ the log transformation.

Next, I try a square-root function.

Now, I add the last layer of the plot to find the intersting cases and outliers. There are two kinds of outliers, those that have unusually large or small values of the dependent variable in their own right and those that have an unusual value on the dependent variable in terms of their relationship to population. As I identify these outliers I will

Inspect the structure of the data. ``{r} #str(Homicide)

Why does it treat population as a factor? I decide to change it to an integer, since you can't have less than a person. 

``{r}
hist(Homicide$04, breaks = "Sturges")

Turns out that you can’t use a number as a variable name because it gets interpreted as a numeric constant so I am changing it back to death! {r} #names(Homicide) <- c("country","pop","death02","death04") #head(Homicide) hist(Homicide$death04, breaks = c(0,70000,5000)) #Why has this become a density plot? #Homicide$pop #rm(Homicide$pop) ``` Now make a per capita variable{r} Homicide\(perCap04 <- Homicide\)death04/Homicide$popNum head(Homicide)