Murder Minder ======================================================== Gapminder Data

I obtained data on death through interpersonal violence in 2002 and 2004 and total population for 2004 from Gapminder. I then loaded in the cases that did not have missing data.

The data came from two separate spread sheets downloaded from Gapminder, the population and interpersonal violence spreadsheets. The population spreadsheet had data for over 200 years and, since it included countries that have since disappeared, had a lot more entries than the violence data. I merged the data by hand in Mac Numbers, which may have been a mistake, since Numbers does not seem to have an option for deleting individual cells.

The assignment is to create 2-5 plots that make use of the techniques from Lesson 3. These include simple histograms, boxplots split over a categorical variable, or frequency polygons. The assignment also includes the task of saving the pictures created and posting them to the dicussion board as well as submitting my code on the Udacity website.

I will begin with a histogram of worldwide murders in 2004.

The first thing I realize is that I have not follow the naming conventions from the Fox and Weisberg text of capitalizing data sets and keeping variable names in small case. Indeed, I seem to have done just the opposite.

#install.packages("knitr")
#install.packages("ggplot2")
library(knitr)
library(ggplot2)
worldHomicide <- read.csv("/Users/michaelreinhard/Google Drive/R/worldHomicide.csv")
head(worldHomicide)

##               Country population Death_02 Death_04 deathRate04 dper100k
## 1         Afghanistan   26693486      916      813   3.046e-05    3.046
## 2             Albania    3124861      187      208   6.656e-05    6.656
## 3             Algeria   32396048     3745     3102   9.575e-05    9.575
## 4             Andorra      75292        1        1   1.328e-05    1.328
## 5              Angola   15957460     5217     6226   3.902e-04   39.016
## 6 Antigua and Barbuda      82838        7        6   7.243e-05    7.243

WrdMurder <- as.data.frame(worldHomicide) #this seems to be sufficient without adding the table() command inbetween. 
head(WrdMurder)

##               Country population Death_02 Death_04 deathRate04 dper100k
## 1         Afghanistan   26693486      916      813   3.046e-05    3.046
## 2             Albania    3124861      187      208   6.656e-05    6.656
## 3             Algeria   32396048     3745     3102   9.575e-05    9.575
## 4             Andorra      75292        1        1   1.328e-05    1.328
## 5              Angola   15957460     5217     6226   3.902e-04   39.016
## 6 Antigua and Barbuda      82838        7        6   7.243e-05    7.243

#Now make variable names lower case
names(WrdMurder) <- tolower(names(WrdMurder))
names(WrdMurder)

## [1] "country"     "population"  "death_02"    "death_04"    "deathrate04"
## [6] "dper100k"

names(WrdMurder) <- c("nat","pop","death02","death04","dRate04","d100k")
head(WrdMurder)

##                   nat      pop death02 death04   dRate04  d100k
## 1         Afghanistan 26693486     916     813 3.046e-05  3.046
## 2             Albania  3124861     187     208 6.656e-05  6.656
## 3             Algeria 32396048    3745    3102 9.575e-05  9.575
## 4             Andorra    75292       1       1 1.328e-05  1.328
## 5              Angola 15957460    5217    6226 3.902e-04 39.016
## 6 Antigua and Barbuda    82838       7       6 7.243e-05  7.243

names(WrdMurder)

## [1] "nat"     "pop"     "death02" "death04" "dRate04" "d100k"

Now I am going to attach the data set to save a lot of typing. I also take a look to see how many cases there are after missing data is eliminated.

WrdMurderNA <- na.omit(WrdMurder)
head(WrdMurderNA)

##                   nat      pop death02 death04   dRate04  d100k
## 1         Afghanistan 26693486     916     813 3.046e-05  3.046
## 2             Albania  3124861     187     208 6.656e-05  6.656
## 3             Algeria 32396048    3745    3102 9.575e-05  9.575
## 4             Andorra    75292       1       1 1.328e-05  1.328
## 5              Angola 15957460    5217    6226 3.902e-04 39.016
## 6 Antigua and Barbuda    82838       7       6 7.243e-05  7.243

summary(WrdMurderNA)

##                   nat           pop              death02     
##  Afghanistan        :  1   Min.   :1.73e+03   Min.   :    0  
##  Albania            :  1   1st Qu.:1.35e+06   1st Qu.:   57  
##  Algeria            :  1   Median :6.47e+06   Median :  345  
##  Andorra            :  1   Mean   :3.34e+07   Mean   : 2921  
##  Angola             :  1   3rd Qu.:2.15e+07   3rd Qu.: 1618  
##  Antigua and Barbuda:  1   Max.   :1.30e+09   Max.   :57516  
##  (Other)            :185                                     
##     death04         dRate04             d100k      
##  Min.   :    0   Min.   :0.00e+00   Min.   : 0.00  
##  1st Qu.:   63   1st Qu.:1.92e-05   1st Qu.: 1.92  
##  Median :  328   Median :6.59e-05   Median : 6.59  
##  Mean   : 3129   Mean   :1.08e-04   Mean   :10.76  
##  3rd Qu.: 2062   3rd Qu.:1.63e-04   3rd Qu.:16.27  
##  Max.   :61229   Max.   :8.64e-04   Max.   :86.39  
##

str(WrdMurderNA)

## 'data.frame':    191 obs. of  6 variables:
##  $ nat    : Factor w/ 195 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ pop    : int  26693486 3124861 32396048 75292 15957460 82838 38340778 3062612 20103822 8185553 ...
##  $ death02: int  916 187 3745 1 5217 7 3329 112 284 75 ...
##  $ death04: int  813 208 3102 1 6226 6 2596 100 253 63 ...
##  $ dRate04: num  3.05e-05 6.66e-05 9.58e-05 1.33e-05 3.90e-04 ...
##  $ d100k  : num  3.05 6.66 9.58 1.33 39.02 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:4] 25 85 153 191
##   .. ..- attr(*, "names")= chr [1:4] "25" "85" "153" "191"

hist(WrdMurderNA$pop)

plot of chunk removeNA

hist(WrdMurderNA$d100k)

plot of chunk removeNA

The first thing we see is that the distribution seems to be very skewed to the right.

Now I switch over to ggplot.

p <- ggplot(data=WrdMurderNA, aes(x=d100k))
p + geom_bar()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk basicGgplot

p + geom_histogram(binwidth = 1)

plot of chunk basicGgplot

p + geom_density()

plot of chunk basicGgplot

WrdMurderNA_large <- subset(WrdMurderNA, pop >= 1000000)

I have imported the data. Now I want to inspect the basic properties of the data set. According to the data set there are 191 cases in the NA purged data set and 149.

summary(WrdMurderNA$pop)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 1.73e+03 1.35e+06 6.47e+06 3.34e+07 2.15e+07 1.30e+09

class(WrdMurderNA$pop)

## [1] "integer"

summary(WrdMurderNA$death04)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      63     328    3130    2060   61200

This is a good illustration of how right skewed data results in the mean being higher than the median.

summary(WrdMurderNA$d100k)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.92    6.59   10.80   16.30   86.40

WrdMurderNA[186,]

##         nat    pop death02 death04   dRate04  d100k
## 189 Vanuatu 205561       3       2 9.729e-06 0.9729

WrdMurderNA[185,]

##            nat      pop death02 death04   dRate04 d100k
## 188 Uzbekistan 25708188     951     921 3.583e-05 3.583

WrdMurderNA[183,]

##               nat       pop death02 death04   dRate04 d100k
## 186 United States 294063120   15726   17647 6.001e-05 6.001

#WrdMurderNA$nat["Afghanistan"]

#WrdMurderNA[nat=="Afghanistan",]
#WrdMurderNA[nat=="United States",]
#WrdMurderNA[nat=="United Kingdom",]

I want to do a few things with this data set.

First I create a simple histogram of the world’s homicides. I adjust the bin width to see if there is an optimal trade-off between grainularity and detecting an overall pattern. Then I create a single boxplot to identify the outliers. I also create a violin plot.

pMurder <- ggplot(aes(x = death04), data = WrdMurderNA)
pMurder + geom_histogram(binwidth = 100)

plot of chunk histograms

pMurder + geom_histogram(binwidth = 1, fill = "steelblue") + xlim(0,50)

plot of chunk histograms

pMurder + geom_histogram(binwidth = 1000, fill = "steelblue") + xlim(5000,70000)

plot of chunk histograms Next, I look at the population varible to see how population size is distributed among the nations of the world. I first create a general histogram and experiment with a few different bin sizes.

pPop <- ggplot(aes(x = pop), data = WrdMurderNA)
pPop + geom_histogram()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk histogramPop

#pPop + geom_histogram(binwidth = range(WrdMurder)pop/10) # doesn't work
pPop + geom_density(fill = "steelblue")

plot of chunk histogramPop

ggsave('population_density.jpg')

## Saving 7 x 5 in image

Next, I attempt to fit a power distribution to the histogram to get a sense of the most appropriate model for the distribution of population.

Another thing that would nice would be to make murders per capita into an ordered factor and use that as the colors in a uniform size bar chart showing the proportions of countries at each level of population that are high, medium or low murder countries. This is done with the cut function:

WrdMurderNA$d100k5cat <- cut(WrdMurderNA$d100k,quantile(WrdMurderNA$d100k, probs=seq(0,1,0.2)))
WrdMurderNA$death04cat5 <- cut(WrdMurderNA$death04,quantile(WrdMurderNA$death04, probs=seq(0,1,0.2)))
summary(WrdMurderNA$death04cat5) #still gives missing values

##              (0,38]            (38,208]           (208,680] 
##                  33                  38                  38 
##      (680,2.56e+03] (2.56e+03,6.12e+04]                NA's 
##                  38                  38                   6

WrdMurderNA$d100k5cat

##   [1] (1.4,3.4]   (3.4,9.91]  (3.4,9.91]  (0,1.4]     (18.6,86.4]
##   [6] (3.4,9.91]  (3.4,9.91]  (1.4,3.4]   (0,1.4]     (0,1.4]    
##  [11] (1.4,3.4]   (18.6,86.4] (0,1.4]     (3.4,9.91]  (9.91,18.6]
##  [16] (9.91,18.6] (1.4,3.4]   (18.6,86.4] (9.91,18.6] (3.4,9.91] 
##  [21] (3.4,9.91]  (1.4,3.4]   (18.6,86.4] (18.6,86.4] (0,1.4]    
##  [26] (1.4,3.4]   (9.91,18.6] (18.6,86.4] (18.6,86.4] (9.91,18.6]
##  [31] (0,1.4]     (9.91,18.6] (18.6,86.4] (18.6,86.4] (3.4,9.91] 
##  [36] (1.4,3.4]   (18.6,86.4] (9.91,18.6] (18.6,86.4] (18.6,86.4]
##  [41] <NA>        (3.4,9.91]  (18.6,86.4] (1.4,3.4]   (3.4,9.91] 
##  [46] (0,1.4]     (0,1.4]     (0,1.4]     (1.4,3.4]   (9.91,18.6]
##  [51] (9.91,18.6] (18.6,86.4] (0,1.4]     (18.6,86.4] (18.6,86.4]
##  [56] (9.91,18.6] (3.4,9.91]  (18.6,86.4] (0,1.4]     (1.4,3.4]  
##  [61] (0,1.4]     (9.91,18.6] (9.91,18.6] (3.4,9.91]  (0,1.4]    
##  [66] (9.91,18.6] (0,1.4]     (3.4,9.91]  (18.6,86.4] (9.91,18.6]
##  [71] (18.6,86.4] (18.6,86.4] (3.4,9.91]  (18.6,86.4] (1.4,3.4]  
##  [76] (0,1.4]     (3.4,9.91]  (3.4,9.91]  (1.4,3.4]   (3.4,9.91] 
##  [81] (0,1.4]     (3.4,9.91]  (0,1.4]     (0,1.4]     (3.4,9.91] 
##  [86] (9.91,18.6] (18.6,86.4] (3.4,9.91]  (18.6,86.4] (1.4,3.4]  
##  [91] (1.4,3.4]   (3.4,9.91]  (3.4,9.91]  (9.91,18.6] (1.4,3.4]  
##  [96] (9.91,18.6] (9.91,18.6] (1.4,3.4]   (3.4,9.91]  (0,1.4]    
## [101] (3.4,9.91]  (9.91,18.6] (9.91,18.6] (3.4,9.91]  (1.4,3.4]  
## [106] (9.91,18.6] (0,1.4]     (1.4,3.4]   (9.91,18.6] (1.4,3.4]  
## [111] (3.4,9.91]  (0,1.4]     (3.4,9.91]  <NA>        (1.4,3.4]  
## [116] (0,1.4]     (18.6,86.4] (9.91,18.6] (9.91,18.6] (3.4,9.91] 
## [121] (9.91,18.6] (0,1.4]     (0,1.4]     (9.91,18.6] (18.6,86.4]
## [126] (9.91,18.6] <NA>        (0,1.4]     (1.4,3.4]   (3.4,9.91] 
## [131] <NA>        (9.91,18.6] (9.91,18.6] (9.91,18.6] (1.4,3.4]  
## [136] (18.6,86.4] (1.4,3.4]   (1.4,3.4]   (0,1.4]     (1.4,3.4]  
## [141] (18.6,86.4] (18.6,86.4] (9.91,18.6] (18.6,86.4] (9.91,18.6]
## [146] (0,1.4]     <NA>        (3.4,9.91]  (1.4,3.4]   (9.91,18.6]
## [151] (1.4,3.4]   (3.4,9.91]  (18.6,86.4] (0,1.4]     (1.4,3.4]  
## [156] (1.4,3.4]   (1.4,3.4]   (1.4,3.4]   (18.6,86.4] (1.4,3.4]  
## [161] (3.4,9.91]  (18.6,86.4] (9.91,18.6] (18.6,86.4] (0,1.4]    
## [166] (0,1.4]     (1.4,3.4]   (1.4,3.4]   (18.6,86.4] (3.4,9.91] 
## [171] (9.91,18.6] (9.91,18.6] (0,1.4]     (9.91,18.6] (1.4,3.4]  
## [176] (1.4,3.4]   (3.4,9.91]  <NA>        (18.6,86.4] (9.91,18.6]
## [181] (0,1.4]     (1.4,3.4]   (3.4,9.91]  (3.4,9.91]  (3.4,9.91] 
## [186] (0,1.4]     (18.6,86.4] (3.4,9.91]  (1.4,3.4]   (18.6,86.4]
## [191] (18.6,86.4]
## Levels: (0,1.4] (1.4,3.4] (3.4,9.91] (9.91,18.6] (18.6,86.4]

summary(WrdMurderNA)

##                   nat           pop              death02     
##  Afghanistan        :  1   Min.   :1.73e+03   Min.   :    0  
##  Albania            :  1   1st Qu.:1.35e+06   1st Qu.:   57  
##  Algeria            :  1   Median :6.47e+06   Median :  345  
##  Andorra            :  1   Mean   :3.34e+07   Mean   : 2921  
##  Angola             :  1   3rd Qu.:2.15e+07   3rd Qu.: 1618  
##  Antigua and Barbuda:  1   Max.   :1.30e+09   Max.   :57516  
##  (Other)            :185                                     
##     death04         dRate04             d100k             d100k5cat 
##  Min.   :    0   Min.   :0.00e+00   Min.   : 0.00   (0,1.4]    :33  
##  1st Qu.:   63   1st Qu.:1.92e-05   1st Qu.: 1.92   (1.4,3.4]  :38  
##  Median :  328   Median :6.59e-05   Median : 6.59   (3.4,9.91] :38  
##  Mean   : 3129   Mean   :1.08e-04   Mean   :10.76   (9.91,18.6]:38  
##  3rd Qu.: 2062   3rd Qu.:1.63e-04   3rd Qu.:16.27   (18.6,86.4]:38  
##  Max.   :61229   Max.   :8.64e-04   Max.   :86.39   NA's       : 6  
##                                                                     
##               death04cat5
##  (0,38]             :33  
##  (38,208]           :38  
##  (208,680]          :38  
##  (680,2.56e+03]     :38  
##  (2.56e+03,6.12e+04]:38  
##  NA's               : 6  
##

WrdMurderNAnone <- WrdMurderNA
summary(WrdMurderNAnone)

##                   nat           pop              death02     
##  Afghanistan        :  1   Min.   :1.73e+03   Min.   :    0  
##  Albania            :  1   1st Qu.:1.35e+06   1st Qu.:   57  
##  Algeria            :  1   Median :6.47e+06   Median :  345  
##  Andorra            :  1   Mean   :3.34e+07   Mean   : 2921  
##  Angola             :  1   3rd Qu.:2.15e+07   3rd Qu.: 1618  
##  Antigua and Barbuda:  1   Max.   :1.30e+09   Max.   :57516  
##  (Other)            :185                                     
##     death04         dRate04             d100k             d100k5cat 
##  Min.   :    0   Min.   :0.00e+00   Min.   : 0.00   (0,1.4]    :33  
##  1st Qu.:   63   1st Qu.:1.92e-05   1st Qu.: 1.92   (1.4,3.4]  :38  
##  Median :  328   Median :6.59e-05   Median : 6.59   (3.4,9.91] :38  
##  Mean   : 3129   Mean   :1.08e-04   Mean   :10.76   (9.91,18.6]:38  
##  3rd Qu.: 2062   3rd Qu.:1.63e-04   3rd Qu.:16.27   (18.6,86.4]:38  
##  Max.   :61229   Max.   :8.64e-04   Max.   :86.39   NA's       : 6  
##                                                                     
##               death04cat5
##  (0,38]             :33  
##  (38,208]           :38  
##  (208,680]          :38  
##  (680,2.56e+03]     :38  
##  (2.56e+03,6.12e+04]:38  
##  NA's               : 6  
##

WrdMurderNAnone <- subset(WrdMurderNA, is.na(death04cat5==FALSE))
summary(WrdMurderNAnone)

##          nat         pop           death02     death04     dRate04 
##  Cook Is   :1   Min.   : 1728   Min.   :0   Min.   :0   Min.   :0  
##  Monaco    :1   1st Qu.:12003   1st Qu.:0   1st Qu.:0   1st Qu.:0  
##  Niue      :1   Median :19439   Median :0   Median :0   Median :0  
##  Palau     :1   Mean   :19209   Mean   :0   Mean   :0   Mean   :0  
##  San Marino:1   3rd Qu.:27242   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
##  Tuvalu    :1   Max.   :35282   Max.   :0   Max.   :0   Max.   :0  
##  (Other)   :0                                                      
##      d100k         d100k5cat              death04cat5
##  Min.   :0   (0,1.4]    :0   (0,38]             :0   
##  1st Qu.:0   (1.4,3.4]  :0   (38,208]           :0   
##  Median :0   (3.4,9.91] :0   (208,680]          :0   
##  Mean   :0   (9.91,18.6]:0   (680,2.56e+03]     :0   
##  3rd Qu.:0   (18.6,86.4]:0   (2.56e+03,6.12e+04]:0   
##  Max.   :0   NA's       :6   NA's               :6   
##

#death04nrm <- death04/sum(death04)
#death04nrm5cat <- cut(death04nrm, quantile(death04nrm, probs=seq(0,1,0.2)))

#WrdMurder$d100knrmQuintiles100 <- cut(WrdMurder$d100knrm,quantile(WrdMurder$d100knrm, probs=seq(0,1,0.2)))

Now I will make the histogram of population with murder rates as the color. Since the population variable is so spread out I will make the x scale logarithmic.

{r popMurder} pPop <- ggplot(aes(x = pop), data = WrdMurderNA) pPop + geom_histogram(aes(fill = d100k5cat)) + scale_color_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40"))+ scale_fill_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40")) + scale_x_log10() pPop + geom_density(aes(fill = dPercentiles5)) + scale_color_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40"))+ scale_fill_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40")) + scale_x_log10() pPop + geom_bar(aes(fill = dPercentiles5, position = "stack")) + scale_color_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40"))+ scale_fill_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40")) + scale_x_log10() pPop + geom_bar(aes(fill = death04nrm5cat, position = "stack")) + scale_color_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40"))+ scale_fill_manual(values = c("grey80", "grey70", "grey60", "grey50", "grey40")) + scale_x_log10() #

popNorm <- as.numeric(pop)/sum(as.numeric(pop)) popNorm sum(popNorm) class(popNorm) WrdMurder$popNorm <- popNorm nrow(WrdMurder)

pPopNorm <- ggplot(aes(x = popNorm, data = WrdMurder)) pPopNorm + geom_bar(aes(fill = death04nrm5cat, position = “stack”)) + scale_color_manual(values = c(“grey80”, “grey70”, “grey60”, “grey50”, “grey40”))+ scale_fill_manual(values = c(“grey80”, “grey70”, “grey60”, “grey50”, “grey40”)) + scale_x_log10() pop_dist <- ggplot(aes(x = pop), data = na.omit(WrdMurderNA)) pop_dist + geom_histogram(aes(fill = d100k5cat),binwidth = 0.1, position = “fill”) + scale_x_log10()

pop_dist + geom_histogram(aes(fill = d100k5cat),binwidth = 0.1, position = “fill”) + scale_color_manual(values = c(“grey80”, “grey70”, “grey60”, “grey50”, “grey40”))+ scale_fill_manual(values = c(“grey80”, “grey70”, “grey60”, “grey50”, “grey40”)) + scale_x_log10()

```

plot <- ggplot(aes(x = pop), data = na.omit(WrdMurderNA)) 
plot + geom_histogram(aes(y = ..density..)) + facet_grid(.~d100k5cat) # this makes the facets side by side.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.