CUNY - Bridge R HW 2

Amanda Arce

July 23, 2018

BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

library(tidyverse)
## -- Attaching packages -------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df <- read_csv('https://raw.githubusercontent.com/mandiemannz/MSDS_Bridge-2018/master/terrorism.csv')
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_integer(),
##   year = col_integer(),
##   methodology = col_character(),
##   method = col_character(),
##   incidents = col_integer(),
##   incidents.us = col_integer(),
##   suicide = col_integer(),
##   suicide.us = col_integer(),
##   nkill = col_integer(),
##   nkill.us = col_integer(),
##   nwound = col_integer(),
##   nwound.us = col_integer()
## )
## See spec(...) for full column specifications.
head(df)
## # A tibble: 6 x 26
##      X1  year methodology method incidents incidents.us suicide suicide.us
##   <int> <int> <chr>       <chr>      <int>        <int>   <int>      <int>
## 1     1  1970 PGIS        p            651          468       0          0
## 2     2  1971 PGIS        p            470          247       0          0
## 3     3  1972 PGIS        p            494           64       0          0
## 4     4  1973 PGIS        p            473           58       0          0
## 5     5  1974 PGIS        p            580           94       0          0
## 6     6  1975 PGIS        p            740          149       0          0
## # ... with 18 more variables: nkill <int>, nkill.us <int>, nwound <int>,
## #   nwound.us <int>, pNA.nkill <dbl>, pNA.nkill.us <dbl>,
## #   pNA.nwound <dbl>, pNA.nwound.us <dbl>, worldPopulation <dbl>,
## #   USpopulation <dbl>, worldDeathRate <dbl>, USdeathRate <dbl>,
## #   worldDeaths <dbl>, USdeaths <dbl>, kill.pmp <dbl>, kill.pmp.us <dbl>,
## #   pkill <dbl>, pkill.us <dbl>
summary(df)
##        X1             year      methodology           method         
##  Min.   : 1.00   Min.   :1970   Length:46          Length:46         
##  1st Qu.:12.25   1st Qu.:1981   Class :character   Class :character  
##  Median :23.50   Median :1992   Mode  :character   Mode  :character  
##  Mean   :23.50   Mean   :1992                                        
##  3rd Qu.:34.75   3rd Qu.:2004                                        
##  Max.   :46.00   Max.   :2015                                        
##                                                                      
##    incidents      incidents.us       suicide      suicide.us 
##  Min.   :  470   Min.   :  6.00   Min.   :  0   Min.   :0.0  
##  1st Qu.: 1348   1st Qu.: 27.00   1st Qu.:  1   1st Qu.:0.0  
##  Median : 2865   Median : 40.00   Median : 11   Median :0.0  
##  Mean   : 3516   Mean   : 59.84   Mean   :106   Mean   :0.2  
##  3rd Qu.: 4213   3rd Qu.: 64.00   3rd Qu.:119   3rd Qu.:0.0  
##  Max.   :16840   Max.   :468.00   Max.   :906   Max.   :4.0  
##                  NA's   :1        NA's   :1     NA's   :1    
##      nkill          nkill.us           nwound        nwound.us      
##  Min.   :  171   Min.   :   6.00   Min.   :   82   Min.   :   3.00  
##  1st Qu.: 3646   1st Qu.:  16.25   1st Qu.: 3423   1st Qu.:  28.25  
##  Median : 6715   Median :  25.50   Median : 6620   Median :  47.00  
##  Mean   : 7803   Mean   : 113.91   Mean   : 9842   Mean   : 117.63  
##  3rd Qu.: 9226   3rd Qu.:  59.75   3rd Qu.:12630   3rd Qu.:  81.75  
##  Max.   :43550   Max.   :2910.00   Max.   :43495   Max.   :1214.00  
##                                                                     
##    pNA.nkill         pNA.nkill.us        pNA.nwound      
##  Min.   :0.001544   Min.   :0.000000   Min.   :0.005145  
##  1st Qu.:0.012920   1st Qu.:0.009448   1st Qu.:0.036557  
##  Median :0.030534   Median :0.898408   Median :0.079877  
##  Mean   :0.072987   Mean   :0.552738   Mean   :0.141271  
##  3rd Qu.:0.103594   3rd Qu.:0.958357   3rd Qu.:0.173619  
##  Max.   :0.315914   Max.   :0.990177   Max.   :0.668016  
##  NA's   :1          NA's   :1          NA's   :1         
##  pNA.nwound.us     worldPopulation    USpopulation    worldDeathRate  
##  Min.   :0.00000   Min.   :3682488   Min.   :209486   Min.   : 7.748  
##  1st Qu.:0.01759   1st Qu.:4538702   1st Qu.:232313   1st Qu.: 8.322  
##  Median :0.90068   Median :5527580   Median :259218   Median : 9.090  
##  Mean   :0.55578   Mean   :5498924   Mean   :262592   Mean   : 9.321  
##  3rd Qu.:0.95807   3rd Qu.:6420073   3rd Qu.:292900   3rd Qu.:10.087  
##  Max.   :0.98986   Max.   :7349472   Max.   :321774   Max.   :11.966  
##  NA's   :1                                                            
##   USdeathRate     worldDeaths          USdeaths          kill.pmp      
##  Min.   :7.900   Min.   :44064648   Min.   :1892879   Min.   :0.04604  
##  1st Qu.:8.325   1st Qu.:45780625   1st Qu.:1980946   1st Qu.:0.60046  
##  Median :8.600   Median :50246087   Median :2222083   Median :1.14272  
##  Mean   :8.580   Mean   :49978598   Mean   :2224823   Mean   :1.28422  
##  3rd Qu.:8.800   3rd Qu.:53356899   3rd Qu.:2425626   3rd Qu.:1.53252  
##  Max.   :9.500   Max.   :57421426   Max.   :2665996   Max.   :5.99385  
##                                                                        
##   kill.pmp.us           pkill              pkill.us        
##  Min.   : 0.02715   Min.   :3.881e-06   Min.   :3.142e-06  
##  1st Qu.: 0.06778   1st Qu.:6.988e-05   1st Qu.:8.137e-06  
##  Median : 0.09195   Median :1.353e-04   Median :1.138e-05  
##  Mean   : 0.41003   Mean   :1.488e-04   Mean   :4.842e-05  
##  3rd Qu.: 0.23054   3rd Qu.:1.701e-04   3rd Qu.:2.692e-05  
##  Max.   :10.18208   Max.   :7.736e-04   Max.   :1.204e-03  
## 

1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes

mean(df$nkill)
## [1] 7802.543
median(df$nkill)
## [1] 6715

2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

mean(df$nkill.us)
## [1] 113.913
median(df$nkill.us)
## [1] 25.5

3. Create new column names for the new data frame.

data_terror <- df %>%
  select(year, nkill.us, nkill, USpopulation, USdeathRate) %>%
  rename(Year = year, NumkilledUS = nkill.us, Numkilled = nkill, USpop = USpopulation, USDeath = USdeathRate)

4. Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare$

summary(data_terror)  
##       Year       NumkilledUS        Numkilled         USpop       
##  Min.   :1970   Min.   :   6.00   Min.   :  171   Min.   :209486  
##  1st Qu.:1981   1st Qu.:  16.25   1st Qu.: 3646   1st Qu.:232313  
##  Median :1992   Median :  25.50   Median : 6715   Median :259218  
##  Mean   :1992   Mean   : 113.91   Mean   : 7803   Mean   :262592  
##  3rd Qu.:2004   3rd Qu.:  59.75   3rd Qu.: 9226   3rd Qu.:292900  
##  Max.   :2015   Max.   :2910.00   Max.   :43550   Max.   :321774  
##     USDeath     
##  Min.   :7.900  
##  1st Qu.:8.325  
##  Median :8.600  
##  Mean   :8.580  
##  3rd Qu.:8.800  
##  Max.   :9.500
mean(data_terror$NumkilledUS)
## [1] 113.913
median(data_terror$NumkilledUS)
## [1] 25.5
mean(data_terror$Numkilled)
## [1] 7802.543
median(data_terror$Numkilled)
## [1] 6715
ggplot(data_terror, aes(Year, data_terror$Numkilled)) + 
  geom_point() +
  geom_smooth() +
  ggtitle("Number killed worldwide")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data_terror, aes(Year, data_terror$NumkilledUS)) + 
  geom_point() +
  ggtitle("Number killed in the US")

ggplot(data_terror, aes(Year)) +
  geom_line(aes(y = NumkilledUS), color="blue") +
  geom_line(aes(y = Numkilled), color="grey") +
  xlab("Year") +
  ylab("Number killed") +
  ggtitle("Number killed world wide vs. USA")

ggplot(data_terror, aes(x = cut(Year, breaks = 9), y = Numkilled)) + geom_boxplot()

5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as“excellent”.

i <- 1
for (x in data_terror$Year){
  if (x == 1970){
    data_terror$Year[i] <- "Nineteen seventy"
  }else if (x == 1971){
    data_terror$Year[i] <- "Nineteen seventy one"
  }else if (x == 1977){
    data_terror$Year[i] <- "Nineteen seventy seven"
  }
  i <- i + 1
}
data_terror
## # A tibble: 46 x 5
##    Year                   NumkilledUS Numkilled   USpop USDeath
##    <chr>                        <int>     <int>   <dbl>   <dbl>
##  1 Nineteen seventy                28       171 209486.     9.5
##  2 Nineteen seventy one            15       173 211358.     9.3
##  3 1972                            12       566 213220.     9.4
##  4 1973                            73       370 215093.     9.3
##  5 1974                            17       542 217002.     9.1
##  6 1975                            22       617 218964.     8.8
##  7 1976                             6       672 220993.     8.8
##  8 Nineteen seventy seven           7       456 223091.     8.6
##  9 1978                            11      1459 225239.     8.7
## 10 1979                            16      2100 227412.     8.5
## # ... with 36 more rows

6. Display enough rows to see examples of all of steps 1-5 above.