Section 6.2

  1. Load the data.table package and the murders dataset and convert it to data.table object:
library(data.table)
library(dslabs)
data("murders")
murders<-setDT(murders)
murders[, population_in_millions := population / 10^6]
murders[, rate := total/ population*100000]
head(murders)
##         state abb region population total population_in_millions     rate
## 1:    Alabama  AL  South    4779736   135               4.779736 2.824424
## 2:     Alaska  AK   West     710231    19               0.710231 2.675186
## 3:    Arizona  AZ   West    6392017   232               6.392017 3.629527
## 4:   Arkansas  AR  South    2915918    93               2.915918 3.189390
## 5: California  CA   West   37253956  1257              37.253956 3.374138
## 6:   Colorado  CO   West    5029196    65               5.029196 1.292453
  1. Add a column rank containing the rank, from highest to lowest murder rate.
murders[, rank:=-rank(rate)]
head(murders)
##         state abb region population total population_in_millions     rate rank
## 1:    Alabama  AL  South    4779736   135               4.779736 2.824424  -29
## 2:     Alaska  AK   West     710231    19               0.710231 2.675186  -25
## 3:    Arizona  AZ   West    6392017   232               6.392017 3.629527  -42
## 4:   Arkansas  AR  South    2915918    93               2.915918 3.189390  -35
## 5: California  CA   West   37253956  1257              37.253956 3.374138  -38
## 6:   Colorado  CO   West    5029196    65               5.029196 1.292453  -14

3. If we want to only show the states and population sizes, we can use:

murders[, .(state, population)] 
##                    state population
##  1:              Alabama    4779736
##  2:               Alaska     710231
##  3:              Arizona    6392017
##  4:             Arkansas    2915918
##  5:           California   37253956
##  6:             Colorado    5029196
##  7:          Connecticut    3574097
##  8:             Delaware     897934
##  9: District of Columbia     601723
## 10:              Florida   19687653
## 11:              Georgia    9920000
## 12:               Hawaii    1360301
## 13:                Idaho    1567582
## 14:             Illinois   12830632
## 15:              Indiana    6483802
## 16:                 Iowa    3046355
## 17:               Kansas    2853118
## 18:             Kentucky    4339367
## 19:            Louisiana    4533372
## 20:                Maine    1328361
## 21:             Maryland    5773552
## 22:        Massachusetts    6547629
## 23:             Michigan    9883640
## 24:            Minnesota    5303925
## 25:          Mississippi    2967297
## 26:             Missouri    5988927
## 27:              Montana     989415
## 28:             Nebraska    1826341
## 29:               Nevada    2700551
## 30:        New Hampshire    1316470
## 31:           New Jersey    8791894
## 32:           New Mexico    2059179
## 33:             New York   19378102
## 34:       North Carolina    9535483
## 35:         North Dakota     672591
## 36:                 Ohio   11536504
## 37:             Oklahoma    3751351
## 38:               Oregon    3831074
## 39:         Pennsylvania   12702379
## 40:         Rhode Island    1052567
## 41:       South Carolina    4625364
## 42:         South Dakota     814180
## 43:            Tennessee    6346105
## 44:                Texas   25145561
## 45:                 Utah    2763885
## 46:              Vermont     625741
## 47:             Virginia    8001024
## 48:           Washington    6724540
## 49:        West Virginia    1852994
## 50:            Wisconsin    5686986
## 51:              Wyoming     563626
##                    state population

Show the state names and abbreviations in murders.

murders[, .(state, abb)]
##                    state abb
##  1:              Alabama  AL
##  2:               Alaska  AK
##  3:              Arizona  AZ
##  4:             Arkansas  AR
##  5:           California  CA
##  6:             Colorado  CO
##  7:          Connecticut  CT
##  8:             Delaware  DE
##  9: District of Columbia  DC
## 10:              Florida  FL
## 11:              Georgia  GA
## 12:               Hawaii  HI
## 13:                Idaho  ID
## 14:             Illinois  IL
## 15:              Indiana  IN
## 16:                 Iowa  IA
## 17:               Kansas  KS
## 18:             Kentucky  KY
## 19:            Louisiana  LA
## 20:                Maine  ME
## 21:             Maryland  MD
## 22:        Massachusetts  MA
## 23:             Michigan  MI
## 24:            Minnesota  MN
## 25:          Mississippi  MS
## 26:             Missouri  MO
## 27:              Montana  MT
## 28:             Nebraska  NE
## 29:               Nevada  NV
## 30:        New Hampshire  NH
## 31:           New Jersey  NJ
## 32:           New Mexico  NM
## 33:             New York  NY
## 34:       North Carolina  NC
## 35:         North Dakota  ND
## 36:                 Ohio  OH
## 37:             Oklahoma  OK
## 38:               Oregon  OR
## 39:         Pennsylvania  PA
## 40:         Rhode Island  RI
## 41:       South Carolina  SC
## 42:         South Dakota  SD
## 43:            Tennessee  TN
## 44:                Texas  TX
## 45:                 Utah  UT
## 46:              Vermont  VT
## 47:             Virginia  VA
## 48:           Washington  WA
## 49:        West Virginia  WV
## 50:            Wisconsin  WI
## 51:              Wyoming  WY
##                    state abb

4. You can show just the New York row like this:

murders[state == "New York"]
##       state abb    region population total population_in_millions    rate rank
## 1: New York  NY Northeast   19378102   517                19.3781 2.66796  -23

You can use other logical vectors to filter rows.

Show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the murders dataset, just show the result. Remember that you can filter based on the rank column.

murders[rank <=-47]
##                   state abb        region population total
## 1: District of Columbia  DC         South     601723    99
## 2:            Louisiana  LA         South    4533372   351
## 3:             Maryland  MD         South    5773552   293
## 4:             Missouri  MO North Central    5988927   321
## 5:       South Carolina  SC         South    4625364   207
##    population_in_millions      rate rank
## 1:               0.601723 16.452753  -51
## 2:               4.533372  7.742581  -50
## 3:               5.773552  5.074866  -48
## 4:               5.988927  5.359892  -49
## 5:               4.625364  4.475323  -47

5. We can remove rows using the != operator. For example, to remove Florida, we would do this:

no_florida <- murders[state != "Florida"]

Create a new data frame called no_south that removes states from the South region. How many states are in this category? You can use the function nrow for this.

no_south<-murders[region != "South"]
nrow(no_south)
## [1] 34

6. We can also use %in% to filter. You can therefore see the data from New York and Texas as follows:

murders[state %in% c("New York", "Texas")]
##       state abb    region population total population_in_millions    rate rank
## 1: New York  NY Northeast   19378102   517               19.37810 2.66796  -23
## 2:    Texas  TX     South   25145561   805               25.14556 3.20136  -36

Create a new data frame called murders_nw with only the states from the Northeast and the West. How many states are in this category?

murders_nw<-murders[region %in% c("Northeast", "West")]
nrow(murders_nw)
## [1] 22

7. Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with filter. Here is an example in which we filter to keep only small states in the Northeast region.

murders[population < 5000000 & region == "Northeast"]
##            state abb    region population total population_in_millions
## 1:   Connecticut  CT Northeast    3574097    97               3.574097
## 2:         Maine  ME Northeast    1328361    11               1.328361
## 3: New Hampshire  NH Northeast    1316470     5               1.316470
## 4:  Rhode Island  RI Northeast    1052567    16               1.052567
## 5:       Vermont  VT Northeast     625741     2               0.625741
##         rate rank
## 1: 2.7139722  -27
## 2: 0.8280881   -8
## 3: 0.3798036   -2
## 4: 1.5200933  -17
## 5: 0.3196211   -1

Make sure murders has been defined with rate and rank and still has all states. Create a table called my_states that contains rows for states satisfying both the conditions: they are in the Northeast or West and the murder rate is less than 1. Show only the state name, the rate, and the rank

my_states<-murders[rate<1 & region %in% c("Northeast", "West")]
my_states[, .(state, rate, rank)]
##            state      rate rank
## 1:        Hawaii 0.5145920   -3
## 2:         Idaho 0.7655102   -6
## 3:         Maine 0.8280881   -8
## 4: New Hampshire 0.3798036   -2
## 5:        Oregon 0.9396843  -10
## 6:          Utah 0.7959810   -7
## 7:       Vermont 0.3196211   -1
## 8:       Wyoming 0.8871131   -9