Section 6.2
data.table object:library(data.table)
library(dslabs)
data("murders")
murders<-setDT(murders)
murders[, population_in_millions := population / 10^6]
murders[, rate := total/ population*100000]
head(murders)
## state abb region population total population_in_millions rate
## 1: Alabama AL South 4779736 135 4.779736 2.824424
## 2: Alaska AK West 710231 19 0.710231 2.675186
## 3: Arizona AZ West 6392017 232 6.392017 3.629527
## 4: Arkansas AR South 2915918 93 2.915918 3.189390
## 5: California CA West 37253956 1257 37.253956 3.374138
## 6: Colorado CO West 5029196 65 5.029196 1.292453
rank containing the rank, from highest to
lowest murder rate.murders[, rank:=-rank(rate)]
head(murders)
## state abb region population total population_in_millions rate rank
## 1: Alabama AL South 4779736 135 4.779736 2.824424 -29
## 2: Alaska AK West 710231 19 0.710231 2.675186 -25
## 3: Arizona AZ West 6392017 232 6.392017 3.629527 -42
## 4: Arkansas AR South 2915918 93 2.915918 3.189390 -35
## 5: California CA West 37253956 1257 37.253956 3.374138 -38
## 6: Colorado CO West 5029196 65 5.029196 1.292453 -14
3. If we want to only show the states and population sizes, we can use:
murders[, .(state, population)]
## state population
## 1: Alabama 4779736
## 2: Alaska 710231
## 3: Arizona 6392017
## 4: Arkansas 2915918
## 5: California 37253956
## 6: Colorado 5029196
## 7: Connecticut 3574097
## 8: Delaware 897934
## 9: District of Columbia 601723
## 10: Florida 19687653
## 11: Georgia 9920000
## 12: Hawaii 1360301
## 13: Idaho 1567582
## 14: Illinois 12830632
## 15: Indiana 6483802
## 16: Iowa 3046355
## 17: Kansas 2853118
## 18: Kentucky 4339367
## 19: Louisiana 4533372
## 20: Maine 1328361
## 21: Maryland 5773552
## 22: Massachusetts 6547629
## 23: Michigan 9883640
## 24: Minnesota 5303925
## 25: Mississippi 2967297
## 26: Missouri 5988927
## 27: Montana 989415
## 28: Nebraska 1826341
## 29: Nevada 2700551
## 30: New Hampshire 1316470
## 31: New Jersey 8791894
## 32: New Mexico 2059179
## 33: New York 19378102
## 34: North Carolina 9535483
## 35: North Dakota 672591
## 36: Ohio 11536504
## 37: Oklahoma 3751351
## 38: Oregon 3831074
## 39: Pennsylvania 12702379
## 40: Rhode Island 1052567
## 41: South Carolina 4625364
## 42: South Dakota 814180
## 43: Tennessee 6346105
## 44: Texas 25145561
## 45: Utah 2763885
## 46: Vermont 625741
## 47: Virginia 8001024
## 48: Washington 6724540
## 49: West Virginia 1852994
## 50: Wisconsin 5686986
## 51: Wyoming 563626
## state population
Show the state names and abbreviations in murders.
murders[, .(state, abb)]
## state abb
## 1: Alabama AL
## 2: Alaska AK
## 3: Arizona AZ
## 4: Arkansas AR
## 5: California CA
## 6: Colorado CO
## 7: Connecticut CT
## 8: Delaware DE
## 9: District of Columbia DC
## 10: Florida FL
## 11: Georgia GA
## 12: Hawaii HI
## 13: Idaho ID
## 14: Illinois IL
## 15: Indiana IN
## 16: Iowa IA
## 17: Kansas KS
## 18: Kentucky KY
## 19: Louisiana LA
## 20: Maine ME
## 21: Maryland MD
## 22: Massachusetts MA
## 23: Michigan MI
## 24: Minnesota MN
## 25: Mississippi MS
## 26: Missouri MO
## 27: Montana MT
## 28: Nebraska NE
## 29: Nevada NV
## 30: New Hampshire NH
## 31: New Jersey NJ
## 32: New Mexico NM
## 33: New York NY
## 34: North Carolina NC
## 35: North Dakota ND
## 36: Ohio OH
## 37: Oklahoma OK
## 38: Oregon OR
## 39: Pennsylvania PA
## 40: Rhode Island RI
## 41: South Carolina SC
## 42: South Dakota SD
## 43: Tennessee TN
## 44: Texas TX
## 45: Utah UT
## 46: Vermont VT
## 47: Virginia VA
## 48: Washington WA
## 49: West Virginia WV
## 50: Wisconsin WI
## 51: Wyoming WY
## state abb
4. You can show just the New York row like this:
murders[state == "New York"]
## state abb region population total population_in_millions rate rank
## 1: New York NY Northeast 19378102 517 19.3781 2.66796 -23
You can use other logical vectors to filter rows.
Show the top 5 states with the highest murder rates. After we add
murder rate and rank, do not change the murders dataset, just show the
result. Remember that you can filter based on the rank
column.
murders[rank <=-47]
## state abb region population total
## 1: District of Columbia DC South 601723 99
## 2: Louisiana LA South 4533372 351
## 3: Maryland MD South 5773552 293
## 4: Missouri MO North Central 5988927 321
## 5: South Carolina SC South 4625364 207
## population_in_millions rate rank
## 1: 0.601723 16.452753 -51
## 2: 4.533372 7.742581 -50
## 3: 5.773552 5.074866 -48
## 4: 5.988927 5.359892 -49
## 5: 4.625364 4.475323 -47
5. We can remove rows using the != operator. For
example, to remove Florida, we would do this:
no_florida <- murders[state != "Florida"]
Create a new data frame called no_south that removes
states from the South region. How many states are in this category? You
can use the function nrow for this.
no_south<-murders[region != "South"]
nrow(no_south)
## [1] 34
6. We can also use %in% to filter. You can therefore see
the data from New York and Texas as follows:
murders[state %in% c("New York", "Texas")]
## state abb region population total population_in_millions rate rank
## 1: New York NY Northeast 19378102 517 19.37810 2.66796 -23
## 2: Texas TX South 25145561 805 25.14556 3.20136 -36
Create a new data frame called murders_nw with only the
states from the Northeast and the West. How many states are in this
category?
murders_nw<-murders[region %in% c("Northeast", "West")]
nrow(murders_nw)
## [1] 22
7. Suppose you want to live in the Northeast or
West and want the murder rate to be less than 1. We
want to see the data for the states satisfying these options. Note that
you can use logical operators with filter. Here is an
example in which we filter to keep only small states in the Northeast
region.
murders[population < 5000000 & region == "Northeast"]
## state abb region population total population_in_millions
## 1: Connecticut CT Northeast 3574097 97 3.574097
## 2: Maine ME Northeast 1328361 11 1.328361
## 3: New Hampshire NH Northeast 1316470 5 1.316470
## 4: Rhode Island RI Northeast 1052567 16 1.052567
## 5: Vermont VT Northeast 625741 2 0.625741
## rate rank
## 1: 2.7139722 -27
## 2: 0.8280881 -8
## 3: 0.3798036 -2
## 4: 1.5200933 -17
## 5: 0.3196211 -1
Make sure murders has been defined
with rate and rank and still has all states.
Create a table called my_states that contains rows for
states satisfying both the conditions: they are in the Northeast or West
and the murder rate is less than 1. Show only the state name, the rate,
and the rank
my_states<-murders[rate<1 & region %in% c("Northeast", "West")]
my_states[, .(state, rate, rank)]
## state rate rank
## 1: Hawaii 0.5145920 -3
## 2: Idaho 0.7655102 -6
## 3: Maine 0.8280881 -8
## 4: New Hampshire 0.3798036 -2
## 5: Oregon 0.9396843 -10
## 6: Utah 0.7959810 -7
## 7: Vermont 0.3196211 -1
## 8: Wyoming 0.8871131 -9