September 22, 2016
All source code at https://github.com/rladies/rtp-data-manipulation-09-22-16
R: Statistical programming language
Both are free and open-source
Install R: https://cran.r-project.org/
Install RStudio: https://www.rstudio.com/products/RStudio/#Desktop
R Splash Screen
The version of R is text that pops up in the Console when you start RStudio
To find out the version of RStudio go to Help \(\rightarrow\) About RStudio
It's good practice to keep both R and RStudio up to date
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and (often) sample data. (From: http://r-pkgs.had.co.nz)
Install these packages by running the following in the Console:
install.packages("readr")
install.packages("tidyr")
install.packages("dplyr")
library(readr) library(tidyr) library(dplyr)
Do not just type code in the Console, it's error prone (especially typo prone!) and hard to keep track of
R Markdown: File -> New File -> R Markdown
Type your code in the R Script
Use curser + Run or highlight + Run
Shortcut for Run button: Command + Enter
Use # for comments
[ Demo ]
R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R.
R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).
Code goes in code chunks
Comments can go in the chunks wih # or just as plain text outside the chunks
Source: http://rmarkdown.rstudio.com/
bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv",
na = c("NA", "", "."))
View the names of variables via
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limi" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
and see detailed descriptions at https://stat.duke.edu/~mc301/data/nc_bike_crash.html.
You might be familiar with the read.csv() function for loading data from a csv file into R.
If using the read.csv() function R will convert character vectors into factors when they are included in a data frame.
Sometimes this is useful, sometimes it isn't -– either way it is important to know what type/class you are working with.
This behavior in read.csv can be changed using the stringsAsFactors = FALSE option when loading a data drame.
In the Environment, click on the name of the data frame to view it in the data viewer
Use the str() function to compactly display the internal structure of an R object
str(bike)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5716 obs. of 54 variables: ## $ FID : int 18 29 33 35 49 53 56 60 63 66 ... ## $ OBJECTID : int 19 30 34 36 50 54 57 61 64 67 ... ## $ AmbulanceR: chr "No" "Yes" "No" "Yes" ... ## $ BikeAge_Gr: chr NA "50-59" NA "16-19" ... ## $ Bike_Age : int 6 51 10 17 6 52 18 40 6 7 ... ## $ Bike_Alc_D: chr "No" "No" "No" "No" ... ## $ Bike_Dir : chr "Not Applicable" "With Traffic" "With Traffic" NA ... ## $ Bike_Injur: chr "C: Possible Injury" "C: Possible Injury" "Injury" "B: Evident Injury" ... ## $ Bike_Pos : chr "Driveway / Alley" "Travel Lane" "Travel Lane" "Travel Lane" ... ## $ Bike_Race : chr "Black" "Black" "Black" "White" ... ## $ Bike_Sex : chr "Female" "Male" "Male" "Male" ... ## $ City : chr "Durham" "Greenville" "Farmville" "Charlotte" ... ## $ County : chr "Durham" "Pitt" "Pitt" "Mecklenburg" ... ## $ CrashAlcoh: chr "No" "No" "No" "No" ... ## $ CrashDay : chr "01-01-06" "01-01-02" "01-01-07" "01-01-05" ... ## $ Crash_Date: Date, format: "2007-01-06" "2007-01-09" ... ## $ Crash_Grp : chr "Bicyclist Failed to Yield - Midblock" "Crossing Paths - Other Circumstances" "Bicyclist Failed to Yield - Sign-Controlled Intersection" "Loss of Control / Turning Error" ... ## $ Crash_Hour: int 13 23 16 19 12 20 19 14 16 0 ... ## $ Crash_Loc : chr "Non-Intersection" "Intersection-Related" "Intersection" "Intersection" ... ## $ Crash_Mont: chr NA NA NA NA ... ## $ Crash_Time: POSIXct, format: "0001-01-01 13:17:58" "0001-01-01 23:08:58" ... ## $ Crash_Type: chr "Bicyclist Ride Out - Residential Driveway" "Crossing Paths - Intersection - Other /" "Bicyclist Ride Through - Sign-Controlled Intersection" "Motorist Lost Control - Other /" ... ## $ Crash_Ty_1: int 353311 211180 111144 119139 112114 311231 119144 132180 112142 460910 ... ## $ Crash_Year: int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ... ## $ Crsh_Sevri: chr "C: Possible Injury" "C: Possible Injury" "O: No Injury" "B: Evident Injury" ... ## $ Developmen: chr "Residential" "Commercial" "Residential" "Residential" ... ## $ DrvrAge_Gr: chr "60-69" "30-39" "50-59" "30-39" ... ## $ Drvr_Age : int 66 34 52 33 NA 20 40 NA 17 51 ... ## $ Drvr_Alc_D: chr "No" "No" "No" "No" ... ## $ Drvr_EstSp: chr "11-15 mph" "0-5 mph" "21-25 mph" "46-50 mph" ... ## $ Drvr_Injur: chr "O: No Injury" "O: No Injury" "O: No Injury" "O: No Injury" ... ## $ Drvr_Race : chr "Black" "Black" "White" "White" ... ## $ Drvr_Sex : chr "Male" "Male" "Female" "Female" ... ## $ Drvr_VehTy: chr "Pickup" "Passenger Car" "Passenger Car" "Sport Utility" ... ## $ ExcsSpdInd: chr "No" "No" "No" "No" ... ## $ Hit_Run : chr "No" "No" "No" "No" ... ## $ Light_Cond: chr "Daylight" "Dark - Lighted Roadway" "Daylight" "Dark - Roadway Not Lighted" ... ## $ Locality : chr "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" ... ## $ Num_Lanes : chr "2 lanes" "5 lanes" "2 lanes" "4 lanes" ... ## $ Num_Units : int 2 2 2 3 2 2 2 2 2 2 ... ## $ Rd_Charact: chr "Straight - Level" "Straight - Level" "Straight - Level" "Straight - Level" ... ## $ Rd_Class : chr "Local Street" "Local Street" "Local Street" "NC Route" ... ## $ Rd_Conditi: chr "Dry" "Dry" "Dry" "Dry" ... ## $ Rd_Config : chr "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" ... ## $ Rd_Defects: chr "None" "None" "None" "None" ... ## $ Rd_Feature: chr "No Special Feature" "Four-Way Intersection" "Four-Way Intersection" "Four-Way Intersection" ... ## $ Rd_Surface: chr "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" ... ## $ Region : chr "Piedmont" "Coastal" "Coastal" "Piedmont" ... ## $ Rural_Urba: chr "Urban" "Urban" "Rural" "Urban" ... ## $ Speed_Limi: chr "20 - 25 MPH" "40 - 45 MPH" "30 - 35 MPH" "40 - 45 MPH" ... ## $ Traff_Cntr: chr "No Control Present" "Stop And Go Signal" "Stop Sign" "Stop And Go Signal" ... ## $ Weather : chr "Clear" "Clear" "Clear" "Cloudy" ... ## $ Workzone_I: chr "No" "No" "No" "No" ... ## $ Location : chr "36.002743, -78.8785" "35.612984, -77.39265" "35.595676, -77.59074" "35.076767, -80.7728" ... ## - attr(*, "spec")=List of 2 ## ..$ cols :List of 54 ## .. ..$ FID : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ OBJECTID : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ AmbulanceR: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ BikeAge_Gr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Age : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Bike_Alc_D: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Dir : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Injur: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Pos : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Race : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Sex : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ City : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ County : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ CrashAlcoh: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ CrashDay : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Date:List of 1 ## .. .. ..$ format: chr "" ## .. .. ..- attr(*, "class")= chr "collector_date" "collector" ## .. ..$ Crash_Grp : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Hour: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crash_Loc : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Mont: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Time:List of 1 ## .. .. ..$ format: chr "" ## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector" ## .. ..$ Crash_Type: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Ty_1: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crash_Year: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crsh_Sevri: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Developmen: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ DrvrAge_Gr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Age : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Drvr_Alc_D: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_EstSp: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Injur: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Race : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Sex : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_VehTy: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ ExcsSpdInd: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Hit_Run : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Light_Cond: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Locality : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Num_Lanes : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Num_Units : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Rd_Charact: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Class : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Conditi: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Config : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Defects: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Feature: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Surface: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Region : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rural_Urba: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Speed_Limi: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Traff_Cntr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Weather : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Workzone_I: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Location : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## ..$ default: list() ## .. ..- attr(*, "class")= chr "collector_guess" "collector" ## ..- attr(*, "class")= chr "col_spec"
Use the glimpse() function to see all variables and the data in them
glimpse(bike)
## Observations: 5,716 ## Variables: 54 ## $ FID <int> 18, 29, 33, 35, 49, 53, 56, 60, 63, 66, 72, 75, 82,... ## $ OBJECTID <int> 19, 30, 34, 36, 50, 54, 57, 61, 64, 67, 73, 76, 83,... ## $ AmbulanceR <chr> "No", "Yes", "No", "Yes", "No", "Yes", "Yes", "No",... ## $ BikeAge_Gr <chr> NA, "50-59", NA, "16-19", NA, "50-59", "16-19", "40... ## $ Bike_Age <int> 6, 51, 10, 17, 6, 52, 18, 40, 6, 7, 45, 30, 17, 20,... ## $ Bike_Alc_D <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Bike_Dir <chr> "Not Applicable", "With Traffic", "With Traffic", N... ## $ Bike_Injur <chr> "C: Possible Injury", "C: Possible Injury", "Injury... ## $ Bike_Pos <chr> "Driveway / Alley", "Travel Lane", "Travel Lane", "... ## $ Bike_Race <chr> "Black", "Black", "Black", "White", "Black", "White... ## $ Bike_Sex <chr> "Female", "Male", "Male", "Male", "Male", "Male", "... ## $ City <chr> "Durham", "Greenville", "Farmville", "Charlotte", "... ## $ County <chr> "Durham", "Pitt", "Pitt", "Mecklenburg", "Mecklenbu... ## $ CrashAlcoh <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ CrashDay <chr> "01-01-06", "01-01-02", "01-01-07", "01-01-05", NA,... ## $ Crash_Date <date> 2007-01-06, 2007-01-09, 2007-01-14, 2007-01-12, 20... ## $ Crash_Grp <chr> "Bicyclist Failed to Yield - Midblock", "Crossing P... ## $ Crash_Hour <int> 13, 23, 16, 19, 12, 20, 19, 14, 16, 0, 17, 18, 14, ... ## $ Crash_Loc <chr> "Non-Intersection", "Intersection-Related", "Inters... ## $ Crash_Mont <chr> NA, NA, NA, NA, NA, "01-04-01", "01-04-01", NA, "01... ## $ Crash_Time <dttm> 0001-01-01 13:17:58, 0001-01-01 23:08:58, 0001-01-... ## $ Crash_Type <chr> "Bicyclist Ride Out - Residential Driveway", "Cross... ## $ Crash_Ty_1 <int> 353311, 211180, 111144, 119139, 112114, 311231, 119... ## $ Crash_Year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200... ## $ Crsh_Sevri <chr> "C: Possible Injury", "C: Possible Injury", "O: No ... ## $ Developmen <chr> "Residential", "Commercial", "Residential", "Reside... ## $ DrvrAge_Gr <chr> "60-69", "30-39", "50-59", "30-39", NA, "20-24", "4... ## $ Drvr_Age <int> 66, 34, 52, 33, NA, 20, 40, NA, 17, 51, NA, 64, 50,... ## $ Drvr_Alc_D <chr> "No", "No", "No", "No", "Missing", "No", "No", "Mis... ## $ Drvr_EstSp <chr> "11-15 mph", "0-5 mph", "21-25 mph", "46-50 mph", "... ## $ Drvr_Injur <chr> "O: No Injury", "O: No Injury", "O: No Injury", "O:... ## $ Drvr_Race <chr> "Black", "Black", "White", "White", "/Missing", "Wh... ## $ Drvr_Sex <chr> "Male", "Male", "Female", "Female", NA, "Female", "... ## $ Drvr_VehTy <chr> "Pickup", "Passenger Car", "Passenger Car", "Sport ... ## $ ExcsSpdInd <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Hit_Run <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "... ## $ Light_Cond <chr> "Daylight", "Dark - Lighted Roadway", "Daylight", "... ## $ Locality <chr> "Mixed (30% To 70% Developed)", "Urban (>70% Develo... ## $ Num_Lanes <chr> "2 lanes", "5 lanes", "2 lanes", "4 lanes", "2 lane... ## $ Num_Units <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ... ## $ Rd_Charact <chr> "Straight - Level", "Straight - Level", "Straight -... ## $ Rd_Class <chr> "Local Street", "Local Street", "Local Street", "NC... ## $ Rd_Conditi <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "D... ## $ Rd_Config <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unprotec... ## $ Rd_Defects <chr> "None", "None", "None", "None", "None", "None", "No... ## $ Rd_Feature <chr> "No Special Feature", "Four-Way Intersection", "Fou... ## $ Rd_Surface <chr> "Smooth Asphalt", "Smooth Asphalt", "Smooth Asphalt... ## $ Region <chr> "Piedmont", "Coastal", "Coastal", "Piedmont", "Pied... ## $ Rural_Urba <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Urban... ## $ Speed_Limi <chr> "20 - 25 MPH", "40 - 45 MPH", "30 - 35 MPH", "40... ## $ Traff_Cntr <chr> "No Control Present", "Stop And Go Signal", "Stop S... ## $ Weather <chr> "Clear", "Clear", "Clear", "Cloudy", "Clear", "Clea... ## $ Workzone_I <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Location <chr> "36.002743, -78.8785", "35.612984, -77.39265", "35....
tidyr package is helpful for converting messy data to tidy data
We'll start with tidy data, and then move on to tidying up messy data
Using base R functions
Using the tidyr and dplyr packages \(\leftarrow\) our focus today
Using a variety of other packages from the tidyverse like plyr, lubridate, etc.
dplyrThe dplyr package is based on the concepts of functions as verbs that manipulate data frames:
filter(): pick rows matching criteriaselect(): pick columns by namerename(): rename specific columnsarrange(): reorder rowsmutate(): add new variablestransmute(): create new data frame with variablessample_n() / sample_frac(): randomly sample rowssummarise(): reduce variables to valuesdplyr rulesfilter()filter()for crashes in Durham County
bike %>% filter(County == "Durham")
## # A tibble: 253 × 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 53 54 Yes 50-59 52 No With Traffic ## 3 56 57 Yes 16-19 18 No <NA> ## 4 209 210 No 16-19 16 No Facing Traffic ## 5 228 229 Yes 40-49 40 No With Traffic ## 6 620 621 Yes 50-59 55 No With Traffic ## 7 667 668 Yes 60-69 61 No Not Applicable ## 8 458 459 Yes 60-69 62 No With Traffic ## 9 576 577 No 40-49 49 No With Traffic ## 10 618 619 No 20-24 23 No With Traffic ## # ... with 243 more rows, and 47 more variables: Bike_Injur <chr>, ## # Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>, ## # County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, ## # Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
filter()for crashes in Durham County where biker was < 10 yrs old
bike %>% filter(County == "Durham", Bike_Age < 10)
## # A tibble: 20 × 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 47 48 No 10-Jun 9 No Not Applicable ## 3 124 125 Yes 10-Jun 8 No With Traffic ## 4 531 532 Yes 10-Jun 7 No With Traffic ## 5 704 705 Yes 10-Jun 9 No Not Applicable ## 6 42 43 No 10-Jun 8 No With Traffic ## 7 392 393 Yes 0-5 2 No Not Applicable ## 8 941 942 No 10-Jun 9 No With Traffic ## 9 436 437 Yes 10-Jun 6 No Not Applicable ## 10 160 161 Yes 10-Jun 7 No With Traffic ## 11 273 274 Yes 10-Jun 7 No Facing Traffic ## 12 78 79 Yes 10-Jun 7 No With Traffic ## 13 422 423 No 10-Jun 9 No Not Applicable ## 14 570 571 No <NA> 0 Missing Not Applicable ## 15 683 684 Yes 10-Jun 8 No Not Applicable ## 16 62 63 Yes 10-Jun 7 No With Traffic ## 17 248 249 No 0-5 4 No Not Applicable ## 18 306 307 Yes 10-Jun 8 No With Traffic ## 19 231 232 Yes 10-Jun 8 No With Traffic ## 20 361 362 Yes 10-Jun 9 No With Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
| operator | definition |
|---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
x | y |
x OR y |
x & y |
x AND y |
| operator | definition |
|---|---|
is.na(x) |
test if x is NA |
!is.na(x) |
test if x is not NA |
x %in% y |
test if x is in y |
!(x %in% y) |
test if x is not in y |
!x |
not x |
What in the world does a BikeAge_gr of 10-Jun or 15-Nov mean?
bike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n())
## # A tibble: 13 × 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 0-5 60 ## 2 10-Jun 421 ## 3 15-Nov 747 ## 4 16-19 605 ## 5 20-24 680 ## 6 25-29 430 ## 7 30-39 658 ## 8 40-49 920 ## 9 50-59 739 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
10-Jun should be 6-1015-Nov should be 11-15stringrstringrinstall.packages("stringr")
library(stringr)
str_replace() and add new variables with mutate()BikeAge_Gr variable: 10-Jun should be 6-10 and 15-Nov should be 11-15bike <- bike %>% mutate(BikeAge_Gr = str_replace(BikeAge_Gr, "10-Jun", "6-10")) %>% mutate(BikeAge_Gr = str_replace(BikeAge_Gr, "15-Nov", "11-15"))
Always check your changes and confirm code did what you wanted it to do
bike %>% group_by(BikeAge_Gr) %>% summarise(count = n())
## # A tibble: 13 × 2 ## BikeAge_Gr count ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
You can also count using the count() function
bike %>% group_by(BikeAge_Gr) %>% count()
## # A tibble: 13 × 2 ## BikeAge_Gr n ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
We can use the same format for calculating other summary statistics
bike %>%
group_by(CrashAlcoh) %>%
summarise(mean_age = mean(Bike_Age, na.rm = TRUE),
median_age = median(Bike_Age, na.rm = TRUE))
## # A tibble: 2 × 3 ## CrashAlcoh mean_age median_age ## <chr> <dbl> <dbl> ## 1 No 29.90971 25 ## 2 Yes 39.41266 43
slice() for certain row numbersFirst five
bike %>% slice(1:5)
## # A tibble: 5 × 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 29 30 Yes 50-59 51 No With Traffic ## 3 33 34 No <NA> 10 No With Traffic ## 4 35 36 Yes 16-19 17 No <NA> ## 5 49 50 No <NA> 6 No Facing Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
slice() for certain row numbersLast five
last_row <- nrow(bike) bike %>% slice((last_row-4):last_row)
## # A tibble: 5 × 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 460 461 Yes 6-10 7 No Not Applicable ## 2 474 475 Yes 50-59 50 No With Traffic ## 3 479 480 Yes 16-19 16 No Not Applicable ## 4 487 488 No 40-49 47 Yes With Traffic ## 5 488 489 Yes 30-39 35 No Facing Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
select() to keep only the variables you mentionbike %>% select(Crash_Loc, Hit_Run) %>% table()
## Hit_Run ## Crash_Loc No Yes ## Intersection 2223 275 ## Intersection-Related 252 42 ## Location 3 7 ## Non-Intersection 2213 462 ## Non-Roadway 205 30
select()to exclude variablesbike %>% select(-OBJECTID)
## # A tibble: 5,716 × 53 ## FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <chr> <chr> <int> <chr> <chr> ## 1 18 No <NA> 6 No Not Applicable ## 2 29 Yes 50-59 51 No With Traffic ## 3 33 No <NA> 10 No With Traffic ## 4 35 Yes 16-19 17 No <NA> ## 5 49 No <NA> 6 No Facing Traffic ## 6 53 Yes 50-59 52 No With Traffic ## 7 56 Yes 16-19 18 No <NA> ## 8 60 No 40-49 40 No Facing Traffic ## 9 63 Yes 6-10 6 No Facing Traffic ## 10 66 Yes 6-10 7 No <NA> ## # ... with 5,706 more rows, and 47 more variables: Bike_Injur <chr>, ## # Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>, ## # County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, ## # Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
rename() specific columnsCorrect typos and rename to make variable names shorter and/or more informative
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limi" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
Speed_Limi to Speed_Limit:bike <- bike %>% rename(Speed_Limit = Speed_Limi)
Always check your changes and confirm code did what you wanted it to do
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limit" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
summarise() in a new data framebike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n()) %>% arrange(crash_count)
## # A tibble: 13 × 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 70 12 ## 2 70+ 58 ## 3 0-5 60 ## 4 <NA> 112 ## 5 60-69 274 ## 6 6-10 421 ## 7 25-29 430 ## 8 16-19 605 ## 9 30-39 658 ## 10 20-24 680 ## 11 50-59 739 ## 12 11-15 747 ## 13 40-49 920
arrange() to order rowsbike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n()) %>% arrange(desc(crash_count))
## # A tibble: 13 × 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 40-49 920 ## 2 11-15 747 ## 3 50-59 739 ## 4 20-24 680 ## 5 30-39 658 ## 6 16-19 605 ## 7 25-29 430 ## 8 6-10 421 ## 9 60-69 274 ## 10 <NA> 112 ## 11 0-5 60 ## 12 70+ 58 ## 13 70 12
sample_n() or sample_frac()sample_n(): randomly sample 5 observationsbike_n5 <- bike %>% sample_n(5, replace = FALSE) dim(bike_n5)
## [1] 5 54
sample_frac(): randomly sample 20% of observationsbike_perc20 <-bike %>% sample_frac(0.2, replace = FALSE) dim(bike_perc20)
## [1] 1143 54
dplyr resourcesVisit https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html for the package vignette.
Refer to the dplyr cheatsheet.
For when not working with dplyr
Refer to a variable in a dataset as bike$Crash_Loc
Access any element in a dataframe using square brackets
bike[1,5] # row 1, column 5
## # A tibble: 1 × 1 ## Bike_Age ## <int> ## 1 6
- For all observations in row 1: `bike[1, ]` - For all observations in column 5: `bike[, 5]`
bike <- separate(bike, col = Location, into = c("Lat", "Lon"),
sep = ", ", remove = FALSE)
bike %>%
select(Location, Lat, Lon) %>%
head(5)
## # A tibble: 5 × 3 ## Location Lat Lon ## <chr> <chr> <chr> ## 1 36.002743, -78.8785 36.002743 -78.8785 ## 2 35.612984, -77.39265 35.612984 -77.39265 ## 3 35.595676, -77.59074 35.595676 -77.59074 ## 4 35.076767, -80.7728 35.076767 -80.7728 ## 5 35.19999, -80.75713 35.19999 -80.75713
bike <- unite(bike, col = ID, FID, OBJECTID, remove = FALSE) bike %>% select(FID, OBJECTID, ID) %>% head(5)
## # A tibble: 5 × 3 ## FID OBJECTID ID ## <int> <int> <chr> ## 1 18 19 18_19 ## 2 29 30 29_30 ## 3 33 34 33_34 ## 4 35 36 35_36 ## 5 49 50 49_50
Let's take a look at a new dataset:
hdi <- read_csv("https://stat.duke.edu/~mc301/data/hdi.csv")
gatherhdi_long <- gather(hdi, key = year, value = hd_index, hdi_1980:hdi_2011) head(hdi_long)
## # A tibble: 6 × 3 ## country year hd_index ## <chr> <chr> <dbl> ## 1 Abkhazia hdi_1980 NA ## 2 Afghanistan hdi_1980 0.198 ## 3 Akrotiri and Dhekelia hdi_1980 NA ## 4 Albania hdi_1980 NA ## 5 Algeria hdi_1980 0.454 ## 6 American Samoa hdi_1980 NA
hdi_long <- hdi_long %>% mutate(year = as.numeric(str_replace(year, "hdi_", ""))) head(hdi_long)
## # A tibble: 6 × 3 ## country year hd_index ## <chr> <dbl> <dbl> ## 1 Abkhazia 1980 NA ## 2 Afghanistan 1980 0.198 ## 3 Akrotiri and Dhekelia 1980 NA ## 4 Albania 1980 NA ## 5 Algeria 1980 0.454 ## 6 American Samoa 1980 NA
spreadhdi_wide <- spread(hdi_long, key = year, value = hd_index) head(hdi_wide)
## # A tibble: 6 × 10 ## country `1980` `1990` `2000` `2005` `2006` `2007` `2008` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Abkhazia NA NA NA NA NA NA NA ## 2 Afghanistan 0.198 0.246 0.230 0.340 0.354 0.363 0.370 ## 3 Akrotiri and Dhekelia NA NA NA NA NA NA NA ## 4 Albania NA 0.656 0.691 0.721 0.724 0.729 0.733 ## 5 Algeria 0.454 0.551 0.624 0.667 0.673 0.680 0.686 ## 6 American Samoa NA NA NA NA NA NA NA ## # ... with 2 more variables: `2009` <dbl>, `2011` <dbl>
Create a new dataframe that doesn't include observations where Bike_Injur = "Injury" since it's not clear what this means.
This new dataframe only should include observations in Durham County, and where the biker is a teenager (13 to 19 years, inclusive).
Then, calculate the number of accidents for each type of bike direction (Bike_Dir).
bike2 <- bike %>% filter(Bike_Injur != "Injury") %>% filter(County == "Durham") %>% filter(between(Bike_Age, 13, 19)) bike2 %>% group_by(Bike_Dir) %>% count()
## # A tibble: 4 × 2 ## Bike_Dir n ## <chr> <int> ## 1 Facing Traffic 9 ## 2 Not Applicable 6 ## 3 With Traffic 18 ## 4 <NA> 5