September 22, 2016

Welcome to R-Ladies RTP!

Outline

  • Introduction to R and RStudio
  • Documenting your code
  • Loading data
  • Data manipulation with dplyr
  • Basic R syntax
  • Tidying up messy data with tidyr
  • What next?
  • Hands on exercises

Materials

Introduction to R and RStudio

What is R and RStudio

  • R: Statistical programming language

  • RStudio:
    • Inregtrated development environment for R
    • Powerful and productive user interface for R
  • Both are free and open-source

Getting started

Anatomy of RStudio

  • Left: Console
    • Text on top at launch: version of R that you’re running
    • Below that is the prompt
  • Upper right: Workspace and command history
  • Lower right: Plots, access to files, help, packages, data viewer

R Splash Screen

What version am I using?

  • The version of R is text that pops up in the Console when you start RStudio

  • To find out the version of RStudio go to Help \(\rightarrow\) About RStudio

  • It's good practice to keep both R and RStudio up to date

R packages

  • Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and (often) sample data. (From: http://r-pkgs.had.co.nz)

  • Install these packages by running the following in the Console:

install.packages("readr")
install.packages("tidyr")
install.packages("dplyr")
  • Then, load the packages by running the following:
library(readr)
library(tidyr)
library(dplyr)
  • This is just one way of installing a package, there is also a GUI approach in the Packages pane in RStudio

Keeping track of your code

Two options: R Script or R Markdown

  • Do not just type code in the Console, it's error prone (especially typo prone!) and hard to keep track of

  • Use one of the two following options:
  • R Script: File -> New File -> R Script
  • R Markdown: File -> New File -> R Markdown

R Script

  • Type your code in the R Script

  • Use curser + Run or highlight + Run

  • Shortcut for Run button: Command + Enter

  • Use # for comments

[ Demo ]

R Markdown

  • R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R.

  • R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).

  • Code goes in code chunks

  • Comments can go in the chunks wih # or just as plain text outside the chunks

Source: http://rmarkdown.rstudio.com/

Loading data

NC DOT Fatal Crashes in North Carolina

bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv", 
                 na = c("NA", "", "."))

View the names of variables via

names(bike)
##  [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
##  [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
## [11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
## [16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
## [31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
## [36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
## [41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
## [46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
## [51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"

and see detailed descriptions at https://stat.duke.edu/~mc301/data/nc_bike_crash.html.

Other functions for loading data

  • You might be familiar with the read.csv() function for loading data from a csv file into R.

  • If using the read.csv() function R will convert character vectors into factors when they are included in a data frame.

  • Sometimes this is useful, sometimes it isn't -– either way it is important to know what type/class you are working with.

  • This behavior in read.csv can be changed using the stringsAsFactors = FALSE option when loading a data drame.

Viewing your data 1

In the Environment, click on the name of the data frame to view it in the data viewer

Viewing your data 2

Use the str() function to compactly display the internal structure of an R object

str(bike)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5716 obs. of  54 variables:
##  $ FID       : int  18 29 33 35 49 53 56 60 63 66 ...
##  $ OBJECTID  : int  19 30 34 36 50 54 57 61 64 67 ...
##  $ AmbulanceR: chr  "No" "Yes" "No" "Yes" ...
##  $ BikeAge_Gr: chr  NA "50-59" NA "16-19" ...
##  $ Bike_Age  : int  6 51 10 17 6 52 18 40 6 7 ...
##  $ Bike_Alc_D: chr  "No" "No" "No" "No" ...
##  $ Bike_Dir  : chr  "Not Applicable" "With Traffic" "With Traffic" NA ...
##  $ Bike_Injur: chr  "C: Possible Injury" "C: Possible Injury" "Injury" "B: Evident Injury" ...
##  $ Bike_Pos  : chr  "Driveway / Alley" "Travel Lane" "Travel Lane" "Travel Lane" ...
##  $ Bike_Race : chr  "Black" "Black" "Black" "White" ...
##  $ Bike_Sex  : chr  "Female" "Male" "Male" "Male" ...
##  $ City      : chr  "Durham" "Greenville" "Farmville" "Charlotte" ...
##  $ County    : chr  "Durham" "Pitt" "Pitt" "Mecklenburg" ...
##  $ CrashAlcoh: chr  "No" "No" "No" "No" ...
##  $ CrashDay  : chr  "01-01-06" "01-01-02" "01-01-07" "01-01-05" ...
##  $ Crash_Date: Date, format: "2007-01-06" "2007-01-09" ...
##  $ Crash_Grp : chr  "Bicyclist Failed to Yield - Midblock" "Crossing Paths - Other Circumstances" "Bicyclist Failed to Yield - Sign-Controlled Intersection" "Loss of Control / Turning Error" ...
##  $ Crash_Hour: int  13 23 16 19 12 20 19 14 16 0 ...
##  $ Crash_Loc : chr  "Non-Intersection" "Intersection-Related" "Intersection" "Intersection" ...
##  $ Crash_Mont: chr  NA NA NA NA ...
##  $ Crash_Time: POSIXct, format: "0001-01-01 13:17:58" "0001-01-01 23:08:58" ...
##  $ Crash_Type: chr  "Bicyclist Ride Out - Residential Driveway" "Crossing Paths - Intersection - Other /" "Bicyclist Ride Through - Sign-Controlled Intersection" "Motorist Lost Control - Other /" ...
##  $ Crash_Ty_1: int  353311 211180 111144 119139 112114 311231 119144 132180 112142 460910 ...
##  $ Crash_Year: int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Crsh_Sevri: chr  "C: Possible Injury" "C: Possible Injury" "O: No Injury" "B: Evident Injury" ...
##  $ Developmen: chr  "Residential" "Commercial" "Residential" "Residential" ...
##  $ DrvrAge_Gr: chr  "60-69" "30-39" "50-59" "30-39" ...
##  $ Drvr_Age  : int  66 34 52 33 NA 20 40 NA 17 51 ...
##  $ Drvr_Alc_D: chr  "No" "No" "No" "No" ...
##  $ Drvr_EstSp: chr  "11-15 mph" "0-5 mph" "21-25 mph" "46-50 mph" ...
##  $ Drvr_Injur: chr  "O: No Injury" "O: No Injury" "O: No Injury" "O: No Injury" ...
##  $ Drvr_Race : chr  "Black" "Black" "White" "White" ...
##  $ Drvr_Sex  : chr  "Male" "Male" "Female" "Female" ...
##  $ Drvr_VehTy: chr  "Pickup" "Passenger Car" "Passenger Car" "Sport Utility" ...
##  $ ExcsSpdInd: chr  "No" "No" "No" "No" ...
##  $ Hit_Run   : chr  "No" "No" "No" "No" ...
##  $ Light_Cond: chr  "Daylight" "Dark - Lighted Roadway" "Daylight" "Dark - Roadway Not Lighted" ...
##  $ Locality  : chr  "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" ...
##  $ Num_Lanes : chr  "2 lanes" "5 lanes" "2 lanes" "4 lanes" ...
##  $ Num_Units : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ Rd_Charact: chr  "Straight - Level" "Straight - Level" "Straight - Level" "Straight - Level" ...
##  $ Rd_Class  : chr  "Local Street" "Local Street" "Local Street" "NC Route" ...
##  $ Rd_Conditi: chr  "Dry" "Dry" "Dry" "Dry" ...
##  $ Rd_Config : chr  "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" ...
##  $ Rd_Defects: chr  "None" "None" "None" "None" ...
##  $ Rd_Feature: chr  "No Special Feature" "Four-Way Intersection" "Four-Way Intersection" "Four-Way Intersection" ...
##  $ Rd_Surface: chr  "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" ...
##  $ Region    : chr  "Piedmont" "Coastal" "Coastal" "Piedmont" ...
##  $ Rural_Urba: chr  "Urban" "Urban" "Rural" "Urban" ...
##  $ Speed_Limi: chr  "20 - 25  MPH" "40 - 45  MPH" "30 - 35  MPH" "40 - 45  MPH" ...
##  $ Traff_Cntr: chr  "No Control Present" "Stop And Go Signal" "Stop Sign" "Stop And Go Signal" ...
##  $ Weather   : chr  "Clear" "Clear" "Clear" "Cloudy" ...
##  $ Workzone_I: chr  "No" "No" "No" "No" ...
##  $ Location  : chr  "36.002743, -78.8785" "35.612984, -77.39265" "35.595676, -77.59074" "35.076767, -80.7728" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 54
##   .. ..$ FID       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ OBJECTID  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ AmbulanceR: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BikeAge_Gr: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Age  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Bike_Alc_D: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Dir  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Injur: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Pos  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Race : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Bike_Sex  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ City      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ County    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CrashAlcoh: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CrashDay  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Date:List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_date" "collector"
##   .. ..$ Crash_Grp : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Hour: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Crash_Loc : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Mont: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Time:List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_datetime" "collector"
##   .. ..$ Crash_Type: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Crash_Ty_1: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Crash_Year: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Crsh_Sevri: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Developmen: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ DrvrAge_Gr: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_Age  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Drvr_Alc_D: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_EstSp: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_Injur: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_Race : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_Sex  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Drvr_VehTy: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ ExcsSpdInd: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Hit_Run   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Light_Cond: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Locality  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Num_Lanes : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Num_Units : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Rd_Charact: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Class  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Conditi: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Config : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Defects: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Feature: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rd_Surface: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Region    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Rural_Urba: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Speed_Limi: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Traff_Cntr: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Weather   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Workzone_I: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Location  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Viewing your data 3

Use the glimpse() function to see all variables and the data in them

glimpse(bike)
## Observations: 5,716
## Variables: 54
## $ FID        <int> 18, 29, 33, 35, 49, 53, 56, 60, 63, 66, 72, 75, 82,...
## $ OBJECTID   <int> 19, 30, 34, 36, 50, 54, 57, 61, 64, 67, 73, 76, 83,...
## $ AmbulanceR <chr> "No", "Yes", "No", "Yes", "No", "Yes", "Yes", "No",...
## $ BikeAge_Gr <chr> NA, "50-59", NA, "16-19", NA, "50-59", "16-19", "40...
## $ Bike_Age   <int> 6, 51, 10, 17, 6, 52, 18, 40, 6, 7, 45, 30, 17, 20,...
## $ Bike_Alc_D <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No...
## $ Bike_Dir   <chr> "Not Applicable", "With Traffic", "With Traffic", N...
## $ Bike_Injur <chr> "C: Possible Injury", "C: Possible Injury", "Injury...
## $ Bike_Pos   <chr> "Driveway / Alley", "Travel Lane", "Travel Lane", "...
## $ Bike_Race  <chr> "Black", "Black", "Black", "White", "Black", "White...
## $ Bike_Sex   <chr> "Female", "Male", "Male", "Male", "Male", "Male", "...
## $ City       <chr> "Durham", "Greenville", "Farmville", "Charlotte", "...
## $ County     <chr> "Durham", "Pitt", "Pitt", "Mecklenburg", "Mecklenbu...
## $ CrashAlcoh <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No...
## $ CrashDay   <chr> "01-01-06", "01-01-02", "01-01-07", "01-01-05", NA,...
## $ Crash_Date <date> 2007-01-06, 2007-01-09, 2007-01-14, 2007-01-12, 20...
## $ Crash_Grp  <chr> "Bicyclist Failed to Yield - Midblock", "Crossing P...
## $ Crash_Hour <int> 13, 23, 16, 19, 12, 20, 19, 14, 16, 0, 17, 18, 14, ...
## $ Crash_Loc  <chr> "Non-Intersection", "Intersection-Related", "Inters...
## $ Crash_Mont <chr> NA, NA, NA, NA, NA, "01-04-01", "01-04-01", NA, "01...
## $ Crash_Time <dttm> 0001-01-01 13:17:58, 0001-01-01 23:08:58, 0001-01-...
## $ Crash_Type <chr> "Bicyclist Ride Out - Residential Driveway", "Cross...
## $ Crash_Ty_1 <int> 353311, 211180, 111144, 119139, 112114, 311231, 119...
## $ Crash_Year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200...
## $ Crsh_Sevri <chr> "C: Possible Injury", "C: Possible Injury", "O: No ...
## $ Developmen <chr> "Residential", "Commercial", "Residential", "Reside...
## $ DrvrAge_Gr <chr> "60-69", "30-39", "50-59", "30-39", NA, "20-24", "4...
## $ Drvr_Age   <int> 66, 34, 52, 33, NA, 20, 40, NA, 17, 51, NA, 64, 50,...
## $ Drvr_Alc_D <chr> "No", "No", "No", "No", "Missing", "No", "No", "Mis...
## $ Drvr_EstSp <chr> "11-15 mph", "0-5 mph", "21-25 mph", "46-50 mph", "...
## $ Drvr_Injur <chr> "O: No Injury", "O: No Injury", "O: No Injury", "O:...
## $ Drvr_Race  <chr> "Black", "Black", "White", "White", "/Missing", "Wh...
## $ Drvr_Sex   <chr> "Male", "Male", "Female", "Female", NA, "Female", "...
## $ Drvr_VehTy <chr> "Pickup", "Passenger Car", "Passenger Car", "Sport ...
## $ ExcsSpdInd <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No...
## $ Hit_Run    <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "...
## $ Light_Cond <chr> "Daylight", "Dark - Lighted Roadway", "Daylight", "...
## $ Locality   <chr> "Mixed (30% To 70% Developed)", "Urban (>70% Develo...
## $ Num_Lanes  <chr> "2 lanes", "5 lanes", "2 lanes", "4 lanes", "2 lane...
## $ Num_Units  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Rd_Charact <chr> "Straight - Level", "Straight - Level", "Straight -...
## $ Rd_Class   <chr> "Local Street", "Local Street", "Local Street", "NC...
## $ Rd_Conditi <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "D...
## $ Rd_Config  <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unprotec...
## $ Rd_Defects <chr> "None", "None", "None", "None", "None", "None", "No...
## $ Rd_Feature <chr> "No Special Feature", "Four-Way Intersection", "Fou...
## $ Rd_Surface <chr> "Smooth Asphalt", "Smooth Asphalt", "Smooth Asphalt...
## $ Region     <chr> "Piedmont", "Coastal", "Coastal", "Piedmont", "Pied...
## $ Rural_Urba <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Urban...
## $ Speed_Limi <chr> "20 - 25  MPH", "40 - 45  MPH", "30 - 35  MPH", "40...
## $ Traff_Cntr <chr> "No Control Present", "Stop And Go Signal", "Stop S...
## $ Weather    <chr> "Clear", "Clear", "Clear", "Cloudy", "Clear", "Clea...
## $ Workzone_I <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No...
## $ Location   <chr> "36.002743, -78.8785", "35.612984, -77.39265", "35....

Tidy data

  • In tidy data
    1. Each variable forms a column
    2. Each observation forms a row
    3. Each type of observational unit forms a table
  • Messy data is any other other arrangement of the data
  • tidyr package is helpful for converting messy data to tidy data

  • We'll start with tidy data, and then move on to tidying up messy data

Data manipulation with dplyr

Data manipulation in R

  • Using base R functions

  • Using the tidyr and dplyr packages \(\leftarrow\) our focus today

  • Using a variety of other packages from the tidyverse like plyr, lubridate, etc.

Verbs of dplyr

The dplyr package is based on the concepts of functions as verbs that manipulate data frames:

  • filter(): pick rows matching criteria
  • select(): pick columns by name
  • rename(): rename specific columns
  • arrange(): reorder rows
  • mutate(): add new variables
  • transmute(): create new data frame with variables
  • sample_n() / sample_frac(): randomly sample rows
  • summarise(): reduce variables to values

dplyr rules

  • First argument is a data frame
  • Subsequent arguments say what to do with data frame
  • Always return a data frame
  • Avoid modify in place

Filter rows with filter()

  • Select a subset of rows in a data frame.
  • Easily filter for many conditions at once.

filter()

for crashes in Durham County

bike %>%
  filter(County == "Durham")
## # A tibble: 253 × 54
##      FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    <int>    <int>      <chr>      <chr>    <int>      <chr>          <chr>
## 1     18       19         No       <NA>        6         No Not Applicable
## 2     53       54        Yes      50-59       52         No   With Traffic
## 3     56       57        Yes      16-19       18         No           <NA>
## 4    209      210         No      16-19       16         No Facing Traffic
## 5    228      229        Yes      40-49       40         No   With Traffic
## 6    620      621        Yes      50-59       55         No   With Traffic
## 7    667      668        Yes      60-69       61         No Not Applicable
## 8    458      459        Yes      60-69       62         No   With Traffic
## 9    576      577         No      40-49       49         No   With Traffic
## 10   618      619         No      20-24       23         No   With Traffic
## # ... with 243 more rows, and 47 more variables: Bike_Injur <chr>,
## #   Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>,
## #   County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>,
## #   Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

filter()

for crashes in Durham County where biker was < 10 yrs old

bike %>%
  filter(County == "Durham", Bike_Age < 10)
## # A tibble: 20 × 54
##      FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    <int>    <int>      <chr>      <chr>    <int>      <chr>          <chr>
## 1     18       19         No       <NA>        6         No Not Applicable
## 2     47       48         No     10-Jun        9         No Not Applicable
## 3    124      125        Yes     10-Jun        8         No   With Traffic
## 4    531      532        Yes     10-Jun        7         No   With Traffic
## 5    704      705        Yes     10-Jun        9         No Not Applicable
## 6     42       43         No     10-Jun        8         No   With Traffic
## 7    392      393        Yes        0-5        2         No Not Applicable
## 8    941      942         No     10-Jun        9         No   With Traffic
## 9    436      437        Yes     10-Jun        6         No Not Applicable
## 10   160      161        Yes     10-Jun        7         No   With Traffic
## 11   273      274        Yes     10-Jun        7         No Facing Traffic
## 12    78       79        Yes     10-Jun        7         No   With Traffic
## 13   422      423         No     10-Jun        9         No Not Applicable
## 14   570      571         No       <NA>        0    Missing Not Applicable
## 15   683      684        Yes     10-Jun        8         No Not Applicable
## 16    62       63        Yes     10-Jun        7         No   With Traffic
## 17   248      249         No        0-5        4         No Not Applicable
## 18   306      307        Yes     10-Jun        8         No   With Traffic
## 19   231      232        Yes     10-Jun        8         No   With Traffic
## 20   361      362        Yes     10-Jun        9         No   With Traffic
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

Commonly used logical operators in R

operator definition
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
x | y x OR y
x & y x AND y

Commonly used logical operators in R

operator definition
is.na(x) test if x is NA
!is.na(x) test if x is not NA
x %in% y test if x is in y
!(x %in% y) test if x is not in y
!x not x

Aside: real data is messy!

What in the world does a BikeAge_gr of 10-Jun or 15-Nov mean?

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n())
## # A tibble: 13 × 2
##    BikeAge_Gr crash_count
##         <chr>       <int>
## 1         0-5          60
## 2      10-Jun         421
## 3      15-Nov         747
## 4       16-19         605
## 5       20-24         680
## 6       25-29         430
## 7       30-39         658
## 8       40-49         920
## 9       50-59         739
## 10      60-69         274
## 11         70          12
## 12        70+          58
## 13       <NA>         112

Careful data scientists clean up their data first!

  • We're going to need to do some text parsing to clean up these data
    • 10-Jun should be 6-10
    • 15-Nov should be 11-15
  • New R package: stringr

Install and load: stringr

Replace with str_replace() and add new variables with mutate()

  • Remember we want to do the following in the BikeAge_Gr variable: 10-Jun should be 6-10 and 15-Nov should be 11-15
bike <- bike %>%
  mutate(BikeAge_Gr = str_replace(BikeAge_Gr, "10-Jun", "6-10")) %>%
  mutate(BikeAge_Gr = str_replace(BikeAge_Gr, "15-Nov", "11-15"))
  • Note that we're overwriting existing data and columns, so be careful!
    • But remember, it's easy to revert if you make a mistake since we didn't touch the raw data, we can always reload it and start over

Check before you move on

Always check your changes and confirm code did what you wanted it to do

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(count = n())
## # A tibble: 13 × 2
##    BikeAge_Gr count
##         <chr> <int>
## 1         0-5    60
## 2       11-15   747
## 3       16-19   605
## 4       20-24   680
## 5       25-29   430
## 6       30-39   658
## 7       40-49   920
## 8       50-59   739
## 9        6-10   421
## 10      60-69   274
## 11         70    12
## 12        70+    58
## 13       <NA>   112

Alternative for counting

You can also count using the count() function

bike %>%
  group_by(BikeAge_Gr) %>%
  count()
## # A tibble: 13 × 2
##    BikeAge_Gr     n
##         <chr> <int>
## 1         0-5    60
## 2       11-15   747
## 3       16-19   605
## 4       20-24   680
## 5       25-29   430
## 6       30-39   658
## 7       40-49   920
## 8       50-59   739
## 9        6-10   421
## 10      60-69   274
## 11         70    12
## 12        70+    58
## 13       <NA>   112

Summarise any summary statistics

We can use the same format for calculating other summary statistics

bike %>%
  group_by(CrashAlcoh) %>%
  summarise(mean_age = mean(Bike_Age, na.rm = TRUE), 
            median_age = median(Bike_Age, na.rm = TRUE))
## # A tibble: 2 × 3
##   CrashAlcoh mean_age median_age
##        <chr>    <dbl>      <dbl>
## 1         No 29.90971         25
## 2        Yes 39.41266         43

slice() for certain row numbers

First five

bike %>%
  slice(1:5)
## # A tibble: 5 × 54
##     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##   <int>    <int>      <chr>      <chr>    <int>      <chr>          <chr>
## 1    18       19         No       <NA>        6         No Not Applicable
## 2    29       30        Yes      50-59       51         No   With Traffic
## 3    33       34         No       <NA>       10         No   With Traffic
## 4    35       36        Yes      16-19       17         No           <NA>
## 5    49       50         No       <NA>        6         No Facing Traffic
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

slice() for certain row numbers

Last five

last_row <- nrow(bike)
bike %>%
  slice((last_row-4):last_row)
## # A tibble: 5 × 54
##     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##   <int>    <int>      <chr>      <chr>    <int>      <chr>          <chr>
## 1   460      461        Yes       6-10        7         No Not Applicable
## 2   474      475        Yes      50-59       50         No   With Traffic
## 3   479      480        Yes      16-19       16         No Not Applicable
## 4   487      488         No      40-49       47        Yes   With Traffic
## 5   488      489        Yes      30-39       35         No Facing Traffic
## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>,
## #   Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>,
## #   CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>,
## #   Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

select() to keep only the variables you mention

bike %>%
  select(Crash_Loc, Hit_Run) %>%
  table()
##                       Hit_Run
## Crash_Loc                No  Yes
##   Intersection         2223  275
##   Intersection-Related  252   42
##   Location                3    7
##   Non-Intersection     2213  462
##   Non-Roadway           205   30

or select()to exclude variables

bike %>%
  select(-OBJECTID)
## # A tibble: 5,716 × 53
##      FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    <int>      <chr>      <chr>    <int>      <chr>          <chr>
## 1     18         No       <NA>        6         No Not Applicable
## 2     29        Yes      50-59       51         No   With Traffic
## 3     33         No       <NA>       10         No   With Traffic
## 4     35        Yes      16-19       17         No           <NA>
## 5     49         No       <NA>        6         No Facing Traffic
## 6     53        Yes      50-59       52         No   With Traffic
## 7     56        Yes      16-19       18         No           <NA>
## 8     60         No      40-49       40         No Facing Traffic
## 9     63        Yes       6-10        6         No Facing Traffic
## 10    66        Yes       6-10        7         No           <NA>
## # ... with 5,706 more rows, and 47 more variables: Bike_Injur <chr>,
## #   Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>,
## #   County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>,
## #   Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>,
## #   Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>,
## #   Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>,
## #   DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>,
## #   Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>,
## #   ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>,
## #   Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>,
## #   Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>,
## #   Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>,
## #   Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>

rename() specific columns

Correct typos and rename to make variable names shorter and/or more informative

  • Original names:
names(bike)
##  [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
##  [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
## [11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
## [16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
## [31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
## [36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
## [41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
## [46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
## [51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"
  • Rename Speed_Limi to Speed_Limit:
bike <- bike %>%
  rename(Speed_Limit = Speed_Limi)

Check before you move on

Always check your changes and confirm code did what you wanted it to do

names(bike)
##  [1] "FID"         "OBJECTID"    "AmbulanceR"  "BikeAge_Gr"  "Bike_Age"   
##  [6] "Bike_Alc_D"  "Bike_Dir"    "Bike_Injur"  "Bike_Pos"    "Bike_Race"  
## [11] "Bike_Sex"    "City"        "County"      "CrashAlcoh"  "CrashDay"   
## [16] "Crash_Date"  "Crash_Grp"   "Crash_Hour"  "Crash_Loc"   "Crash_Mont" 
## [21] "Crash_Time"  "Crash_Type"  "Crash_Ty_1"  "Crash_Year"  "Crsh_Sevri" 
## [26] "Developmen"  "DrvrAge_Gr"  "Drvr_Age"    "Drvr_Alc_D"  "Drvr_EstSp" 
## [31] "Drvr_Injur"  "Drvr_Race"   "Drvr_Sex"    "Drvr_VehTy"  "ExcsSpdInd" 
## [36] "Hit_Run"     "Light_Cond"  "Locality"    "Num_Lanes"   "Num_Units"  
## [41] "Rd_Charact"  "Rd_Class"    "Rd_Conditi"  "Rd_Config"   "Rd_Defects" 
## [46] "Rd_Feature"  "Rd_Surface"  "Region"      "Rural_Urba"  "Speed_Limit"
## [51] "Traff_Cntr"  "Weather"     "Workzone_I"  "Location"

summarise() in a new data frame

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n()) %>%
  arrange(crash_count)
## # A tibble: 13 × 2
##    BikeAge_Gr crash_count
##         <chr>       <int>
## 1          70          12
## 2         70+          58
## 3         0-5          60
## 4        <NA>         112
## 5       60-69         274
## 6        6-10         421
## 7       25-29         430
## 8       16-19         605
## 9       30-39         658
## 10      20-24         680
## 11      50-59         739
## 12      11-15         747
## 13      40-49         920

and arrange() to order rows

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n()) %>%
  arrange(desc(crash_count))
## # A tibble: 13 × 2
##    BikeAge_Gr crash_count
##         <chr>       <int>
## 1       40-49         920
## 2       11-15         747
## 3       50-59         739
## 4       20-24         680
## 5       30-39         658
## 6       16-19         605
## 7       25-29         430
## 8        6-10         421
## 9       60-69         274
## 10       <NA>         112
## 11        0-5          60
## 12        70+          58
## 13         70          12

Select rows with sample_n() or sample_frac()

  • sample_n(): randomly sample 5 observations
bike_n5 <- bike %>%
  sample_n(5, replace = FALSE)
dim(bike_n5)
## [1]  5 54
  • sample_frac(): randomly sample 20% of observations
bike_perc20 <-bike %>%
  sample_frac(0.2, replace = FALSE)
dim(bike_perc20)
## [1] 1143   54

More dplyr resources

Basic R syntax

Few important R syntax notes

For when not working with dplyr

  • Refer to a variable in a dataset as bike$Crash_Loc

  • Access any element in a dataframe using square brackets

bike[1,5] # row 1, column 5
## # A tibble: 1 × 1
##   Bike_Age
##      <int>
## 1        6
- For all observations in row 1: `bike[1, ]`
- For all observations in column 5: `bike[, 5]`

Tidying up messy data with tidyr

Separating a column into multiple columns

bike <- separate(bike, col = Location, into = c("Lat", "Lon"), 
                 sep = ", ", remove = FALSE)

bike %>%
  select(Location, Lat, Lon) %>%
  head(5)
## # A tibble: 5 × 3
##               Location       Lat       Lon
##                  <chr>     <chr>     <chr>
## 1  36.002743, -78.8785 36.002743  -78.8785
## 2 35.612984, -77.39265 35.612984 -77.39265
## 3 35.595676, -77.59074 35.595676 -77.59074
## 4  35.076767, -80.7728 35.076767  -80.7728
## 5  35.19999, -80.75713  35.19999 -80.75713

Uniting multiple columns into a single column

bike <- unite(bike, col = ID, FID, OBJECTID, remove = FALSE)

bike %>%
  select(FID, OBJECTID, ID) %>%
  head(5)
## # A tibble: 5 × 3
##     FID OBJECTID    ID
##   <int>    <int> <chr>
## 1    18       19 18_19
## 2    29       30 29_30
## 3    33       34 33_34
## 4    35       36 35_36
## 5    49       50 49_50

Changing gears…

Let's take a look at a new dataset:

hdi <- read_csv("https://stat.duke.edu/~mc301/data/hdi.csv")

Wide to long data with gather

hdi_long <- gather(hdi, key = year, value = hd_index, hdi_1980:hdi_2011)

head(hdi_long)
## # A tibble: 6 × 3
##                 country     year hd_index
##                   <chr>    <chr>    <dbl>
## 1              Abkhazia hdi_1980       NA
## 2           Afghanistan hdi_1980    0.198
## 3 Akrotiri and Dhekelia hdi_1980       NA
## 4               Albania hdi_1980       NA
## 5               Algeria hdi_1980    0.454
## 6        American Samoa hdi_1980       NA

Let's do a little bit better…

hdi_long <- hdi_long %>%
  mutate(year = as.numeric(str_replace(year, "hdi_", "")))

head(hdi_long)
## # A tibble: 6 × 3
##                 country  year hd_index
##                   <chr> <dbl>    <dbl>
## 1              Abkhazia  1980       NA
## 2           Afghanistan  1980    0.198
## 3 Akrotiri and Dhekelia  1980       NA
## 4               Albania  1980       NA
## 5               Algeria  1980    0.454
## 6        American Samoa  1980       NA

Long to wide data with spread

hdi_wide <- spread(hdi_long, key = year, value = hd_index)

head(hdi_wide)
## # A tibble: 6 × 10
##                 country `1980` `1990` `2000` `2005` `2006` `2007` `2008`
##                   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1              Abkhazia     NA     NA     NA     NA     NA     NA     NA
## 2           Afghanistan  0.198  0.246  0.230  0.340  0.354  0.363  0.370
## 3 Akrotiri and Dhekelia     NA     NA     NA     NA     NA     NA     NA
## 4               Albania     NA  0.656  0.691  0.721  0.724  0.729  0.733
## 5               Algeria  0.454  0.551  0.624  0.667  0.673  0.680  0.686
## 6        American Samoa     NA     NA     NA     NA     NA     NA     NA
## # ... with 2 more variables: `2009` <dbl>, `2011` <dbl>

What's next?

Want more R?

  • Resources for learning R:
    • Coursera
    • DataCamp
    • Many many online demos, resources, examples, as well as books
  • Debugging R errors:
  • Keeping up with what's new in R land:

Exercise

Your turn

Create a new dataframe that doesn't include observations where Bike_Injur = "Injury" since it's not clear what this means.

This new dataframe only should include observations in Durham County, and where the biker is a teenager (13 to 19 years, inclusive).

Then, calculate the number of accidents for each type of bike direction (Bike_Dir).

Your turn - solution

bike2 <- bike %>%
  filter(Bike_Injur != "Injury") %>%
  filter(County == "Durham") %>%
  filter(between(Bike_Age, 13, 19))

bike2 %>%
  group_by(Bike_Dir) %>%
  count()
## # A tibble: 4 × 2
##         Bike_Dir     n
##            <chr> <int>
## 1 Facing Traffic     9
## 2 Not Applicable     6
## 3   With Traffic    18
## 4           <NA>     5