R Workshop / Duke MEM

October 14, 2015

Intro

Outline

Introduction to R and RStudio
Reproducible data analysis with R Markdown
Loading data
Data visualization
Data wrangling
Basic R syntax
What next?
Hands on exercises

Materials

All source code at https://github.com/mine-cetinkaya-rundel/rworkshop-mem
Slides at http://rpubs.com/minebocek/117428

Introduction to R and RStudio

What is R and RStudio

R: Statistical programming language
RStudio:
- Inregtrated development environment for R
- Powerful and productive user interface for R
Both are free and open-source

Getting started

Traditionally you would install R and RStudio on your computer
We will skip over that step for now for efficiency and use the RStudio server at

https://vm-manage.oit.duke.edu/containers

(Log in with your Duke Net ID and password)

Local installation instructions will be provided at the end of the workshop

Anatomy of RStudio

Left: Console
- Text on top at launch: version of R that you’re running
- Below that is the prompt
Upper right: Workspace and command history
Lower right: Plots, access to files, help, packages, data viewer

What version am I using?

The version of R is text that pops up in the Console when you start RStudio
To find out the version of RStudio go to Help $\rightarrow$ About RStudio
It's good practice to keep both R and RStudio up to date

R packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and (often) sample data. (From: http://r-pkgs.had.co.nz)
We will use the ggplot2 package for plots and dplyr for data wrangling in this workshop
Install these packages by running the following in the Console:

install.packages("ggplot2")
install.packages("dplyr")

Then, load the packages by running the following:

library(ggplot2)
library(dplyr)

This is just one way of installing a package, there is also a GUI approach in the Packages pane in RStudio

Reproducible data analysis with R Markdown

What is R Markdown?

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R.
It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document.
R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).

Source: http://rmarkdown.rstudio.com/

Your turn!

Create your first R Markdown document, knit it, and examine the source code and the output.

File $\rightarrow$ R Markdown…
Enter a title (e.g. "My first R Markdown document") and author info
Choose Document as file type, and HTML as the output
Hit OK
Click Knit HTML in the new document, which will prompt you to save your document
- Naming tip: Do not use spaces
- Viewing tip: Click on the down arrow next to Knit HTML and select View in Pane

Markdown basics

Markdown is a simple formatting language designed to make authoring content easy for everyone.
Rather than writing complex markup code (e.g. HTML or LaTeX), Markdown enables the use of a syntax much more like plain-text email.

R Code Chunks

Within an R Markdown file, R Code Chunks can be embedded using the native Markdown syntax for fenced code regions.

Your turn!

How many code chunks are in your R Markdown document?

What does each code chunk do? You may not understand the R syntax yet, but you should be able to compare the source file and the output to answer this question.

Inline R Code

You can also evaluate R expressions inline by enclosing the expression within a single back-tick qualified with ‘r’. For example, the following code:

Results in this output: "I counted 2 red trucks on the highway."

Your turn!

Suppose Sammy works on average 8.37 hours per day, 5 days per week. How many hours does Sammy work on average per week?

Add a sentence to your document that includes simple inline R code that answers this question, along the lines of…

"Sammy works 8.37 * 5 hours per week, on average."

Workspaces

R Markdown workspace and Console workspace are independent of each other

If you define a variable in your Console and it shows up in the Environment tab, it is not going to be automatically included in your R Markdown document
If you define a variable in your R Markdown document, it won't automatically be available in your Console

[ Demo ]

Tip: Use the Run all previous chunks in the source file and Run current chunk code functionality in the buttons in each code chunk to help manage workspaces.

Workspaces and reproducibilty

The fact that the two workspaces do not automatically have access to the same variables might / will be frustrating at first.
But this is not a bug, in fact, it's a functionality that helps reproducibility, as it ensures that all variables, functions, etc. that are being used in the R Markdown document are explicitly defined or loaded.

Your turn!

Define x = 2 in the Console. Then, in your Console run x * 3. Does your code run as expected?
Now, insert a new code chunk in your R Markdown document and in this chunk type x * 3 only. Knit your document. Does the document compile, or do you get an error? If you get an error, what does the error say, and how can you fix it? Implement the fix and Knit your document. Make sure you are able to compile without errors before you move on.

Tip: Insert a new code chunk bu clicking Chunks $\rightarrow$ Insert Chunk.

Next insert another code chunk in your R Markdown document and define y = 4 and calculate y + 5. Knit your document. Does everything work as expected?
Now run y + 5 in your Console. Does your code run as expected or do you get an error? If you get an error, what does the error say, and how can you fix it? Implement the fix.

Code chunk options

You can hide the code, hide the result, hide warnings, messages, etc.
Refer to the handy R Markdown cheatsheet
Another good reference: http://rmarkdown.rstudio.com/authoring_rcodechunks.html

Loading data

NC DOT Fatal Crashes in North Carolina

bike <- read.csv("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv", 
                 sep = ";", stringsAsFactors = FALSE, na.strings = c("NA", "", ".")) %>%
  tbl_df()

View the names of variables via

names(bike)

##  [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
##  [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
## [11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
## [16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
## [31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
## [36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
## [41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
## [46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
## [51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"

and see detailed descriptions at https://stat.duke.edu/~mc301/data/nc_bike_crash.html.

Aside: Strings (characters) vs factors

By default R will convert character vectors into factors when they are included in a data frame.
Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with.
This behavior can be changed using the stringsAsFactors = FALSE when loading a data drame.

Viewing your data

In the Environment, click on the name of the data frame to view it in the data viewer
Use the str() function to compactly display the internal structure of an R object

str(bike)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5716 obs. of  54 variables:
##  $ FID       : int  18 29 33 35 49 53 56 60 63 66 ...
##  $ OBJECTID  : int  19 30 34 36 50 54 57 61 64 67 ...
##  $ AmbulanceR: chr  "No" "Yes" "No" "Yes" ...
##  $ BikeAge_Gr: chr  NA "50-59" NA "16-19" ...
##  $ Bike_Age  : int  6 51 10 17 6 52 18 40 6 7 ...
##  $ Bike_Alc_D: chr  "No" "No" "No" "No" ...
##  $ Bike_Dir  : chr  "Not Applicable" "With Traffic" "With Traffic" NA ...
##  $ Bike_Injur: chr  "C: Possible Injury" "C: Possible Injury" "Injury" "B: Evident Injury" ...
##  $ Bike_Pos  : chr  "Driveway / Alley" "Travel Lane" "Travel Lane" "Travel Lane" ...
##  $ Bike_Race : chr  "Black" "Black" "Black" "White" ...
##  $ Bike_Sex  : chr  "Female" "Male" "Male" "Male" ...
##  $ City      : chr  "Durham" "Greenville" "Farmville" "Charlotte" ...
##  $ County    : chr  "Durham" "Pitt" "Pitt" "Mecklenburg" ...
##  $ CrashAlcoh: chr  "No" "No" "No" "No" ...
##  $ CrashDay  : chr  "01-01-06" "01-01-02" "01-01-07" "01-01-05" ...
##  $ Crash_Date: chr  "2007-01-06" "2007-01-09" "2007-01-14" "2007-01-12" ...
##  $ Crash_Grp : chr  "Bicyclist Failed to Yield - Midblock" "Crossing Paths - Other Circumstances" "Bicyclist Failed to Yield - Sign-Controlled Intersection" "Loss of Control / Turning Error" ...
##  $ Crash_Hour: int  13 23 16 19 12 20 19 14 16 0 ...
##  $ Crash_Loc : chr  "Non-Intersection" "Intersection-Related" "Intersection" "Intersection" ...
##  $ Crash_Mont: chr  NA NA NA NA ...
##  $ Crash_Time: chr  "0001-01-01T08:21:58-04:56" "0001-01-01T18:12:58-04:56" "0001-01-01T11:48:58-04:56" "0001-01-01T14:59:58-04:56" ...
##  $ Crash_Type: chr  "Bicyclist Ride Out - Residential Driveway" "Crossing Paths - Intersection - Other /" "Bicyclist Ride Through - Sign-Controlled Intersection" "Motorist Lost Control - Other /" ...
##  $ Crash_Ty_1: int  353311 211180 111144 119139 112114 311231 119144 132180 112142 460910 ...
##  $ Crash_Year: int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Crsh_Sevri: chr  "C: Possible Injury" "C: Possible Injury" "O: No Injury" "B: Evident Injury" ...
##  $ Developmen: chr  "Residential" "Commercial" "Residential" "Residential" ...
##  $ DrvrAge_Gr: chr  "60-69" "30-39" "50-59" "30-39" ...
##  $ Drvr_Age  : int  66 34 52 33 NA 20 40 NA 17 51 ...
##  $ Drvr_Alc_D: chr  "No" "No" "No" "No" ...
##  $ Drvr_EstSp: chr  "11-15 mph" "0-5 mph" "21-25 mph" "46-50 mph" ...
##  $ Drvr_Injur: chr  "O: No Injury" "O: No Injury" "O: No Injury" "O: No Injury" ...
##  $ Drvr_Race : chr  "Black" "Black" "White" "White" ...
##  $ Drvr_Sex  : chr  "Male" "Male" "Female" "Female" ...
##  $ Drvr_VehTy: chr  "Pickup" "Passenger Car" "Passenger Car" "Sport Utility" ...
##  $ ExcsSpdInd: chr  "No" "No" "No" "No" ...
##  $ Hit_Run   : chr  "No" "No" "No" "No" ...
##  $ Light_Cond: chr  "Daylight" "Dark - Lighted Roadway" "Daylight" "Dark - Roadway Not Lighted" ...
##  $ Locality  : chr  "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" ...
##  $ Num_Lanes : chr  "2 lanes" "5 lanes" "2 lanes" "4 lanes" ...
##  $ Num_Units : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ Rd_Charact: chr  "Straight - Level" "Straight - Level" "Straight - Level" "Straight - Level" ...
##  $ Rd_Class  : chr  "Local Street" "Local Street" "Local Street" "NC Route" ...
##  $ Rd_Conditi: chr  "Dry" "Dry" "Dry" "Dry" ...
##  $ Rd_Config : chr  "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" ...
##  $ Rd_Defects: chr  "None" "None" "None" "None" ...
##  $ Rd_Feature: chr  "No Special Feature" "Four-Way Intersection" "Four-Way Intersection" "Four-Way Intersection" ...
##  $ Rd_Surface: chr  "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" ...
##  $ Region    : chr  "Piedmont" "Coastal" "Coastal" "Piedmont" ...
##  $ Rural_Urba: chr  "Urban" "Urban" "Rural" "Urban" ...
##  $ Speed_Limi: chr  "20 - 25  MPH" "40 - 45  MPH" "30 - 35  MPH" "40 - 45  MPH" ...
##  $ Traff_Cntr: chr  "No Control Present" "Stop And Go Signal" "Stop Sign" "Stop And Go Signal" ...
##  $ Weather   : chr  "Clear" "Clear" "Clear" "Cloudy" ...
##  $ Workzone_I: chr  "No" "No" "No" "No" ...
##  $ Location  : chr  "36.002743, -78.8785" "35.612984, -77.39265" "35.595676, -77.59074" "35.076767, -80.7728" ...

Data visualization

Data visualization in R

Using base R functions
Using the ggplot2 package $\leftarrow$ our focus today
Using a variety of other packages like lattice, ggvis, etc.

The Grammar of Graphics

Visualisation concept created by Wilkinson (1999)
- to define the basic elements of a statistical graphic
Adapted for R by Wickham (2009) who created the ggplot2 package
- consistent and compact syntax to describe statistical graphics
- highly modular as it breaks up graphs into semantic components
Is not a guide which graph to choose and how to convey information best!

Source: https://rpubs.com/timwinke/ggplot2workshop

The Grammar of Graphics - Terminology

A statistical graphic is a…

mapping of data
to aesthetic attributes (color, size, xy-position)
using geometric objects (points, lines, bars)
with data being statistically transformed (summarised, log-transformed)
and mapped onto a specific facet and coordinate system

Biker age vs. crash hour

Which data is used as an input?
What geometric objects are chosen for visualization?
What variables are mapped onto which attributes?
What type of scales are used to map data to aesthetics?
Are the variables statistically transformed before plotting?

Biker age vs. crash hour - code

ggplot(data = bike, aes(x = Crash_Hour, y = Bike_Age)) +
  geom_point()

## Warning: Removed 130 rows containing missing values (geom_point).

Altering features

How did the plot change?
Are these changes based on data (i.e. can be mapped to variables in the dataset) or changes in preferences for geometric objects?

Altering features - code

ggplot(data = bike, aes(x = Crash_Hour, y = Bike_Age)) +
  geom_point(alpha = 0.5, color = "blue")

## Warning: Removed 130 rows containing missing values (geom_point).

More alterations

How did the plot change?
Are these changes based on data (i.e. can be mapped to variables in the dataset) or changes in preferences for geometric objects?

More alterations - code

ggplot(data = bike, aes(x = Crash_Hour, y = Bike_Age, color = AmbulanceR)) +
  geom_point(alpha = 0.5) +
  facet_grid(. ~ Bike_Sex)

Anatomy of ggplots

ggplot(data = [dataframe], aes(x = [var_x], y = [var_y], 
       color = [var_for_color], fill = [var_for_fill], 
       shape = [var_for_shape])) +
    geom_[some_geom] +
    ... # other options

Histograms

ggplot(data = bike, aes(x = Bike_Age)) +
  geom_histogram(binwidth = 5)

Boxplots

ggplot(data = bike, aes(y = Bike_Age, x = Bike_Sex)) +
  geom_boxplot()

Barplots

ggplot(data = bike, aes(x = Bike_Injur)) +
  geom_bar()

Segmented barplots

ggplot(data = bike, aes(x = Crash_Loc, fill = Bike_Injur)) +
  geom_bar()

Segmented barplots - proportions

ggplot(data = bike, aes(x = Crash_Loc, fill = Bike_Injur)) +
  geom_bar(position="fill")

More `ggplot2` resources

Visit http://docs.ggplot2.org/current/ for documentation on the current version of the ggplot2 package. It's full of examples!
Refer to the ggplot2 cheatsheet.

Data wrangling

Data wrangling in R

Using base R functions
Using the dplyr package $\leftarrow$ our focus today
Using a variety of other packages like plyr, tidyr, lubridate, etc.

Data wrangling with `dplyr`

The dplyr package is based on the concepts of functions as verbs that manipulate data frames:

filter(): pick rows matching criteria
select(): pick columns by name
rename(): rename specific columns
arrange(): reorder rows
mutate(): add new variables
transmute(): create new data frame with variables
sample_n() / sample_frac(): randomly sample rows
summarise(): reduce variables to values

`dplyr` rules

First argument is a data frame
Subsequent arguments say what to do with data frame
Always return a data frame
Avoid modify in place

Filter rows with `filter()`

Select a subset of rows in a data frame.
Easily filter for many conditions at once.

`filter()`

for crashes in Durham County

bike %>%
  filter(County == "Durham")

## Source: local data frame [253 x 54]
## 
##      FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    (int)    (int)      (chr)      (chr)    (int)      (chr)          (chr)
## 1     18       19         No         NA        6         No Not Applicable
## 2     53       54        Yes      50-59       52         No   With Traffic
## 3     56       57        Yes      16-19       18         No             NA
## 4    209      210         No      16-19       16         No Facing Traffic
## 5    228      229        Yes      40-49       40         No   With Traffic
## 6    620      621        Yes      50-59       55         No   With Traffic
## 7    667      668        Yes      60-69       61         No Not Applicable
## 8    458      459        Yes      60-69       62         No   With Traffic
## 9    576      577         No      40-49       49         No   With Traffic
## 10   618      619         No      20-24       23         No   With Traffic
## ..   ...      ...        ...        ...      ...        ...            ...
## Variables not shown: Bike_Injur (chr), Bike_Pos (chr), Bike_Race (chr),
##   Bike_Sex (chr), City (chr), County (chr), CrashAlcoh (chr), CrashDay
##   (chr), Crash_Date (chr), Crash_Grp (chr), Crash_Hour (int), Crash_Loc
##   (chr), Crash_Mont (chr), Crash_Time (chr), Crash_Type (chr), Crash_Ty_1
##   (int), Crash_Year (int), Crsh_Sevri (chr), Developmen (chr), DrvrAge_Gr
##   (chr), Drvr_Age (int), Drvr_Alc_D (chr), Drvr_EstSp (chr), Drvr_Injur
##   (chr), Drvr_Race (chr), Drvr_Sex (chr), Drvr_VehTy (chr), ExcsSpdInd
##   (chr), Hit_Run (chr), Light_Cond (chr), Locality (chr), Num_Lanes (chr),
##   Num_Units (int), Rd_Charact (chr), Rd_Class (chr), Rd_Conditi (chr),
##   Rd_Config (chr), Rd_Defects (chr), Rd_Feature (chr), Rd_Surface (chr),
##   Region (chr), Rural_Urba (chr), Speed_Limi (chr), Traff_Cntr (chr),
##   Weather (chr), Workzone_I (chr), Location (chr)

`filter()`

for crashes in Durham County where biker was < 10 yrs old

bike %>%
  filter(County == "Durham", Bike_Age < 10)

## Source: local data frame [20 x 54]
## 
##      FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    (int)    (int)      (chr)      (chr)    (int)      (chr)          (chr)
## 1     18       19         No         NA        6         No Not Applicable
## 2     47       48         No     10-Jun        9         No Not Applicable
## 3    124      125        Yes     10-Jun        8         No   With Traffic
## 4    531      532        Yes     10-Jun        7         No   With Traffic
## 5    704      705        Yes     10-Jun        9         No Not Applicable
## 6     42       43         No     10-Jun        8         No   With Traffic
## 7    392      393        Yes        0-5        2         No Not Applicable
## 8    941      942         No     10-Jun        9         No   With Traffic
## 9    436      437        Yes     10-Jun        6         No Not Applicable
## 10   160      161        Yes     10-Jun        7         No   With Traffic
## 11   273      274        Yes     10-Jun        7         No Facing Traffic
## 12    78       79        Yes     10-Jun        7         No   With Traffic
## 13   422      423         No     10-Jun        9         No Not Applicable
## 14   570      571         No         NA        0    Missing Not Applicable
## 15   683      684        Yes     10-Jun        8         No Not Applicable
## 16    62       63        Yes     10-Jun        7         No   With Traffic
## 17   248      249         No        0-5        4         No Not Applicable
## 18   306      307        Yes     10-Jun        8         No   With Traffic
## 19   231      232        Yes     10-Jun        8         No   With Traffic
## 20   361      362        Yes     10-Jun        9         No   With Traffic
## Variables not shown: Bike_Injur (chr), Bike_Pos (chr), Bike_Race (chr),
##   Bike_Sex (chr), City (chr), County (chr), CrashAlcoh (chr), CrashDay
##   (chr), Crash_Date (chr), Crash_Grp (chr), Crash_Hour (int), Crash_Loc
##   (chr), Crash_Mont (chr), Crash_Time (chr), Crash_Type (chr), Crash_Ty_1
##   (int), Crash_Year (int), Crsh_Sevri (chr), Developmen (chr), DrvrAge_Gr
##   (chr), Drvr_Age (int), Drvr_Alc_D (chr), Drvr_EstSp (chr), Drvr_Injur
##   (chr), Drvr_Race (chr), Drvr_Sex (chr), Drvr_VehTy (chr), ExcsSpdInd
##   (chr), Hit_Run (chr), Light_Cond (chr), Locality (chr), Num_Lanes (chr),
##   Num_Units (int), Rd_Charact (chr), Rd_Class (chr), Rd_Conditi (chr),
##   Rd_Config (chr), Rd_Defects (chr), Rd_Feature (chr), Rd_Surface (chr),
##   Region (chr), Rural_Urba (chr), Speed_Limi (chr), Traff_Cntr (chr),
##   Weather (chr), Workzone_I (chr), Location (chr)

Commonly used logical operators in R

operator	definition
`<`	less than
`<=`	less than or equal to
`>`	greater than
`>=`	greater than or equal to
`==`	exactly equal to
`!=`	not equal to
`x \| y`	`x` OR `y`
`x & y`	`x` AND `y`

Commonly used logical operators in R

operator	definition
`is.na(x)`	test if `x` is `NA`
`!is.na(x)`	test if `x` is not `NA`
`x %in% y`	test if `x` is in `y`
`!(x %in% y)`	test if `x` is not in `y`
`!x`	not `x`

Aside: real data is messy!

What in the world does a BikeAge_gr of 10-Jun or 15-Nov mean?

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n())

## Source: local data frame [13 x 2]
## 
##    BikeAge_Gr crash_count
##         (chr)       (int)
## 1         0-5          60
## 2      10-Jun         421
## 3      15-Nov         747
## 4       16-19         605
## 5       20-24         680
## 6       25-29         430
## 7       30-39         658
## 8       40-49         920
## 9       50-59         739
## 10      60-69         274
## 11         70          12
## 12        70+          58
## 13         NA         112

Careful data scientists clean up their data first!

We're going to need to do some text parsing to clean up these data
- 10-Jun should be 6-10
- 15-Nov should be 11-15
New R package: stringr

Install and load: `stringr`

Install:

install.packages("stringr")

Load:

library(stringr)

Package reference: Most R packages come with a vignette that describe in detail what each function does and how to use them, they're incredibly useful resources (in addition to other worked out examples on the web) https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html

Replace with `str_replace()` and add new variables with `mutate()`

Remember we want to do the following in the BikeAge_Gr variable: 10-Jun should be 6-10 and 15-Nov should be 11-15

bike <- bike %>%
  mutate(BikeAge_Gr = str_replace(BikeAge_Gr, "10-Jun", "6-10")) %>%
  mutate(BikeAge_Gr = str_replace(BikeAge_Gr, "15-Nov", "11-15"))

Note that we're overwriting existing data and columns, so be careful!
- But remember, it's easy to revert if you make a mistake since we didn't touch the raw data, we can always reload it and start over

Check before you move on

Always check your changes and confirm code did what you wanted it to do

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(count = n())

## Source: local data frame [13 x 2]
## 
##    BikeAge_Gr count
##         (chr) (int)
## 1         0-5    60
## 2       11-15   747
## 3       16-19   605
## 4       20-24   680
## 5       25-29   430
## 6       30-39   658
## 7       40-49   920
## 8       50-59   739
## 9        6-10   421
## 10      60-69   274
## 11         70    12
## 12        70+    58
## 13         NA   112

`slice()` for certain row numbers

First five

bike %>%
  slice(1:5)

## Source: local data frame [5 x 54]
## 
##     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##   (int)    (int)      (chr)      (chr)    (int)      (chr)          (chr)
## 1    18       19         No         NA        6         No Not Applicable
## 2    29       30        Yes      50-59       51         No   With Traffic
## 3    33       34         No         NA       10         No   With Traffic
## 4    35       36        Yes      16-19       17         No             NA
## 5    49       50         No         NA        6         No Facing Traffic
## Variables not shown: Bike_Injur (chr), Bike_Pos (chr), Bike_Race (chr),
##   Bike_Sex (chr), City (chr), County (chr), CrashAlcoh (chr), CrashDay
##   (chr), Crash_Date (chr), Crash_Grp (chr), Crash_Hour (int), Crash_Loc
##   (chr), Crash_Mont (chr), Crash_Time (chr), Crash_Type (chr), Crash_Ty_1
##   (int), Crash_Year (int), Crsh_Sevri (chr), Developmen (chr), DrvrAge_Gr
##   (chr), Drvr_Age (int), Drvr_Alc_D (chr), Drvr_EstSp (chr), Drvr_Injur
##   (chr), Drvr_Race (chr), Drvr_Sex (chr), Drvr_VehTy (chr), ExcsSpdInd
##   (chr), Hit_Run (chr), Light_Cond (chr), Locality (chr), Num_Lanes (chr),
##   Num_Units (int), Rd_Charact (chr), Rd_Class (chr), Rd_Conditi (chr),
##   Rd_Config (chr), Rd_Defects (chr), Rd_Feature (chr), Rd_Surface (chr),
##   Region (chr), Rural_Urba (chr), Speed_Limi (chr), Traff_Cntr (chr),
##   Weather (chr), Workzone_I (chr), Location (chr)

`slice()` for certain row numbers

Last five

last_row <- nrow(bike)
bike %>%
  slice((last_row-4):last_row)

## Source: local data frame [5 x 54]
## 
##     FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##   (int)    (int)      (chr)      (chr)    (int)      (chr)          (chr)
## 1   460      461        Yes       6-10        7         No Not Applicable
## 2   474      475        Yes      50-59       50         No   With Traffic
## 3   479      480        Yes      16-19       16         No Not Applicable
## 4   487      488         No      40-49       47        Yes   With Traffic
## 5   488      489        Yes      30-39       35         No Facing Traffic
## Variables not shown: Bike_Injur (chr), Bike_Pos (chr), Bike_Race (chr),
##   Bike_Sex (chr), City (chr), County (chr), CrashAlcoh (chr), CrashDay
##   (chr), Crash_Date (chr), Crash_Grp (chr), Crash_Hour (int), Crash_Loc
##   (chr), Crash_Mont (chr), Crash_Time (chr), Crash_Type (chr), Crash_Ty_1
##   (int), Crash_Year (int), Crsh_Sevri (chr), Developmen (chr), DrvrAge_Gr
##   (chr), Drvr_Age (int), Drvr_Alc_D (chr), Drvr_EstSp (chr), Drvr_Injur
##   (chr), Drvr_Race (chr), Drvr_Sex (chr), Drvr_VehTy (chr), ExcsSpdInd
##   (chr), Hit_Run (chr), Light_Cond (chr), Locality (chr), Num_Lanes (chr),
##   Num_Units (int), Rd_Charact (chr), Rd_Class (chr), Rd_Conditi (chr),
##   Rd_Config (chr), Rd_Defects (chr), Rd_Feature (chr), Rd_Surface (chr),
##   Region (chr), Rural_Urba (chr), Speed_Limi (chr), Traff_Cntr (chr),
##   Weather (chr), Workzone_I (chr), Location (chr)

`select()` to keep only the variables you mention

bike %>%
  select(Crash_Loc, Hit_Run) %>%
  table()

##                       Hit_Run
## Crash_Loc                No  Yes
##   Intersection         2223  275
##   Intersection-Related  252   42
##   Location                3    7
##   Non-Intersection     2213  462
##   Non-Roadway           205   30

or `select()`to exclude variables

bike %>%
  select(-OBJECTID)

## Source: local data frame [5,716 x 53]
## 
##      FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D       Bike_Dir
##    (int)      (chr)      (chr)    (int)      (chr)          (chr)
## 1     18         No         NA        6         No Not Applicable
## 2     29        Yes      50-59       51         No   With Traffic
## 3     33         No         NA       10         No   With Traffic
## 4     35        Yes      16-19       17         No             NA
## 5     49         No         NA        6         No Facing Traffic
## 6     53        Yes      50-59       52         No   With Traffic
## 7     56        Yes      16-19       18         No             NA
## 8     60         No      40-49       40         No Facing Traffic
## 9     63        Yes       6-10        6         No Facing Traffic
## 10    66        Yes       6-10        7         No             NA
## ..   ...        ...        ...      ...        ...            ...
## Variables not shown: Bike_Injur (chr), Bike_Pos (chr), Bike_Race (chr),
##   Bike_Sex (chr), City (chr), County (chr), CrashAlcoh (chr), CrashDay
##   (chr), Crash_Date (chr), Crash_Grp (chr), Crash_Hour (int), Crash_Loc
##   (chr), Crash_Mont (chr), Crash_Time (chr), Crash_Type (chr), Crash_Ty_1
##   (int), Crash_Year (int), Crsh_Sevri (chr), Developmen (chr), DrvrAge_Gr
##   (chr), Drvr_Age (int), Drvr_Alc_D (chr), Drvr_EstSp (chr), Drvr_Injur
##   (chr), Drvr_Race (chr), Drvr_Sex (chr), Drvr_VehTy (chr), ExcsSpdInd
##   (chr), Hit_Run (chr), Light_Cond (chr), Locality (chr), Num_Lanes (chr),
##   Num_Units (int), Rd_Charact (chr), Rd_Class (chr), Rd_Conditi (chr),
##   Rd_Config (chr), Rd_Defects (chr), Rd_Feature (chr), Rd_Surface (chr),
##   Region (chr), Rural_Urba (chr), Speed_Limi (chr), Traff_Cntr (chr),
##   Weather (chr), Workzone_I (chr), Location (chr)

`rename()` specific columns

Correct typos and rename to make variable names shorter and/or more informative

Original names:

names(bike)

##  [1] "FID"        "OBJECTID"   "AmbulanceR" "BikeAge_Gr" "Bike_Age"  
##  [6] "Bike_Alc_D" "Bike_Dir"   "Bike_Injur" "Bike_Pos"   "Bike_Race" 
## [11] "Bike_Sex"   "City"       "County"     "CrashAlcoh" "CrashDay"  
## [16] "Crash_Date" "Crash_Grp"  "Crash_Hour" "Crash_Loc"  "Crash_Mont"
## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri"
## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age"   "Drvr_Alc_D" "Drvr_EstSp"
## [31] "Drvr_Injur" "Drvr_Race"  "Drvr_Sex"   "Drvr_VehTy" "ExcsSpdInd"
## [36] "Hit_Run"    "Light_Cond" "Locality"   "Num_Lanes"  "Num_Units" 
## [41] "Rd_Charact" "Rd_Class"   "Rd_Conditi" "Rd_Config"  "Rd_Defects"
## [46] "Rd_Feature" "Rd_Surface" "Region"     "Rural_Urba" "Speed_Limi"
## [51] "Traff_Cntr" "Weather"    "Workzone_I" "Location"

Rename Speed_Limi to Speed_Limit:

bike <- bike %>%
  rename(Speed_Limit = Speed_Limi)

Check before you move on

Always check your changes and confirm code did what you wanted it to do

names(bike)

##  [1] "FID"         "OBJECTID"    "AmbulanceR"  "BikeAge_Gr"  "Bike_Age"   
##  [6] "Bike_Alc_D"  "Bike_Dir"    "Bike_Injur"  "Bike_Pos"    "Bike_Race"  
## [11] "Bike_Sex"    "City"        "County"      "CrashAlcoh"  "CrashDay"   
## [16] "Crash_Date"  "Crash_Grp"   "Crash_Hour"  "Crash_Loc"   "Crash_Mont" 
## [21] "Crash_Time"  "Crash_Type"  "Crash_Ty_1"  "Crash_Year"  "Crsh_Sevri" 
## [26] "Developmen"  "DrvrAge_Gr"  "Drvr_Age"    "Drvr_Alc_D"  "Drvr_EstSp" 
## [31] "Drvr_Injur"  "Drvr_Race"   "Drvr_Sex"    "Drvr_VehTy"  "ExcsSpdInd" 
## [36] "Hit_Run"     "Light_Cond"  "Locality"    "Num_Lanes"   "Num_Units"  
## [41] "Rd_Charact"  "Rd_Class"    "Rd_Conditi"  "Rd_Config"   "Rd_Defects" 
## [46] "Rd_Feature"  "Rd_Surface"  "Region"      "Rural_Urba"  "Speed_Limit"
## [51] "Traff_Cntr"  "Weather"     "Workzone_I"  "Location"

`summarise()` in a new data frame

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n()) %>%
  arrange(crash_count)

## Source: local data frame [13 x 2]
## 
##    BikeAge_Gr crash_count
##         (chr)       (int)
## 1          70          12
## 2         70+          58
## 3         0-5          60
## 4          NA         112
## 5       60-69         274
## 6        6-10         421
## 7       25-29         430
## 8       16-19         605
## 9       30-39         658
## 10      20-24         680
## 11      50-59         739
## 12      11-15         747
## 13      40-49         920

and `arrange()` to order rows

bike %>%
  group_by(BikeAge_Gr) %>%
  summarise(crash_count = n()) %>%
  arrange(desc(crash_count))

## Source: local data frame [13 x 2]
## 
##    BikeAge_Gr crash_count
##         (chr)       (int)
## 1       40-49         920
## 2       11-15         747
## 3       50-59         739
## 4       20-24         680
## 5       30-39         658
## 6       16-19         605
## 7       25-29         430
## 8        6-10         421
## 9       60-69         274
## 10         NA         112
## 11        0-5          60
## 12        70+          58
## 13         70          12

Select rows with `sample_n()` or `sample_frac()`

sample_n(): randomly sample 5 observations

bike_n5 <- bike %>%
  sample_n(5, replace = FALSE)
dim(bike_n5)

## [1]  5 54

sample_frac(): randomly sample 20% of observations

bike_perc20 <-bike %>%
  sample_frac(0.2, replace = FALSE)
dim(bike_perc20)

## [1] 1143   54

More `dplyr` resources

Visit https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html for the package vignette.
Refer to the dplyr cheatsheet.

Basic R syntax

Few important R syntax notes

For when not working with dplyr or ggplot2

Refer to a variable in a dataset as bike$Crash_Loc
Access any element in a dataframe using square brackets

bike[1,5] # row 1, column 5

## Source: local data frame [1 x 1]
## 
##   Bike_Age
##      (int)
## 1        6

- For all observations in row 1: `bike[1, ]`
- For all observations in column 5: `bike[, 5]`

What's next?

Want more R?

Local install
- R: https://cran.r-project.org/
- RStudio: https://www.rstudio.com/products/RStudio/#Desktop
Resources for learning R:
- Coursera
- DataCamp
- Many many online demos, resources, examples, as well as books
Debugging R errors:
- Read the error!
- StackOverflow
Keeping up with what's new in R land:
- R-bloggers
- Twitter: #rstats

Exercise

Your turn

Create a new dataframe that doesn't include observations where Bike_Injur = Injury since it's not clear what this means.

This new dataframe also should include observations in Durham, and where the biker is a teenager (13 to 19 years, inclusive).

Create a visualization that will help answer whether facing traffic or riding with traffic (Bike_Dir) is more dangerous in bike crashes for teenagers in Durham, based on the Bike_Injur variable.

Intro

Outline

Materials

Introduction to R and RStudio

What is R and RStudio

Getting started

Anatomy of RStudio

What version am I using?

R packages

Reproducible data analysis with R Markdown

What is R Markdown?

Your turn!

Markdown basics

R Code Chunks

Your turn!

Inline R Code

Your turn!

Workspaces

Workspaces and reproducibilty

Your turn!

Code chunk options

Loading data

NC DOT Fatal Crashes in North Carolina

Aside: Strings (characters) vs factors

Viewing your data

Data visualization

Data visualization in R

The Grammar of Graphics

The Grammar of Graphics - Terminology

Biker age vs. crash hour

Biker age vs. crash hour - code

Altering features

Altering features - code

More alterations

More alterations - code

Anatomy of ggplots

Histograms

Boxplots

Barplots

Segmented barplots

Segmented barplots - proportions

More ggplot2 resources

Data wrangling

Data wrangling in R

Data wrangling with dplyr

dplyr rules

Filter rows with filter()

filter()

filter()

Commonly used logical operators in R

Commonly used logical operators in R

Aside: real data is messy!

Careful data scientists clean up their data first!

Install and load: stringr

Replace with str_replace() and add new variables with mutate()

Check before you move on

slice() for certain row numbers

slice() for certain row numbers

select() to keep only the variables you mention

or select()to exclude variables

rename() specific columns

Check before you move on

summarise() in a new data frame

and arrange() to order rows

Select rows with sample_n() or sample_frac()

More dplyr resources

Basic R syntax

Few important R syntax notes

What's next?

Want more R?

Exercise

Your turn

More `ggplot2` resources

Data wrangling with `dplyr`

`dplyr` rules

Filter rows with `filter()`

`filter()`

`filter()`

Install and load: `stringr`

Replace with `str_replace()` and add new variables with `mutate()`

`slice()` for certain row numbers

`slice()` for certain row numbers

`select()` to keep only the variables you mention

or `select()`to exclude variables

`rename()` specific columns

`summarise()` in a new data frame

and `arrange()` to order rows

Select rows with `sample_n()` or `sample_frac()`

More `dplyr` resources