January 10, 2017

First things first

Some things you should know

  • Typical uses for R
    • Data analysis
    • Data visualization
    • Reshaping data
    • Automation
  • Mental models and programming
    • Excel/SPSS/Stata –> R/Python/Julia/JavaScript
  • What all R programmers do
    • Google for answers
    • Borrow code
    • Ask friends and pros for help
  • Useful references (see Reference Materials in the Syllabus for more)

Quick Tour of R Studio

Quick Tour of R Studio

  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI

Quick Tour of R Studio

  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI
install.packages('devtools')

Quick Tour of R Studio

  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI

Libraries, functions, data

Libraries, functions, data

  • Libraries (or packages) are collections of functions (and datasets)
  • Over 10,000 libraries on CRAN
install.packages('tidyverse')
install.packages(c('readxl', 'rmarkdown'))
library(tidyverse)

Libraries, functions, data

  • Functions perform operations in R
foo <- c(1,2,4)
foo %>% min()
foo %>% mean()
foo %>% max()
foo %>% sd()
  • Function arguments
    • Arguments are inputs
    • Arguments are comma separated
    • Let’s look at arguments in the lm function
help(lm)

Libraries, functions, data

  • Function arguments
    • Arguments are inputs
    • Arguments are comma separated
    • Let’s look at arguments in the lm function
cat_function <- function(love=TRUE){
    if(love==TRUE){
        print('I love cats!')
    }
    else {
        print('I am not a cool person.')
    }
}
min <- function(vector_of_values, na.rm=TRUE)

Libraries, functions, data

  • Function arguments
    • Arguments are inputs
    • Arguments are comma separated
    • Let’s look at arguments in the lm function
min <- function(vector_of_values, na.rm=TRUE)
foo <- c(1,2,NA, 4)
foo %>% min()
foo %>% mean()
foo %>% max()
foo %>% sd()

Libraries, functions, data

  • Function arguments
    • Arguments are inputs
    • Arguments are comma separated
    • Let’s look at arguments in the lm function
min <- function(vector_of_values, na.rm=TRUE)
foo <- c(1,2,NA, 4)
foo %>% min()
## [1] NA
foo %>% mean()
## [1] NA
foo %>% max()
## [1] NA
foo %>% sd()
## [1] NA

Libraries, functions, data

  • Function arguments
    • Arguments are inputs
    • Arguments are comma separated
    • Let’s look at arguments in the lm function
min <- function(vector_of_values, na.rm=TRUE)
foo <- c(1,2,NA, 4)
min(foo, na.rm = TRUE)
## [1] 1
mean(foo, na.rm = TRUE)
## [1] 2.333333
max(foo, na.rm = TRUE)
## [1] 4
sd(foo, na.rm = TRUE)
## [1] 1.527525

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
    • Integers, numbers, characters, factors
  • Confirm data type
c('foo', 'moo', 'boo') %>% class()
c('foo', 'moo', 'boo') %>% is.character()
c('foo', 'moo', 'boo') %>% is.factor()
  • Coerce data type
c('foo', 'moo', 'boo') %>% as.factor()
c('foo', 'moo', 'boo') %>% as.factor() %>% class()

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
    • Integers, numbers, characters, factors
  • Confirm data type
c('foo', 'moo', 'boo') %>% class()
## [1] "character"
c('foo', 'moo', 'boo') %>% is.character()
## [1] TRUE
c('foo', 'moo', 'boo') %>% is.factor()
## [1] FALSE
  • Coerce data type
c('foo', 'moo', 'boo') %>% as.factor()
c('foo', 'moo', 'boo') %>% as.factor() %>% class()

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
    • Integers, numbers, characters, factors
  • Confirm data type
c('foo', 'moo', 'boo') %>% class()
## [1] "character"
c('foo', 'moo', 'boo') %>% is.character()
## [1] TRUE
c('foo', 'moo', 'boo') %>% is.factor()
## [1] FALSE
  • Coerce data type
c('foo', 'moo', 'boo') %>% as.factor()
## [1] foo moo boo
## Levels: boo foo moo
c('foo', 'moo', 'boo') %>% as.factor() %>% class()
## [1] "factor"

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Data frames look a little like spreadsheets
data_frame(
  x = c(1:3)
  , y = c(4:6)
  , z = c('foo', 'boo', 'moo')
  )
## # A tibble: 3 x 3
##       x     y z    
##   <int> <int> <chr>
## 1     1     4 foo  
## 2     2     5 boo  
## 3     3     6 moo

Libraries, functions, data

  • Call a specific variable by name or location in the data frame
    • Use the $ between the data frame name and the variable name
    • Call the column (or row) by index number
cars$speed
cars[,1]
## [1]  4  4  7  7  8  9 10
## [1]  4  4  7  7  8  9 10

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Matrices and data frames have similar shapes
  • Traditionally matrices contained data of a single type
matrix(data = 1:6, nrow = 3, ncol = 2)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Vectors contain 1 or more values in a string
c('foo', 'moo', 'boo')
## [1] "foo" "moo" "boo"
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
rep(1:2, times = 2)
## [1] 1 2 1 2
rep(c(1,2), times = 2)
## [1] 1 2 1 2
seq(from = 0, to = 100, by = 10)
##  [1]   0  10  20  30  40  50  60  70  80  90 100
seq(0, 100, 10)
##  [1]   0  10  20  30  40  50  60  70  80  90 100
cars$speed
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Call a specific element by its location in a vector
c('foo', 'moo', 'boo')[2]
seq(from = 0, to = 100, by = 10)[6]

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Call a specific element by its location in a vector
c('foo', 'moo', 'boo')[2]
## [1] "moo"
seq(from = 0, to = 100, by = 10)[6]
## [1] 50

Let’s apply what we learned to real data

Let’s apply what we learned to real data

Questions
- What type of data object is donor?
- What are the min() and mean() for amount?
- What is the max() for legislative_district?

To answer the questions, import the small donations dataset

donor <- read.csv('https://goo.gl/tm9JQ5')

Let’s apply what we learned to real data

Questions
- What type of data object is donor?
- What are the min() and mean() for amount?
- What is the max() for legislative_district?

donor %>% class()
## [1] "data.frame"

Let’s apply what we learned to real data

Questions
- What type of data object is donor?
- What are the min() and mean() for amount?
- What is the max() for legislative_district?

donor %>% class()
## [1] "data.frame"
donor$amount %>% min()
## [1] 0
donor$amount %>% mean()
## [1] 255.7491

Let’s apply what we learned to real data

Questions
- What type of data object is donor?
- What are the min() and mean() for amount?
- What is the max() for legislative_district?

donor %>% class()
## [1] "data.frame"
donor$amount %>% min()
## [1] 0
donor$amount %>% mean()
## [1] 255.7491
donor$legislative_district %>% max(na.rm = TRUE)
## [1] 49

Base R functions to explore your data

Base R functions to explore your data

  • Get in the habit of running these functions when you open a new dataset
    • head() shows you the top subset of a data in a data frame
      • Number of rows is a head() argument and defaults at 5
      • tail() shows the bottom subset of data in a data frame
    • summary() shows summary statistics on all variabes in a dataset
      • Reported summary statistics depend on the variable type
    • ls() shows all variables in a data frame
      • You can also use ls() without calling an object between the parentheses to see all objects in your workspace
    • str() tells the variable type and selected variable values in a data frame for all variables
    • dim() tells you the dimensions of your dataset
      • nrow() reports the number of rows only
      • ncol() reports the number of columns only

Base R functions to explore your data

Questions
- How many rows are in donor?
- What is the median value for amount?
- How many variables are in donor?

Hint: There are multiple ways to answers these questions with the functions you know

Base R functions to explore your data

Questions
- How many rows are in donor?
- What is the median value for amount?
- How many variables are in donor?

donor %>% dim()
## [1] 9491   38
donor$amount %>% median(na.rm = TRUE)
## [1] 35
donor %>% ncol()
## [1] 38

Other methods to answer questions

donor %>% summary()
donor %>% ls()

Base R functions to explore vectors

  • table() shows you the distribution of values in a vector
  • length() tells you in the number of elements in a vector
  • unique() shows you the unique values in a vector
  • summary() shows you descriptive statistics for a vector
    • summary() can run on a vector or data frame

Base R functions to explore vectors

Questions
- How many ‘DEMOCRAT’ values are there in party?
- Which value in contributor_employer_state is most frequent?
- Is ‘Mayoral Race’ a type value?
- How many distinct first_name values are there?

Hint: There are multiple ways to answers these questions with the functions you know

Base R functions to explore vectors

Questions
- How many ‘DEMOCRAT’ values are there in party?
- Which value in contributor_employer_state is most frequent?
- Is ‘Mayoral Race’ a type value?
- How many distinct first_name values are there?

table(donor$party) %>% c() %>% .[1]
## DEMOCRAT 
##     1366
table(donor$contributor_employer_state) %>% c() %>% tail()
##   PA   SE   TX   UT   VA   WA 
##    8    1   14    2    2 2741
donor$type %>% unique()
## [1] Candidate           Political Committee
## Levels: Candidate Political Committee
donor$first_name %>% unique() %>% length()
## [1] 486

Troubleshooting your R code

Make sure…

  • You loaded your libraries
  • Arguments in your functions are comma-separated
  • Your functions have a closing paren
  • You have an end quotation mark where relevant
  • The variable type is correct to perform the function you call
    • You might want to run min() on a character string, but probably not
  • Your definition operator looks like this
<-
  • You pipe operator looks like this
%in%

Practice with police data

Questions
- How many rows and columns are there in police?
- What is the most recent event_clearance_date?
- Which district_sector sees the most incident activity?
- How many ‘RECKLESS BURNING’ incidences are there in event_clearance_subgroup?
- What is the smallest census_tract value?

Import the small police dataset in R Studio

police <- read.csv('https://goo.gl/T42fHz')

Practice with police data

Questions
- How many rows and columns are there in police?
- What is the most recent event_clearance_date?
- Which district_sector sees the most incident activity?
- How many ‘RECKLESS BURNING’ incidences are there in event_clearance_subgroup?
- What is the smallest census_tract value?

police %>% dim()
## [1] 10000    25

Practice with police data

Questions
- How many rows and columns are there in police?
- What is the most recent event_clearance_date?
- Which district_sector sees the most incident activity?
- How many ‘RECKLESS BURNING’ incidences are there in event_clearance_subgroup?
- What is the smallest census_tract value?

police %>% dim()
## [1] 10000    25
police$event_clearance_date %>% as.Date() %>% max(na.rm = TRUE)
## [1] "2017-11-10"

Practice with police data

Questions
- How many rows and columns are there in police?
- What is the most recent event_clearance_date?
- Which district_sector sees the most incident activity?
- How many ‘RECKLESS BURNING’ incidences are there in event_clearance_subgroup?
- What is the smallest census_tract value?

police %>% dim()
## [1] 10000    25
police$event_clearance_date %>% as.Date() %>% max(na.rm = TRUE)
## [1] "2017-11-10"
table(police$district_sector) %>% c() %>% .[5:12]
##   E   F   G   J   K   L   M   N 
## 771 471 412 522 901 539 876 622

Practice with police data

Questions
- How many rows and columns are there in police?
- What is the most recent event_clearance_date?
- Which district_sector sees the most incident activity?
- How many ‘RECKLESS BURNING’ incidences are there in event_clearance_subgroup?
- What is the smallest census_tract value?

police %>% dim()
## [1] 10000    25
police$event_clearance_date %>% as.Date() %>% max(na.rm = TRUE)
## [1] "2017-11-10"
table(police$district_sector) %>% c() %>% .[5:12]
##   E   F   G   J   K   L   M   N 
## 771 471 412 522 901 539 876 622
police$event_clearance_subgroup %>% table() %>% c() %>% .[32:33]
## PUBLIC GATHERINGS  RECKLESS BURNING 
##                 4                 3

Practice with police data

Questions
- How many rows and columns are there in police?
- What is the most recent event_clearance_date?
- Which district_sector sees the most incident activity?
- How many ‘RECKLESS BURNING’ incidences are there in event_clearance_subgroup?
- What is the smallest census_tract value?

police %>% dim()
## [1] 10000    25
police$event_clearance_date %>% as.Date() %>% max(na.rm = TRUE)
## [1] "2017-11-10"
table(police$district_sector) %>% c() %>% .[5:12]
##   E   F   G   J   K   L   M   N 
## 771 471 412 522 901 539 876 622
police$event_clearance_subgroup %>% table() %>% c() %>% .[32:33]
## PUBLIC GATHERINGS  RECKLESS BURNING 
##                 4                 3
police$census_tract %>% as.integer() %>% min(na.rm = TRUE)
## [1] 1