October 5, 2019

First things first

Some things you should know

  • Typical uses for R
    • Data analysis
    • Data visualization
    • Reshaping data
    • Automation
  • Mental models and programming
    • Excel/SPSS/Stata –> R/Python/Julia/JavaScript
  • What all R programmers do
    • Google for answers
    • Borrow code
    • Ask friends and pros for help
  • Useful references (see Reference Materials in the syllabus for more)

Quick Tour of R Studio

Quick Tour of R Studio

  • Open and save (locally) an R Notebook
  • Run code from the notebook and the console
  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI

Quick Tour of R Studio

  • Open and save (locally) an R Notebook
  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI
install.packages('devtools')

Quick Tour of R Studio

  • Open and save (locally) an R Notebook
  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI

Libraries, functions, data

Libraries, functions, data

  • Libraries (or packages) are collections of functions (and datasets)
  • Over 10,000 libraries on CRAN
install.packages('tidyverse')
install.packages(c('ggthemes', 'officer'))
library(tidyverse)

Libraries, functions, data

  • Functions perform operations in R
foo <- c(1,2,4)
foo %>% min()
foo %>% mean()
foo %>% max()
foo %>% sd()

Libraries, functions, data

  • Functions perform operations in R
foo <- c(1,2,4)
foo %>% min()
## [1] 1
foo %>% mean()
## [1] 2.333333
foo %>% max()
## [1] 4
foo %>% sd()
## [1] 1.527525

Libraries, functions, data

  • Function arguments
    • Let’s look at arguments in the lm and min functions
    • Arguments are inputs
    • Arguments are comma separated
help(lm)
help(min)

cat_function <- function(love=TRUE){
    if(love==TRUE){
        print('I love cats!')
    }
    else {
        print('I am not a cool person.')
    }
}

Libraries, functions, data

  • Function arguments
    • Let’s look at arguments in the lm and min functions
    • Arguments are inputs
    • Arguments are comma separated
foo <- c(1,2,NA, 4)
foo %>% min()
foo %>% mean()
foo %>% max()
foo %>% sd()

Libraries, functions, data

  • Function arguments
    • Let’s look at arguments in the lm and min functions
    • Arguments are inputs
    • Arguments are comma separated
foo <- c(1,2,NA, 4)
foo %>% min()
## [1] NA
foo %>% mean()
## [1] NA
foo %>% max()
## [1] NA
foo %>% sd()
## [1] NA

Libraries, functions, data

  • Function arguments
    • Let’s look at arguments in the lm and min functions
    • Arguments are inputs
    • Arguments are comma separated
foo <- c(1,2,NA, 4)
min(foo, na.rm = TRUE)
## [1] 1
mean(foo, na.rm = TRUE)
## [1] 2.333333
max(foo, na.rm = TRUE)
## [1] 4
sd(foo, na.rm = TRUE)
## [1] 1.527525

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
    • Integers, numbers, characters, factors
  • Confirm data type
c('foo', 'moo', 'boo') %>% class()
c('foo', 'moo', 'boo') %>% is.character()
c('foo', 'moo', 'boo') %>% is.factor()
  • Coerce data type
c('foo', 'moo', 'boo') %>% as.factor()
c('foo', 'moo', 'boo') %>% as.factor() %>% class()

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
    • Integers, numbers, characters, factors
  • Confirm data type
c('foo', 'moo', 'boo') %>% class()
## [1] "character"
c('foo', 'moo', 'boo') %>% is.character()
## [1] TRUE
c('foo', 'moo', 'boo') %>% is.factor()
## [1] FALSE
  • Coerce data type
c('foo', 'moo', 'boo') %>% as.factor()
c('foo', 'moo', 'boo') %>% as.factor() %>% class()

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
    • Integers, numbers, characters, factors
  • Confirm data type
c('foo', 'moo', 'boo') %>% class()
## [1] "character"
c('foo', 'moo', 'boo') %>% is.character()
## [1] TRUE
c('foo', 'moo', 'boo') %>% is.factor()
## [1] FALSE
  • Coerce data type
c('foo', 'moo', 'boo') %>% as.factor()
## [1] foo moo boo
## Levels: boo foo moo
c('foo', 'moo', 'boo') %>% as.factor() %>% class()
## [1] "factor"

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Data frames look a little like spreadsheets
data_frame(x = c(1:3), y = c(4:6), z = c('foo', 'boo', 'moo'))
## # A tibble: 3 x 3
##       x     y z    
##   <int> <int> <chr>
## 1     1     4 foo  
## 2     2     5 boo  
## 3     3     6 moo

Libraries, functions, data

  • Call a specific variable by name or location in the data frame
    • Use the $ between the data frame name and the variable name
    • pull() functionizes $
    • Also able to call the column (or row) by index number
cars$speed %>% head(7)
## [1]  4  4  7  7  8  9 10
cars %>% pull(speed) %>% head(7)
## [1]  4  4  7  7  8  9 10
cars[,1] %>% head(7)
## [1]  4  4  7  7  8  9 10

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Matrices and data frames have similar shapes
  • Traditionally matrices contained data of a single type
  • When it comes down to using a matrix or data frame in our class, data frames are the way to go
matrix(data = 1:6, nrow = 3, ncol = 2)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Vectors contain 1 or more values in a string
c('foo', 'moo', 'boo')
## [1] "foo" "moo" "boo"
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
rep(1:2, times = 2)
## [1] 1 2 1 2
rep(c(1,2), times = 2)
## [1] 1 2 1 2
seq(from = 0, to = 100, by = 10)
##  [1]   0  10  20  30  40  50  60  70  80  90 100
seq(0, 100, 10)
##  [1]   0  10  20  30  40  50  60  70  80  90 100
cars$speed
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Call a specific element by its location in a vector
c('foo', 'moo', 'boo')[2]
seq(from = 0, to = 100, by = 10)[6]

Libraries, functions, data

  • Oh my, so many types of data!
    • Data frames, matrices, vectors
  • Call a specific element by its location in a vector
c('foo', 'moo', 'boo')[2]
## [1] "moo"
seq(from = 0, to = 100, by = 10)[6]
## [1] 50

Let’s apply what we learned to real data

Let’s apply what we learned to real data

Exercise - 5 minutes
- What type of data object is contr?
- What are the min() and mean() for amount?
- What is the max() for election_year?

To answer the questions, import the political contributions dataset

contr <- read_csv('https://bit.ly/2lQySrQ') %>% as.data.frame()

Let’s apply what we learned to real data

Exercise - 5 minutes
- What type of data object is contr?
- What are the min() and mean() for amount?
- What is the max() for election_year?

contr %>% class()
## [1] "data.frame"

Let’s apply what we learned to real data

Exercise - 5 minutes
- What type of data object is contr?
- What are the min() and mean() for amount?
- What is the max() for election_year?

contr %>% class()
## [1] "data.frame"
contr$amount %>% min(na.rm = TRUE)
## [1] 0
contr$amount %>% mean(na.rm = TRUE)
## [1] 241.1717

Let’s apply what we learned to real data

Exercise - 5 minutes
- What type of data object is contr?
- What are the min() and mean() for amount?
- What is the max() for election_year?

contr %>% class()
## [1] "data.frame"
contr$amount %>% min(na.rm = TRUE)
## [1] 0
contr$amount %>% mean(na.rm = TRUE)
## [1] 241.1717
contr$election_year %>% max(na.rm = TRUE)
## [1] 2023

Base R functions to explore data frames

Base R functions to explore data frames

  • Get in the habit of running these functions when you open a new dataset
    • head() shows you the top subset of a data in a data frame
      • head() argument and defaults at 6
      • tail() shows the bottom subset of data in a data frame
    • summary() shows summary statistics on all variabes in a dataset
      • Reported summary statistics depend on the variable type
    • ls() shows all variables in a data frame
      • You can also use ls() without calling an object between the parentheses to see all objects in your workspace
    • str() tells the variable type and selected variable values in a data frame for all variables
    • dim() tells you the dimensions of your dataset
      • nrow() reports the number of rows only
      • ncol() reports the number of columns only

Base R functions to explore data frames

Exercise - 5 minutes
- How many observations (or rows) are in contr?
- What is the median() value for amount?
- How many variables are in contr?

Hint: There are multiple ways to answers these questions with the functions you know

Base R functions to explore data frames

Exercise - 5 minutes
- How many observations (or rows) are in contr?
- What is the median value for amount?
- How many variables are in contr?

contr %>% dim()
## [1] 45000    22

Base R functions to explore data frames

Exercise - 5 minutes
- How many observations (or rows) are in contr?
- What is the median value for amount?
- How many variables are in contr?

contr %>% dim()
## [1] 45000    22
contr$amount %>% median(na.rm = TRUE)
## [1] 100

Base R functions to explore data frames

Exercise - 5 minutes
- How many observations (or rows) are in contr?
- What is the median value for amount?
- How many variables are in contr?

contr %>% dim()
## [1] 45000    22
contr$amount %>% median(na.rm = TRUE)
## [1] 100
contr %>% ncol()
## [1] 22

Base R functions to explore data frames

Exercise - 5 minutes
- How many observations (or rows) are in contr?
- What is the median value for amount?
- How many variables are in contr?

contr %>% dim()
## [1] 45000    22
contr$amount %>% median(na.rm = TRUE)
## [1] 100
contr %>% ncol()
## [1] 22

Other methods to answer questions

contr %>% summary()
contr %>% ls()

Base R functions to explore vectors

  • table() shows you the distribution of values in a vector
  • length() tells you in the number of elements in a vector
  • unique() shows you the unique values in a vector
  • summary() shows you descriptive statistics for a vector
    • summary() can run on a vector or data frame

Base R functions to explore vectors

Exercise - 7 minutes
- How many ‘DEMOCRAT’ values are there in party?
- Which value in contributor_state is most frequent?
- Is ‘Mayoral Race’ a value you from the type variable?
- How many distinct contributor_zip values are there?

Hint: There are multiple ways to answers these questions with the functions you know

Base R functions to explore vectors

Exercise - 7 minutes
- How many ‘DEMOCRAT’ values are there in party?
- Which value in contributor_state is most frequent?
- Is ‘Mayoral Race’ a value you from the type variable?
- How many distinct contributor_zip values are there?

table(contr$party)[1]
## DEMOCRAT 
##    14748
table(contr$contributor_state) %>% tail()
## 
##    VA    VT    WA    WI    WV    WY 
##   119     8 38394    19    27     3
contr$type %>% unique()
## [1] NA          "Candidate"
contr$contributor_zip %>% unique() %>% length()
## [1] 1845

Troubleshooting your R code

Make sure…

  • You loaded your libraries
  • Arguments in your functions are comma-separated
  • Your functions have a closing paren
  • You have an end quotation mark where relevant
  • The variable type is correct to perform the function you call
    • For example you probably don’t want to run min() on a character string
  • Your definition operator looks like this
<-
  • You pipe operator looks like this
%>%

Practice with crime data

Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?

Begin the exercise by importing the crime dataset in R Studio

crime <- read_csv('https://bit.ly/2mcZLq4') %>% as.data.frame()

Practice with crime data

Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?

crime %>% dim()
## [1] 101141     13

Practice with crime data

Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?

crime %>% dim()
## [1] 101141     13
crime$occurred_date %>% max(na.rm = TRUE)
## [1] "2019-03-20"

Practice with crime data

Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?

crime$neighborhood %>% table() %>% sort() %>% tail()
## .
##          UNIVERSITY         SLU/CASCADE          QUEEN ANNE 
##                3524                4276                4671 
##        CAPITOL HILL           NORTHGATE DOWNTOWN COMMERCIAL 
##                5091                5186                8813

Practice with crime data

Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?

crime$crime_subcategory %>% table() %>% tail(5)
## .
##  THEFT-BICYCLE THEFT-BUILDING THEFT-SHOPLIFT       TRESPASS         WEAPON 
##           1411           3855           9225           2505            917

Practice with crime data

Exercise - 10 minutes
- How many rows and columns are there in crime?
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- How many ‘THEFT-BICYCLE’ incidences are there in crime_subcategory?
- What are the earliest and latest reported_time values?

crime$crime_subcategory %>% table() %>% tail(5)
## .
##  THEFT-BICYCLE THEFT-BUILDING THEFT-SHOPLIFT       TRESPASS         WEAPON 
##           1411           3855           9225           2505            917
crime$reported_time %>% min(na.rm = TRUE)
## [1] 0
crime$reported_time %>% max(na.rm = TRUE)
## [1] 2359