March 22nd, 2018

Background

What is R?

  • R is an open-source and free software enviroment for statistical computing.
  • It is effective at handling and storing data and contains many capabilities for graphing, analyzing, and performing computation on that data.

Why should you learn it?

  • Easy to pick up.
  • Numerous useful packages and tools.
  • Large and dedicated community.
  • Used widely in academia and industry.

What to expect from this presentation

  • High-level overview covering lots of information, but not diving too deeply into any single topic (breadth over depth).
  • Applied focus.
  • The real learning will happen during DataFest!

R Basics

Intro to RStudio

  • RStudio is an integrated development environment (IDE) that provides many useful features for coding in R, as well as the ability to easily output your analyses in any almost desirable format.
  • Examples: RMarkdown, Shiny, Bookdown

General R Workflow

  • Interact via console, using arrow keys to go up and down through previous commands. Save code in scripts if want to reuse later.
  • Generally, experiment in console then copy down what you will be reusing again into script.
  • Leave (frequent) comments using #.
#this is an R comment

Getting more information

  • Type ?function into your RStudio console to search for a function.
  • Use ??"stringOrRegEx" to search for any pattern among the documentation for packages in your library.
#?class
#??"linear model"

Arithmetic in R

  • R can do math in the console, as you might expect. Sequences of integers can easily be created using colon notation.
(7 + 14) * 12
## [1] 252
1:5 #also works in reverse
## [1] 1 2 3 4 5
sum(1:5)
## [1] 15

Types in R

  • Check the type of an object using the class function.
  • The most common types in R are numerics, characters, factors, and dataframes.
  • To declare objects in R, use <- or = (<- is generally preferred).
num <- 3
char <- 'c' #use "" or '' to denote character
class(num) #methods never called on objects - num.class() is wrong!
## [1] "numeric"
class(char) #all string are characters in R
## [1] "character"

Converting between types

  • R comes built in with many functions to convert between data types to get them into the right form for your analysis.
as.character(4)
## [1] "4"
as.numeric("4")
## [1] 4

Vectors and Matrices

  • Vectors represents an array (or list) of objects of the same type, and are created using the c() function.
  • Vectors are the most important data structure in R (we'll go more into why later.)
  • A matrix is a collection of objects of the same type (generally numeric) in a rectangular layout (e.g. 2-dimensional vector).
dat <- c(1:5)
dat #prints contents of dat to console
## [1] 1 2 3 4 5
class(dat) #vector is an object, not a class! 
## [1] "integer"

Dataframes

  • Think of dataframes as matrices that can hold any combination of object types (not just one single type).
  • These will be your bread-and-butter data structure in R!
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
dat <- data.frame(employee, salary)

Dataframes 2

dat
##     employee salary
## 1   John Doe  21000
## 2 Peter Gynn  23400
## 3 Jolie Hope  26800
class(dat) #dataframe is a class, however
## [1] "data.frame"

Working with data

  • We can use a built in dataset in R to begin testing some of R's data analysis functionality.
  • View the data dictionary for the mtcars dataset here.
mtcars <- mtcars
mtcars 
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#View(mtcars) opens up data in RStudio console

Reading in data

  • Generally, R must be supplied with a file path to read in data from, then use specific functions made for that file type to import it.
  • Commonly, this is in the form of a .csv file.
#data <- read.csv("path/to/file/data.csv")

Other tips for reading in data

  • Make things easier by specifying a specific working directory at the beginning of the session to keep you from typing a long file path every time you import data.
  • Use the fread command for reading in large datasets (will be especially useful during DataFest).
#setwd("your/desired/directory")
#data <- data.table::fread("data.csv")

Common data analysis functions

  • Upon loading the data, we can use common base R functions to get a sense of the data's size and variables included.
class(mtcars)
## [1] "data.frame"
dim(mtcars) #also nrow(mtcars) and ncol(mtcars)
## [1] 32 11
names(mtcars) #also rownames(mtcars) colnames(mtcars) 
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Structure

  • Initially checking the structure of your data is extremely important for future modeling and visualization.
  • Make sure everything corresponds to the right type (e.g. categorical variables are labelled as such). Categorical variables are called factors in R.
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Factors in R

  • Factors are variables in R that take on a discrete number of specified values.
  • Important because categorical variables are treated very differently than numeric variables in statistical/computational methods.
gender <- c("Male", "Female")
gender <- as.factor(gender) #convert to factor from character
class(gender)
## [1] "factor"

Modifying factors

  • Use the levels command to get or change the names assigned to the different levels of your factor.
levels(gender)
## [1] "Female" "Male"
levels(gender) <- c("F","M")
levels(gender)
## [1] "F" "M"

Ordinal factors

  • Some factors may be ordinal in nature, while others are not (e.g. one's race vs. one's score on a test.) These are fairly rare, however.
  • Specify ordered = T in the factor command to create an ordinal factor.
rank <- c(1:5)
rank <- factor(rank, levels = c(1:5), ordered = T)
rank
## [1] 1 2 3 4 5
## Levels: 1 < 2 < 3 < 4 < 5
levels(rank)
## [1] "1" "2" "3" "4" "5"

Changing variable types

  • It's clear that some of the variables in the mtcars dataset are in fact categorical, even though they are coded as numeric right now. Now, with knowledge of how factors work in R, we can change them.
  • As we will see later, this is especially useful for visualization.
mtcars$cyl <- as.factor(mtcars$cyl)
class(mtcars$cyl)
## [1] "factor"

Using the summary function

  • The summary function can give you a high-level overview of data distribution and missingness.
  • Check for out-of-range or non-sensical values upon loading in dataset.

Summary

summary(mtcars)
##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec             vs               am        
##  Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :3.325   Median :17.71   Median :0.0000   Median :0.0000  
##  Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062  
##  3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000  
##       gear            carb      
##  Min.   :3.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000  
##  Median :4.000   Median :2.000  
##  Mean   :3.688   Mean   :2.812  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :8.000

Missing values in R

  • Missing values in data can be represented as either NA or NaN.
  • They will interfere with many common R operations, so always make sure to check for them before analysis.
na <- c(4, 5, NA, 7, NaN)
mean(na)
## [1] NA
is.na(na) #check for missingness
## [1] FALSE FALSE  TRUE FALSE  TRUE

Dealing with missing values

  • Simply removing the missing values can sometimes be an effective way of dealing with them, especially when they are missing completely at random or not high in number.
  • Missing value imputation can also be effective, but that's for another talk.
na.not <- na.omit(na)
is.na(na.not)
## [1] FALSE FALSE FALSE
mean(na.not)
## [1] 5.333333

Data selection 1

  • Datasets can be references both by individual variables, or through an indexed selection.
  • Bracket notation works by specifying the desired row, then column (R is one-indexed!) and can return anything from a single observation, to a vector, to a dataframe.
mtcars[2,4]
## [1] 110
mtcars[,3]
##  [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6
## [12] 275.8 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0
## [23] 304.0 350.0 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

Data selection 2

  • Dollar sign notation works with named columns in the dataframe, and returns a vector.
mtcars[1:3,]
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

Useful statistical tools

  • Base R provides many useful statistical functions for data analysis.
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$hp)
## [1] 123
sd(mtcars$disp)
## [1] 123.9387

Useful statistical functions 2

  • A robust suite of statistical testing tools is included as well.
cor(mtcars$mpg, mtcars$hp)
## [1] -0.7761684
cor.test(mtcars$mpg, mtcars$hp)
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

Vectorized operations

  • Vectors are so important to R because many operations apply parallel on vectors by default, and R is optimized for operating on vectors.
  • In practice, this means most operations in R treat vectors the same way they would scalars, without having to alter your code or write loops.
  • This is because everything in R is treated as a vector in its underlying representation (even if this isn't apparent to us).

Vectorized Operations in Action

a <- c(1:3)
b <- c(4:6)
a + b
## [1] 5 7 9
mean(a + b)
## [1] 7
sd(a + b)
## [1] 2

Apply functions

  • Apply functions take advantage of this by performing some function over a specified set of data in a vectorized way.
  • There are many different kinds, and we won't get into all of them; the most useful, however, are lappy, sapply, and vapply.
  • vapply - returns a vector with specifiable return type.
  • lapply - returns a list (vector that contains R objects - we won't be covering them as they are fairly specialized in practice).
  • sapply - like lapply, but attempts to return a simplified version since lists are often unwieldy to work with.

Apply functions in action

sapply(1:10, function(x) x^2)
##  [1]   1   4   9  16  25  36  49  64  81 100
vapply(c("a", "b", "c", "d"), function(x) x=="d", logical(1))
##     a     b     c     d 
## FALSE FALSE FALSE  TRUE

Apply functions 2

mtcars[,8:11] <- lapply(mtcars[,8:11], as.factor)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Apply functions vs. loops

  • Loops are almost never advisable to use in R. Instead, consider using a suitable apply function.
  • The only common instance for an R for loop would be looping through files in a folder, but you probably won't need to do this for DataFest.

Plotting in Base R

  • We will spend most of this presentation focusing on other ways of visualizing data, but base R plotting is easy enough that it is still worth a mention.
plot(mtcars) #pairs plot

Bar chart (base R)

plot(mtcars$cyl)

Histogram (base R)

hist(mtcars$mpg)

Scatter plot (base R)

plot(mtcars$mpg~mtcars$hp)

Data Manipulation

Introduction to the tidyverse

  • By far the most useful and widespread packages for data manipulation are dplyr and tidyr, both part of Hadley Wickham's tidyverse.
  • We will look at another one of his packages for visualization, ggplot2, later.
  • We can install all the necessary packages at once by installing the tidyverse package.

tidyverse

  • lubridate is another package in the tidyverse we are not covering, but could come in handy during DataFest for working with dates in R.