R Programming for Institutional Research

2017-11-08

Introduction - What is R?

"R is a free sofware enviroment for statistical computing and graphics"
R is a multifacted programming language that allows users to:
- Import and export data from a variety of formats
- "Data Wrangling": The process of manipulating raw data into a format to be used for analysis
- Data Analysis: Summarizing data, exploratory data analysis, statistical/machine learning algorithms
- Create interactive apps/dashboards
- Create reports and presentations in pdf, html, Microsoft Word

Introduction - Why use R?

R is a free and open source
It can be used on Windows, OSX, and Linux operating systems
R is a general purpose programming language: it can be used to automate analyses and for development
There is large community of users and developers
Comprehensive R Archive Network (CRAN)
- A network of mirrors that act as the primary service for distributing R binaries, packages and documentation
- Over 11,000 user contributed packages
R makes reproducible research easy!
R is widely used in industry and academia

Introduction - Integrating into R

R can be challenging for new users
R is a scripting language
You can use a Graphical User Interface (GUI)
You can use only the commandline
R is interactive: Users interactly request output
R can be used in batch processes (automation)

Introduction - R Studio

R Studio is an integrated desktop environment (IDE) for R programming
There is a console for executing R code
Syntax highlighting colors R scripts to help users identify important R language features.
- Helpful in avoiding typos/syntax errors and identify functions being used
There tools for plotting, debugging and workspace management

Interactive R

R is an interpreted language
R statements are converted to instruction to the machine as they are encounterd
A prompt ">" is presented to users
R statements are evaulated and then a result is returned

Operators - Arithmetric

+ : add
- : subtract
* : multiply
^ : raise to the power
/ : divide
%% : give the remander of the first with the second (x mod y)
%/% : give the result of the division of one element with another

Operators - Relational and Logical

> : Produces a Boolean value by checking to see whether one element is greater than another
< : Produces a Boolean value by checking to see whether one element is less than another
>= : Produces a Boolean value by checking to see whether one element is greater than or equal to another
<= : Produces a Boolean value by checking to see whether one element is less than or equal to another
== : Produces a Boolean value by checking to see whether one element is equal to another
! : Produces a Boolean value by checking to see whether one element is not equal to another
& : Produces a Boolean value by checking that each corresponding element are both TRUE
|: Produces a Boolean value by checking whether one corresponding is True

Operators - Assignment

<-, =, <<- : left assignment to create an object
->, ->> : right assignment to create an object

Operators - Other

%in% : A Boolean operator that determines whether an element in one vector is in another
colon operator : is useful for creating a sequence of numbers
$ : Select a certain column in a data frame
%*% : matrix multiplication
~ : formula operator mostly used for statistical modeling
? : help

Intro to R Exercise

Type "Hello, World!"
56934 + 34356
$49^{15}$
Calculate the circumference (in miles) of the earth: $2 \times \pi \times 3959$ (pi represent $\pi$)

Intro to R Exercise

"Hello, World"

## [1] "Hello, World"

56934 + 34356

## [1] 91290

49^15

## [1] 2.253934e+25

2*pi*3959

## [1] 24875.13

Objects and Environments

R uses objects to store actions into active memory in the form of a name
Users can perform operations on objects
A variable is a binding between symbols and objects
An environment is a place to store variables
When you create new variables in R, you're adding them into the global environment
To create a varible, you use the assignment operator "<-", "->","=", or assign function

Object Attributes and Classes

R objects contain attibutes, they store its metadata
Attributes can be obtained for an object using the attr() and attribute() functions
Classes describes the object's abstract type

x <- data.frame(VAR1 = c(1,2), VAR2 = c(3,4)) ; attributes(x); mode(x)

## $names
## [1] "VAR1" "VAR2"
## 
## $row.names
## [1] 1 2
## 
## $class
## [1] "data.frame"

## [1] "list"

Data Structures: Vectors

The most basic object in R is a vector
Every element in a vector must have the same data type
Scalars do not exist in R, they are just a vector of length 1 with the mode numeric

x <- 1
x

## [1] 1

Character strings are a vector of length 1, with the mode character

my_name <- "Brian Pattiz"
my_name

## [1] "Brian Pattiz"

Other vectors

Integer vectors are used so that they are never converted to numeric values
Boolean (TRUE/FALSE)
Complex Numbers

Multiple-length vectors

Using the concatenate function c(), users can create multiple length vectors

my_vec <- c(1:10, 13, 55, 63, 536)
my_vec

##  [1]   1   2   3   4   5   6   7   8   9  10  13  55  63 536

Data Structures: Factors

A factor is a vector used to store categorical data
They contain two attributes: the class factor and levels, which defines the set of allowed values
They are useful for knowing the set possible values, even when some aren't present in the data

stem_majors <- factor(c("Biology", "Chemistry", "Biology","Computer Science"),
  levels = c("Biology", "Chemistry", "Physics", "Engineering", "Mathematics", "Computer Science", "Statistics"))
stem_majors

## [1] Biology          Chemistry        Biology          Computer Science
## 7 Levels: Biology Chemistry Physics Engineering ... Statistics

Data Structures: Matricies and arrays

Adding the dimension attribute to a vectors, creates an array
Matricies are a special case of an array with two attributes: the number of rows and the number of columns

mat <- matrix(c(1,2,3,4), nrow = 2, ncol = 2)
mat

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Data Structures: Lists

Lists are similar to vectors, but allows for multiple data types

my_list <- list("foo", c(pi, exp(2)), TRUE)
my_list

## [[1]]
## [1] "foo"
## 
## [[2]]
## [1] 3.141593 7.389056
## 
## [[3]]
## [1] TRUE

my_list[[2]]

## [1] 3.141593 7.389056

Data Structures: Dataframes

The data structure most commonly used for storing data in R
A data frame is a list of equal length vectors: can store many different modes of data

my_df <- data.frame(ID = paste0("000", 1:10), ACT = round(runif(10, 17, 34)))
my_df

##       ID ACT
## 1   0001  31
## 2   0002  32
## 3   0003  32
## 4   0004  23
## 5   0005  23
## 6   0006  20
## 7   0007  17
## 8   0008  21
## 9   0009  34
## 10 00010  19

Functions

Functions are the fundamental data structure for R and many other programming languages
They are a set of machine instructions that takes inputs and returns an output
Functions are very useful for repeatable tasks
R contains many base functions that are very useful
R allows users to create their own functions to accomplish a particular task

myFunction <- function(arguments){
  
  # instructions based on the arguments
  # generate output
  
  return(output)
  
}

Character strings

Character strings are an important part of data analysis
Using strings in base R can be challenging to learn
The package stringr was created to make working with strings easier

Functions

str_length(string): Calculates the length of a string
str_c(string1, string2): Join multiple strings together
str_sub(string, start, end): Extract and replace substrings from a character vector
str_split(string, pattern): Splits up one string into multiple strings
str_detect(string, pattern): A Boolean function that indicates if there is a pattern match
str_subset(string, patter): Find the matching components and store it in a vector

Loops and Conditions: Conditionals

Statements/Expressions that perform computations depending on whether a Boolean (TRUE/FALSE) statement is met

If Structure

The statement can be either a logical or numeric vector, but only the first element is tested

if (expression){
  
    statement
}

Loops and Conditions: If-else Structure

For statement2 to be evaluated, statement1 must be false.

if (expression) {
   statement1
} else {
   statement2  }

Loops and Conditionals: for Loop

This control structure allows code to be executed interatively.
Uses a loop variable to allow the statement to know about sequencing of the iteration.
Used with the number of iterations is known.

for(value in sequence)

  statement

}

Loops and Conditions: While Loops

This control structure allows code to evaluated repeatedly if the Boolean test condtion is true
One of the main drawbacks for use this is if the loop is not carefully contructed, the code can be infinitely executed.

while(test_condition){
  
  statement
  
}

Some Useful Functions

seq(from, to, by): A function that generates a sequence
rep(vector): A function repeats a vector n times
any(vector): A Boolean function that determines whether given a condition, a vector has any true values
all(vector): A Boolean function that determines whether given a condition, a vector has all true values
ifelse(condition, yes, no): A vectorized version of the if-else construct
print(object): a generic function that prints its argument
example(topic): a function that executes the example in R's help documentation

Testing and Coercing Vectors

Testing a vector's data type is useful for operations such as merging
You can find a vectors's data type using the function typeof()
Testing for a particular data type can be done with the "is" family of functions
Coercion of elements in a vector can occur automatically since the elements in a vector must be one type
Forced coercion is sometimes necessary for operations such as merging, the as family of functions are useful for this

Missing and Null Values

Missing values are represented in R as NA
NULL represents a value that does not exist, rather than being unknown

x <- c(3, NA, 3, NA, 5, 9)

is.na(x)

## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

Statistical and Mathematical Functions

sum(vector): sum the values of a vector
rowSums(dataframe) and colSums(dataframe): sum the row/columns of a dataframe
min(vector): the minimum value of a vector
max(vector): the maximum value of a vector
mean(vector): the average value of a vector
sd(vector): the standard value of a vector
If the vector that you want to apply these functions to contains missing values use the 'na.rm = TRUE' argument

The Tidyverse Package Suite

A collection of packages created by Hadley Wickham, the chief scientist of R studio

install.packages("tidyverse")

R contains installed many useful functions for data analysis, however it can be extended with these powerful packages
These packages improve upon existing functions by:
- Increasing speed
- Making code easier to read and write
- Being less reliant on loops and conditionals

Character strings

Character strings are an important part of data analysis
Using strings in base R can be challenging to learn
The package stringr was created to make working with strings easier

Functions

str_length(string): Calculates the length of a string
str_c(string1, string2): Join multiple strings together
str_sub(string, start, end): Extract and replace substrings from a character vector
str_split(string, pattern): Splits up one string into multiple strings
str_detect(string, pattern): A Boolean function that indicates if there is a pattern match
str_subset(string, patter): Find the matching components and store it in a vector

Reading Data into R

R can read in data from flat files, databases, and scrape it from the internet

Flat File Packages

readr: provides a fast and easy way to read csv, tsv and fwf formats
readxl: makes it easier to extract data from Excel
haven: allows R to read and write data formats from other statistical software including SAS, SPSS and Stata

Database Packages

DBI: This package helps R connect to an external database management system
ROracle, RMySQL, RSQLite, RPostgres, RSQLServer are examples of interfaces to database managment systems

Web data

jsonlite and xml2 makes it easier to parse json and xml data
rvest: provides tools to scrape data from webpages

Reading Data into R - Using the readr package

readr's read_csv has advantages of the base R read.csv function:
- Faster
- read.csv forces character strings as factors by default
- Parses common date/time formats with ease

my_dataset <- read_csv("path/my_file.csv")

Reading Data into R - the DBI package

connects to a DBMS
creates and executes statements sent
gathers results from those statements

Useful Functions

dbConnect(driver,...): Using a driver, a connection is created and opened.
dbGetQuery(conn, statement): This function submits a query and gathers the output and returns an object
dbWriteTable(conn, name, value): This function will write an object or dataframe to a database
dbListTables(conn): returns a vector of table names
dbListFields(conn, name): returns a vector of field names

Manipulating Data

Principals of Tidy Data

From Hadley Wickham in R for Data Science:

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.

Each observation must have its own row.

Each value must have its own cell.

Packages for Manipulating Data

dplyr: Provides a grammar of data manipulation
tidyr: Helps with creating 'tidy' data
data.table: Provides a faster, enhanced version of data.frame and useful syntax for manipulating data
sqldf: Manipulate dataframes with SQL statements

Manipulating Data - Base R

Useful functions

subset(dataframe, test_condition): Filters a dataframe using a conditional statement
merge(dataframe1, dataframe2, by = "ColumnName"): merge two dataframes together by a common column name
unique(dataframe): returns a vector/dataframe with the duplicated elements removed
sort(vector, decreasing = FALSE): sort a dataframe column/vector ascending/descending order
aggregate(dataframe, grouped_variable, function): breaks down a grouped dataframe into a single value

Manipulating Data - Overview of dplyr

The verbs that dplyr uses to provide a grammar of data manipulation:

select(): get a subset of the data by choosing the columns
mutate(): create new columns or modify existing ones
filter(): get a subset of data based off a conditional statement
arrange(): reorder the rows of the data
summarise(): reduce the data into single values

Other functions

group_by(): allows to perform any sequence of operations by group
rename(): rename the columns
inner_join(),left_join(), right_join(), full_join(), semi_join(), anti_join(): join dataframes (dplyr will automatically find common columns)
case_when(): a vectorized if function

Manipulating Data - Pipeline Operator (from the magittr package)

The pipeline operator %>% is used to string together multiple functions in a sequence of operations

Eliminates the need to create multiple objects
This is makes code very easy to write and read
Easy to go back and add to the sequence of operations

x %>% f(x, y) %>% f(x,y,z)

Exploratory Data Analysis - Overview

The goal of exploratory data analysis is to generate questions about your data, this can lead you to:

Summarize your data
Create visualizations
Transform variables
Create simple models to detect patterns/trends

Advanced Methods

Cluster analysis: grouping the data into "natural" clusters
Factor analysis: identify latent factors to explain variation in the data
Analysis of Variance (ANOVA): Comparing the means of subgroups of the data
Principal Component Analysis: detect systematic patterns of variations in the data

Exploratory Data Analysis - Summarizing Data

An easy thing to do is to look at the structure of your data
- the str() function will look at the structure of any object
Look at the data: View(), head() and tail()
Generate frequencies and contingency tables
- table using table(), multidimensional table using ftable(object), table of proportions using prop.table(v), and table margin using margin.table()
- crosstables using xtabs() and the gmodels package
Generate summary statistics using the summary(), summary statistics by group using the psych package and the doBy package mimics the PROC SUMMARY from SAS

Exploratory Data Analysis - Data Visualization Overview

Data visualizations are very useful for helping analysts understand the data and helping to share this understanding with others.
R is contains very powerful plotting systems

Data Visualization - Base R Plotting

The base R's graphics engine is enclosed in the following packages:

graphics: contains the functions to produce a scatter plot, line charts, histogram, box and whisker plot, ect.
grDevices: fucntions allowing to write output to formats such as png, pdf, ps, ect.
grid: An extension of the base R plotting that makes it easier to create plots specific locations in the plotting area.

Base R plotting functions

plot(x, y, type): the generic plot function where x and y are their respective coordinates of points in the plot.
- type is an argument which determines what type of plot should be drawn
- main is the argument for the title of the plot
- xlab, ylab are titles for the x and y axes
Once a plot is made, you can use the functions to add lines(),text() and points()
hist(): This plot displays distribution of the data by grouping it into bins shows the frequency of them
boxpot(): The classic box and wisker plot is very useful for visualizing the variation in the data
barplot(): This function makes it easy to create a plot with vertical or horzontal bars
mosaicplot(): A mosaic plot is useful for displaying the proportions of factors for categorical data

Data Visualization - the ggplot2 package

Created by Hadley Wickham as an implementation of Leland Wilkinson's The Grammar of Graphics
Hadley describes a statistical graphic as

a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system

Data Visualization - ggplot2 Plotting

Unlike plotting in base R, ggplot requires the use of a dataframe
The plotting system uses the following:
- aesthetics: tranformation of the data into color, size and shape
- geoms: tranformation of the data into points, lines and shpaes
- facets: how the plot is arranged
- stats: transformation of the the data into bins, quantiles and smoothing
- scales: controls how the data is transformed into aesthetics
- coordinates: draws the geoms in a systematic way in a specified location

Data Visualization - ggplot2 functions

qplot(Variable1, Varable2, dataframe): ggplot2's verison of plot()
ggplot(dataset): the base function where it takes a dataset and an aesthetic mapping aes()

geom layers

geom_point(): add a layer of points
geom_smooth(): add a smoother from a model
geom_bar(): add bars
geom_boxplot(): add a boxplot
geom_text(): add text

facet layers

facet_grid(): layout panels defined by facetting variables

Data Visualization - more ggplot2 functions

stat layers

stat_identity(): leave data as is (default is "bin" which is suitable for only continuous data and drops missing values)
stat_unique(): remove duplicated values

scale layers

ggtitle(), labs(), xlab(), ylab(): add and modify title, axis and plot labels
lims(), xlim(), ylim(): Set scale limits

Statistical Analysis - Statistical Modeling/Regression Overview

R was built with the goal of statistical analysis
Regression analysis is very simple to perform with R
R has many built in functions to perform many types of models
- lm(formula, data) and glm(model, data): generalized linear models
- arima(time_series): ARIMA models
- loess(formula, data): Local Polynomial Regression
- nls(formula, data): Nonlinear Least-Squares Regression

Statistical Analysis - Useful Modeling Packages

tree: Classification and Regression Trees
mlogit: Multinomial-Logit Models
caret: Functions for Regression and Classification
forecast: Functions for Time Series Forecasting
mgcv and VGAM: Generalized Additive Models
glmnet: Lasso models
survival: Survival Analysis
lme4: Linear Mixed Models

Statistical Analysis - Probability Distributions

R contains functions for most statistical distributions
- the p function is for probability distribution functions (pdfs)
- the d function is for density functions
- the r function is for random number generation
- the q function is for quantiles

Statistical Analysis - Hypothesis Testing

Write functions to obtain test statistics
Quantile functions such as qnorm can help obtain critical values
Probability density functions such as pnorm obtain p-values

Functions for Hypothesis Testing

t.test(vector1, vecto2, altenative = two.sided, mu): Students' t-Test for one and two samples
prop.test(vector, n, probability_vector): Test that probabilites of success in groups are the same

Markdown

What is Markdown?

Markdown is a lightweight version of markup language
Markup languages produce documents from plain text
Originally designed to make it easier to write html

Why use Markdown?

Markdown's goal is to be "as easy-to-read and easy-to-write as is feasible"
Simple syntax
Focus on content

Markdown converted to HTML

<h2>Header</h2>
<h3>Subheading</h3>
<p>Plain text</p>
<em>italic</em>, <em>italic</em>
<strong>bold</strong>, <strong>bold</strong>, 
<h3>Unordered List</h3>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
<h3>Ordered List</h3>
<ol>
<li>Item 1</li>
<li>Item 2</li>
</ol>
<a href="http://rstudio.com">RStudio</a>
<blockquote><p>Block quote.</p></blockquote>

Header

R Markdown

What is R Markdown?

Creation of dynamic documents, presentations and reports.
Ease of Markdown syntax with the rendering of R code to produce output

Why R Markdown?

Ease of use
Flexiblity: Can be converted into different formats
Get new data? Use different parameters? Regenerate the report without copy/paste
Encourages transparancy: Displaying code + output will help colleagues and yourself
Interactive nature. Explore the code and analysis.

How does R Markdown Work?

Create .Rmd report that includes markdown and R code chunks
Knitr is a package in R that integrates R code into rendered R Markdown documents
Pandoc is a universal document convert that allows the conversion of .Rmd to html, word doc, pdf, ect.

Code Chunk Options

echo: Whether to show code
eval: Whether to evaluate code
message: Whether to show messages
warning: Whether to show warnings
fig.width and fig.height: Modify the figure output
cache: Save the output of the chunk.

Interactive Data Visualization Packages and Tools

shiny: This package allows users to create apps front-end user interface with a server component
htmlwidgets: This package contains many useful JavaScript visualization libraries
flexdashboard: This package allows to create interactive dashboards

Free Resources

Introduction - What is R?

Introduction - Why use R?

Introduction - Integrating into R

Introduction - R Studio

Interactive R

Operators - Arithmetric

Operators - Relational and Logical

Operators - Assignment

Operators - Other

Intro to R Exercise

Intro to R Exercise

Objects and Environments

Object Attributes and Classes

Data Structures: Vectors

Other vectors

Multiple-length vectors

Data Structures: Factors

Data Structures: Matricies and arrays

Data Structures: Lists

Data Structures: Dataframes

Functions

Character strings

Functions

Loops and Conditions: Conditionals

If Structure

Loops and Conditions: If-else Structure

Loops and Conditionals: for Loop

Loops and Conditions: While Loops

Some Useful Functions

Testing and Coercing Vectors

Missing and Null Values

Statistical and Mathematical Functions

The Tidyverse Package Suite

Character strings

Functions

Reading Data into R

Flat File Packages

Database Packages

Web data

Reading Data into R - Using the readr package

Reading Data into R - the DBI package

Useful Functions

Manipulating Data

Principals of Tidy Data

Packages for Manipulating Data

Manipulating Data - Base R

Useful functions

Manipulating Data - Overview of dplyr

Other functions

Manipulating Data - Pipeline Operator (from the magittr package)

Exploratory Data Analysis - Overview

Advanced Methods

Exploratory Data Analysis - Summarizing Data

Exploratory Data Analysis - Data Visualization Overview

Data Visualization - Base R Plotting

Base R plotting functions

Data Visualization - the ggplot2 package

Data Visualization - ggplot2 Plotting

Data Visualization - ggplot2 functions

geom layers

facet layers

Data Visualization - more ggplot2 functions

stat layers

scale layers

Statistical Analysis - Statistical Modeling/Regression Overview

Statistical Analysis - Useful Modeling Packages

Statistical Analysis - Probability Distributions

Statistical Analysis - Hypothesis Testing

Functions for Hypothesis Testing

Markdown

What is Markdown?

Why use Markdown?

Markdown converted to HTML

Header

Smaller Subheading

Unordered List

Ordered List

R Markdown

What is R Markdown?

Why R Markdown?