2017-11-08

Introduction - What is R?

  • "R is a free sofware enviroment for statistical computing and graphics"
  • R is a multifacted programming language that allows users to:
    • Import and export data from a variety of formats
    • "Data Wrangling": The process of manipulating raw data into a format to be used for analysis
    • Data Analysis: Summarizing data, exploratory data analysis, statistical/machine learning algorithms
    • Create interactive apps/dashboards
    • Create reports and presentations in pdf, html, Microsoft Word

Introduction - Why use R?

  • R is a free and open source
  • It can be used on Windows, OSX, and Linux operating systems
  • R is a general purpose programming language: it can be used to automate analyses and for development
  • There is large community of users and developers
  • Comprehensive R Archive Network (CRAN)
    • A network of mirrors that act as the primary service for distributing R binaries, packages and documentation
    • Over 11,000 user contributed packages
  • R makes reproducible research easy!
  • R is widely used in industry and academia

Introduction - Integrating into R

  • R can be challenging for new users
  • R is a scripting language
  • You can use a Graphical User Interface (GUI)
  • You can use only the commandline
  • R is interactive: Users interactly request output
  • R can be used in batch processes (automation)

Introduction - R Studio

  • R Studio is an integrated desktop environment (IDE) for R programming
  • There is a console for executing R code
  • Syntax highlighting colors R scripts to help users identify important R language features.
    • Helpful in avoiding typos/syntax errors and identify functions being used
  • There tools for plotting, debugging and workspace management

Interactive R

  • R is an interpreted language
  • R statements are converted to instruction to the machine as they are encounterd
  • A prompt ">" is presented to users
  • R statements are evaulated and then a result is returned

Operators - Arithmetric

  • + : add
  • - : subtract
  • * : multiply
  • ^ : raise to the power
  • / : divide
  • %% : give the remander of the first with the second (x mod y)
  • %/% : give the result of the division of one element with another

Operators - Relational and Logical

  • > : Produces a Boolean value by checking to see whether one element is greater than another
  • < : Produces a Boolean value by checking to see whether one element is less than another
  • >= : Produces a Boolean value by checking to see whether one element is greater than or equal to another
  • <= : Produces a Boolean value by checking to see whether one element is less than or equal to another
  • == : Produces a Boolean value by checking to see whether one element is equal to another
  • ! : Produces a Boolean value by checking to see whether one element is not equal to another
  • & : Produces a Boolean value by checking that each corresponding element are both TRUE
  • |: Produces a Boolean value by checking whether one corresponding is True

Operators - Assignment

  • <-, =, <<- : left assignment to create an object
  • ->, ->> : right assignment to create an object

Operators - Other

  • %in% : A Boolean operator that determines whether an element in one vector is in another
  • colon operator : is useful for creating a sequence of numbers
  • $ : Select a certain column in a data frame
  • %*% : matrix multiplication
  • ~ : formula operator mostly used for statistical modeling
  • ? : help

Intro to R Exercise

  1. Type "Hello, World!"
  2. 56934 + 34356
  3. \(49^{15}\)
  4. Calculate the circumference (in miles) of the earth: \(2 \times \pi \times 3959\) (pi represent \(\pi\))

Intro to R Exercise

"Hello, World"
## [1] "Hello, World"
56934 + 34356
## [1] 91290
49^15
## [1] 2.253934e+25
2*pi*3959
## [1] 24875.13

Objects and Environments

  • R uses objects to store actions into active memory in the form of a name
  • Users can perform operations on objects
  • A variable is a binding between symbols and objects
  • An environment is a place to store variables
  • When you create new variables in R, you're adding them into the global environment
  • To create a varible, you use the assignment operator "<-", "->","=", or assign function

Object Attributes and Classes

  • R objects contain attibutes, they store its metadata
  • Attributes can be obtained for an object using the attr() and attribute() functions
  • Classes describes the object's abstract type
x <- data.frame(VAR1 = c(1,2), VAR2 = c(3,4)) ; attributes(x); mode(x)
## $names
## [1] "VAR1" "VAR2"
## 
## $row.names
## [1] 1 2
## 
## $class
## [1] "data.frame"
## [1] "list"

Data Structures: Vectors

  • The most basic object in R is a vector
  • Every element in a vector must have the same data type
  • Scalars do not exist in R, they are just a vector of length 1 with the mode numeric
x <- 1
x
## [1] 1
  • Character strings are a vector of length 1, with the mode character
my_name <- "Brian Pattiz"
my_name
## [1] "Brian Pattiz"

Other vectors

  • Integer vectors are used so that they are never converted to numeric values
  • Boolean (TRUE/FALSE)
  • Complex Numbers

Multiple-length vectors

  • Using the concatenate function c(), users can create multiple length vectors
my_vec <- c(1:10, 13, 55, 63, 536)
my_vec
##  [1]   1   2   3   4   5   6   7   8   9  10  13  55  63 536

Data Structures: Factors

  • A factor is a vector used to store categorical data
  • They contain two attributes: the class factor and levels, which defines the set of allowed values
  • They are useful for knowing the set possible values, even when some aren't present in the data
stem_majors <- factor(c("Biology", "Chemistry", "Biology","Computer Science"),
  levels = c("Biology", "Chemistry", "Physics", "Engineering", "Mathematics", "Computer Science", "Statistics"))
stem_majors
## [1] Biology          Chemistry        Biology          Computer Science
## 7 Levels: Biology Chemistry Physics Engineering ... Statistics

Data Structures: Matricies and arrays

  • Adding the dimension attribute to a vectors, creates an array
  • Matricies are a special case of an array with two attributes: the number of rows and the number of columns
mat <- matrix(c(1,2,3,4), nrow = 2, ncol = 2)
mat
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Data Structures: Lists

  • Lists are similar to vectors, but allows for multiple data types
my_list <- list("foo", c(pi, exp(2)), TRUE)
my_list
## [[1]]
## [1] "foo"
## 
## [[2]]
## [1] 3.141593 7.389056
## 
## [[3]]
## [1] TRUE
my_list[[2]]
## [1] 3.141593 7.389056

Data Structures: Dataframes

  • The data structure most commonly used for storing data in R
  • A data frame is a list of equal length vectors: can store many different modes of data
my_df <- data.frame(ID = paste0("000", 1:10), ACT = round(runif(10, 17, 34)))
my_df 
##       ID ACT
## 1   0001  31
## 2   0002  32
## 3   0003  32
## 4   0004  23
## 5   0005  23
## 6   0006  20
## 7   0007  17
## 8   0008  21
## 9   0009  34
## 10 00010  19

Functions

  • Functions are the fundamental data structure for R and many other programming languages
  • They are a set of machine instructions that takes inputs and returns an output
  • Functions are very useful for repeatable tasks
  • R contains many base functions that are very useful
  • R allows users to create their own functions to accomplish a particular task
myFunction <- function(arguments){
  
  # instructions based on the arguments
  # generate output
  
  return(output)
  
}

Character strings

  • Character strings are an important part of data analysis
  • Using strings in base R can be challenging to learn
  • The package stringr was created to make working with strings easier

Functions

  • str_length(string): Calculates the length of a string
  • str_c(string1, string2): Join multiple strings together
  • str_sub(string, start, end): Extract and replace substrings from a character vector
  • str_split(string, pattern): Splits up one string into multiple strings
  • str_detect(string, pattern): A Boolean function that indicates if there is a pattern match
  • str_subset(string, patter): Find the matching components and store it in a vector

Loops and Conditions: Conditionals

Statements/Expressions that perform computations depending on whether a Boolean (TRUE/FALSE) statement is met

If Structure

The statement can be either a logical or numeric vector, but only the first element is tested

if (expression){
  
    statement
}

Loops and Conditions: If-else Structure

For statement2 to be evaluated, statement1 must be false.

if (expression) {
   statement1
} else {
   statement2  }

Loops and Conditionals: for Loop

  • This control structure allows code to be executed interatively.
  • Uses a loop variable to allow the statement to know about sequencing of the iteration.
  • Used with the number of iterations is known.
for(value in sequence)

  statement

}

Loops and Conditions: While Loops

  • This control structure allows code to evaluated repeatedly if the Boolean test condtion is true
  • One of the main drawbacks for use this is if the loop is not carefully contructed, the code can be infinitely executed.
while(test_condition){
  
  statement
  
}

Some Useful Functions

  • seq(from, to, by): A function that generates a sequence
  • rep(vector): A function repeats a vector n times
  • any(vector): A Boolean function that determines whether given a condition, a vector has any true values
  • all(vector): A Boolean function that determines whether given a condition, a vector has all true values
  • ifelse(condition, yes, no): A vectorized version of the if-else construct
  • print(object): a generic function that prints its argument
  • example(topic): a function that executes the example in R's help documentation

Testing and Coercing Vectors

  • Testing a vector's data type is useful for operations such as merging
  • You can find a vectors's data type using the function typeof()
  • Testing for a particular data type can be done with the "is" family of functions
  • Coercion of elements in a vector can occur automatically since the elements in a vector must be one type
  • Forced coercion is sometimes necessary for operations such as merging, the as family of functions are useful for this

Missing and Null Values

  • Missing values are represented in R as NA
  • NULL represents a value that does not exist, rather than being unknown
x <- c(3, NA, 3, NA, 5, 9)

is.na(x)
## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

Statistical and Mathematical Functions

  • sum(vector): sum the values of a vector
  • rowSums(dataframe) and colSums(dataframe): sum the row/columns of a dataframe
  • min(vector): the minimum value of a vector
  • max(vector): the maximum value of a vector
  • mean(vector): the average value of a vector
  • sd(vector): the standard value of a vector

  • If the vector that you want to apply these functions to contains missing values use the 'na.rm = TRUE' argument

The Tidyverse Package Suite

A collection of packages created by Hadley Wickham, the chief scientist of R studio

install.packages("tidyverse")
  • R contains installed many useful functions for data analysis, however it can be extended with these powerful packages
  • These packages improve upon existing functions by:
    • Increasing speed
    • Making code easier to read and write
    • Being less reliant on loops and conditionals

Character strings

  • Character strings are an important part of data analysis
  • Using strings in base R can be challenging to learn
  • The package stringr was created to make working with strings easier

Functions

  • str_length(string): Calculates the length of a string
  • str_c(string1, string2): Join multiple strings together
  • str_sub(string, start, end): Extract and replace substrings from a character vector
  • str_split(string, pattern): Splits up one string into multiple strings
  • str_detect(string, pattern): A Boolean function that indicates if there is a pattern match
  • str_subset(string, patter): Find the matching components and store it in a vector

Reading Data into R

R can read in data from flat files, databases, and scrape it from the internet

Flat File Packages

  • readr: provides a fast and easy way to read csv, tsv and fwf formats
  • readxl: makes it easier to extract data from Excel
  • haven: allows R to read and write data formats from other statistical software including SAS, SPSS and Stata

Database Packages

  • DBI: This package helps R connect to an external database management system
  • ROracle, RMySQL, RSQLite, RPostgres, RSQLServer are examples of interfaces to database managment systems

Web data

  • jsonlite and xml2 makes it easier to parse json and xml data
  • rvest: provides tools to scrape data from webpages

Reading Data into R - Using the readr package

  • readr's read_csv has advantages of the base R read.csv function:
    • Faster
    • read.csv forces character strings as factors by default
    • Parses common date/time formats with ease
my_dataset <- read_csv("path/my_file.csv")

Reading Data into R - the DBI package

  • connects to a DBMS
  • creates and executes statements sent
  • gathers results from those statements

Useful Functions

  • dbConnect(driver,...): Using a driver, a connection is created and opened.
  • dbGetQuery(conn, statement): This function submits a query and gathers the output and returns an object
  • dbWriteTable(conn, name, value): This function will write an object or dataframe to a database
  • dbListTables(conn): returns a vector of table names
  • dbListFields(conn, name): returns a vector of field names

Manipulating Data

Principals of Tidy Data

From Hadley Wickham in R for Data Science:

There are three interrelated rules which make a dataset tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Packages for Manipulating Data

  • dplyr: Provides a grammar of data manipulation
  • tidyr: Helps with creating 'tidy' data
  • data.table: Provides a faster, enhanced version of data.frame and useful syntax for manipulating data
  • sqldf: Manipulate dataframes with SQL statements

Manipulating Data - Base R

Useful functions

  • subset(dataframe, test_condition): Filters a dataframe using a conditional statement
  • merge(dataframe1, dataframe2, by = "ColumnName"): merge two dataframes together by a common column name
  • unique(dataframe): returns a vector/dataframe with the duplicated elements removed
  • sort(vector, decreasing = FALSE): sort a dataframe column/vector ascending/descending order
  • aggregate(dataframe, grouped_variable, function): breaks down a grouped dataframe into a single value

Manipulating Data - Overview of dplyr

The verbs that dplyr uses to provide a grammar of data manipulation:

  • select(): get a subset of the data by choosing the columns
  • mutate(): create new columns or modify existing ones
  • filter(): get a subset of data based off a conditional statement
  • arrange(): reorder the rows of the data
  • summarise(): reduce the data into single values

Other functions

  • group_by(): allows to perform any sequence of operations by group
  • rename(): rename the columns
  • inner_join(),left_join(), right_join(), full_join(), semi_join(), anti_join(): join dataframes (dplyr will automatically find common columns)
  • case_when(): a vectorized if function

Manipulating Data - Pipeline Operator (from the magittr package)

The pipeline operator %>% is used to string together multiple functions in a sequence of operations

  • Eliminates the need to create multiple objects
  • This is makes code very easy to write and read
  • Easy to go back and add to the sequence of operations
x %>% f(x, y) %>% f(x,y,z)

Exploratory Data Analysis - Overview

The goal of exploratory data analysis is to generate questions about your data, this can lead you to:

  • Summarize your data
  • Create visualizations
  • Transform variables
  • Create simple models to detect patterns/trends

Advanced Methods

  • Cluster analysis: grouping the data into "natural" clusters
  • Factor analysis: identify latent factors to explain variation in the data
  • Analysis of Variance (ANOVA): Comparing the means of subgroups of the data
  • Principal Component Analysis: detect systematic patterns of variations in the data

Exploratory Data Analysis - Summarizing Data

  • An easy thing to do is to look at the structure of your data
    • the str() function will look at the structure of any object
  • Look at the data: View(), head() and tail()
  • Generate frequencies and contingency tables
    • table using table(), multidimensional table using ftable(object), table of proportions using prop.table(v), and table margin using margin.table()
    • crosstables using xtabs() and the gmodels package
  • Generate summary statistics using the summary(), summary statistics by group using the psych package and the doBy package mimics the PROC SUMMARY from SAS

Exploratory Data Analysis - Data Visualization Overview

  • Data visualizations are very useful for helping analysts understand the data and helping to share this understanding with others.
  • R is contains very powerful plotting systems

Data Visualization - Base R Plotting

The base R's graphics engine is enclosed in the following packages:

  • graphics: contains the functions to produce a scatter plot, line charts, histogram, box and whisker plot, ect.
  • grDevices: fucntions allowing to write output to formats such as png, pdf, ps, ect.
  • grid: An extension of the base R plotting that makes it easier to create plots specific locations in the plotting area.

Base R plotting functions

  • plot(x, y, type): the generic plot function where x and y are their respective coordinates of points in the plot.
    • type is an argument which determines what type of plot should be drawn
    • main is the argument for the title of the plot
    • xlab, ylab are titles for the x and y axes
  • Once a plot is made, you can use the functions to add lines(),text() and points()
  • hist(): This plot displays distribution of the data by grouping it into bins shows the frequency of them
  • boxpot(): The classic box and wisker plot is very useful for visualizing the variation in the data
  • barplot(): This function makes it easy to create a plot with vertical or horzontal bars
  • mosaicplot(): A mosaic plot is useful for displaying the proportions of factors for categorical data

Data Visualization - the ggplot2 package

  • Created by Hadley Wickham as an implementation of Leland Wilkinson's The Grammar of Graphics
  • Hadley describes a statistical graphic as

a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system

Data Visualization - ggplot2 Plotting

  • Unlike plotting in base R, ggplot requires the use of a dataframe
  • The plotting system uses the following:
    • aesthetics: tranformation of the data into color, size and shape
    • geoms: tranformation of the data into points, lines and shpaes
    • facets: how the plot is arranged
    • stats: transformation of the the data into bins, quantiles and smoothing
    • scales: controls how the data is transformed into aesthetics
    • coordinates: draws the geoms in a systematic way in a specified location

Data Visualization - ggplot2 functions

  • qplot(Variable1, Varable2, dataframe): ggplot2's verison of plot()
  • ggplot(dataset): the base function where it takes a dataset and an aesthetic mapping aes()

geom layers

  • geom_point(): add a layer of points
  • geom_smooth(): add a smoother from a model
  • geom_bar(): add bars
  • geom_boxplot(): add a boxplot
  • geom_text(): add text

facet layers

facet_grid(): layout panels defined by facetting variables

Data Visualization - more ggplot2 functions

stat layers

  • stat_identity(): leave data as is (default is "bin" which is suitable for only continuous data and drops missing values)
  • stat_unique(): remove duplicated values

scale layers

  • ggtitle(), labs(), xlab(), ylab(): add and modify title, axis and plot labels
  • lims(), xlim(), ylim(): Set scale limits

Statistical Analysis - Statistical Modeling/Regression Overview

  • R was built with the goal of statistical analysis
  • Regression analysis is very simple to perform with R
  • R has many built in functions to perform many types of models
    • lm(formula, data) and glm(model, data): generalized linear models
    • arima(time_series): ARIMA models
    • loess(formula, data): Local Polynomial Regression
    • nls(formula, data): Nonlinear Least-Squares Regression

Statistical Analysis - Useful Modeling Packages

  • tree: Classification and Regression Trees
  • mlogit: Multinomial-Logit Models
  • caret: Functions for Regression and Classification
  • forecast: Functions for Time Series Forecasting
  • mgcv and VGAM: Generalized Additive Models
  • glmnet: Lasso models
  • survival: Survival Analysis
  • lme4: Linear Mixed Models

Statistical Analysis - Probability Distributions

  • R contains functions for most statistical distributions
    • the p function is for probability distribution functions (pdfs)
    • the d function is for density functions
    • the r function is for random number generation
    • the q function is for quantiles

Statistical Analysis - Hypothesis Testing

  • Write functions to obtain test statistics
  • Quantile functions such as qnorm can help obtain critical values
  • Probability density functions such as pnorm obtain p-values

Functions for Hypothesis Testing

  • t.test(vector1, vecto2, altenative = two.sided, mu): Students' t-Test for one and two samples
  • prop.test(vector, n, probability_vector): Test that probabilites of success in groups are the same

Markdown

What is Markdown?

  • Markdown is a lightweight version of markup language
  • Markup languages produce documents from plain text
  • Originally designed to make it easier to write html

Why use Markdown?

  • Markdown's goal is to be "as easy-to-read and easy-to-write as is feasible"
  • Simple syntax
  • Focus on content

Markdown converted to HTML

<h2>Header</h2>
<h3>Subheading</h3>
<p>Plain text</p>
<em>italic</em>, <em>italic</em>
<strong>bold</strong>, <strong>bold</strong>, 
<h3>Unordered List</h3>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
<h3>Ordered List</h3>
<ol>
<li>Item 1</li>
<li>Item 2</li>
</ol>
<a href="http://rstudio.com">RStudio</a>
<blockquote><p>Block quote.</p></blockquote>

Header

R Markdown

What is R Markdown?

  • Creation of dynamic documents, presentations and reports.
  • Ease of Markdown syntax with the rendering of R code to produce output

Why R Markdown?

  • Ease of use
  • Flexiblity: Can be converted into different formats
  • Get new data? Use different parameters? Regenerate the report without copy/paste
  • Encourages transparancy: Displaying code + output will help colleagues and yourself
  • Interactive nature. Explore the code and analysis.

How does R Markdown Work?

  • Create .Rmd report that includes markdown and R code chunks
  • Knitr is a package in R that integrates R code into rendered R Markdown documents
  • Pandoc is a universal document convert that allows the conversion of .Rmd to html, word doc, pdf, ect.

Code Chunk Options

  • echo: Whether to show code
  • eval: Whether to evaluate code
  • message: Whether to show messages
  • warning: Whether to show warnings
  • fig.width and fig.height: Modify the figure output
  • cache: Save the output of the chunk.

Interactive Data Visualization Packages and Tools

  • shiny: This package allows users to create apps front-end user interface with a server component
  • htmlwidgets: This package contains many useful JavaScript visualization libraries
  • flexdashboard: This package allows to create interactive dashboards

Free Resources