5/25/2018

Introduction

Outline

  • What is R?
  • What is RStudio?

Specifics on R:

  • Basic R syntax
  • Reading in data
  • Creating plots
  • Regression models
  • Analysis of Variance

What next?

  • Environmental analyses in R
  • The power of R + Markdown: easy formatting that can be re-used in the future

Materials

Introduction to R and RStudio

What is R?

  • R: Statistical programming language
    • Objects belong to different classes (similar to object-oriented programming)
    • Common data classes include: vector (set of numbers or words), matrix or data.frame (similar to a spreadsheet: columns of vectors), list (usually a mixture of vectors, matrices, or other objects–one key feature is that they are not forced to be the same length)
    • Often, issues with analyses or errors are due to using the wrong type of input class for a function.
    • The great strength of R and other open-source languages (e.g. python, Julia, C, Fortran) is that they are free and you can reproduce your analyses by simply re-running your code!

A manifesto for Open-Source Languages

  • By embracing code, you free yourself from painful trips down memory lane to recall exactly which buttons you pressed to perform a specific task.
  • "Open-source" means that you can inspect the code yourself and understand how and what functions are doing!
  • R is especially well-suited for ecological statistical analyses (and increasingly more complex forms of modeling). There are TONS of packages (more than 10,000) with easy installation. So pretty much any type of analysis you can think of, there is some way to do in R!

What is RStudio?

  • RStudio:
    • "Integrated development environment" (IDE) for R
      • The term IDE is a fancy way of saying that you can run code and see output within the same program.
    • Powerful, helpful, and free user interface for R
    • All of this is just another way of saying that RStudio is a software package where you can run the R programming language more easily.

Accessing R and RStudio

  • Debugging R errors:
    • "Debugging" means to remove bugs–deal with errors in your code
    • You can use the error to help you learn more and to fix the problem!
    • Often, searching on the internet with the error message will be helpful.
    • StackOverflow is a great website where people contribute questions and get answers.

Anatomy of RStudio

  • Left: Console
    • Text on top at launch: version of R that you’re running
    • Below that is the prompt
  • Upper right: Workspace and command history
  • Lower right: Plots, access to files, help, packages, data viewer

(from rworkshop-mem)

Key tips about R

  • Version numbers are used to track changes to software, such as bug fixes and updates. For instance, Windows 10 is a different version than Windows XP.

  • When you open R or RStudio, the version number pops up in the Console (the area where code is run in RStudio; in R's graphic user interface (GUI) the default screen is just the console).

  • You can also run the command R.Version() or R.version.string

  • To find out the version of RStudio go to Help \(\rightarrow\) About RStudio

  • It's good practice to keep both R and RStudio up to date
    • It can be painful to deal with re-installing R to update it, but there are good resources on how to deal with packages, such as r-bloggers. Note that if you update RStudio, you often have to update the R language version as well.

Using R: Syntax and a general introduction

Standard R syntax: Overview

  • Syntax refers to the rules that each programming language has about correctly defining functions, declaring variables, etc.
  • Standard R syntax is similar but different to other major programming languages (especially python and Julia).
  • Usually, you would be working with datasets (which are often matrices or data.frames).

R syntax: Helpfiles

  • One excellent feature of R is its help files.
    • For any function, you can always use the command: ? and then type in the function name.
    • The help page will tell you about the types of objects (classes) that it can accept, and what the function will produce.
    • For example:
?plot # Tell me what plot does.
?mean # How does R calculate the mean of several values?
?glm # What is generalized linear modeling in R?
  • Note that # means make a comment (text that is not interpreted as code). Comments are a great way to keep track of what you were doing in the past.

Variables in R: vectors, matrices, and data.frames

The most fundamental set of objects you will create in R are vectors, matrices, and data.frames

c(1:5) # A vector
## [1] 1 2 3 4 5
matrix( rep(c(1:5), 3), nrow=3, byrow=T) # A matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    1    2    3    4    5
## [3,]    1    2    3    4    5
data.frame(Site=c("A","B","C","D","E"), # A data frame
           MouseCount=rpois(5,10), 
           MuntjacCount=rpois(5,5),
           GaurCount=rpois(5,0.25))
##   Site MouseCount MuntjacCount GaurCount
## 1    A         13            5         0
## 2    B         10            4         0
## 3    C         10            5         0
## 4    D          9            5         0
## 5    E         12            3         0
days <- c(1:5) # make a variable named days 
               # and assign the values 1-5
days # show me days
## [1] 1 2 3 4 5

Matrices versus data.frames

  • Matrices and data.frames are very similar–except that data.frames can support multiple classes of objects.
    • That means that you can have one column be strings (Site) and the rest be numeric (Count).
  • We can see this if we use the command for matrix on the data.frame that we created above:
matrix( c( c("A","B","C","D","E"),
           MouseCount=rpois(5,10),
           MuntjacCount=rpois(5,5),
           GaurCount=rpois(5,0.25)), ncol=4)
##      [,1] [,2] [,3] [,4]
## [1,] "A"  "13" "4"  "0" 
## [2,] "B"  "17" "6"  "1" 
## [3,] "C"  "15" "5"  "0" 
## [4,] "D"  "9"  "5"  "0" 
## [5,] "E"  "12" "3"  "0"
# Everything is interpreted as a string and the counts are not numeric

Workspaces

  • A workspace is "your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions) that you have entered. At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started." Quick-R

  • For instace, we can see what is in my workspace right now using:

ls()
  • We can also see what directory (file location or path) R is running out of.
getwd()

Workspaces + Quitting

  • Note that when you exit (or quit) R, it will always ask Save workspace image?
    • If you select Save, then everything that is in your current workspace (all the variables, analyses, etc.) will be stored and re-loaded when you re-open R.
    • Personally, I do not like to Save my workspace–I always want it to be empty so that I can control what I do and do not see in every session (a session is each instance where you open and run a program).
  • For more information: R-bloggers

Using save to preserve specific objects

  • If you have specific objects (such as analyses or tables, or anything else) that you want to save to re-load later (and avoid re-running certain bits of code that might be slow or memory-intensive), you can use the command save or save.image.
  • Example: I have a regression (regression_mod) that I want to save:
save(file="~/Rscripts/regression_mod.RData",regression_mod) 
# Unix (Linux/Mac) style path
  • The alternative is to call save.image which would save everything in the current workspace.
    • This is similar to the Quit R Session: Save Workspace Image? prompt, but better:
    • You control where the objects are located (file name)!
save.image(file="C:\Documents\MyWorkspace-5-26-2018.RData") # Windows-style path
  • You can then re-load these objects into a new session with the commands:
load(file="C:\Documents\MyWorkspace-5-26-2018.RData")

R packages

install.packages("gdata") # a package that specializes in dealing with excel spreadsheets
install.packages("ggplot2") # a popular package for plots that is an alternative to base R
  • Note that you only have to install each package once!
    • It is like downloading a file or software: once it is on your machine, it's on your machine!

Using R packages

  • When you want to use a specific package, you load it by using the command library:
library("gdata")
library("ggplot2")
  • You must run the library command every time you want to use the functions from a specific package
    • or you can use the package name to call a specific function, but that is more advanced (e.g. gdata::read.xls - take the function to read excel files from gdata without me having to use the library call to load the whole package into the workspace)

R syntax: data.frame example

Let's get on to an example of real-world data.

Here is a simulated (fake!) dataset entitled xishuangbanna_birds_df with bird counts and an ecological covariate (eco_cov).

## # A tibble: 26 x 6
##    site   eco_cov Bulbul Hornbill TitBabbler Babbler
##    <fctr>   <dbl>  <dbl>    <dbl>      <dbl>   <dbl>
##  1 A         7.49   5.00     1.00       3.00    2.00
##  2 B        10.7    7.00     1.00      11.0     1.00
##  3 C         9.61   9.00     1.00       2.00    1.00
##  4 D        14.4    9.00     1.00      12.0     4.00
##  5 E        10.6    9.00     1.00       8.00    1.00
##  6 F        11.6    7.00     1.00       5.00    3.00
##  7 G         7.09   6.00     1.00       6.00    3.00
##  8 H        13.6   12.0      1.00       7.00    1.00
##  9 I         5.87   6.00     1.00       3.00    1.00
## 10 J         8.20   7.00     1.00       5.00    1.00
## # ... with 16 more rows

R syntax: data.frame example

Let's say we wanted to extract observations for bulbuls. In traditional (often called Base R) R syntax, we would use the command:

xishuangbanna_birds_df$Bulbul # Extracting a column from a dataset
##  [1]  5  7  9  9  9  7  6 12  6  7  8  8  7  9  8  9 12 12  6 21  9  7 14
## [24] 18  6  8

If we wanted to see the first bulbul counts at the first five sites (A-E):

xishuangbanna_birds_df$Bulbul[1:5] # Extracting values from vectors
## [1] 5 7 9 9 9

R syntax: data.frame example

To preview the data frame, you can use the command str().

str(xishuangbanna_birds_df)
## 'data.frame':    26 obs. of  6 variables:
##  $ site      : Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ eco_cov   : num  7.49 10.66 9.61 14.43 10.58 ...
##  $ Bulbul    : num  5 7 9 9 9 7 6 12 6 7 ...
##  $ Hornbill  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ TitBabbler: num  3 11 2 12 8 5 6 7 3 5 ...
##  $ Babbler   : num  2 1 1 4 1 3 3 1 1 1 ...

The "Tidyverse": a new style of R syntax

Some R users dislike the rules for extracting columns from datasets or values from vectors.

They argue that it seems clunky and not that elegant.

There is an alternate "ecosystem" of packages called the Tidyverse (developed by Hadley Wickham; the link contains excellent instruction materials). In the Tidyverse, you can perform standard R commands using a different set of rules.

Tidyverse examples

To call the bulbul column, we can use this tidyverse command:

select(xishuangbanna_birds_df, Bulbul)

If we wanted to only see the sites where Bulbul observations were greater than 0:

filter(xishuangbanna_birds_df, Bulbul > 0)
##    site   eco_cov Bulbul Hornbill TitBabbler Babbler
## 1     A  7.489038      5        1          3       2
## 2     B 10.657656      7        1         11       1
## 3     C  9.605415      9        1          2       1
## 4     D 14.433924      9        1         12       4
## 5     E 10.584856      9        1          8       1
## 6     F 11.593150      7        1          5       3
## 7     G  7.091047      6        1          6       3
## 8     H 13.572664     12        1          7       1
## 9     I  5.873703      6        1          3       1
## 10    J  8.200689      7        1          5       1
## 11    K 10.449431      8        1          7       2
## 12    L 10.481372      8        1          5       4
## 13    M  8.991830      7        1          2       1
## 14    N 13.699202      9        1         15       1
## 15    O 10.616898      8        1         11       2
## 16    P  9.853416      9        1          6       3
## 17    Q  8.055729     12        1          3       3
## 18    R 12.554281     12        1          5       3
## 19    S  5.430929      6        1          5       1
## 20    T 21.551484     21        2         13       4
## 21    U  7.809550      9        1          5       2
## 22    V 13.820303      7        1          3       2
## 23    W 11.309806     14        1          6       2
## 24    X 13.867023     18        1          4       1
## 25    Y  5.928104      6        1          4       1
## 26    Z  7.807747      8        1          5       1

Fundamental analyses in R

Reading in Data

Reading in data

  • For the rest of this short course, we will continue with instruction focused on standard base R syntax (not tidyverse syntax).
  • Oftentimes, you will want to import data from a spreadsheet. read.csv is perfect for Comma Separated Value (CSV) spreadsheets
read.csv(file="C:\Documents\FieldData.csv", header=T) 
  # Note that you have to specify the path (directory) where your file is located.
      # In this example, this would be a Windows-style path
  # header: do you have variable names for each column?
  • The more general file read-in function is read.table which can take .txt, .csv, etc.
read.table(file="~/Documents/FieldData.csv",header=F)
  # In this example, the file directory is for a Unix (Linux/Mac OS) system
  # header=F means that I did not have names for the columns 
  # (no names for the variables), 
  # so they will be named X1, X2, X3, etc.

Reading in data: Excel

  • For Excel spreadsheets (*.xls, *.xlsx), the package gdata has a function to read in excel files that avoids common input issues.
install.packages("gdata") # recall that you only need to do this ONCE!
library("gdata") # you must do this every time 
                 # you start a new R session (re-open R)

## There are two ways to use the function read.xls from the gdata package:

    # 1: call it directly
read.xls(file="C:\Documents\FieldData.xlsx", sheet=1) 
# read.xls can only extract from one spreadsheet page at a time, 
# so tell it which page to look at

    # 2: tell R to go and look for the function "read.xls" from package "gdata":
gdata::read.xls(file="C:\Documents\FieldData.xlsx",sheet="Birds") 
# if your sheets have names, you can also refer to them by the names of each tab.

Fundamental analyses in R

Basic math

Basic math in R

  • You can use R to perform addition, subtraction, multiplication (including matrix multiplication), division, and pretty much any mathematical task you desire.
  • Let's start by creating a variable x and assigning the value of 5 to it:
x <- 5
x
## [1] 5
x + 5
## [1] 10
x / 5
## [1] 1
x * 5
## [1] 25
x - 5
## [1] 0

Basic math in R

  • We can do the exact same sorts of things with vectors, matrices, and data.frames.
    • For instance, let's revisit xishuangbanna_birds_df
# Find the total number of babblers in each site
xishuangbanna_birds_df$TitBabbler + xishuangbanna_birds_df$Babbler
##  [1]  5 12  3 16  9  8  9  8  4  6  9  9  3 16 13  9  6  8  6 17  7  5  8
## [24]  5  5  6
  • We could also define a vector and multiply it by another vector or matrix:
tree_area <- rnorm(mean = 20, sd=3, n=5)
  # stand area of trees simulated from a normal distribution
tree_carbon <- rnorm(mean = 50, sd=5, n=5)
  # carbon per unit area simulated from a normal distribution
tree_area %*% tree_carbon
##          [,1]
## [1,] 5052.161
tree_sites_no <- matrix(rnorm(mean = 20, sd= 3, n=25), ncol = 5, nrow = 5)
tree_sites_no 
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 16.52729 16.46395 22.52863 23.72030 15.71235
## [2,] 18.40911 16.47790 15.62602 19.67770 19.00707
## [3,] 27.33705 19.00123 18.79908 20.51778 20.38516
## [4,] 17.50251 24.08934 17.67075 20.76380 23.05436
## [5,] 21.24056 18.59256 18.89211 18.15640 19.23328
tree_sites_no %*% tree_carbon
##          [,1]
## [1,] 4918.302
## [2,] 4622.457
## [3,] 5531.381
## [4,] 5288.817
## [5,] 4970.695

Fundamental analyses in R

Statistics

Calculating descriptive statistics in R

  • R has many standard functions for descriptive statistics, for example:
mean(xishuangbanna_birds_df$Hornbill)
## [1] 1.038462
sd(xishuangbanna_birds_df$Hornbill) 
## [1] 0.1961161
median(xishuangbanna_birds_df$Hornbill)
## [1] 1
summary(xishuangbanna_birds_df$Hornbill)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.038   1.000   2.000
  • Sometimes, you may have missing values. These are known as nas. To deal with them, you can use na.rm=T: take out any missing values from my data before you do the analysis
# Sample 10 observations from a poisson distribution
spotty_data <- rpois(8, 10)
# Remove the 5th observation and replace with NA
spotty_data[5] <- NA
mean(spotty_data) # uh-oh!
## [1] NA
mean(spotty_data,na.rm=T) # phew!
## [1] 10.57143

Visualizing data: Basic plots

  • In Base R, you can use the command plot to create scatterplots, boxplots, histograms, barplots, and many other types of plots!
  • The package ggplot2 further extends the plotting capability of R
par(mfcol=c(1,3), oma=c(0,0,0,0))
plot(cars,xlab="Speed",ylab="Distance",main="Cars Dataset Scatterplot", pch=19, col="grey60")
boxplot(count ~ spray, data = InsectSprays, col = "lightgray",xlab="Treatment",ylab="Insect Count", main="Insect Sprays Dataset")
hist(islands,xlab=expression(paste("Land area (miles"^2,")")))

Analysis of Variance (ANOVA)

  • Oftentimes, we may be interested in how a continuous variable relates to a categorical variable
  • One example is the dataset PlantGrowth which describes plant weight as a function of control (ctrl) or experimental (trt1, trt2) treatment.
# call the built-in "PlantGrowth" dataset
data("PlantGrowth") 
# ANOVA
anova(lm(weight ~ group, data = PlantGrowth))
## Analysis of Variance Table
## 
## Response: weight
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## group      2  3.7663  1.8832  4.8461 0.01591 *
## Residuals 27 10.4921  0.3886                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Boxplot
boxplot(weight ~ group, data = PlantGrowth, 
        main = "Plant Growth data",
        ylab = "Dried weight of plants", 
        xlab="Treatment", col = "lightgray")

Regressions: Ordinary least squares (OLS)

  • It is straightforward to build regression models (how does y (dependent variable) relate to x (predictors/independent variables)?)
  • As an example, let's regress braking distance against car speed using the cars dataset (?cars):
data(cars)
# build linear regression model on full cars dataset
# and store in a variable named linearMod
linearMod <- lm(dist ~ speed, data=cars)  
linearMod # show me linearMod
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932
# What was the model fit?
summary(linearMod)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Generalized linear models (GLMs)

  • The standard regression formula takes the form: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \varepsilon\).
  • GLMs are multivariate regression models that go beyond OLS:
    • OLS assumes that the error term, \(\varepsilon\) (unexplained variation) comes from a Normal distribution (note: this does not place assumptions on \(Y\)!)
  • On the other hand, GLMs can accomodate many other distribution families, such as the types of error that would arise for count data (Poisson distributed) or binomial data (binary "yes" or "no" responses)
  • For example, in the xishuangbanna_birds_df dataset, the bird counts clearly come from a Poisson distribution.
summary(glm(Bulbul~eco_cov, data=xishuangbanna_birds_df, family=poisson))
## 
## Call:
## glm(formula = Bulbul ~ eco_cov, family = poisson, data = xishuangbanna_birds_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4064  -0.3553  -0.1331   0.2338   1.7857  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.40698    0.19725   7.133 9.82e-13 ***
## eco_cov      0.07433    0.01628   4.566 4.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 33.048  on 25  degrees of freedom
## Residual deviance: 14.189  on 24  degrees of freedom
## AIC: 122.54
## 
## Number of Fisher Scoring iterations: 4

Environmental analyses in R

R: spatial statistics

  • There is a growing and exciting set of tools in R to do complex geographical data cleaning, coordinate projection, and spatial manipulation.
  • Arguably, a major advantage of using R and its spatial packages (e.g. sp, rgeos, rgdal, raster, ggmap, sf, and more) over GIS is that you can easily replicate your work in the future by simply re-executing your code!
ggmap::ggmap(China)

Conservation social science

  • As ecologists and conservationists, we often strive to understand what drives people to follow rules that protect natural resources.
  • More importantly, we want to identify what leads to rule-breaking, such as: illegal hunting, logging, fishing, or other forms of resource exploitation.
  • One problem is that asking questions about these activities directly can be uncomfortable for respondents, so they may lie to avoid telling the truth.
  • We can deal with these issues by adopting indirect surveys, but then we need specialized methods to analyze the data (which are both binary–"yes" or "no"–and also cannot be taken at face value)
  • Charlotte Chang and others have built packages for sensitive surveys in conservation.

Bayesian inference

  • Bayesian approaches can be particularly helpful when you are dealing with data that come from a complex, hierarchical generative process.
  • Take bird survey data as an example.
  • In bird surveys, we face multiple problems:
    • A species might be present, but we may fail to detect it.
    • Certain species may be harder to detect at some times of the day or year.
    • Habitat quality can affect both the abundance of species and their detectability (e.g. how much are they singing, how active are they)?
  • Bayesian models can allow us to specify the set of interactions leading to changes in detectability and abundance.
  • There are many helpful resources on Bayesian methods in R:

R Markdown: A way to make your code easy to share and easy to revisit!

What is (R) Markdown?

  • Markdown is a straightforward language for turning plain text into nicely formatted HTML or PDF, which is a great format for sharing with collaborators.

  • R Markdown combines the syntax (language rules) of markdown with embedded R code chunks that are run so their output can be included in the final document.

  • R Markdown lets you create documents (you could even write an entire scientifc manuscript in R Markdown!), presentations (this presentation is written in R Markdown), and reports from R.

  • R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes). This is a huge advantage over a static workflow such as Microsoft Word + SPSS. Whenever you want to re-run your analyses, simply alter your code (usually just some small part of your code)–this is much easier than having to go back into something like SPSS or SAS and re-do all of the manual data input and analyses.

Source: http://rmarkdown.rstudio.com/ and https://github.com/mine-cetinkaya-rundel/rworkshop-mem/

How do you create an R Markdown document?

In RStudio, you can create R Markdown documents, "knit" it (render the code into a nice format such as HTML or PDF), and examine the source code and the output.

  1. File \(\rightarrow\) R Markdown…

  2. Enter a title (e.g. "Spatial analysis of predators and prey") and author info

  3. Choose Document as file type, and HTML as the output

  4. Hit OK

  5. Click Knit HTML in the new document, which will prompt you to save your document
    • Naming tip: Do not use spaces
      • Instead, use "CaMeLcAsE" or underscores ("Camel_Case").
    • Viewing tip: Click on the down arrow next to Knit HTML and select View in Pane

Markdown basics

  • Markdown is a very simple formatting language based on plain text (the standard text format used in emails, at least historically)

  • Rather than writing complex code (e.g. HTML), Markdown enables the use of a syntax much more like plain-text email.

Markdown overview

R Code Chunks

Within an R Markdown file, R Code Chunks can be embedded using the native Markdown syntax for fenced code regions.

Code chunks

Thank you!

If you have any questions, please talk to Mingxia, or feel free to email Charlotte.

The slides are online