5/25/2018
Specifics on R:
What next?
All source code at https://github.com/charlottehchang/RintroductionCEPF
Slides at http://rpubs.com/chwchang/RintroCEPF
This presentation borrows heavily from Dr. Nicholas Matzke and Dr. Mine Cetinkaya-Rundel
vector (set of numbers or words), matrix or data.frame (similar to a spreadsheet: columns of vectors), list (usually a mixture of vectors, matrices, or other objects–one key feature is that they are not forced to be the same length) (from rworkshop-mem)
Version numbers are used to track changes to software, such as bug fixes and updates. For instance, Windows 10 is a different version than Windows XP.
When you open R or RStudio, the version number pops up in the Console (the area where code is run in RStudio; in R's graphic user interface (GUI) the default screen is just the console).
You can also run the command R.Version() or R.version.string
To find out the version of RStudio go to Help \(\rightarrow\) About RStudio
python and Julia).data.frames).help files.
? and then type in the function name.?plot # Tell me what plot does. ?mean # How does R calculate the mean of several values? ?glm # What is generalized linear modeling in R?
# means make a comment (text that is not interpreted as code). Comments are a great way to keep track of what you were doing in the past.R: vectors, matrices, and data.framesThe most fundamental set of objects you will create in R are vectors, matrices, and data.frames
c(1:5) # A vector
## [1] 1 2 3 4 5
matrix( rep(c(1:5), 3), nrow=3, byrow=T) # A matrix
## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 1 2 3 4 5 ## [3,] 1 2 3 4 5
data.frame(Site=c("A","B","C","D","E"), # A data frame
MouseCount=rpois(5,10),
MuntjacCount=rpois(5,5),
GaurCount=rpois(5,0.25))
## Site MouseCount MuntjacCount GaurCount ## 1 A 13 5 0 ## 2 B 10 4 0 ## 3 C 10 5 0 ## 4 D 9 5 0 ## 5 E 12 3 0
R, you use <- or = (but <- is better IMO) to assign values to named variables, for example:days <- c(1:5) # make a variable named days
# and assign the values 1-5
days # show me days
## [1] 1 2 3 4 5
data.framesdata.frames are very similar–except that data.frames can support multiple classes of objects.
Site) and the rest be numeric (Count).matrix on the data.frame that we created above:matrix( c( c("A","B","C","D","E"),
MouseCount=rpois(5,10),
MuntjacCount=rpois(5,5),
GaurCount=rpois(5,0.25)), ncol=4)
## [,1] [,2] [,3] [,4] ## [1,] "A" "13" "4" "0" ## [2,] "B" "17" "6" "1" ## [3,] "C" "15" "5" "0" ## [4,] "D" "9" "5" "0" ## [5,] "E" "12" "3" "0"
# Everything is interpreted as a string and the counts are not numeric
A workspace is "your current R working environment and includes any user-defined objects (vectors, matrices, data frames, lists, functions) that you have entered. At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started." Quick-R
For instace, we can see what is in my workspace right now using:
ls()
getwd()
quit) R, it will always ask Save workspace image?
Save, then everything that is in your current workspace (all the variables, analyses, etc.) will be stored and re-loaded when you re-open R.Save my workspace–I always want it to be empty so that I can control what I do and do not see in every session (a session is each instance where you open and run a program).save to preserve specific objectssave or save.image.regression_mod) that I want to save:save(file="~/Rscripts/regression_mod.RData",regression_mod) # Unix (Linux/Mac) style path
save.image which would save everything in the current workspace.
Quit R Session: Save Workspace Image? prompt, but better:save.image(file="C:\Documents\MyWorkspace-5-26-2018.RData") # Windows-style path
load(file="C:\Documents\MyWorkspace-5-26-2018.RData")
"Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and (often) sample data." (From Hadley Wickham, a prominent R software developer with many helpful resources, emphasis mine.)
R packages are stored online at the Comprehensive R Archive Network (CRAN), which currently has 12579 packages.
The command install.packages lets you easily install any of the packages in CRAN.
install.packages("gdata") # a package that specializes in dealing with excel spreadsheets
install.packages("ggplot2") # a popular package for plots that is an alternative to base R
library:library("gdata")
library("ggplot2")
library command every time you want to use the functions from a specific package
gdata::read.xls - take the function to read excel files from gdata without me having to use the library call to load the whole package into the workspace)data.frame exampleLet's get on to an example of real-world data.
Here is a simulated (fake!) dataset entitled xishuangbanna_birds_df with bird counts and an ecological covariate (eco_cov).
## # A tibble: 26 x 6 ## site eco_cov Bulbul Hornbill TitBabbler Babbler ## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 A 7.49 5.00 1.00 3.00 2.00 ## 2 B 10.7 7.00 1.00 11.0 1.00 ## 3 C 9.61 9.00 1.00 2.00 1.00 ## 4 D 14.4 9.00 1.00 12.0 4.00 ## 5 E 10.6 9.00 1.00 8.00 1.00 ## 6 F 11.6 7.00 1.00 5.00 3.00 ## 7 G 7.09 6.00 1.00 6.00 3.00 ## 8 H 13.6 12.0 1.00 7.00 1.00 ## 9 I 5.87 6.00 1.00 3.00 1.00 ## 10 J 8.20 7.00 1.00 5.00 1.00 ## # ... with 16 more rows
data.frame exampleLet's say we wanted to extract observations for bulbuls. In traditional (often called Base R) R syntax, we would use the command:
xishuangbanna_birds_df$Bulbul # Extracting a column from a dataset
## [1] 5 7 9 9 9 7 6 12 6 7 8 8 7 9 8 9 12 12 6 21 9 7 14 ## [24] 18 6 8
If we wanted to see the first bulbul counts at the first five sites (A-E):
xishuangbanna_birds_df$Bulbul[1:5] # Extracting values from vectors
## [1] 5 7 9 9 9
data.frame exampleTo preview the data frame, you can use the command str().
str(xishuangbanna_birds_df)
## 'data.frame': 26 obs. of 6 variables: ## $ site : Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ... ## $ eco_cov : num 7.49 10.66 9.61 14.43 10.58 ... ## $ Bulbul : num 5 7 9 9 9 7 6 12 6 7 ... ## $ Hornbill : num 1 1 1 1 1 1 1 1 1 1 ... ## $ TitBabbler: num 3 11 2 12 8 5 6 7 3 5 ... ## $ Babbler : num 2 1 1 4 1 3 3 1 1 1 ...
Some R users dislike the rules for extracting columns from datasets or values from vectors.
They argue that it seems clunky and not that elegant.
There is an alternate "ecosystem" of packages called the Tidyverse (developed by Hadley Wickham; the link contains excellent instruction materials). In the Tidyverse, you can perform standard R commands using a different set of rules.
To call the bulbul column, we can use this tidyverse command:
select(xishuangbanna_birds_df, Bulbul)
If we wanted to only see the sites where Bulbul observations were greater than 0:
filter(xishuangbanna_birds_df, Bulbul > 0)
## site eco_cov Bulbul Hornbill TitBabbler Babbler ## 1 A 7.489038 5 1 3 2 ## 2 B 10.657656 7 1 11 1 ## 3 C 9.605415 9 1 2 1 ## 4 D 14.433924 9 1 12 4 ## 5 E 10.584856 9 1 8 1 ## 6 F 11.593150 7 1 5 3 ## 7 G 7.091047 6 1 6 3 ## 8 H 13.572664 12 1 7 1 ## 9 I 5.873703 6 1 3 1 ## 10 J 8.200689 7 1 5 1 ## 11 K 10.449431 8 1 7 2 ## 12 L 10.481372 8 1 5 4 ## 13 M 8.991830 7 1 2 1 ## 14 N 13.699202 9 1 15 1 ## 15 O 10.616898 8 1 11 2 ## 16 P 9.853416 9 1 6 3 ## 17 Q 8.055729 12 1 3 3 ## 18 R 12.554281 12 1 5 3 ## 19 S 5.430929 6 1 5 1 ## 20 T 21.551484 21 2 13 4 ## 21 U 7.809550 9 1 5 2 ## 22 V 13.820303 7 1 3 2 ## 23 W 11.309806 14 1 6 2 ## 24 X 13.867023 18 1 4 1 ## 25 Y 5.928104 6 1 4 1 ## 26 Z 7.807747 8 1 5 1
Reading in Data
R syntax (not tidyverse syntax).read.csv is perfect for Comma Separated Value (CSV) spreadsheetsread.csv(file="C:\Documents\FieldData.csv", header=T)
# Note that you have to specify the path (directory) where your file is located.
# In this example, this would be a Windows-style path
# header: do you have variable names for each column?
read.table which can take .txt, .csv, etc.read.table(file="~/Documents/FieldData.csv",header=F) # In this example, the file directory is for a Unix (Linux/Mac OS) system # header=F means that I did not have names for the columns # (no names for the variables), # so they will be named X1, X2, X3, etc.
gdata has a function to read in excel files that avoids common input issues.install.packages("gdata") # recall that you only need to do this ONCE!
library("gdata") # you must do this every time
# you start a new R session (re-open R)
## There are two ways to use the function read.xls from the gdata package:
# 1: call it directly
read.xls(file="C:\Documents\FieldData.xlsx", sheet=1)
# read.xls can only extract from one spreadsheet page at a time,
# so tell it which page to look at
# 2: tell R to go and look for the function "read.xls" from package "gdata":
gdata::read.xls(file="C:\Documents\FieldData.xlsx",sheet="Birds")
# if your sheets have names, you can also refer to them by the names of each tab.
Basic math
RR to perform addition, subtraction, multiplication (including matrix multiplication), division, and pretty much any mathematical task you desire.x and assigning the value of 5 to it:x <- 5 x
## [1] 5
x + 5
## [1] 10
x / 5
## [1] 1
x * 5
## [1] 25
x - 5
## [1] 0
Rdata.frames.
xishuangbanna_birds_df# Find the total number of babblers in each site xishuangbanna_birds_df$TitBabbler + xishuangbanna_birds_df$Babbler
## [1] 5 12 3 16 9 8 9 8 4 6 9 9 3 16 13 9 6 8 6 17 7 5 8 ## [24] 5 5 6
tree_area <- rnorm(mean = 20, sd=3, n=5) # stand area of trees simulated from a normal distribution tree_carbon <- rnorm(mean = 50, sd=5, n=5) # carbon per unit area simulated from a normal distribution tree_area %*% tree_carbon
## [,1] ## [1,] 5052.161
tree_sites_no <- matrix(rnorm(mean = 20, sd= 3, n=25), ncol = 5, nrow = 5) tree_sites_no
## [,1] [,2] [,3] [,4] [,5] ## [1,] 16.52729 16.46395 22.52863 23.72030 15.71235 ## [2,] 18.40911 16.47790 15.62602 19.67770 19.00707 ## [3,] 27.33705 19.00123 18.79908 20.51778 20.38516 ## [4,] 17.50251 24.08934 17.67075 20.76380 23.05436 ## [5,] 21.24056 18.59256 18.89211 18.15640 19.23328
tree_sites_no %*% tree_carbon
## [,1] ## [1,] 4918.302 ## [2,] 4622.457 ## [3,] 5531.381 ## [4,] 5288.817 ## [5,] 4970.695
RStatistics
RR has many standard functions for descriptive statistics, for example:mean(xishuangbanna_birds_df$Hornbill)
## [1] 1.038462
sd(xishuangbanna_birds_df$Hornbill)
## [1] 0.1961161
median(xishuangbanna_birds_df$Hornbill)
## [1] 1
summary(xishuangbanna_birds_df$Hornbill)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 1.000 1.000 1.038 1.000 2.000
nas. To deal with them, you can use na.rm=T: take out any missing values from my data before you do the analysis# Sample 10 observations from a poisson distribution spotty_data <- rpois(8, 10) # Remove the 5th observation and replace with NA spotty_data[5] <- NA mean(spotty_data) # uh-oh!
## [1] NA
mean(spotty_data,na.rm=T) # phew!
## [1] 10.57143
R, you can use the command plot to create scatterplots, boxplots, histograms, barplots, and many other types of plots!ggplot2 further extends the plotting capability of Rpar(mfcol=c(1,3), oma=c(0,0,0,0))
plot(cars,xlab="Speed",ylab="Distance",main="Cars Dataset Scatterplot", pch=19, col="grey60")
boxplot(count ~ spray, data = InsectSprays, col = "lightgray",xlab="Treatment",ylab="Insect Count", main="Insect Sprays Dataset")
hist(islands,xlab=expression(paste("Land area (miles"^2,")")))
PlantGrowth which describes plant weight as a function of control (ctrl) or experimental (trt1, trt2) treatment.# call the built-in "PlantGrowth" dataset
data("PlantGrowth")
# ANOVA
anova(lm(weight ~ group, data = PlantGrowth))
## Analysis of Variance Table ## ## Response: weight ## Df Sum Sq Mean Sq F value Pr(>F) ## group 2 3.7663 1.8832 4.8461 0.01591 * ## Residuals 27 10.4921 0.3886 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Boxplot
boxplot(weight ~ group, data = PlantGrowth,
main = "Plant Growth data",
ylab = "Dried weight of plants",
xlab="Treatment", col = "lightgray")
y (dependent variable) relate to x (predictors/independent variables)?)
cars dataset (?cars):data(cars) # build linear regression model on full cars dataset # and store in a variable named linearMod linearMod <- lm(dist ~ speed, data=cars) linearMod # show me linearMod
## ## Call: ## lm(formula = dist ~ speed, data = cars) ## ## Coefficients: ## (Intercept) speed ## -17.579 3.932
# What was the model fit? summary(linearMod)
## ## Call: ## lm(formula = dist ~ speed, data = cars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -29.069 -9.525 -2.272 9.215 43.201 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -17.5791 6.7584 -2.601 0.0123 * ## speed 3.9324 0.4155 9.464 1.49e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 15.38 on 48 degrees of freedom ## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 ## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
xishuangbanna_birds_df dataset, the bird counts clearly come from a Poisson distribution.summary(glm(Bulbul~eco_cov, data=xishuangbanna_birds_df, family=poisson))
## ## Call: ## glm(formula = Bulbul ~ eco_cov, family = poisson, data = xishuangbanna_birds_df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.4064 -0.3553 -0.1331 0.2338 1.7857 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.40698 0.19725 7.133 9.82e-13 *** ## eco_cov 0.07433 0.01628 4.566 4.97e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 33.048 on 25 degrees of freedom ## Residual deviance: 14.189 on 24 degrees of freedom ## AIC: 122.54 ## ## Number of Fisher Scoring iterations: 4
RR: spatial statisticsR to do complex geographical data cleaning, coordinate projection, and spatial manipulation.R and its spatial packages (e.g. sp, rgeos, rgdal, raster, ggmap, sf, and more) over GIS is that you can easily replicate your work in the future by simply re-executing your code!ggmap::ggmap(China)
R:
Markdown is a straightforward language for turning plain text into nicely formatted HTML or PDF, which is a great format for sharing with collaborators.
R Markdown combines the syntax (language rules) of markdown with embedded R code chunks that are run so their output can be included in the final document.
R Markdown lets you create documents (you could even write an entire scientifc manuscript in R Markdown!), presentations (this presentation is written in R Markdown), and reports from R.
R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes). This is a huge advantage over a static workflow such as Microsoft Word + SPSS. Whenever you want to re-run your analyses, simply alter your code (usually just some small part of your code)–this is much easier than having to go back into something like SPSS or SAS and re-do all of the manual data input and analyses.
Source: http://rmarkdown.rstudio.com/ and https://github.com/mine-cetinkaya-rundel/rworkshop-mem/
In RStudio, you can create R Markdown documents, "knit" it (render the code into a nice format such as HTML or PDF), and examine the source code and the output.
File \(\rightarrow\) R Markdown…
Enter a title (e.g. "Spatial analysis of predators and prey") and author info
Choose Document as file type, and HTML as the output
Hit OK
Markdown is a very simple formatting language based on plain text (the standard text format used in emails, at least historically)
Rather than writing complex code (e.g. HTML), Markdown enables the use of a syntax much more like plain-text email.
Markdown overview
Within an R Markdown file, R Code Chunks can be embedded using the native Markdown syntax for fenced code regions.
Code chunks
If you have any questions, please talk to Mingxia, or feel free to email Charlotte.
The slides are online