Change this setting: Tools > Global Options > Code Editing > Soft-wrap R source files
The R code for specific tasks will appear in a box like this that mimics an R script.
The scripts are annotated in plain text to help you along. The following scripts refer to object names that you can or need to edit in CAPS (e.g. INDEPENDENT.VARIABLE, DEPENDENT.VARIABLE, etc.) and these are indicated after each section. All other script text affects functionality of what is performed and can of course be edited if you know what you are doing.
Note that two-parted names including a “.” (e.g. my.name) need not be two-parted and you can edit them as desired (e.g. “myname”). Object names are case sensitive.
In addition to showing the R code, some sections in this cookbook contain examples.
## [1] "outputs will be shown like this, as it would appear in the R console when run."
The examples shown use ‘iris’ dataset that is provided by R. If you wish to follow along with the examples, everything can be reproduced by copy/pasting & running the provided code. To access the “iris” dataset in R, simply run:
data(iris)
This dataset
To test this, type any equation in your script and run to see the solution.
For example:
7*9/(3+4)^2-8
## [1] -6.714286
define where R will look for and put files.
setwd("C:/Users/YOURNETID/Desktop/")
Replace “YOURNETID” with your case-sensitive net id
PC vs Mac: “C:” Assumes you are on a PC. For Mac, delete “C:”.
To determine your current working directory
getwd()
## [1] "/Users/colleennell/Dropbox/Projects/R_ed"
csv:
DF <- read.csv(file="YOURDATA.csv", header=TRUE, sep=",", na.strings = "NA")
txt:
DF <- read.table(file="YOURDATA.txt", header=TRUE, sep=",", na.strings = "NA")
“YOURDATA” must be located in your working directory. You can also provide the absolute path (working directory + filename) as the file location.
**Note that ‘<-’ is used to assign your data file to ‘DF’. Now, ‘DF’ contains your data and will need to be called on to work with it.
Print in console:
DF
Open dataframe in new window:
View(DF)
Print the first few rows (head) or last few rows (tail) of your data. For example, using the ‘iris’ dataset:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Look at the structure (str) of your data:
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
‘num’ after Sepal.Length indicates that the variable is numeric. Other types include ‘int’ (integer), ‘factor’ (categorical factor), ‘ord’ (ordered) and ‘chr’ (character). This output also tells us that the dataframe contains 150 observations (rows) of 5 variables (columns).
R provides many built-in functions that allow you to explore & plot your data. The majority of what is covered in this cookbook utilizes these base R tools. In addition, there is a diversity of R packages that can be downloaded and implemented in R to increase its’ power and functionality.
Download packages to your computer:
install.packages("PACKAGE.NAME")
Load them for use:
library(PACKAGE.NAME)
To use a package, it only needs to be loaded once per session.
Two very useful packages for data manipulation and plotting:
install.packages("dplyr") #data manipulation
install.packages("ggplot2") #plotting
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
library(ggplot2)
To be able to work with your data you need to be able to refer to your variables. There are several ways to call on a variable within your dataframe.
DF$VARIABLE
Example:
head(iris$Sepal.Width)
## [1] 3.5 3.0 3.2 3.1 3.6 3.9
Selects the ‘Sepal.Width’ variable from the ‘iris’ dataframe.
DF[ROW,COL]
Replace ROW and COL with numbers indicating the row and column of interest within the dataframe
Column example:
head(iris[,2])
## [1] 3.5 3.0 3.2 3.1 3.6 3.9
As shown by the output, this is the same as iris$Sepal.Width
Row example:
iris[2,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 4.9 3 1.4 0.2 setosa
This prints the second row of the iris dataframe.
Providing values for ROW and COL returns the corresponding cell value:
iris[2,2]
## [1] 3
This is the second value in the 2nd row.
You can reference multiple rows and/or columns simultaneously using the ‘:’ operator. ‘:’ = ‘to’, such that
DF[,1:4]
selects columns 1 through 4 from the dataframe. Or listed using ‘c()’
Example:
iris[1:5,c(2,4,5)]
## Sepal.Width Petal.Width Species
## 1 3.5 0.2 setosa
## 2 3.0 0.2 setosa
## 3 3.2 0.2 setosa
## 4 3.1 0.2 setosa
## 5 3.6 0.2 setosa
This outputs rows 1 through 5 from columns 2, 4, and 5 of the iris dataframe.
Rather than providing the column number, column names can also be used in place of COLNAME (when using names they must be in " “).
DF[,"COLNAME"]
or for multiple columns list the column names in ‘c()’
DF[,c("COLNAME1","COLNAME2")]
Example:
head(iris[,"Petal.Width"])
## [1] 0.2 0.2 0.2 0.2 0.2 0.4
names can be combined with row numbers as well:
iris[1:5,c("Sepal.Width","Species")]
## Sepal.Width Species
## 1 3.5 setosa
## 2 3.0 setosa
## 3 3.2 setosa
## 4 3.1 setosa
## 5 3.6 setosa
Calculate summary statistics on your data frame for each level of a factor.
The resulting data frame calculates statistics for each treatment.
In base R there are built in functions to calculate the mean, standard deviation, and n
DATA.SUMMARY <- data.frame(
treatment=levels(rdata$location),
mean=tapply(rdata$DEPENDENT.VARIABLE, rdata$location, mean),
n=tapply(rdata$DEPENDENT.VARIABLE, rdata$location, length),
sd=tapply(rdata$DEPENDENT.VARIABLE, rdata$location, sd)
)
For example, using the ‘iris dataset where our treatment is ’Species’ and our dependent variable of interest is ‘Sepal.Length’:
iris.summary <- data.frame(
treatment=levels(iris$Species),
mean=tapply(iris$Sepal.Width, iris$Species, mean),
n=tapply(iris$Sepal.Width,iris$Species, length),
sd=tapply(iris$Sepal.Width, iris$Species, sd)
)
There is not a built in function for standard error in base R.
To add standard error of the mean (SEM) & view summary statistics:
DATA.SUMMARY$sem <- DATA.SUMMARY$sd/sqrt(DATA.SUMMARY$n)
DATA.SUMMARY
Using ‘iris’ data:
iris.summary$sem<-iris.summary$sd/sqrt(iris.summary$n)
iris.summary
## treatment mean n sd sem
## setosa setosa 3.428 50 0.3790644 0.05360780
## versicolor versicolor 2.770 50 0.3137983 0.04437778
## virginica virginica 2.974 50 0.3224966 0.04560791
Standard error can also be written as a function and included in the original code for the same result:
se<-function(x) sqrt(var(x)/length(x))
DATA.SUMMARY <- data.frame(
treatment=levels(rdata$location),
mean=tapply(rdata$DEPENDENT.VARIABLE, rdata$location, mean),
n=tapply(rdata$DEPENDENT.VARIABLE, rdata$location, length),
sd=tapply(rdata$DEPENDENT.VARIABLE, rdata$location, sd),
se=tapply(rdata$DEPENDENT.VARIABLE, rdata$location, se)
)
Use summary statistics to make figure with mean and SE error bars Load the ggplot2 library if you have not already:
library(ggplot2)
ggplot(DATA.SUMMARY, aes(x = treatment, y = mean)) +
geom_bar(position = position_dodge(), stat="identity", fill="grey") +
geom_errorbar(aes(ymin=mean-sem, ymax=mean+sem),width=.3) +
labs("TITLE (MOST PLOTS DONT NEED ONE)",x="X AXIS", y="Y AXIS")+
theme_bw() + theme(panel.grid.major = element_blank())
Make this plot using summary statistics for ‘iris$Sepal.Length’:
ggplot(iris.summary, aes(x = treatment, y = mean)) +
geom_bar(position = position_dodge(), stat="identity", fill="grey") +
geom_errorbar(aes(ymin=mean-sem, ymax=mean+sem),width=.3) +
labs(" ",x="Species", y="Sepal length")+
theme_bw() + theme(panel.grid.major = element_blank())
write.csv(DATA.SUMMARY,"DATA.SUMMARY.csv")
Saves the R data frame named “DATA.SUMMARY” as a comma delimited file called “SUMMARY.STATS.csv” to wherever you have set your working directory.
RDATA$INDEPENDENT.VARIABLE<-as.factor(RDATA$INDEPENDENT.VARIABLE)
This defines your independent variable as being discontinuous i.e.categorical.
str(RDATA$INDEPENDENT.VARIABLE)
This displays the structure of the variable INDEPENDENT.VARIABLE to check that it is now a factor variable.
TEST.RESULTS<-aov(DEPENDENT.VARIABLE~INDEPENDENT.VARIABLE, data=RDATA)
TEST.RESULTS
Performs an ANOVA on data frame called “RDATA” with the defined dependent and independent variable and saves the output into the file called “TEST.RESULTS” and displays that object.
summary(TEST.RESULTS)
Provides summary of the ANOVA results which is slightly different from displaying the full results but you could do this too.
Example: One way Anova using ‘iris’ data
Does petal width differ by Species?
iris$Species<-as.factor(iris$Species)
str(iris$Species)
## Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
test.results<-aov(Petal.Width~Species, data=iris)
test.results
## Call:
## aov(formula = Petal.Width ~ Species, data = iris)
##
## Terms:
## Species Residuals
## Sum of Squares 80.41333 6.15660
## Deg. of Freedom 2 147
##
## Residual standard error: 0.20465
## Estimated effects may be unbalanced
summary(test.results)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 80.41 40.21 960 <2e-16 ***
## Residuals 147 6.16 0.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(DEPENDENT.VARIABLE~INDEPENDENT.VARIABLE, data=RDATA, main="TITLE", xlab="X AXIS LABEL", ylab="Y AXIS LABEL")
Make a boxplot of the results for each level of the independent variable. These plots have the median (center line), the 25th and 75th quartiles (outside of box), the extreem values (the ends of the wiskers) and outliers (individual circles). These figures are not appropriate in most cases because we want mean and standard error, but as a quick and dirty way of visaulizing your data these plots are nevertheless useful.
TukeyHSD(TEST.RESULTS)
Conducts Tukey Honest Significant Difference test, a post-hoc tests that provides P value for each pairwise comparisons among levels of the independent variable. If your independent variable has 3 levels (e.g. small, medium and large), the P value for the for the ANOVA tells you at least two of these levels differ from each other, but you don’t know which two. This test gives you separate P values for every pair-wise comparison (e.g. small vs. medium, small vs. large, medium vs. large).
Tukey example using ‘iris’ data, testing for pairwise differences between treatment groups (Species), for petal width.
TukeyHSD(test.results)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Petal.Width ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 1.08 0.9830903 1.1769097 0
## virginica-setosa 1.78 1.6830903 1.8769097 0
## virginica-versicolor 0.70 0.6030903 0.7969097 0
RDATA$INDEPENDENT.VARIABLE1<-as.factor(RDATA$INDEPENDENT.VARIABLE1)
RDATA$INDEPENDENT.VARIABLE2<-as.factor(RDATA$INDEPENDENT.VARIABLE2)
This defines your independent variable as being discontinuous i.e.categorical.
str(RDATA$INDEPENDENT.VARIABLE1)
str(RDATA$INDEPENDENT.VARIABLE2)
This displays the structure of the variables INDEPENDENT.VARIABLE1 and INDEPENDENT.VARIABLE2 to check that they are now factor variables.
TEST.RESULTS<-aov(DEPENDENT.VARIABLE~INDEPENDENT.VARIABLE1+INDEPENDENT.VARIABLE2+INDEPENDENT.VARIABLE1*INDEPENDENT.VARIABLE2,data=RDATA)
Performs a two-factor ANOVA on data frame called “RDATA” with the defined dependent and two independent variables. This analysis includes the main effects of each independent variables as well as the interaction between them. It then saves the output into the file called " TEST.RESULTS“.
summary(TEST.RESULTS)
Provides summary of the two-factor ANOVA results which is slightly different from displaying the full results but you could do this too.
boxplot(DEPENDENT.VARIABLE ~ INDEPENDENT.VARIABLE1, data=RDATA, main="TITLE", xlab="X AXIS LABEL", ylab="Y AXIS LABEL")
Make a boxplot of the results for each level of the independent variable1. These plots have the median (center line), the 25th and 75th quartiles (outside of box), the extreem values (the ends of the wiskers) and outliers (individual circles). These figures are not appropriate in most cases because we want mean and standard error, but as a quick and dirty way of visaulizing your data these plots are nevertheless useful.
boxplot(DEPENDENT.VARIABLE ~ INDEPENDENT.VARIABLE2, data=RDATA, main="TITLE", xlab="X AXIS LABEL", ylab="Y AXIS LABEL")
Makes a boxplot of the results with means and errors for each level of independent variable2.These plots have the median (center line), the 25th and 75th quartiles (outside of box), the extreem values (the ends of the wiskers) and outliers (individual circles). These figures are not appropriate in most cases because we want mean and standard error, but as a quick and dirty way of visaulizing your data these plots are nevertheless useful.
boxplot(DEPENDENT.VARIABLE ~ INDEPENDENT.VARIABLE1*INDEPENDENT.VARIABLE2, data=RDATA, main="TITLE", xlab="X AXIS LABEL", ylab="Y AXIS LABEL")
Makes a boxplot of the results with means and errors for each level of independent variable1 and 2 in combination. These plots have the median (center line), the 25th and 75th quartiles (outside of box), the extreem values (the ends of the wiskers) and outliers (individual circles). These figures are not appropriate in most cases because we want mean and standard error, but as a quick and dirty way of visaulizing your data these plots are nevertheless useful.
TukeyHSD(TEST.RESULTS)
Conducts Tukey Honest Significant Difference test, a post-hoc tests that provides P value for each pairwise comparisons among all combinations of the levels of the two independent variable. If one of your independent variable has 3 levels (e.g. small, medium and large) and the other has 3 levels (e.g. slow, fast, extrafast), the P values for these factors for the for the ANOVA tells you at least two of these levels differ from each other, but you don’t know which two. This test gives you separate P values for every pair-wise comparison within each level of each factor (e.g. small vs. medium, small vs. large, medium vs. large) but also for pairwise combinations among factors (e.g. small-slow vs. small-fast, etc.).
RDATA$INDEPENDENT.VARIABLE.FACTOR<-as.factor(RDATA$INDEPENDENT.VARIABLE.FACTOR)
This defines INDEPENENT.VARIABLE.FACTOR as being discontinuous i.e. a factor variable.
str(RDATA$INDEPENDENT.VARIABLE.FACTOR)
This displays the structure of the variable INDEPENDENT.VARIABLE.FACTOR to check that it is now a factor variable.
RDATA$INDEPENDENT.VARIABLE.NUMERIC<-as.numeric(RDATA$INDEPENDENT.VARIABLE.NUMERIC)
This defines INDEPENDENT.VARIABLE.NUMERIC as being continuous i.e. numeric.
str(RDATA$INDEPENDENT.VARIABLE.NUMERIC)
This displays the structure of the variable INDEPENDENT.VARIABLE.NUMERIC to check that it is now a numeric variable.
TEST.RESULTS<-lm(DEPENDENT.VARIABLE ~ INDEPENDENT.VARIABLE.FACTOR + INDEPENDENT.VARIABLE.NUMERIC + INDEPENDENT.VARIABLE.FACTOR*INDEPENDENT.VARIABLE.NUMERIC, data=RDATA)
Performs an ANCOVA on data frame called “RDATA” with the defined dependent and two independent variables, one continuous and one discontinuous. This analysis includes the main effects of each independent variables as well as the interaction between them. It then saves the output into the object called " TEST.RESULTS“.
summary(TEST.RESULTS)
Provides summary of the ANCOVA results which is slightly different from displaying the full results but you could do this too.
plot(INDEPENDENT.VARIABLE.NUMERIC,DEPENDENT.VARIABLE,as.numeric(INDEPENDENT.VARIABLE.FACTOR),col=c("blue","red")[as.numeric(INDEPENDENT.VARIABLE.FACTOR)], data=RDATA, main="TITLE", xlab="X AXIS LABEL", ylab="Y AXIS LABEL")
This plots the levels of your discrete independent variable blue for the first alphabetically and red for the second alphabetically.
abline(lm(DEPENDENT.VARIABLE[INDEPENDENT.VARIABLE.FACTOR=="LEVEL1.INDEPENDENT.VARIABLE.FACTOR"]~INDEPENDENT.VARIABLE.NUMERIC[INDEPENDENT.VARIABLE.FACTOR=="LEVEL1.INDEPENDENT.VARIABLE.FACTOR"]),lty=2,col="blue")
Enter the actual value for the level of your discrete independent variable that is fist alphabetically (twice)
abline(lm(DEPENDENT.VARIABLE[INDEPENDENT.VARIABLE.FACTOR=="LEVEL2.INDEPENDENT.VARIABLE.FACTOR"]~INDEPENDENT.VARIABLE.NUMERIC[INDEPENDENT.VARIABLE.FACTOR=="LEVEL2.INDEPENDENT.VARIABLE.FACTOR"]),lty=2,col="red")
Enter the actual value for the level of your discrete independent variable that is second alphabetically (twice)
RDATA$INDEPENDENT.VARIABLE<-as.numeric(RDATA$INDEPENDENT.VARIABLE)
This defines your independent variable as being discontinuous i.e.categorical.
str(RDATA$INDEPENDENT.VARIABLE)
This displays the structure of the variable INDEPENDENT.VARIABLE to check that it is now a factor variable.
TEST.RESULTS<-lm(DEPENDENT.VARIABLE ~ INDEPENDENT.VARIABLE, data=RDATA)
TEST.RESULTS
Very similar to a simple regression.
RDATA$INDEPENDENT.VARIABLE1<-as.numeric(RDATA$INDEPENDENT.VARIABLE1)
str(RDATA$INDEPENDENT.VARIABLE1)
This defines your independent variable as being discontinuous i.e.categorical, and displays the structure of the variable INDEPENDENT.VARIABLE1 to check that it is now a factor variable.
RDATA$INDEPENDENT.VARIABLE2<-as.numeric(RDATA$INDEPENDENT.VARIABLE2)
str(RDATA$INDEPENDENT.VARIABLE2)
This defines your independent variable as being discontinuous i.e.categorical, and displays the structure of the variable INDEPENDENT.VARIABLE2 to check that it is now a factor variable.
TEST.RESULTS<-cor.test(RDATA$INDEPENDENT.VARIABLE1, RDATA$INDEPENDENT.VARIABLE2, method = c("pearson"))
TEST.RESULTS
plot(INDEPENDENT.VARIABLE1~INDEPENDENT.VARIABLE2, data=RDATA, main="YOUR TITLE", xlab="X AXIS LABEL", ylab="Y AXIS LAGEL")
To add the correlation line to the plot:
abline(lm(INDEPENDENT.VARIABLE1~INDEPENDENT.VARIABLE2, data=RDATA), col="red", lty=2)
The types of variables you have and your experiemntal design are reflected in the formula written to model your data.
As shown above an ANOVA compares the fixed effects of 1 or more dependent variables (X1, X2, …), on your independent variable (Y).
For data frame ‘DF’:
aov(Y~ X1 + X2, data = DF)
This models the additive effects of X1 and X2. If X1 and X2 are crossed,
to test for an interaction between X1 & X1:
aov(Y~X1*X2, data= DF)
or
aov(Y~X1+X2+X1:X2, data= DF)
is the same, where we indidate we want the additive effects (X1 + X2) AND the interaction (X1:X2). Or simply (X1*X2) to include all combinations.
[Learn more about Formulae in R for ANOVA & mixed effects models] (http://conjugateprior.org/2013/01/formulae-in-r-anova/)
To create more beautiful custom plots using R try the ‘ggplot2’ package.
www.rseek.org Google search for R only
http://www.statmethods.net –> Quick R, Basic stats for good help outside of R help
http://www.ats.ucla.edu/stat/r/
http://grad.bio.uci.edu/ecoevo/petryw/Introduction_to_R.html
Google is your friend. Searching “how to do ‘X’ in r” will almost always return helpful sites.
include dplyrcheatsheet dplyr tutorial
Formulae for ANOVAs, mixed models, nested models
http://conjugateprior.org/2013/01/formulae-in-r-anova/
http://www.cookbook-r.com/Graphs/ –> Cookbook For R for GGPLOT2
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf –> GGPLOT2 cheatsheet
The ‘dplyr’ package is design to make data manipulation easy. ‘%>%’ is a pipe. This basically means apply the step following the pipe to the dataframe preceeding the pipe. Using this and some of the functions in dplyr data summarization and manipulation can be very easy.
DATA.SUMMARY<-iris%>%
group_by(Species)%>%
summarize_each(funs(mean,sd,se,length))
DATA.SUMMARY
## # A tibble: 3 × 17
## Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean
## <fctr> <dbl> <dbl> <dbl>
## 1 setosa 5.006 3.428 1.462
## 2 versicolor 5.936 2.770 4.260
## 3 virginica 6.588 2.974 5.552
## # ... with 13 more variables: Petal.Width_mean <dbl>,
## # Sepal.Length_sd <dbl>, Sepal.Width_sd <dbl>, Petal.Length_sd <dbl>,
## # Petal.Width_sd <dbl>, Sepal.Length_se <dbl>, Sepal.Width_se <dbl>,
## # Petal.Length_se <dbl>, Petal.Width_se <dbl>,
## # Sepal.Length_length <int>, Sepal.Width_length <int>,
## # Petal.Length_length <int>, Petal.Width_length <int>