I use the RStudio menu to select my working directory.
setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lab/Lab6_review")
I am going to work with a file called light.RData. I can use list.files() to check that the working directory that I have set contains that file.
list.files()
## [1] "finchs.csv"
## [2] "FUNCTIONplot2means.R"
## [3] "Lab_6_basic_data_analysis_review.html"
## [4] "Lab_6_basic_data_analysis_review.Rmd"
## [5] "Lab_6_basic_data_analysis_review_INCLASS_EXERCISE.html"
## [6] "Lab_6_basic_data_analysis_review_INCLASS_EXERCISE.Rmd"
## [7] "light.RData"
## [8] "plot2means.jpg"
## [9] "plot2means.png"
## [10] "plot2means.RData"
## [11] "R_on_youtube_assignment.xlsx"
## [12] "rsconnect"
## [13] "subset.jpg"
## [14] "tapply.jpg"
R can save data into its own species format. Here, I have saved data from a spreadsheet into an .RData file. This data can be loaded into my R session using the load() command. NOTE: the name of the file has to be in parentheses.
load("light.RData")
I can confirm that the data is loaded using the ls() command. The file “light.Rdata” loads an R object called “light”
ls()
## [1] "light"
I can use the standard command head, tail, and summary to check out the data.
head(light)
## treatment response
## 869 C -4.504766
## 871 C -3.164909
## 873 C -4.320002
## 875 C -4.486886
## 877 C -4.207244
## 879 C -4.591058
tail(light)
## treatment response
## 1002 E -3.310792
## 1004 E -4.135877
## 1006 E -5.112481
## 1008 E -4.780870
## 1010 E -4.822527
## 1012 E -4.761931
summary(light)
## treatment response
## C:36 Min. :-5.537
## E:36 1st Qu.:-5.026
## Median :-4.758
## Mean :-4.592
## 3rd Qu.:-4.330
## Max. :-2.357
These are data from deer exclusion study at Trillium Trail in Pittsburgh, PA. C = control plots, E = exclusion plots. The numeric response variable is the ratio of sunlight at the forest floor to sunlight above the canopy. Actually, its the log of that ratio, eg log(light below / light above). The hypothesis is that when deer are excluded, vegetation increases in density and therefore light reachign the forest floor decreases.
The data is organized into groups so we will want to split it up for our real analysis. HOwever, it can be informative just to look at the raw data with a histogram.
hist(light$response)
The data are skewed. Such is life.
The boxplot() function naturally handles grouped data so was can easily visualize these dat.
boxplot(response ~ treatment, data = light)
Light levels are higher in the “C” treatment (the control where deer have access and can eat the vegetation)
These data are split into two groups. Some rows of data are from control plots (“C”) and some are from exclosures (“E”). The boxplot() function naturally handles data structured like this. (The t.test function also does, but we won’t work with that today).
Not not all R functions handle grouped data so easily. We’ll use the subset() command to split the data into two seperate R objects that we can then work with seperately.
The subset() command takes three arguements
The command we will use to subset the data is “treatment ==”C" “, which tells the subset command,”look at the treatment column and give me just those rows that have “C” in them“.
We use the arguement “select = c(”treatment“,”response“)” to tell subset to give us both the treatment and response columns in our output.
The picture below is an annotated version of the subset command, though applied to a differen dataset, not the light data.
control.data <- subset(light,
treatment == "C",
select = c("treatment",
"response"))
We can see if this worked by using the summary() command. Under the treatment column, summary() indicates that only the variable “C” appears
summary(control.data)
## treatment response
## C:36 Min. :-5.097
## E: 0 1st Qu.:-4.857
## Median :-4.517
## Mean :-4.354
## 3rd Qu.:-4.167
## Max. :-2.407
exclosure.data <- subset(light,
treatment == "E",
select = c("treatment",
"response"))
Now that the data are split up we can make a histogram to look at the distributions of the data.
With the hist() command we have to tell R exactly which column we want it to use. We can’t just tell R hist(control.data), because there are two columns in the control.data dataframe. We have to use the dollar sign operator $ to tell hist() to select just the “response”" column, eg hist(control.data$response)
hist(control.data$response)
To plot the two datasets side by side we need to use the par() command to change the plotting window settings. Note the x-axes are set differently because the ranges of the data are somewhat different.
par(mfrow = c(1,2))
hist(control.data$response)
hist(exclosure.data$response)
We can get the means, standard deviations, etc of each set of data easily using the mean(), sd() command. Note that again we must use the dollar sign $ to define the column we want to use.
Means
mean(control.data$response)
## [1] -4.35354
mean(exclosure.data$response)
## [1] -4.829708
Standard deviations
sd(control.data$response)
## [1] 0.6722443
sd(exclosure.data$response)
## [1] 0.6073773
We can get the sample size using the length() command
length(control.data$response)
## [1] 36
length(exclosure.data$response)
## [1] 36
Learning to split up data using the subset() is an important skill, but it isn’t always necessary. We can get the means of each group of data more quickly by using a special function caleld tapply() which will split our data up “on the fly” and apply the mean() or other functions to subsets of the data for us.
The tapply() functions takes 3 main argument.
i) The numeric variable that you want to get information from (ie, mean values by group).
ii) The categorical variable that defines the groups in the data.
iii)The mathematical function to apply to the data (mean, sd, min, etc)
my.means <- tapply(light$response,
light$treatment,
FUN = mean,
na.rm = T)
The output
my.means
## C E
## -4.353540 -4.829708
If we want to make a standard plot with means and error bars we need these things
We have the means for each group. We can use the tapply function to calculate the standard deviations using the sd() function.
my.sd <- tapply(light$response,
light$treatment,
FUN = sd,
na.rm = T)
We can get the samples sizes (n) using summary(). Both treatment have 36 samples.
summary(light)
## treatment response
## C:36 Min. :-5.537
## E:36 1st Qu.:-5.026
## Median :-4.758
## Mean :-4.592
## 3rd Qu.:-4.330
## Max. :-2.357
Let’s store that information in an R object
my.n <- c(36,36)
We can calcualte the standard error as sd/sqrt(N)
my.se <- my.sd/sqrt(my.n)
To plot the means we’ll use my function plot2means. We could load this code by hand by cutting and pasting code from http://rpubs.com/brouwern/plot2means , but instead I have saved the function as an .RData file that we can load. .RData files can hold any R object, whether its a dataframe like the light.RData we loaded earlier or a function.
The load() command adds the function plot2means() to our R sessions.
load("plot2means.RData")
The plot2means arguement takes on three major arguements.
Here’s an annotated version of the function
plot2means(means = my.means,
se = my.se,
categories = c("C","E"))