Con Bio Lab R Markdown Tutorial

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. See the R Markdown Cheat Sheet and Reference Guide for more formatting options.

Con Bio Lab Assignment (2pts)

Run a linear model for the old faithful dataset

Below is sample code to use R markdown. Use this code as a template to test the assumptions of the linear model and results for the data("faithful") data set. This data set measures waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

Workflow:

Step #1: Register for a free R Pubs account at http://rpubs.com

Step #2: Open a new R markdown file in RStudio following: File > New File > R Markdown…

Step #3 Copy and paste the pre-amble above in your new R Markdown File. You can change the Title, Author, and Date.

Step #4: Work through the template below to understand the basic features of the R Markdown Language. You will find after a short time that the code makes sense. You may even find that you can become more organized in your workflow.

Step #5: The assignment.

Your task is to create a similar HTML document as below using the data("faithful") data set. Once complete, publish your data to rpubs.com (blue ‘eye-like’ icon at the top). On rpubs.com, click share. Copy and paste the web address. Submit the link to me Sakai in Assignments. I will grade your assignment directly from the link.

Running an ANOVA model with R Markdown

Step One: Load libraries and transformation functions

# ANOVA WORKFLOW - The Nuts and Bolts
library(ggplot2)
library(plyr)

# Data Transformation Section ----

Cube.Tns <- function (x) { x ^ 3 }

Square.Tns <- function (x) { x ^ 2 }

Raw.Tns <- function (x) { x }

Sqrt.Tns <- function (x) { sqrt(x) }

Log.Tns <- function (x) { log10(x + 0.00001) }

RecipRoot.Tns <- function (x) { -1 / sqrt(x) }

Recip.Tns <- function (x) { -1 / (x) }

InvSquare.Tns <- function (x) { -1 / (x ^ 2) }

Step Two: Import the data

data("iris")  # Import iris data

head(iris)  # first six rows of data

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

summary(iris)  # summary = mean, median, min, etc.

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

str(iris)  # attributes of the dataframe

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Step Three: Inspect the data distributions

hist(iris$Sepal.Length) # histogram

Figure: Histogram of Sepal Length from the iris data set.

qqnorm(iris$Sepal.Length) # QQ-plot of raw data
qqline(iris$Sepal.Length)

Figure: QQ-Norm plot of Sepal Length from the iris data set.

boxplot(iris$Sepal.Length ~ iris$Species) # Boxplot

Figure: Boxplot of Sepal Length from the iris data set.

Step Four: Transformation needed?

iris$Sepal.LengthLOG <- Log.Tns(iris$Sepal.Length) # Log transform

qqnorm(iris$Sepal.LengthLOG) # QQ-plot of Log transformation
qqline(iris$Sepal.LengthLOG)

Figure: Log Transformation QQ-Norm plot of Sepal Length from the iris data set.

Step Five: Run the ANOVA model

A1 <- aov(Sepal.Length ~ Species, data = iris)
plot(A1) # Assumptions met

Figure: Diagnostic plots for untransformed data for Sepal Length vs. Iris Species.

summary(A1) # Significant p-value

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(A1) # Significant multiple comparisons

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
## 
## $Species
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0

Step Six: Plot the results

# ddply() is loaded in the plyr package
graph.df <- ddply(iris, c("Species"), summarise,
                  N    = sum(!is.na(Sepal.Length)),
                  mean = mean(Sepal.Length, na.rm=TRUE),
                  sd   = sd(Sepal.Length, na.rm=TRUE),
                  se   = sd / sqrt(N))

graph.df # summary statistics used for ggplot()

##      Species  N  mean        sd         se
## 1     setosa 50 5.006 0.3524897 0.04984957
## 2 versicolor 50 5.936 0.5161711 0.07299762
## 3  virginica 50 6.588 0.6358796 0.08992695

Figure: Mean +/- standard deviation for raw data of Sepal Length vs. Iris Species. (n = 50)