This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. See the R Markdown Cheat Sheet and Reference Guide for more formatting options.
Run a linear model for the old faithful dataset
Below is sample code to use R markdown. Use this code as a template to test the assumptions of the linear model and results for the data("faithful")
data set. This data set measures waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
Step #1: Register for a free R Pubs account at http://rpubs.com
Step #2: Open a new R markdown file in RStudio following: File > New File > R Markdown…
Step #3 Copy and paste the pre-amble above in your new R Markdown File. You can change the Title, Author, and Date.
Step #4: Work through the template below to understand the basic features of the R Markdown Language. You will find after a short time that the code makes sense. You may even find that you can become more organized in your workflow.
Step #5: The assignment.
Your task is to create a similar HTML document as below using the data("faithful")
data set. Once complete, publish your data to rpubs.com (blue ‘eye-like’ icon at the top). On rpubs.com, click share. Copy and paste the web address. Submit the link to me Sakai in Assignments. I will grade your assignment directly from the link.
# ANOVA WORKFLOW - The Nuts and Bolts
library(ggplot2)
library(plyr)
# Data Transformation Section ----
Cube.Tns <- function (x) { x ^ 3 }
Square.Tns <- function (x) { x ^ 2 }
Raw.Tns <- function (x) { x }
Sqrt.Tns <- function (x) { sqrt(x) }
Log.Tns <- function (x) { log10(x + 0.00001) }
RecipRoot.Tns <- function (x) { -1 / sqrt(x) }
Recip.Tns <- function (x) { -1 / (x) }
InvSquare.Tns <- function (x) { -1 / (x ^ 2) }
data("iris") # Import iris data
head(iris) # first six rows of data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris) # summary = mean, median, min, etc.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(iris) # attributes of the dataframe
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
hist(iris$Sepal.Length) # histogram
Figure: Histogram of Sepal Length from the iris data set.
qqnorm(iris$Sepal.Length) # QQ-plot of raw data
qqline(iris$Sepal.Length)
Figure: QQ-Norm plot of Sepal Length from the iris data set.
boxplot(iris$Sepal.Length ~ iris$Species) # Boxplot
Figure: Boxplot of Sepal Length from the iris data set.
iris$Sepal.LengthLOG <- Log.Tns(iris$Sepal.Length) # Log transform
qqnorm(iris$Sepal.LengthLOG) # QQ-plot of Log transformation
qqline(iris$Sepal.LengthLOG)
Figure: Log Transformation QQ-Norm plot of Sepal Length from the iris data set.
A1 <- aov(Sepal.Length ~ Species, data = iris)
plot(A1) # Assumptions met
Figure: Diagnostic plots for untransformed data for Sepal Length vs. Iris Species.
Figure: Diagnostic plots for untransformed data for Sepal Length vs. Iris Species.
Figure: Diagnostic plots for untransformed data for Sepal Length vs. Iris Species.
Figure: Diagnostic plots for untransformed data for Sepal Length vs. Iris Species.
summary(A1) # Significant p-value
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(A1) # Significant multiple comparisons
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862273 1.1737727 0
## virginica-setosa 1.582 1.3382273 1.8257727 0
## virginica-versicolor 0.652 0.4082273 0.8957727 0
# ddply() is loaded in the plyr package
graph.df <- ddply(iris, c("Species"), summarise,
N = sum(!is.na(Sepal.Length)),
mean = mean(Sepal.Length, na.rm=TRUE),
sd = sd(Sepal.Length, na.rm=TRUE),
se = sd / sqrt(N))
graph.df # summary statistics used for ggplot()
## Species N mean sd se
## 1 setosa 50 5.006 0.3524897 0.04984957
## 2 versicolor 50 5.936 0.5161711 0.07299762
## 3 virginica 50 6.588 0.6358796 0.08992695
Figure: Mean +/- standard deviation for raw data of Sepal Length vs. Iris Species. (n = 50)