R is a programming language often used for statistics. It is object-oriented which means data, analyses, plots, and everything else is attached to “objects”. You can make whatever kinds of objects you want!
Each object will have a name (or “variable”) associated with it. You define this by:
“Object Name <– Object parameters/code”
In this section, we will create an object using the “PlantGrowth” dataset (a common example data used when introducing statistical analysis in R, then create some plots and analyze the data statistically. Also note our formatting to create a “code chunk”. We will go over more specific formatting later in this script.
In the “Your Turn” sections, you will parallel the examples using a separate dataset of plant growth over time.
# First step: Read in the "PlantGrowth" Data (located in a .csv file of the local directory) (NOTE: This is a comment. It is distinguished by the "#" sign preceding the line, and is only for our information. R does not interpret this line)
# Read in the data from out "PlantGrowth.csv" file
# To run this line, move your carriage to the line and press "CMD+ENTER" (Mac) or "CTRL+ENTER" (Windows). You will see it run in the console below.
Plant <- read.csv("PlantGrowth.csv")
# OR We could look at the data structure using "str()" (NOTE: This displays some of the data, along with the type. We should notice "group" is currently a character datatype, when it should be a factor)
str(Plant)
## 'data.frame': 30 obs. of 3 variables:
## $ Time : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : chr "Control" "Control" "Control" "Control" ...
#The "group" column is current a "chr" (or character) datatype. This is incorrect, it's actually a factor (e.g. grouping variable). We correct the datatype and re-display the structure here.
Plant$group <- as.factor(Plant$group)
str(Plant)
## 'data.frame': 30 obs. of 3 variables:
## $ Time : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : Factor w/ 3 levels "Control","Fertilizer",..: 1 1 1 1 1 1 1 1 1 1 ...
Now complete the exercise above, but use the new data set “PlantGrowth V2”. Fill in the _________ sections with the correct parameter. Be sure to add your own comments with “#” to describe what is happening.
This dataset is an example time series of plant growth using the same grouping structure. Note those differences and remeber them for plotting later.
Plant2 <- read.csv("PlantGrowth V2.csv")
#HINT: What .csv file should we use?
Plant2$group <- as.factor(Plant2$group)
# HINT: Let's change the "chr" datatype to a "Factor" type.
str(Plant2)
## 'data.frame': 30 obs. of 3 variables:
## $ Time : int 1 2 3 4 5 6 7 8 9 10 ...
## $ biomass: num 4.54 4.03 4.87 5.04 6.24 ...
## $ group : Factor w/ 3 levels "Control","Fertilizer",..: 1 1 1 1 1 1 1 1 1 1 ...
Plotting is one of the most important functions of R! It allows you to detect trends and display those trends for a broader audience. For this reason, it’s a great first AND last step in any R project. The first graphs help you plan your data analysis. The last graphs go into your reports/publications and presentations.
For this class, we will use an R package called “ggplot2”. ggplot is a useful collection of tools and formating commands for plotting in R
# First we install ggplot (if we don't already have it).
# Run : " install.packages("ggplot2") " in your console below
# Then we load the package into our current environment. This makes the ggplot2 commands avaialible to us.
library(ggplot2)
# Let's make a boxplot of the change in weight by treatment group
ggplot(data = Plant, mapping=aes(x=group, y=weight))+
geom_boxplot()+
xlab("Treatment")+
ylab("Biomass (g)")+
theme_bw()
Now recreate the exercise above using the Plant2 object we previously created. This dataset is a time sequence, so your plots should look very different! Be sure to add comments to describe what each line does.
# NOTE: We already loaded ggplot2 above, so we don't need to do it again
# Let's make a point plot of the change in biomass by treatment group to accurately display the time series.
ggplot(Plant2, aes(x=Time, y=biomass, color=group))+
geom_boxplot()+ #HINT: Make a point plot. geom_?
xlab("Time Period (Days)")+
ylab("Biomass (g)")+
theme_bw()
Statistics further define what you can say about your data (and what you cannot). Often we are looking for significant differences between groups of data.
Here we will use an Analysis of Variance and T-tests with a linear model to determine if there is a significant difference between treatment groups for dataset one (“Plant”)
Answer each question within the code block or right below it
# First, we make a linear model of the data using the "lm()" function
model <- lm(formula = weight ~ group, data = Plant)
# Summary shows us some basic information about the model.
#QUESTION: Compare the "estimate" column under coefficients with the boxplot from earlier. What is this column telling us?
##The "estimate" column under coefficients is telling us the estimate standard deviation of each group.
summary(model)
##
## Call:
## lm(formula = weight ~ group, data = Plant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0710 -0.4180 -0.0060 0.2627 1.3690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0320 0.1971 25.527 <2e-16 ***
## groupFertilizer 0.4940 0.2788 1.772 0.0877 .
## groupHerbicide -0.3710 0.2788 -1.331 0.1944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6234 on 27 degrees of freedom
## Multiple R-squared: 0.2641, Adjusted R-squared: 0.2096
## F-statistic: 4.846 on 2 and 27 DF, p-value: 0.01591
# Using "plot()" on a linear model will generate four diagnostic plots. There is more information here than can be presented in a model.
plot(model)
# Let's create an object containing an Analysis of Variance model, then display the results. This will allow us to run a Tukey's post-hoc test
aov.model <- aov(model)
aov.model
## Call:
## aov(formula = model)
##
## Terms:
## group Residuals
## Sum of Squares 3.76634 10.49209
## Deg. of Freedom 2 27
##
## Residual standard error: 0.6233746
## Estimated effects may be unbalanced
# A Tukey post-hoc test is a common test for significant differences between treatment groups. Look at the "p adj" (p-value) column.
# QUESTION: Which groups were significantly different from each other? What was the average difference in biomass for this group?
##Each group was significantly different from one another as the Fertilizer-Control had a p-value of 0.1786670 while the Herbicide-Control group had a p-value of 0.0027926 and the Herbicide-Fertilizer had a p-value of 0.0000220. The average difference in biomass was 2.721042.
TukeyHSD(x = aov.model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = model)
##
## $group
## diff lwr upr p adj
## Fertilizer-Control 0.494 -0.1972161 1.1852161 0.1979960
## Herbicide-Control -0.371 -1.0622161 0.3202161 0.3908711
## Herbicide-Fertilizer -0.865 -1.5562161 -0.1737839 0.0120064
Congrats on loading, plotting, and analyzing a dataset. You completed this activity in a “.RMD” file (e.g. an R Markdown file). This is a format that allows text, individual code blocks, and “knitted” final reports for ease of use and collaboration.
You will run all of you analyses this semester in R Markdown files. When turning in an assignment, you will be required to attach the source code file (this file), and a knitted report
Once you have completed all of the above activities, press “knit”>“knit to HTML” in the toolbar above. You should generate an “HTML” file which displays all your graphs. This file should be saved to your local directory (the same as the .RMD file) automatically.
Be sure to include this file in your Canvas submission for full points!