are found at https://github.com/THOMASELOVE/431-2020/blob/master/software/README.md
This is meant to walk you through the steps of completing some elementary analyses using R, R Studio, and R Markdown. Working through this document will definitely help you get rolling on our labs and other assignments.
pqhs431/2020-08-31_my-first-R-project
431-r-template.Rmd
file into your project folder.A R Markdown file is just a plain text document, with interspersed R code that lets you produce reports that combine narration with results, and that can be easily exported as an HTML, Word or PDF file. It’s a great tool. Dr. Love builds virtually everything you’ll see in this class with R Markdown. R Markdown files use the .Rmd
extension.
Open the template file 431-r-template.Rmd
by clicking on it in the Files tab on the lower right of your R Studio setup, or selecting File … Open from the main menu. + The start of the template file is a top-line set of instructions to R Markdown about how to process the rest of the document. It is referred to as the YAML material, and looks like this:
---
title: "R Markdown Template"
author: "Your Name"
date: "2020-08-28"
output:
html_document:
toc: yes
code_folding: show
---
Actually, next to the word date:
, your template has a little bit of code that inserts today’s date. I’d leave that as it is - that way you don’t need to switch dates around, and your document will always show the date on which it was last generated.
Now, edit the file to include a meaningful Title for this work, and place your actual name in the author section. Then use File … Save as to save the Markdown file under a new project-specific name, rather than the generic 431-r-template.Rmd
. Your result should look something like this:
---
title: "My extremely exciting first data analysis"
author: "Chris Traeger"
date: "2020-08-28"
output:
html_document:
toc: yes
code_folding: show
---
In most cases, changing only the title and author, but otherwise leaving this as is, will work well for our purposes.
To learn more about R Markdown, we recommend the tutorials at http://rmarkdown.rstudio.com/lesson-1.html
To begin, we’ll load the packages (libraries) and set up an option for commenting that we will use in our analyses.
knitr::opts_chunk$set(comment=NA)
library(magrittr); library(tidyverse)
The chickwts
data, available as part of the base installation of R (in the datasets
package) describe an experiment conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. For more on the chickwts
data, type ?(chickwts)
into the R console.
We’ll begin by placing the data in a tibble called chick
.
chick <- as_tibble(chickwts)
chick
# A tibble: 71 x 2
weight feed
<dbl> <fct>
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
7 108 horsebean
8 124 horsebean
9 143 horsebean
10 140 horsebean
# ... with 61 more rows
weight
variable is numeric (double-precision) and gives the chick’s weight.feed
variable is categorical (a factor in R) and gives the feed type.feed
The regular summary
function can provide some useful results.
chick %>%
select(feed) %>%
summary()
feed
casein :12
horsebean:10
linseed :12
meatmeal :11
soybean :14
sunflower:12
There are lots of ways to generate a table for a factor (categorical variable) like this.
chick %>%
select(feed) %>%
table() %>%
addmargins()
.
casein horsebean linseed meatmeal soybean sunflower Sum
12 10 12 11 14 12 71
weight
, numericallyThe regular summary
function provides a five-number summary, plus the mean. We can do this with …
chick %>%
select(weight) %>%
summary()
weight
Min. :108.0
1st Qu.:204.5
Median :258.0
Mean :261.3
3rd Qu.:323.5
Max. :423.0
Or, we can use a different pipe than the usual %>%
- here %$%
exposes the pieces of the chick
tibble (the variables) to the function summary()
.
chick %$% summary(weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
108.0 204.5 258.0 261.3 323.5 423.0
The favstats
function from the mosaic
package produces a more extensive set of numerical summaries. Here, we are forced to use the new pipe %$%
to identify the variables for the function favstats
in mosaic
.
chick %$%
mosaic::favstats(weight)
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
min Q1 median Q3 max mean sd n missing
108 204.5 258 323.5 423 261.3099 78.0737 71 0
Another way to accomplish the same end is
mosaic::favstats(chick$weight)
min Q1 median Q3 max mean sd n missing
108 204.5 258 323.5 423 261.3099 78.0737 71 0
Here is a smaller numerical summary of the weights broken down by feed category.
chick %>%
group_by(feed) %>%
summarize(mean(weight), sd(weight), median(weight))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 6 x 4
feed `mean(weight)` `sd(weight)` `median(weight)`
<fct> <dbl> <dbl> <dbl>
1 casein 324. 64.4 342
2 horsebean 160. 38.6 152.
3 linseed 219. 52.2 221
4 meatmeal 277. 64.9 263
5 soybean 246. 54.1 248
6 sunflower 329. 48.8 328
weight
dataHere is the default approach.
ggplot(chick, aes(x = weight)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note the warning about the number of bins with which the histogram is constructed by default. We could ignore this warning, or we could play around with the number of bins to get something that shows the variation in the data more effectively. We’ll try something a little smaller than the default number of 30 bins.
As we’re doing that, on the top of the next page, we’ll make something that is slightly more attractive (to my eye), revise the labels, and place a title.
ggplot(chick, aes(x = weight)) +
geom_histogram(bins = 10, color = "black", fill = "turquoise") +
labs(x = "Chick's Weight", y = "Number of Chicks",
title = "Size of 71 Chicks in the chickwts data")
Another option would be to plot the density function, rather than the raw counts, and compare it directly to what we would expect from a Normal model with the same mean and standard deviation as the weights in the chick
data.
ggplot(chick, aes(x = weight)) +
geom_histogram(aes(y = ..density..), bins = 12, color = "white", fill = "turquoise") +
stat_function(fun = dnorm,
args = list(mean = mean(chick$weight), sd = sd(chick$weight)),
lwd = 1.5, col = "violetred") +
labs(x = "Chick's Weight", y = "Probability Density",
title = "Size of 71 Chicks in the chickwts data",
subtitle = "with superimposed Normal density function")
A boxplot might, for instance, compare the weight distributions for each of the various types of feed.
ggplot(chick, aes(x = feed, y = weight, fill = feed)) +
geom_boxplot() +
labs(title = "Weight by Feed Type in the Chickwts data")
We could drop the labels on the right hand side by adding guides(fill = FALSE) +
somewhere in our ggplot
call.
A Normal Q-Q plot of the weights is probably most easily obtained using base graphics, rather than ggplot. For example,
qqnorm(chick$weight, main = "Normal Q-Q plot of Chick Weights")
qqline(chick$weight, col = "red")
The Orange data frame has 35 rows and 3 columns of records of the growth of orange trees. Let’s get the data into a tibble.
orange <- tbl_df(Orange)
Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
orange
# A tibble: 35 x 3
Tree age circumference
<ord> <dbl> <dbl>
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
7 1 1582 145
8 2 118 33
9 2 484 69
10 2 664 111
# ... with 25 more rows
tree
is an ordinal factor, which indicates the tree on which the measurement was made. The ordering is by increaing maximum diameter of the five trees.age
is a numerical variable, containing the age of the tree as measured in days since 1968-12-31.circumference
is a numerical variable, containing the trunk circumference (probably at “breast height”) in mm.And here’s the standard numerical summary for the full data set.
summary(orange)
Tree age circumference
3:7 Min. : 118.0 Min. : 30.0
1:7 1st Qu.: 484.0 1st Qu.: 65.5
5:7 Median :1004.0 Median :115.0
2:7 Mean : 922.1 Mean :115.9
4:7 3rd Qu.:1372.0 3rd Qu.:161.5
Max. :1582.0 Max. :214.0
Next, we’ll look at the mean age and circumference, within each of the seven measurements per tree.
orange %>%
group_by(Tree) %>%
summarize(mean(age), mean(circumference))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 3
Tree `mean(age)` `mean(circumference)`
<ord> <dbl> <dbl>
1 3 922. 94
2 1 922. 99.6
3 5 922. 111.
4 2 922. 135.
5 4 922. 139.
It looks like each of the trees was measured at exactly the same time (age).
table(orange$age, orange$Tree)
3 1 5 2 4
118 1 1 1 1 1
484 1 1 1 1 1
664 1 1 1 1 1
1004 1 1 1 1 1
1231 1 1 1 1 1
1372 1 1 1 1 1
1582 1 1 1 1 1
Yes, each tree was measured at precisely the same five times.
Here’s another case that calls for the %$%
pipe.
orange %$% cor(age, circumference)
[1] 0.9135189
Or, obtain the identical result with…
cor(orange$age, orange$circumference)
[1] 0.9135189
The Pearson correlation of age and circumference is 0.91 which is pretty strong, indicating that we’d expect to see a fairly positive and mostly linear association in a scatterplot. So, let’s see if that’s what we get.
ggplot(orange, aes(x = age, y = circumference)) +
geom_point()
OK. Let’s add a linear model to this plot, as well as some better labels, and we’ll change from mapping the points as observed to using geom_jitter
to add a little horizontal (x-axis) jitter to the points, so that we don’t have so much overlap.
ggplot(orange, aes(x = age, y = circumference)) +
geom_jitter(width = 20, height = 0) +
geom_smooth(method = "lm", col = "orange") +
labs(title = "Growth of Orange Trees",
x = "Age (days since 1968-12-31)",
y = "Circumference in mm") +
theme_bw()
`geom_smooth()` using formula 'y ~ x'
The linear model fitted here is summarized below:
model1 <- lm(circumference ~ age, data = orange)
summary(model1)
Call:
lm(formula = circumference ~ age, data = orange)
Residuals:
Min 1Q Median 3Q Max
-46.310 -14.946 -0.076 19.697 45.111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.399650 8.622660 2.018 0.0518 .
age 0.106770 0.008277 12.900 1.93e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.74 on 33 degrees of freedom
Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
So the linear regression model is: circumference = 17.4 + 0.107 age.
So our predicted circumference for a tree of age 1000 days is 124 mm.
As a third option, let’s fit separate smooth (loess
) curves to each of the five individual trees, and plot each of them in different colors.
ggplot(orange, aes(x = age, y = circumference, col = Tree)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Growth of Orange Trees, with Scatterplot Smooths",
x = "Age (days since 1968-12-31)", y = "Circumference in mm")
`geom_smooth()` using formula 'y ~ x'
Or we could facet the plots, showing multiple scatterplots, one for each Tree.
ggplot(orange, aes(x = age, y = circumference, col = Tree)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
facet_wrap(~ Tree) +
guides(col = FALSE) +
labs(title = "Growth of Orange Trees over Time, with Scatterplot Smooths",
x = "Days since 1968-12-31", y = "Circumference (in mm)")
`geom_smooth()` using formula 'y ~ x'
The easiest way to get data from another software package into R is to save the file (from within the other software package) in a form that R can read. What you want is to end up with an Excel file that looks like this…
An Excel sheet with a tidy data set
This tidy data set contains:
The variable names are in the first row, and the data are in the remaining rows (2-10 in this small example). Categorical variables are most easily indicated by letters (drug A or B, for instance) while continuous variables, like response, are indicated by numbers. Leave missing cells blank or use the symbol NA
, rather than indicating them with, say, -99
or some other code.
Within Excel, this file can be saved as a .csv
(comma-separated text file) or just as an Excel .XLS file, and then imported directly into R, via RStudio by clicking Import Dataset under the Workspace tab, then selecting From Text File. If you’ve saved the file in Excel as a .csv
file, R Studio will generally make correct guesses about how to import the file. Once imported, you just need to save the workspace when you quit RStudio and you’ll avoid the need to re-import.