are found at https://github.com/THOMASELOVE/431-2021/blob/main/software/README.md
431-first-r-template.Rmd file I’ve provided.This is meant to walk you through the steps of completing some elementary analyses using R, R Studio, and R Markdown. Working through this document will definitely help you get rolling on our labs and other assignments.
pqhs431/2021-08-31_my-first-R-project431-first-r-template.Rmd file into your project folder.A R Markdown file is just a plain text document, with interspersed R code that lets you produce reports that combine narration with results, and that can be easily exported as an HTML, Word or PDF file. It’s a great tool. Dr. Love builds virtually everything you’ll see in this class with R Markdown. R Markdown files use the .Rmd extension.
Open the template file 431-first-r-template.Rmd by clicking on it in the Files tab on the lower right of your R Studio setup, or selecting File … Open from the main menu. + The start of the template file is a top-line set of instructions to R Markdown about how to process the rest of the document. It is referred to as the YAML material, and looks like this:
---
title: "A First R Markdown Template"
author: "Your Name"
date: "2021-08-21"
output:
  html_document: 
    toc: yes
    code_folding: show
---Actually, next to the word date:, your template has a little bit of code that inserts today’s date. I’d leave that as it is - that way you don’t need to switch dates around, and your document will always show the date on which it was last generated.
Now, edit the file to include a meaningful Title for this work, and place your actual name in the author section. Then use File … Save as to save the Markdown file under a new project-specific name, rather than the generic 431-first-r-template.Rmd. Your result should look something like this:
---
title: "My extremely exciting first data analysis"
author: "Chris Traeger"
date: "2021-08-21"
output:
  html_document: 
    toc: yes
    code_folding: show
---In most cases, changing only the title and author, but otherwise leaving this as is, will work well for our purposes.
To learn more about R Markdown, we recommend the tutorials at http://rmarkdown.rstudio.com/lesson-1.html
To begin, we’ll load the packages (libraries) and set up an option for commenting that we will use in our analyses.
knitr::opts_chunk$set(comment=NA)
library(magrittr); library(tidyverse)The chickwts data, available as part of the base installation of R (in the datasets package) describe an experiment conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. For more on the chickwts data, type ?(chickwts) into the R console.
We’ll begin by placing the data in a tibble called chick.
chick <- as_tibble(chickwts)
chick# A tibble: 71 x 2
   weight feed     
    <dbl> <fct>    
 1    179 horsebean
 2    160 horsebean
 3    136 horsebean
 4    227 horsebean
 5    217 horsebean
 6    168 horsebean
 7    108 horsebean
 8    124 horsebean
 9    143 horsebean
10    140 horsebean
# ... with 61 more rowsweight variable is numeric (double-precision) and gives the chick’s weight.feed variable is categorical (a factor in R) and gives the feed type.feedThe regular summary function can provide some useful results.
chick %>% 
    select(feed) %>% 
    summary()        feed   
 casein   :12  
 horsebean:10  
 linseed  :12  
 meatmeal :11  
 soybean  :14  
 sunflower:12  There are lots of ways to generate a table for a factor (categorical variable) like this.
chick %>%
  select(feed) %>%
    table() %>%
    addmargins() .
   casein horsebean   linseed  meatmeal   soybean sunflower       Sum 
       12        10        12        11        14        12        71 weight, numericallyThe regular summary function provides a five-number summary, plus the mean. We can do this with …
chick %>%
    select(weight) %>%
    summary()     weight     
 Min.   :108.0  
 1st Qu.:204.5  
 Median :258.0  
 Mean   :261.3  
 3rd Qu.:323.5  
 Max.   :423.0  Or, we can use a different pipe than the usual %>% - here %$% exposes the pieces of the chick tibble (the variables) to the function summary().
chick %$% summary(weight)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  108.0   204.5   258.0   261.3   323.5   423.0 The favstats function from the mosaic package produces a more extensive set of numerical summaries. Here, we are forced to use the new pipe %$% to identify the variables for the function favstats in mosaic.
chick %$%
    mosaic::favstats(weight)Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2 min    Q1 median    Q3 max     mean      sd  n missing
 108 204.5    258 323.5 423 261.3099 78.0737 71       0Another way to accomplish the same end is
mosaic::favstats(chick$weight) min    Q1 median    Q3 max     mean      sd  n missing
 108 204.5    258 323.5 423 261.3099 78.0737 71       0Here is a smaller numerical summary of the weights broken down by feed category.
chick %>%
  group_by(feed) %>%
  summarize(mean(weight), sd(weight), median(weight))# A tibble: 6 x 4
  feed      `mean(weight)` `sd(weight)` `median(weight)`
  <fct>              <dbl>        <dbl>            <dbl>
1 casein              324.         64.4             342 
2 horsebean           160.         38.6             152.
3 linseed             219.         52.2             221 
4 meatmeal            277.         64.9             263 
5 soybean             246.         54.1             248 
6 sunflower           329.         48.8             328 weight dataHere is the default approach.
ggplot(chick, aes(x = weight)) +
  geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.Note the warning about the number of bins with which the histogram is constructed by default. We could ignore this warning, or we could play around with the number of bins to get something that shows the variation in the data more effectively. We’ll try something a little smaller than the default number of 30 bins.
As we’re doing that, on the top of the next page, we’ll make something that is slightly more attractive (to my eye), revise the labels, and place a title.
ggplot(chick, aes(x = weight)) +
  geom_histogram(bins = 10, color = "black", fill = "turquoise") + 
  labs(x = "Chick's Weight", y = "Number of Chicks", 
       title = "Size of 71 Chicks in the chickwts data")Another option would be to plot the density function, rather than the raw counts, and compare it directly to what we would expect from a Normal model with the same mean and standard deviation as the weights in the chick data.
ggplot(chick, aes(x = weight)) +
  geom_histogram(aes(y = ..density..), bins = 12, color = "white", fill = "turquoise") + 
  stat_function(fun = dnorm,
                args = list(mean = mean(chick$weight), sd = sd(chick$weight)),
                lwd = 1.5, col = "violetred") +
  labs(x = "Chick's Weight", y = "Probability Density", 
       title = "Size of 71 Chicks in the chickwts data",
       subtitle = "with superimposed Normal density function")A boxplot might, for instance, compare the weight distributions for each of the various types of feed.
ggplot(chick, aes(x = feed, y = weight, fill = feed)) +
    geom_boxplot() +
    labs(title = "Weight by Feed Type in the Chickwts data")We could drop the labels on the right hand side by adding guides(fill = FALSE) + somewhere in our ggplot call.
A Normal Q-Q plot of the weights is probably most easily obtained using base graphics, rather than ggplot. For example,
qqnorm(chick$weight, main = "Normal Q-Q plot of Chick Weights")
qqline(chick$weight, col = "red")The Orange data frame has 35 rows and 3 columns of records of the growth of orange trees. Let’s get the data into a tibble.
orange <- tbl_df(Orange)Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.orange# A tibble: 35 x 3
   Tree    age circumference
   <ord> <dbl>         <dbl>
 1 1       118            30
 2 1       484            58
 3 1       664            87
 4 1      1004           115
 5 1      1231           120
 6 1      1372           142
 7 1      1582           145
 8 2       118            33
 9 2       484            69
10 2       664           111
# ... with 25 more rowstree is an ordinal factor, which indicates the tree on which the measurement was made. The ordering is by increaing maximum diameter of the five trees.age is a numerical variable, containing the age of the tree as measured in days since 1968-12-31.circumference is a numerical variable, containing the trunk circumference (probably at “breast height”) in mm.And here’s the standard numerical summary for the full data set.
summary(orange) Tree       age         circumference  
 3:7   Min.   : 118.0   Min.   : 30.0  
 1:7   1st Qu.: 484.0   1st Qu.: 65.5  
 5:7   Median :1004.0   Median :115.0  
 2:7   Mean   : 922.1   Mean   :115.9  
 4:7   3rd Qu.:1372.0   3rd Qu.:161.5  
       Max.   :1582.0   Max.   :214.0  Next, we’ll look at the mean age and circumference, within each of the seven measurements per tree.
orange %>%
  group_by(Tree) %>%
  summarize(mean(age), mean(circumference))# A tibble: 5 x 3
  Tree  `mean(age)` `mean(circumference)`
  <ord>       <dbl>                 <dbl>
1 3            922.                  94  
2 1            922.                  99.6
3 5            922.                 111. 
4 2            922.                 135. 
5 4            922.                 139. It looks like each of the trees was measured at exactly the same time (age).
table(orange$age, orange$Tree)       
       3 1 5 2 4
  118  1 1 1 1 1
  484  1 1 1 1 1
  664  1 1 1 1 1
  1004 1 1 1 1 1
  1231 1 1 1 1 1
  1372 1 1 1 1 1
  1582 1 1 1 1 1Yes, each tree was measured at precisely the same five times.
Here’s another case that calls for the %$% pipe.
orange %$% cor(age, circumference)[1] 0.9135189Or, obtain the identical result with…
cor(orange$age, orange$circumference)[1] 0.9135189The Pearson correlation of age and circumference is 0.91 which is pretty strong, indicating that we’d expect to see a fairly positive and mostly linear association in a scatterplot. So, let’s see if that’s what we get.
ggplot(orange, aes(x = age, y = circumference)) +
  geom_point()OK. Let’s add a linear model to this plot, as well as some better labels, and we’ll change from mapping the points as observed to using geom_jitter to add a little horizontal (x-axis) jitter to the points, so that we don’t have so much overlap.
ggplot(orange, aes(x = age, y = circumference)) +
    geom_jitter(width = 20, height = 0) + 
    geom_smooth(method = "lm", col = "orange") +
    labs(title = "Growth of Orange Trees", 
         x = "Age (days since 1968-12-31)", 
         y = "Circumference in mm") +
    theme_bw()`geom_smooth()` using formula 'y ~ x'The linear model fitted here is summarized below:
model1 <- lm(circumference ~ age, data = orange)
summary(model1)
Call:
lm(formula = circumference ~ age, data = orange)
Residuals:
    Min      1Q  Median      3Q     Max 
-46.310 -14.946  -0.076  19.697  45.111 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17.399650   8.622660   2.018   0.0518 .  
age          0.106770   0.008277  12.900 1.93e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.74 on 33 degrees of freedom
Multiple R-squared:  0.8345,    Adjusted R-squared:  0.8295 
F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14So the linear regression model is: circumference = 17.4 + 0.107 age.
So our predicted circumference for a tree of age 1000 days is 124 mm.
As a third option, let’s fit separate smooth (loess) curves to each of the five individual trees, and plot each of them in different colors.
ggplot(orange, aes(x = age, y = circumference, col = Tree)) +
  geom_point() + 
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Growth of Orange Trees, with Scatterplot Smooths", 
       x = "Age (days since 1968-12-31)", y = "Circumference in mm")`geom_smooth()` using formula 'y ~ x'Or we could facet the plots, showing multiple scatterplots, one for each Tree.
ggplot(orange, aes(x = age, y = circumference, col = Tree)) +
  geom_point() + 
  geom_smooth(method = "loess", se = FALSE) +
  facet_wrap(~ Tree) +
  guides(col = FALSE) +
  labs(title = "Growth of Orange Trees over Time, with Scatterplot Smooths", 
       x = "Days since 1968-12-31", y = "Circumference (in mm)")Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
"none")` instead.`geom_smooth()` using formula 'y ~ x'The easiest way to get data from another software package into R is to save the file (from within the other software package) in a form that R can read. What you want is to end up with an Excel file that looks like this…
An Excel sheet with a tidy data set
This tidy data set contains:
The variable names are in the first row, and the data are in the remaining rows (2-10 in this small example). Categorical variables are most easily indicated by letters (drug A or B, for instance) while continuous variables, like response, are indicated by numbers. Leave missing cells blank or use the symbol NA, rather than indicating them with, say, -99 or some other code.
Within Excel, this file can be saved as a .csv (comma-separated text file) or just as an Excel .XLS file, and then imported directly into R, via RStudio by clicking Import Dataset under the Workspace tab, then selecting From Text File. If you’ve saved the file in Excel as a .csv file, R Studio will generally make correct guesses about how to import the file. Once imported, you just need to save the workspace when you quit RStudio and you’ll avoid the need to re-import.