Please make sure to show all R code and output after each question so that we can see your work. You should run or enter in your own console all example code as well as the direct questions. Write a sentence for each numerical value produced describing its meaning in context with the proper units. Be sure to submit both the knit Word document and the Rmd file to get full credit on your lab.
In this lab, we’re going to be learning both Exploratory Data Analysis for single quantitative (numeric) variables, and also R Markdown. R Markdown allows us to create documents with embedded code and outputs.
The first thing you will want to do is open an R markdown document. We first need to install the package rmarkdown
into our RStudio. Recall that packages are sets of functions or datasets that we can load into R as needed. If you go to the Packages pane, you will see a list of packages installed on the R Server (or your own computer). Packages with a checkmark are the packages that have been loaded into R to use in that current session.
To load the rmarkdown
package, use the function install.packages()
and provide the package to be loaded as the only argument in quotation marks.
install.packages("rmarkdown")
If rmarkdown
appears on your list of packages, you don’t have to run that line of code again.
To load the package, you can do one of the following:
- click the box next to the package name - use the function library()
and supply the package name - use the function require()
and supply the package name
Now open a R Markdown document (.Rmd). You can do so by either:
- Click File > New File > R Markdown
- Click the paper with the green plus icon in the left corner > R Markdown
Provide a name and author, then for now, keep the output type as html. You will notice that a new pane opened in your RStudio workspace that has your Rmd document. You can delete everything in it after the first code chunk. Go back and refresh your memory on the RStudio Orientation to Panes chapter we read in Lab 1, this time paying attention to the R Markdown pane as well.
Before we continue, check out this R Markdown Overview (10 min) (all chapters through “Output Formats”; remember, you will want to take notes as you move through the tutorials) as well as this introduction to R Markdown (30 min, all of Chapter 4).
You may find it useful to work through some more basic R commands using a R Markdown document before continuing in the lab. This is a good refresher, but is not required: Intro R using R Markdown (Chapter 5).
As you work through the lab in R Markdown, it might be useful to be able to write yourself notes as reminders or to return to a specific question before you end. The markdown language uses the wrappers <!---
and --->
to surround comments. Those comments won’t appear in the knitted document, but can also serve as visual reminders for you as you work through the lab. For example, .
Whenever you create a new R Markdown file, a “setup” code chunk will generate automatically (see below). This chunk allows you to specify formatting standards for the whole document. You will not see this chunk once you knit the document because we set the chunk option to include=FALSE
. We like to modify the base code with comment = ""
so that R code output does not have ##
in front of it when it is knitted.
The other option listed is echo=TRUE
. This tells R to include all code and output when the document is knitted. If echo=FALSE
, then all the code will be hidden and only the output will be displayed.
Each code chunk can modify these global options by including the argument after {r
, like our argument in this code chunk, include=FALSE
. We set the global options by passing the arguments to the function knitr::opts_chunk$set()
. Think of global options as the code chunk defaults; we can always override them for a particular code chunk, but it is the settings we will most often want.
In order to load the data into an R Markdown file so that it knits correctly, you must specify the exact pathway to the data. Remember, when we knit/render an R Markdown file, every line of code that is needed to run your code chunks has to be included in the Rmd file – it doesn’t interact with your global environment. Essentially, every time you knit, it acts as if the environment is empty and only runs the code in your Rmd file. Thus, we have to direct the R Markdown file where to find our data. It can make life easier to set a working directory or create a project for each lab so that the file pathway is short (we discussed this in Lab #1).
We are going to use the data collected from prior semesters of STAT 250 students to investigate the amount of time per week students spend either exercising or watching TV.
Modify the code below to read in the StudentSurvey_Lab2.csv
data. Remember, you can test a code chunk by running the code in the chunk without knitting the whole document.
survey<-read.csv("Lab2_StudentSurvey.csv", header=TRUE)
We have now created a datafame called survey.
Now we’ll want to investigate the data within our dataframe.
Recall that in properly organized data, each row is an individual and each column is a variable. Once the data is loaded, you will have to “call” the variables for use in each function.
First, we need to know the names of the variables in our dataframe. Modify the code below to print out the list of variable names.
names(survey)
[1] "semester" "gender" "age" "class" "exercise" "tv"
We can also get a quick overview of the data using summary()
. With the whole data set named in the function, you will get a summary for each variable in the data set. Modify the code below based on the description in the prior sentence.
summary(survey)
semester gender age class
Fall-2014 : 80 Female:347 Min. :17.00 Freshman : 9
Fall-2015 : 80 Male :166 1st Qu.:20.00 Junior :278
Fall-2016 :103 NA's : 1 Median :21.00 Senior :135
Spring-2015: 86 Mean :21.89 Sophomore: 86
Spring-2016: 93 3rd Qu.:23.00 NA's : 6
Spring-2018: 72 Max. :56.00
NA's :3
exercise tv
Min. : 0.000 Min. : 0.000
1st Qu.: 2.000 1st Qu.: 2.000
Median : 4.000 Median : 5.000
Mean : 5.108 Mean : 6.641
3rd Qu.: 7.000 3rd Qu.:10.000
Max. :24.000 Max. :60.000
NA's :20 NA's :26
Semester: Qualitative ordinal Gender: Qualitative nominal Age: Quantitative continuous Class: Qualitative ordinal Exercise: Quantitative discrete TV: Quantitative discrete
For age, exercise, and TV, R provides the 5 number summary. For semester, gender, and class, R provides a summary of the number of students in each subcategory.
What type of variable should age be? It would help for us to inspect the raw data of just the age variable. Recall that to call a variable inside a dataframe, we can use $
to direct first to a dataframe object, then to the variable within it. Add code within the code chunk below to use the $
option to print out the age
variable to the screen.
Age should be quantitative continuous.
summary(survey$gender)
Female Male NA's
347 166 1
It makes it quantitative discrete.
Let’s verify that R is reading the data the same way that we are. Each of the following codes can help us do this. Don’t run them yet – let’s discuss them first. Before we do, notice a new feature of R code below. In R, using a #
essentially tells R to ignore anything after the #
. You can use it to annotate your R code, such as below, by writing comments explaining what each line of code does. Note: #
has a different meaning in R chunks than in the general Markdown document. In R chunks, it preceeds comments. In Markdown, it denotes headers.
class(survey$age) # print the class of age (integer, string, etc.)
[1] "integer"
is.integer(survey$age) # is age an integer?
[1] TRUE
as.integer(survey$age) # assign, if necessary, age as an integer
[1] 20 21 22 24 18 18 19 19 18 20 24 21 21 22 21 23 21 26 21 20 20 23 21
[24] 21 19 19 21 20 22 19 21 22 23 21 21 21 20 20 20 20 23 20 18 27 20 31
[47] 20 20 21 20 19 32 21 23 23 28 20 20 23 21 21 20 22 22 22 20 20 26 22
[70] 20 21 20 19 21 30 38 38 22 22 20 NA 23 23 23 22 20 33 23 20 22 21 20
[93] 21 19 21 35 26 19 21 20 19 20 19 26 24 23 19 20 22 22 33 20 20 22 22
[116] 19 18 21 23 21 21 19 23 29 29 25 32 20 26 21 28 18 22 23 20 19 18 20
[139] 20 19 22 19 19 24 20 22 21 21 19 19 21 26 23 25 21 20 24 20 18 22 22
[162] 20 21 26 19 19 56 32 21 21 27 20 22 20 22 22 22 22 22 22 31 36 20 44
[185] 20 22 26 22 19 21 21 20 23 21 19 23 20 20 22 19 21 30 20 20 20 23 21
[208] 23 27 21 20 22 21 21 28 22 19 30 30 28 23 22 19 22 32 20 19 21 20 19
[231] 23 20 19 21 27 20 19 19 21 23 25 19 NA 23 22 22 20 24 20 20 22 20 24
[254] 21 19 24 19 19 28 32 20 19 19 19 20 20 19 19 23 20 19 25 19 20 24 22
[277] 22 21 19 19 19 23 21 21 21 23 20 19 19 19 20 21 18 29 18 21 22 19 22
[300] 27 24 24 23 20 27 21 20 19 19 41 20 20 19 27 19 19 20 20 27 21 26 20
[323] 19 20 20 20 20 20 21 19 19 37 20 19 21 19 20 20 20 21 20 19 19 20 21
[346] 19 21 19 23 21 19 23 24 23 21 22 20 19 23 21 23 20 23 22 21 21 20 33
[369] 21 23 20 20 20 20 20 21 20 19 18 23 22 21 22 20 22 18 33 20 21 25 22
[392] 54 19 23 19 20 22 19 22 21 22 20 26 20 22 19 20 25 20 17 22 20 22 19
[415] 25 21 20 24 21 27 21 20 21 23 23 22 20 29 23 20 18 21 23 23 19 19 19
[438] 19 20 26 20 20 26 19 19 23 19 19 21 20 19 22 22 20 20 20 19 21 19 20
[461] 19 20 19 19 19 20 20 21 20 19 21 23 20 21 19 20 NA 20 20 21 21 20 20
[484] 20 20 19 24 28 23 23 33 21 22 20 23 26 21 36 22 21 23 19 21 34 25 19
[507] 18 22 22 20 20 20 19 20
The first two functions essentially serve to ask the same question: either “what class of data is this variable” or “is the variable an integer class.” If class()
returns “integer” than is.integer()
will return a TRUE; If class()
returns anything else, than is.integer()
will return a FALSE. Either will answer our question if our variable is correctly assigned as an integer or not. If not, we can run as.integer()
to tell R to treat our variable as an integer. Run one of the two first codes and see how R has classed age.
If necessary, then run the last line of code.
As scientists, we need to not only be able to identify the research question, population, sample, and variables of interest for our own questions, but for other studies as well. The following is an abstract of a recent publication. Read it carefully, (perhaps check the Strategies for Reading Comprehension document on iLearn for helpful hints!) and then answer the questions about it.
Abstract Basal metabolic rate (BMR) is posited to be a fundamental control on the structure and dynamics of ecological networks, influencing organism resource use and rates of senescence. Differences in the maintenance energy requirements of individual species therefore potentially predict extinction likelihood. If validated, this would comprise an important link between organismic ecology and macroevolutionary dynamics. To test this hypothesis, the BMRs of organisms within fossil species were determined using body size and temperature data, and considered in the light of species’ survival and extinction through time. Our analysis focused on the high-resolution record of Pliocene to recent molluscs (bivalves and gastropods) from the Western Atlantic. Species-specific BMRs were calculated by measuring the size range of specimens from museum collections, determining ocean temperature using the HadCM3 global climate model, and deriving values based on relevant equations. Intriguingly, a statistically significant difference in metabolic rate exists between those bivalve and gastropod taxa that went extinct and those that survived throughout the course of the Neogene. This indicates that there is a scaling up from organismic properties to species survival for these communities. Metabolic rate could therefore represent an important metric for predicting future extinction patterns, with changes in global climate potentially affecting the lifespan of individuals, ultimately leading to the extinction of the species they are contained within. We also find that, at the assemblage level, there are no significant differences in metabolic rates for different time intervals throughout the entire study period. This may suggest that Neogene mollusc communities have remained energetically stable, despite many extinctions.
Indicate the correct answer by bolding your chosen response(s).
Now lets build some descriptive statistics and graphics with the survey
data.
A summary statistic gives a summary of each variable, whereas raw data shows all of the data for each individual in the variables.
We can use R to calculate summary statistics for single quantitative variables, such as the mean and standard deviation. Modify the code below so it produces the mean and standard deviation for both the number of hours per week spent exercising and spent watching tv.
mean(survey$exercise, na.rm= TRUE) # calculates the mean for the variable exercise
[1] 5.107794
sd(survey$exercise, na.rm= TRUE) # calculates the standard deviation for exercise
[1] 4.107598
mean(survey$tv, na.rm= TRUE) # calculate the mean for tv
[1] 6.641393
sd(survey$tv, na.rm= TRUE) # calculate the standard deviation for tv
[1] 7.001062
We often also want to know the number of individuals in our sample (our sample size). If individuals are rows in a column variable, then how long that column is would be the number of individuals. Thus, the code for sample size is length()
and we supply the specific variable we want to know the sample size for. Note: if you supply just the dataframe to length()
, it will give you the length of the dataframe, or the number of variables in the dataframe, not the sample size of each one; so be careful when you apply this function.
Modify the code below to find the sample size for both our exercise and tv variables
length(survey$exercise) # find the sample size of the variable exercise
[1] 514
length(survey$tv) # find the sample size of the variable tv
[1] 514
We ran code before to produce the mean and standard deviation alone – it just prints to the screen. I often will want to save the output to an object
name so I can use it within the R Markdown document. When you save output to an object
name it will not print out the value unless you tell R to provide the value.
mean(survey$exercise, na.rm = TRUE) # prints mean but does not save it
[1] 5.107794
mEx<-mean(survey$exercise, na.rm = TRUE) # calculates mean and saves it as mEx
mEx # prints to screen the value saved to mEx
[1] 5.107794
sdEx<-sd(survey$exercise, na.rm = TRUE) # calculates standard deviations and saves it as sdEx
sdEx # prints to screen the value saved to sdEx
[1] 4.107598
Example: The mean amount of time per week spent exercising by all prior STAT 250 students is 5.1077935 hrs.
The mean amount of time per week spent exercising by all prior STAT 250 students is 5.1077935. The average spread about the mean is
sdEx
.
Below, insert a code chunk and calculate the same set of descriptive statistics about the variable tv
. You will need to save the values as objects to use it later inline with the text.
mean(survey$tv, na.rm = TRUE) # prints mean but does not save it
[1] 6.641393
mTv<-mean(survey$tv, na.rm = TRUE) # calculates mean and saves it as mTv
mTv # prints to screen the value saved to mTv
[1] 6.641393
sdTv<-sd(survey$tv, na.rm = TRUE) # calculates standard deviations and saves it as sdTv
sdTv # prints to screen the value saved to sdTv
[1] 7.001062
The mean amount of time per week spent watching TV by all prior STAT 250 students is 6.6413934. The average spread about the mean is
sdTV
.
Using a stored object name is useful if you are going to use the value often (to cut down on typing), but if you only need to use the value small number of times, you can also write the code directly in line. For example, the maximum number of hours of tv watched in a week in our dataset of prior STAT 250 students is NA.
The maximum amount of hours spent watching TV by all prior STAT 250 students is 60.
We often write descriptive statistics using mathematical notation; we can do this in R Markdown too. Mathematical equations are surrounded by $
. We can call greek letters by their full names preceded by a backslash (e.g. \mu
, \sigma
), use special notation as well by calling the name preceded by a backslash and then applying it to a numeric value (e.g. \bar{x}
), or use roman characters directly (e.g. s
). An equation would then look like, for example: \(\bar{x} = 3.9\).
exercise
in its last code chunk? Why?The mean of exercise would be calculated using
\bar{x}
as it is the sample mean. The standard deviation would be calculated usings
.
exercise
referenced above, using mathematical notation.\(\bar{x} = 5.107794\) \(s = 4.107598\)
There are two main types of descriptive graphics we can use for numeric data: histograms and boxplots. We’ll learn how to make both in R and customize them using arguments.
Below is the basic code for a histogram of exercise
. The function tells R to make a histogram. The first argument provided is always the data that you want graphed in the histogram. I have added two additional arguments to the code. Try running it different times and see what changes as you change the text within the quotation marks. Notice that this code is spaced out differently than we’ve seen before. It is just to make commenting each argument easier – R will keep reading the function until it reaches the end )
.
Comment on each line of code to indicate what it does and modify the code to add labels to the graph axes.
hist(survey$exercise, # creates a histogram using the exercise data from the survey
xlab = "Number of hours spent exercising", # labels the x-axis
ylab = "Number of students exercising for x number of hours") # labels the y-axis
We would prefer it if our axes extended all the way to our maximum values. We can calculate the minimum and maximum values seen in our dataset using the min()
and max()
functions. We can then see the largest bin value we would want our axis to extend to. Modify the code below to calculate the minimum and maximum values. Then modify the hist()
function to insert arguments labeling the axes and also set the minimum and maximum limits of the x axis. Annotate your code.
min(survey$exercise, na.rm = TRUE) # calculates the minimum value in exercise
[1] 0
max(survey$exercise, na.rm = TRUE) # calculates the maximum value in exercise
[1] 24
hist(survey$exercise, xlim=c(0,24)) # creates a histogram using the min and max hours of exercise
min()
and max()
functions for them to run? Why?Yes, it had NA values so I had to add na.rm = TRUE to ensure that those were excluded.
It is unimodal right-skewed.
Most STAT 250 students spend less than 10 hours a week exercising.
Below is a basic code for a histogram. Modify the code to plot a histogram for the variable tv
, being sure to label your axes appropriately.
hist(survey$tv, # creates a histogram using the tv data from the survey
xlab = "Number of hours spent watching TV", # labels the x-axis
ylab = "Number of students watching x number of hours of TV") # labels the y-axis
Now let’s create a boxplot to look at the distribution of exercise
. There are two ways to print the boxplot. Add labels to the x and y axis appropriate for each style.
boxplot(survey$exercise)
boxplot(survey$exercise, horizontal = TRUE)
horizontal = TRUE
to the code do to the boxplot?It rotated the box plot from vertical to horizontal.
Insert a code chunk to create a boxplot for tv
.
boxplot(survey$tv, horizontal = TRUE)
tv
and exercise
.The distribution for exercise is right-skewed and the distribution for tv is also right-skewed
We can calculate the 5-number summary for both variables by supplying the variable name to the function summary()
. Insert a code chunk below to do so.
summary(survey$exercise)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 2.000 4.000 5.108 7.000 24.000 20
summary(survey$tv)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 2.000 5.000 6.641 10.000 60.000 26
Sam walks along the same two mile stretch of beach each morning walking his dog. While he walks, he picks up any trash he encounters. For the last ninety days, he has been keeping track of the number of pieces of trash he picks up each day. Load in the dataset trash.csv
.
The population is all of the days in a year. The sample is the past 90 days. There are two variable given, the number of days and the amount of trash per day.
It was an observational study as Sam did nothing to manipulate the variables. He did not make the trash appear other than walk along in order to be able to see it. While he did pick the trash up after his observation, this did not affect the actual experiement, and in fact may have made it easier to ensure that he wasn’t counting the same piece of trash multiple days in a row.
trash
?A quantitative continuous variable as there is an infinite amount of trash. As Sam continues to collect trash, more trash is made.
trash<-read.csv("Lab 2_trash.csv", header=TRUE)
summary(trash)
items
Min. : 5.00
1st Qu.:12.00
Median :15.00
Mean :15.26
3rd Qu.:18.00
Max. :34.00
Provide mathematical (e.g. \(equation\)) and verbal interpretation of each value calculated.
hist(trash$items, # creates a histogram using the tv data from the survey
xlab = "Number of pieces of trash picked up", # labels the x-axis
ylab = "Number of days x pieces of trash was picked up") # labels the y-axis)
The distribution of the amount of trash that Sam picked up on the beach is unimodal skewed-right.
You can run specific code chunks to check their output by pressing the green play button to run the specific code chunk, or the icon next to it to run all code chunks prior to that code chunk. If you want to check your R Markdown editing, you can knit the document and view it in the ‘Viewer’ Pane by pressing the ‘Knit’ button at the top of the R Markdown pane. Make sure the setting (gear icon above) is set to ‘Preview in Viewer pane’. When you are ready to knit your document, change the output type in the YAML header from html_document
to word_document
and press the ‘Knit’ button. Your word document will be created in the project folder!
For quick reference of R Markdown syntax, this website has the basics.