Lab #2: Exploratory Data Analysis

Name: Sarah Munro-Kennedy

Skills

Identify the population, sample, and variables of interest in a study
Identify variable types
Use R to generate and interpret descriptive statistics and graphics for numeric variables
Editing graphics in R using arguments
Learning and applying Markdown language for text editing
Learning and applying R Markdown documents for R code and output generation

Please make sure to show all R code and output after each question so that we can see your work. You should run or enter in your own console all example code as well as the direct questions. Write a sentence for each numerical value produced describing its meaning in context with the proper units. Be sure to submit both the knit Word document and the Rmd file to get full credit on your lab.

Introduction

In this lab, we’re going to be learning both Exploratory Data Analysis for single quantitative (numeric) variables, and also R Markdown. R Markdown allows us to create documents with embedded code and outputs.

Part 1: Introduction to R Markdown

The first thing you will want to do is open an R markdown document. We first need to install the package rmarkdown into our RStudio. Recall that packages are sets of functions or datasets that we can load into R as needed. If you go to the Packages pane, you will see a list of packages installed on the R Server (or your own computer). Packages with a checkmark are the packages that have been loaded into R to use in that current session.

To load the rmarkdown package, use the function install.packages() and provide the package to be loaded as the only argument in quotation marks.

install.packages("rmarkdown")

If rmarkdown appears on your list of packages, you don’t have to run that line of code again.

To load the package, you can do one of the following:
- click the box next to the package name - use the function library() and supply the package name - use the function require() and supply the package name

Now open a R Markdown document (.Rmd). You can do so by either:
- Click File > New File > R Markdown
- Click the paper with the green plus icon in the left corner > R Markdown

Provide a name and author, then for now, keep the output type as html. You will notice that a new pane opened in your RStudio workspace that has your Rmd document. You can delete everything in it after the first code chunk. Go back and refresh your memory on the RStudio Orientation to Panes chapter we read in Lab 1, this time paying attention to the R Markdown pane as well.

Before we continue, check out this R Markdown Overview (10 min) (all chapters through “Output Formats”; remember, you will want to take notes as you move through the tutorials) as well as this introduction to R Markdown (30 min, all of Chapter 4).

You may find it useful to work through some more basic R commands using a R Markdown document before continuing in the lab. This is a good refresher, but is not required: Intro R using R Markdown (Chapter 5).

As you work through the lab in R Markdown, it might be useful to be able to write yourself notes as reminders or to return to a specific question before you end. The markdown language uses the wrappers  to surround comments. Those comments won’t appear in the knitted document, but can also serve as visual reminders for you as you work through the lab. For example, .

Whenever you create a new R Markdown file, a “setup” code chunk will generate automatically (see below). This chunk allows you to specify formatting standards for the whole document. You will not see this chunk once you knit the document because we set the chunk option to include=FALSE. We like to modify the base code with comment = "" so that R code output does not have ## in front of it when it is knitted.

The other option listed is echo=TRUE. This tells R to include all code and output when the document is knitted. If echo=FALSE, then all the code will be hidden and only the output will be displayed.

Each code chunk can modify these global options by including the argument after {r, like our argument in this code chunk, include=FALSE. We set the global options by passing the arguments to the function knitr::opts_chunk$set(). Think of global options as the code chunk defaults; we can always override them for a particular code chunk, but it is the settings we will most often want.

1.1: Loading data

In order to load the data into an R Markdown file so that it knits correctly, you must specify the exact pathway to the data. Remember, when we knit/render an R Markdown file, every line of code that is needed to run your code chunks has to be included in the Rmd file – it doesn’t interact with your global environment. Essentially, every time you knit, it acts as if the environment is empty and only runs the code in your Rmd file. Thus, we have to direct the R Markdown file where to find our data. It can make life easier to set a working directory or create a project for each lab so that the file pathway is short (we discussed this in Lab #1).

We are going to use the data collected from prior semesters of STAT 250 students to investigate the amount of time per week students spend either exercising or watching TV.

Modify the code below to read in the StudentSurvey_Lab2.csv data. Remember, you can test a code chunk by running the code in the chunk without knitting the whole document.

survey<-read.csv("Lab2_StudentSurvey.csv", header=TRUE)

We have now created a datafame called survey. Now we’ll want to investigate the data within our dataframe.

1.2: Using Data in R

Recall that in properly organized data, each row is an individual and each column is a variable. Once the data is loaded, you will have to “call” the variables for use in each function.

First, we need to know the names of the variables in our dataframe. Modify the code below to print out the list of variable names.

names(survey)

[1] "semester" "gender"   "age"      "class"    "exercise" "tv"

We can also get a quick overview of the data using summary(). With the whole data set named in the function, you will get a summary for each variable in the data set. Modify the code below based on the description in the prior sentence.

summary(survey)

        semester      gender         age              class    
 Fall-2014  : 80   Female:347   Min.   :17.00   Freshman :  9  
 Fall-2015  : 80   Male  :166   1st Qu.:20.00   Junior   :278  
 Fall-2016  :103   NA's  :  1   Median :21.00   Senior   :135  
 Spring-2015: 86                Mean   :21.89   Sophomore: 86  
 Spring-2016: 93                3rd Qu.:23.00   NA's     :  6  
 Spring-2018: 72                Max.   :56.00                  
                                NA's   :3                      
    exercise            tv        
 Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 2.000   1st Qu.: 2.000  
 Median : 4.000   Median : 5.000  
 Mean   : 5.108   Mean   : 6.641  
 3rd Qu.: 7.000   3rd Qu.:10.000  
 Max.   :24.000   Max.   :60.000  
 NA's   :20       NA's   :26

Identify the variable type (e.g. numeric discrete) for each of the variables in our dataset.

Semester: Qualitative ordinal Gender: Qualitative nominal Age: Quantitative continuous Class: Qualitative ordinal Exercise: Quantitative discrete TV: Quantitative discrete

What type of summary does R provide for each of the variable types?

For age, exercise, and TV, R provides the 5 number summary. For semester, gender, and class, R provides a summary of the number of students in each subcategory.

What type of variable should age be? It would help for us to inspect the raw data of just the age variable. Recall that to call a variable inside a dataframe, we can use $ to direct first to a dataframe object, then to the variable within it. Add code within the code chunk below to use the $ option to print out the age variable to the screen.

Age should be quantitative continuous.

summary(survey$gender)

Female   Male   NA's 
   347    166      1

You’ll notice that age was recorded as whole-years only. What type of variable does it make it?

It makes it quantitative discrete.

Let’s verify that R is reading the data the same way that we are. Each of the following codes can help us do this. Don’t run them yet – let’s discuss them first. Before we do, notice a new feature of R code below. In R, using a # essentially tells R to ignore anything after the #. You can use it to annotate your R code, such as below, by writing comments explaining what each line of code does. Note: # has a different meaning in R chunks than in the general Markdown document. In R chunks, it preceeds comments. In Markdown, it denotes headers.

class(survey$age)          # print the class of age (integer, string, etc.)

[1] "integer"

is.integer(survey$age)     # is age an integer?

[1] TRUE

as.integer(survey$age)     # assign, if necessary, age as an integer

  [1] 20 21 22 24 18 18 19 19 18 20 24 21 21 22 21 23 21 26 21 20 20 23 21
 [24] 21 19 19 21 20 22 19 21 22 23 21 21 21 20 20 20 20 23 20 18 27 20 31
 [47] 20 20 21 20 19 32 21 23 23 28 20 20 23 21 21 20 22 22 22 20 20 26 22
 [70] 20 21 20 19 21 30 38 38 22 22 20 NA 23 23 23 22 20 33 23 20 22 21 20
 [93] 21 19 21 35 26 19 21 20 19 20 19 26 24 23 19 20 22 22 33 20 20 22 22
[116] 19 18 21 23 21 21 19 23 29 29 25 32 20 26 21 28 18 22 23 20 19 18 20
[139] 20 19 22 19 19 24 20 22 21 21 19 19 21 26 23 25 21 20 24 20 18 22 22
[162] 20 21 26 19 19 56 32 21 21 27 20 22 20 22 22 22 22 22 22 31 36 20 44
[185] 20 22 26 22 19 21 21 20 23 21 19 23 20 20 22 19 21 30 20 20 20 23 21
[208] 23 27 21 20 22 21 21 28 22 19 30 30 28 23 22 19 22 32 20 19 21 20 19
[231] 23 20 19 21 27 20 19 19 21 23 25 19 NA 23 22 22 20 24 20 20 22 20 24
[254] 21 19 24 19 19 28 32 20 19 19 19 20 20 19 19 23 20 19 25 19 20 24 22
[277] 22 21 19 19 19 23 21 21 21 23 20 19 19 19 20 21 18 29 18 21 22 19 22
[300] 27 24 24 23 20 27 21 20 19 19 41 20 20 19 27 19 19 20 20 27 21 26 20
[323] 19 20 20 20 20 20 21 19 19 37 20 19 21 19 20 20 20 21 20 19 19 20 21
[346] 19 21 19 23 21 19 23 24 23 21 22 20 19 23 21 23 20 23 22 21 21 20 33
[369] 21 23 20 20 20 20 20 21 20 19 18 23 22 21 22 20 22 18 33 20 21 25 22
[392] 54 19 23 19 20 22 19 22 21 22 20 26 20 22 19 20 25 20 17 22 20 22 19
[415] 25 21 20 24 21 27 21 20 21 23 23 22 20 29 23 20 18 21 23 23 19 19 19
[438] 19 20 26 20 20 26 19 19 23 19 19 21 20 19 22 22 20 20 20 19 21 19 20
[461] 19 20 19 19 19 20 20 21 20 19 21 23 20 21 19 20 NA 20 20 21 21 20 20
[484] 20 20 19 24 28 23 23 33 21 22 20 23 26 21 36 22 21 23 19 21 34 25 19
[507] 18 22 22 20 20 20 19 20

The first two functions essentially serve to ask the same question: either “what class of data is this variable” or “is the variable an integer class.” If class() returns “integer” than is.integer() will return a TRUE; If class() returns anything else, than is.integer() will return a FALSE. Either will answer our question if our variable is correctly assigned as an integer or not. If not, we can run as.integer() to tell R to treat our variable as an integer. Run one of the two first codes and see how R has classed age. If necessary, then run the last line of code.

Part 2: Data Collection

As scientists, we need to not only be able to identify the research question, population, sample, and variables of interest for our own questions, but for other studies as well. The following is an abstract of a recent publication. Read it carefully, (perhaps check the Strategies for Reading Comprehension document on iLearn for helpful hints!) and then answer the questions about it.

Abstract Basal metabolic rate (BMR) is posited to be a fundamental control on the structure and dynamics of ecological networks, influencing organism resource use and rates of senescence. Differences in the maintenance energy requirements of individual species therefore potentially predict extinction likelihood. If validated, this would comprise an important link between organismic ecology and macroevolutionary dynamics. To test this hypothesis, the BMRs of organisms within fossil species were determined using body size and temperature data, and considered in the light of species’ survival and extinction through time. Our analysis focused on the high-resolution record of Pliocene to recent molluscs (bivalves and gastropods) from the Western Atlantic. Species-specific BMRs were calculated by measuring the size range of specimens from museum collections, determining ocean temperature using the HadCM3 global climate model, and deriving values based on relevant equations. Intriguingly, a statistically significant difference in metabolic rate exists between those bivalve and gastropod taxa that went extinct and those that survived throughout the course of the Neogene. This indicates that there is a scaling up from organismic properties to species survival for these communities. Metabolic rate could therefore represent an important metric for predicting future extinction patterns, with changes in global climate potentially affecting the lifespan of individuals, ultimately leading to the extinction of the species they are contained within. We also find that, at the assemblage level, there are no significant differences in metabolic rates for different time intervals throughout the entire study period. This may suggest that Neogene mollusc communities have remained energetically stable, despite many extinctions.

Indicate the correct answer by bolding your chosen response(s).

What is the research question in this study?

What is the metabolic energy rate of molluscs? b. Can differences in metabolic requirements predict extinction likelihood?
Does basal metabolic rate influences resource use?

What is the population about which the researchers want to make inferences? a. All organisms in the fossil record

All bivalves and gastropods
Pliocene to recent bivalves and gastropods
Pliocene to recent molluscs from the Western Atlantic

What is the sample that the researchers collected to make inferences about the population?

Bivalves and gastropods
Pliocene to recent bivalves and gastropods c. Pliocene to recent molluscs from museum collections
Organisms in the fossil record

What is/are the variable(s) that the researchers directly measured for each individual? (choose all that apply)

Basal metabolic rate (BMR)
Ocean Temperature c. Specimen Size

What type of study was this? a. Observational

Experimental

Part 3: Exploratory Data Analysis

Now lets build some descriptive statistics and graphics with the survey data.

3.1: Summary Statistics

What is a summary statistic? How does it differ from raw data?

A summary statistic gives a summary of each variable, whereas raw data shows all of the data for each individual in the variables.

We can use R to calculate summary statistics for single quantitative variables, such as the mean and standard deviation. Modify the code below so it produces the mean and standard deviation for both the number of hours per week spent exercising and spent watching tv.

mean(survey$exercise, na.rm= TRUE) # calculates the mean for the variable exercise

[1] 5.107794

sd(survey$exercise, na.rm= TRUE) # calculates the standard deviation for exercise

[1] 4.107598

mean(survey$tv, na.rm= TRUE) # calculate the mean for tv

[1] 6.641393

sd(survey$tv, na.rm= TRUE) # calculate the standard deviation for tv

[1] 7.001062

We often also want to know the number of individuals in our sample (our sample size). If individuals are rows in a column variable, then how long that column is would be the number of individuals. Thus, the code for sample size is length() and we supply the specific variable we want to know the sample size for. Note: if you supply just the dataframe to length(), it will give you the length of the dataframe, or the number of variables in the dataframe, not the sample size of each one; so be careful when you apply this function.

Modify the code below to find the sample size for both our exercise and tv variables

length(survey$exercise)  # find the sample size of the variable exercise

[1] 514

length(survey$tv) # find the sample size of the variable tv

[1] 514

We ran code before to produce the mean and standard deviation alone – it just prints to the screen. I often will want to save the output to an object name so I can use it within the R Markdown document. When you save output to an object name it will not print out the value unless you tell R to provide the value.

mean(survey$exercise, na.rm = TRUE) # prints mean but does not save it

[1] 5.107794

mEx<-mean(survey$exercise, na.rm = TRUE) # calculates mean and saves it as mEx
mEx # prints to screen the value saved to mEx

[1] 5.107794

sdEx<-sd(survey$exercise, na.rm = TRUE) # calculates standard deviations and saves it as sdEx
sdEx # prints to screen the value saved to sdEx

[1] 4.107598

Use the named objects for mean and standard deviation and write out a description of each value with the output inline of the text.

Example: The mean amount of time per week spent exercising by all prior STAT 250 students is 5.1077935 hrs.

The mean amount of time per week spent exercising by all prior STAT 250 students is 5.1077935. The average spread about the mean is sdEx.

Below, insert a code chunk and calculate the same set of descriptive statistics about the variable tv. You will need to save the values as objects to use it later inline with the text.

mean(survey$tv, na.rm = TRUE) # prints mean but does not save it

[1] 6.641393

mTv<-mean(survey$tv, na.rm = TRUE) # calculates mean and saves it as mTv
mTv # prints to screen the value saved to mTv

[1] 6.641393

sdTv<-sd(survey$tv, na.rm = TRUE) # calculates standard deviations and saves it as sdTv
sdTv # prints to screen the value saved to sdTv

[1] 7.001062

Use the named objects for mean and standard deviation and write out a description of each value with the output inline of the text.

The mean amount of time per week spent watching TV by all prior STAT 250 students is 6.6413934. The average spread about the mean is sdTV.

Using a stored object name is useful if you are going to use the value often (to cut down on typing), but if you only need to use the value small number of times, you can also write the code directly in line. For example, the maximum number of hours of tv watched in a week in our dataset of prior STAT 250 students is NA.

Use in-line code directly to describe the maximum hours of exercise a prior STAT 250 student had in a week.

The maximum amount of hours spent watching TV by all prior STAT 250 students is 60.

We often write descriptive statistics using mathematical notation; we can do this in R Markdown too. Mathematical equations are surrounded by $. We can call greek letters by their full names preceded by a backslash (e.g. \mu, \sigma), use special notation as well by calling the name preceded by a backslash and then applying it to a numeric value (e.g. \bar{x}), or use roman characters directly (e.g. s). An equation would then look like, for example: $\bar{x} = 3.9$.

Which set of symbols would we use for the values we calculated on exercise in its last code chunk? Why?

The mean of exercise would be calculated using \bar{x} as it is the sample mean. The standard deviation would be calculated using s.

Write the values for exercise referenced above, using mathematical notation.

$\bar{x} = 5.107794$ $s = 4.107598$

3.2: Descriptive Graphics

There are two main types of descriptive graphics we can use for numeric data: histograms and boxplots. We’ll learn how to make both in R and customize them using arguments.

2.2.a: Histograms

Below is the basic code for a histogram of exercise. The function tells R to make a histogram. The first argument provided is always the data that you want graphed in the histogram. I have added two additional arguments to the code. Try running it different times and see what changes as you change the text within the quotation marks. Notice that this code is spaced out differently than we’ve seen before. It is just to make commenting each argument easier – R will keep reading the function until it reaches the end ).

Comment on each line of code to indicate what it does and modify the code to add labels to the graph axes.

hist(survey$exercise, # creates a histogram using the exercise data from the survey
     xlab = "Number of hours spent exercising", # labels the x-axis
     ylab = "Number of students exercising for x number of hours") # labels the y-axis

We would prefer it if our axes extended all the way to our maximum values. We can calculate the minimum and maximum values seen in our dataset using the min() and max() functions. We can then see the largest bin value we would want our axis to extend to. Modify the code below to calculate the minimum and maximum values. Then modify the hist() function to insert arguments labeling the axes and also set the minimum and maximum limits of the x axis. Annotate your code.

min(survey$exercise, na.rm = TRUE) # calculates the minimum value in exercise

[1] 0

max(survey$exercise, na.rm = TRUE) # calculates the maximum value in exercise

[1] 24

hist(survey$exercise, xlim=c(0,24)) # creates a histogram using the min and max hours of exercise

Did you need to add an argument to your min() and max() functions for them to run? Why?

Yes, it had NA values so I had to add na.rm = TRUE to ensure that those were excluded.

Describe the histogram in terms of the four characteristics we discussed in class.

It is unimodal right-skewed.

Look at the frequency distribution in the histogram. What does this tell us about the amount of time spent exercising per week for STAT 250 students?

Most STAT 250 students spend less than 10 hours a week exercising.

Below is a basic code for a histogram. Modify the code to plot a histogram for the variable tv, being sure to label your axes appropriately.

hist(survey$tv, # creates a histogram using the tv data from the survey
     xlab = "Number of hours spent watching TV", # labels the x-axis
     ylab = "Number of students watching x number of hours of TV") # labels the y-axis

3.2.b: Boxplots

Now let’s create a boxplot to look at the distribution of exercise. There are two ways to print the boxplot. Add labels to the x and y axis appropriate for each style.

boxplot(survey$exercise)

boxplot(survey$exercise, horizontal = TRUE)

What did adding horizontal = TRUE to the code do to the boxplot?

It rotated the box plot from vertical to horizontal.

Insert a code chunk to create a boxplot for tv.

boxplot(survey$tv, horizontal = TRUE)

Describe the distributions based on the boxplots for tv and exercise.

The distribution for exercise is right-skewed and the distribution for tv is also right-skewed

We can calculate the 5-number summary for both variables by supplying the variable name to the function summary(). Insert a code chunk below to do so.

summary(survey$exercise)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   2.000   4.000   5.108   7.000  24.000      20

summary(survey$tv)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   2.000   5.000   6.641  10.000  60.000      26

Part 4: Scenario: Beach Trash

Sam walks along the same two mile stretch of beach each morning walking his dog. While he walks, he picks up any trash he encounters. For the last ninety days, he has been keeping track of the number of pieces of trash he picks up each day. Load in the dataset trash.csv.

What is the population and sample for this scenario?

The population is all of the days in a year. The sample is the past 90 days. There are two variable given, the number of days and the amount of trash per day.

What type of study was this?

It was an observational study as Sam did nothing to manipulate the variables. He did not make the trash appear other than walk along in order to be able to see it. While he did pick the trash up after his observation, this did not affect the actual experiement, and in fact may have made it easier to ensure that he wasn’t counting the same piece of trash multiple days in a row.

What type of variable is recorded in trash?

A quantitative continuous variable as there is an infinite amount of trash. As Sam continues to collect trash, more trash is made.

Insert a code chunk that will read in your data and then calculate the relevant summary statistics. After your code chunk, describe each statistic using mathematical and verbal notation.

trash<-read.csv("Lab 2_trash.csv", header=TRUE)
summary(trash)

     items      
 Min.   : 5.00  
 1st Qu.:12.00  
 Median :15.00  
 Mean   :15.26  
 3rd Qu.:18.00  
 Max.   :34.00

Provide mathematical (e.g. $equation$) and verbal interpretation of each value calculated.

Create an appropriate graphic of your sample data. It should include any modifier as necessary to label the graphic.

hist(trash$items, # creates a histogram using the tv data from the survey
     xlab = "Number of pieces of trash picked up", # labels the x-axis
     ylab = "Number of days x pieces of trash was picked up") # labels the y-axis)

Describe the distribution of items of trash on the beach Sam picked up.

The distribution of the amount of trash that Sam picked up on the beach is unimodal skewed-right.

Checking code & Knitting your Document

You can run specific code chunks to check their output by pressing the green play button to run the specific code chunk, or the icon next to it to run all code chunks prior to that code chunk. If you want to check your R Markdown editing, you can knit the document and view it in the ‘Viewer’ Pane by pressing the ‘Knit’ button at the top of the R Markdown pane. Make sure the setting (gear icon above) is set to ‘Preview in Viewer pane’. When you are ready to knit your document, change the output type in the YAML header from html_document to word_document and press the ‘Knit’ button. Your word document will be created in the project folder!

Appendix

For quick reference of R Markdown syntax, this website has the basics.