Lab: Data Visualization and Summarization

Deliverable

Create a Word or Markdown file that contains the following:

All plots from Plots section
- include x labels, y labels, and titles
Answer all questions in Questions section
Paste all code you wrote during this lab to the bottom of your file

Protips

R is case sensitive. Using lower case, single words to name your objects in R will make life easier. Short names also make life easier.

# expl instead of explanatory
expl <- c(1, 2)

When naming objects avoid spaces, don’t start with a number, and words that are already functions (ex. mean(), or sum()). Stick to things you will remember.

mean_penguins <- c(1, 2) # is better than
mean <- c(1, 2) # which could be the mean of many things
weddell.seals <- c(1,2)

Your data files for these labs have all been ‘cleaned’. All column headers and explanatory variables are lower case, and one word. Most real data has to be cleaned up before it can be analyzed.

Finally, don’t forget to use the help button to the right if you ever need to see what a function does.

Preamble

We will be using R to calculate the summary statistics for our data, then we will plot the data in a number of different ways.

Getting Started

Open and save R Script files

Open a new R Script file by choosing File > New File > R Script and name it: yourlastname_Datavisualization.

Copy code from this page into your R Script and edit it as necessary. Select the line(s) of code you want to run, and try out the keyboard shortcut CTRL + enter for Run.

Indentations and comments make code easier for you to read and understand when you revisit your labs, or when you share your code with someone else.

irisSepalLength <- subset(iris, Sepal.Length == "6.5") 

#this function gave us all data with sepal length of 6.5. 

# We can come back and read this comment later to determine what this function did.

Read in and view data

Download the caffeine.csv file to your computer from Canvas and place it into your Biostats folder. Be sure your working directory is set to the proper file path.

Read your data into RStudio Desktop

You have some options for reading in your data. You can write the code to directly read your data into R, which would look something like this:

x <- read.csv("caffeine.csv")

You can use the following code which opens up a window in which you can choose your csv file.

x <- read.csv(file = file.choose())

You can use the menu button on the top right that says ‘Import Dataset’.

These data are now in a specific kind of R object; a dataframe. I arbitrary named that dataframe ‘x’. But I am going to make a copy of that dataframe into a new object, for which I will use a more descriptive name,‘caffeine’.

caffeineData <- x

Dataframes are objects that can hold different types of variables in columns. You can name the dataframe whatever you want. In this instance, ‘caffeine’ holds 3 different types of variables: an integer, a factor with two levels and a number.

This dataset was collected to determine if caffeine intake (yes or no) influenced the amount of time exercised (hr).

Typing your object’s name will print the data.

caffeineData

If your dataframe is large, it might be a little overwhelming to print the whole thing. That is why we did not print them here. We can use the head() function to list off a couple:

head(caffeineData)

##   id caff exer
## 1  1   no 2.00
## 2  2   no 1.00
## 3  3   no 2.00
## 4  4   no 0.00
## 5  5   no 1.25
## 6  6   no 1.00

str(caffeineData) # can be used to show the 'structure' of the object

## 'data.frame':    40 obs. of  3 variables:
##  $ id  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ caff: chr  "no" "no" "no" "no" ...
##  $ exer: num  2 1 2 0 1.25 1 1.5 0 1.75 1.5 ...

Summary() provides the minimum and maximum values for each quantitative variable, as well as means and various percentiles, along with frequency statistics for qualitative variables. Most Importantly, it identifies the type of variable in each column.

summary(caffeineData)

##        id            caff                exer      
##  Min.   : 1.00   Length:40          Min.   :0.000  
##  1st Qu.:10.75   Class :character   1st Qu.:0.750  
##  Median :20.50   Mode  :character   Median :1.125  
##  Mean   :20.50                      Mean   :1.269  
##  3rd Qu.:30.25                      3rd Qu.:2.000  
##  Max.   :40.00                      Max.   :4.000

We want R to recognize caff as a factor with two levels rather than a character variable. We can use the function ‘as.factor’ to make that conversion.

caffeineData$caff <- as.factor(caffeineData$caff)

See the difference?

summary(caffeineData)

##        id         caff         exer      
##  Min.   : 1.00   no :17   Min.   :0.000  
##  1st Qu.:10.75   yes:23   1st Qu.:0.750  
##  Median :20.50            Median :1.125  
##  Mean   :20.50            Mean   :1.269  
##  3rd Qu.:30.25            3rd Qu.:2.000  
##  Max.   :40.00            Max.   :4.000

Note the column names in your data frame, ‘id’, ‘caff’ and ‘exer’. Each one can be used as its own vector of data. For example:

caffeineData$caff

##  [1] no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  yes yes
## [20] yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes
## [39] yes yes
## Levels: no yes

Subsetting a dataframe

Subset by explanatory variable

Now we can calculate some summary statistics for our response variable at each levels of the explanatory variable. First we separate our dataframe into individual vectors of “yes” and “no”.

level_1  <-  caffeineData$exer[caffeineData$caff == 'yes']
level_2  <-  caffeineData$exer[caffeineData$caff == 'no']

Summary statistics

Calculate the mean value for level 1 and print to the screen. Add comments to your script to help you remember what the code is doing by starting the line with a #.

mean_1  <-  mean(level_1)
# The mean of level_1 is
mean_1

## [1] 1.391304

Now we can calculate the rest of the summary statistics for level 1 and level 2. We can print them to the screen if desired, but even if we don’t view them, they will be stored in R for later use. These values will appear in the Global Environment (upper right window). Having the Global Environment window is one of the great features of RStudio.

sd_1  <-  sd(level_1) # standard deviation of level_1
cv.1  <-  sd_1/mean_1 # coefficient of variation of level_1
mean_2  <-  mean(level_2)
sd_2  <-  sd(level_2)
cv.2  <-  sd_2/mean_2

Basic plot() formatting

R has a great basic plot function that tries to give us the best plot for the types of variables we provide. When we provide a categorical predictor and and numerical response, it gives us a boxplot. If we had two numerical variables, it provides a scatterplot. One thing to note, many (but not all) plot functions accept directions for a plot in a two different formats.

Format 1: ‘x = , y =’

In this format we explicitly call out our x and y variables:

plot(x = caffeineData$caff, y = caffeineData$exer)

Format 2: ‘y ~ x’

In this format we provide a formula, exer ~ caff, which can be stated as ‘exercise as a function of caff,’ generically we can state this as: y ~ x

plot(caffeineData$exer ~ caffeineData$caff)

Customize using ‘arguments’

To give each plot proper axis labels with units, as well as an appropriate titles we can add additional commands within the plot() function. These additional commands are called ‘arguments’ and are available within all functions. See example below for arguments that work in most plot functions. There are lots of more plot() arguments which allow customization of symbol shapes, colors, legends, plot background and more! You might notice RStudio prompting with arguments (and object names) as you type in code.

plot(caffeineData$exer ~ caffeineData$caff, 
     xlab='what I measured (units)', ylab='Still...Frequency!', 
     main='Some Title') # can be helpful to put arguments on separate lines to keep code organized

Plots

In addition to the generic plot function, there are functions for specific kinds of plots. Here we look at hist(), boxplot(),stripchart(), and barplot().

Copy plots into a your submission file by clicking: Export > Copy to Clipboard > Copy Plot > Paste

1. Histogram

hist(level_1)

Use arguments from above to customize and note you can also specify the number of breaks within the hist() command, R will take your request and calculates a number of interval/bins close to your request, but taking account the range of values in your dataset.

hist(level_1, breaks = 50, 
     xlab='what I measured (units)', ylab='Still...Frequency!', 
     main='Some Title')

Make a histogram for both yes and no levels.

2. Strip Chart

stripchart(caffeineData$exer ~ caffeineData$caff, vertical = T)

# fancier strip chart with points jittered
par(bty = "l")
stripchart(caffeineData$exer ~ caffeineData$caff, vertical = TRUE, method = "jitter", jitter = 0.2, pch = 1, col = "firebrick", las = 1)

3. Box Plot

Dress your plots up using the “main”, “xlab”, “ylab”commands. Try “col” if you enjoy it in your life. One command that is not intuitive but fun to play with is pch, which identifies the shape of symbol being plotted (try ?pch in your command line for jitters or point plots). When you have a plot with appropriate axis labels, a title, and level names, copy it and paste it in your assignment document.

boxplot(caffeineData$exer ~ caffeineData$caff)

boxplot(caffeineData$exer ~ caffeineData$caff, col="#a8d1df")

4. Box and Whisker

For the dot-and-whisker plot, we need to do a little data manipulation first.

Create 3 vectors

The first object should be the names of your explanatory variablies. The second should be their means, and the third should be their standard deviations. Typically error bars would represent standard errors or confidence intervals, but today we will use standard deviations since that is where we are in the course material.

levels  <-  c("yes", "no")
means  <-  c(mean_1, mean_2)    
sds  <-  c(sd_1, sd_2)

##### The following code will make a basic dot-and whisker plot. Make modifications to the “barplot” command so that it includes axis labels and a title.

bp <- barplot(means, names = levels, density = 0, border = F, ylim = c(0, max(caffeineData$exer)))
points(x = bp, y = means, pch = 16)
arrows(bp, means, bp, means+sds, angle = 90)
arrows(bp, means, bp, means-sds, angle = 90)

5. Barplot

##

Tapply() island data

Download and read in the island.csv data

islandData <- read.csv("island.csv")

Summarize data with tapply()

In R, we can use tapply() summarize data across a whole table. See Table 1 below for a summary of the function, or type:

tapply()

into r and run it.

Table 1: Tapply() function summary

Data container

What is in the data?

head(islandData)

##    Island  Face AlgalDensity Nutrients
## 1 SanJuan North        47.59 11.296981
## 2 SanJuan North        39.56  8.796913
## 3 SanJuan North        47.62 10.399961
## 4 SanJuan North        53.82  9.903147
## 5 SanJuan North        50.64  6.947449
## 6 SanJuan South        46.43 14.702849

For example, an easy way to get the means for AgalDensity for north, south, east, and west facing is like this:

tapply(islandData$AlgalDensity, islandData$Face, FUN = "mean")

##    North    South     West 
## 47.43733 42.11800 42.23533

tapply(islandData$AlgalDensity, list(islandData$Island, islandData$Face),FUN = "mean")

##          North  South   West
## Lopez   46.792 43.350 48.950
## SanJuan 47.846 38.774 39.522
## Shaw    47.674 44.230 38.234

Note that you always need to use the object$column notation, even within functions, so R know what data you want to apply the function to.

Use tapply() to summarize the table. Note that you can save this summary to a new object.

IsleM <-tapply(islandData$AlgalDensity, islandData$Island, FUN = "mean")

barplot(IsleM, names=c("Lopez","San Juan","Shaw"))

Questions

Answer all of the following questions using kenyanfinches.csv data.

This is data contains information about three finches in Kenya, including the Crimson-rumped waxbill (CRU.WAXB), the Cutthroat finch (CUTTHROA), and the White-browed sparrow weaver (WB.SPARW).

6. Data Types

Data container

6a: What are the columns named?

6b: What types of data do they contain? (continuous numerical?, nominal categories?, etc…)

7. Seperate

Separate the three finch species.

7a: What is each species summary statistic?

7b: Find the mean, standard deviation, and coefficient of variation for each.

8. Histograms

Create 3 histograms for each species mass.

8a: Describe the shape of the distribution. Is it symmetric or strongly skewed? Is it unimodal or bimodal?

8b: Are there outliers?

9. Boxplots

Create a boxplot displaying each species mass.

Describe the data. Reference your summary statistics.

Extra Skills

If you have extra time here is something fun to try. Run the code below. Respond “yes” to the dialog box that comes up. The first command below will install a package ‘scales’ that adds additional functionality to R. In this case the additional function is to create a transparent color for the second histogram using the alpha function. Packages like this one are developed by a community of people who use R. Note that once a package is installed in your version of R you only need to load it using the library command. You will not need to reinstall it, but you will want to check for updates periodically.

install.packages("scales", repos="https://cran.rstudio.com") # this may take a minute, be patient

library(scales)

hist(level_1, freq = F, col = "pink", border = F,
  main = "Composite Histogram", xlab = "Response Variable")
hist(level_2, freq = F, col = alpha("blue",.5),
  add = T, border = F)

curve(dnorm(x, mean(level_1), sd(level_1)), 
  add = T, col = "red")
curve(dnorm(x, mean(level_2), sd(level_2)), 
  add = T, col = "blue")

legend("topright", c("Level 1", "Level 2"), bty = "n", 
  col = c("red","blue"), pch = 15)