Create a Word or Markdown file that contains the following:
R is case sensitive. Using lower case, single words to name your objects in R will make life easier. Short names also make life easier.
# expl instead of explanatory
expl <- c(1, 2)
When naming objects avoid spaces, don’t start with a number, and words that are already functions (ex. mean(), or sum()). Stick to things you will remember.
mean_penguins <- c(1, 2) # is better than
mean <- c(1, 2) # which could be the mean of many things
weddell.seals <- c(1,2)
Your data files for these labs have all been ‘cleaned’. All column headers and explanatory variables are lower case, and one word. Most real data has to be cleaned up before it can be analyzed.
Finally, don’t forget to use the help button to the right if you ever need to see what a function does.
We will be using R to calculate the summary statistics for our data, then we will plot the data in a number of different ways.
Open a new R Script file by choosing File > New File > R Script and name it: yourlastname_Datavisualization.
Copy code from this page into your R Script and edit it as necessary. Select the line(s) of code you want to run, and try out the keyboard shortcut CTRL + enter for Run.
Indentations and comments make code easier for you to read and understand when you revisit your labs, or when you share your code with someone else.
irisSepalLength <- subset(iris, Sepal.Length == "6.5")
#this function gave us all data with sepal length of 6.5.
# We can come back and read this comment later to determine what this function did.
Download the caffeine.csv file to your computer from Canvas and place it into your Biostats folder. Be sure your working directory is set to the proper file path.
You have some options for reading in your data. You can write the code to directly read your data into R, which would look something like this:
x <- read.csv("caffeine.csv")
You can use the following code which opens up a window in which you can choose your csv file.
x <- read.csv(file = file.choose())
You can use the menu button on the top right that says ‘Import Dataset’.
These data are now in a specific kind of R object; a dataframe. I arbitrary named that dataframe ‘x’. But I am going to make a copy of that dataframe into a new object, for which I will use a more descriptive name,‘caffeine’.
caffeineData <- x
Dataframes are objects that can hold different types of variables in columns. You can name the dataframe whatever you want. In this instance, ‘caffeine’ holds 3 different types of variables: an integer, a factor with two levels and a number.
This dataset was collected to determine if caffeine intake (yes or no) influenced the amount of time exercised (hr).
Typing your object’s name will print the data.
caffeineData
If your dataframe is large, it might be a little overwhelming to print the whole thing. That is why we did not print them here. We can use the head() function to list off a couple:
head(caffeineData)
## id caff exer
## 1 1 no 2.00
## 2 2 no 1.00
## 3 3 no 2.00
## 4 4 no 0.00
## 5 5 no 1.25
## 6 6 no 1.00
str(caffeineData) # can be used to show the 'structure' of the object
## 'data.frame': 40 obs. of 3 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ caff: chr "no" "no" "no" "no" ...
## $ exer: num 2 1 2 0 1.25 1 1.5 0 1.75 1.5 ...
Summary() provides the minimum and maximum values for each quantitative variable, as well as means and various percentiles, along with frequency statistics for qualitative variables. Most Importantly, it identifies the type of variable in each column.
summary(caffeineData)
## id caff exer
## Min. : 1.00 Length:40 Min. :0.000
## 1st Qu.:10.75 Class :character 1st Qu.:0.750
## Median :20.50 Mode :character Median :1.125
## Mean :20.50 Mean :1.269
## 3rd Qu.:30.25 3rd Qu.:2.000
## Max. :40.00 Max. :4.000
We want R to recognize caff as a factor with two levels rather than a character variable. We can use the function ‘as.factor’ to make that conversion.
caffeineData$caff <- as.factor(caffeineData$caff)
See the difference?
summary(caffeineData)
## id caff exer
## Min. : 1.00 no :17 Min. :0.000
## 1st Qu.:10.75 yes:23 1st Qu.:0.750
## Median :20.50 Median :1.125
## Mean :20.50 Mean :1.269
## 3rd Qu.:30.25 3rd Qu.:2.000
## Max. :40.00 Max. :4.000
Note the column names in your data frame, ‘id’, ‘caff’ and ‘exer’. Each one can be used as its own vector of data. For example:
caffeineData$caff
## [1] no no no no no no no no no no no no no no no no no yes yes
## [20] yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes
## [39] yes yes
## Levels: no yes
Now we can calculate some summary statistics for our response variable at each levels of the explanatory variable. First we separate our dataframe into individual vectors of “yes” and “no”.
level_1 <- caffeineData$exer[caffeineData$caff == 'yes']
level_2 <- caffeineData$exer[caffeineData$caff == 'no']
Calculate the mean value for level 1 and print to the screen. Add comments to your script to help you remember what the code is doing by starting the line with a #.
mean_1 <- mean(level_1)
# The mean of level_1 is
mean_1
## [1] 1.391304
Now we can calculate the rest of the summary statistics for level 1 and level 2. We can print them to the screen if desired, but even if we don’t view them, they will be stored in R for later use. These values will appear in the Global Environment (upper right window). Having the Global Environment window is one of the great features of RStudio.
sd_1 <- sd(level_1) # standard deviation of level_1
cv.1 <- sd_1/mean_1 # coefficient of variation of level_1
mean_2 <- mean(level_2)
sd_2 <- sd(level_2)
cv.2 <- sd_2/mean_2
R has a great basic plot function that tries to give us the best plot for the types of variables we provide. When we provide a categorical predictor and and numerical response, it gives us a boxplot. If we had two numerical variables, it provides a scatterplot. One thing to note, many (but not all) plot functions accept directions for a plot in a two different formats.
In this format we explicitly call out our x and y variables:
plot(x = caffeineData$caff, y = caffeineData$exer)
In this format we provide a formula, exer ~ caff, which can be stated as ‘exercise as a function of caff,’ generically we can state this as: y ~ x
plot(caffeineData$exer ~ caffeineData$caff)
To give each plot proper axis labels with units, as well as an appropriate titles we can add additional commands within the plot() function. These additional commands are called ‘arguments’ and are available within all functions. See example below for arguments that work in most plot functions. There are lots of more plot() arguments which allow customization of symbol shapes, colors, legends, plot background and more! You might notice RStudio prompting with arguments (and object names) as you type in code.
plot(caffeineData$exer ~ caffeineData$caff,
xlab='what I measured (units)', ylab='Still...Frequency!',
main='Some Title') # can be helpful to put arguments on separate lines to keep code organized
In addition to the generic plot function, there are functions for specific kinds of plots. Here we look at hist(), boxplot(),stripchart(), and barplot().
hist(level_1)
Use arguments from above to customize and note you can also specify the number of breaks within the hist() command, R will take your request and calculates a number of interval/bins close to your request, but taking account the range of values in your dataset.
hist(level_1, breaks = 50,
xlab='what I measured (units)', ylab='Still...Frequency!',
main='Some Title')
Make a histogram for both yes and no levels.
stripchart(caffeineData$exer ~ caffeineData$caff, vertical = T)
# fancier strip chart with points jittered
par(bty = "l")
stripchart(caffeineData$exer ~ caffeineData$caff, vertical = TRUE, method = "jitter", jitter = 0.2, pch = 1, col = "firebrick", las = 1)
boxplot(caffeineData$exer ~ caffeineData$caff)
boxplot(caffeineData$exer ~ caffeineData$caff, col="#a8d1df")
Create 3 vectors
The first object should be the names of your explanatory variablies. The second should be their means, and the third should be their standard deviations. Typically error bars would represent standard errors or confidence intervals, but today we will use standard deviations since that is where we are in the course material.
levels <- c("yes", "no")
means <- c(mean_1, mean_2)
sds <- c(sd_1, sd_2)
##### The following code will make a basic dot-and whisker plot. Make modifications to the “barplot” command so that it includes axis labels and a title.
bp <- barplot(means, names = levels, density = 0, border = F, ylim = c(0, max(caffeineData$exer)))
points(x = bp, y = means, pch = 16)
arrows(bp, means, bp, means+sds, angle = 90)
arrows(bp, means, bp, means-sds, angle = 90)
##
islandData <- read.csv("island.csv")
In R, we can use tapply() summarize data across a whole table. See Table 1 below for a summary of the function, or type:
tapply()
into r and run it.
Table 1: Tapply() function summary
What is in the data?
head(islandData)
## Island Face AlgalDensity Nutrients
## 1 SanJuan North 47.59 11.296981
## 2 SanJuan North 39.56 8.796913
## 3 SanJuan North 47.62 10.399961
## 4 SanJuan North 53.82 9.903147
## 5 SanJuan North 50.64 6.947449
## 6 SanJuan South 46.43 14.702849
For example, an easy way to get the means for AgalDensity for north, south, east, and west facing is like this:
tapply(islandData$AlgalDensity, islandData$Face, FUN = "mean")
## North South West
## 47.43733 42.11800 42.23533
tapply(islandData$AlgalDensity, list(islandData$Island, islandData$Face),FUN = "mean")
## North South West
## Lopez 46.792 43.350 48.950
## SanJuan 47.846 38.774 39.522
## Shaw 47.674 44.230 38.234
Note that you always need to use the object$column notation, even within functions, so R know what data you want to apply the function to.
Use tapply() to summarize the table. Note that you can save this summary to a new object.
IsleM <-tapply(islandData$AlgalDensity, islandData$Island, FUN = "mean")
barplot(IsleM, names=c("Lopez","San Juan","Shaw"))
This is data contains information about three finches in Kenya, including the Crimson-rumped waxbill (CRU.WAXB), the Cutthroat finch (CUTTHROA), and the White-browed sparrow weaver (WB.SPARW).
6a: What are the columns named?
6b: What types of data do they contain? (continuous numerical?, nominal categories?, etc…)
7a: What is each species summary statistic?
7b: Find the mean, standard deviation, and coefficient of variation for each.
8a: Describe the shape of the distribution. Is it symmetric or strongly skewed? Is it unimodal or bimodal?
8b: Are there outliers?
Describe the data. Reference your summary statistics.
If you have extra time here is something fun to try. Run the code below. Respond “yes” to the dialog box that comes up. The first command below will install a package ‘scales’ that adds additional functionality to R. In this case the additional function is to create a transparent color for the second histogram using the alpha function. Packages like this one are developed by a community of people who use R. Note that once a package is installed in your version of R you only need to load it using the library command. You will not need to reinstall it, but you will want to check for updates periodically.
install.packages("scales", repos="https://cran.rstudio.com") # this may take a minute, be patient
library(scales)
hist(level_1, freq = F, col = "pink", border = F,
main = "Composite Histogram", xlab = "Response Variable")
hist(level_2, freq = F, col = alpha("blue",.5),
add = T, border = F)
curve(dnorm(x, mean(level_1), sd(level_1)),
add = T, col = "red")
curve(dnorm(x, mean(level_2), sd(level_2)),
add = T, col = "blue")
legend("topright", c("Level 1", "Level 2"), bty = "n",
col = c("red","blue"), pch = 15)