Instructors: Luis E. Escobar, PhD and Mariana Castaneda-Guzman

Class: Foundations of Fish & Wildlife Diseases, FiW 4984, Fall 2019.

Start R Session

If you have not downloaded the software, yet, follow the links to download R and R Studio.

Step 1: Download Data

To begin, go to the desktop and create a new folder, give it a name. After creating the folder, copy and paste the following link into your web browser to download the data for this tutorial. Download the file and save it in your new folder. Make sure to remember the name and location of the file.

  http://bit.ly/2rv5kmm

Step 2: Open a new session of R studio

After opening R studio your window should look similar to this

Step 3: Open a new R script

Once you have opened R studio, open a new empty R script by clicking the highlighted icon in the image below, and selecting R script.

If you have successfully opened R studio and created a new R script window your screen should now look like the image below. Now save your script by click the save icon, save R-script in the same folder as your data.

Step 4: Understand the different panes

R Studio contains four different pane: 1. R Script pane: this is where you’ll write your main code. 2. The console pane: this is where you’ll see the outputs (except for plots) 3. The environment, history, and connection pane: this is were you’ll see the values for the variables in memory, the R code history, and any open connection to outside databases. 4. The file, plots, package, help, viewer: in the file tab, you’ll be able to see, generally, the files inside your working directory. In the plot tab, you’ll be able to see your output plots. In the help tab you’ll be able to see R-functions descriptions (to get help on any function simply type ?name_of_function in your console).

Load/Read Necessary Data

Step 1: Set working directory

Click on the tab Session on the task bar, then click on Set Working Directory, and then click Choose Directory…. Chose the correct folder were you saved the data.

Step 2: Import data into R

One of the ways to import data into R is to use text files. Our data is contained in a CSV (comma-separated values) file. The CSV file uses commas to separate the different elements in a line. Data resembles an excel worksheet. To read file into R we use the R function read.csv().

my_data_frame <-
        read.csv("Distemper.csv",
                 header = TRUE,
                 stringsAsFactors = FALSE)

Note: You can check if the data you read in Step 2 is correct by using the R built-in function View(data_frame_name).

View(my_data_frame)

Below you can see the first 10 lines of the data frame. Important thing to notice about data frame are:

1. Number of columns
2. Number of rows
3. Spelling of things
4. Type of data within each column
5. Basic formats

Keep these things in mind as we clean the data in the next few steps.

MoNTH weight species RESUTLT
January 26.5 S Positive
February 13.0 R Positive
February 8.0 R Negative
March 18.5 S Negative
March 10.5 R Positive
April 6.5 S Negative
April 1.0 R Negative
April 5.0 R Negative
May 26.0 R Negative
May 0.5 R Negative

Data Cleaning

Once you have loaded and reviewed the data, we can stat cleaning the data.

There are two essential thing to cleaning data, 1. make aesthetic changes, and 2. specify data type. Making this changes will help facilitate the use of the data frame later.

Aesthetic Changes

You can check the current column’s name by using the function names(data_frame_name)

names(my_data_frame)
## [1] "MoNTH"   "weight"  "species" "RESUTLT"

To make changes to the column names use function colnames(data_frame_name) and specify the new names using c("name1", "name2", ...). The c( ) in a R creates a one dimensional array of items or vectors for specific attributes, for example, names.

colnames(my_data_frame) <-
        c("Month", "Weight", "Species", "Diagnostic")

Double check that you have correctly named your columns by rerunning or typing the function names( ).

names(my_data_frame)
## [1] "Month"      "Weight"     "Species"    "Diagnostic"

Assessing Data Types

Data types are data items, defined by the values it can take. R has five different data types: character, numeric, integer, complex, and logical.

Step 1: Check data type for each of the columns in data frame.

To check type/class of columns in the data frame we use the R function sapply( ). The function sapply( ) takes in two different arguments: 1. your data frame 2. what are you looking for, in this case data type = class.

sapply(my_data_frame, class)
##       Month      Weight     Species  Diagnostic 
## "character"   "numeric" "character" "character"

Step 2: Change the the class of the columns according to what the column represents

Categorical data in R is call factor. Categorical data refers to a variable that can take only one of a fixed, finite set of possibilities or level.

For example, the column denoting Species needs to be set as a factor with levels “R” (Raccoon), and “S” (Skunk). We do this by using the R function as.factor( ). Note, since Species column only contains two different variables “S” and “R”, R will automatically set this two possibilities as levels.

my_data_frame$Species <- as.factor(my_data_frame$Species)

We do the same thing for column Month. Note, for Month we need to set the levels to month.name, month.name is a built-in list within R. This is to prevent in case our data frame does not include all possible month of the year.

my_data_frame$Month <-
        factor(my_data_frame$Month, levels = month.name)

If we run month.name by itself we can see the content of the object is a list with all the month of the year in order.

month.name 
##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December"

Step 3: Re-check data types

After changing the class/type of all of the desirable columns, we can re-check the data frame’s data types

sapply(my_data_frame, class)
##       Month      Weight     Species  Diagnostic 
##    "factor"   "numeric"    "factor" "character"

Basic plots in R

R has built-in functions that allows us to plot different kinds of graphics. For example, line charts, time-series, bar plots, box plots, histograms, density plots, scatter plots, among other.

For this tutorial we will focus on three basic plots: bar plots, box plots, and density plots.

Bar plots

A bar plot represents categorical data with heights proportional to the values they represent.

Before we can plot our graphic we need to transform the data frame into the correct data structure, in this case, a table. We can use the R function table() to do this transformation. The function table( ) in R uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels. This counts will become the height of each of the bars.

For this example we will only like to transform the column Month into a table. The function table( ) allows us to count the number of instances a given month appears in the column Month of our data frame

month_count <- table(my_data_frame$Month)
month_count
## 
##   January  February     March     April       May      June      July 
##         1         2         2         3         4         5         8 
##    August September   October  November  December 
##        10         7         4         3         1

Now that we have transformed our data into the correct data structure, we can continue the steps to create our bar plot.

Step 1: Call function box plot with desired data

The most basic bar plot you can make only requires you to add the height argument inside the function barplot()

barplot(height = month_count)

Step 2: Add a title

To add a title to the plot use the argument main = "your title here"

barplot(height = month_count,
        main = "Number of Reported Cases per Month")

Step 3: Add a label to x-axis

To add and label to the x-axis use the argument xlab = "your x-axis label here"

barplot(height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month")

Step 4: Add a label to y-axis

To add a label to the y axis use the argument ylab = "your y-axis label here"

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases"
)

Step 5: Change border color

To change the border of the bars use the argument border = "color name here"

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases",
        border = "yellow"
)

Step 6: Change fill color

To change the fill color of the bars use the argument col = "color name here"

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases",
        border = "yellow",
        col = "purple"
)

Step 7: Add texture to bars

To add texture to bars use argument density = "size of lines"

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases",
        border = "yellow",
        col = "purple",
        density = 10
)

Step 8: Change angle of texture

To change the angle of the density lines use argument angle = "angle here". Use any angle from 0 to 360 degrees.

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases",
        border = "yellow",
        col = "purple",
        density = 10,
        angle = 20
)

If you want to alternate angles specify the desire angles using function c() to create a list of angles. If angles are less than the number of bars, angles will alternate between the bars.

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases",
        border = "yellow",
        col = "purple",
        density = 10,
        angle = c(45, 90)
)

Step 9: Change space between bars

To change the space between the bars use argument space = "new space here"

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases",
        border = "yellow",
        col = "orange",
        density = 10,
        angle = c(45, 90),
        space = 2
)

Step 10: Change direction of bars

To change the direction of the bars use argument horiz = TRUE. By adding this argument, the lines will change orientation, and be oriented horizontally.

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases",
        border = "yellow",
        col = "orange",
        density = 15,
        angle = c(45, 90),
        space = 2,
        horiz = TRUE
)

Step 10: Change direction of axis labels

To change the axis label orientation use the argument las = "number", where ‘las’ stands for axis label orientation.

The number for the axis orientation can be: 0: parallel to the axis 1: always horizontal 2: perpendicular to the axis 3: always vertical

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Month",
        ylab = "Number of Cases",
        border = "yellow",
        col = "orange",
        density = 15,
        angle = c(45, 90),
        space = 2,
        horiz = TRUE,
        las = 2
)

Step 11: Reverse axis labels

To reverse the axis labels, simply change the title of the argument. In this case, change “xlab” to “ylab”, and vice versa.

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        ylab = "Month",
        xlab = "Number of Cases",
        border = "yellow",
        col = "orange",
        density = 15,
        angle = c(45, 90),
        space = 2,
        horiz = TRUE,
        las = 2
)

Step 12: Fit the plot to the window

If you notice, some of the y-axis labels are cut-off, to fix this we can set the plot margins to our desire size to make sure all our plot data fit properly. To do this we use the function par(mar = c(0,0,0,0)). par stands for parameter and ‘mar’ stands for margin. You need to write the function before your plot-code. The default values are par(mar = c(5.1,4.1,4.1,2.1)). Note, for the plot below we deleted the ylab, just for aesthetics.

par(mar = c(5.1,8.1,4.1,2.1))

barplot(
        height = month_count,
        main = "Number of Reported Cases per Month",
        xlab = "Number of Cases",
        border = "yellow",
        col = "orange",
        density = 15,
        angle = c(45, 90),
        space = 2,
        horiz = TRUE,
        las = 2
)

Density Plot

A density plot is a representation of the distribution of a numeric variable. The density of a given point in the curve is calculated with the counts of instances a variable appears in our data frame.

In order to plot a density graph, we need to transform data using R function density(). For this examples, we will transform only the Weight column of our data frame.

density_of_Weight <- density(my_data_frame$Weight)
density_of_Weight
## 
## Call:
##  density.default(x = my_data_frame$Weight)
## 
## Data: my_data_frame$Weight (50 obs.);    Bandwidth 'bw' = 3.379
## 
##        x                y            
##  Min.   :-9.636   Min.   :2.654e-05  
##  1st Qu.: 5.182   1st Qu.:2.974e-03  
##  Median :20.000   Median :8.106e-03  
##  Mean   :20.000   Mean   :1.685e-02  
##  3rd Qu.:34.818   3rd Qu.:2.954e-02  
##  Max.   :49.636   Max.   :5.487e-02

Step 1: Run basic plot function

After transforming Weight column into density table and saving transformation into a new object, we can plot our destiny plot. To plot density curve we can to use R function plot().

plot(density_of_Weight)

Step 2: Add a title

To add a title to the plot use the argument main = "your title here"

plot(density_of_Weight,
     main = "Density plot")

Step 3: Add a label to x-axis

To add and label to the x-axis use the argument xlab = "your x-axis label here"

plot(density_of_Weight,
     main = "Density plot",
     xlab = "Weight")

Step 4: Add a label to y-axis

To add and label to the y-axis use the argument ylab = "your y-axis label here"

plot(density_of_Weight,
     main = "Density plot",
     xlab = "Weight",
     ylab = "Probability Density")

Step 5: Change color of the line

To change the color of the line use the argument col = "new color here".

plot(density_of_Weight,
     main = "Density plot",
     xlab = "Weight", 
     ylab = "Probability Density",
     col = "red")

Step 6: Change color of area under the curve

If you want to change the color of the area under the curve you must use the function polygon() and specify the new color of the curve. Note, for this example we use and RGB code for the color of the curve. The rgb( ) function takes in three arguments representing different intensities of red, green, and blue.

plot(density_of_Weight,
     main = "Density plot",
      xlab = "Weight", 
     ylab = "Probability Density")
polygon(density_of_Weight,
        col = rgb(0.2, 0.4, 0.6))

Box Plot

Relationship between a box plot and a density curve (image: Wikipedia)

Step 1: Make basic box plot function call

For the basic box plot call, you need to feed into the function two different arguments: 1. The data to look at: The function boxplot() take in formulas of the form y ~ x where, y is a numeric vector which is grouped according to the value of x. 2. Your data frame.

boxplot(Weight ~ Species, data = my_data_frame) # "Weight ~ Species" reads: Weight group by Species

Step 2: Add a title

To add a title to the plot use the argument main = "your title here"

boxplot(Weight ~ Species, data = my_data_frame,
        main = "Box Plot Weight and Species")

Step 3: Add a label to x-axis

To add and label to the x-axis use the argument xlab = "your x-axis label here"

boxplot(Weight ~ Species,
        data = my_data_frame,
        main = "Box Plot Weight and Species",
        xlab = "Species")

Step 4: Add a label to y-axis

To add a label to the y axis use the argument ylab = "your y-axis label here"

boxplot(
        Weight ~ Species,
        data = my_data_frame,
        main = "Box Plot Weight and Species",
        xlab = "Species",
        ylab = "Weight"
)

Step 5: Change fill color

To change the fill color of the bars use the argument col = "color name here"

boxplot(
        Weight ~ Species,
        data = my_data_frame,
        main = "Box Plot Weight and Species",
        xlab = "Species",
        ylab = "Weight",
        col = c("pink", "skyblue")
)

Step 6: Change border color

To change the border color of the bars use the argument border = "color name here"

boxplot(
        Weight ~ Species,
        data = my_data_frame,
        main = "Box Plot Weight and Species",
        xlab = "Species",
        ylab = "Weight",
        col = c("pink", "skyblue"),
        border = c("red", "gray")
)

Step 7: Change the names of boxes

To change the names of the individual boxes, use argument names = "new names here" and use c( ) to list all names if more than one.

boxplot(
        Weight ~ Species,
        data = my_data_frame,
        main = "Box Plot Weight and Species",
        xlab = "Species",
        ylab = "Weight",
        col = c("pink", "skyblue"),
        border = c("red", "gray"),
        names = c("Raccoon", "Skunk")
)

### Step 8: Change the color of the axis

To change the color of the y and x axis, use argument col.axis = "color here".

boxplot(
        Weight ~ Species,
        data = my_data_frame,
        main = "Box Plot Weight and Species",
        xlab = "Species",
        ylab = "Weight",
        col = c("pink", "skyblue"),
        border = c("red", "gray"),
        names = c("Raccoon", "Skunk"),
        col.axis = "#69b3a2"
)

Calculating Prevalence and Incidence

Step 1: Transforming Data Frame

In order to obtain prevalence and incidence we need to count the number of positive diagnostics and negative diagnostic. For this, we will use, once again, the function table(). Remember that the function table() in R creates a table of the counts at each combination of factor levels. In this example the function table( ) will count the number of instances a given diagnostic was made, either positive or negative (level of column Diagnostic).

diagnostic <- table(my_data_frame$Diagnostic)
diagnostic
## 
## Negative Positive 
##       24       26

After, we need to transform the data once more to a data frame using the function as.data.frame( ).

diagnostic <- data.frame(diagnostic)
diagnostic
##       Var1 Freq
## 1 Negative   24
## 2 Positive   26

Step 2: Get count of positive diagnostics

To obtain only the number of positive samples we need to access our new data frame by using an specific index. In this case our count of positive diagnostics is located in the second (2) row and the second (2) column, in R this look like = [2,2] [row, column].

positive <- diagnostic[2, 2]
positive
## [1] 26

Step 3: Get total number of Species

The number of Species in our data frame are equal to our sample size. In this case our sample size is equal to the number of rows in our data frame (i.e, rows = 50). To obtain the number of rows in our data frame, we can use the function nrow( ).

number_of_Species <- nrow(my_data_frame)
number_of_Species
## [1] 50

Step 4: Get prevalence

Prevalence is the number of Species in the sample that tested positive, divided by the total number of Species in the sample.

# Basic formula to calculate prevalence
prevalence <- (positive / number_of_Species) * 100
prevalence
## [1] 52

Step 5: Get incidence

Incidence is the number of Species in the sample with that tested positive, divided by the population number.

Hence, to calculate incidence we need to set the population number. For this example we are using the population number of Guatemala (16 Million inhabitants).

population <- 16000000 

# Basic formula to calculate incidence
incidence <- (positive / population) * 10000

incidence
## [1] 0.01625

Extra Step: Calculate prevalence per Species

subset_my_data_frame_2 <- my_data_frame[, c("Species", "Diagnostic")]
head(subset_my_data_frame_2)
##   Species Diagnostic
## 1       S   Positive
## 2       R   Positive
## 3       R   Negative
## 4       S   Negative
## 5       R   Positive
## 6       S   Negative
Species_frequency <- table(subset_my_data_frame_2)
Species_frequency
##        Diagnostic
## Species Negative Positive
##       R       13       15
##       S       11       11
Species_frequency <- as.data.frame(Species_frequency)
Species_frequency
##   Species Diagnostic Freq
## 1       R   Negative   13
## 2       S   Negative   11
## 3       R   Positive   15
## 4       S   Positive   11
Raccoon_frequency <- subset(Species_frequency, Species == "R")
head(Raccoon_frequency)
##   Species Diagnostic Freq
## 1       R   Negative   13
## 3       R   Positive   15
Skunk_frequency <- subset(Species_frequency, Species == "S")
head(Skunk_frequency)
##   Species Diagnostic Freq
## 2       S   Negative   11
## 4       S   Positive   11
Raccoon_prevalence <-
        (Raccoon_frequency[2, 3] / sum(Raccoon_frequency$Freq)) * 100
Raccoon_prevalence
## [1] 53.57143
Skunk_prevalence <-
        (Skunk_frequency[2, 3] / sum(Skunk_frequency$Freq)) * 100
Skunk_prevalence
## [1] 50
barplot(
        height = c(Raccoon_prevalence, Skunk_prevalence),
        main = "Prevalence per Species",
        xlab = "Species",
        ylab = "Prevalence (%)",
        names.arg = c("Raccoon", "Skunk"),
        col = c("orange", "cyan"),
        border = rgb(0,0.5,1),
        col.axis = "#69b3a2", #changes the color of the axis
        cex.axis = 1.5, #changes the  size of the axis
        cex.lab = 0.8, #changes the size of the axis's label
        cex.main = 1.5, #changes the size of the main title
)