Instructors: Luis E. Escobar, PhD and Mariana Castaneda-Guzman
Class: Foundations of Fish & Wildlife Diseases, FiW 4984, Fall 2019.
If you have not downloaded the software, yet, follow the links to download R and R Studio.
To begin, go to the desktop and create a new folder, give it a name. After creating the folder, copy and paste the following link into your web browser to download the data for this tutorial. Download the file and save it in your new folder. Make sure to remember the name and location of the file.
http://bit.ly/2rv5kmm
After opening R studio your window should look similar to this
Once you have opened R studio, open a new empty R script by clicking the highlighted icon in the image below, and selecting R script.
If you have successfully opened R studio and created a new R script window your screen should now look like the image below. Now save your script by click the save icon, save R-script in the same folder as your data.
R Studio contains four different pane: 1. R Script pane: this is where you’ll write your main code. 2. The console pane: this is where you’ll see the outputs (except for plots) 3. The environment, history, and connection pane: this is were you’ll see the values for the variables in memory, the R code history, and any open connection to outside databases. 4. The file, plots, package, help, viewer: in the file tab, you’ll be able to see, generally, the files inside your working directory. In the plot tab, you’ll be able to see your output plots. In the help tab you’ll be able to see R-functions descriptions (to get help on any function simply type ?name_of_function in your console).
Click on the tab Session on the task bar, then click on Set Working Directory, and then click Choose Directory…. Chose the correct folder were you saved the data.
One of the ways to import data into R is to use text files. Our data is contained in a CSV (comma-separated values) file. The CSV file uses commas to separate the different elements in a line. Data resembles an excel worksheet. To read file into R we use the R function read.csv().
my_data_frame <-
read.csv("Distemper.csv",
header = TRUE,
stringsAsFactors = FALSE)
Note: You can check if the data you read in Step 2 is correct by using the R built-in function View(data_frame_name).
View(my_data_frame)
Below you can see the first 10 lines of the data frame. Important thing to notice about data frame are:
1. Number of columns
2. Number of rows
3. Spelling of things
4. Type of data within each column
5. Basic formats
Keep these things in mind as we clean the data in the next few steps.
| MoNTH | weight | species | RESUTLT |
|---|---|---|---|
| January | 26.5 | S | Positive |
| February | 13.0 | R | Positive |
| February | 8.0 | R | Negative |
| March | 18.5 | S | Negative |
| March | 10.5 | R | Positive |
| April | 6.5 | S | Negative |
| April | 1.0 | R | Negative |
| April | 5.0 | R | Negative |
| May | 26.0 | R | Negative |
| May | 0.5 | R | Negative |
Once you have loaded and reviewed the data, we can stat cleaning the data.
There are two essential thing to cleaning data, 1. make aesthetic changes, and 2. specify data type. Making this changes will help facilitate the use of the data frame later.
You can check the current column’s name by using the function names(data_frame_name)
names(my_data_frame)
## [1] "MoNTH" "weight" "species" "RESUTLT"
To make changes to the column names use function colnames(data_frame_name) and specify the new names using c("name1", "name2", ...). The c( ) in a R creates a one dimensional array of items or vectors for specific attributes, for example, names.
colnames(my_data_frame) <-
c("Month", "Weight", "Species", "Diagnostic")
Double check that you have correctly named your columns by rerunning or typing the function names( ).
names(my_data_frame)
## [1] "Month" "Weight" "Species" "Diagnostic"
Data types are data items, defined by the values it can take. R has five different data types: character, numeric, integer, complex, and logical.
To check type/class of columns in the data frame we use the R function sapply( ). The function sapply( ) takes in two different arguments: 1. your data frame 2. what are you looking for, in this case data type = class.
sapply(my_data_frame, class)
## Month Weight Species Diagnostic
## "character" "numeric" "character" "character"
Categorical data in R is call factor. Categorical data refers to a variable that can take only one of a fixed, finite set of possibilities or level.
For example, the column denoting Species needs to be set as a factor with levels “R” (Raccoon), and “S” (Skunk). We do this by using the R function as.factor( ). Note, since Species column only contains two different variables “S” and “R”, R will automatically set this two possibilities as levels.
my_data_frame$Species <- as.factor(my_data_frame$Species)
We do the same thing for column Month. Note, for Month we need to set the levels to month.name, month.name is a built-in list within R. This is to prevent in case our data frame does not include all possible month of the year.
my_data_frame$Month <-
factor(my_data_frame$Month, levels = month.name)
If we run month.name by itself we can see the content of the object is a list with all the month of the year in order.
month.name
## [1] "January" "February" "March" "April" "May"
## [6] "June" "July" "August" "September" "October"
## [11] "November" "December"
After changing the class/type of all of the desirable columns, we can re-check the data frame’s data types
sapply(my_data_frame, class)
## Month Weight Species Diagnostic
## "factor" "numeric" "factor" "character"
R has built-in functions that allows us to plot different kinds of graphics. For example, line charts, time-series, bar plots, box plots, histograms, density plots, scatter plots, among other.
For this tutorial we will focus on three basic plots: bar plots, box plots, and density plots.
A bar plot represents categorical data with heights proportional to the values they represent.
Before we can plot our graphic we need to transform the data frame into the correct data structure, in this case, a table. We can use the R function table() to do this transformation. The function table( ) in R uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels. This counts will become the height of each of the bars.
For this example we will only like to transform the column Month into a table. The function table( ) allows us to count the number of instances a given month appears in the column Month of our data frame
month_count <- table(my_data_frame$Month)
month_count
##
## January February March April May June July
## 1 2 2 3 4 5 8
## August September October November December
## 10 7 4 3 1
Now that we have transformed our data into the correct data structure, we can continue the steps to create our bar plot.
The most basic bar plot you can make only requires you to add the height argument inside the function barplot()
barplot(height = month_count)
To add a title to the plot use the argument main = "your title here"
barplot(height = month_count,
main = "Number of Reported Cases per Month")
To add and label to the x-axis use the argument xlab = "your x-axis label here"
barplot(height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month")
To add a label to the y axis use the argument ylab = "your y-axis label here"
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases"
)
To change the border of the bars use the argument border = "color name here"
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases",
border = "yellow"
)
To change the fill color of the bars use the argument col = "color name here"
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases",
border = "yellow",
col = "purple"
)
To add texture to bars use argument density = "size of lines"
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases",
border = "yellow",
col = "purple",
density = 10
)
To change the angle of the density lines use argument angle = "angle here". Use any angle from 0 to 360 degrees.
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases",
border = "yellow",
col = "purple",
density = 10,
angle = 20
)
If you want to alternate angles specify the desire angles using function c() to create a list of angles. If angles are less than the number of bars, angles will alternate between the bars.
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases",
border = "yellow",
col = "purple",
density = 10,
angle = c(45, 90)
)
To change the space between the bars use argument space = "new space here"
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases",
border = "yellow",
col = "orange",
density = 10,
angle = c(45, 90),
space = 2
)
To change the direction of the bars use argument horiz = TRUE. By adding this argument, the lines will change orientation, and be oriented horizontally.
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases",
border = "yellow",
col = "orange",
density = 15,
angle = c(45, 90),
space = 2,
horiz = TRUE
)
To change the axis label orientation use the argument las = "number", where ‘las’ stands for axis label orientation.
The number for the axis orientation can be: 0: parallel to the axis 1: always horizontal 2: perpendicular to the axis 3: always vertical
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Month",
ylab = "Number of Cases",
border = "yellow",
col = "orange",
density = 15,
angle = c(45, 90),
space = 2,
horiz = TRUE,
las = 2
)
To reverse the axis labels, simply change the title of the argument. In this case, change “xlab” to “ylab”, and vice versa.
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
ylab = "Month",
xlab = "Number of Cases",
border = "yellow",
col = "orange",
density = 15,
angle = c(45, 90),
space = 2,
horiz = TRUE,
las = 2
)
If you notice, some of the y-axis labels are cut-off, to fix this we can set the plot margins to our desire size to make sure all our plot data fit properly. To do this we use the function par(mar = c(0,0,0,0)). par stands for parameter and ‘mar’ stands for margin. You need to write the function before your plot-code. The default values are par(mar = c(5.1,4.1,4.1,2.1)). Note, for the plot below we deleted the ylab, just for aesthetics.
par(mar = c(5.1,8.1,4.1,2.1))
barplot(
height = month_count,
main = "Number of Reported Cases per Month",
xlab = "Number of Cases",
border = "yellow",
col = "orange",
density = 15,
angle = c(45, 90),
space = 2,
horiz = TRUE,
las = 2
)
A density plot is a representation of the distribution of a numeric variable. The density of a given point in the curve is calculated with the counts of instances a variable appears in our data frame.
In order to plot a density graph, we need to transform data using R function density(). For this examples, we will transform only the Weight column of our data frame.
density_of_Weight <- density(my_data_frame$Weight)
density_of_Weight
##
## Call:
## density.default(x = my_data_frame$Weight)
##
## Data: my_data_frame$Weight (50 obs.); Bandwidth 'bw' = 3.379
##
## x y
## Min. :-9.636 Min. :2.654e-05
## 1st Qu.: 5.182 1st Qu.:2.974e-03
## Median :20.000 Median :8.106e-03
## Mean :20.000 Mean :1.685e-02
## 3rd Qu.:34.818 3rd Qu.:2.954e-02
## Max. :49.636 Max. :5.487e-02
After transforming Weight column into density table and saving transformation into a new object, we can plot our destiny plot. To plot density curve we can to use R function plot().
plot(density_of_Weight)
To add a title to the plot use the argument main = "your title here"
plot(density_of_Weight,
main = "Density plot")
To add and label to the x-axis use the argument xlab = "your x-axis label here"
plot(density_of_Weight,
main = "Density plot",
xlab = "Weight")
To add and label to the y-axis use the argument ylab = "your y-axis label here"
plot(density_of_Weight,
main = "Density plot",
xlab = "Weight",
ylab = "Probability Density")
To change the color of the line use the argument col = "new color here".
plot(density_of_Weight,
main = "Density plot",
xlab = "Weight",
ylab = "Probability Density",
col = "red")
If you want to change the color of the area under the curve you must use the function polygon() and specify the new color of the curve. Note, for this example we use and RGB code for the color of the curve. The rgb( ) function takes in three arguments representing different intensities of red, green, and blue.
plot(density_of_Weight,
main = "Density plot",
xlab = "Weight",
ylab = "Probability Density")
polygon(density_of_Weight,
col = rgb(0.2, 0.4, 0.6))
Relationship between a box plot and a density curve (image: Wikipedia)
For the basic box plot call, you need to feed into the function two different arguments: 1. The data to look at: The function boxplot() take in formulas of the form y ~ x where, y is a numeric vector which is grouped according to the value of x. 2. Your data frame.
boxplot(Weight ~ Species, data = my_data_frame) # "Weight ~ Species" reads: Weight group by Species
To add a title to the plot use the argument main = "your title here"
boxplot(Weight ~ Species, data = my_data_frame,
main = "Box Plot Weight and Species")
To add and label to the x-axis use the argument xlab = "your x-axis label here"
boxplot(Weight ~ Species,
data = my_data_frame,
main = "Box Plot Weight and Species",
xlab = "Species")
To add a label to the y axis use the argument ylab = "your y-axis label here"
boxplot(
Weight ~ Species,
data = my_data_frame,
main = "Box Plot Weight and Species",
xlab = "Species",
ylab = "Weight"
)
To change the fill color of the bars use the argument col = "color name here"
boxplot(
Weight ~ Species,
data = my_data_frame,
main = "Box Plot Weight and Species",
xlab = "Species",
ylab = "Weight",
col = c("pink", "skyblue")
)
To change the border color of the bars use the argument border = "color name here"
boxplot(
Weight ~ Species,
data = my_data_frame,
main = "Box Plot Weight and Species",
xlab = "Species",
ylab = "Weight",
col = c("pink", "skyblue"),
border = c("red", "gray")
)
To change the names of the individual boxes, use argument names = "new names here" and use c( ) to list all names if more than one.
boxplot(
Weight ~ Species,
data = my_data_frame,
main = "Box Plot Weight and Species",
xlab = "Species",
ylab = "Weight",
col = c("pink", "skyblue"),
border = c("red", "gray"),
names = c("Raccoon", "Skunk")
)
### Step 8: Change the color of the axis
To change the color of the y and x axis, use argument col.axis = "color here".
boxplot(
Weight ~ Species,
data = my_data_frame,
main = "Box Plot Weight and Species",
xlab = "Species",
ylab = "Weight",
col = c("pink", "skyblue"),
border = c("red", "gray"),
names = c("Raccoon", "Skunk"),
col.axis = "#69b3a2"
)
In order to obtain prevalence and incidence we need to count the number of positive diagnostics and negative diagnostic. For this, we will use, once again, the function table(). Remember that the function table() in R creates a table of the counts at each combination of factor levels. In this example the function table( ) will count the number of instances a given diagnostic was made, either positive or negative (level of column Diagnostic).
diagnostic <- table(my_data_frame$Diagnostic)
diagnostic
##
## Negative Positive
## 24 26
After, we need to transform the data once more to a data frame using the function as.data.frame( ).
diagnostic <- data.frame(diagnostic)
diagnostic
## Var1 Freq
## 1 Negative 24
## 2 Positive 26
To obtain only the number of positive samples we need to access our new data frame by using an specific index. In this case our count of positive diagnostics is located in the second (2) row and the second (2) column, in R this look like = [2,2] [row, column].
positive <- diagnostic[2, 2]
positive
## [1] 26
The number of Species in our data frame are equal to our sample size. In this case our sample size is equal to the number of rows in our data frame (i.e, rows = 50). To obtain the number of rows in our data frame, we can use the function nrow( ).
number_of_Species <- nrow(my_data_frame)
number_of_Species
## [1] 50
Prevalence is the number of Species in the sample that tested positive, divided by the total number of Species in the sample.
# Basic formula to calculate prevalence
prevalence <- (positive / number_of_Species) * 100
prevalence
## [1] 52
Incidence is the number of Species in the sample with that tested positive, divided by the population number.
Hence, to calculate incidence we need to set the population number. For this example we are using the population number of Guatemala (16 Million inhabitants).
population <- 16000000
# Basic formula to calculate incidence
incidence <- (positive / population) * 10000
incidence
## [1] 0.01625
subset_my_data_frame_2 <- my_data_frame[, c("Species", "Diagnostic")]
head(subset_my_data_frame_2)
## Species Diagnostic
## 1 S Positive
## 2 R Positive
## 3 R Negative
## 4 S Negative
## 5 R Positive
## 6 S Negative
Species_frequency <- table(subset_my_data_frame_2)
Species_frequency
## Diagnostic
## Species Negative Positive
## R 13 15
## S 11 11
Species_frequency <- as.data.frame(Species_frequency)
Species_frequency
## Species Diagnostic Freq
## 1 R Negative 13
## 2 S Negative 11
## 3 R Positive 15
## 4 S Positive 11
Raccoon_frequency <- subset(Species_frequency, Species == "R")
head(Raccoon_frequency)
## Species Diagnostic Freq
## 1 R Negative 13
## 3 R Positive 15
Skunk_frequency <- subset(Species_frequency, Species == "S")
head(Skunk_frequency)
## Species Diagnostic Freq
## 2 S Negative 11
## 4 S Positive 11
Raccoon_prevalence <-
(Raccoon_frequency[2, 3] / sum(Raccoon_frequency$Freq)) * 100
Raccoon_prevalence
## [1] 53.57143
Skunk_prevalence <-
(Skunk_frequency[2, 3] / sum(Skunk_frequency$Freq)) * 100
Skunk_prevalence
## [1] 50
barplot(
height = c(Raccoon_prevalence, Skunk_prevalence),
main = "Prevalence per Species",
xlab = "Species",
ylab = "Prevalence (%)",
names.arg = c("Raccoon", "Skunk"),
col = c("orange", "cyan"),
border = rgb(0,0.5,1),
col.axis = "#69b3a2", #changes the color of the axis
cex.axis = 1.5, #changes the size of the axis
cex.lab = 0.8, #changes the size of the axis's label
cex.main = 1.5, #changes the size of the main title
)