Lets Prepare our files:

Make sure R and R Studio are downloaded.
Create a folder on your desktop called “Herpetology CURE”
Open RStudio. Click on the “New” button “white sheet with a green plus sign on it” and select “R Markdown”. Name it with your last name and a title.
Go to “File” and save it to your “Herpetology CURE” folder.
Download the Excel data file from BlackBoard. Click on the “SceloporusData tab”. Click on “File” –> “Save as” –> “ScelopData” as a file format of “Comma Separated Values”. Make sure this is also saved in the “Herpetology CURE” file.

At this point, you now have a file that will allow you run your analysis and make your plots (the RMarkdown file) and a data file that will run properly in R (CSV file).

Important Notes on Formatting CSV File:

R does not like to have spaces in the column variable names or file names
Safe special character choices are “-”, “_” and “.”

Helpful hints of using RStudio:

To run a line of code, put your cursor at the end of the line and press “Ctrl Enter” on Windows and “Cmd Enter” on a Mac
Save often, it does not autosave
You have to install packages once, and load the libraries every time

Annotating in R Markdown:

Read in Data

#import your data set to R
# header = T tells R that the first line of the file you are reading in is actually your variable names and not your first data point. 
dat=read.csv("ScelopData.csv", header = T)

At this stage, you want to check and make sure that the data appears in your environment (top right hand corner of the RStudio dashboard). If you see “dat” in the enviroment, then it is properly loaded into R. Click on it, and look at the file to make sure it contains the data and appears the way that you could expect, matching your excel file.

Format Variable Types

# Identify each variable type as labeled by R on import
# This code in sentence form is essentially "for dataframe "dat", tell me that class of each variable"
lapply(dat, class)

## $OvipDate
## [1] "character"
## 
## $Cage
## [1] "integer"
## 
## $Egg
## [1] "integer"
## 
## $Treatment
## [1] "integer"
## 
## $Temperature
## [1] "character"
## 
## $Moisture
## [1] "character"
## 
## $EggMass
## [1] "numeric"
## 
## $HatchingSuccess
## [1] "integer"
## 
## $HatchDate
## [1] "character"
## 
## $IncDuration
## [1] "integer"
## 
## $Sex
## [1] "character"
## 
## $SVL
## [1] "integer"
## 
## $TL
## [1] "character"
## 
## $HatchMass
## [1] "numeric"

Often, the data variables are read in as numerical when they should be categorical, or categorical when they should be numeric. If that is the case, you will need to change them. In this case, multiple variables were the incorrect class.

Treatment should be categorical (character)
Cage should be categorical (character)
Egg should be categorical (character)
Tail Length should be numerical
HatchingSuccess (yes/no) should be categorical (character)
SVL should be numerical

#load the dplyr package each time you open R
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#change each variable to the appropriate type. 
#This can be done by telling R to take a variable from a dataframe and to format it "as" a certain type. 
# Example: So you can say "look at treatment within the dat file,  overwrite the current numerical Treatment varaible "as" a character type"
  dat$Treatment=as.character(dat$Treatment) 
  dat$Cage=as.character(dat$Cage) 
  dat$Egg=as.character(dat$Egg)
  dat$TL=as.numeric(dat$TL)

## Warning: NAs introduced by coercion

  dat$HatchingSuccess=as.character(dat$HatchingSuccess)
  dat$SVL=as.numeric(dat$SVL)
  
#double check column class after changes
lapply(dat, class)

## $OvipDate
## [1] "character"
## 
## $Cage
## [1] "character"
## 
## $Egg
## [1] "character"
## 
## $Treatment
## [1] "character"
## 
## $Temperature
## [1] "character"
## 
## $Moisture
## [1] "character"
## 
## $EggMass
## [1] "numeric"
## 
## $HatchingSuccess
## [1] "character"
## 
## $HatchDate
## [1] "character"
## 
## $IncDuration
## [1] "integer"
## 
## $Sex
## [1] "character"
## 
## $SVL
## [1] "numeric"
## 
## $TL
## [1] "numeric"
## 
## $HatchMass
## [1] "numeric"

Now each variable is the proper class to allow analysis. We can move forward with preparing the data for analysis.

Create additional necessary columns

Many of you chose to include analyses of measures that do not exist yet, such as Body Condition and Total Offspring Mass Investment. We will need to calculate those values for each row of data.

Body Condition

#Principle Component Analyses in R only works when you have "complete data". So if you have any rows in your data frame that have HatchlingMass or SVL measurements missing, the analysis will fail and produce an error. 
#This code tells it to only keep rows in which individuals have a SVL value entered. 
#This by default eliminates all eggs that did not survive to hatching and will only be a portion of the 67 original eggs (number of rows in the "dat" dataset).
dat.comp=dat[complete.cases(dat$SVL),]

You now have a dataset named “dat.comp”, as it is the data we intend to use in the component analysis. This dataframe only has 28 rows/observations, as there were only 28 individuals that have both a SVL and hatchling mass value.

For more information on understanding Principle Component Analysis, see:

https://www.youtube.com/watch?v=HMOI_lkzW08
https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186#:~:text=Principal%20component%20analysis%2C%20or%20PCA,more%20easily%20visualized%20and%20analyzed

#Create a subset of data that includes only the variables you will run the PCA on. We want to compare only HatchlingMass and SVL. So we are creating a dataframe called "PCA" that includes only the SVL and HatchMass columns from the dat.comp dataframe. 
PCA=select(dat.comp, SVL, HatchMass)
#You will notice that if you look in the environment, the PCA dataset now only has two variables (SVL and HatchlingMass)

#The PCA test requires input on whether or not your two variables of interest are correlated with one another. So we will run a cor.test function of SVL and HatchMass.
cor.test(PCA$SVL, PCA$HatchMass)

## 
##  Pearson's product-moment correlation
## 
## data:  PCA$SVL and PCA$HatchMass
## t = 3.0755, df = 26, p-value = 0.004896
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1776261 0.7458411
## sample estimates:
##      cor 
## 0.516477

#Your p-value is 0.004 is significant, indicating that they are correlated. So in the next line of code, you will want to put "TRUE" for the "cor =" function. Although  note the "cor" value (equivalent to R value for the equation) is only 0.51. So it is not hugely correlated. 

#Run a "princomp" test to run a PCA anlaysis. 
#Start by naming the test "off.meas.pr" so that later, we can use the saved values. 
#run "princomp" on dataset "PCA", indicating correlation between variables, and printing scores for each row. 
off.meas.pr=princomp(PCA, cor=T, scores=TRUE)

#Ensure that the PCA ran by looking at the output
summary(off.meas.pr, loadings=T)

## Importance of components:
##                           Comp.1    Comp.2
## Standard deviation     1.2314532 0.6953582
## Proportion of Variance 0.7582385 0.2417615
## Cumulative Proportion  0.7582385 1.0000000
## 
## Loadings:
##           Comp.1 Comp.2
## SVL        0.707  0.707
## HatchMass  0.707 -0.707

You now have run the PCA successfully to get a single value of “Body Condition” for each observation in the dataset. However, they are stored in the PCA results “off.meas.pr”. We need them to be in the dataset that was used to calculate the value. So now we need to extract the scores from “off.meas.pr” and put them in the “dat.comp” file.

dat.comp= dat.comp %>%
#merge the scores from "off.meas.pr" into the dat.comp file
  cbind(off.meas.pr$scores) %>%
#remove Principle Comp.2 from the file, as we will be using Comp.1 as the Body condition value
  subset(select = -c(Comp.2))

#Rename Comp.1 to something with meaning. In this case, Comp.1, or Principle Component 1, is a measure of Body Condition which we will name "BodyConPCA"
names(dat.comp)[names(dat.comp) == "Comp.1"] <- "BodyConPCA"

The dat.comp file now has all 28 individuals with SVL and HatchlingMass values, but also with the newly calculated Body Condition values.

Offspring Mass Investment

We will follow the exact same process to calculate a single value of offspring inventment including a measure of egg mass and hatchling mass

#Get a single value of investment into offspring mass 
PCA2=select(dat.comp, EggMass, HatchMass)
cor.test(PCA2$EggMass, PCA2$HatchMass)

## 
##  Pearson's product-moment correlation
## 
## data:  PCA2$EggMass and PCA2$HatchMass
## t = 4.0748, df = 26, p-value = 0.0003845
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3274793 0.8089515
## sample estimates:
##       cor 
## 0.6242843

off.mass.pr=princomp(PCA2, cor=T, scores=TRUE)

summary(off.mass.pr, loadings=T)

## Importance of components:
##                           Comp.1    Comp.2
## Standard deviation     1.2744741 0.6129565
## Proportion of Variance 0.8121422 0.1878578
## Cumulative Proportion  0.8121422 1.0000000
## 
## Loadings:
##           Comp.1 Comp.2
## EggMass    0.707  0.707
## HatchMass  0.707 -0.707

dat.comp= dat.comp %>%
  cbind(off.mass.pr$scores) %>%
  subset(select = -c(Comp.2))

names(dat.comp)[names(dat.comp) == "Comp.1"] <- "OffMassPCA"

At this point you have created a measure of “OffMassPCA” that is a single value of mass investment and merged it back into the dat.comp file. You now have the single dat.comp file which includes all individuals that had recorded SVL, hatchling mass, egg mass, BodyConPCA, and OffMassPCA.

However, we want one single data set that can be used for the whole class, rather than jumping back and forth between files. That often leads to messy analysis and mistakes. So we need to merge these newly calculated variables (24 individuals) back into the “dat” file with 67 individuals.

Merge to Create Final DataSet

#Create our new "go-to" file by merging the new varibales (BodyConPCA and OffMassPCA)from the dat.comp file into the original file "dat" by matching the "Egg" ID in both files. "All.x = T" tells R to keep all of the rows regardless if they have a value for Body Condition and Offspring Mass or not. In this case, it puts in an "NA" for any individual who does not have the value. If this is "All.x=F", it will only keep the 24 that have values. 

dat.full= merge(dat, dat.comp[,c("Egg", "BodyConPCA", "OffMassPCA")],by="Egg", all.x = T)

So… what data sets do we have now? dat = all data as read in from excel dat.comp = only complete data with newly calculated variables. dat.full = all data + the newly calculated values of Body Condition and Offspring Mass.

We will be using the dat.full dataset to perform all of our analyses moving forward. So we need to expore it for use in all of your R code moving forward and to archive it for use in publication.

#Take the dataset we just created ("dat.full"), and export it as a CSV file named "Scelop.Dat.Full". Use the row.names=F command to tell it not to print the row numbers to the file. If you do not include this, the first column is just a numbered list from 1-67, which is not useful to us. 
write.csv(dat.full, "Scelop.Dat.Full.csv", row.names = F)

Herpetology: Data Curation for Analysis

Abby Beatty & Class

2024-02-12

Lets Prepare our files:

Important Notes on Formatting CSV File:

Helpful hints of using RStudio:

Annotating in R Markdown:

Read in Data

Format Variable Types

Create additional necessary columns

Body Condition

Offspring Mass Investment

Merge to Create Final DataSet