Hello! Welcome to my final re-analysis of my senior thesis data (hopefully). If you’re not familiar with the project, feel free to check out my Hough Grant presentation here. In short, I built a bunch of environmental sensors and put them next to an imperiled dune thistle on Beaver Island, Michigan for 6 weeks.
My initial analysis was lacking for several reasons. The first is the usage of the environmental data from the Arduino-based sensors. Time series data are already tricky, but when the data has some inconsistencies, it gets even trickier. Our initial analysis led to us throwing away a lot of data and not using the data we could us to its full potential. The second area of weakness came in the measurement of the growth of our Pitcher’s thistles Cirsium pitcheri. While this task may seem simple, Cirsium’s rosette growing pattern means the purposeful die-back of older leaves to make way for younger leaves. This means, that within our time frame, not one metric (change in total leaf length, average leaf length, or number of leaves) was completely representative of an individual’s growth.
Since completing my senior thesis, I have taken several statistics courses and thought deeply about how to answer the question of Cirsium’s growth along the stark environmental gradient that the freshwater dune systems allow. In the following analysis, I believe I have addressed the aforementioned issues and created an analysis that I am proud of. Thank you for reading and any notes are appreciated!
Here are the libraries that I used in this analysis.
library(gdata)
library(tidyverse)
library(readr)
library(tseries)
library(MASS)
library(vegan)
library(rgl)
library(cluster)
library(gganimate)
library(transformr)
library(FactoMineR)
library(factoextra)
I am going to start by building out our environmental data that we will later used in a redundancy analysis (RDA). First, we focus on expanding upon soil temperature, then on soil moisture, and finally, we bring it all together along with our static environmental variables.
First and foremost, I load in all of the data from the arduino sensors. This is includes all the raw data (NAs and all) along with each measurement’s time. All the ‘Time’ variables are in hours since the first sensor went online.
EnvAll<-read.csv("C:/Users/benny/Dropbox/__SIP/DATA/AllEnvData.csv") #Grabbing environmental data
EnvAll$Sensor<-as.factor(EnvAll$Sensor)
Clean <- subset(EnvAll, SoilTemp > 5) #Cleaning data
As with any good journey, I have to start with some good-ole graphs. Here, I map our two time-based environmental factors: Soil Temperature and Soil Moisture. There are 40 sensors in total, which is a lot of graphs. I’ve tried to reduce the size of the graphs enough not overwhelm you, but I also didn’t want to shrink them too much as to affect readability (It’s a trade-off!).
x <- unique(as.numeric(Clean$Sensor)) #get a numeric value for each sensor to iterate through
for (y in x){ #for loop to graph each sensor
ModelData <- subset(Clean, Sensor == as.character(y)) #extract each sensor individually
r<-ggplot(data = (ModelData), aes(Time, SoilTemp, group = as.factor(Sensor), color =(Sensor))) + geom_line(aes(color = "Soil Temperature")) + geom_line(aes(y =SoilMoist, color = "Soil Moisture"))+ scale_y_continuous(sec.axis = sec_axis(~.*5, name = "Soil Moisture"))+ scale_colour_manual(values = c("blue", "red"))+ labs(y = "Soil Temperature", colour = y)+ theme(legend.position = c(0.8,0.9)) + xlim(0,835)
print(r)
}
As we can see, the data aren’t perfect (if you find a perfect dataset, rest well with your Nobel prize), but we can certainly do some cool analyses with this!
The concept behind what I wanted to do with the Temperature data is the heart of what has motivated me to attempt this analysis. Even when first working with this data, I was fixated on how sinusoidal the temperature data were. I thought it would be really cleaver to map a sine wave to each individual sensor’s soil temperature data. This would smooth out gaps in the data while still providing us with a picture of what each plant was feeling. I would then extract the amplitude and intercept from the model to use in later analysis. I didn’t know how to do that on my first go around, but - with a help from my friends and fellow Purdue Master’s students, Sarah Rademacher and Sarah Cuprewich - I was able to do exactly that!
The first step in making a sinusoidal model is checking to see if the data you are trying to use is ‘stationary’. That’s exactly what I check here with the adf tests. Low p-values means it is stationary. Spoiler alert, they are!
#########Testing if stationary###########
#For this analysis, time series data need to be stationary
adf.test(Clean$SoilTemp) #test if entire dataset is stationary
##
## Augmented Dickey-Fuller Test
##
## data: Clean$SoilTemp
## Dickey-Fuller = -12.863, Lag order = 23, p-value = 0.01
## alternative hypothesis: stationary
#p<0.05 which means it IS stationary
#Now we will test if the sime series for each individual sensor is stationary or not
x <- unique(as.numeric(Clean$Sensor)) #get a numeric value for each sensor to iterate through
for (n in x){ #for loop to test each sensor
ModelData <- subset(Clean, Sensor == as.character(n)) #extract each sensor individually
adf<-adf.test(ModelData$SoilTemp) #test the data
ifelse(adf$p.value > 0.05, print(as.character(n)), print('Acceptable')) #print sensor number if unacceptable
}
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
## [1] "Acceptable"
This, right here, is the neat part. This is what have been wanting to do for 3 years! Here is where I make a sine was model for each sensor and extract its amplitude and intercept for later analysis. The first step to doing that is to subset each sensor’s temperature data. Then, we have to make a a periodogram using the spectrum command. The periodigram is the other graph seen in this chunk. Using this, we can extract the periodicity of each of the sensor’s sine waves (the object per). We use this to feed the sine wave model! From there, it is just a matter of graphing the model and real data together and actually extracting the amplitude and intercepts. All sensors with fewer than 50 readings were excluded.
Note that I took the absolute value of the amplitude. Some were coming out negative, which makes sense in a sine wave, but would provide fake variation in our later analysis. In short, a 2 and a -2 amplitude show the same amount of temperature fluctuation in this data.
##########Building models##############
#empty vectors to fill with our models
TemperatureAmplitude <- vector()
TemperatureIntercept <- vector()
for (p in x){ #for loop creating sine wave models and extracting amplitude and intercept
detach()
ModelData <- subset(Clean, Sensor == as.character(p)) #subseting single sensor
ModelData <- ModelData[, -c(2:4)]
attach(ModelData) #attaching single sensor
ssp <- spectrum(SoilTemp) #creates a Periodogram
per <- 1/ssp$freq[ssp$spec==max(ssp$spec)] #extracting periodicity from periodogram
reslm3 <- lm(ModelData$SoilTemp ~ sin(2*pi/per*ModelData$Time)) #Creates model using sine wave
summary(reslm3) #checking it worked
#rg <- diff(range(SoilTemp)) # taking range of values
z <- ggplot(aes(x = Time, y = SoilTemp), data = na.omit(ModelData)) +
geom_point(shape = 1, size = 4, aes(color = Sensor))+
labs(y = "Soil Temperature", title = paste("Actual Data and Modeled Sine Wave from Sensor", p))+
geom_line(aes(y = fitted(reslm3), x =na.omit(Time)), color = "black", alpha=0.6, size = 1.25) + #adding fitted values as a line
theme(legend.position = c(0.8,0.9))+
scale_color_discrete(name = "Actual Data", labels = paste("Sensor",p)) +
xlim(0,835)
print(z)
TemperatureAmplitude <- append(TemperatureAmplitude,ifelse(length(ModelData$SoilTemp)>50,abs(reslm3$coefficients[2]),"")) #extracting ABSOLUTE VALUE amplitude from model
TemperatureIntercept <- append(TemperatureIntercept,ifelse(length(ModelData$SoilTemp)>50,abs(reslm3$coefficients[1]),"")) #extracting intercept from model
}
Now that the Soil Temperature stuff is taken care of, it is time to turn towards the Soil Moisture. Now, unfortunately, the Soil Moisture does not follow the same nice wavy pattern as the soil temperature data. Additionally, if you were paying close attention to the exploratory graph section, you may have noticed the wildly inconsistent y-axis for soil moisture! Ah! how do we handle this? Well, there is no perfect answer, but here is an answer. The first thing we do is alter our measurement. Because I know that there were several heavy rain events on the dunes during the six weeks the sensors were out there, I can assume that the maximum value is the soil at full saturation. We then alter the other measurements to be a percentage of saturation. The mean and standard deviation of this proportion are what we extract from each sensor for Soil Moisture. Because this doesn’t have the benefit of the sine wave to smooth out the gaps, I set a high barrier to be included in the analysis at 100 readings.
#Establish empty vectos
MoistMean<- vector()
MoistSD<- vector()
MoistCount<- vector()
for (m in x){ #for loop to alter the soil moisture to a percentage of saturation metric and then extract if enough readings
max<- max(subset(na.omit(Clean), Sensor == m)$SoilMoist) #find max for that sensor
min<- min(subset(na.omit(Clean), Sensor == m)$SoilMoist) #adjust for the minimum which was above 0 for all sensors
Moist<- (subset(na.omit(Clean), Sensor == m)$SoilMoist-min)/(max-min) # create propotion
#add values to our vectors
MoistMean<- append(MoistMean, mean(Moist))
MoistSD<- append(MoistSD, sd(Moist))
MoistCount<- append(MoistCount, length(subset(na.omit(Clean), Sensor == m)$SoilMoist))
}
Moisture<- cbind(MoistMean, MoistSD, MoistCount) #combine them into one datafram
#adjust to only include data for sensors with over 100 readings
Moisture<- mutate(as.data.frame(Moisture), SoilMoistureMean = ifelse(MoistCount >100, MoistMean,"" ))
Moisture<- mutate(as.data.frame(Moisture), SoilMoistureSD = ifelse(MoistCount >100, MoistSD, ""))
Moisture<- na_if(Moisture, "") #making blanks into proper 'NAs'. This is very important and kinda neat!
As I mentioned at the top, one of my frustrations was trying to decide on just a single metric for growth. So, instead of doing something crazy like, ‘making a decision’ (gross), I found a way to use all of them! Here I find the find the delta (d) in Total Leaf Length, Average Leaf Length, and Number of Leaves from the beginning of the study to the end. Finding the trend of each of the growth variables felt more powerful than just the change from the begining until the end. Because we had measurements from every two weeks, I was able to model a linear regressions each plant’s growth and extracted each metric’s slope to us in the late RDA
#Read in and adjust the data
leaf <- read.csv("C:/Users/benny/Dropbox/__SIP/DATA/Leaf Measurements csv.csv")
colnames(leaf)[1]<- "Sensor"
#Create blank vectors
dTotalLeafLength<- vector()
dAverageLeafLength<- vector()
dNumberLeaves<- vector()
#A list of all plants measured including those with no sensor denoted by a letter rather than a number
SensorList<- unique(leaf$Sensor)
#For loop to make models for each plant's growth and extrating the slope
for (b in 1:60) {
#model
TotalLeafTrend<-lm(TotalLeaf~Measurement, data = subset(leaf, Sensor == SensorList[b]) )
AverageLeafTrend<- lm(AvgLength~Measurement, data = subset(leaf, Sensor == SensorList[b]) )
NumLeafTrend<- lm(NumLvs~Measurement, data = subset(leaf, Sensor == SensorList[b]) )
#Extract
dTotalLeafLength<- append(dTotalLeafLength,TotalLeafTrend$coefficients[2])
dAverageLeafLength<- append(dAverageLeafLength,AverageLeafTrend$coefficients[2])
dNumberLeaves<- append(dNumberLeaves,NumLeafTrend$coefficients[2])
}
I know this is everybody’s favorite part: data frame management! Essentially what I am doing is preparing one matrix of the dependent, response, growth variables and one of the independent, explanatory, environmental variables. Really fun stuff!
I won’t bore you with too many details here
TooBig<-cbind(subset(leaf, Measurement==4), dTotalLeafLength, dAverageLeafLength, dNumberLeaves) #binding them together
#removing the sensor name as a vraible, re-adding the sensor names as the row names (helpful later), and then removing all the other things we do not need
Growth<- TooBig[,-1]
rownames(Growth)<- TooBig[,1]
Growth<- Growth[,44:46]
Growth<-Growth[-33,] #Sensor 33 was an outlier in Growth and thus removed from futur analysis
Making the Environmental data into one, properly formatted matrix.
GisInfo <- read.csv("GisInfo.csv") #adding in data from ArcGIS (position on the dune and distance from waterline)
#this dataframe (gdata) and command (cbindX) lets me combine data frame with different numbers of rows. This becomes important later because the rda will accept some NAs and still work. This lets me add the 20 plants we took growth measurements on without an arduino sensor. This increased the sample size by 50%!
library(gdata)
Environment<- cbindX( as.data.frame(TemperatureAmplitude), as.data.frame(TemperatureIntercept), as.data.frame(Moisture[,4:5]), as.data.frame(TooBig[,4:6]), as.data.frame(GisInfo[,2:3]))
rownames(Environment)<- TooBig[,1]#making the row names be the sensors
Environment<- na_if(Environment, "") #making proper NAs
#for some reason the vectors from our old for loops put everything into strings. In order to properly keep the values within, you have to first make them a character then a number. If you try to go straight to numeric, R spits out a bunch of weird number! How strange!
Environment$TemperatureAmplitude<- as.numeric(as.character(Environment$TemperatureAmplitude))
Environment$TemperatureIntercept<- as.numeric(as.character(Environment$TemperatureIntercept))
Environment$SoilMoistureMean<- as.numeric(as.character(Environment$SoilMoistureMean))
Environment$SoilMoistureSD<- as.numeric(as.character(Environment$SoilMoistureSD))
Environment<-Environment[-33,] #Sensor 33 also removed here
Okay folks, this is the fun part. It is the reason that we are all gathered here today. The Redundancy Analysis (RDA) is this neat test that let’s us take a group of X’s (Our environmental data) and use them to explain a group of Y’s (the growth data). This is what we have been building towards! It does this by binding a PCA co-linearly. It is pretty mathy, but very cool! It is essentially a linear regression that uses multiple explanatory variables to explain multiple response variables.
The plan here is two-fold. First, we will run the RDA on the actual data we collected as is. However, because our sample size is only 60 (with some NAs in some variables), we wanted to add in a bootstrap to understand the potential and distribution of possibilities. If we had no chance in capturing a explaining much of the variation between the environmental data and the growth of Pitcher’s thistle, then we have to question the quality of the data. But, if the potential is there in randomized simulations, but the actual data showed no relationship, then we have some evidence that these environmental factors are not effecting the growth of Pitcher’s thistle. Let’s find out!
Here is the RDA of the actual data. RDA’s are great because they produce pseudo-F values, p-values based on that, and R-squared effect sizes. All are classic metrics most people are familiar with. Before we launch into the code, I want to note a couple of things. First, within the rda command, the na.action = na.exclude is of particular importance. Because we are multidimensional space, some NAs are acceptable. There are limits within the command (important in the bootstrap), but this setting allows the NAs to be excluded without throwing the whole row of data away. For example, Sensor #4 did not meat the criteria to have the Soil Moisture data included. In different tests, the entire row would have to be thrown away, but because of this setting and this test, the other environmental factors (ex. Cover, Slope, TemperatureAmplitudes, etc), can still be used! This keeps our sample size at 60. Here, the RDA is also scaled to get all of our factors on the same page. Additionally, there are two types of biplots that an RDA can produce: a distance biplot and a correlation biplot. Because this analysis is focused on the relationships between variables and not differences between sites, we use the correlation biplot. This is indicated by the scaling = 2 setting within the plot() command.
RealRDA<- rda(Growth ~., Environment, na.action= na.exclude, scale = TRUE ) #The RDA itself.
summary(RealRDA) #This is a very long explaination of the PCAs created complete with Eiganvalues and everything. Very important for analysis of a significant model.
##
## Call:
## rda(formula = Growth ~ TemperatureAmplitude + TemperatureIntercept + SoilMoistureMean + SoilMoistureSD + Cover + Slope + Aspect + Position + Distance, data = Environment, scale = TRUE, na.action = na.exclude)
##
## Partitioning of correlations:
## Inertia Proportion
## Total 3.000 1.0000
## Constrained 1.433 0.4776
## Unconstrained 1.567 0.5224
##
## Eigenvalues, and their contribution to the correlations
##
## Importance of components:
## RDA1 RDA2 RDA3 PC1 PC2 PC3
## Eigenvalue 0.9676 0.4584 0.006896 0.8032 0.7536 0.010211
## Proportion Explained 0.3225 0.1528 0.002299 0.2677 0.2512 0.003404
## Cumulative Proportion 0.3225 0.4753 0.477649 0.7454 0.9966 1.000000
##
## Accumulated constrained eigenvalues
## Importance of components:
## RDA1 RDA2 RDA3
## Eigenvalue 0.9676 0.4584 0.006896
## Proportion Explained 0.6753 0.3199 0.004813
## Cumulative Proportion 0.6753 0.9952 1.000000
##
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores: 3.631929
##
##
## Species scores
##
## RDA1 RDA2 RDA3 PC1 PC2 PC3
## dTotalLeafLength 1.4055 -0.4953 -0.1120 -0.1172 -1.4608 0.12573
## dAverageLeafLength -0.1096 -1.2783 0.0752 1.4811 -0.7361 -0.09833
## dNumberLeaves 1.5057 0.3693 0.1101 -1.1508 -0.7986 -0.13936
##
##
## Site scores (weighted sums of species scores)
##
## RDA1 RDA2 RDA3 PC1 PC2 PC3
## 1 -1.07909 0.380997 0.346660 2.444e-01 1.023e+00 -7.436e-02
## 2 -0.12333 0.635791 0.841904 2.702e-01 5.406e-01 -7.629e-01
## 3 0.09505 0.316165 0.987596 -3.784e-01 5.014e-01 -5.582e-01
## 4 -0.40072 1.729220 0.968161 -8.735e-01 1.105e+00 -1.341e-01
## 5 0.04672 2.089461 -0.179351 -9.032e-01 4.416e-01 4.563e-01
## 6 0.72379 -0.428218 -0.994673 -5.258e-01 -5.374e-01 4.611e-01
## 7 -0.60754 0.284341 0.402315 1.721e-01 7.021e-01 4.790e-02
## 8 1.21986 2.076090 2.363355 -8.136e-01 -3.551e-02 -1.357e+00
## 9 0.79050 1.738505 -0.933860 -1.573e+00 -4.460e-02 8.360e-01
## 10 1.78405 0.128209 -2.940846 -1.453e-01 -2.451e-01 1.092e+00
## 11 -0.54395 -1.699414 0.005812 1.405e+00 -1.812e-01 -1.870e-01
## 12 -0.93097 2.074939 -1.180673 -6.716e-01 3.995e-01 1.568e-01
## 13 0.56847 0.309478 -1.344912 -4.117e-01 -1.445e+00 -2.164e-01
## 14 -0.96635 -0.700719 0.372833 7.873e-01 6.411e-01 -6.378e-01
## 15 0.73246 0.001351 -0.493775 -1.219e-16 6.134e-17 -7.669e-16
## 16 0.87333 -1.164358 -2.631248 9.471e-01 -1.019e+00 1.455e+00
## 17 2.01285 2.657743 8.109862 -2.880e+00 -7.801e-01 -6.944e+00
## 18 0.80875 -0.727497 -1.152762 -1.673e-01 -4.226e-01 -2.908e-01
## 19 -0.63330 0.450441 0.197423 -2.837e-01 -8.654e-02 4.202e-01
## 20 -0.82423 -0.706961 0.510287 1.015e+00 7.223e-01 1.047e-01
## 21 0.01024 -0.999653 0.380261 8.031e-02 -3.992e-01 -2.641e-01
## 22 -0.58989 -0.607445 1.092041 3.485e-01 -3.763e-01 -1.050e+00
## 23 -0.01227 -2.126163 0.346283 2.955e-01 -3.316e-01 -7.556e-01
## 24 0.12714 -0.370409 0.791407 5.391e-01 1.206e-01 -2.463e-01
## 25 0.61670 -0.890088 -0.040738 3.828e-01 -1.342e+00 -1.366e-01
## 26 -0.06292 0.016087 0.034010 5.828e-01 3.325e-01 5.988e-01
## 27 -0.27150 -1.265815 1.006078 3.209e-01 -1.444e-02 -7.885e-01
## 28 -0.50299 -0.448926 -0.628491 5.810e-01 2.925e-01 6.459e-01
## 29 -0.16700 0.264769 0.623437 2.176e-01 3.605e-01 -7.442e-01
## 30 -0.34788 -0.485558 -0.566741 3.097e-01 4.414e-01 6.531e-01
## 31 -1.04971 -0.032161 -0.729806 -5.322e-01 6.777e-02 6.636e-01
## 32 0.09876 -1.121290 -0.422524 4.407e-01 -5.229e-01 3.444e-01
## 34 -1.09297 -0.046727 -0.089187 -3.756e-01 8.044e-01 8.542e-01
## 35 -0.09894 0.339638 -0.558737 -6.024e-01 -8.619e-01 -1.024e-01
## 36 -1.42436 -0.840974 0.926843 1.656e+00 2.210e-01 -1.717e-02
## 37 0.22031 2.660168 1.713277 -1.351e+00 1.063e+00 -1.167e-01
## 38 -0.58186 0.206416 0.312402 2.071e-01 6.447e-01 8.810e-02
## 39 -0.73664 0.967276 -0.122870 -2.736e-01 7.347e-01 9.316e-01
## 40 2.48571 -0.196826 0.915148 -1.002e+00 -7.753e-01 -7.203e-02
## A -0.81277 1.073849 0.560952 -2.135e-01 1.226e+00 2.302e-01
## B -1.65890 1.479694 0.470808 2.030e-02 2.204e+00 8.329e-01
## C -1.82935 0.546084 -0.247170 7.176e-01 1.977e+00 1.258e+00
## D -1.26354 0.485413 0.458950 4.260e-01 1.414e+00 3.812e-01
## E 0.32308 -1.284653 -0.882379 6.342e-01 -8.460e-01 2.207e-01
## F 1.01564 -0.769763 -1.780478 -9.866e-02 -1.302e+00 7.501e-01
## G -0.07159 0.252652 -0.641008 -1.195e-01 1.691e-01 6.286e-01
## H -0.89566 0.747563 0.728948 4.353e-02 1.172e+00 4.669e-02
## I -0.08726 0.047079 0.262999 2.061e-02 1.047e-01 -1.598e-01
## J -0.61392 0.377449 0.177157 1.164e-01 7.453e-01 2.605e-01
## K -1.11323 -0.125298 0.451547 7.292e-01 1.018e+00 1.498e-01
## L -0.86050 -1.345834 1.089188 1.363e+00 2.749e-01 -8.239e-01
## M 0.46146 -1.659112 -1.519105 7.936e-01 -1.137e+00 5.744e-01
## N -0.26673 -0.035779 0.935356 1.779e-01 2.465e-01 -6.440e-01
## O -1.95109 1.606375 -0.078384 1.100e-01 2.533e+00 1.462e+00
## P 0.94269 -0.294991 -2.375402 -3.596e-01 -1.040e+00 1.401e+00
## Q -0.62138 0.393879 0.224074 1.102e-01 7.596e-01 2.301e-01
## R 0.16346 1.628507 1.830873 -1.139e+00 5.278e-01 -1.149e+00
## S 0.27001 -0.288008 0.621109 2.645e-02 -3.739e-01 -7.204e-01
## T -0.48060 0.324561 0.591530 7.224e-02 5.983e-01 -1.597e-01
##
##
## Site constraints (linear combinations of constraining variables)
##
## RDA1 RDA2 RDA3 PC1 PC2 PC3
## 1 -0.19780 -0.039509 -0.41494 2.444e-01 1.023e+00 -7.436e-02
## 2 0.40689 0.575994 -0.42924 2.702e-01 5.406e-01 -7.629e-01
## 3 0.28308 -0.445552 -0.05453 -3.784e-01 5.014e-01 -5.582e-01
## 4 -8.735e-01 1.105e+00 -1.341e-01
## 5 -0.05861 0.773922 0.01349 -9.032e-01 4.416e-01 4.563e-01
## 6 0.07058 -0.655370 -0.11013 -5.258e-01 -5.374e-01 4.611e-01
## 7 1.721e-01 7.021e-01 4.790e-02
## 8 0.79232 1.195976 0.68188 -8.136e-01 -3.551e-02 -1.357e+00
## 9 -1.573e+00 -4.460e-02 8.360e-01
## 10 1.53651 0.125312 -1.46045 -1.453e-01 -2.451e-01 1.092e+00
## 11 1.405e+00 -1.812e-01 -1.870e-01
## 12 -0.95720 1.049528 -1.30672 -6.716e-01 3.995e-01 1.568e-01
## 13 -0.71068 0.831263 -0.66373 -4.117e-01 -1.445e+00 -2.164e-01
## 14 -0.11009 -0.249709 -0.77754 7.873e-01 6.411e-01 -6.378e-01
## 15 0.73246 0.001351 -0.49378 -1.219e-16 6.134e-17 -7.669e-16
## 16 0.57918 0.581289 -0.11152 9.471e-01 -1.019e+00 1.455e+00
## 17 -2.880e+00 -7.801e-01 -6.944e+00
## 18 0.41088 -0.626219 -1.23318 -1.673e-01 -4.226e-01 -2.908e-01
## 19 -0.83303 0.188636 0.74580 -2.837e-01 -8.654e-02 4.202e-01
## 20 0.20662 -0.060565 0.22294 1.015e+00 7.223e-01 1.047e-01
## 21 -0.25016 -0.636799 0.33392 8.031e-02 -3.992e-01 -2.641e-01
## 22 -0.70750 0.045316 0.09468 3.485e-01 -3.763e-01 -1.050e+00
## 23 -0.12064 -1.565001 -0.32712 2.955e-01 -3.316e-01 -7.556e-01
## 24 0.47680 0.153564 0.44931 5.391e-01 1.206e-01 -2.463e-01
## 25 -0.20019 0.451772 0.72391 3.828e-01 -1.342e+00 -1.366e-01
## 26 0.47050 0.440082 0.57849 5.828e-01 3.325e-01 5.988e-01
## 27 -0.13107 -0.891541 0.08105 3.209e-01 -1.444e-02 -7.885e-01
## 28 5.810e-01 2.925e-01 6.459e-01
## 29 0.20343 0.268224 -0.50748 2.176e-01 3.605e-01 -7.442e-01
## 30 0.13488 -0.442065 -0.04899 3.097e-01 4.414e-01 6.531e-01
## 31 -1.25330 -0.678855 -0.00726 -5.322e-01 6.777e-02 6.636e-01
## 32 -0.07628 -0.273926 0.37888 4.407e-01 -5.229e-01 3.444e-01
## 34 -0.67007 -1.019521 0.38000 -3.756e-01 8.044e-01 8.542e-01
## 35 -1.03431 0.250468 -0.14551 -6.024e-01 -8.619e-01 -1.024e-01
## 36 -0.45795 0.865349 0.87429 1.656e+00 2.210e-01 -1.717e-02
## 37 0.35905 0.423995 0.76039 -1.351e+00 1.063e+00 -1.167e-01
## 38 2.071e-01 6.447e-01 8.810e-02
## 39 -0.31596 0.155756 0.49436 -2.736e-01 7.347e-01 9.316e-01
## 40 1.42167 -0.793166 1.27871 -1.002e+00 -7.753e-01 -7.203e-02
## A -2.135e-01 1.226e+00 2.302e-01
## B 2.030e-02 2.204e+00 8.329e-01
## C 7.176e-01 1.977e+00 1.258e+00
## D 4.260e-01 1.414e+00 3.812e-01
## E 6.342e-01 -8.460e-01 2.207e-01
## F -9.866e-02 -1.302e+00 7.501e-01
## G -1.195e-01 1.691e-01 6.286e-01
## H 4.353e-02 1.172e+00 4.669e-02
## I 2.061e-02 1.047e-01 -1.598e-01
## J 1.164e-01 7.453e-01 2.605e-01
## K 7.292e-01 1.018e+00 1.498e-01
## L 1.363e+00 2.749e-01 -8.239e-01
## M 7.936e-01 -1.137e+00 5.744e-01
## N 1.779e-01 2.465e-01 -6.440e-01
## O 1.100e-01 2.533e+00 1.462e+00
## P -3.596e-01 -1.040e+00 1.401e+00
## Q 1.102e-01 7.596e-01 2.301e-01
## R -1.139e+00 5.278e-01 -1.149e+00
## S 2.645e-02 -3.739e-01 -7.204e-01
## T 7.224e-02 5.983e-01 -1.597e-01
##
##
## Biplot scores for constraining variables
##
## RDA1 RDA2 RDA3 PC1 PC2 PC3
## TemperatureAmplitude 0.41136 -0.3671546 0.333473 0 0 0
## TemperatureIntercept -0.22596 -0.5350788 -0.268211 0 0 0
## SoilMoistureMean 0.01568 -0.2479766 -0.182140 0 0 0
## SoilMoistureSD -0.22974 -0.0359403 0.076181 0 0 0
## Cover -0.11877 0.5151761 -0.185489 0 0 0
## Slope 0.04844 -0.0126253 0.314022 0 0 0
## AspectNE 0.05642 0.0573740 0.175166 0 0 0
## AspectNW 0.09690 0.1521536 -0.420275 0 0 0
## AspectSE -0.30488 -0.3193271 -0.004701 0 0 0
## AspectW -0.11161 -0.0642583 -0.228860 0 0 0
## PositionLake -0.36997 -0.0829240 0.616759 0 0 0
## PositionSwale 0.20490 0.0003779 -0.138129 0 0 0
## PositionTop 0.44859 -0.2141919 -0.342799 0 0 0
## Distance 0.18645 -0.4030000 0.400670 0 0 0
##
##
## Centroids for factor constraints
##
## RDA1 RDA2 RDA3 PC1 PC2 PC3
## AspectE 0.20037 0.144159 0.311222 0 0 0
## AspectNE 0.08418 0.085600 0.261342 0 0 0
## AspectNW 0.14457 0.227008 -0.627037 0 0 0
## AspectSE -0.36992 -0.387453 -0.005704 0 0 0
## AspectW -0.18960 -0.109154 -0.388760 0 0 0
## PositionInland -0.09942 0.425502 -0.496779 0 0 0
## PositionLake -0.19648 -0.044039 0.327546 0 0 0
## PositionSwale 0.73246 0.001351 -0.493775 0 0 0
## PositionTop 0.59955 -0.286270 -0.458155 0 0 0
AnovaRDA<-anova.cca(RealRDA) #This is what gives us F and P values
RealRsq<- RsquareAdj(RealRDA)$r.squared #extracting R-squared
RealF<-AnovaRDA$F[1] #extracting only the F value
AnovaRDA #readout of the RDA
## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
##
## Model: rda(formula = Growth ~ TemperatureAmplitude + TemperatureIntercept + SoilMoistureMean + SoilMoistureSD + Cover + Slope + Aspect + Position + Distance, data = Environment, scale = TRUE, na.action = na.exclude)
## Df Variance F Pr(>F)
## Model 14 1.4329 1.1104 0.399
## Residual 17 1.5671
RealRsq #read out of effect size (Rsq)
## [1] 0.4776486
plot(RealRDA, display=c("wa", "bp"), cex = 2, scaling=2, main = "Cirsium Redundancy Analysis"); RDAscores <- scores(RealRDA, choices=1:2, display="sp"); arrows(0,0, RDAscores[,1], RDAscores[,2], length=0, lty=1, col="darkred", lwd = 2.5); text(RealRDA,display = "species", scaling = 2, cex = 1, col = "darkred", font = 2, adj = -.01) # Correlation Biplot of our Rda. Any notes on design/color would be appreciated!
As this output would indicate, we have a p-value > 0.05 which means our environmental variables are not explaining the the growth variables. Had we gotten a significant p-value, we would interpret the biplot by looking at how the arrows correlate. For example, because Position on the top of the dune and delta total leaf length are headed in similar directions, we could have concluded that those two variables we highly correlated. But, because the model is not significant, we can’t make any conclusions. The question is now whether or not we saw no significance due to our data being insufficient, or if there is really no biological relationship. That is why we are bootstraping
Okay, now is the randomization test part. The first step is to re-arrange our environmental matrix. However, there are a could of considerations for the Environmental data. As I have talked about before, the RDA can handle some NULLs, but not an overwhelming amount. When I first did this, I rearranged the two temperature variable and the two soil moisture variables separately. This made the matrix too “holey” even when using na.exclude. At first, I wanted to punch a hole through RStudio, but this I realized that moving the two Temperature and the two Soil Moisture variables separately didn’t make any sense because each set is intrinsically tied together. I adjusted that to move the two temperature variables together and also move the two soil moisture variables together. Temperature variable and soil moisture variables were moved independently one another, though. This almost fixed the issue entirely. However, about 20 times out of 10,000, the matrix still ends up too holey. Again, I almost threw my computer out of a window. These occurrences would halt the while loop and abort the randomization test! This, of course, was unacceptable. I utilized the tryCatch command which tests if the rda can run, if it can the while loops continues as normal, but if it doesn’t work, it just iterates to the next loop which lets the data be randomized again. It’s pretty neat, if you ask me.
Fvalues<- vector()
PsuedoRsq<- vector()
BootGrowth<- Growth
BootEnv <- Environment
v <- 0
while(v < 10000){
#Shuffling Environmental data
BootEnv$Cover<- sample(BootEnv$Cover, replace = FALSE)
BootEnv$Slope<- sample(BootEnv$Slope, replace = FALSE)
BootEnv$Aspect<- sample(BootEnv$Aspect, replace = FALSE)
BootEnv$Position<- sample(BootEnv$Position, replace = FALSE)
BootEnv[,1:2]<-BootEnv[,1:2][sample(nrow(BootEnv),nrow(BootEnv)),]
BootEnv[,3:4]<-BootEnv[,3:4][sample(nrow(BootEnv),nrow(BootEnv)),]
#trycatch
skip_to_next <- FALSE
tryCatch(rda(BootGrowth ~., BootEnv, na.action = na.exclude, scale = TRUE), error = function(e) { skip_to_next <<- TRUE})
if(skip_to_next) { next }
#Run RDA
BootRDA <- rda(BootGrowth ~., BootEnv, na.action = na.exclude, scale = TRUE)
#Capture Variables
BootF<- anova.cca(BootRDA)$F[1]
BootRsq<- RsquareAdj(BootRDA)$r.squared
ifelse(BootF > 100, skip_to_next <<- TRUE, (v <- v + 1 )) #Another little check to prevent ludicrous F values that were showing up
if(skip_to_next) { next }
#Add to vectors
Fvalues<- append(Fvalues, BootF )
PsuedoRsq <- append(PsuedoRsq, BootRsq)
}
This is the final part of our journey! We have the RDA from our real data and 10,000 simulations of randomized RDAs. Now let’s compare the F-values and Rsquares (effect sizes). This will place our reality within the actual potential of our data. No assumed distribution here! Just reality! The real F-value and Rsquare will appear as red lines in histograms of the simulations.
#F-value graph
hist(Fvalues, col = "blue", xlim = c(0,20), breaks = 40, main = ""); abline(v=as.numeric(RealF), col ="red"); abline(v = quantile(Fvalues, 0.95), col = "green", lty = 2) # The green line represents an F value that would result in a p value of 0.05 (significant). Anything to the right of this line would be significant
pvalueF<- length(which(Fvalues>RealF)) / length(Fvalues) #extract real p-value from this!
pvalueF #Looks insignificant!
## [1] 0.476
hist(PsuedoRsq, col = "gold", main = ""); abline(v=RealRsq, col ="red") #Effect size graph
# we are on the low side of this too
Phew! This has been a journey! But after all of that, I think I can confidently say that our environmental variables are not having a strong effect on the growth of Pitcher’s thistle! As we see from the last two graphs, the collected data’s Fvalue and Rsq value are both on the low end of what our data could have produced. I’ll admit, that I am surprised that our data had the potential to capture a significant relationship (shown by the green line in the f-value histogram). The randomization test outputted a p-value of 0.476, which is not significant. With such a large simulated distribution, I am confident that we are not seeing a relationship between the environmental factors we measured and the growth of Pitcher’s thistle rather than just missing the relationship because of sample size.
Thanks again for checking out this analysis! I would love any feedback or questions. I hope you enjoyed it!
-Benjamin J Rivera
Updates in v.2