This R Publication is intended to provide updates on current and ongoing data analysis to the researches involved in analysis of data from the SuperTarantula experiment at UC Berkeley. Using this approach provides transparency regarding the data analysis methods used, and improves communication between researchers.
# Load Libraries ----------------------------------------------------------
library(tidyverse); library(dplyr); library(lubridate); library(reshape2)
library(ggplot2); library(cowplot); library(RColorBrewer); library(ggpubr)
library(DescTools); library(scales)
# Import and Process Text File -------------------------------------------
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
####IMPORT LOGGED DATA####
#Pull data in from text files and format timestamps
setwd(paste(dirname(rstudioapi::getActiveDocumentContext()$path),
"/May 2024 Data",sep=""))
rawData <- read.table("N-min_sensors_raw.csv",sep=',', header=TRUE)
rawData <- select(rawData, -Day)
names(rawData)[names(rawData) == "Timestamp"] <- "dateTime"
#Convert columns to rows for data analysis
allData <- melt(rawData,id.vars=c("dateTime"))
names(allData)[names(allData) == "variable"] <- "pot"
names(allData)[names(allData) == "value"] <- "reading"
allData$dateTime <- as.POSIXct(allData$dateTime,format="%m/%d/%Y %H:%M")
allData <- allData[order(allData$dateTime),] #Order by timestamp
# Normalize Resistances ---------------------------------------------------
#R_o will be the mean resistance value between these dateTimes:
rStart <- as.POSIXct("2024-04-26 00:00:00"); rEnd <- rStart + hours(12)
endDate <- max(allData$dateTime)
Ro <- allData[allData$dateTime > rStart & allData$dateTime < rEnd,] %>%
group_by(pot) %>% summarize(Ro = mean(reading))
allData <- merge(allData,Ro,by="pot"); allData$Rnorm=allData$reading/allData$Ro
# Map to treatments -------------------------------------------------------
allData$treatment <- as.factor(sub("\\..*","",allData$pot)) #Get treatment types
The code above converts the data into a single data.frame in which each line is a single reading.
# Plot the data -----------------------------------------------------------
treatmentColors <- c("Peat" = "darkgreen", "Sandy" = "darkorange", "Clay" = "darkred")
treatmentShapes <- c("Peat" = 1, "Sandy" = 2, "Clay" = 5)
####Plot raw data with free Y scale####
rawPlot <- ggplot() +
geom_point(data=allData, aes(dateTime,Rnorm,col=treatment, alpha=0.75), size=1) +
scale_color_manual(name = "Soil Type", values = treatmentColors) +
scale_shape_manual(values=treatmentShapes, guide="none") +
scale_alpha(guide="none") +
xlab("") + ylab(expression("Normalized Resistance (R/R"[o]*")")) +
xlim(rStart, endDate) + theme_bw()
rawPlot
####Plot raw data with constrained Y scale####
rawPlot + ylim(0.9, 1.25)
To the naked eye, it appears that the three treatments (soil types)
result in a different sensor response, on average. Let’s dig deeper!
rawPlot +
stat_smooth(data=allData, aes(dateTime,Rnorm,col=treatment), method="lm",
se=FALSE, linewidth = 1,formula = y~x) +
facet_wrap(~pot, nrow=3)
We see some interesting things here. The Peat installations are
clearly reacting more than the other two soil types, but also in a more
stochastic way. The reactions start largely near the end of the
experiment, suggesting that we might be only seeing the beginning of a
process that would be more clear with a longer experimental run-time.
The Sandy installations appear to have a general increasing
slope which is more gradual. The Clay soils appear to have a
mostly flat but slightly parabolic shape - surprising, since we don’t
generally expect the resistances of these sensors to
decrease.
####Average sensor signals by treatment####
aveData <- allData %>%
mutate(dateTime = cut(dateTime, breaks='1 hour')) %>%
group_by(treatment, dateTime) %>%
dplyr::summarize(stDev = sd(Rnorm), Rnorm = mean(Rnorm, na.rm=T))
aveData$dateTime <- as.POSIXct(aveData$dateTime)
####Plot the data####
treatmentPlot <- ggplot() +
geom_point(data=aveData,
aes(dateTime,Rnorm,col=treatment,shape=treatment), size=1) +
stat_smooth(data=aveData,
aes(dateTime,Rnorm,col=treatment), method="lm", se=FALSE,
linewidth = 0.5,formula = y~x) +
xlab("") + ylab(expression("Normalized Resistance (R/R"[o]*")")) +
scale_color_manual(name = "Pot Treatment", values = treatmentColors) +
scale_shape_manual(values = treatmentShapes, guide="none") +
xlim(rStart, endDate) + theme_bw() + ylim(0.95,1.25)
treatmentPlot
In general, our approach for decomposition sensor data analysis has been to extract individual sensor slopes and then look to correlate these slopes to other parameters of interest, for example the enzymatic activity within each instrumented plot. In this case, I will provide two different tables of slopes: One for the complete slope across the entire experiment (after a 12-hour period used to baseline the initial resistances), and one for the slope within some “decomposition window”, the part of the signal that we think contains the most useful information.
####First, smooth the data a bit####
sData <- allData %>%
mutate(dateTime = cut(dateTime, breaks='6 hour')) %>%
group_by(pot, dateTime) %>%
dplyr::summarize(stDev = sd(Rnorm), Rnorm = mean(Rnorm, na.rm=T))
#Add a time elapsed column, in days, called tDay
sData$dateTime <- as.POSIXct(sData$dateTime)
sData$tDay <- as.numeric(sData$dateTime - min(sData$dateTime))/60/60/24
#Get slopes ACROSS THE WHOLE EXPERIMENT
slopesAll <- sData %>%
group_by(pot) %>%
dplyr::summarize(slope = summary(lm(Rnorm ~ tDay + tDay^2))$coefficients[2])
slopesAll$treatment <- as.factor(sub("\\..*","",slopesAll$pot)) #Get treatments
allPlot <- ggplot(data=slopesAll) + theme_bw()+theme(legend.position = "none")+
geom_boxplot(aes(x=treatment, y=slope, col=treatment)) +
scale_color_manual(name = "Pot Treatment", values = treatmentColors) +
xlab("") + ylab("Sensor Slope") + ggtitle("Overall Slope")
####Define the beginning and end of a decomposition window####
decompStart <- rStart + hours(1)
decompEnd <- rStart + days(10)
#Get slopes WITHIN THE DECOMPOSITION WINDOW
slopesWindow <- sData[sData$dateTime >= decompStart &
sData$dateTime <= decompEnd,] %>%
group_by(pot) %>%
dplyr::summarize(slope = summary(lm(Rnorm ~ tDay + tDay^2))$coefficients[2])
slopesWindow$treatment <- as.factor(sub("\\..*","",slopesWindow$pot))
windowPlot <- ggplot(data=slopesWindow) + theme_bw() +
geom_boxplot(aes(x=treatment, y=slope, col=treatment)) +
scale_color_manual(name = "Pot Treatment", values = treatmentColors) +
xlab("") + ylab("") + ggtitle("Slope Within Decomp Window")
bothBox <- ggarrange(allPlot, windowPlot, ncol=2) #Show both box plots
show(bothBox)
These slopes are the linear term of a 2nd-order polynomial fit. When we look at the overall slope distribution across all of the experiment, there is some separation; when we subset the data to only include a decomposition window that starts pretty much immediately after the experiment begins, and ends 10 days after the experiment begins, we see much better separation between soil types.
In the plot below, the decomposition window will be denoted using vertical black dashed lines.
####Remove the outlier sensor's data####
#slopesAll <- slopesAll[slopesAll$pot != "Peat.soil.5",]
#slopesWindow <- slopesWindow[slopesWindow$pot != "Peat.soil.5",]
slopesBoxPlot <- ggplot(data=slopesWindow) + theme_bw() +
geom_boxplot(aes(x=treatment, y=slope, col=treatment)) +
geom_jitter(aes(x=treatment, y=slope, col=treatment), size=2, alpha=0.75) +
scale_color_manual(name = "Pot Treatment", values = treatmentColors) +
xlab("") + ylab("") + ggtitle("Slope Within Decomp Window")
####Get new averaged data####
aveData <- allData %>%
mutate(dateTime = cut(dateTime, breaks='12 hour')) %>%
group_by(treatment, dateTime) %>%
dplyr::summarize(stDev = sd(Rnorm), Rnorm = mean(Rnorm, na.rm=T))
aveData$dateTime <- as.POSIXct(aveData$dateTime)
treatmentPlot <- ggplot() + ggtitle("Averaged Sensor Response by Treatment") +
geom_point(data=aveData,
aes(dateTime,Rnorm,col=treatment,shape=treatment), size=1) +
geom_errorbar(data=aveData[aveData$dateTime >= decompStart &
aveData$dateTime <= decompEnd,],
aes(x=dateTime, ymin=Rnorm-stDev, ymax=Rnorm+stDev,
col=treatment, alpha=0.5), width=50000,
position=position_dodge(10000)) +
stat_smooth(data=aveData[aveData$dateTime >= decompStart &
aveData$dateTime <= decompEnd,],
aes(dateTime,Rnorm,col=treatment), method="lm", se=FALSE,
linewidth = 1.25,formula = y~x) +
geom_vline(xintercept = c(decompStart,decompEnd),
color="black", linewidth = 1, linetype = "dashed") +
xlab("") + ylab(expression("Normalized Resistance (R/R"[o]*")")) +
scale_color_manual(name = "Pot Treatment", values = treatmentColors) +
scale_shape_manual(values = treatmentShapes, guide="none") +
xlim(rStart, endDate) + theme_bw() + ylim(0.95,1.25) +
theme(legend.position = "none")
ggarrange(treatmentPlot, slopesBoxPlot, ncol=2, widths=c(2,1))
Visually, it looks like we are seeing a statistically-significant impact of treatment (soil type) on the slopes that we see from these sensors. Let’s check this with an ANOVA:
windowAnova <- aov(slopesWindow$slope~slopesWindow$treatment)
summary(windowAnova)
## Df Sum Sq Mean Sq F value Pr(>F)
## slopesWindow$treatment 2 2.243e-04 1.121e-04 113.9 1.57e-08 ***
## Residuals 12 1.182e-05 9.800e-07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These tables contain the slopes taken from each individual sensor, both across the whole experiment (which we expect will be less useful) and within the decomposition window.
show(slopesAll)
## # A tibble: 15 × 3
## pot slope treatment
## <fct> <dbl> <fct>
## 1 Peat.soil.1 0.00312 Peat
## 2 Peat.soil.2 0.000950 Peat
## 3 Peat.soil.3 0.00703 Peat
## 4 Peat.soil.4 0.00226 Peat
## 5 Peat.soil.5 0.00689 Peat
## 6 Sandy.soil.1 0.00229 Sandy
## 7 Sandy.soil.2 0.00141 Sandy
## 8 Sandy.soil.3 0.00367 Sandy
## 9 Sandy.soil.4 0.00320 Sandy
## 10 Sandy.soil.5 0.00152 Sandy
## 11 Clay.soil.1 0.000873 Clay
## 12 Clay.soil.2 -0.000234 Clay
## 13 Clay.soil.3 0.000254 Clay
## 14 Clay.soil.4 0.000689 Clay
## 15 Clay.soil.5 0.000490 Clay
show(slopesWindow)
## # A tibble: 15 × 3
## pot slope treatment
## <fct> <dbl> <fct>
## 1 Peat.soil.1 0.000668 Peat
## 2 Peat.soil.2 -0.000604 Peat
## 3 Peat.soil.3 -0.000556 Peat
## 4 Peat.soil.4 -0.000191 Peat
## 5 Peat.soil.5 -0.00126 Peat
## 6 Sandy.soil.1 0.0103 Sandy
## 7 Sandy.soil.2 0.00735 Sandy
## 8 Sandy.soil.3 0.00790 Sandy
## 9 Sandy.soil.4 0.0102 Sandy
## 10 Sandy.soil.5 0.00829 Sandy
## 11 Clay.soil.1 0.00636 Clay
## 12 Clay.soil.2 0.00601 Clay
## 13 Clay.soil.3 0.00619 Clay
## 14 Clay.soil.4 0.00509 Clay
## 15 Clay.soil.5 0.00739 Clay
We have defined a “decomposition window” where we take individual normalized resistance slopes from each sensor. In this case, we define this window as starting 1 week into the experiment, and ending when the experimental data ends.
We take slopes by running a 2nd-order polynomial regression, and taking just the linear coefficient of the fit as the slope associated with that particular sensor.
Based on a one-way ANOVA test, we see a significant difference between the slope responses from sensors as a function of the soil in which they were installed.
Next, we should use the slopes extracted from these sensor signals to see if they correlate to other measurements in these experimental pots: