In this practical we will explore the thermistor data collected during the Ecolab experiments from week 5. You completed two activities during the practical:
a three point calibration of the thermistor at ~1 ℃, ~18 ℃ and ~38 ℃ .
an assessment of the impact of ADC resolution (8, 10 and 12 bit) on sensor output during a warming cycle/ air temperature measurement period.
The main aim of this practical is to introduce you to the tools and
approaches for manipulating digital data sets (particular data
from environmental sensors). You will gain hands-on experience of data
visualization using ggplot2 then use plots to identify
patterns and help inform statistical analysis for assessing sensor
calibration and key specifications (e.g. accuracy and precision).
Figure 1: Schematic representation of the experimental setup used to calibrate the NTC thermistors at Ecolab
Download and import the sensor data measured using the Arduino mkr zero.
Restructure the data so it is in tidy form (ideal for plotting using ggplot2).
Apply sensor calibrations and assess different curve fitting approaches to the data collected during the Week 5 practical.
Assess sensor accuracy, linearity and precision.
Explore the impact of ADC resolution 8, 10 12 bit) on sensor measurements.
Gain a deeper understanding of the grammar of graphics and how it has been implemented in ggplot2.
Explore ways to overcome some of the challenges associated with plotting high frequency sensor data.
Develop interactive plots to aid with visualization and interpretation.
All code that is in grey boxes should be copied into your R script (or ideally your R Notebook) and then run/edited as required.
We will be using a variety of R packages to import, wrangle and
visualize the data. These include the tidyverse,
specifically ggplot2 and
dplyr, plotly for
interactive plotting, cowplot for arranging plots
and ggpubr for creating some nice plot themes.
Recap : If you are not already familiar with it,
something that can be confusing is the %>% or
|> operator - known as a pipe. This essentially pushes
the result from the bit before the pipe into the first argument of the
bit after the pipe. It’s not essential you grasp the concept, but will
help you understand the code structure.
Also it is worth thinking about the data analysis pipeline I mentioned in the lecture as we work through the practical. We will be undertaking the import, tidy, transform, model and visualize steps today.
The r packages in the code chunk below need to be installed on your
computer before running through the example code in this document. These
can either be installed from the command line with the
install.packages() function or using the R Studio
GUI and selecting the Packagestab in
the bottom right panel then install - this opens a
new window with a search tab for selecting the package you want to
install.
library(tidyverse)
library(plotly)
library(ggpubr)
library(ggpmisc)
library(cowplot)
library(lubridate)
library(viridis)
library(colourpicker)
library(cowplot)
library(gganimate)
library(thermocouple)##package for working with thermistor data
library(Metrics)
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,eval = FALSE)
Make sure you have the data you collected during the practical during
week 5. If you haven’t already done so download the csv file which
provides a data template which you can then use to format your data (Week
7 canvas page). The first challenge to overcome is associated with
data import - i.e. getting the data frame into the R
environment so you begin analysis and visualization. Here we use a
function from the tidyverse - read_csv()
- as we have data that is separated by commas. We then write it to an
object called df_xxx.
NOTE - Make sure you change the code below to match the
location where you have saved the file
e.g. file = "C:/Downloads/NTC_cal_example.csv".
df_cal<-read_csv(file = "C:/Users/khamisk/Dropbox/Teaching 2021 S1 and S2/Data viz/Practical class/Thermistor practical/NTC_cal_example.csv")
df_adc_water<-read_csv(file = "C:/Users/khamisk/Dropbox/Teaching 2021 S1 and S2/Data viz/Practical class/Thermistor practical/ADC_water_bath_example.csv")
df_adc_air<-read_csv(file = "C:/Users/khamisk/Dropbox/Teaching 2021 S1 and S2/Data viz/Practical class/Thermistor practical/ADC_air_logging_example.csv")
df_adc_refT<-read_csv(file = "C:/Users/khamisk/Dropbox/Teaching 2021 S1 and S2/Data viz/Practical class/Thermistor practical/ADC_ref_temp.csv",col_types = "?dc")
First we need to check the data has imported correctly. We can print the data frame to the console just by typing the name of the data frame object - df_cal, df_xxx. The physical structure of the data frame is then displayed in the R console.
Some things to check:
Have all the variables been imported?
Do the dimensions (number of columns and rows) look correct?
Are the columns the expected format (date, numeric or factor)?
Are the data tidy?
Are variables spread across column names?
Is the time variable imported as a time variable or a character variable?
#Remove the # below and run to inspect the data frames
#df_cal
#df_adc_water
#df_adc_water
#df_adc_refT
If there is a problem with time for one of the data frames you can
fix using the following code chunk. Change the data object if necessary
xxxx$Time.
df_adc_refT$Time<-parse_time(df_adc_refT$Time,"%H.%M.%S")
As observed above, the data are still in the wide format for the ADC resolution with the variable ADC resolution embedded in column headers. This is a very common problem with data sets so it is important to understand how to restructure the data into a tidy or long format.
We need to create a long version of the data frame so each row is an observation for a single time and ADC resolution - adhering to the principles of tidy data (Wickham 2014). We will call this new data frame “df_ADC_water_L” for the water bath data and “df_ADC_air_L” for the air logging data.
###ADC water bath
df_ADC_water_L<-df_adc_water %>%
pivot_longer(!c(Time),
names_to = "ADC_Bits",values_to = "Resistance")
###ADC air
df_ADC_air_L<-df_adc_air %>%
pivot_longer(!c(Time),
names_to = "ADC_Bits",values_to = "Resistance")
Given we have high resolution, time series data (i.e. 1 sec
frequency), it may be useful to calculate some additional time related
variables that can make data summaries and plotting easier. We can
create a minute variable. This is done using some very useful functions
in the lubridate package (minute) in
combination with the the mutate function After running the
code below print the data frame to the console you can check the new
variables have been saved. You may which to explore plotting using this
variable later in the practical.
df_ADC_water_L<-df_ADC_water_L %>%
mutate(min=minute(Time))
df_ADC_air_L<-df_ADC_air_L %>%
mutate(min=minute(Time))
Now for the exciting part - plotting the data! As we will be using ggplot2 for this - a package which implements the grammar of graphics (see Figure 2 below) - we should take a moment to consider how this is put into practice.
First we define the data frame to be used df_cal and specify
that we want to plot the - this is the DATA part of the layers in the
figure above. We then define the mapping and aesthetics of the plot by
defining which variables we want for x and y.
If you run the first steps of the code and write to a plot object -
p1 - we can then print this by running the name of the plot
(i.e. p1).
What has been plotted?
p1<-ggplot(data = df_cal, mapping = aes(x = Resistance,
y = Ref_temp))
p1
We now have the mapping for the plot but need a geometric
representation of the data. we can call different geometric using the
geom_xx. We can create points
(geom_point) and can add these elements to the base mapping
p1 with the + sign.
p2<-p1 + geom_point()
p2
p2<-p1 + geom_line()
p2
We can manipulate the visual properties of the geom by adding
arguments to the function such as colour, shape, size and line type.
Note the colourPicker() function is a very useful
tool for exploring the wide range of colours available in the R
environment. We can also add additional theme related
information such as axis labels (more on how to manipulate themes
later).
#size and shape
p2<-p1 + geom_point(size =3, shape="circle")
p2
#Axis labels
p3<-p2 + ylab("Water temperature (\u00b0C)")+ xlab("Resistance (\u2126)")
p3
###Find any R colour
#colourPicker()
p1 + geom_point(size =3, shape="circle",colour="darkgoldenrod4")+ ylab("Water temperature (\u00b0C)")+ xlab("Resistance (\u2126)")
Take some time to explore the different options for
geom_point() have a look here for more aesthetics that you
can modify beyond size= , shape=,
colour= https://ggplot2.tidyverse.org/reference/geom_point.html
Note: Think about the difference between assigning a mapped aesthetic to this vs a fixed value. When would one approach be more useful than the other? This is a useful Chapter to read: Mastering Data Viz - Chapter 8.
You can also use predefined themes to style your plots as in the code
chunk below. there are numerous themes you can explore but a few of my
favs are theme_cowplot() from the cowplot package
, theme_light() and theme_pubr() from the
ggpubr package.
p1 + geom_point(size=3.5, shape=1,colour="black") + ylab("Water temperature (\u00b0C)") + xlab("Resistance (\u2126)") + theme_cowplot()
Thermistor manufacturers provide a beta coefficient for calculating
temperature from resistance. For the thermistor we used the beta
coefficient (𝛽) = 3950 ± 1%. We use this value to fit a curve to the
data which approximates the Steinhart equation (see notes from week 5
practical). This can be implemented in R using the following function
from the thermocouple package. Here the argument
R is the measured thermistor resistance and T0 is the
reference temperature (= 25 ℃) and R0 reference resistance
(=10,000 Ω). Note the input R can be a vector. If you want to find out
more about the beta coefficient then the technical paper by Fraden (2000) is an excellent, but maths heavy,
resource.
ThermistorTemperature(R0 = 10000,T0 = 25,
R =15000, betaTH = 3950)
#remove
df_cal$Temp_beta<-ThermistorTemperature(R0 = 10000,T0 = 25,
R =df_cal$Resistance, betaTH = 3950)
Question(s):
Can you calculate the temperature based on the beta coef and save it to the df_cal? Make sure you call this variable something intuitive (e.g. Temp_beta)
Can you plot this data against the reference temperature measurements?
Can you log transform resistance and create a new variable called log_res?
Hint: log(df_cal$Resistance)
We now need to fit a calibration curve to the data. We can explore
fits visually using the geom_smooth()
##fits a simple linear regression model
p4 <- p3 + geom_smooth(method = "lm")
p4 + stat_regline_equation(label.x=20000, label.y=40)
##for a polynomial fit we need to specify the formula - a quadratic seems appropriate but you can explore higher order polynomial fits by changing the number in the poly() function
p3 + stat_poly_line(formula = y ~ poly(x, 2, raw = TRUE)) + stat_poly_eq(formula = y ~ poly(x, 2, raw = TRUE), use_label("eq"),label.x = 20000, label.y = 40)
Question(s):
We now need to create the models and save them as model objects so we can fit them to the data (i.e. convert all measurements of resistance to temperature).
linear_model<-lm(data = df_cal, formula = Ref_temp ~ Resistance)
poly_model<-lm(data = df_cal,
formula = Ref_temp ~ poly(x = Resistance,
degree = 2,raw = T))
You can now view the model coefficients and goodness of fit using
summary().
Question(s):
Can you repeat the model fitting for the log transformed data?
Which model has the best fit to the data?
How many coefficients do the models have and are these all significant?
You can extract the calculated temperature for all your measured
resistance values and for both models using the
$fitted.values. An example for the linear model is shown
below
linear_model$fitted.values
df_cal$Temp_linear<-linear_model$fitted.values
Question(s):
Extract the modelled temperature for both models and save it to the df_cal? Make sure you call this variable something intuitive (e.g. Temp_linear/poly)
For each temperature calibration point can you calculate the mean and SD and create a table (see Table 1 template below)
You can take this further by calculating the coefficient of variation (CV% = [SD/mean] *100) which can be used as a measure of sensor precision (Sousan et al. 2016)
Do you think there are other factors beyond the sensor precision that may influence the CV%? How might we account for these?
\[ \text{CV(%)} = \left( \frac{\sigma_i}{\mu_i} \right) \times 100 \]
Where the CV is for the ith temperature calibration point:
σ = standard deviation of a specific calibration model for the ith temperature calibration point
μ = mean of a specific calibration model for the ith temperature calibration point
| Ref Temp (℃) |
Linear model (mean±SD ℃) |
Log model (mean±SD ℃) |
Beta model (mean±SD ℃) |
Poly model (mean±SD ℃) |
|---|---|---|---|---|
| 2.2 | xx ± xx | |||
| 18.1 | ||||
| 38.2 |
Before we chose a calibration model to apply to the ADC resolution data we need to explore the data both visually and using a robust measure of accuracy or suite of measures. To make it easier to plot the data on a single graph we need to tidy the data and convert into a long format. Here we want create a new factor variable which embeds the model name and a new numeric variable that includes the calibrated temperature.
df_cal_long<-df_cal |> pivot_longer(!c(Cal_medium, Resistance,
Voltage, Ref_temp,
Log_resistance),
names_to = "cal_model",
values_to = "cal_temp")
Run the code below to create box plots.
Question(s):
#plot 1 ??
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp))+geom_boxplot()
##plot 2
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp, colour = Cal_medium))+geom_boxplot()
##plot 3
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp, colour = Cal_medium))+geom_boxplot()+facet_wrap(~Cal_medium)
##plot 4
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp, colour = Cal_medium))+geom_boxplot()+facet_wrap(~Cal_medium,scales = "free_y")
##plot 5
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp, colour = Cal_medium))+geom_jitter()+facet_wrap(~Cal_medium,scales = "free_y")+theme_pubclean()
Now we need to plot the calibrated temperature against the reference temperature.
The plots below do this and include a 1:1 line.
Question(s):
ggplot(data = df_cal_long,
mapping = aes(x=Ref_temp, y = cal_temp,
colour = cal_model))+geom_point(size=2)+
facet_wrap(~cal_model,scales = "fixed")+
theme_pubclean()+
geom_abline(slope = 1)+
geom_smooth(se = F,method = "lm",linetype = "dashed")+
stat_regline_equation(label.x=5, label.y=40)
We can assess the fit using R2 (coefficient of determination) but it is useful to get an estimate of the error in the measurement units. For this we can use the Root Mean Square Error (RMSE).
RMSE is a statistical measure that is often used to evaluate the accuracy of a regression model (or in our case the calibration model used). It measures the difference between the predicted values \((\hat{y})\) and the actual values \((y)\) in the units of the response variable.
The RMSE is calculated as the square root of the average of the squared differences between the predicted values and the actual values. In other words, it is the square root of the mean of the squared errors.
\[ \text{RMSE}(y, \hat{y}) = \sqrt{\frac{\sum_{i=0}^{N - 1} (y_i - \hat{y}_i)^2}{N}} \]
We can calculate this using the rmse() function
in the metrics package.
#assess error RMSE
rmse(df_cal$Ref_temp, df_cal$Temp_beta)
rmse(df_cal$Ref_temp, df_cal$Temp_linear)
##Repeat for other models
Question(s):
Can you create a table to enable easy comparisons of the important
variables related to the four models (See below). You can extract the
R2 value by calling summary() on the model name
e.g. summary(linear_model). For the 1:1 line plots you
generatedabove you can extract the slope (equation displayed on the
plot) the closer to 1 the value is the more linear the sensor response
is over the measurement range.
| Variable | Linear model | Log linear model | Beta model | Polynomial model |
|---|---|---|---|---|
| R2 | ||||
| RMSE | ||||
| Slope |
Now you have decided on the model to take forward you need to apply
it to the temperature data collected for the ADC experiment. To do this
we use the predict() function.
NOTE: I suggest generating this for both the best calibration model and the beta coefficient model as the later should be more generalisable and may be less prone to errors associated with the specific thermistor and arduino combination.
df_ADC_water_L$Temp<-predict(object = poly_model,df_ADC_water_L)
df_ADC_air_L$Temp<-predict(object =poly_model,df_ADC_air_L)
df_ADC_air_L$Temp_beta<-ThermistorTemperature(R0 = 10000,T0 = 25,
R =df_ADC_air_L$Resistance, betaTH = 3950)
Now we can plot the time series of temperature.
df_ADC_water_L |> ggplot(aes(x = Time,y = Temp,colour = ADC_Bits))+geom_point(size=0.7)
df_ADC_water_L |> ggplot(aes(x = Time,y = Temp,colour = ADC_Bits))+geom_point(size=0.7)+facet_wrap(~ADC_Bits)
df_ADC_water_L |> ggplot(aes(x = Time,y = Temp,colour = ADC_Bits))+geom_line()+facet_wrap(~ADC_Bits)
Question(s):
Which geom is best for representing time series data?
Try plotting multiple geoms (e.g. point and line) on a single plot - why might this be useful for time series data?
What do you notice about the plots?
What do you think the resolution (smallest step) is for 8 bit temperature measurements using an Arduino?
You can create an interactive plot to explore the data using
plotly. You can pass a ggplot object (i.e. a saved plot)
directly to the ggplotly() function.
p1<-df_ADC_water_L |> ggplot(aes(x = Time,y = Temp,colour = ADC_Bits))+geom_line()+facet_wrap(~ADC_Bits)
ggplotly(p1,width = 600 ,height = 400 )
Question(s):
What do you think the minimum temperature resolution is for the 10 and 12 bit measurements
Why might it be difficult to calculate this using the water bath data?
Extract the reference temperature for the warming cycle. Can you calculate the RMSE and generate a scatter plot with 1:1 line for each adc resolution?
First we need to merge the temperature data measured using the YSI
(handheld device - i.e. the reference temperature) with the thermistor
data. We can do this using the left_join() function. Have a
look at the code below and think about why we are using the
filter() function? Also take note of how we are using the
Time variable to enable the matching (a quick way to combine
lots of values from different datasets!). The output is then written to
a new dataframe (df_adc_ref_xxx, with the xxx denoting the measurement
matrix [water/air]).
df_adc_ref_water<-df_adc_refT |> filter(Medium=="Water") |> left_join(df_ADC_water_L,by = "Time")
df_adc_ref_air<-df_adc_refT |> filter(Medium=="Air") |> left_join(df_ADC_air_L,by = "Time")
Plotting the relationship between thermistor air temperature measured using the thermistor and the reference measurements.
##we can now plot the data for the Beta coefficent model
df_adc_ref_air |> ggplot(aes(x = ADC_Bits,y = Temp_beta ))+geom_boxplot()+geom_point()
df_adc_ref_air |> ggplot(aes(x = Temp_ref ,y = Temp_beta ))+geom_point()+geom_smooth(method = "lm")+facet_wrap(~ADC_Bits)+ stat_regline_equation(label.x=13.8, label.y=14)
Question(s):
What do you notice about the relationship between the reference and thermistor temperature? What are the implications of ADC resolution on the relationship.
Can you generate a plot for your best calibration model? How does this compare to the Beta model?
Additional activities:
For the air temperature logging can you calculate the mean, SD, CV%. Remeber CV% gives us an idea of sensor precision. How does this vary with ADC resolution? Is it what you expected?
A few important settings that will help with your write-up of the practical.
To add a figure caption to plots generated in a code chunk use
the following argument fig.cap in the curly braces at the
top of the code chunk (e.g. fig.cap=“A plot that shows something
interesting”). If your figure caption is long then you need to use the
following formatting:
{r}
#| echo = TRUE, message=FALSE, warning=FALSE,
#| fig.width = 10,
#| fig.cap = "Figure 2: Representation of the graphial elements used in a plot generated using ggplot2"
You can insert tables to R notebooks using the Visual
Editor in R studio. Just select table from the
drop down menu then insert table or use the following shortcut
ctrl+alt+T
To style tables you can right click on the table to set cell
alignment etc. (similar to how you would in word). You can control line
breaks in your table headers using the following <br>
. See this tutorial for more info: Link
to tutorial.
Give this section of your lab book a title – e.g “using environmental sensors”.
Briefly explain how you collected the data (i.e. the methods you used). Here you can include images, schematics and text. Be sure to highlight the instruments used.
For the calibration generate a scatter plot of resistance vs observed temperature. Ensure you include a figure caption and some plain text with a brief description of the pattern(s) observed.
Now fit a linear model and a nonlinear model (polynomial). Extract information on the model fit and present these in a table with a short interpretation of the results (which model appear to have the better fit)?
Apply both models to the data and calculate the mean and SD for each calibration data point. Present this as a table, what do you notice about the data?
Calculate the sensor linearity and accuracy (RMSE) based on all the models. Present this data as a table or integrated into a short paragraph that describes the results. Would you be happy to use this sensor for another research project? How does it compare to commercial sensor specifications?
For the ADC experiment plot the data as a boxplot (8-bit, 10-bit and 12-bit). What do you notice? Now plot the data as a dot plot – what does this tell us about the difference in the step size (i.e. resolution in °C)? You could also include an interactive plot of the time series here.
For the air temperature logging calculate the mean and SD for each resolution along with the mean and SD of the reference sensor (10 data points required). What do you notice?
End your notebook with some short reflections on using the Arduino platform – what worked well, what didn’t and what you’d do differently if you were using the platform for another project.
Which model gave the lowest RMSE and most uniform residuals across temperature? (Linear, polynomial, Beta, log_linear)
What other factors may have impacted the calibration accuracy and transferability. Hint think about stability of the calibration tanks and also sensors used as the reference. Also have a look at this white paper Here