Introduction

In this practical we will explore the thermistor data collected during the Ecolab experiments from week 5. You completed two activities during the practical:

  1. a three point calibration of the thermistor at ~1 ℃, ~18 ℃ and ~38 ℃ .

  2. an assessment of the impact of ADC resolution (8, 10 and 12 bit) on sensor output during a warming cycle/ air temperature measurement period.

The main aim of this practical is to introduce you to the tools and approaches for manipulating digital data sets (particular data from environmental sensors). You will gain hands-on experience of data visualization using ggplot2 then use plots to identify patterns and help inform statistical analysis for assessing sensor calibration and key specifications (e.g. accuracy and precision).

Figure 1: Schematic representation of the experimental setup 
  used to calibrate the NTC thermistors at Ecolab

Figure 1: Schematic representation of the experimental setup used to calibrate the NTC thermistors at Ecolab

Objectives

  • Download and import the sensor data measured using the Arduino mkr zero.

  • Restructure the data so it is in tidy form (ideal for plotting using ggplot2).

  • Apply sensor calibrations and assess different curve fitting approaches to the data collected during the Week 5 practical.

  • Assess sensor accuracy, linearity and precision.

  • Explore the impact of ADC resolution 8, 10 12 bit) on sensor measurements.

  • Gain a deeper understanding of the grammar of graphics and how it has been implemented in ggplot2.

  • Explore ways to overcome some of the challenges associated with plotting high frequency sensor data.

  • Develop interactive plots to aid with visualization and interpretation.

A note on script and R packages to be used:

All code that is in grey boxes should be copied into your R script (or ideally your R Notebook) and then run/edited as required.

We will be using a variety of R packages to import, wrangle and visualize the data. These include the tidyverse, specifically ggplot2 and dplyr, plotly for interactive plotting, cowplot for arranging plots and ggpubr for creating some nice plot themes.

Recap : If you are not already familiar with it, something that can be confusing is the %>% or |> operator - known as a pipe. This essentially pushes the result from the bit before the pipe into the first argument of the bit after the pipe. It’s not essential you grasp the concept, but will help you understand the code structure.

Also it is worth thinking about the data analysis pipeline I mentioned in the lecture as we work through the practical. We will be undertaking the import, tidy, transform, model and visualize steps today.

Data import and manipulation

R packages to load

The r packages in the code chunk below need to be installed on your computer before running through the example code in this document. These can either be installed from the command line with the install.packages() function or using the R Studio GUI and selecting the Packagestab in the bottom right panel then install - this opens a new window with a search tab for selecting the package you want to install.

library(tidyverse)
library(plotly)
library(ggpubr)
library(ggpmisc)
library(cowplot)
library(lubridate)
library(viridis)
library(colourpicker)
library(cowplot)
library(gganimate)
library(thermocouple)##package for working with thermistor data
library(Metrics)

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Import and inspect the data

Make sure you have the data you collected during the practical during week 5. If you haven’t already done so download the csv file which provides a data template which you can then use to format your data (Week 7 canvas page). The first challenge to overcome is associated with data import - i.e. getting the data frame into the R environment so you begin analysis and visualization. Here we use a function from the tidyverse - read_csv() - as we have data that is separated by commas. We then write it to an object called df_xxx.


NOTE - Make sure you change the code below to match the location where you have saved the file e.g. file = "C:/Downloads/NTC_cal_example.csv".

df_cal<-read_csv(file = "C:/Users/khamisk/Dropbox/Teaching 2021 S1 and S2/Data viz/Practical class/Thermistor practical/NTC_cal_example.csv")

df_adc_water<-read_csv(file = "C:/Users/khamisk/Dropbox/Teaching 2021 S1 and S2/Data viz/Practical class/Thermistor practical/ADC_water_bath_example.csv")

df_adc_air<-read_csv(file = "C:/Users/khamisk/Dropbox/Teaching 2021 S1 and S2/Data viz/Practical class/Thermistor practical/ADC_air_logging_example.csv")

df_adc_refT<-read_csv(file = "C:/Users/khamisk/Dropbox/Teaching 2021 S1 and S2/Data viz/Practical class/Thermistor practical/ADC_ref_temp.csv",col_types = "?dc")

First we need to check the data has imported correctly. We can print the data frame to the console just by typing the name of the data frame object - df_cal, df_xxx. The physical structure of the data frame is then displayed in the R console.

Some things to check:

  1. Have all the variables been imported?

  2. Do the dimensions (number of columns and rows) look correct?

  3. Are the columns the expected format (date, numeric or factor)?

  4. Are the data tidy?

  5. Are variables spread across column names?

  6. Is the time variable imported as a time variable or a character variable?

#Remove the # below and run to inspect the data frames

#df_cal

#df_adc_water

#df_adc_water

#df_adc_refT

If there is a problem with time for one of the data frames you can fix using the following code chunk. Change the data object if necessary xxxx$Time.

df_adc_refT$Time<-parse_time(df_adc_refT$Time,"%H.%M.%S")

Restructure the data and create new variables

As observed above, the data are still in the wide format for the ADC resolution with the variable ADC resolution embedded in column headers. This is a very common problem with data sets so it is important to understand how to restructure the data into a tidy or long format.

We need to create a long version of the data frame so each row is an observation for a single time and ADC resolution - adhering to the principles of tidy data (Wickham 2014). We will call this new data frame “df_ADC_water_L” for the water bath data and “df_ADC_air_L” for the air logging data.

###ADC water bath
df_ADC_water_L<-df_adc_water %>% 
  pivot_longer(!c(Time),
              names_to = "ADC_Bits",values_to = "Resistance") 


###ADC air
df_ADC_air_L<-df_adc_air %>% 
  pivot_longer(!c(Time),
              names_to = "ADC_Bits",values_to = "Resistance") 

Given we have high resolution, time series data (i.e. 1 sec frequency), it may be useful to calculate some additional time related variables that can make data summaries and plotting easier. We can create a minute variable. This is done using some very useful functions in the lubridate package (minute) in combination with the the mutate function After running the code below print the data frame to the console you can check the new variables have been saved. You may which to explore plotting using this variable later in the practical.

df_ADC_water_L<-df_ADC_water_L %>% 
  mutate(min=minute(Time))


df_ADC_air_L<-df_ADC_air_L %>% 
  mutate(min=minute(Time))

Thermistor calibration

Plotting the data

The basic grammar of a plot

Now for the exciting part - plotting the data! As we will be using ggplot2 for this - a package which implements the grammar of graphics (see Figure 2 below) - we should take a moment to consider how this is put into practice.

Figure 2: Representation of the graphial elements used in a plot generated using ggplot2

Figure 2: Representation of the graphial elements used in a plot generated using ggplot2

First we define the data frame to be used df_cal and specify that we want to plot the - this is the DATA part of the layers in the figure above. We then define the mapping and aesthetics of the plot by defining which variables we want for x and y. If you run the first steps of the code and write to a plot object - p1 - we can then print this by running the name of the plot (i.e. p1).

What has been plotted?

#eval=FALSE,
p1<-ggplot(data = df_cal, mapping = aes(x = Resistance, 
                                 y = Ref_temp))

p1

We now have the mapping for the plot but need a geometric representation of the data. we can call different geometric using the geom_xx. We can create points (geom_point) and can add these elements to the base mapping p1 with the + sign.

#eval=FALSE,
p2<-p1 + geom_point()
p2

p2<-p1 + geom_line()
p2

Plot styling

We can manipulate the visual properties of the geom by adding arguments to the function such as colour, shape, size and line type. Note the colourPicker() function is a very useful tool for exploring the wide range of colours available in the R environment. We can also add additional theme related information such as axis labels (more on how to manipulate themes later).

#eval=FALSE, message=FALSE, warning=FALSE
#size and shape 
p2<-p1 + geom_point(size =3, shape="circle")
p2

#Axis labels
p3<-p2 + ylab("Water temperature (\u00b0C)")+ xlab("Resistance (\u2126)")
p3

###Find any R colour 
#colourPicker()

p1 + geom_point(size =3, shape="circle",colour="darkgoldenrod4")+ ylab("Water temperature (\u00b0C)")+ xlab("Resistance (\u2126)")

Take some time to explore the different options for geom_point() have a look here for more aesthetics that you can modify beyond size= , shape=, colour= https://ggplot2.tidyverse.org/reference/geom_point.html

Note: Think about the difference between assigning a mapped aesthetic to this vs a fixed value. When would one approach be more useful than the other? This is a useful Chapter to read: Mastering Data Viz - Chapter 8.

Plot themes

You can also use predefined themes to style your plots as in the code chunk below. there are numerous themes you can explore but a few of my favs are theme_cowplot() from the cowplot package , theme_light() and theme_pubr() from the ggpubr package.

p1 + geom_point(size=3.5, shape=1,colour="black") + ylab("Water temperature (\u00b0C)") + xlab("Resistance (\u2126)") + theme_cowplot()

Fitting curves to your plots

Thermistor manufacturers provide a beta coefficient for calculating temperature from resistance. For the thermistor we used the beta coefficient (𝛽) = 3950 ± 1%. We use this value to fit a curve to the data which approximates the Steinhart equation (see notes from week 5 practical). This can be implemented in R using the following function from the thermocouple package. Here the argument R is the measured thermistor resistance and T0 is the reference temperature (= 25 ℃) and R0 reference resistance (=10,000 Ω). Note the input R can be a vector. If you want to find out more about the beta coefficient then the technical paper by Fraden (2000) is an excellent, but maths heavy, resource.

ThermistorTemperature(R0 = 10000,T0 = 25, 
                      R =15000, betaTH = 3950)
## [1] 16.14612
#remove
df_cal$Temp_beta<-ThermistorTemperature(R0 = 10000,T0 = 25, 
                      R =df_cal$Resistance, betaTH = 3950)

Question(s):

  1. Can you calculate the temperature based on the beta coef and save it to the df_cal? Make sure you call this variable something intuitive (e.g. Temp_beta)

  2. Can you plot this data against the reference temperature measurements?

  3. Can you log transform resistance and create a new variable called log_res?

Hint: log(df_cal$Resistance)

We now need to fit a calibration curve to the data. We can explore fits visually using the geom_smooth()

df_cal$Log_resistance<-log(df_cal$Resistance)


##fits a simple linear regression model

p4 <- p3 + geom_smooth(method = "lm") 
p4 + stat_regline_equation(label.x=20000, label.y=40) 

##for a polynomial fit we need to specify the formula - a quadratic seems appropriate but you can explore higher order polynomial fits by changing the number in the poly() function 

p3 + stat_poly_line(formula = y ~ poly(x, 2, raw = TRUE)) +   stat_poly_eq(formula = y ~ poly(x, 2, raw = TRUE), use_label("eq"),label.x = 20000, label.y = 40)  

Question(s):

  1. Can you fit a linear model (line of best fit) to the log transformed resistance data
#include = FALSE
###log linear model

df_cal |> ggplot(mapping = aes(y = Ref_temp,x = Log_resistance))+geom_point()+ geom_smooth(method = "lm")  + stat_regline_equation(label.x=9, label.y=40)

Fit the calibration model

We now need to create the models and save them as model objects so we can fit them to the data (i.e. convert all measurements of resistance to temperature).

linear_model<-lm(data = df_cal, formula = Ref_temp ~ Resistance)

poly_model<-lm(data = df_cal, 
                 formula = Ref_temp ~ poly(x = Resistance,degree = 2,
                                           raw = T))                    

You can now view the model coefficients and goodness of fit using summary().

Question(s):

  1. Can you repeat the model fitting for the log transformed data?

  2. Which model has the best fit to the data?

  3. How many coefficients do the models have and are these all significant?

#include=FALSE
log_linear_model<-lm(data = df_cal, formula = Ref_temp ~ Log_resistance)

You can extract the calculated temperature for all your measured resistance values and for both models using the $fitted.values. An example for the linear model is shown below

linear_model$fitted.values
##          1          2          3          4          5          6          7 
##  0.3933014  0.4804046  0.2547383  0.1399002  0.5832688  0.5512827  0.4575150 
##          8          9         10         11         12         13         14 
##  0.4620818  0.4758193  0.3496756 21.3792095 21.4187141 21.3085914 21.3610538 
##         15         16         17         18         19         20         21 
## 21.3160913 21.3235912 21.3300144 21.3128797 21.3578422 21.2968031 36.8239480 
##         22         23         24         25         26         27         28 
## 36.8856740 36.8870477 36.7786514 36.6946113 36.8617447 36.8410085 36.8589787 
##         29         30 
## 36.8031561 36.8124011
df_cal$Temp_linear<-linear_model$fitted.values

Question(s):

  1. Extract the modelled temperature for both models and save it to the df_cal? Make sure you call this variable something intuitive (e.g. Temp_linear/poly)

  2. For each temperature calibration point can you calculate the mean and SD and create a table (see Table 1 template below)

  3. You can take this further by calculating the coefficient of variation (CV% = [SD/mean] *100) which can be used as a measure of sensor precision (Sousan et al. 2016)

  4. Do you think there are other factors beyond the sensor precision that may influence the CV%? How might we account for these?

\[ \text{CV(%)} = \left( \frac{\sigma_i}{\mu_i} \right) \times 100 \]

Where the CV is for the ith temperature calibration point:

  • σ = standard deviation of a specific calibration model for the ith temperature calibration point

  • μ = mean of a specific calibration model for the ith temperature calibration point

#include=FALSE
#remove
df_cal$Temp_linear<-linear_model$fitted.values
df_cal$Temp_poly<-poly_model$fitted.values
df_cal$Temp_log_linear<-log_linear_model$fitted.values
Table 1. Suggested data structure for notebook entry for this section of the course
Ref Temp (℃)
Linear model
(mean±SD ℃)
Log model
(mean±SD ℃)
Beta model
(mean±SD ℃)
Poly model
(mean±SD ℃)
2.2 xx ± xx
18.1
38.2

Before we chose a calibration model to apply to the ADC resolution data we need to explore the data both visually and using a robust measure of accuracy or suite of measures. To make it easier to plot the data on a single graph we need to tidy the data and convert into a long format. Here we want create a new factor variable which embeds the model name and a new numeric variable that includes the calibrated temperature.

df_cal_long<-df_cal |> pivot_longer(!c(Cal_medium, Resistance,
                                       Voltage, Ref_temp, 
                                       Log_resistance),
                                    names_to = "cal_model", 
                                    values_to = "cal_temp")

Run the code below to create box plots.

Question(s):

  1. What’s wrong with plot 1?
#plot 1 ??
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp))+geom_boxplot()

##plot 2
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp, colour = Cal_medium))+geom_boxplot()

##plot 3
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp, colour = Cal_medium))+geom_boxplot()+facet_wrap(~Cal_medium)

##plot 4
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp, colour = Cal_medium))+geom_boxplot()+facet_wrap(~Cal_medium,scales = "free_y")

##plot 5 
ggplot(data = df_cal_long,mapping = aes(x=cal_model, y = cal_temp, colour = Cal_medium))+geom_jitter()+facet_wrap(~Cal_medium,scales = "free_y")+theme_pubclean()

Now we need to plot the calibrated temperature against the reference temperature.

The plots below do this and include a 1:1 line.

Question(s):

  1. Can you find out what it is and when normally used. Think about why it useful in this case?
ggplot(data = df_cal_long,
       mapping = aes(x=Ref_temp, y = cal_temp,
       colour = cal_model))+geom_point(size=2)+
  facet_wrap(~cal_model,scales = "fixed")+
  theme_pubclean()+
  geom_abline(slope = 1)+
  geom_smooth(se = F,method = "lm",linetype = "dashed")+  
  stat_regline_equation(label.x=5, label.y=40) 

We can assess the fit using R2 (coefficient of determination) but it is useful to get an estimate of the error in the measurement units. For this we can use the Root Mean Square Error (RMSE).

RMSE is a statistical measure that is often used to evaluate the accuracy of a regression model (or in our case the calibration model used). It measures the difference between the predicted values \((\hat{y})\) and the actual values \((y)\) in the units of the response variable.

The RMSE is calculated as the square root of the average of the squared differences between the predicted values and the actual values. In other words, it is the square root of the mean of the squared errors.

\[ \text{RMSE}(y, \hat{y}) = \sqrt{\frac{\sum_{i=0}^{N - 1} (y_i - \hat{y}_i)^2}{N}} \]

We can calculate this using the rmse() function in the metrics package.

#assess error RMSE
rmse(df_cal$Ref_temp, df_cal$Temp_beta)
## [1] 2.253915
rmse(df_cal$Ref_temp, df_cal$Temp_linear)
## [1] 3.655814
rmse(df_cal$Ref_temp, df_cal$Temp_poly)
## [1] 0.2423138
rmse(df_cal$Ref_temp, df_cal$Temp_log_linear)
## [1] 0.6319984

Question(s):

  1. Based on the visual integration of the data and the RMSE which calibration model should you apply to the ADC measured data?

Can you create a table to enable easy comparisons of the important variables related to the four models (See below). You can extract the R2 value by calling summary() on the model name e.g. summary(linear_model) . For the 1:1 lines

Table 2. Suggested format of your table
Variable Linear model Log linear model Beta model Polynomial model
R2
RMSE
Slope

ADC data

Applying the calibration model to the data

Now you have decided on the model to take forward you need to apply it to the temperature data collected for the ADC experiment. To do this we use the predict() function.

NOTE: I suggest generating this for both the best calibration model and the beta coefficient model as the later should be more generalisable and may be less prone to errors associated with the specific thermistor and arduino combination.

df_ADC_water_L$Temp<-predict(object = poly_model,df_ADC_water_L)

df_ADC_air_L$Temp<-predict(object =poly_model,df_ADC_air_L)


df_ADC_air_L$Temp_beta<-ThermistorTemperature(R0 = 10000,T0 = 25, 
                      R =df_ADC_air_L$Resistance, betaTH = 3950)

Now we can plot the time series of temperature.

df_ADC_water_L |> ggplot(aes(x = Time,y = Temp,colour = ADC_Bits))+geom_point(size=0.7)

df_ADC_water_L |> ggplot(aes(x = Time,y = Temp,colour = ADC_Bits))+geom_point(size=0.7)+facet_wrap(~ADC_Bits)

df_ADC_water_L |> ggplot(aes(x = Time,y = Temp,colour = ADC_Bits))+geom_line()+facet_wrap(~ADC_Bits)

Question(s):

  1. Which geom is best for representing time series data?

  2. Try plotting multiple geoms (e.g. point and line) on a single plot - why might this be useful for time series data?

  3. What do you notice about the plots?

  4. What do you think the resolution (smallest step) is for 8 bit temperature measurements using an Arduino?

Interactive plotting

You can create an interactive plot to explore the data using plotly. You can pass a ggplot object (i.e. a saved plot) directly to the ggplotly() function.

p1<-df_ADC_water_L |> ggplot(aes(x = Time,y = Temp,colour = ADC_Bits))+geom_line()+facet_wrap(~ADC_Bits)

ggplotly(p1,width = 600 ,height = 400 )

Question(s):

  1. What do you think the minimum temperature resolution is for the 10 and 12 bit measurements

  2. Why might it be difficult to calculate this using the water bath data?

  3. Extract the reference temperature for the warming cycle. Can you calculate the RMSE and generate a scatter plot with 1:1 line for each adc resolution?

Air temperature logging

First we need to merge the temperature data measured using the YSI (handheld device - i.e. the reference temperature) with the thermistor data. We can do this using the left_join() function. Have a look at the code below and think about why we are using the filter() function? Also take note of how we are using the Time variable to enable the matching (a quick way to combine lots of values from different datasets!). The output is then written to a new dataframe (df_adc_ref_xxx, with the xxx denoting the measurement matrix [water/air]).

df_adc_ref_water<-df_adc_refT |> filter(Medium=="Water") |> left_join(df_ADC_water_L,by = "Time")

df_adc_ref_air<-df_adc_refT |> filter(Medium=="Air") |> left_join(df_ADC_air_L,by = "Time")

Plotting the relationship between thermistor air temperature measured using the thermistor and the reference measurements.

##we can now plot the data for the Beta coefficent model
df_adc_ref_air |> ggplot(aes(x = ADC_Bits,y = Temp_beta ))+geom_boxplot()+geom_point()

df_adc_ref_air |> ggplot(aes(x = Temp_ref ,y = Temp_beta ))+geom_point()+geom_smooth(method = "lm")+facet_wrap(~ADC_Bits)+ stat_regline_equation(label.x=13.8, label.y=14) 

Question(s):

  1. What do you notice about the relationship between the reference and thermistor temperature? What are the implications of ADC resolution on the relationship.

  2. Can you generate a plot for your best calibration model? How does this compare to the Beta model?

Additional activities:

For the air temperature logging can you calculate the mean, SD, CV%. Remeber CV% gives us an idea of sensor precision. How does this vary with ADC resolution? Is it what you expected?

Information for your notebook write‑up

Formatting tips

A few important settings that will help with your write-up of the practical.

  1. To add a figure caption to plots generated in a code chunk use the following argument fig.cap in the curly braces at the top of the code chunk (e.g. fig.cap=“A plot that shows something interesting”). If your figure caption is long then you need to use the following formatting:

    {r}

    #| echo = TRUE, message=FALSE, warning=FALSE,

    #| fig.width = 10,

    #| fig.cap = "Figure 2: Representation of the graphial elements used in a plot generated using ggplot2"

  2. You can insert tables to R notebooks using the Visual Editor in R studio. Just select table from the drop down menu then insert table or use the following shortcut ctrl+alt+T

  3. To style tables you can right click on the table to set cell alignment etc. (similar to how you would in word). You can control line breaks in your table headers using the following <br> . See this tutorial for more info: Link to tutorial.

Key elements to include in your notebook entry

  1. Give this section of your lab book a title – e.g “using environmental sensors”.

  2. Briefly explain how you collected the data (i.e. the methods you used). Here you can include images, schematics and text. Be sure to highlight the instruments used.

  3. For the calibration generate a scatter plot of resistance vs observed temperature. Ensure you include a figure caption and some plain text with a brief description of the pattern(s) observed.

  4. Now fit a linear model and a nonlinear model (polynomial). Extract information on the model fit and present these in a table with a short interpretation of the results (which model appear to have the better fit)?

  5. Apply both models to the data and calculate the mean and SD for each calibration data point. Present this as a table, what do you notice about the data?

  6. Calculate the sensor linearity and accuracy (RMSE) based on all the models. Present this data as a table or integrated into a short paragraph that describes the results. Would you be happy to use this sensor for another research project? How does it compare to commercial sensor specifications?

  7. For the ADC experiment plot the data as a boxplot (8-bit, 10-bit and 12-bit). What do you notice? Now plot the data as a dot plot – what does this tell us about the difference in the step size (i.e. resolution in °C)? You could also include an interactive plot of the time series here.

  8. For the air temperature logging calculate the mean and SD for each resolution along with the mean and SD of the reference sensor (10 data points required). What do you notice?

  9. End your notebook with some short reflections on using the Arduino platform – what worked well, what didn’t and what you’d do differently if you were using the platform for another project.

Discussion prompts

  • Granularity: What step size (°C per ADC count) is effectively observed for 8-, 10-, 12‑bit? How does this compare to the spread you see in the dot plot?
  • Noise vs resolution: How might averaging multiple samples reduce quantization artefacts? Where do diminishing returns appear?
  • Calibration impact:
    • Which model gave the lowest RMSE and most uniform residuals across temperature? (Linear, polynomial, Beta, log_linear)

    • What other factors may have impacted the calibration accuracy and transferability. Hint think about stability of the calibration tanks and also sensors used as the reference. Also have a look at this white paper Here

  • Field relevance: Given typical environmental ranges (e.g., 0–30 °C), which ADC resolution is sufficient considering noise and logging interval?

References

Fraden, Jacob. 2000. “A Two-Point Calibration of Negative Temperature Coefficient Thermistors.” Review of Scientific Instruments 71 (4): 1901–5. https://doi.org/10.1063/1.1150560.
Sousan, Sinan, Kirsten Koehler, Geb Thomas, Jae Hong Park, Michael Hillman, Andrew Halterman, and Thomas M. Peters. 2016. “Inter-Comparison of Low-Cost Sensors for Measuring the Mass Concentration of Occupational Aerosols.” Aerosol Science and Technology 50 (5): 462–73. https://doi.org/10.1080/02786826.2016.1162901.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.