During the course of the Limnogeology lecture you have been given different kind of data related to sediment cores. In many cases, limnogeology is a rather disciplinary approach and one is sometimes confronted and overwhelmed with a plentitude of information. Following some ideas how you can extract the most important information. These ideas generally work for different data handling and visualisation pipelines, be it with Microsoft Excel or with a programming language, but some functions might not be available everywhere.

3 Exploring your data

The best way to explore your data is visually. Start with some simple scatter or line plots, add colours and shapes, use facets and transformations.

Quite often, you’ll have some prior knowledge as to what exactly you are looking for or where you are looking for something. Linescans/core photos can help with the decision too. In the next step, you will often look at XRF Corescan data. It is helpful to look for correlation patterns and/or focus on elements of interest.

3.1 Exploring correlations with Microsoft Excel

Microsoft Excel and other spreadsheet applications allow for some statistical analyses including the calculation of a correlation matrix. To this end, make sure that your XRF element traces are in a wide format, i.e. every line is a different depth in the sediment core and every column is a different element.

You can find the data analysis tools on the Data ribbon tab. Maybe you will have to activate the tools first in the settings (above red square in screenshot).

Microsoft Excel data analysis tools

You will have to mark all the element traces that you want to include in the correlation matrix (do not include core name and depths, just the data you want to compare), check if you have a header and where you want to create the correlation matrix. The result is in the next screenshot:

Using Excel’s conditional formatting it’s possible to fill the cells with different colours according to their value. One disadvantage of this method is, that the correlation matrix has the same order as the original input data, i.e. there is no clustering of similar traces, which makes spotting patterns difficult.

3.2 Exploring correlations with R

Using R we can use the corrplot package to easily visually distinguish between groups of related element traces. We’re using sample data from Lake Zug to this end. The only prior information in this case is that we expect some lead pollution.

We start by loading the necessary packages for our analysis and loading the previously parsed xrf data.

# Loading necessary packages - if this fails, chances are that you haven't installed a package
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(corrplot)

## corrplot 0.84 loaded

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# Load XRF data from file in same folder
load("xrfdata_limno21_sampledata.RData")

# Looking at our data
glimpse(xrfdata_lakezug)

## Rows: 38,068
## Columns: 21
## $ CoreID     <chr> "ZUG18-7", "ZUG18-7", "ZUG18-7", "ZUG18-7", "ZUG18-7", "ZUG…
## $ Depth      <dbl> 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, …
## $ Date       <chr> "2019-02-20", "2019-02-20", "2019-02-20", "2019-02-20", "20…
## $ Time       <chr> "09-54-32", "09-55-32", "09-56-35", "09-57-31", "09-58-31",…
## $ Duration   <dbl> 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,…
## $ Run        <chr> "Run1", "Run1", "Run1", "Run1", "Run1", "Run1", "Run1", "Ru…
## $ Rep        <chr> "Rep0", "Rep0", "Rep0", "Rep0", "Rep0", "Rep0", "Rep0", "Re…
## $ Voltage    <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ Current    <dbl> 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500,…
## $ Filter     <chr> "No-Filter", "No-Filter", "No-Filter", "No-Filter", "No-Fil…
## $ SlitDown   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ SlitCross  <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
## $ Excitation <chr> "XRay Tube Generator", "XRay Tube Generator", "XRay Tube Ge…
## $ Throughput <dbl> 175000, 246000, 288000, 216000, 242000, 255000, 262000, 262…
## $ Element    <chr> "Al", "Al", "Al", "Al", "Al", "Al", "Al", "Al", "Al", "Al",…
## $ AbsLine    <chr> "Ka", "Ka", "Ka", "Ka", "Ka", "Ka", "Ka", "Ka", "Ka", "Ka",…
## $ Area       <dbl> 3059, 7327, 10347, 4845, 7133, 7822, 11253, 13465, 8139, 76…
## $ AreaStd    <dbl> 86, 117, 133, 102, 116, 121, 137, 145, 122, 119, 127, 135, …
## $ Chi2       <dbl> 1.0, 1.2, 2.2, 2.8, 1.2, 3.1, 2.7, 2.8, 2.9, 3.2, 2.3, 6.6,…
## $ cps        <dbl> 102.0, 244.2, 344.9, 161.5, 237.8, 260.7, 375.1, 448.8, 271…
## $ cpsStd     <dbl> 2.9, 3.9, 4.4, 3.4, 3.9, 4.0, 4.6, 4.8, 4.1, 4.0, 4.2, 4.5,…

The output of the glimpse command shows us the structure of our data. There are 21 variables/columns and 38068 observations/rows. <chr> shows us that a variable contains texts (chr is abbreviated for char), while <dbl> stands for double – which in turn means floating point number with double precision. There is also additional data that was created by the XRF corescanner that is not needed for our analysis. We only need to keep the variables/columns CoreID (which is effectively the core name), Depth (which is the core depth without green stuff in mm), Element and cps (short for counts per second – counts or integral area from older XRF data, but corrected for different measuring times). It is a good idea to tell R that Element and CoreID are categorical variables or so called factors in R. Factors have some added benefits over raw text variables.

# We mutate two variables in our dataframe, turning them into factors and saving the changes back into the same dataframe.
xrfdata_lakezug <- xrfdata_lakezug %>% 
  mutate(CoreID = as_factor(CoreID), Element = as_factor(Element))

glimpse(xrfdata_lakezug)

## Rows: 38,068
## Columns: 21
## $ CoreID     <fct> ZUG18-7, ZUG18-7, ZUG18-7, ZUG18-7, ZUG18-7, ZUG18-7, ZUG18…
## $ Depth      <dbl> 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, …
## $ Date       <chr> "2019-02-20", "2019-02-20", "2019-02-20", "2019-02-20", "20…
## $ Time       <chr> "09-54-32", "09-55-32", "09-56-35", "09-57-31", "09-58-31",…
## $ Duration   <dbl> 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,…
## $ Run        <chr> "Run1", "Run1", "Run1", "Run1", "Run1", "Run1", "Run1", "Ru…
## $ Rep        <chr> "Rep0", "Rep0", "Rep0", "Rep0", "Rep0", "Rep0", "Rep0", "Re…
## $ Voltage    <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ Current    <dbl> 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500,…
## $ Filter     <chr> "No-Filter", "No-Filter", "No-Filter", "No-Filter", "No-Fil…
## $ SlitDown   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ SlitCross  <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
## $ Excitation <chr> "XRay Tube Generator", "XRay Tube Generator", "XRay Tube Ge…
## $ Throughput <dbl> 175000, 246000, 288000, 216000, 242000, 255000, 262000, 262…
## $ Element    <fct> Al, Al, Al, Al, Al, Al, Al, Al, Al, Al, Al, Al, Al, Al, Al,…
## $ AbsLine    <chr> "Ka", "Ka", "Ka", "Ka", "Ka", "Ka", "Ka", "Ka", "Ka", "Ka",…
## $ Area       <dbl> 3059, 7327, 10347, 4845, 7133, 7822, 11253, 13465, 8139, 76…
## $ AreaStd    <dbl> 86, 117, 133, 102, 116, 121, 137, 145, 122, 119, 127, 135, …
## $ Chi2       <dbl> 1.0, 1.2, 2.2, 2.8, 1.2, 3.1, 2.7, 2.8, 2.9, 3.2, 2.3, 6.6,…
## $ cps        <dbl> 102.0, 244.2, 344.9, 161.5, 237.8, 260.7, 375.1, 448.8, 271…
## $ cpsStd     <dbl> 2.9, 3.9, 4.4, 3.4, 3.9, 4.0, 4.6, 4.8, 4.1, 4.0, 4.2, 4.5,…

Additionally, we need to turn this long format table (one column for Elements and one column for cps) into a wide-format table (as in the Excel screenshot above) to satisfy the cor function that computes the correlation matrix:

# We take our dataframe with the name xrfdata_lakezug and "feed it" with the pipe symbol %>% into the next function, where we select the columns of interest. In the next step, we expand the long format table back to a wide format table. The result of this pipeline is assigned to the new dataframe xrfwide. This syntax is also called a tidyverse pipeline and is a modern, more consistent form of R.
xrfwide <- xrfdata_lakezug %>% 
  select(CoreID, Depth, Element, cps) %>% 
  pivot_wider(names_from = Element, values_from = cps)

glimpse(xrfwide)

## Rows: 1,228
## Columns: 33
## $ CoreID <fct> ZUG18-7, ZUG18-7, ZUG18-7, ZUG18-7, ZUG18-7, ZUG18-7, ZUG18-7, …
## $ Depth  <dbl> 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, …
## $ Al     <dbl> 102.0, 244.2, 344.9, 161.5, 237.8, 260.7, 375.1, 448.8, 271.3, …
## $ Ar     <dbl> 250.05, 625.41, 568.57, 1081.60, 1120.00, 997.66, 691.07, 612.1…
## $ As     <dbl> 5.2, 4.6, 6.1, 4.6, 6.7, 5.9, 1.2, 4.7, 0.8, 2.2, 1.6, 0.8, 4.8…
## $ Ba     <dbl> 391.82, 300.18, 296.96, 138.19, 93.94, 104.94, 112.65, 112.18, …
## $ Br     <dbl> 0.8, 4.6, 6.7, 11.1, 13.4, 15.5, 12.7, 12.5, 14.9, 8.5, 7.6, 7.…
## $ Ca     <dbl> 14448.44, 67611.48, 102092.69, 50920.85, 72133.43, 82528.54, 86…
## $ Cl     <dbl> 658.25, 542.35, 585.65, 614.98, 631.11, 2941.11, 1921.01, 637.9…
## $ Co     <dbl> 221.00, 156.25, 142.89, 221.30, 225.97, 184.29, 207.81, 200.81,…
## $ Cr     <dbl> -89.13, 92.00, 142.35, 136.43, 156.40, 141.68, 190.15, 195.20, …
## $ Cu     <dbl> 343.38, 241.21, 203.01, 242.16, 221.03, 181.63, 200.46, 196.21,…
## $ Fe     <dbl> 2124.63, 8664.00, 11717.93, 9877.10, 13268.81, 11531.54, 13253.…
## $ Ga     <dbl> 2.50, 2.70, 3.00, 2.69, 4.26, 1.59, 1.58, 3.12, 0.67, 1.42, 2.3…
## $ Hg     <dbl> 1.50, 1.10, 1.20, 0.90, 2.90, 1.25, 2.30, 0.79, 0.88, 2.60, 1.1…
## $ K      <dbl> 470.48, 1507.05, 2042.88, 1107.47, 1964.22, 1796.58, 2485.30, 2…
## $ Mg     <dbl> 1.5, 9.8, 8.3, 6.6, 6.8, 9.4, 19.7, 19.5, 13.8, 6.6, 13.2, 6.3,…
## $ Mn     <dbl> 210.56, 327.67, 386.29, 383.62, 447.49, 396.79, 432.37, 444.27,…
## $ Mo     <dbl> 21.1, 27.8, 22.4, 10.0, 19.3, 20.5, 10.8, 9.7, 15.3, 20.7, 9.3,…
## $ Ni     <dbl> 237.5, 238.2, 231.1, 206.1, 196.0, 195.0, 196.1, 195.1, 196.7, …
## $ P      <dbl> 21.8, 72.0, 102.1, 38.4, 76.7, 88.1, 89.2, 94.5, 83.7, 79.5, 91…
## $ Pb     <dbl> 5.9, 10.6, 12.4, 13.8, 16.3, 16.0, 23.4, 21.3, 27.3, 33.0, 33.1…
## $ Rb     <dbl> 1.7, 10.5, 19.6, 42.1, 47.2, 36.3, 42.4, 43.8, 35.4, 36.1, 32.3…
## $ S      <dbl> 708.60, 1695.61, 2435.91, 1109.16, 1344.35, 1327.84, 1520.71, 1…
## $ Sc     <dbl> 2.6, 6.5, 9.8, 7.3, 5.7, 7.0, 9.6, 7.5, 7.4, 8.0, 13.7, 8.7, 8.…
## $ Si     <dbl> 2514.90, 4436.40, 5889.32, 3370.77, 4318.57, 4668.40, 5986.48, …
## $ Sr     <dbl> 34.8, 72.3, 134.2, 239.4, 243.3, 249.5, 280.4, 293.2, 319.8, 36…
## $ Ti     <dbl> 185.40, 459.71, 557.53, 416.52, 649.27, 526.48, 739.09, 855.52,…
## $ U      <dbl> 5.8, 5.9, 4.0, 6.0, 4.0, 2.9, 6.4, 2.2, 5.2, 4.0, 3.8, 1.7, 5.8…
## $ V      <dbl> -81.44, -20.85, -12.18, 35.79, 39.23, 20.92, 67.37, 51.09, 31.8…
## $ Y      <dbl> 4.0, 4.0, 8.6, 13.6, 12.6, 23.7, 13.4, 14.0, 16.3, 11.3, 13.7, …
## $ Zn     <dbl> 12.8, 40.2, 52.2, 50.0, 44.3, 39.1, 39.4, 42.1, 47.2, 54.9, 71.…
## $ Zr     <dbl> 26.4, 27.2, 29.7, 48.7, 52.0, 53.6, 52.3, 62.2, 50.0, 46.3, 53.…

The previous not needed variables disappeared and we can now see that we have the same wide structure as before with Excel.

We can now inspect every core separately, but need to know how many cores there are first (of course you could also check out the linescans or the dataframe structure visually if you are using RStudio). Since the CoreID variable is a factor now, R knows how many distinct states, so called levels there are:

levels(xrfwide$CoreID)

## [1] "ZUG18-7" "ZUG18-8" "ZUG18-9"

We can now create the correlation matrices for every core and plot a correlation plot. We choose to display only the upper half of the correlation plot and we use hierarchical clustering to group very similar variables together:

# Read from the inside to the outside: We filter xrfwide to only have the data of one core, then remove the core name and depth for the correlation analysis. The correlation table is then assigned to xrfwide_zug18_7.
xrfwide_zug18_7 <- cor(xrfwide %>% filter(CoreID == "ZUG18-7") %>% select(-CoreID, -Depth))

corrplot(xrfwide_zug18_7, type = "upper", order = "hclust", tl.col="black", mar=c(0,0,1,0), title = "Correlation plot for XRF data of ZUG18-7")

xrfwide_zug18_8 <- cor(xrfwide %>% filter(CoreID == "ZUG18-8") %>% select(-CoreID, -Depth))

corrplot(xrfwide_zug18_8, type = "upper", order = "hclust", tl.col="black", mar=c(0,0,1,0), title = "Correlation plot for XRF data of ZUG18-8")

xrfwide_zug18_9 <- cor(xrfwide %>% filter(CoreID == "ZUG18-9") %>% select(-CoreID, -Depth))

corrplot(xrfwide_zug18_9, type = "upper", order = "hclust", tl.col="black", mar=c(0,0,1,0), title = "Correlation plot for XRF data of ZUG18-9")

We can see from the correlation plots that there are some common groups of positive correlations between elements. However, we don’t know how valid these are (interferences are possible), we don’t know how high the counts/cps are etc. It does seem though that Pb is mostly co-occuring with Zn.

# We are creating a ggplot2 plot object. The plot consists of three separate panels, so called facets. The panels are facetted by core name. Plotly is a plotting library used for web graphics that helps us to bring our static plots to life.
plotobj <- ggplot(xrfdata_lakezug, aes(x = Depth, y = cps, colour = Element)) + geom_line() + coord_flip() + scale_x_reverse() + facet_wrap(~ CoreID)

ggplotly(plotobj)

Luckily for us, we can make interactive plots where we can turn every element separately on or off. Additionally, we can scale and zoom into our data. A good strategy would be to check for the previously discovered groups.

If we look e.g. only at Pb and Zn, our plot will look similar to this:

ggplot(xrfdata_lakezug %>% filter(Element %in% c("Pb", "Zn")), aes(x = Depth, y = cps, colour = Element)) + geom_line() + coord_flip() + scale_x_reverse() + facet_wrap(~ CoreID, scales = "free")

Lead and zinc clearly occur around the same time. The lead peak in ZUG18-9 could be an outlier/measurement error.

Data Exploration and visualisation in Limnogeology

Remo Roethlin (remo.roethlin@eawag.ch)

May 19, 2021

1 The first step: Measuring and creating data

2 Importing, cleaning and inspecting data

3 Exploring your data

3.1 Exploring correlations with Microsoft Excel

3.2 Exploring correlations with R