Introduction

An exploratory data analysis (EDA) on pH data on Lake 239 at IISD Experimental Lakes Area (IISD-ELA) is reported. R packages were applied to perform EDA.

About the Data

This specific dataset I analyzed comprises the monitoring records of chemical parameters on Lake 239 at IISD Experimental Lakes Area (IISD-ELA) in 1990 - 2019. Lake 239 is located at the east of Kenora. The zoomed-in Lake 239 view is available on the IISD-ELA map.

This lake is important because it has not been manipulated in any way, so any changes we see in the lake are a product of natural variation and atmospheric change 1. The dataset is available from the repository.

Primary Sample information are listed in table 1 and table 2.

Table 1: Sampling Location

Information

Description

Latitude

49.66103

Longitude

-93.71315

Location ID

239LAEIF

Location Name

East Inflow

Horizontal Coordinate Reference System

NAD83, WGS84

Table 2: Sample Information

Information

Description

Location Type

River/Stream

Location

Inflow

Sample Type

Surface Water

Characteristics

pH
Organic Carbon, Filtered.
Inorganic Carbon, Dissolved.

Total Nitrogen, mixed forms, Filtered.
Total Nitrogen, mixed forms, Suspended.

Total Phosphorus, mixed forms, Filtered.
Total Phosphorus, mixed forms, Suspended.

Chlorophyll a, Filtered.



The structure of the dataset is as follows:

## 'data.frame':    5337 obs. of  36 variables:
##  $ DatasetName                                          : chr  "ELA LTER Chemistry" "ELA LTER Chemistry" "ELA LTER Chemistry" "ELA LTER Chemistry" ...
##  $ MonitoringLocationID                                 : chr  "239LAEIF" "239LAEIF" "239LAEIF" "239LAEIF" ...
##  $ MonitoringLocationName                               : chr  "Lake 239 East Inflow" "Lake 239 East Inflow" "Lake 239 East Inflow" "Lake 239 East Inflow" ...
##  $ MonitoringLocationLatitude                           : num  49.7 49.7 49.7 49.7 49.7 ...
##  $ MonitoringLocationLongitude                          : num  -93.7 -93.7 -93.7 -93.7 -93.7 ...
##  $ MonitoringLocationHorizontalCoordinateReferenceSystem: chr  "NAD83" "NAD83" "NAD83" "NAD83" ...
##  $ MonitoringLocationType                               : chr  "River/Stream" "River/Stream" "River/Stream" "River/Stream" ...
##  $ ActivityType                                         : chr  "Sample-Routine" "Sample-Routine" "Sample-Routine" "Sample-Routine" ...
##  $ ActivityMediaName                                    : chr  "Surface Water" "Surface Water" "Surface Water" "Surface Water" ...
##  $ ActivityStartDate                                    : chr  "1990-03-24" "1990-03-24" "1990-03-24" "1990-03-24" ...
##  $ ActivityStartTime                                    : chr  "" "" "" "" ...
##  $ ActivityEndDate                                      : chr  "" "" "" "" ...
##  $ ActivityEndTime                                      : chr  "" "" "" "" ...
##  $ ActivityDepthHeightMeasure                           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ActivityDepthHeightUnit                              : chr  "m" "m" "m" "m" ...
##  $ SampleCollectionEquipmentName                        : chr  "Pump/Submersible" "Pump/Submersible" "Pump/Submersible" "Pump/Submersible" ...
##  $ CharacteristicName                                   : chr  "Organic carbon" "Organic carbon" "Organic carbon" "Organic carbon" ...
##  $ MethodSpeciation                                     : chr  "" "" "" "" ...
##  $ ResultSampleFraction                                 : chr  "Filtered, Lab" "Filtered, Lab" "Filtered, Lab" "Filtered, Lab" ...
##  $ ResultValue                                          : num  3510 3770 3900 3540 3420 3200 3180 3360 94 88 ...
##  $ ResultUnit                                           : chr  "umol/L" "umol/L" "umol/L" "umol/L" ...
##  $ ResultValueType                                      : chr  "" "" "" "" ...
##  $ ResultDetectionCondition                             : chr  "" "" "" "" ...
##  $ ResultDetectionQuantitationLimitMeasure              : num  10 10 10 10 5 5 5 5 1 1 ...
##  $ ResultDetectionQuantitationLimitUnit                 : chr  "umol/L" "umol/L" "umol/L" "umol/L" ...
##  $ ResultDetectionQuantitationLimitType                 : chr  "Method Detection Level" "Method Detection Level" "Method Detection Level" "Method Detection Level" ...
##  $ ResultStatusID                                       : chr  "Validated" "Validated" "Validated" "Validated" ...
##  $ ResultComment                                        : chr  NA NA NA NA ...
##  $ ResultAnalyticalMethodID                             : logi  NA NA NA NA NA NA ...
##  $ ResultAnalyticalMethodContext                        : chr  "" "" "" "" ...
##  $ ResultAnalyticalMethodName                           : chr  "" "" "" "" ...
##  $ AnalysisStartDate                                    : chr  "" "" "" "" ...
##  $ AnalysisStartTime                                    : logi  NA NA NA NA NA NA ...
##  $ AnalysisStartTimeZone                                : logi  NA NA NA NA NA NA ...
##  $ LaboratoryName                                       : chr  "" "" "" "" ...
##  $ LaboratorySampleID                                   : chr  "K141" "K137" "K139" "K140" ...

Reused Lab ID

During exploring the data set, I found out that there were 24 Lab IDs being assigned for samples on different dates.

pH Measurement Replicates

There were 735 pH records in total. pH was measured more than once in 40 days out of 724 days (shown in the following table). There are lack of AnalysisStartTime data in the dataset, so I don’t know whether pH was measured at different time.
TABLE II: pH Measurement Replicates
Date Freq
1990-03-24 4
1990-03-31 2
1990-04-01 2
1990-04-02 2
1990-04-03 2
1990-04-12 4
1990-06-02 3
1990-06-17 4
1990-06-19 3
1990-06-20 5
1990-06-21 3
1990-07-07 2
1994-04-13 2
2009-06-24 12
2009-07-01 4
2016-06-13 2
2016-08-15 2
2016-09-01 14
2016-09-07 7
2016-09-12 2
2016-09-28 8
2016-10-03 2
2016-10-17 2
2016-10-24 2
2017-10-16 2
2018-06-12 2
2018-07-10 2
2019-06-25 2
2019-07-09 2
2019-07-16 2
2019-08-13 2
2019-09-03 2
2019-09-10 2
2019-09-17 2
2019-09-24 2
2019-10-01 4
2019-10-08 2
2019-10-22 2
2019-10-29 2
2019-11-04 2



The sampling frequency for pH is illustrated in the following heatmap.

Figure 1: The Number(s) of pH Records per Week

pH Abnormal Values

Any pH data that are placed outside of \(\scriptsize{\textrm{median} \pm 3 \times \textrm{(median}\ \textrm{absolute}\ \textrm{deviation})}\) are considered abnormal values.




There were the most numbers of abnormal pH values in 1990, and the greatest ratio of abnormal pH values is found in 2007.

pH Measurments in 1990 -2019

The multiple pH values average out on the daily base for the following plots. In the following heatmap, year 2007 stands out from other years, because pH changed from low to high when season changed.

Figure 3: pH Heatmap in 1990 - 2019


In the following scatter plot, most of pH medians fall in pH 6.0 - 6.5 over the years. Attention should be paid for year 2007, because the median is 4.2.

Figure 4: pH Measurment with Trend

Figure 4: pH Measurment with Trend

Future Work

  1. The ratios of abnormal pH records in the year 1990, 2006, and 2007 are top 3 years. The decision of whether these abnormal pH records should be kept for further modeling must be made.
  2. Further EDA on other chemical characteristics shall be done.
  3. The correlation among chemical characteristics could be helpful.

Created Date: 2022-05-18

Last Modified Date: 2022-09-27