LBB-UL: Understanding the Characteristics of Ocean Water

Wayan K.

May 8, 2021

1 About the Data

This report gathers data from The California Cooperative Oceanic Fisheries Investigations (CalCOFI). The organization was formed in 1949 that become a unique partnership of the California Department of Fish & Wildlife, NOAA Fisheries Service and Scripps Institution of Oceanography to the study of the marine environment off the coast of California, the management of its living resources, and monitoring the indicators of El Nino and climate change.

Data collected include: temperature, salinity, oxygen, phosphate, silicate, nitrate and nitrite, chlorophyll, transmissometer, PAR, C14 primary productivity, phytoplankton biodiversity, zooplankton biomass, and zooplankton biodiversity.

This report will try to evaluate the data of ocean water variables/ characteristics gathered by CalCOFI, and try to leverage as much as possible all of information from it by using data dimensionality reduction and data clustering techniques. We will also try to see which variabels/ components of the ocean water that have strong correlation/ characteristic to each other or its corresponding cluster, and also some of the outliers conditions of the ocean water.

Due to the time contraints, the ocean water’s variabels/ characteristics that will be used on this report are limited to as follows: - Sta_ID: CalCOFI Line and Station - Depthm: Depth in meters - T_degC: Temperature of Water - Salnty: Salinity of water - O2ml_L: Milliliters of dissolved oxygen per Liter seawater - STheta: Potential Density of Water - 02Sat: Oxygen Saturation - Oxy_umol/kg: Oxygen in micro moles per kilogram of seawater - ChlorA: Acetone extracted chlorophyll-a measured fluorometrically - Phaeop: Phaeophytin concentration measured fluorometrically - PO4µM: Phosphate concentration - SiO3µM: Silicate concentration - NO2µM: Nitrite concentration - NO3µM: Nitrate concentration - NH3µM: Ammonium concentration - LightP: Light intensities expressed as a percentage - R_PRES: Pressure in decibars

ocean <- read.csv("bottle.csv", stringsAsFactors = T, na.strings=c("","","NA"))
tail(ocean)

1.1 Data Dictionary:

Cst_Cnt: Auto-numbered Cast Count - all casts consecutively numbered. 1 is first station done
Btl_Cnt: Auto-numbered Bottle count- all bottles ever sampled, consecutively numbered
Sta_ID: CalCOFI Line and Station
Depth_ID: [Century]-[YY][MM][ShipCode]-[CastType][Julian Day]-[CastTime]-[Line][Sta][Depth][Bottle]-[Rec_Ind]
Depthm: Depth in meters
T_degC: Temperature of Water
Salnty: Salinity of water
O2ml_L: Milliliters of dissolved oxygen per Liter seawater
STheta: Potential Density of Water
02Sat: Oxygen Saturation
Oxy_umol/kg: Oxygen in micro moles per kilogram of seawater
BtlNum: Bottle Number
Reclnd: Record Indicator
T_prec: Temperature Units of Precision
T_qual: Temperature Quality Code
S_prec: Salinity Units of Precision
S_qual: Salinity Quality Code
P_qual: Pressure Quality Code
O_qual: Oxygen Quality Code
SThatq: Sigma Theta Quality Code
O2Satq: Oxygen Saturation Quality Code
ChlorA: Acetone extracted chlorophyll-a measured fluorometrically
Chlqua: Chlorophyll-a Quality Code
Phaeop: Phaeophytin concentration measured fluorometrically
Phaqua: Phaeophytin Quality Code
PO4µM: Phosphate concentration
PO4q: Phosphate Quality Code
SiO3µM: Silicate concentration
SiO3qu: Silicate Quality Code
NO2µM: Nitrite concentration
NO2q: Nitrite Quality Code
NO3µM: Nitrate concentration
NO3q: Nitrate Quality Code
NH3µM: Ammonium concentration
NH3q: Ammonium Quality Code
C14As1: 14C Assimilation of replicate 1
C14A1p: 14C Assimilation of replicate 1 precision
C14A1q: 14C As1 Quality Code
C14As2: 14C Assimilation of replicate 2
C14A2p: 14C Assimilation of replicate 2 precision
C14A2q: 14C As2 Quality Code
DarkAs: 14C Assimilation Dark Bottle
DarkAp: 14C Assimilation Dark Bottle precision
DarkAq: 14C Assimilation Dark Bottle Quality Code
MeanAs: Mean 14C Assimilation of Bottle 1 and 2
MeanAp: Mean 14C Assimilation of Bottle 1 and 2 precision
MeanAq: Mean 14C Assimilation Quality Code
IncTim: Incubation time
LightP: Light intensities expressed as a percentage
R_Depth: Reported Depth in Meters
R_TEMP: Reported Temperature
R_POTEMP: Reported Potential Temperature
R_SALINITY: Reported Salinity
R_SIGMA: Reported Potential Density of water
R_SVA: Reported Specific Volume Anomaly
R_DYNHT: Reported Dynamic Height
R_02: Reported milliliters of oxygen per liter of seawater
R_O2Sat: Reported Oxygen Saturation
R_SIO3: Reported Silicate Concentration
R_PO4: Reported Phosphate Concentration
R_NO3: Reported Nitrate Concentration
R_NO2: Reported Nitrite Concentration
R_NH4: Reported Ammonium Concentration
R_CHLA: Reported Chlorophyll-a
R_PHAEO: Reported Phaeophytin
R_PRES: Pressure in decibars
R_SAMP: Sample number
R_Oxy_µmol/kg: Reported Oxygen in micro moles per kilogram of seawater
DIC1: Replicate 1 - Dissolved Inorganic Carbon in micro moles per kilogram of seawater
DIC2: Replicate 2 - Dissolved Inorganic Carbon in micro moles per kilogram of seawater
TA1: Replicate 1 - Total Alkalinity in micro moles per kilogram of seawater
TA2: Replicate 2 - Total Alkalinity in micro moles per kilogram of seawater
DIC Quality (Comments): Quality Comments associated with DIC sampling, drawing and analysis

Checking for NA value of Data:

anyNA(ocean)

#> [1] TRUE

colSums(is.na(ocean))

#>             Cst_Cnt             Btl_Cnt              Sta_ID            Depth_ID 
#>                   0                   0                   0                   0 
#>              Depthm              T_degC              Salnty              O2ml_L 
#>                   0               10963               47354              168662 
#>              STheta               O2Sat        Oxy_Âµmol.Kg              BtlNum 
#>               52689              203589              203595              746196 
#>              RecInd              T_prec              T_qual              S_prec 
#>                   0               10963              841736               47354 
#>              S_qual              P_qual              O_qual              SThtaq 
#>              789949              191108              680187              799040 
#>              O2Satq              ChlorA              Chlqua              Phaeop 
#>              647066              639591              225697              639592 
#>              Phaqua               PO4uM                PO4q              SiO3uM 
#>              225693              451546              413077              510772 
#>              SiO3qu               NO2uM                NO2q               NO3uM 
#>              353997              527287              335389              527460 
#>                NO3q               NH3uM                NH3q              C14As1 
#>              334930              799901               56564              850431 
#>              C14A1p              C14A1q              C14As2              C14A2p 
#>              852103               16258              850449              852121 
#>              C14A2q              DarkAs              DarkAp              DarkAq 
#>               16240              842214              844406               24423 
#>              MeanAs              MeanAp              MeanAq              IncTim 
#>              842213              844406               24424              850426 
#>              LightP             R_Depth              R_TEMP            R_POTEMP 
#>              846212                   0               10963               46047 
#>          R_SALINITY             R_SIGMA               R_SVA             R_DYNHT 
#>               47354               52856               52771               46657 
#>                R_O2             R_O2Sat              R_SIO3               R_PO4 
#>              168662              198415              510764              451538 
#>               R_NO3               R_NO2               R_NH4              R_CHLA 
#>              527452              527279              799881              639587 
#>             R_PHAEO              R_PRES              R_SAMP                DIC1 
#>              639588                   0              742857              862864 
#>                DIC2                 TA1                 TA2                 pH2 
#>              864639              862779              864629              864853 
#>                 pH1 DIC.Quality.Comment 
#>              864779              864808

str(ocean)

#> 'data.frame':    864863 obs. of  74 variables:
#>  $ Cst_Cnt            : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Btl_Cnt            : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Sta_ID             : Factor w/ 2634 levels "001.0 168.0",..: 313 313 313 313 313 313 313 313 313 313 ...
#>  $ Depth_ID           : Factor w/ 864850 levels "19-4903CR-HY-060-0930-05400560-0000A-3",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ Depthm             : int  0 8 10 19 20 30 39 50 58 75 ...
#>  $ T_degC             : num  10.5 10.5 10.5 10.4 10.4 ...
#>  $ Salnty             : num  33.4 33.4 33.4 33.4 33.4 ...
#>  $ O2ml_L             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ STheta             : num  25.6 25.7 25.7 25.6 25.6 ...
#>  $ O2Sat              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ Oxy_Âµmol.Kg       : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ BtlNum             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ RecInd             : int  3 3 7 3 7 7 3 7 3 7 ...
#>  $ T_prec             : int  1 2 2 2 2 2 2 2 2 2 ...
#>  $ T_qual             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ S_prec             : int  2 2 3 2 3 3 2 3 2 3 ...
#>  $ S_qual             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ P_qual             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ O_qual             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ SThtaq             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ O2Satq             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ ChlorA             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ Chlqua             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ Phaeop             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ Phaqua             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ PO4uM              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ PO4q               : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ SiO3uM             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ SiO3qu             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ NO2uM              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ NO2q               : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ NO3uM              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ NO3q               : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ NH3uM              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ NH3q               : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ C14As1             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ C14A1p             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ C14A1q             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ C14As2             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ C14A2p             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ C14A2q             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ DarkAs             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ DarkAp             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ DarkAq             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ MeanAs             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ MeanAp             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ MeanAq             : int  9 9 9 9 9 9 9 9 9 9 ...
#>  $ IncTim             : Factor w/ 199 levels "12/30/1899 00:01:00",..: NA NA NA NA NA NA NA NA NA NA ...
#>  $ LightP             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_Depth            : num  0 8 10 19 20 30 39 50 58 75 ...
#>  $ R_TEMP             : num  10.5 10.5 10.5 10.4 10.4 ...
#>  $ R_POTEMP           : num  10.5 10.5 10.5 10.4 10.4 ...
#>  $ R_SALINITY         : num  33.4 33.4 33.4 33.4 33.4 ...
#>  $ R_SIGMA            : num  25.6 25.6 25.6 25.6 25.6 ...
#>  $ R_SVA              : num  233 232 233 234 234 ...
#>  $ R_DYNHT            : num  0 0.01 0.02 0.04 0.04 0.07 0.09 0.11 0.13 0.17 ...
#>  $ R_O2               : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_O2Sat            : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_SIO3             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_PO4              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_NO3              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_NO2              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_NH4              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_CHLA             : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_PHAEO            : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ R_PRES             : int  0 8 10 19 20 30 39 50 58 75 ...
#>  $ R_SAMP             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ DIC1               : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ DIC2               : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ TA1                : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ TA2                : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ pH2                : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ pH1                : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ DIC.Quality.Comment: Factor w/ 37 levels "Bottle tripped at correct depth",..: NA NA NA NA NA NA NA NA NA NA ...

Note: There is indication of NA/ missing values on some of the columns on dataset. Therefore further data cleansing will be required to make the data better for the modeling.

2 Data Cleansing & Exploratory Data Analysis

As stated priorly from the business assumption requirement, this report will only gather and process the columns of dataset: - Sta_ID: CalCOFI Line and Station - Depthm: Depth in meters - T_degC: Temperature of Water - Salnty: Salinity of water - O2ml_L: Milliliters of dissolved oxygen per Liter seawater - STheta: Potential Density of Water - 02Sat: Oxygen Saturation - Oxy_umol/kg: Oxygen in micro moles per kilogram of seawater - ChlorA: Acetone extracted chlorophyll-a measured fluorometrically - Phaeop: Phaeophytin concentration measured fluorometrically - PO4µM: Phosphate concentration - SiO3µM: Silicate concentration - NO2µM: Nitrite concentration - NO3µM: Nitrate concentration - NH3µM: Ammonium concentration - LightP: Light intensities expressed as a percentage - R_PRES: Pressure in decibars

These data cleansing step will be done in order to create a more optimal data set: - All other columns except the columns stated above will be ommitted as it will not be used on the report.

ocean_clean <- ocean %>% 
  dplyr::select(Sta_ID, Depthm, T_degC, Salnty, O2ml_L,  O2Sat, Oxy_Âµmol.Kg, ChlorA, Phaeop, PO4uM, SiO3uM, NO2uM, NO3uM, NH3uM, LightP, R_PRES)

Check NA/ Missing Values:

colSums(is.na(ocean))

#>             Cst_Cnt             Btl_Cnt              Sta_ID            Depth_ID 
#>                   0                   0                   0                   0 
#>              Depthm              T_degC              Salnty              O2ml_L 
#>                   0               10963               47354              168662 
#>              STheta               O2Sat        Oxy_Âµmol.Kg              BtlNum 
#>               52689              203589              203595              746196 
#>              RecInd              T_prec              T_qual              S_prec 
#>                   0               10963              841736               47354 
#>              S_qual              P_qual              O_qual              SThtaq 
#>              789949              191108              680187              799040 
#>              O2Satq              ChlorA              Chlqua              Phaeop 
#>              647066              639591              225697              639592 
#>              Phaqua               PO4uM                PO4q              SiO3uM 
#>              225693              451546              413077              510772 
#>              SiO3qu               NO2uM                NO2q               NO3uM 
#>              353997              527287              335389              527460 
#>                NO3q               NH3uM                NH3q              C14As1 
#>              334930              799901               56564              850431 
#>              C14A1p              C14A1q              C14As2              C14A2p 
#>              852103               16258              850449              852121 
#>              C14A2q              DarkAs              DarkAp              DarkAq 
#>               16240              842214              844406               24423 
#>              MeanAs              MeanAp              MeanAq              IncTim 
#>              842213              844406               24424              850426 
#>              LightP             R_Depth              R_TEMP            R_POTEMP 
#>              846212                   0               10963               46047 
#>          R_SALINITY             R_SIGMA               R_SVA             R_DYNHT 
#>               47354               52856               52771               46657 
#>                R_O2             R_O2Sat              R_SIO3               R_PO4 
#>              168662              198415              510764              451538 
#>               R_NO3               R_NO2               R_NH4              R_CHLA 
#>              527452              527279              799881              639587 
#>             R_PHAEO              R_PRES              R_SAMP                DIC1 
#>              639588                   0              742857              862864 
#>                DIC2                 TA1                 TA2                 pH2 
#>              864639              862779              864629              864853 
#>                 pH1 DIC.Quality.Comment 
#>              864779              864808

Note: - There are also ‘NA/ Missing’ values based on preliminary check on may of the columns.

In order to create a better overall result, we will try to replace any missing/ NA values based on its types:

Data with missing Numeric type values: will be replaced by its mean values (using mean() function).
Data value with the factor data type will be replaced with value that has highest number of occurrences in its set of data (using mode() function).

Creating Function for data cleansing:

Mode = function(x){
  a = table(x)
  b = max(a)
  if(all(a == b))
    mod = NA
  else if(is.numeric(x))
    mod = as.numeric(names(a))[a==b]
    else
      mod = names(a)[a==b]
  return(mod)
}

ocean_clean$Sta_ID[is.na(ocean_clean$Sta_ID)] <-  Mode(ocean_clean$Sta_ID)
ocean_clean$Depthm[is.na(ocean_clean$Depthm)] <-  mean(ocean_clean$Depthm, na.rm = T)
ocean_clean$T_degC[is.na(ocean_clean$T_degC)] <- mean(ocean_clean$T_degC, na.rm = T)
ocean_clean$Salnty[is.na(ocean_clean$Salnty)] <-  mean(ocean_clean$Salnty, na.rm = T)
ocean_clean$O2ml_L[is.na(ocean_clean$O2ml_L)] <-  mean(ocean_clean$O2ml_L, na.rm = T)
ocean_clean$O2Sat[is.na(ocean_clean$O2Sat)] <-  mean(ocean_clean$O2Sat, na.rm = T)
ocean_clean$Oxy_Âµmol.Kg[is.na(ocean_clean$Oxy_Âµmol.Kg)] <-  mean(ocean_clean$Oxy_Âµmol.Kg, na.rm = T)
ocean_clean$ChlorA[is.na(ocean_clean$ChlorA)] <- mean(ocean_clean$ChlorA, na.rm = T)
ocean_clean$Phaeop[is.na(ocean_clean$Phaeop)] <-  mean(ocean_clean$Phaeop, na.rm = T)
ocean_clean$PO4uM[is.na(ocean_clean$PO4uM)] <-  mean(ocean_clean$PO4uM, na.rm = T)
ocean_clean$SiO3uM[is.na(ocean_clean$SiO3uM)] <-  mean(ocean_clean$SiO3uM, na.rm = T)
ocean_clean$NO2uM[is.na(ocean_clean$NO2uM)] <-  mean(ocean_clean$NO2uM, na.rm = T)
ocean_clean$NO3uM[is.na(ocean_clean$NO3uM)] <-  mean(ocean_clean$NO3uM, na.rm = T)
ocean_clean$NH3uM[is.na(ocean_clean$NH3uM)] <-  mean(ocean_clean$NH3uM, na.rm = T)
ocean_clean$LightP[is.na(ocean_clean$LightP)] <-  mean(ocean_clean$LightP, na.rm = T)
ocean_clean$R_PRES[is.na(ocean_clean$R_PRES)] <-  mean(ocean_clean$R_PRES, na.rm = T)

Check NA/ Missing Values for ocean_clean object:

colSums(is.na(ocean_clean))

#>       Sta_ID       Depthm       T_degC       Salnty       O2ml_L        O2Sat 
#>            0            0            0            0            0            0 
#> Oxy_Âµmol.Kg       ChlorA       Phaeop        PO4uM       SiO3uM        NO2uM 
#>            0            0            0            0            0            0 
#>        NO3uM        NH3uM       LightP       R_PRES 
#>            0            0            0            0

summary(ocean_clean)

#>          Sta_ID           Depthm           T_degC          Salnty     
#>  090.0 045.0: 10043   Min.   :   0.0   Min.   : 1.44   Min.   :28.43  
#>  090.0 070.0: 10039   1st Qu.:  46.0   1st Qu.: 7.72   1st Qu.:33.50  
#>  090.0 037.0:  9771   Median : 125.0   Median :10.13   Median :33.84  
#>  090.0 060.0:  9521   Mean   : 226.8   Mean   :10.80   Mean   :33.84  
#>  080.0 060.0:  9393   3rd Qu.: 300.0   3rd Qu.:13.83   3rd Qu.:34.18  
#>  080.0 070.0:  9318   Max.   :5351.0   Max.   :31.14   Max.   :37.03  
#>  (Other)    :806778                                                   
#>      O2ml_L           O2Sat        Oxy_Âµmol.Kg          ChlorA       
#>  Min.   :-0.010   Min.   : -0.1   Min.   : -0.4349   Min.   :-0.0010  
#>  1st Qu.: 1.890   1st Qu.: 31.5   1st Qu.: 90.4764   1st Qu.: 0.4502  
#>  Median : 3.392   Median : 57.1   Median :148.8087   Median : 0.4502  
#>  Mean   : 3.392   Mean   : 57.1   Mean   :148.8087   Mean   : 0.4502  
#>  3rd Qu.: 5.240   3rd Qu.: 88.3   3rd Qu.:225.2387   3rd Qu.: 0.4502  
#>  Max.   :11.130   Max.   :214.1   Max.   :485.7018   Max.   :66.1100  
#>                                                                       
#>      Phaeop            PO4uM           SiO3uM           NO2uM        
#>  Min.   :-3.8900   Min.   :0.000   Min.   :  0.00   Min.   :0.00000  
#>  1st Qu.: 0.1986   1st Qu.:1.565   1st Qu.: 26.61   1st Qu.:0.02000  
#>  Median : 0.1986   Median :1.565   Median : 26.61   Median :0.04232  
#>  Mean   : 0.1986   Mean   :1.565   Mean   : 26.61   Mean   :0.04232  
#>  3rd Qu.: 0.1986   3rd Qu.:1.565   3rd Qu.: 26.61   3rd Qu.:0.04232  
#>  Max.   :65.3000   Max.   :5.210   Max.   :196.00   Max.   :8.19000  
#>                                                                      
#>      NO3uM          NH3uM              LightP          R_PRES      
#>  Min.   :-0.4   Min.   : 0.00000   Min.   : 0.00   Min.   :   0.0  
#>  1st Qu.:17.3   1st Qu.: 0.08488   1st Qu.:18.36   1st Qu.:  46.0  
#>  Median :17.3   Median : 0.08488   Median :18.36   Median : 126.0  
#>  Mean   :17.3   Mean   : 0.08488   Mean   :18.36   Mean   : 228.4  
#>  3rd Qu.:17.3   3rd Qu.: 0.08488   3rd Qu.:18.36   3rd Qu.: 302.0  
#>  Max.   :95.0   Max.   :15.63000   Max.   :99.90   Max.   :5458.0  
#>

str(ocean_clean)

#> 'data.frame':    864863 obs. of  16 variables:
#>  $ Sta_ID      : Factor w/ 2634 levels "001.0 168.0",..: 313 313 313 313 313 313 313 313 313 313 ...
#>  $ Depthm      : num  0 8 10 19 20 30 39 50 58 75 ...
#>  $ T_degC      : num  10.5 10.5 10.5 10.4 10.4 ...
#>  $ Salnty      : num  33.4 33.4 33.4 33.4 33.4 ...
#>  $ O2ml_L      : num  3.39 3.39 3.39 3.39 3.39 ...
#>  $ O2Sat       : num  57.1 57.1 57.1 57.1 57.1 ...
#>  $ Oxy_Âµmol.Kg: num  149 149 149 149 149 ...
#>  $ ChlorA      : num  0.45 0.45 0.45 0.45 0.45 ...
#>  $ Phaeop      : num  0.199 0.199 0.199 0.199 0.199 ...
#>  $ PO4uM       : num  1.56 1.56 1.56 1.56 1.56 ...
#>  $ SiO3uM      : num  26.6 26.6 26.6 26.6 26.6 ...
#>  $ NO2uM       : num  0.0423 0.0423 0.0423 0.0423 0.0423 ...
#>  $ NO3uM       : num  17.3 17.3 17.3 17.3 17.3 ...
#>  $ NH3uM       : num  0.0849 0.0849 0.0849 0.0849 0.0849 ...
#>  $ LightP      : num  18.4 18.4 18.4 18.4 18.4 ...
#>  $ R_PRES      : num  0 8 10 19 20 30 39 50 58 75 ...

This report will be using dimensionality reduction using PCA (Principle Component Analysis) as one of exploratory data analysis of the ocean dataset, where the goal of using the dimensionality reduction is to reduce the number of variables (dimensions / features) in the data while also maintaining as much information as possible (insights). In this report, assume that we want to extract not less that 80% of the data insights when using the optimal Principal Components (PC) been used/ selected.

PCA is very useful to retain information while reducing the dimension of the data. However, we need to make sure that our data is properly scaled in order to get a useful PCA. As the dataset is having a dispersed range on each variables, furthermore as this report will also analyse the data using PCA, the scaling of the data is needed.

In order to conduct PCA, all columns that is not a numeric data type must be omitted, as PCA will only processing numeric data type.

ocean_num <- ocean_clean %>% 
  dplyr::select(-Sta_ID)
tail(ocean_num)

3 Scaling the Data

ocean_scale <- scale(ocean_num)
head(ocean_scale)

#>          Depthm      T_degC     Salnty                  O2ml_L
#> [1,] -0.7177085 -0.07106668 -0.8916057 0.000000000000002148652
#> [2,] -0.6923961 -0.08055245 -0.8916057 0.000000000000002148652
#> [3,] -0.6860679 -0.08055245 -0.8982869 0.000000000000002148652
#> [4,] -0.6575915 -0.08292389 -0.9361470 0.000000000000002148652
#> [5,] -0.6544274 -0.08292389 -0.9339199 0.000000000000002148652
#> [6,] -0.6227869 -0.08292389 -0.9116493 0.000000000000002148652
#>                          O2Sat            Oxy_Âµmol.Kg                   ChlorA
#> [1,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [2,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [3,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [4,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [5,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [6,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#>                       Phaeop                    PO4uM                  SiO3uM
#> [1,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [2,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [3,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [4,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [5,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [6,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#>                          NO2uM NO3uM                   NH3uM
#> [1,] -0.0000000000000008527376     0 0.000000000000006675914
#> [2,] -0.0000000000000008527376     0 0.000000000000006675914
#> [3,] -0.0000000000000008527376     0 0.000000000000006675914
#> [4,] -0.0000000000000008527376     0 0.000000000000006675914
#> [5,] -0.0000000000000008527376     0 0.000000000000006675914
#> [6,] -0.0000000000000008527376     0 0.000000000000006675914
#>                       LightP     R_PRES
#> [1,] -0.00000000000001458751 -0.7149503
#> [2,] -0.00000000000001458751 -0.6899078
#> [3,] -0.00000000000001458751 -0.6836472
#> [4,] -0.00000000000001458751 -0.6554744
#> [5,] -0.00000000000001458751 -0.6523440
#> [6,] -0.00000000000001458751 -0.6210409

summary(ocean_scale)

#>      Depthm            T_degC            Salnty             O2ml_L       
#>  Min.   :-0.7177   Min.   :-2.2196   Min.   :-12.0470   Min.   :-1.8291  
#>  1st Qu.:-0.5722   1st Qu.:-0.7303   1st Qu.: -0.7491   1st Qu.:-0.8077  
#>  Median :-0.3222   Median :-0.1588   Median :  0.0000   Median : 0.0000  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   :  0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.2315   3rd Qu.: 0.7186   3rd Qu.:  0.7564   3rd Qu.: 0.9932  
#>  Max.   :16.2131   Max.   : 4.8236   Max.   :  7.1125   Max.   : 4.1596  
#>      O2Sat          Oxy_Âµmol.Kg         ChlorA             Phaeop      
#>  Min.   :-1.7636   Min.   :-1.8925   Min.   : -0.7315   Min.   :-21.28  
#>  1st Qu.:-0.7894   1st Qu.:-0.7397   1st Qu.:  0.0000   1st Qu.:  0.00  
#>  Median : 0.0000   Median : 0.0000   Median :  0.0000   Median :  0.00  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   :  0.0000   Mean   :  0.00  
#>  3rd Qu.: 0.9618   3rd Qu.: 0.9692   3rd Qu.:  0.0000   3rd Qu.:  0.00  
#>  Max.   : 4.8402   Max.   : 4.2720   Max.   :106.4507   Max.   :338.76  
#>      PO4uM            SiO3uM           NO2uM              NO3uM       
#>  Min.   :-2.185   Min.   :-1.504   Min.   : -0.6501   Min.   :-1.944  
#>  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: -0.3428   1st Qu.: 0.000  
#>  Median : 0.000   Median : 0.000   Median :  0.0000   Median : 0.000  
#>  Mean   : 0.000   Mean   : 0.000   Mean   :  0.0000   Mean   : 0.000  
#>  3rd Qu.: 0.000   3rd Qu.: 0.000   3rd Qu.:  0.0000   3rd Qu.: 0.000  
#>  Max.   : 5.090   Max.   : 9.575   Max.   :125.1611   Max.   : 8.531  
#>      NH3uM             LightP           R_PRES       
#>  Min.   : -1.134   Min.   :-4.187   Min.   :-0.7149  
#>  1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.:-0.5710  
#>  Median :  0.000   Median : 0.000   Median :-0.3205  
#>  Mean   :  0.000   Mean   : 0.000   Mean   : 0.0000  
#>  3rd Qu.:  0.000   3rd Qu.: 0.000   3rd Qu.: 0.2304  
#>  Max.   :207.722   Max.   :18.601   Max.   :16.3703

4 Outlier Detection

Before processing with PCA and CLustering procedures, we will check if there are any outliers in the data. As Outliers will greatly affect the clustering results, we will need to remove the outliers from the data. Outliers can be visualized by using a biplot graph from the PCA results.

4.1 PCA Procedure

pca_ocean <- PCA(X = ocean_scale, scale.unit = F, ncp = 15)

summary(pca_ocean)

#> 
#> Call:
#> PCA(X = ocean_scale, scale.unit = F, ncp = 15) 
#> 
#> 
#> Eigenvalues
#>                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
#> Variance               6.736   1.743   1.427   1.053   0.998   0.857   0.801
#> % of var.             44.909  11.623   9.513   7.019   6.652   5.713   5.342
#> Cumulative % of var.  44.909  56.532  66.045  73.064  79.716  85.429  90.772
#>                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
#> Variance               0.489   0.373   0.221   0.170   0.094   0.035   0.003
#> % of var.              3.260   2.488   1.471   1.131   0.625   0.234   0.018
#> Cumulative % of var.  94.032  96.520  97.991  99.122  99.748  99.982 100.000
#>                       Dim.15
#> Variance               0.000
#> % of var.              0.000
#> Cumulative % of var. 100.000
#> 
#> Individuals (the 10 first)
#>                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
#> 1            |  1.351 | -0.628  0.000  0.216 | -0.276  0.000  0.042 |  0.789
#> 2            |  1.325 | -0.611  0.000  0.213 | -0.268  0.000  0.041 |  0.760
#> 3            |  1.323 | -0.610  0.000  0.212 | -0.266  0.000  0.041 |  0.754
#> 4            |  1.321 | -0.605  0.000  0.210 | -0.261  0.000  0.039 |  0.727
#> 5            |  1.316 | -0.603  0.000  0.209 | -0.259  0.000  0.039 |  0.723
#> 6            |  1.269 | -0.579  0.000  0.208 | -0.248  0.000  0.038 |  0.689
#> 7            |  1.227 | -0.558  0.000  0.206 | -0.238  0.000  0.038 |  0.658
#> 8            |  1.226 | -0.534  0.000  0.190 | -0.226  0.000  0.034 |  0.615
#> 9            |  1.215 | -0.510  0.000  0.176 | -0.215  0.000  0.031 |  0.581
#> 10           |  1.052 | -0.418  0.000  0.158 | -0.182  0.000  0.030 |  0.507
#>                 ctr   cos2  
#> 1             0.000  0.340 |
#> 2             0.000  0.329 |
#> 3             0.000  0.325 |
#> 4             0.000  0.302 |
#> 5             0.000  0.302 |
#> 6             0.000  0.295 |
#> 7             0.000  0.288 |
#> 8             0.000  0.252 |
#> 9             0.000  0.229 |
#> 10            0.000  0.232 |
#> 
#> Variables (the 10 first)
#>                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
#> Depthm       |  0.703  7.339  0.494 |  0.199  2.265  0.039 | -0.623 27.213
#> T_degC       | -0.807  9.660  0.651 | -0.120  0.826  0.014 |  0.223  3.476
#> Salnty       |  0.763  8.644  0.582 |  0.099  0.565  0.010 | -0.072  0.367
#> O2ml_L       | -0.922 12.628  0.851 |  0.011  0.008  0.000 | -0.067  0.312
#> O2Sat        | -0.926 12.722  0.857 |  0.008  0.004  0.000 | -0.092  0.592
#> Oxy_Âµmol.Kg | -0.921 12.590  0.848 |  0.017  0.016  0.000 | -0.096  0.643
#> ChlorA       | -0.095  0.135  0.009 |  0.808 37.447  0.653 |  0.234  3.834
#> Phaeop       | -0.052  0.040  0.003 |  0.828 39.321  0.686 |  0.276  5.328
#> PO4uM        |  0.840 10.479  0.706 | -0.058  0.191  0.003 |  0.329  7.563
#> SiO3uM       |  0.776  8.944  0.602 | -0.068  0.269  0.005 |  0.345  8.349
#>                cos2  
#> Depthm        0.388 |
#> T_degC        0.050 |
#> Salnty        0.005 |
#> O2ml_L        0.004 |
#> O2Sat         0.008 |
#> Oxy_Âµmol.Kg  0.009 |
#> ChlorA        0.055 |
#> Phaeop        0.076 |
#> PO4uM         0.108 |
#> SiO3uM        0.119 |

pca_ocean$eig

#>             eigenvalue percentage of variance cumulative percentage of variance
#> comp 1  6.736372917691         44.90920471100                          44.90920
#> comp 2  1.743394087571         11.62264068919                          56.53185
#> comp 3  1.427004711776          9.51337574504                          66.04522
#> comp 4  1.052814963842          7.01877454109                          73.06400
#> comp 5  0.997849502027          6.65233770530                          79.71633
#> comp 6  0.856951799943          5.71301860532                          85.42935
#> comp 7  0.801360117322          5.34240695932                          90.77176
#> comp 8  0.488981352978          3.25987945577                          94.03164
#> comp 9  0.373257962127          2.48838929139                          96.52003
#> comp 10 0.220716696020          1.47144634150                          97.99147
#> comp 11 0.169638596089          1.13092528156                          99.12240
#> comp 12 0.093814278923          0.62542924931                          99.74783
#> comp 13 0.035130045478          0.23420057398                          99.98203
#> comp 14 0.002692486270          0.01794992922                          99.99998
#> comp 15 0.000003138149          0.00002092102                         100.00000

4.2 Visualization of PCA Result

Visualization of the PCA and identify the top outer outliers.

plot.PCA(x = pca_ocean, 
         choix = "ind", 
         select = "contrib 20", 
         habillage = "Depthm")

From the graph it can be concluded that the data below are the most outer outliers, and later will be omitted to enhance the clustering result.

outlier <- ocean_num[c(525549,525548,758530,514616,514615),]

ocean_no_outlier <- ocean_clean[-c(525549,525548,758530,514616,514615),]

5 Clustering Procedure

Prior to the clustering procedure, we need to make sure that the data only consists of numeric columns that will be used for clustering.

ocean_nooutnum <- ocean_no_outlier %>% 
  select_if(is.numeric)
tail(ocean_nooutnum)

str(ocean_nooutnum)

#> 'data.frame':    864858 obs. of  15 variables:
#>  $ Depthm      : num  0 8 10 19 20 30 39 50 58 75 ...
#>  $ T_degC      : num  10.5 10.5 10.5 10.4 10.4 ...
#>  $ Salnty      : num  33.4 33.4 33.4 33.4 33.4 ...
#>  $ O2ml_L      : num  3.39 3.39 3.39 3.39 3.39 ...
#>  $ O2Sat       : num  57.1 57.1 57.1 57.1 57.1 ...
#>  $ Oxy_Âµmol.Kg: num  149 149 149 149 149 ...
#>  $ ChlorA      : num  0.45 0.45 0.45 0.45 0.45 ...
#>  $ Phaeop      : num  0.199 0.199 0.199 0.199 0.199 ...
#>  $ PO4uM       : num  1.56 1.56 1.56 1.56 1.56 ...
#>  $ SiO3uM      : num  26.6 26.6 26.6 26.6 26.6 ...
#>  $ NO2uM       : num  0.0423 0.0423 0.0423 0.0423 0.0423 ...
#>  $ NO3uM       : num  17.3 17.3 17.3 17.3 17.3 ...
#>  $ NH3uM       : num  0.0849 0.0849 0.0849 0.0849 0.0849 ...
#>  $ LightP      : num  18.4 18.4 18.4 18.4 18.4 ...
#>  $ R_PRES      : num  0 8 10 19 20 30 39 50 58 75 ...

As the data has been scaled before, we will use the object ocean_nooutnum as the base for the Clustering Procedure.

5.1 Choosing Optimum K

Before conducting clustering procedure, we will use the Elbow Method to determine the \(k\) optimum. This report will limit the Use of maximum of maxK as 5 to limit the plot into 5 distinct clusters.

# fungsi untuk plot elbow method
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(123)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

kmeansTunning(data = ocean_nooutnum, maxK = 5)

Based on the graph above, it can be concluded that the optimal clusters that should be utilized is 4 (based on the ‘elbow’ indication of the graph. Therefore we will use the number of cluster is 4 for calculating the K-Means value.

ocean_km <- kmeans(x = ocean_nooutnum,
                 centers = 4)

5.2 K-means Clustering

K-means is a clustering algorithm that groups the data based on distance. The resulting clusters are stated to be optimum if the distance between data in the same cluster is low and the distance between data from different clusters is high.

head(ocean_km$cluster)

#> 1 2 3 4 5 6 
#> 1 1 1 1 1 1

RNGkind(sample.kind = "Rounding")
set.seed(123)

ocean_nooutnum$cluster <-  as.numeric(ocean_km$cluster)
tail(ocean_nooutnum)

str(ocean_nooutnum)

#> 'data.frame':    864858 obs. of  16 variables:
#>  $ Depthm      : num  0 8 10 19 20 30 39 50 58 75 ...
#>  $ T_degC      : num  10.5 10.5 10.5 10.4 10.4 ...
#>  $ Salnty      : num  33.4 33.4 33.4 33.4 33.4 ...
#>  $ O2ml_L      : num  3.39 3.39 3.39 3.39 3.39 ...
#>  $ O2Sat       : num  57.1 57.1 57.1 57.1 57.1 ...
#>  $ Oxy_Âµmol.Kg: num  149 149 149 149 149 ...
#>  $ ChlorA      : num  0.45 0.45 0.45 0.45 0.45 ...
#>  $ Phaeop      : num  0.199 0.199 0.199 0.199 0.199 ...
#>  $ PO4uM       : num  1.56 1.56 1.56 1.56 1.56 ...
#>  $ SiO3uM      : num  26.6 26.6 26.6 26.6 26.6 ...
#>  $ NO2uM       : num  0.0423 0.0423 0.0423 0.0423 0.0423 ...
#>  $ NO3uM       : num  17.3 17.3 17.3 17.3 17.3 ...
#>  $ NH3uM       : num  0.0849 0.0849 0.0849 0.0849 0.0849 ...
#>  $ LightP      : num  18.4 18.4 18.4 18.4 18.4 ...
#>  $ R_PRES      : num  0 8 10 19 20 30 39 50 58 75 ...
#>  $ cluster     : num  1 1 1 1 1 1 1 1 1 1 ...

5.3 Cluster Profiling

Kembalikan label cluster masing-masing observasi ke data awal sebelum discale, yang outliernya sudah di-remove

ocean_profile <- ocean_nooutnum %>% 
  group_by(cluster) %>% 
  summarise_all(.funs = "mean")

ocean_pca <- PCA(ocean_nooutnum,
                  scale.unit = F, 
                  quali.sup = 16, 
                  graph = F)


fviz_pca_biplot(ocean_pca,
                habillage = 16, 
                geom.ind = "point",
                addEllipses = T)

Based on the graph above, we will conduct a cluster profiling to understand the characteristics of each cluster.

ocean_profile <- ocean_nooutnum %>% 
  group_by(cluster) %>% 
  summarise_all(.funs = "mean")
ocean_profile

ocean_profile %>% 
  tidyr::pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(cluster_min_val = which.min(value),
            cluster_max_val = which.max(value))

6 Conclusion

From Principal Component Analysis (PCA) and Clustering procedures that we have executed above, there are some conclusions that can be derived from the ocean water’s characteristics, These are:

From PCA precedure, based on the preliminary business assumptions based on selected ocean water’s variables been chosen, if we want to extract not less that 80% of the data insights, we can use only 6 number of Principal Components (out of 15 in total available).
Based on the Clustering procedure, it is indicated that the optimum number of Clusters that can be used are 4, in order to get the optimal representation of the overall data insights; and further Cluster Profiling also indicated some information as follows:

Cluster 1: Cluster with relatively high on temperature, Milliliters of dissolved oxygen per Liter seawater, Oxygen Saturation, Oxygen in micro moles per kilogram of seawater, Acetone extracted chlorophyll-a, Nitrite concentration, Ammonium concentration, Light intensities, but have a relatively low on Water Depth, Nitrate concentration, Phaeophytin concentration, Water Pressure, Water Salinity, and Silicate concentration.

Cluster 2: Cluster with relatively high on Nitrate concentration, but have a relatively low on Acetone extracted chlorophyll-a, Light intensities, Ammonium concentration, Nitrite concentration, Oxygen Saturation, and Oxygen in micro moles per kilogram of seawater.

Cluster 3: Cluster with relatively high on Phosphate concentration, but have a relatively low on Milliliters of dissolved oxygen per Liter seawater.

Cluster 4: Cluster with relatively high on Water Depth, Phaeophytin concentration, Water Pressure, Water Salinity, Silicate concentration, but have a relatively low on Water Temperature.