This report gathers data from The California Cooperative Oceanic Fisheries Investigations (CalCOFI). The organization was formed in 1949 that become a unique partnership of the California Department of Fish & Wildlife, NOAA Fisheries Service and Scripps Institution of Oceanography to the study of the marine environment off the coast of California, the management of its living resources, and monitoring the indicators of El Nino and climate change.
Data collected include: temperature, salinity, oxygen, phosphate, silicate, nitrate and nitrite, chlorophyll, transmissometer, PAR, C14 primary productivity, phytoplankton biodiversity, zooplankton biomass, and zooplankton biodiversity.
This report will try to evaluate the data of ocean water variables/ characteristics gathered by CalCOFI, and try to leverage as much as possible all of information from it by using data dimensionality reduction and data clustering techniques. We will also try to see which variabels/ components of the ocean water that have strong correlation/ characteristic to each other or its corresponding cluster, and also some of the outliers conditions of the ocean water.
Due to the time contraints, the ocean water’s variabels/ characteristics that will be used on this report are limited to as follows: - Sta_ID: CalCOFI Line and Station - Depthm: Depth in meters - T_degC: Temperature of Water - Salnty: Salinity of water - O2ml_L: Milliliters of dissolved oxygen per Liter seawater - STheta: Potential Density of Water - 02Sat: Oxygen Saturation - Oxy_umol/kg: Oxygen in micro moles per kilogram of seawater - ChlorA: Acetone extracted chlorophyll-a measured fluorometrically - Phaeop: Phaeophytin concentration measured fluorometrically - PO4µM: Phosphate concentration - SiO3µM: Silicate concentration - NO2µM: Nitrite concentration - NO3µM: Nitrate concentration - NH3µM: Ammonium concentration - LightP: Light intensities expressed as a percentage - R_PRES: Pressure in decibars
ocean <- read.csv("bottle.csv", stringsAsFactors = T, na.strings=c("","","NA"))
tail(ocean)Checking for NA value of Data:
anyNA(ocean)#> [1] TRUE
colSums(is.na(ocean))#> Cst_Cnt Btl_Cnt Sta_ID Depth_ID
#> 0 0 0 0
#> Depthm T_degC Salnty O2ml_L
#> 0 10963 47354 168662
#> STheta O2Sat Oxy_µmol.Kg BtlNum
#> 52689 203589 203595 746196
#> RecInd T_prec T_qual S_prec
#> 0 10963 841736 47354
#> S_qual P_qual O_qual SThtaq
#> 789949 191108 680187 799040
#> O2Satq ChlorA Chlqua Phaeop
#> 647066 639591 225697 639592
#> Phaqua PO4uM PO4q SiO3uM
#> 225693 451546 413077 510772
#> SiO3qu NO2uM NO2q NO3uM
#> 353997 527287 335389 527460
#> NO3q NH3uM NH3q C14As1
#> 334930 799901 56564 850431
#> C14A1p C14A1q C14As2 C14A2p
#> 852103 16258 850449 852121
#> C14A2q DarkAs DarkAp DarkAq
#> 16240 842214 844406 24423
#> MeanAs MeanAp MeanAq IncTim
#> 842213 844406 24424 850426
#> LightP R_Depth R_TEMP R_POTEMP
#> 846212 0 10963 46047
#> R_SALINITY R_SIGMA R_SVA R_DYNHT
#> 47354 52856 52771 46657
#> R_O2 R_O2Sat R_SIO3 R_PO4
#> 168662 198415 510764 451538
#> R_NO3 R_NO2 R_NH4 R_CHLA
#> 527452 527279 799881 639587
#> R_PHAEO R_PRES R_SAMP DIC1
#> 639588 0 742857 862864
#> DIC2 TA1 TA2 pH2
#> 864639 862779 864629 864853
#> pH1 DIC.Quality.Comment
#> 864779 864808
str(ocean)#> 'data.frame': 864863 obs. of 74 variables:
#> $ Cst_Cnt : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Btl_Cnt : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Sta_ID : Factor w/ 2634 levels "001.0 168.0",..: 313 313 313 313 313 313 313 313 313 313 ...
#> $ Depth_ID : Factor w/ 864850 levels "19-4903CR-HY-060-0930-05400560-0000A-3",..: 1 2 3 4 5 6 7 8 9 10 ...
#> $ Depthm : int 0 8 10 19 20 30 39 50 58 75 ...
#> $ T_degC : num 10.5 10.5 10.5 10.4 10.4 ...
#> $ Salnty : num 33.4 33.4 33.4 33.4 33.4 ...
#> $ O2ml_L : num NA NA NA NA NA NA NA NA NA NA ...
#> $ STheta : num 25.6 25.7 25.7 25.6 25.6 ...
#> $ O2Sat : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Oxy_µmol.Kg : num NA NA NA NA NA NA NA NA NA NA ...
#> $ BtlNum : int NA NA NA NA NA NA NA NA NA NA ...
#> $ RecInd : int 3 3 7 3 7 7 3 7 3 7 ...
#> $ T_prec : int 1 2 2 2 2 2 2 2 2 2 ...
#> $ T_qual : int NA NA NA NA NA NA NA NA NA NA ...
#> $ S_prec : int 2 2 3 2 3 3 2 3 2 3 ...
#> $ S_qual : int NA NA NA NA NA NA NA NA NA NA ...
#> $ P_qual : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ O_qual : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ SThtaq : int NA NA NA NA NA NA NA NA NA NA ...
#> $ O2Satq : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ ChlorA : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Chlqua : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ Phaeop : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Phaqua : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ PO4uM : num NA NA NA NA NA NA NA NA NA NA ...
#> $ PO4q : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ SiO3uM : num NA NA NA NA NA NA NA NA NA NA ...
#> $ SiO3qu : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ NO2uM : num NA NA NA NA NA NA NA NA NA NA ...
#> $ NO2q : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ NO3uM : num NA NA NA NA NA NA NA NA NA NA ...
#> $ NO3q : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ NH3uM : num NA NA NA NA NA NA NA NA NA NA ...
#> $ NH3q : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ C14As1 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ C14A1p : int NA NA NA NA NA NA NA NA NA NA ...
#> $ C14A1q : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ C14As2 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ C14A2p : int NA NA NA NA NA NA NA NA NA NA ...
#> $ C14A2q : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ DarkAs : num NA NA NA NA NA NA NA NA NA NA ...
#> $ DarkAp : int NA NA NA NA NA NA NA NA NA NA ...
#> $ DarkAq : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ MeanAs : num NA NA NA NA NA NA NA NA NA NA ...
#> $ MeanAp : int NA NA NA NA NA NA NA NA NA NA ...
#> $ MeanAq : int 9 9 9 9 9 9 9 9 9 9 ...
#> $ IncTim : Factor w/ 199 levels "12/30/1899 00:01:00",..: NA NA NA NA NA NA NA NA NA NA ...
#> $ LightP : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_Depth : num 0 8 10 19 20 30 39 50 58 75 ...
#> $ R_TEMP : num 10.5 10.5 10.5 10.4 10.4 ...
#> $ R_POTEMP : num 10.5 10.5 10.5 10.4 10.4 ...
#> $ R_SALINITY : num 33.4 33.4 33.4 33.4 33.4 ...
#> $ R_SIGMA : num 25.6 25.6 25.6 25.6 25.6 ...
#> $ R_SVA : num 233 232 233 234 234 ...
#> $ R_DYNHT : num 0 0.01 0.02 0.04 0.04 0.07 0.09 0.11 0.13 0.17 ...
#> $ R_O2 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_O2Sat : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_SIO3 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_PO4 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_NO3 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_NO2 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_NH4 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_CHLA : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_PHAEO : num NA NA NA NA NA NA NA NA NA NA ...
#> $ R_PRES : int 0 8 10 19 20 30 39 50 58 75 ...
#> $ R_SAMP : int NA NA NA NA NA NA NA NA NA NA ...
#> $ DIC1 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ DIC2 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ TA1 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ TA2 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ pH2 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ pH1 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ DIC.Quality.Comment: Factor w/ 37 levels "Bottle tripped at correct depth",..: NA NA NA NA NA NA NA NA NA NA ...
Note: There is indication of NA/ missing values on some of the columns on dataset. Therefore further data cleansing will be required to make the data better for the modeling.
As stated priorly from the business assumption requirement, this report will only gather and process the columns of dataset: - Sta_ID: CalCOFI Line and Station - Depthm: Depth in meters - T_degC: Temperature of Water - Salnty: Salinity of water - O2ml_L: Milliliters of dissolved oxygen per Liter seawater - STheta: Potential Density of Water - 02Sat: Oxygen Saturation - Oxy_umol/kg: Oxygen in micro moles per kilogram of seawater - ChlorA: Acetone extracted chlorophyll-a measured fluorometrically - Phaeop: Phaeophytin concentration measured fluorometrically - PO4µM: Phosphate concentration - SiO3µM: Silicate concentration - NO2µM: Nitrite concentration - NO3µM: Nitrate concentration - NH3µM: Ammonium concentration - LightP: Light intensities expressed as a percentage - R_PRES: Pressure in decibars
These data cleansing step will be done in order to create a more optimal data set: - All other columns except the columns stated above will be ommitted as it will not be used on the report.
ocean_clean <- ocean %>%
dplyr::select(Sta_ID, Depthm, T_degC, Salnty, O2ml_L, O2Sat, Oxy_µmol.Kg, ChlorA, Phaeop, PO4uM, SiO3uM, NO2uM, NO3uM, NH3uM, LightP, R_PRES)Check NA/ Missing Values:
colSums(is.na(ocean))#> Cst_Cnt Btl_Cnt Sta_ID Depth_ID
#> 0 0 0 0
#> Depthm T_degC Salnty O2ml_L
#> 0 10963 47354 168662
#> STheta O2Sat Oxy_µmol.Kg BtlNum
#> 52689 203589 203595 746196
#> RecInd T_prec T_qual S_prec
#> 0 10963 841736 47354
#> S_qual P_qual O_qual SThtaq
#> 789949 191108 680187 799040
#> O2Satq ChlorA Chlqua Phaeop
#> 647066 639591 225697 639592
#> Phaqua PO4uM PO4q SiO3uM
#> 225693 451546 413077 510772
#> SiO3qu NO2uM NO2q NO3uM
#> 353997 527287 335389 527460
#> NO3q NH3uM NH3q C14As1
#> 334930 799901 56564 850431
#> C14A1p C14A1q C14As2 C14A2p
#> 852103 16258 850449 852121
#> C14A2q DarkAs DarkAp DarkAq
#> 16240 842214 844406 24423
#> MeanAs MeanAp MeanAq IncTim
#> 842213 844406 24424 850426
#> LightP R_Depth R_TEMP R_POTEMP
#> 846212 0 10963 46047
#> R_SALINITY R_SIGMA R_SVA R_DYNHT
#> 47354 52856 52771 46657
#> R_O2 R_O2Sat R_SIO3 R_PO4
#> 168662 198415 510764 451538
#> R_NO3 R_NO2 R_NH4 R_CHLA
#> 527452 527279 799881 639587
#> R_PHAEO R_PRES R_SAMP DIC1
#> 639588 0 742857 862864
#> DIC2 TA1 TA2 pH2
#> 864639 862779 864629 864853
#> pH1 DIC.Quality.Comment
#> 864779 864808
Note: - There are also ‘NA/ Missing’ values based on preliminary check on may of the columns.
In order to create a better overall result, we will try to replace any missing/ NA values based on its types:
Creating Function for data cleansing:
Mode = function(x){
a = table(x)
b = max(a)
if(all(a == b))
mod = NA
else if(is.numeric(x))
mod = as.numeric(names(a))[a==b]
else
mod = names(a)[a==b]
return(mod)
}ocean_clean$Sta_ID[is.na(ocean_clean$Sta_ID)] <- Mode(ocean_clean$Sta_ID)
ocean_clean$Depthm[is.na(ocean_clean$Depthm)] <- mean(ocean_clean$Depthm, na.rm = T)
ocean_clean$T_degC[is.na(ocean_clean$T_degC)] <- mean(ocean_clean$T_degC, na.rm = T)
ocean_clean$Salnty[is.na(ocean_clean$Salnty)] <- mean(ocean_clean$Salnty, na.rm = T)
ocean_clean$O2ml_L[is.na(ocean_clean$O2ml_L)] <- mean(ocean_clean$O2ml_L, na.rm = T)
ocean_clean$O2Sat[is.na(ocean_clean$O2Sat)] <- mean(ocean_clean$O2Sat, na.rm = T)
ocean_clean$Oxy_µmol.Kg[is.na(ocean_clean$Oxy_µmol.Kg)] <- mean(ocean_clean$Oxy_µmol.Kg, na.rm = T)
ocean_clean$ChlorA[is.na(ocean_clean$ChlorA)] <- mean(ocean_clean$ChlorA, na.rm = T)
ocean_clean$Phaeop[is.na(ocean_clean$Phaeop)] <- mean(ocean_clean$Phaeop, na.rm = T)
ocean_clean$PO4uM[is.na(ocean_clean$PO4uM)] <- mean(ocean_clean$PO4uM, na.rm = T)
ocean_clean$SiO3uM[is.na(ocean_clean$SiO3uM)] <- mean(ocean_clean$SiO3uM, na.rm = T)
ocean_clean$NO2uM[is.na(ocean_clean$NO2uM)] <- mean(ocean_clean$NO2uM, na.rm = T)
ocean_clean$NO3uM[is.na(ocean_clean$NO3uM)] <- mean(ocean_clean$NO3uM, na.rm = T)
ocean_clean$NH3uM[is.na(ocean_clean$NH3uM)] <- mean(ocean_clean$NH3uM, na.rm = T)
ocean_clean$LightP[is.na(ocean_clean$LightP)] <- mean(ocean_clean$LightP, na.rm = T)
ocean_clean$R_PRES[is.na(ocean_clean$R_PRES)] <- mean(ocean_clean$R_PRES, na.rm = T)Check NA/ Missing Values for ocean_clean object:
colSums(is.na(ocean_clean))#> Sta_ID Depthm T_degC Salnty O2ml_L O2Sat
#> 0 0 0 0 0 0
#> Oxy_µmol.Kg ChlorA Phaeop PO4uM SiO3uM NO2uM
#> 0 0 0 0 0 0
#> NO3uM NH3uM LightP R_PRES
#> 0 0 0 0
summary(ocean_clean)#> Sta_ID Depthm T_degC Salnty
#> 090.0 045.0: 10043 Min. : 0.0 Min. : 1.44 Min. :28.43
#> 090.0 070.0: 10039 1st Qu.: 46.0 1st Qu.: 7.72 1st Qu.:33.50
#> 090.0 037.0: 9771 Median : 125.0 Median :10.13 Median :33.84
#> 090.0 060.0: 9521 Mean : 226.8 Mean :10.80 Mean :33.84
#> 080.0 060.0: 9393 3rd Qu.: 300.0 3rd Qu.:13.83 3rd Qu.:34.18
#> 080.0 070.0: 9318 Max. :5351.0 Max. :31.14 Max. :37.03
#> (Other) :806778
#> O2ml_L O2Sat Oxy_µmol.Kg ChlorA
#> Min. :-0.010 Min. : -0.1 Min. : -0.4349 Min. :-0.0010
#> 1st Qu.: 1.890 1st Qu.: 31.5 1st Qu.: 90.4764 1st Qu.: 0.4502
#> Median : 3.392 Median : 57.1 Median :148.8087 Median : 0.4502
#> Mean : 3.392 Mean : 57.1 Mean :148.8087 Mean : 0.4502
#> 3rd Qu.: 5.240 3rd Qu.: 88.3 3rd Qu.:225.2387 3rd Qu.: 0.4502
#> Max. :11.130 Max. :214.1 Max. :485.7018 Max. :66.1100
#>
#> Phaeop PO4uM SiO3uM NO2uM
#> Min. :-3.8900 Min. :0.000 Min. : 0.00 Min. :0.00000
#> 1st Qu.: 0.1986 1st Qu.:1.565 1st Qu.: 26.61 1st Qu.:0.02000
#> Median : 0.1986 Median :1.565 Median : 26.61 Median :0.04232
#> Mean : 0.1986 Mean :1.565 Mean : 26.61 Mean :0.04232
#> 3rd Qu.: 0.1986 3rd Qu.:1.565 3rd Qu.: 26.61 3rd Qu.:0.04232
#> Max. :65.3000 Max. :5.210 Max. :196.00 Max. :8.19000
#>
#> NO3uM NH3uM LightP R_PRES
#> Min. :-0.4 Min. : 0.00000 Min. : 0.00 Min. : 0.0
#> 1st Qu.:17.3 1st Qu.: 0.08488 1st Qu.:18.36 1st Qu.: 46.0
#> Median :17.3 Median : 0.08488 Median :18.36 Median : 126.0
#> Mean :17.3 Mean : 0.08488 Mean :18.36 Mean : 228.4
#> 3rd Qu.:17.3 3rd Qu.: 0.08488 3rd Qu.:18.36 3rd Qu.: 302.0
#> Max. :95.0 Max. :15.63000 Max. :99.90 Max. :5458.0
#>
str(ocean_clean)#> 'data.frame': 864863 obs. of 16 variables:
#> $ Sta_ID : Factor w/ 2634 levels "001.0 168.0",..: 313 313 313 313 313 313 313 313 313 313 ...
#> $ Depthm : num 0 8 10 19 20 30 39 50 58 75 ...
#> $ T_degC : num 10.5 10.5 10.5 10.4 10.4 ...
#> $ Salnty : num 33.4 33.4 33.4 33.4 33.4 ...
#> $ O2ml_L : num 3.39 3.39 3.39 3.39 3.39 ...
#> $ O2Sat : num 57.1 57.1 57.1 57.1 57.1 ...
#> $ Oxy_µmol.Kg: num 149 149 149 149 149 ...
#> $ ChlorA : num 0.45 0.45 0.45 0.45 0.45 ...
#> $ Phaeop : num 0.199 0.199 0.199 0.199 0.199 ...
#> $ PO4uM : num 1.56 1.56 1.56 1.56 1.56 ...
#> $ SiO3uM : num 26.6 26.6 26.6 26.6 26.6 ...
#> $ NO2uM : num 0.0423 0.0423 0.0423 0.0423 0.0423 ...
#> $ NO3uM : num 17.3 17.3 17.3 17.3 17.3 ...
#> $ NH3uM : num 0.0849 0.0849 0.0849 0.0849 0.0849 ...
#> $ LightP : num 18.4 18.4 18.4 18.4 18.4 ...
#> $ R_PRES : num 0 8 10 19 20 30 39 50 58 75 ...
This report will be using dimensionality reduction using PCA (Principle Component Analysis) as one of exploratory data analysis of the ocean dataset, where the goal of using the dimensionality reduction is to reduce the number of variables (dimensions / features) in the data while also maintaining as much information as possible (insights). In this report, assume that we want to extract not less that 80% of the data insights when using the optimal Principal Components (PC) been used/ selected.
PCA is very useful to retain information while reducing the dimension of the data. However, we need to make sure that our data is properly scaled in order to get a useful PCA. As the dataset is having a dispersed range on each variables, furthermore as this report will also analyse the data using PCA, the scaling of the data is needed.
In order to conduct PCA, all columns that is not a numeric data type must be omitted, as PCA will only processing numeric data type.
ocean_num <- ocean_clean %>%
dplyr::select(-Sta_ID)
tail(ocean_num)ocean_scale <- scale(ocean_num)
head(ocean_scale)#> Depthm T_degC Salnty O2ml_L
#> [1,] -0.7177085 -0.07106668 -0.8916057 0.000000000000002148652
#> [2,] -0.6923961 -0.08055245 -0.8916057 0.000000000000002148652
#> [3,] -0.6860679 -0.08055245 -0.8982869 0.000000000000002148652
#> [4,] -0.6575915 -0.08292389 -0.9361470 0.000000000000002148652
#> [5,] -0.6544274 -0.08292389 -0.9339199 0.000000000000002148652
#> [6,] -0.6227869 -0.08292389 -0.9116493 0.000000000000002148652
#> O2Sat Oxy_µmol.Kg ChlorA
#> [1,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [2,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [3,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [4,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [5,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> [6,] -0.0000000000000002190625 0.000000000000001441613 -0.000000000000003329897
#> Phaeop PO4uM SiO3uM
#> [1,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [2,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [3,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [4,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [5,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> [6,] 0.000000000000002310878 -0.000000000000004340651 0.000000000000001807393
#> NO2uM NO3uM NH3uM
#> [1,] -0.0000000000000008527376 0 0.000000000000006675914
#> [2,] -0.0000000000000008527376 0 0.000000000000006675914
#> [3,] -0.0000000000000008527376 0 0.000000000000006675914
#> [4,] -0.0000000000000008527376 0 0.000000000000006675914
#> [5,] -0.0000000000000008527376 0 0.000000000000006675914
#> [6,] -0.0000000000000008527376 0 0.000000000000006675914
#> LightP R_PRES
#> [1,] -0.00000000000001458751 -0.7149503
#> [2,] -0.00000000000001458751 -0.6899078
#> [3,] -0.00000000000001458751 -0.6836472
#> [4,] -0.00000000000001458751 -0.6554744
#> [5,] -0.00000000000001458751 -0.6523440
#> [6,] -0.00000000000001458751 -0.6210409
summary(ocean_scale) #> Depthm T_degC Salnty O2ml_L
#> Min. :-0.7177 Min. :-2.2196 Min. :-12.0470 Min. :-1.8291
#> 1st Qu.:-0.5722 1st Qu.:-0.7303 1st Qu.: -0.7491 1st Qu.:-0.8077
#> Median :-0.3222 Median :-0.1588 Median : 0.0000 Median : 0.0000
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.2315 3rd Qu.: 0.7186 3rd Qu.: 0.7564 3rd Qu.: 0.9932
#> Max. :16.2131 Max. : 4.8236 Max. : 7.1125 Max. : 4.1596
#> O2Sat Oxy_µmol.Kg ChlorA Phaeop
#> Min. :-1.7636 Min. :-1.8925 Min. : -0.7315 Min. :-21.28
#> 1st Qu.:-0.7894 1st Qu.:-0.7397 1st Qu.: 0.0000 1st Qu.: 0.00
#> Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.00
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00
#> 3rd Qu.: 0.9618 3rd Qu.: 0.9692 3rd Qu.: 0.0000 3rd Qu.: 0.00
#> Max. : 4.8402 Max. : 4.2720 Max. :106.4507 Max. :338.76
#> PO4uM SiO3uM NO2uM NO3uM
#> Min. :-2.185 Min. :-1.504 Min. : -0.6501 Min. :-1.944
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: -0.3428 1st Qu.: 0.000
#> Median : 0.000 Median : 0.000 Median : 0.0000 Median : 0.000
#> Mean : 0.000 Mean : 0.000 Mean : 0.0000 Mean : 0.000
#> 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.000
#> Max. : 5.090 Max. : 9.575 Max. :125.1611 Max. : 8.531
#> NH3uM LightP R_PRES
#> Min. : -1.134 Min. :-4.187 Min. :-0.7149
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:-0.5710
#> Median : 0.000 Median : 0.000 Median :-0.3205
#> Mean : 0.000 Mean : 0.000 Mean : 0.0000
#> 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 0.2304
#> Max. :207.722 Max. :18.601 Max. :16.3703
Before processing with PCA and CLustering procedures, we will check if there are any outliers in the data. As Outliers will greatly affect the clustering results, we will need to remove the outliers from the data. Outliers can be visualized by using a biplot graph from the PCA results.
pca_ocean <- PCA(X = ocean_scale, scale.unit = F, ncp = 15)summary(pca_ocean)#>
#> Call:
#> PCA(X = ocean_scale, scale.unit = F, ncp = 15)
#>
#>
#> Eigenvalues
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
#> Variance 6.736 1.743 1.427 1.053 0.998 0.857 0.801
#> % of var. 44.909 11.623 9.513 7.019 6.652 5.713 5.342
#> Cumulative % of var. 44.909 56.532 66.045 73.064 79.716 85.429 90.772
#> Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
#> Variance 0.489 0.373 0.221 0.170 0.094 0.035 0.003
#> % of var. 3.260 2.488 1.471 1.131 0.625 0.234 0.018
#> Cumulative % of var. 94.032 96.520 97.991 99.122 99.748 99.982 100.000
#> Dim.15
#> Variance 0.000
#> % of var. 0.000
#> Cumulative % of var. 100.000
#>
#> Individuals (the 10 first)
#> Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
#> 1 | 1.351 | -0.628 0.000 0.216 | -0.276 0.000 0.042 | 0.789
#> 2 | 1.325 | -0.611 0.000 0.213 | -0.268 0.000 0.041 | 0.760
#> 3 | 1.323 | -0.610 0.000 0.212 | -0.266 0.000 0.041 | 0.754
#> 4 | 1.321 | -0.605 0.000 0.210 | -0.261 0.000 0.039 | 0.727
#> 5 | 1.316 | -0.603 0.000 0.209 | -0.259 0.000 0.039 | 0.723
#> 6 | 1.269 | -0.579 0.000 0.208 | -0.248 0.000 0.038 | 0.689
#> 7 | 1.227 | -0.558 0.000 0.206 | -0.238 0.000 0.038 | 0.658
#> 8 | 1.226 | -0.534 0.000 0.190 | -0.226 0.000 0.034 | 0.615
#> 9 | 1.215 | -0.510 0.000 0.176 | -0.215 0.000 0.031 | 0.581
#> 10 | 1.052 | -0.418 0.000 0.158 | -0.182 0.000 0.030 | 0.507
#> ctr cos2
#> 1 0.000 0.340 |
#> 2 0.000 0.329 |
#> 3 0.000 0.325 |
#> 4 0.000 0.302 |
#> 5 0.000 0.302 |
#> 6 0.000 0.295 |
#> 7 0.000 0.288 |
#> 8 0.000 0.252 |
#> 9 0.000 0.229 |
#> 10 0.000 0.232 |
#>
#> Variables (the 10 first)
#> Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
#> Depthm | 0.703 7.339 0.494 | 0.199 2.265 0.039 | -0.623 27.213
#> T_degC | -0.807 9.660 0.651 | -0.120 0.826 0.014 | 0.223 3.476
#> Salnty | 0.763 8.644 0.582 | 0.099 0.565 0.010 | -0.072 0.367
#> O2ml_L | -0.922 12.628 0.851 | 0.011 0.008 0.000 | -0.067 0.312
#> O2Sat | -0.926 12.722 0.857 | 0.008 0.004 0.000 | -0.092 0.592
#> Oxy_µmol.Kg | -0.921 12.590 0.848 | 0.017 0.016 0.000 | -0.096 0.643
#> ChlorA | -0.095 0.135 0.009 | 0.808 37.447 0.653 | 0.234 3.834
#> Phaeop | -0.052 0.040 0.003 | 0.828 39.321 0.686 | 0.276 5.328
#> PO4uM | 0.840 10.479 0.706 | -0.058 0.191 0.003 | 0.329 7.563
#> SiO3uM | 0.776 8.944 0.602 | -0.068 0.269 0.005 | 0.345 8.349
#> cos2
#> Depthm 0.388 |
#> T_degC 0.050 |
#> Salnty 0.005 |
#> O2ml_L 0.004 |
#> O2Sat 0.008 |
#> Oxy_µmol.Kg 0.009 |
#> ChlorA 0.055 |
#> Phaeop 0.076 |
#> PO4uM 0.108 |
#> SiO3uM 0.119 |
pca_ocean$eig#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 6.736372917691 44.90920471100 44.90920
#> comp 2 1.743394087571 11.62264068919 56.53185
#> comp 3 1.427004711776 9.51337574504 66.04522
#> comp 4 1.052814963842 7.01877454109 73.06400
#> comp 5 0.997849502027 6.65233770530 79.71633
#> comp 6 0.856951799943 5.71301860532 85.42935
#> comp 7 0.801360117322 5.34240695932 90.77176
#> comp 8 0.488981352978 3.25987945577 94.03164
#> comp 9 0.373257962127 2.48838929139 96.52003
#> comp 10 0.220716696020 1.47144634150 97.99147
#> comp 11 0.169638596089 1.13092528156 99.12240
#> comp 12 0.093814278923 0.62542924931 99.74783
#> comp 13 0.035130045478 0.23420057398 99.98203
#> comp 14 0.002692486270 0.01794992922 99.99998
#> comp 15 0.000003138149 0.00002092102 100.00000
Visualization of the PCA and identify the top outer outliers.
plot.PCA(x = pca_ocean,
choix = "ind",
select = "contrib 20",
habillage = "Depthm")From the graph it can be concluded that the data below are the most outer outliers, and later will be omitted to enhance the clustering result.
outlier <- ocean_num[c(525549,525548,758530,514616,514615),]
ocean_no_outlier <- ocean_clean[-c(525549,525548,758530,514616,514615),]Prior to the clustering procedure, we need to make sure that the data only consists of numeric columns that will be used for clustering.
ocean_nooutnum <- ocean_no_outlier %>%
select_if(is.numeric)
tail(ocean_nooutnum)str(ocean_nooutnum)#> 'data.frame': 864858 obs. of 15 variables:
#> $ Depthm : num 0 8 10 19 20 30 39 50 58 75 ...
#> $ T_degC : num 10.5 10.5 10.5 10.4 10.4 ...
#> $ Salnty : num 33.4 33.4 33.4 33.4 33.4 ...
#> $ O2ml_L : num 3.39 3.39 3.39 3.39 3.39 ...
#> $ O2Sat : num 57.1 57.1 57.1 57.1 57.1 ...
#> $ Oxy_µmol.Kg: num 149 149 149 149 149 ...
#> $ ChlorA : num 0.45 0.45 0.45 0.45 0.45 ...
#> $ Phaeop : num 0.199 0.199 0.199 0.199 0.199 ...
#> $ PO4uM : num 1.56 1.56 1.56 1.56 1.56 ...
#> $ SiO3uM : num 26.6 26.6 26.6 26.6 26.6 ...
#> $ NO2uM : num 0.0423 0.0423 0.0423 0.0423 0.0423 ...
#> $ NO3uM : num 17.3 17.3 17.3 17.3 17.3 ...
#> $ NH3uM : num 0.0849 0.0849 0.0849 0.0849 0.0849 ...
#> $ LightP : num 18.4 18.4 18.4 18.4 18.4 ...
#> $ R_PRES : num 0 8 10 19 20 30 39 50 58 75 ...
As the data has been scaled before, we will use the object ocean_nooutnum as the base for the Clustering Procedure.
Before conducting clustering procedure, we will use the Elbow Method to determine the \(k\) optimum. This report will limit the Use of maximum of maxK as 5 to limit the plot into 5 distinct clusters.
# fungsi untuk plot elbow method
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 2:maxK) {
set.seed(123)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}
kmeansTunning(data = ocean_nooutnum, maxK = 5)Based on the graph above, it can be concluded that the optimal clusters that should be utilized is 4 (based on the ‘elbow’ indication of the graph. Therefore we will use the number of cluster is 4 for calculating the K-Means value.
ocean_km <- kmeans(x = ocean_nooutnum,
centers = 4)K-means is a clustering algorithm that groups the data based on distance. The resulting clusters are stated to be optimum if the distance between data in the same cluster is low and the distance between data from different clusters is high.
head(ocean_km$cluster)#> 1 2 3 4 5 6
#> 1 1 1 1 1 1
RNGkind(sample.kind = "Rounding")
set.seed(123)
ocean_nooutnum$cluster <- as.numeric(ocean_km$cluster)
tail(ocean_nooutnum)str(ocean_nooutnum)#> 'data.frame': 864858 obs. of 16 variables:
#> $ Depthm : num 0 8 10 19 20 30 39 50 58 75 ...
#> $ T_degC : num 10.5 10.5 10.5 10.4 10.4 ...
#> $ Salnty : num 33.4 33.4 33.4 33.4 33.4 ...
#> $ O2ml_L : num 3.39 3.39 3.39 3.39 3.39 ...
#> $ O2Sat : num 57.1 57.1 57.1 57.1 57.1 ...
#> $ Oxy_µmol.Kg: num 149 149 149 149 149 ...
#> $ ChlorA : num 0.45 0.45 0.45 0.45 0.45 ...
#> $ Phaeop : num 0.199 0.199 0.199 0.199 0.199 ...
#> $ PO4uM : num 1.56 1.56 1.56 1.56 1.56 ...
#> $ SiO3uM : num 26.6 26.6 26.6 26.6 26.6 ...
#> $ NO2uM : num 0.0423 0.0423 0.0423 0.0423 0.0423 ...
#> $ NO3uM : num 17.3 17.3 17.3 17.3 17.3 ...
#> $ NH3uM : num 0.0849 0.0849 0.0849 0.0849 0.0849 ...
#> $ LightP : num 18.4 18.4 18.4 18.4 18.4 ...
#> $ R_PRES : num 0 8 10 19 20 30 39 50 58 75 ...
#> $ cluster : num 1 1 1 1 1 1 1 1 1 1 ...
Kembalikan label cluster masing-masing observasi ke data awal sebelum discale, yang outliernya sudah di-remove
ocean_profile <- ocean_nooutnum %>%
group_by(cluster) %>%
summarise_all(.funs = "mean")ocean_pca <- PCA(ocean_nooutnum,
scale.unit = F,
quali.sup = 16,
graph = F)
fviz_pca_biplot(ocean_pca,
habillage = 16,
geom.ind = "point",
addEllipses = T) Based on the graph above, we will conduct a cluster profiling to understand the characteristics of each cluster.
ocean_profile <- ocean_nooutnum %>%
group_by(cluster) %>%
summarise_all(.funs = "mean")
ocean_profileocean_profile %>%
tidyr::pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(cluster_min_val = which.min(value),
cluster_max_val = which.max(value))From Principal Component Analysis (PCA) and Clustering procedures that we have executed above, there are some conclusions that can be derived from the ocean water’s characteristics, These are:
From PCA precedure, based on the preliminary business assumptions based on selected ocean water’s variables been chosen, if we want to extract not less that 80% of the data insights, we can use only 6 number of Principal Components (out of 15 in total available).
Based on the Clustering procedure, it is indicated that the optimum number of Clusters that can be used are 4, in order to get the optimal representation of the overall data insights; and further Cluster Profiling also indicated some information as follows:
Cluster 1: Cluster with relatively high on temperature, Milliliters of dissolved oxygen per Liter seawater, Oxygen Saturation, Oxygen in micro moles per kilogram of seawater, Acetone extracted chlorophyll-a, Nitrite concentration, Ammonium concentration, Light intensities, but have a relatively low on Water Depth, Nitrate concentration, Phaeophytin concentration, Water Pressure, Water Salinity, and Silicate concentration.
Cluster 2: Cluster with relatively high on Nitrate concentration, but have a relatively low on Acetone extracted chlorophyll-a, Light intensities, Ammonium concentration, Nitrite concentration, Oxygen Saturation, and Oxygen in micro moles per kilogram of seawater.
Cluster 3: Cluster with relatively high on Phosphate concentration, but have a relatively low on Milliliters of dissolved oxygen per Liter seawater.
Cluster 4: Cluster with relatively high on Water Depth, Phaeophytin concentration, Water Pressure, Water Salinity, Silicate concentration, but have a relatively low on Water Temperature.