LBB-RM: Ocean Water and its Properties

Wayan K.

4/19/2021


This report gathers data from The California Cooperative Oceanic Fisheries Investigations (CalCOFI). The organization was formed in 1949 that become a unique partnership of the California Department of Fish & Wildlife, NOAA Fisheries Service and Scripps Institution of Oceanography to the study of the marine environment off the coast of California, the management of its living resources, and monitoring the indicators of El Nino and climate change.

Data collected include: temperature, salinity, oxygen, phosphate, silicate, nitrate and nitrite, chlorophyll, transmissometer, PAR, C14 primary productivity, phytoplankton biodiversity, zooplankton biomass, and zooplankton biodiversity.

ocean <- read.csv("bottle.csv", stringsAsFactors = T)
sample(ocean)

Data Dictionary:

Data Wrangling:

Checking for NA value of Data:

anyNA(ocean)
#> [1] TRUE
colSums(is.na(ocean))
#>             Cst_Cnt             Btl_Cnt              Sta_ID            Depth_ID 
#>                   0                   0                   0                   0 
#>              Depthm              T_degC              Salnty              O2ml_L 
#>                   0               10963               47354              168662 
#>              STheta               O2Sat        Oxy_µmol.Kg              BtlNum 
#>               52689              203589              203595              746196 
#>              RecInd              T_prec              T_qual              S_prec 
#>                   0               10963              841736               47354 
#>              S_qual              P_qual              O_qual              SThtaq 
#>              789949              191108              680187              799040 
#>              O2Satq              ChlorA              Chlqua              Phaeop 
#>              647066              639591              225697              639592 
#>              Phaqua               PO4uM                PO4q              SiO3uM 
#>              225693              451546              413077              510772 
#>              SiO3qu               NO2uM                NO2q               NO3uM 
#>              353997              527287              335389              527460 
#>                NO3q               NH3uM                NH3q              C14As1 
#>              334930              799901               56564              850431 
#>              C14A1p              C14A1q              C14As2              C14A2p 
#>              852103               16258              850449              852121 
#>              C14A2q              DarkAs              DarkAp              DarkAq 
#>               16240              842214              844406               24423 
#>              MeanAs              MeanAp              MeanAq              IncTim 
#>              842213              844406               24424                   0 
#>              LightP             R_Depth              R_TEMP            R_POTEMP 
#>              846212                   0               10963               46047 
#>          R_SALINITY             R_SIGMA               R_SVA             R_DYNHT 
#>               47354               52856               52771               46657 
#>                R_O2             R_O2Sat              R_SIO3               R_PO4 
#>              168662              198415              510764              451538 
#>               R_NO3               R_NO2               R_NH4              R_CHLA 
#>              527452              527279              799881              639587 
#>             R_PHAEO              R_PRES              R_SAMP                DIC1 
#>              639588                   0              742857              862864 
#>                DIC2                 TA1                 TA2                 pH2 
#>              864639              862779              864629              864853 
#>                 pH1 DIC.Quality.Comment 
#>              864779                   0

Data Cleansing:

library(lubridate)
ocean_clean <- ocean %>% 
  dplyr::select(Depthm, T_degC, Salnty, R_PRES)
ocean_clean <- na.omit(ocean_clean)
head(ocean_clean)

Visualising COrrelation of Variables of Data:

library(GGally)
ggcorr(ocean_clean, label = T, hjust = 0.5)

Interpretation of Variables of Data: - There are some variables that may have strong correlation to each other, such as: - Water Pressure vs. Water Salinity - Water Pressure vs. Depth - Salinity vs. Depth

Visualization between Water Salinity vs. Water Temperature:

ocean_clean %>% 
  ggplot(aes(x= T_degC, y = Salnty)) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = "lm")+
  theme_minimal()

boxplot(ocean_clean$Salnty)

hist(ocean_clean$T_degC, breaks = 50)

This report will try to see the correlation of ocean water’s salinity level based on the Depth, Temp (temperature), and Pressure of the water.

Some of the Business Questions that can be gathered are: - Is there a relationship between water salinity & water temperature? - Can we predict the water temperature based on salinity?

By using data that has been cleaned, first we will try to create the data model based on the most correlated variables.

Target Variable: Water Salinity Predictor Variables: Depth, Temperature, Pressure

Creating a Model:

model_ocean <- lm(formula = Salnty ~ ., data = ocean_clean)
summary(model_ocean)
#> 
#> Call:
#> lm(formula = Salnty ~ ., data = ocean_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.1002 -0.1940  0.0078  0.1779  3.7983 
#> 
#> Coefficients:
#>               Estimate Std. Error  t value            Pr(>|t|)    
#> (Intercept) 33.4658549  0.0021197 15787.83 <0.0000000000000002 ***
#> Depthm       0.1508489  0.0004072   370.45 <0.0000000000000002 ***
#> T_degC       0.0042172  0.0001439    29.31 <0.0000000000000002 ***
#> R_PRES      -0.1483664  0.0004022  -368.89 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3435 on 814243 degrees of freedom
#> Multiple R-squared:  0.4463, Adjusted R-squared:  0.4463 
#> F-statistic: 2.187e+05 on 3 and 814243 DF,  p-value: < 0.00000000000000022

Interpreting Model based on common Regression model formula:

\[ \hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n \] where, \(x_1,...,x_n\) are the variabel predictors that will be included.

Interpretation of Model Output summary: - Based on the P-Va, all of the parameters (Depth, Temp, Pressure) are having strong correlation with the level of Salinity of ocean water. - Common model of Salinity = 33.466 + 0.150 * Depth + 0.004 * Temp - 0.148 * Pressure - Intercept value of 33.466 is the value of water salinity when the water is having depth of 0 (zero), temprature of 0 (zero), and pressure of 0 (zero) - For every additional value of 1 for water Depth will increase the water salinity by 0.150 times, by assuming other value of Temp and Pressure are constant. - For every additional value of 1 for water Temp will increase the water salinity by 0.004 times, by assuming other value of Depth and Pressure are constant. - For every additional value of 1 for water Pressure will decrease the water salinity by 0.148 times, by assuming other value of Depth and Temp are constant. - Although having correlations with each other, the adjusted R-squared values of 0.4463 means that in both the target variable and predictor variables may not represent a very good model in overall. (Adjusted R-Squared value approaching of value 1 is considering better).

Comparing predictor variables in various models: We will try to create several models to compare based on each Predictor Variables.

model_ocean1 <- lm(formula = Salnty ~ Depthm+T_degC, data = ocean_clean)
model_ocean2 <- lm(formula = Salnty ~ T_degC+R_PRES, data = ocean_clean)
model_ocean3 <- lm(formula = Salnty ~ R_PRES+Depthm, data = ocean_clean)

summary(model_ocean1)$adj.r.squared
#> [1] 0.3537267
summary(model_ocean2)$adj.r.squared
#> [1] 0.3529415
summary(model_ocean3)$adj.r.squared
#> [1] 0.4456854

From three additional models, the model of comparing water salinity with the water depth is having the strongest correlation, where both of the step model forward and backward are also having the highest adjusted R-Squared value.

Next, we will try to create Prediction based on new Data Sampling:

sampling_ocean <- read.csv("bottle_sample.csv", stringsAsFactors = T)
head(sampling_ocean)
ocean_sampclean <- sampling_ocean %>% 
  dplyr::select(Depthm, T_degC, Salnty, R_PRES)
ocean_sampclean <- na.omit(ocean_sampclean)

head(ocean_sampclean)

Prediction Modeling from new Data Sampling:

predict_ocean <- predict(object = model_ocean, newdata = ocean_sampclean)
tail(predict_ocean)
#>      495      496      497      498      499      500 
#> 34.25362 34.33411 34.56082 34.43223 34.53098 34.62977

1 Model Evaluation

Evaluating prediction model and error in order to find the least error value by using: * Mean Absolute Error (MAE) * Mean Absolute Percentage Error (MAPE) * Root Mean Squared Error (RMSE)

mae <- mean(abs(predict_ocean - ocean_sampclean$Salnty))
mae
#> [1] 0.3505954
MAPE(y_pred = predict_ocean, y_true = ocean_sampclean$Salnty)*100
#> [1] 1.053392
RMSE(y_pred = predict_ocean, y_true = ocean_sampclean$Salnty)
#> [1] 0.4497452

Interpretation of MAE, MAPE, and RMSE Result:

  • MAE value of 0.35 points shows that in average the prediction result will deviate in approximately 0.35 points, either in positive or negative ways.
  • MAPE value of 1.05% shows that in average the model will only deviates in approximately 1.05% from its actual values.
  • RMSE value of 0.45 shows that in average the prediction results will deviate in approximately 0.45 points either positive or negative.

Feature Selection of Model using Step-wise Regression & Visualization using Scatterplot between prediction value (fitted value) with error value:

model_backward <- step(object = model_ocean, direction = "backward", trace = 0)
summary(model_backward)
#> 
#> Call:
#> lm(formula = Salnty ~ Depthm + T_degC + R_PRES, data = ocean_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.1002 -0.1940  0.0078  0.1779  3.7983 
#> 
#> Coefficients:
#>               Estimate Std. Error  t value            Pr(>|t|)    
#> (Intercept) 33.4658549  0.0021197 15787.83 <0.0000000000000002 ***
#> Depthm       0.1508489  0.0004072   370.45 <0.0000000000000002 ***
#> T_degC       0.0042172  0.0001439    29.31 <0.0000000000000002 ***
#> R_PRES      -0.1483664  0.0004022  -368.89 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3435 on 814243 degrees of freedom
#> Multiple R-squared:  0.4463, Adjusted R-squared:  0.4463 
#> F-statistic: 2.187e+05 on 3 and 814243 DF,  p-value: < 0.00000000000000022
model_ocean_none <- lm(formula = Salnty~1, data = ocean_clean)
model_forward <- step(object = model_ocean_none, direction = "forward", scope = list(lower = model_ocean_none, upper = model_ocean), trace = 0)
summary(model_forward)
#> 
#> Call:
#> lm(formula = Salnty ~ Depthm + R_PRES + T_degC, data = ocean_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.1002 -0.1940  0.0078  0.1779  3.7983 
#> 
#> Coefficients:
#>               Estimate Std. Error  t value            Pr(>|t|)    
#> (Intercept) 33.4658549  0.0021197 15787.83 <0.0000000000000002 ***
#> Depthm       0.1508489  0.0004072   370.45 <0.0000000000000002 ***
#> R_PRES      -0.1483664  0.0004022  -368.89 <0.0000000000000002 ***
#> T_degC       0.0042172  0.0001439    29.31 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3435 on 814243 degrees of freedom
#> Multiple R-squared:  0.4463, Adjusted R-squared:  0.4463 
#> F-statistic: 2.187e+05 on 3 and 814243 DF,  p-value: < 0.00000000000000022
summary(model_backward)$adj.r.squared
#> [1] 0.4462688
summary(model_forward)$adj.r.squared
#> [1] 0.4462688
plot(x = model_forward$fitted.values, y = model_forward$residuals)

Interpretation of Feature Selection Results:

  • Both of the method, either using backward step or forward step resulted in same Adjusted R-Squared value of 0.446. This may mean that both methods have approximately the same adjusted R-Squared value calculations. Although it has a quite strong P-Values, but the inter-correlation of the predictor variables not storng enough as it only have 44.63% (where the value approaching 100% will show better relations)

Answering the Business Questions:

  • It can be concluded that the degree of ocean water’s salinity is correlated by water temperature, depth, and the pressure. Although it is correlated, but the correlation may not strong enough, and there may be other predictor variables that may results in stronger correlation for the model.

  • We can predict the temperature based on its salinity, however bsed on the calculation conducted above, more rigorous predictor variables may be needed in order to make the model more stronger in resulting the temperature prediction.