This report gathers data from The California Cooperative Oceanic Fisheries Investigations (CalCOFI). The organization was formed in 1949 that become a unique partnership of the California Department of Fish & Wildlife, NOAA Fisheries Service and Scripps Institution of Oceanography to the study of the marine environment off the coast of California, the management of its living resources, and monitoring the indicators of El Nino and climate change.
Data collected include: temperature, salinity, oxygen, phosphate, silicate, nitrate and nitrite, chlorophyll, transmissometer, PAR, C14 primary productivity, phytoplankton biodiversity, zooplankton biomass, and zooplankton biodiversity.
ocean <- read.csv("bottle.csv", stringsAsFactors = T)
sample(ocean)Data Dictionary:
Data Wrangling:
Checking for NA value of Data:
anyNA(ocean)#> [1] TRUE
colSums(is.na(ocean))#> Cst_Cnt Btl_Cnt Sta_ID Depth_ID
#> 0 0 0 0
#> Depthm T_degC Salnty O2ml_L
#> 0 10963 47354 168662
#> STheta O2Sat Oxy_µmol.Kg BtlNum
#> 52689 203589 203595 746196
#> RecInd T_prec T_qual S_prec
#> 0 10963 841736 47354
#> S_qual P_qual O_qual SThtaq
#> 789949 191108 680187 799040
#> O2Satq ChlorA Chlqua Phaeop
#> 647066 639591 225697 639592
#> Phaqua PO4uM PO4q SiO3uM
#> 225693 451546 413077 510772
#> SiO3qu NO2uM NO2q NO3uM
#> 353997 527287 335389 527460
#> NO3q NH3uM NH3q C14As1
#> 334930 799901 56564 850431
#> C14A1p C14A1q C14As2 C14A2p
#> 852103 16258 850449 852121
#> C14A2q DarkAs DarkAp DarkAq
#> 16240 842214 844406 24423
#> MeanAs MeanAp MeanAq IncTim
#> 842213 844406 24424 0
#> LightP R_Depth R_TEMP R_POTEMP
#> 846212 0 10963 46047
#> R_SALINITY R_SIGMA R_SVA R_DYNHT
#> 47354 52856 52771 46657
#> R_O2 R_O2Sat R_SIO3 R_PO4
#> 168662 198415 510764 451538
#> R_NO3 R_NO2 R_NH4 R_CHLA
#> 527452 527279 799881 639587
#> R_PHAEO R_PRES R_SAMP DIC1
#> 639588 0 742857 862864
#> DIC2 TA1 TA2 pH2
#> 864639 862779 864629 864853
#> pH1 DIC.Quality.Comment
#> 864779 0
Data Cleansing:
library(lubridate)
ocean_clean <- ocean %>%
dplyr::select(Depthm, T_degC, Salnty, R_PRES)
ocean_clean <- na.omit(ocean_clean)
head(ocean_clean)Visualising COrrelation of Variables of Data:
library(GGally)
ggcorr(ocean_clean, label = T, hjust = 0.5)Interpretation of Variables of Data: - There are some variables that may have strong correlation to each other, such as: - Water Pressure vs. Water Salinity - Water Pressure vs. Depth - Salinity vs. Depth
Visualization between Water Salinity vs. Water Temperature:
ocean_clean %>%
ggplot(aes(x= T_degC, y = Salnty)) +
geom_point(alpha = 0.5)+
geom_smooth(method = "lm")+
theme_minimal()boxplot(ocean_clean$Salnty)hist(ocean_clean$T_degC, breaks = 50)This report will try to see the correlation of ocean water’s salinity level based on the Depth, Temp (temperature), and Pressure of the water.
Some of the Business Questions that can be gathered are: - Is there a relationship between water salinity & water temperature? - Can we predict the water temperature based on salinity?
By using data that has been cleaned, first we will try to create the data model based on the most correlated variables.
Target Variable: Water Salinity Predictor Variables: Depth, Temperature, Pressure
Creating a Model:
model_ocean <- lm(formula = Salnty ~ ., data = ocean_clean)
summary(model_ocean)#>
#> Call:
#> lm(formula = Salnty ~ ., data = ocean_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.1002 -0.1940 0.0078 0.1779 3.7983
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 33.4658549 0.0021197 15787.83 <0.0000000000000002 ***
#> Depthm 0.1508489 0.0004072 370.45 <0.0000000000000002 ***
#> T_degC 0.0042172 0.0001439 29.31 <0.0000000000000002 ***
#> R_PRES -0.1483664 0.0004022 -368.89 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3435 on 814243 degrees of freedom
#> Multiple R-squared: 0.4463, Adjusted R-squared: 0.4463
#> F-statistic: 2.187e+05 on 3 and 814243 DF, p-value: < 0.00000000000000022
Interpreting Model based on common Regression model formula:
\[ \hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n \] where, \(x_1,...,x_n\) are the variabel predictors that will be included.
Interpretation of Model Output summary: - Based on the P-Va, all of the parameters (Depth, Temp, Pressure) are having strong correlation with the level of Salinity of ocean water. - Common model of Salinity = 33.466 + 0.150 * Depth + 0.004 * Temp - 0.148 * Pressure - Intercept value of 33.466 is the value of water salinity when the water is having depth of 0 (zero), temprature of 0 (zero), and pressure of 0 (zero) - For every additional value of 1 for water Depth will increase the water salinity by 0.150 times, by assuming other value of Temp and Pressure are constant. - For every additional value of 1 for water Temp will increase the water salinity by 0.004 times, by assuming other value of Depth and Pressure are constant. - For every additional value of 1 for water Pressure will decrease the water salinity by 0.148 times, by assuming other value of Depth and Temp are constant. - Although having correlations with each other, the adjusted R-squared values of 0.4463 means that in both the target variable and predictor variables may not represent a very good model in overall. (Adjusted R-Squared value approaching of value 1 is considering better).
Comparing predictor variables in various models: We will try to create several models to compare based on each Predictor Variables.
model_ocean1 <- lm(formula = Salnty ~ Depthm+T_degC, data = ocean_clean)
model_ocean2 <- lm(formula = Salnty ~ T_degC+R_PRES, data = ocean_clean)
model_ocean3 <- lm(formula = Salnty ~ R_PRES+Depthm, data = ocean_clean)
summary(model_ocean1)$adj.r.squared#> [1] 0.3537267
summary(model_ocean2)$adj.r.squared#> [1] 0.3529415
summary(model_ocean3)$adj.r.squared#> [1] 0.4456854
From three additional models, the model of comparing water salinity with the water depth is having the strongest correlation, where both of the step model forward and backward are also having the highest adjusted R-Squared value.
Next, we will try to create Prediction based on new Data Sampling:
sampling_ocean <- read.csv("bottle_sample.csv", stringsAsFactors = T)
head(sampling_ocean)ocean_sampclean <- sampling_ocean %>%
dplyr::select(Depthm, T_degC, Salnty, R_PRES)
ocean_sampclean <- na.omit(ocean_sampclean)
head(ocean_sampclean)Prediction Modeling from new Data Sampling:
predict_ocean <- predict(object = model_ocean, newdata = ocean_sampclean)
tail(predict_ocean)#> 495 496 497 498 499 500
#> 34.25362 34.33411 34.56082 34.43223 34.53098 34.62977
Evaluating prediction model and error in order to find the least error value by using: * Mean Absolute Error (MAE) * Mean Absolute Percentage Error (MAPE) * Root Mean Squared Error (RMSE)
mae <- mean(abs(predict_ocean - ocean_sampclean$Salnty))
mae#> [1] 0.3505954
MAPE(y_pred = predict_ocean, y_true = ocean_sampclean$Salnty)*100#> [1] 1.053392
RMSE(y_pred = predict_ocean, y_true = ocean_sampclean$Salnty)#> [1] 0.4497452
Interpretation of MAE, MAPE, and RMSE Result:
Feature Selection of Model using Step-wise Regression & Visualization using Scatterplot between prediction value (fitted value) with error value:
model_backward <- step(object = model_ocean, direction = "backward", trace = 0)
summary(model_backward)#>
#> Call:
#> lm(formula = Salnty ~ Depthm + T_degC + R_PRES, data = ocean_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.1002 -0.1940 0.0078 0.1779 3.7983
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 33.4658549 0.0021197 15787.83 <0.0000000000000002 ***
#> Depthm 0.1508489 0.0004072 370.45 <0.0000000000000002 ***
#> T_degC 0.0042172 0.0001439 29.31 <0.0000000000000002 ***
#> R_PRES -0.1483664 0.0004022 -368.89 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3435 on 814243 degrees of freedom
#> Multiple R-squared: 0.4463, Adjusted R-squared: 0.4463
#> F-statistic: 2.187e+05 on 3 and 814243 DF, p-value: < 0.00000000000000022
model_ocean_none <- lm(formula = Salnty~1, data = ocean_clean)
model_forward <- step(object = model_ocean_none, direction = "forward", scope = list(lower = model_ocean_none, upper = model_ocean), trace = 0)
summary(model_forward)#>
#> Call:
#> lm(formula = Salnty ~ Depthm + R_PRES + T_degC, data = ocean_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.1002 -0.1940 0.0078 0.1779 3.7983
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 33.4658549 0.0021197 15787.83 <0.0000000000000002 ***
#> Depthm 0.1508489 0.0004072 370.45 <0.0000000000000002 ***
#> R_PRES -0.1483664 0.0004022 -368.89 <0.0000000000000002 ***
#> T_degC 0.0042172 0.0001439 29.31 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3435 on 814243 degrees of freedom
#> Multiple R-squared: 0.4463, Adjusted R-squared: 0.4463
#> F-statistic: 2.187e+05 on 3 and 814243 DF, p-value: < 0.00000000000000022
summary(model_backward)$adj.r.squared#> [1] 0.4462688
summary(model_forward)$adj.r.squared#> [1] 0.4462688
plot(x = model_forward$fitted.values, y = model_forward$residuals)Interpretation of Feature Selection Results:
Answering the Business Questions:
It can be concluded that the degree of ocean water’s salinity is correlated by water temperature, depth, and the pressure. Although it is correlated, but the correlation may not strong enough, and there may be other predictor variables that may results in stronger correlation for the model.
We can predict the temperature based on its salinity, however bsed on the calculation conducted above, more rigorous predictor variables may be needed in order to make the model more stronger in resulting the temperature prediction.