Neural Net Predict Water Quality

Musthofa Syarifudin

June 13, 2021

Introduction

Description

This notebook will guide to use supervised learning to predict water quality either safe or not using Keras The objective of the notebook is to classify which is safe drinking water from dataset.

Stucture

Here are stucture of notebook:

  1. Exploratory Data Analysis
  2. Keras Baseline Model
  3. Keras Fined Tuned Model
  4. Verdict

Exploratory Data Analysis

Export Dataset

first we read the dataset.

data <- read.csv("data/water_potability.csv")
glimpse(data)
#> Rows: 3,276
#> Columns: 10
#> $ ph              <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, ~
#> $ Hardness        <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.~
#> $ Solids          <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874~
#> $ Chloramines     <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54~
#> $ Sulfate         <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66~
#> $ Conductivity    <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.~
#> $ Organic_carbon  <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,~
#> $ Trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.~
#> $ Turbidity       <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55~
#> $ Potability      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

we need to change potability into factor data type.

Na values

check na column

colSums(is.na(data))
#>              ph        Hardness          Solids     Chloramines         Sulfate 
#>             491               0               0               0             781 
#>    Conductivity  Organic_carbon Trihalomethanes       Turbidity      Potability 
#>               0               0             162               0               0

we will fill na values with mean of each variable, because total na values is quite big and we need to keep distribution the sama

check na data

rmarkdown::paged_table(head(data[is.na(data),]))

we fill na data with mean of each variable

data$ph[is.na(data$ph)]<-mean(data$ph,na.rm=TRUE) # changing ph nan with mean of ph
data$Sulfate[is.na(data$Sulfate)]<-mean(data$Sulfate,na.rm=TRUE) # changing ph Sulfate with mean of Sulfate
data$Trihalomethanes[is.na(data$Trihalomethanes)]<-mean(data$Trihalomethanes,na.rm=TRUE) # changing ph nan with mean of ph
rmarkdown::paged_table(head(data))

now we can re-check na values in each column

colSums(is.na(data))
#>              ph        Hardness          Solids     Chloramines         Sulfate 
#>               0               0               0               0               0 
#>    Conductivity  Organic_carbon Trihalomethanes       Turbidity      Potability 
#>               0               0               0               0               0

after we sure there is no more na values we can move on next step.

Check Class proportion

check class proportion

data$Potability <- as.factor(data$Potability)
barplot(data$Potability %>% table() %>% prop.table())

we know that for safe and harm water there is more safe water class.

Data Distribution

we can inspect data distribution of numeric variable

data %>%   
  inspect_num() %>% 
  show_plot()

Data Spearson Corelation

we also check corelation between each variable to avoid multicolinearity

ggcorr(data = data, label = T)

Split Data

we can split data into train and validation set with proportion 80:20.

set.seed(99)

index <- initial_split(data, 0.8)

train <- training(index)
val <- testing(index)

train_x <- train %>% select(-Potability) %>% scale() %>% as.matrix()
train_y <- train$Potability %>% as.numeric() -1

val_x <- val %>% select(-Potability) %>% scale(center = attr(train_x, "scaled:center") , 
        scale = attr(train_x, "scaled:scale")) %>% as.matrix()
val_y <- val$Potability %>% as.numeric() - 1

we reshape our predictor into matrix

train_x <- array_reshape(train_x, dim(train_x))

val_x <- array_reshape(val_x, dim(val_x))

Keras Baseline Model

Keras Model Architect

we are creating keras baseline which use 6 deep network of dense layer. we pick relu as activation function because it’s computationally faster and more robust on bigger dataset. we also pick sigmoid as final activation function to change into range of 0 and 1 whic represent probability of class.

set.seed(99)
tensorflow::set_random_seed(99)
model_baseline <- keras_model_sequential(name = "model_baseline") %>% 
  
  layer_dense(units = 128, 
              activation = "relu",
              input_shape = 9
              ) %>% 
  layer_dense(units = 64, 
              activation = "relu"
              ) %>% 

  layer_dense(units = 32, 
              activation = "relu"
              ) %>% 
  layer_dense(units = 16, 
              activation = "relu",
              ) %>% 
  layer_dense(units = 8, 
              activation = "relu",
              ) %>% 
  # Output Layer
  layer_dense(units = 1, activation = "sigmoid")

model_baseline
#> Model
#> Model: "model_baseline"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> dense_5 (Dense)                     (None, 128)                     1280        
#> ________________________________________________________________________________
#> dense_4 (Dense)                     (None, 64)                      8256        
#> ________________________________________________________________________________
#> dense_3 (Dense)                     (None, 32)                      2080        
#> ________________________________________________________________________________
#> dense_2 (Dense)                     (None, 16)                      528         
#> ________________________________________________________________________________
#> dense_1 (Dense)                     (None, 8)                       136         
#> ________________________________________________________________________________
#> dense (Dense)                       (None, 1)                       9           
#> ================================================================================
#> Total params: 12,289
#> Trainable params: 12,289
#> Non-trainable params: 0
#> ________________________________________________________________________________

Model Compile

we are gonna compile using default parameter such as optimizer using 0.01 learning rate

model_baseline %>% 
  compile(loss = "binary_crossentropy", 
          optimizer = "adam", 
          metrics = "accuracy"
          )

Model Fitting

we fit our model with data from train and validation, with batch size of 32 and total epoch of 20.

history <- model_baseline %>% 
  fit(x = train_x, 
      y = train_y, 
      batch_size = 32, 
      epochs = 20, 
      validation_data = list(val_x, val_y),
      verbose = 1 
      )

it’s seem the model overfit gaining 0.8677 as highest accuracy on train data while getting 0.6841 as highest accuracy on validation data

we can visualize training session

plot(history)

the distance between training accuracy and validation accuracy are quite big so we can assume that model overfit on training data

Confusion Matrix

to make sure our assumtion we can create confusion matrix to see model performance

prediction <- predict_classes(model_baseline, val_x)
confusionMatrix(as.factor(prediction), as.factor(val_y))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 286  88
#>          1 150 131
#>                                           
#>                Accuracy : 0.6366          
#>                  95% CI : (0.5985, 0.6736)
#>     No Information Rate : 0.6656          
#>     P-Value [Acc > NIR] : 0.9461          
#>                                           
#>                   Kappa : 0.2374          
#>                                           
#>  Mcnemar's Test P-Value : 7.684e-05       
#>                                           
#>             Sensitivity : 0.6560          
#>             Specificity : 0.5982          
#>          Pos Pred Value : 0.7647          
#>          Neg Pred Value : 0.4662          
#>              Prevalence : 0.6656          
#>          Detection Rate : 0.4366          
#>    Detection Prevalence : 0.5710          
#>       Balanced Accuracy : 0.6271          
#>                                           
#>        'Positive' Class : 0               
#> 

Keras Fine Tuned Model

Keras Model Architecture

as we can see our model are overfit on train data and failed to generalize on validation data. there are many ways to dealing with overfitting probelm, one of them id using smaller model and adding regulizer which put constraints on the complexity of a network by forcing its weights to only take on small values, which makes the distribution of weight values more “regular”. we are creating 4 dense network with tota parameter 3,265which is almost 1/4 of first model and we also add L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the “L1 norm” of the weights). which we set to 0.01.

tensorflow::set_random_seed(99)
model_tuned <- keras_model_sequential(name = "model_tuned") %>% 

  
  layer_dense(units = 64, 
              activation = "relu",
              input_shape = 9,
              kernel_regularizer = regularizer_l2(l = 0.001)
              ) %>% 
  layer_dense(units = 32, 
              activation = "relu",
              kernel_regularizer = regularizer_l2(l = 0.001)
              ) %>% 
  layer_dense(units = 16, 
              activation = "relu",
              kernel_regularizer = regularizer_l2(l = 0.001)
              ) %>% 
  # Output Layer
  layer_dense(units = 1, activation = "sigmoid")

model_tuned
#> Model
#> Model: "model_tuned"
#> ________________________________________________________________________________
#> Layer (type)                        Output Shape                    Param #     
#> ================================================================================
#> dense_9 (Dense)                     (None, 64)                      640         
#> ________________________________________________________________________________
#> dense_8 (Dense)                     (None, 32)                      2080        
#> ________________________________________________________________________________
#> dense_7 (Dense)                     (None, 16)                      528         
#> ________________________________________________________________________________
#> dense_6 (Dense)                     (None, 1)                       17          
#> ================================================================================
#> Total params: 3,265
#> Trainable params: 3,265
#> Non-trainable params: 0
#> ________________________________________________________________________________

Model Compile

we use the same compile parameter as before

model_tuned %>% 
  compile(loss = "binary_crossentropy", 
          optimizer = optimizer_adam(0.0001), 
          metrics = "accuracy" 
          )

Model Fitting

we can fit the model with 0.0001 learnin rate parameter

history <- model_tuned %>% 
  fit(x = train_x, 
      y = train_y,
      batch_size = 32, 
      epochs = 20, 
      validation_data = list(val_x, val_y),
      verbose = 1 
      )

it look like we reduce overfitting problem and increase accuracy on validation data.

we can plot fine tuned model accuracy and loss

plot(history)

the model seems a little overfit but not as bad as before.

Confusion Matrix

we can see confusion matrix for new fine tuned model

prediction_ <- predict(model_tuned, val_x)
prediction_ <- ifelse(prediction_ > 0.5, 1, 0)
confusionMatrix(as.factor(prediction_), as.factor(val_y))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 397 156
#>          1  39  63
#>                                           
#>                Accuracy : 0.7023          
#>                  95% CI : (0.6657, 0.7371)
#>     No Information Rate : 0.6656          
#>     P-Value [Acc > NIR] : 0.02499         
#>                                           
#>                   Kappa : 0.2286          
#>                                           
#>  Mcnemar's Test P-Value : < 2e-16         
#>                                           
#>             Sensitivity : 0.9106          
#>             Specificity : 0.2877          
#>          Pos Pred Value : 0.7179          
#>          Neg Pred Value : 0.6176          
#>              Prevalence : 0.6656          
#>          Detection Rate : 0.6061          
#>    Detection Prevalence : 0.8443          
#>       Balanced Accuracy : 0.5991          
#>                                           
#>        'Positive' Class : 0               
#> 

Verdict

we are succesfully creating model that predict potability of water with accuracy of 0.7023 in validation data. we also able to dealing with overfitting model with creating smaller model and add regulizer to reduce wight and bias of the model. the result is better accuracy on validation data