Introduction
Description
This notebook will guide to use supervised learning to predict water quality either safe or not using Keras The objective of the notebook is to classify which is safe drinking water from dataset.
Stucture
Here are stucture of notebook:
- Exploratory Data Analysis
- Keras Baseline Model
- Keras Fined Tuned Model
- Verdict
Exploratory Data Analysis
Export Dataset
first we read the dataset.
data <- read.csv("data/water_potability.csv")
glimpse(data)#> Rows: 3,276
#> Columns: 10
#> $ ph <dbl> NA, 3.716080, 8.099124, 8.316766, 9.092223, 5.584087, ~
#> $ Hardness <dbl> 204.8905, 129.4229, 224.2363, 214.3734, 181.1015, 188.~
#> $ Solids <dbl> 20791.32, 18630.06, 19909.54, 22018.42, 17978.99, 2874~
#> $ Chloramines <dbl> 7.300212, 6.635246, 9.275884, 8.059332, 6.546600, 7.54~
#> $ Sulfate <dbl> 368.5164, NA, NA, 356.8861, 310.1357, 326.6784, 393.66~
#> $ Conductivity <dbl> 564.3087, 592.8854, 418.6062, 363.2665, 398.4108, 280.~
#> $ Organic_carbon <dbl> 10.379783, 15.180013, 16.868637, 18.436524, 11.558279,~
#> $ Trihalomethanes <dbl> 86.99097, 56.32908, 66.42009, 100.34167, 31.99799, 54.~
#> $ Turbidity <dbl> 2.963135, 4.500656, 3.055934, 4.628771, 4.075075, 2.55~
#> $ Potability <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
we need to change potability into factor data type.
Na values
check na column
colSums(is.na(data))#> ph Hardness Solids Chloramines Sulfate
#> 491 0 0 0 781
#> Conductivity Organic_carbon Trihalomethanes Turbidity Potability
#> 0 0 162 0 0
we will fill na values with mean of each variable, because total na values is quite big and we need to keep distribution the sama
check na data
rmarkdown::paged_table(head(data[is.na(data),]))we fill na data with mean of each variable
data$ph[is.na(data$ph)]<-mean(data$ph,na.rm=TRUE) # changing ph nan with mean of ph
data$Sulfate[is.na(data$Sulfate)]<-mean(data$Sulfate,na.rm=TRUE) # changing ph Sulfate with mean of Sulfate
data$Trihalomethanes[is.na(data$Trihalomethanes)]<-mean(data$Trihalomethanes,na.rm=TRUE) # changing ph nan with mean of ph
rmarkdown::paged_table(head(data))now we can re-check na values in each column
colSums(is.na(data))#> ph Hardness Solids Chloramines Sulfate
#> 0 0 0 0 0
#> Conductivity Organic_carbon Trihalomethanes Turbidity Potability
#> 0 0 0 0 0
after we sure there is no more na values we can move on next step.
Check Class proportion
check class proportion
data$Potability <- as.factor(data$Potability)
barplot(data$Potability %>% table() %>% prop.table()) we know that for safe and harm water there is more safe water class.
Data Distribution
we can inspect data distribution of numeric variable
data %>%
inspect_num() %>%
show_plot()Data Spearson Corelation
we also check corelation between each variable to avoid multicolinearity
ggcorr(data = data, label = T)Split Data
we can split data into train and validation set with proportion 80:20.
set.seed(99)
index <- initial_split(data, 0.8)
train <- training(index)
val <- testing(index)
train_x <- train %>% select(-Potability) %>% scale() %>% as.matrix()
train_y <- train$Potability %>% as.numeric() -1
val_x <- val %>% select(-Potability) %>% scale(center = attr(train_x, "scaled:center") ,
scale = attr(train_x, "scaled:scale")) %>% as.matrix()
val_y <- val$Potability %>% as.numeric() - 1we reshape our predictor into matrix
train_x <- array_reshape(train_x, dim(train_x))
val_x <- array_reshape(val_x, dim(val_x))Keras Baseline Model
Keras Model Architect
we are creating keras baseline which use 6 deep network of dense layer. we pick relu as activation function because it’s computationally faster and more robust on bigger dataset. we also pick sigmoid as final activation function to change into range of 0 and 1 whic represent probability of class.
set.seed(99)
tensorflow::set_random_seed(99)
model_baseline <- keras_model_sequential(name = "model_baseline") %>%
layer_dense(units = 128,
activation = "relu",
input_shape = 9
) %>%
layer_dense(units = 64,
activation = "relu"
) %>%
layer_dense(units = 32,
activation = "relu"
) %>%
layer_dense(units = 16,
activation = "relu",
) %>%
layer_dense(units = 8,
activation = "relu",
) %>%
# Output Layer
layer_dense(units = 1, activation = "sigmoid")
model_baseline#> Model
#> Model: "model_baseline"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> dense_5 (Dense) (None, 128) 1280
#> ________________________________________________________________________________
#> dense_4 (Dense) (None, 64) 8256
#> ________________________________________________________________________________
#> dense_3 (Dense) (None, 32) 2080
#> ________________________________________________________________________________
#> dense_2 (Dense) (None, 16) 528
#> ________________________________________________________________________________
#> dense_1 (Dense) (None, 8) 136
#> ________________________________________________________________________________
#> dense (Dense) (None, 1) 9
#> ================================================================================
#> Total params: 12,289
#> Trainable params: 12,289
#> Non-trainable params: 0
#> ________________________________________________________________________________
Model Compile
we are gonna compile using default parameter such as optimizer using 0.01 learning rate
model_baseline %>%
compile(loss = "binary_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)Model Fitting
we fit our model with data from train and validation, with batch size of 32 and total epoch of 20.
history <- model_baseline %>%
fit(x = train_x,
y = train_y,
batch_size = 32,
epochs = 20,
validation_data = list(val_x, val_y),
verbose = 1
)it’s seem the model overfit gaining 0.8677 as highest accuracy on train data while getting 0.6841 as highest accuracy on validation data
we can visualize training session
plot(history) the distance between training accuracy and validation accuracy are quite big so we can assume that model overfit on training data
Confusion Matrix
to make sure our assumtion we can create confusion matrix to see model performance
prediction <- predict_classes(model_baseline, val_x)
confusionMatrix(as.factor(prediction), as.factor(val_y))#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 286 88
#> 1 150 131
#>
#> Accuracy : 0.6366
#> 95% CI : (0.5985, 0.6736)
#> No Information Rate : 0.6656
#> P-Value [Acc > NIR] : 0.9461
#>
#> Kappa : 0.2374
#>
#> Mcnemar's Test P-Value : 7.684e-05
#>
#> Sensitivity : 0.6560
#> Specificity : 0.5982
#> Pos Pred Value : 0.7647
#> Neg Pred Value : 0.4662
#> Prevalence : 0.6656
#> Detection Rate : 0.4366
#> Detection Prevalence : 0.5710
#> Balanced Accuracy : 0.6271
#>
#> 'Positive' Class : 0
#>
Keras Fine Tuned Model
Keras Model Architecture
as we can see our model are overfit on train data and failed to generalize on validation data. there are many ways to dealing with overfitting probelm, one of them id using smaller model and adding regulizer which put constraints on the complexity of a network by forcing its weights to only take on small values, which makes the distribution of weight values more “regular”. we are creating 4 dense network with tota parameter 3,265which is almost 1/4 of first model and we also add L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the “L1 norm” of the weights). which we set to 0.01.
tensorflow::set_random_seed(99)
model_tuned <- keras_model_sequential(name = "model_tuned") %>%
layer_dense(units = 64,
activation = "relu",
input_shape = 9,
kernel_regularizer = regularizer_l2(l = 0.001)
) %>%
layer_dense(units = 32,
activation = "relu",
kernel_regularizer = regularizer_l2(l = 0.001)
) %>%
layer_dense(units = 16,
activation = "relu",
kernel_regularizer = regularizer_l2(l = 0.001)
) %>%
# Output Layer
layer_dense(units = 1, activation = "sigmoid")
model_tuned#> Model
#> Model: "model_tuned"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> dense_9 (Dense) (None, 64) 640
#> ________________________________________________________________________________
#> dense_8 (Dense) (None, 32) 2080
#> ________________________________________________________________________________
#> dense_7 (Dense) (None, 16) 528
#> ________________________________________________________________________________
#> dense_6 (Dense) (None, 1) 17
#> ================================================================================
#> Total params: 3,265
#> Trainable params: 3,265
#> Non-trainable params: 0
#> ________________________________________________________________________________
Model Compile
we use the same compile parameter as before
model_tuned %>%
compile(loss = "binary_crossentropy",
optimizer = optimizer_adam(0.0001),
metrics = "accuracy"
)Model Fitting
we can fit the model with 0.0001 learnin rate parameter
history <- model_tuned %>%
fit(x = train_x,
y = train_y,
batch_size = 32,
epochs = 20,
validation_data = list(val_x, val_y),
verbose = 1
)it look like we reduce overfitting problem and increase accuracy on validation data.
we can plot fine tuned model accuracy and loss
plot(history) the model seems a little overfit but not as bad as before.
Confusion Matrix
we can see confusion matrix for new fine tuned model
prediction_ <- predict(model_tuned, val_x)
prediction_ <- ifelse(prediction_ > 0.5, 1, 0)
confusionMatrix(as.factor(prediction_), as.factor(val_y))#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 397 156
#> 1 39 63
#>
#> Accuracy : 0.7023
#> 95% CI : (0.6657, 0.7371)
#> No Information Rate : 0.6656
#> P-Value [Acc > NIR] : 0.02499
#>
#> Kappa : 0.2286
#>
#> Mcnemar's Test P-Value : < 2e-16
#>
#> Sensitivity : 0.9106
#> Specificity : 0.2877
#> Pos Pred Value : 0.7179
#> Neg Pred Value : 0.6176
#> Prevalence : 0.6656
#> Detection Rate : 0.6061
#> Detection Prevalence : 0.8443
#> Balanced Accuracy : 0.5991
#>
#> 'Positive' Class : 0
#>
Verdict
we are succesfully creating model that predict potability of water with accuracy of 0.7023 in validation data. we also able to dealing with overfitting model with creating smaller model and add regulizer to reduce wight and bias of the model. the result is better accuracy on validation data