dados = read.csv("df_data.csv", header = TRUE)
table(dados$target)
##
## high low med
## 3000 6000 1000
The number of observations in each class is unbalanced. This can reduce models prediction power. So, I applied data augmentation for corrected classes:
dados$target <- as.factor(dados$target)
dados_b <- upSample(x = dados[,1:3],
y = dados$target,
list = FALSE,
yname = "target")
dados_sc = data.frame(scale(dados_b[,1:3], center = T),target = as.factor(dados_b$target))
table(dados_sc$target)
##
## high low med
## 6000 6000 6000
The data have now, 6000 observations for each class.
pca = princomp(dados_sc[,1:3])
biplot(pca)
According to the pca, the variables are equally important for the variation of the data (equal vector sizes in the graph). The correlation between them is weak, however, the same between them.
par(mfcol=c(3,2))
for(i in 1:3){
hist(dados_b[,i], col = "red", breaks = 100)
}
for(i in 1:3){
hist(dados_sc[,i], col = "red", breaks = 100)
}
For instance, the old data set and new one have the same distribution yet. the Shapiro test is available for verify assumption of data normality. Normalized data offer the opportunity for apply a lot of linear models.
shap_test = list()
for(i in 1:3){
shap_test[[i]] = shapiro.test(dados_b[,i][1:5000])
}
shap_test
## [[1]]
##
## Shapiro-Wilk normality test
##
## data: dados_b[, i][1:5000]
## W = 0.97461, p-value < 2.2e-16
##
##
## [[2]]
##
## Shapiro-Wilk normality test
##
## data: dados_b[, i][1:5000]
## W = 0.93975, p-value < 2.2e-16
##
##
## [[3]]
##
## Shapiro-Wilk normality test
##
## data: dados_b[, i][1:5000]
## W = 0.95644, p-value < 2.2e-16
The Shapiro test shows data has no normal distribution. Therefore, non-lienar machine learning models has been chosen. Here, were applied Random Florest and Artificial Neural Network approaches.
At the beginning, the dataset was partitioned in training and test sets:
Was used threshold of 0.8 for training and testing data:
validation_index <- createDataPartition(dados_sc$target, p=0.80, list=FALSE)
validation <- dados_sc[-validation_index,]
dataset <- dados_sc[validation_index,]
summary(dataset[,1:3])
## x1 x2 x3
## Min. :-2.428763 Min. :-1.724342 Min. :-1.748330
## 1st Qu.:-0.812627 1st Qu.:-0.722934 1st Qu.:-0.849059
## Median : 0.006169 Median :-0.178309 Median :-0.010334
## Mean :-0.002364 Mean : 0.001865 Mean : 0.002045
## 3rd Qu.: 0.838325 3rd Qu.: 0.528337 3rd Qu.: 0.873552
## Max. : 2.292088 Max. : 6.651952 Max. : 1.724187
summary(validation[,1:3])
## x1 x2 x3
## Min. :-2.428763 Min. :-1.702869 Min. :-1.747337
## 1st Qu.:-0.845979 1st Qu.:-0.728791 1st Qu.:-0.897571
## Median : 0.050249 Median :-0.190022 Median :-0.013312
## Mean : 0.009454 Mean :-0.007459 Mean :-0.008181
## 3rd Qu.: 0.864461 3rd Qu.: 0.516624 3rd Qu.: 0.862634
## Max. : 2.279605 Max. : 4.783829 Max. : 1.724187
m0 = randomForest(target~., data=dataset, ntree=2000, replace = T)
predictions <- predict(m0, validation)
confusionMatrix(predictions, validation$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low med
## high 941 323 3
## low 247 810 0
## med 12 67 1197
##
## Overall Statistics
##
## Accuracy : 0.8189
## 95% CI : (0.8059, 0.8313)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7283
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: high Class: low Class: med
## Sensitivity 0.7842 0.6750 0.9975
## Specificity 0.8642 0.8971 0.9671
## Pos Pred Value 0.7427 0.7663 0.9381
## Neg Pred Value 0.8890 0.8466 0.9987
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.2614 0.2250 0.3325
## Detection Prevalence 0.3519 0.2936 0.3544
## Balanced Accuracy 0.8242 0.7860 0.9823
The Random Forest model obtained 80% accuracy prediction and achieved high predictive power in all classes (with sensitivity 0.99 and acuracy of 98% for “med” class).
Was used threshold of 0.8 for training and testing data:
dados_sc[,4] <- as.numeric(dados_sc[,4]) -1
dados_sc = as.matrix(dados_sc)
dimnames(dados_sc) <- NULL
sample_set <- sample(2, nrow(dados_sc), replace=TRUE, prob=c(0.8, 0.2))
dados.training = dados_sc[sample_set==1, 1:3]
dados.test = dados_sc[sample_set==2, 1:4]
dados.trainingtarget <- dados_sc[sample_set==1, 4]
dados.testtarget = dados_sc[sample_set==2, 4]
summary(dados.training[,1:3])
## V1 V2 V3
## Min. :-2.428763 Min. :-1.724342 Min. :-1.748330
## 1st Qu.:-0.832521 1st Qu.:-0.724886 1st Qu.:-0.877347
## Median : 0.012411 Median :-0.186118 Median :-0.018275
## Mean :-0.002281 Mean :-0.002371 Mean :-0.007144
## 3rd Qu.: 0.846420 3rd Qu.: 0.524433 3rd Qu.: 0.860152
## Max. : 2.279605 Max. : 6.651952 Max. : 1.724187
summary(dados.test[,1:3])
## V1 V2 V3
## Min. :-2.428763 Min. :-1.698965 Min. :-1.74436
## 1st Qu.:-0.776153 1st Qu.:-0.701462 1st Qu.:-0.82201
## Median : 0.028404 Median :-0.166597 Median : 0.02441
## Mean : 0.009051 Mean : 0.009407 Mean : 0.02835
## 3rd Qu.: 0.828671 3rd Qu.: 0.535169 3rd Qu.: 0.90978
## Max. : 2.292088 Max. : 4.932186 Max. : 1.72419
plot(history$metrics$loss, main="Model Loss", xlab = "epoch", ylab="loss", col="blue", type="l")
classes <- m1 %>% predict_classes(dados.test[,1:3])
table(dados.testtarget, classes)
## classes
## dados.testtarget 0 1 2
## 0 322 40 788
## 1 314 54 911
## 2 256 36 902
The Model ANN proved to be able to differentiate all classes, however more data is needed for more efficient learning. For instance, a larger set of variables would be ideal to improve the parameterization of the model