Used R libraries

Data Loading

dados = read.csv("df_data.csv", header = TRUE)

table(dados$target)
## 
## high  low  med 
## 3000 6000 1000

The number of observations in each class is unbalanced. This can reduce models prediction power. So, I applied data augmentation for corrected classes:

data transforming and augmentation (uploading bad classes)

dados$target <- as.factor(dados$target)

dados_b <- upSample(x = dados[,1:3],
                    y = dados$target,
                    list = FALSE,
                    yname = "target")
dados_sc = data.frame(scale(dados_b[,1:3], center = T),target = as.factor(dados_b$target))
table(dados_sc$target)
## 
## high  low  med 
## 6000 6000 6000

The data have now, 6000 observations for each class.

PCA for variables analysis

pca = princomp(dados_sc[,1:3])
biplot(pca)

According to the pca, the variables are equally important for the variation of the data (equal vector sizes in the graph). The correlation between them is weak, however, the same between them.

Data Distribution comparation

par(mfcol=c(3,2))
for(i in 1:3){
  hist(dados_b[,i], col = "red", breaks = 100)
}
for(i in 1:3){
  hist(dados_sc[,i], col = "red", breaks = 100)
}

For instance, the old data set and new one have the same distribution yet. the Shapiro test is available for verify assumption of data normality. Normalized data offer the opportunity for apply a lot of linear models.

Shapiro test for normal distribution

shap_test = list()
for(i in 1:3){
  shap_test[[i]] = shapiro.test(dados_b[,i][1:5000])
}
shap_test
## [[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  dados_b[, i][1:5000]
## W = 0.97461, p-value < 2.2e-16
## 
## 
## [[2]]
## 
##  Shapiro-Wilk normality test
## 
## data:  dados_b[, i][1:5000]
## W = 0.93975, p-value < 2.2e-16
## 
## 
## [[3]]
## 
##  Shapiro-Wilk normality test
## 
## data:  dados_b[, i][1:5000]
## W = 0.95644, p-value < 2.2e-16

The Shapiro test shows data has no normal distribution. Therefore, non-lienar machine learning models has been chosen. Here, were applied Random Florest and Artificial Neural Network approaches.

At the beginning, the dataset was partitioned in training and test sets:

Testing and training datasets:

Was used threshold of 0.8 for training and testing data:

validation_index <- createDataPartition(dados_sc$target, p=0.80, list=FALSE)
validation <- dados_sc[-validation_index,]
dataset <- dados_sc[validation_index,]

Summary comparation

summary(dataset[,1:3])
##        x1                  x2                  x3           
##  Min.   :-2.428763   Min.   :-1.724342   Min.   :-1.748330  
##  1st Qu.:-0.812627   1st Qu.:-0.722934   1st Qu.:-0.849059  
##  Median : 0.006169   Median :-0.178309   Median :-0.010334  
##  Mean   :-0.002364   Mean   : 0.001865   Mean   : 0.002045  
##  3rd Qu.: 0.838325   3rd Qu.: 0.528337   3rd Qu.: 0.873552  
##  Max.   : 2.292088   Max.   : 6.651952   Max.   : 1.724187
summary(validation[,1:3])
##        x1                  x2                  x3           
##  Min.   :-2.428763   Min.   :-1.702869   Min.   :-1.747337  
##  1st Qu.:-0.845979   1st Qu.:-0.728791   1st Qu.:-0.897571  
##  Median : 0.050249   Median :-0.190022   Median :-0.013312  
##  Mean   : 0.009454   Mean   :-0.007459   Mean   :-0.008181  
##  3rd Qu.: 0.864461   3rd Qu.: 0.516624   3rd Qu.: 0.862634  
##  Max.   : 2.279605   Max.   : 4.783829   Max.   : 1.724187

Random Forest method:

m0 = randomForest(target~., data=dataset, ntree=2000, replace = T)
predictions <- predict(m0, validation)

Results:

confusionMatrix(predictions, validation$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high  low  med
##       high  941  323    3
##       low   247  810    0
##       med    12   67 1197
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8189          
##                  95% CI : (0.8059, 0.8313)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7283          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: high Class: low Class: med
## Sensitivity               0.7842     0.6750     0.9975
## Specificity               0.8642     0.8971     0.9671
## Pos Pred Value            0.7427     0.7663     0.9381
## Neg Pred Value            0.8890     0.8466     0.9987
## Prevalence                0.3333     0.3333     0.3333
## Detection Rate            0.2614     0.2250     0.3325
## Detection Prevalence      0.3519     0.2936     0.3544
## Balanced Accuracy         0.8242     0.7860     0.9823

The Random Forest model obtained 80% accuracy prediction and achieved high predictive power in all classes (with sensitivity 0.99 and acuracy of 98% for “med” class).

Neural Network method:

Data Assemble

Was used threshold of 0.8 for training and testing data:

dados_sc[,4] <- as.numeric(dados_sc[,4]) -1
dados_sc = as.matrix(dados_sc)
dimnames(dados_sc) <- NULL
sample_set <- sample(2, nrow(dados_sc), replace=TRUE, prob=c(0.8, 0.2))
dados.training = dados_sc[sample_set==1, 1:3]
dados.test = dados_sc[sample_set==2, 1:4]
dados.trainingtarget <- dados_sc[sample_set==1, 4]
dados.testtarget =  dados_sc[sample_set==2, 4]

Summary comparation

summary(dados.training[,1:3])
##        V1                  V2                  V3           
##  Min.   :-2.428763   Min.   :-1.724342   Min.   :-1.748330  
##  1st Qu.:-0.832521   1st Qu.:-0.724886   1st Qu.:-0.877347  
##  Median : 0.012411   Median :-0.186118   Median :-0.018275  
##  Mean   :-0.002281   Mean   :-0.002371   Mean   :-0.007144  
##  3rd Qu.: 0.846420   3rd Qu.: 0.524433   3rd Qu.: 0.860152  
##  Max.   : 2.279605   Max.   : 6.651952   Max.   : 1.724187
summary(dados.test[,1:3])
##        V1                  V2                  V3          
##  Min.   :-2.428763   Min.   :-1.698965   Min.   :-1.74436  
##  1st Qu.:-0.776153   1st Qu.:-0.701462   1st Qu.:-0.82201  
##  Median : 0.028404   Median :-0.166597   Median : 0.02441  
##  Mean   : 0.009051   Mean   : 0.009407   Mean   : 0.02835  
##  3rd Qu.: 0.828671   3rd Qu.: 0.535169   3rd Qu.: 0.90978  
##  Max.   : 2.292088   Max.   : 4.932186   Max.   : 1.72419

Model generation and adjust

Learning evidence - Loss:

plot(history$metrics$loss, main="Model Loss", xlab = "epoch", ylab="loss", col="blue", type="l")

Predictions:

classes <- m1 %>% predict_classes(dados.test[,1:3])
table(dados.testtarget, classes)
##                 classes
## dados.testtarget   0   1   2
##                0 322  40 788
##                1 314  54 911
##                2 256  36 902

The Model ANN proved to be able to differentiate all classes, however more data is needed for more efficient learning. For instance, a larger set of variables would be ideal to improve the parameterization of the model

Considerations about models

Both models have the potential to be used as predictors, with emphasis on RF, which obtained better results with the available dataset.

Considerations about dataset

The data set is suitable for preliminary predictions, however, for more accurate results, more observations are needed.