1 Análise Exploratória de Dados

Com base nas informações será realizada uma análise descritiva das variáveis e as ferramentas para tal interpretação.

Na base de dados utilizada estão relacionadas aos artigos citados abaixo:

1.1 Bibliografia Consultada

  1. I-Cheng Yeh, “Modeling of strength of high performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).
  2. I-Cheng Yeh, “Modeling Concrete Strength with Augment-Neuron Networks,” J. of Materials in Civil Engineering, ASCE, Vol. 10, No. 4, pp. 263-268 (1998).
  3. I-Cheng Yeh, “Design of High Performance Concrete Mixture Using Neural Networks,” J. of Computing in Civil Engineering, ASCE, Vol. 13, No. 1, pp. 36-42 (1999).
  4. I-Cheng Yeh, “Prediction of Strength of Fly Ash and Slag Concrete By The Use of Artificial Neural Networks,” Journal of the Chinese Institute of Civil and Hydraulic Engineering, Vol. 15, No. 4, pp. 659-663 (2003).
  5. I-Cheng Yeh, “A mix Proportioning Methodology for Fly Ash and Slag Concrete Using Artificial Neural Networks,” Chung Hua Journal of Science and Engineering, Vol. 1, No. 1, pp. 77-84 (2003).
  6. Yeh, I-Cheng, “Analysis of strength of concrete using design of experiments and neural networks,” Journal of Materials in Civil Engineering, ASCE, Vol.18, No.4, pp.597-604 (2006).

Para o download do dataset:

http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength

As variáveis dessa banco de dados são as seguintes:

  • Cement (component 1) – quantitative – kg in a m3 mixture – Input Variable
  • Blast Furnace Slag (component 2) – quantitative – kg in a m3 mixture – Input Variable
  • Fly Ash (component 3) – quantitative – kg in a m3 mixture – Input Variable
  • Water (component 4) – quantitative – kg in a m3 mixture – Input Variable
  • Superplasticizer (component 5) – quantitative – kg in a m3 mixture – Input Variable
  • Coarse Aggregate (component 6) – quantitative – kg in a m3 mixture – Input Variable
  • Fine Aggregate (component 7) – quantitative – kg in a m3 mixture – Input Variable
  • Age – quantitative – Day (1~365) – Input Variable
  • Concrete compressive strength - CCS – quantitative – MPa – Output Variable

2 Packages Utilizados

library(knitr)
library(data.table)
library(lattice)
library(ggplot2)
library(bpca)
library(Amelia)
library(corrplot)
library(factoextra)
library(dplyr)
library(car)
library(caret)
library(Metrics)
library(MLmetrics)
library(tidyverse)
library(kableExtra)
library(tidyverse)
library(tibble)
library(gridExtra)
library(plot3D)
library(FactoMineR)
library(bpca)
library(GGally)
library(PerformanceAnalytics)
library(cowplot)

3 ETL do Dataset

3.1 Carregamento da Base de Dados

Dados de Análise
cement blast_furnace_slag fly_ash water superplasticizer coarse_aggregate fine_aggregate age CCS w.c s.c
540.0 0.0 0 162 2.5 1040.0 676.0 28 79.99 0.3000000 1.251852
540.0 0.0 0 162 2.5 1055.0 676.0 28 61.89 0.3000000 1.251852
332.5 142.5 0 228 0.0 932.0 594.0 270 40.27 0.6857143 1.786466
332.5 142.5 0 228 0.0 932.0 594.0 365 41.05 0.6857143 1.786466
198.6 132.4 0 192 0.0 978.4 825.5 360 44.30 0.9667674 4.156596
266.0 114.0 0 228 0.0 932.0 670.0 90 47.03 0.8571429 2.518797
380.0 95.0 0 228 0.0 932.0 594.0 365 43.70 0.6000000 1.563158
380.0 95.0 0 228 0.0 932.0 594.0 28 36.45 0.6000000 1.563158
266.0 114.0 0 228 0.0 932.0 670.0 28 45.85 0.8571429 2.518797
475.0 0.0 0 228 0.0 932.0 594.0 28 39.29 0.4800000 1.250526

3.2 Análise - Ajuste das observações do dataset

Abaixo são apresentadas as Estatisticas Univariadas.

summary(dados_expl)
##      cement      blast_furnace_slag    fly_ash           water      
##  Min.   :102.0   Min.   :  0.0      Min.   :  0.00   Min.   :121.8  
##  1st Qu.:190.7   1st Qu.:  0.0      1st Qu.:  0.00   1st Qu.:165.6  
##  Median :273.0   Median : 24.0      Median :  0.00   Median :185.7  
##  Mean   :280.4   Mean   : 74.0      Mean   : 57.25   Mean   :182.3  
##  3rd Qu.:350.0   3rd Qu.:142.5      3rd Qu.:118.60   3rd Qu.:193.0  
##  Max.   :540.0   Max.   :359.4      Max.   :260.00   Max.   :252.5  
##  NA's   :211     NA's   :231        NA's   :234      NA's   :211    
##  superplasticizer coarse_aggregate fine_aggregate       age        
##  Min.   : 0.000   Min.   : 708.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 931.2   1st Qu.:724.3   1st Qu.: 14.00  
##  Median : 6.500   Median : 967.1   Median :778.5   Median : 28.00  
##  Mean   : 6.274   Mean   : 969.8   Mean   :772.3   Mean   : 43.88  
##  3rd Qu.:10.200   3rd Qu.:1028.4   3rd Qu.:822.0   3rd Qu.: 28.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##  NA's   :229      NA's   :211      NA's   :211                     
##       CCS             w.c              s.c       
##  Min.   : 1.50   Min.   :0.2600   Min.   :0.000  
##  1st Qu.:23.52   1st Qu.:0.7178   1st Qu.:2.084  
##  Median :33.73   Median :1.1060   Median :2.782  
##  Mean   :35.08   Mean   :1.2381   Mean   :3.067  
##  3rd Qu.:44.56   3rd Qu.:1.6599   3rd Qu.:4.031  
##  Max.   :82.60   Max.   :3.7468   Max.   :9.235  
## 

3.3 Limpeza e Organização do Dataset

A união de vários datasets num mesmo arquivo poderá gerar dados repetidos e dados faltantes (NAs).

Para os dados repetidos são tomadas as seguintes medidas:

# número de dados inicialmente inputados
nrow(dados_expl)
## [1] 3427
# retirada das observações repetidas
dados_expl_1 <- unique(dados_expl)

# número de dados final sem repetições
nrow(dados_expl_1)
## [1] 2656
print(paste0("Foram eliminadas ", nrow(dados_expl) - nrow(dados_expl_1), " observações repetidas." ))
## [1] "Foram eliminadas 771 observações repetidas."
print(paste0("O novo dataset (dados_expl_1) contém ",nrow(dados_expl_1), " observações." ))
## [1] "O novo dataset (dados_expl_1) contém 2656 observações."

Para as observações faltantes são tomadas as seguintes medidas:

  • É gerado um mapa com os itens faltantes de cada variável;
  • Identifica-se qual variável possui maior índice de itens faltantes (fly_ash = 8.51%), eliminando observações até 30%, caso ocorram maiores índices rever a variável;
  • São eliminadas as observações dos itens faltantes dessa variável;
  • São consultados novamante os itens faltantes e observa-se que os mesmos não existem mais;
  • O mapa de itens faltantes é gerado outra vez sem apresentar vazios;
  • “Problema resolvido”
#  Mapa de valores ausentes (univariado).
missmap(dados_expl, col=c("black", "grey"), legend=FALSE)

NAs <- round(colSums(is.na(dados_expl_1))*100/nrow(dados_expl_1), 2)
NAs
##             cement blast_furnace_slag            fly_ash              water 
##               7.64               8.40               8.51               7.64 
##   superplasticizer   coarse_aggregate     fine_aggregate                age 
##               8.32               7.64               7.64               0.00 
##                CCS                w.c                s.c 
##               0.00               0.00               0.00
dados_expl_2 <- dados_expl_1[!is.na(dados_expl_1$fly_ash),]

NAs <- round(colSums(is.na(dados_expl_2))*100/nrow(dados_expl_2), 2)
NAs
##             cement blast_furnace_slag            fly_ash              water 
##                  0                  0                  0                  0 
##   superplasticizer   coarse_aggregate     fine_aggregate                age 
##                  0                  0                  0                  0 
##                CCS                w.c                s.c 
##                  0                  0                  0
#  Mapa de valores ausentes (univariado).
missmap(dados_expl_2, col=c("black", "grey"), legend=FALSE)

3.4 Novas Estatísticas Univariadas

glimpse(dados_expl_2) 
## Rows: 2,430
## Columns: 11
## $ cement             <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 266.0, 380.0, 38~
## $ blast_furnace_slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 114.0, 95.0, 95.0, 1~
## $ fly_ash            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ water              <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 228, 228, 1~
## $ superplasticizer   <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0~
## $ coarse_aggregate   <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, ~
## $ fine_aggregate     <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 670.0, 594.0, 59~
## $ age                <dbl> 28, 28, 270, 365, 360, 90, 365, 28, 28, 28, 90, 28,~
## $ CCS                <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 47.03, 43.70, 36~
## $ w.c                <dbl> 0.3000000, 0.3000000, 0.6857143, 0.6857143, 0.96676~
## $ s.c                <dbl> 1.251852, 1.251852, 1.786466, 1.786466, 4.156596, 2~
summary(dados_expl_2)
##      cement      blast_furnace_slag    fly_ash           water      
##  Min.   :102.0   Min.   :  0.00     Min.   :  0.00   Min.   :121.8  
##  1st Qu.:190.3   1st Qu.:  0.00     1st Qu.:  0.00   1st Qu.:164.0  
##  Median :252.0   Median : 19.00     Median : 81.90   Median :182.9  
##  Mean   :273.7   Mean   : 65.05     Mean   : 67.53   Mean   :181.0  
##  3rd Qu.:337.9   3rd Qu.:129.80     3rd Qu.:123.00   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.40     Max.   :260.00   Max.   :247.0  
##  superplasticizer coarse_aggregate fine_aggregate       age        
##  Min.   : 0.000   Min.   : 708.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:737.0   1st Qu.: 14.00  
##  Median : 6.860   Median : 971.8   Median :780.1   Median : 28.00  
##  Mean   : 6.452   Mean   : 974.5   Mean   :775.5   Mean   : 43.96  
##  3rd Qu.:10.100   3rd Qu.:1040.0   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##       CCS             w.c              s.c       
##  Min.   : 2.33   Min.   :0.2669   Min.   :1.135  
##  1st Qu.:24.00   1st Qu.:0.7282   1st Qu.:2.231  
##  Median :34.34   Median :1.0996   Median :3.016  
##  Mean   :35.47   Mean   :1.2055   Mean   :3.272  
##  3rd Qu.:45.03   3rd Qu.:1.5376   3rd Qu.:4.217  
##  Max.   :82.60   Max.   :3.7468   Max.   :9.235

4 Criação da um Variável através de Clusterização

4.1 Cluster Não Hierárquico

# Padronizar variaveis em escala
dado.padronizado <- scale(dados_expl_2)

# Criar clusters
concreto.k2 <- kmeans(dado.padronizado, centers = 2)
concreto.k3 <- kmeans(dado.padronizado, centers = 3)
concreto.k4 <- kmeans(dado.padronizado, centers = 4)
concreto.k5 <- kmeans(dado.padronizado, centers = 5)
concreto.k6 <- kmeans(dado.padronizado, centers = 6)
concreto.k7 <- kmeans(dado.padronizado, centers = 7)

# Criar graficos das possíveis combinações => cluster
G1 <- fviz_cluster(concreto.k2, geom = "point", data = dado.padronizado) + ggtitle("k = 2")
G2 <- fviz_cluster(concreto.k3, geom = "point", data = dado.padronizado) + ggtitle("k = 3")
G3 <- fviz_cluster(concreto.k4, geom = "point", data = dado.padronizado) + ggtitle("k = 4")
G4 <- fviz_cluster(concreto.k5, geom = "point", data = dado.padronizado) + ggtitle("k = 5")
G5 <- fviz_cluster(concreto.k6, geom = "point", data = dado.padronizado) + ggtitle("k = 6")
G6 <- fviz_cluster(concreto.k7, geom = "point", data = dado.padronizado) + ggtitle("k = 7")

#Imprimir gráficos na mesma tela
grid.arrange(G1, G2, G3, G4, G5, G6, ncol = 2)

#VERIFICANDO ELBOW E SILHOUTTE
fviz_nbclust(dado.padronizado, kmeans, method = "wss")

fviz_nbclust(dado.padronizado, kmeans, method = "silhouette")

# Seguindo a recomendação do gráfico "silhouette" serão adotados 7 clusters, 
dados_expl_2$cluster <- concreto.k7$cluster

#Visualizando a base de dados
  kable(dados_expl_2[1:10,]) %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = T, 
                font_size = 12)
cement blast_furnace_slag fly_ash water superplasticizer coarse_aggregate fine_aggregate age CCS w.c s.c cluster
540.0 0.0 0 162 2.5 1040.0 676.0 28 79.99 0.3000000 1.251852 7
540.0 0.0 0 162 2.5 1055.0 676.0 28 61.89 0.3000000 1.251852 1
332.5 142.5 0 228 0.0 932.0 594.0 270 40.27 0.6857143 1.786466 2
332.5 142.5 0 228 0.0 932.0 594.0 365 41.05 0.6857143 1.786466 2
198.6 132.4 0 192 0.0 978.4 825.5 360 44.30 0.9667674 4.156596 2
266.0 114.0 0 228 0.0 932.0 670.0 90 47.03 0.8571429 2.518797 4
380.0 95.0 0 228 0.0 932.0 594.0 365 43.70 0.6000000 1.563158 2
380.0 95.0 0 228 0.0 932.0 594.0 28 36.45 0.6000000 1.563158 4
266.0 114.0 0 228 0.0 932.0 670.0 28 45.85 0.8571429 2.518797 4
475.0 0.0 0 228 0.0 932.0 594.0 28 39.29 0.4800000 1.250526 1
## o cluster também serve como variável categórica para análise exploratória

5 Análise de Componentes Principais (PCA)

5.1 PCA

## Inicialmente são verificadas as correlações entre as variáveis

#A função ggpairs() do pacote GGally apresenta as distribuições das variáveis,
#scatters, valores das correlações e suas respectivas significâncias
ggpairs(dados_expl_2[,1:8])

#A função chart.Correlation() do pacote PerformanceAnalytics também apresenta as
#distribuições das variáveis, scatters, valores das correlações e suas
#respectivas significancias
chart.Correlation((dados_expl_2[,1:8]), histogram = TRUE)

correlacao <- cor(dados_expl_2[,1:8])
cores <- colorRampPalette(c("red", "white", "blue"))
corrplot(correlacao, order="AOE", method="square", col=cores(20), tl.srt=45, tl.cex=0.75, tl.col="black")
corrplot(correlacao, add=TRUE, type="lower", method="number", order="AOE", col="black", diag=FALSE, tl.pos="n", cl.pos="n", number.cex=0.75)

# selecionando as variáveis da análise de PCA
dados_expl_3 <- dados_expl_2[c(1,2,3,4,5,6,7,8)]
plot(bpca(dados_expl_3))

res.pca <- PCA(dados_expl_3,  graph = FALSE)

# Extract eigenvalues/variances
get_eig(res.pca)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.27634802       28.4543502                    28.45435
## Dim.2 1.39429097       17.4286371                    45.88299
## Dim.3 1.27749929       15.9687412                    61.85173
## Dim.4 1.01180933       12.6476166                    74.49935
## Dim.5 0.96983672       12.1229590                    86.62230
## Dim.6 0.83664602       10.4580753                    97.08038
## Dim.7 0.20157156        2.5196446                    99.60002
## Dim.8 0.03199809        0.3999761                   100.00000
# Visualize eigenvalues/variances
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))

# Extract the results for variables
var <- get_pca_var(res.pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"
# Coordinates of variables
head(var$coord)
##                         Dim.1      Dim.2         Dim.3       Dim.4       Dim.5
## cement             -0.2305701  0.4252639  0.8399345852 -0.01944138  0.16221684
## blast_furnace_slag -0.3462523 -0.7686563  0.0005870709 -0.44748003 -0.06080994
## fly_ash             0.6315563 -0.1298539 -0.3752186984  0.36231207  0.52032135
## water              -0.8132067 -0.1619217 -0.1953682521  0.35891444  0.05495657
## superplasticizer    0.7203124 -0.2863339  0.4143713141 -0.04263821  0.28008634
## coarse_aggregate    0.0876480  0.6628116 -0.4694115845 -0.53926663  0.08908167
# Contribution of variables
head(var$contrib)
##                         Dim.1     Dim.2        Dim.3       Dim.4      Dim.5
## cement              2.3354326 12.970707 5.522431e+01  0.03735558  2.7132715
## blast_furnace_slag  5.2667973 42.375126 2.697866e-05 19.79012962  0.3812857
## fly_ash            17.5220755  1.209363 1.102068e+01 12.97379218 27.9154521
## water              29.0511446  1.880428 2.987771e+00 12.73160552  0.3114157
## superplasticizer   22.7930828  5.880201 1.344060e+01  0.17967976  8.0888211
## coarse_aggregate    0.3374779 31.508430 1.724833e+01 28.74143262  0.8182351
# Graph of variables: default plot
fviz_pca_var(res.pca, col.var = "black")

# Control variable colors using their contributions
fviz_pca_var(res.pca, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             )

# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 8)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 8)

# Contributions of variables to PC3
fviz_contrib(res.pca, choice = "var", axes = 3, top = 8)

# Contributions of variables to PC4
fviz_contrib(res.pca, choice = "var", axes = 4, top = 8)

# Contributions of variables to PC5
fviz_contrib(res.pca, choice = "var", axes = 5, top = 8)

fviz_pca_ind(res.pca, col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
)
## Warning: ggrepel: 2398 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Biplot of individuals and variables
fviz_pca_biplot(res.pca, repel = TRUE)
## Warning: ggrepel: 2389 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## a título de comparação são apresentadas novamente as correlação após o tratamento das variáveis com PCA

corrplot(var$cos2, is.corr=FALSE)

corrplot(var$contrib, is.corr=FALSE)

6 Apresentação Gráfica dos Dados

Análise das variáveis independentes com relação a variável dependente.

6.1 Gráficos QQ-PLOT

plot_grid(
ggplot(data = dados_expl_2, aes(x = CCS, y = cement))+
  geom_point()+
  geom_smooth(method = "loess")+
  labs(x = "Resistência Compressão", y = "Cimento"),

ggplot(data = dados_expl_2, aes(x = CCS, y = water))+
  geom_point()+
  geom_smooth(method = "loess")+
  labs(x = "Resistência Compressão", y = "Agua"),

ggplot(data = dados_expl_2, aes(x = CCS, y = blast_furnace_slag))+
  geom_point()+
  geom_smooth(method = "loess")+
  labs(x = "Resistência Compressão", y = "Escória de Alto Forno"),

ggplot(data = dados_expl_2, aes(x = CCS, y = fly_ash))+
  geom_point()+
  geom_smooth(method = "loess")+
  labs(x = "Resistência Compressão", y = "Cinzas Volantes"),
ggplot(data = dados_expl_2, aes(x = CCS, y = superplasticizer))+
  geom_point()+
  geom_smooth(method = "loess")+
  labs(x = "Resistência Compressão", y = "Superplastificante"),

ggplot(data = dados_expl_2, aes(x = CCS, y = coarse_aggregate))+
  geom_point()+
  geom_smooth(method = "loess")+
  labs(x = "Resistência Compressão", y = "Agregado Graudo - Brita"),

ggplot(data = dados_expl_2, aes(x = CCS, y = fine_aggregate))+
  geom_point()+
  geom_smooth(method = "loess")+
  labs(x = "Resistência Compressão", y = "Agregado Fino - Areia"),

ggplot(data = dados_expl_2, aes(x = CCS, y = age))+
  geom_point()+
  geom_smooth(method = "loess")+
  labs(x = "Resistência Compressão", y = "Idade"),
nrow = 3, 
label_x = 0, label_y = 0,
hjust = -0.5, vjust = -0.5)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

## deve ser observado que alguns gráficos esão apresentando resultados com valores nulos (zero), 
# estes mesmos precisam ser eliminados no momento da análise pois não foram usados estes componentes na mistura do concreto

### ESTE ITEM DEVERÁ SER ACRESCENTADO FUTURAMENTE

6.2 Gráficos BOXPLOT

Os gráficos estão sendo gerados a partir dos clusters

layout(matrix(1:6, ncol = 3))
boxplot(data = dados_expl_2, water ~ cluster, main = "Agua")
boxplot(data = dados_expl_2, cement ~ cluster, main = "Cimento")
boxplot(data = dados_expl_2, fine_aggregate ~ cluster, main = "Areia")
boxplot(data = dados_expl_2, coarse_aggregate ~ cluster, main = "Brita")
boxplot(data = dados_expl_2, age ~ cluster, main = "Idade")
boxplot(data = dados_expl_2, CCS ~ cluster, main = "Resistencia")

layout(matrix(1:1, ncol = 1))

Os gráficos estão sendo gerados através da combinação das variáveis no formato long

library(data.table)
## convertendo as variáveis para o formato long
long <- melt(setDT(dados_expl_2[,c(1,4,6,7,8,12)]), id.vars = c("cluster"), variable.name = c("componente"))

# Basic box plot - indicação da média com o ponto vermelho
ggplot(long, aes(x = componente, y = value, 
                 color = componente)) +  
  geom_boxplot(width = .95, outlier.colour = NA, coef = 100, alpha=0.7) +
  stat_summary(fun.y=mean, geom="point", shape=20, size=5, color="red", fill="red", alpha=0.9)
## Warning: `fun.y` is deprecated. Use `fun` instead.

# Basic box plot - indicação da distribuição dos dados de cada variável
ggplot(long, aes(x = componente, y = value, 
                 color = componente)) + 
  geom_boxplot(width = .95, outlier.colour = NA, coef = 100, alpha=0.7) +
  geom_jitter(
    width = .05,
    alpha = .9,
    size = 1,
    color = "orange") 

# Basic box plot - indicação dos outliers
ggplot(long, aes(x = componente, y = value, 
                 color = componente)) +
  geom_boxplot()

# Basic box plot - analisando os conjuntos através dos diferentes clusters
ggplot(long, aes(x = componente, y = value, 
                 color = componente )) +
  geom_boxplot() +
  facet_wrap(~cluster, scale="free")

# Basic box plot - variáveis individualmente
ggplot(long, aes(x = cluster, y = value, 
                 color = cluster )) +
  geom_boxplot() +
  facet_wrap(~componente, scale="free") + 
  ggtitle("Adjust line width of boxplot in ggplot2")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

6.3 Gráficos de HISTOGRAMA e DENSIDADE

Os gráficos estão sendo gerados a partir da combinação dos dois modelos e variáveis em conjunto

layout(matrix(1:2, ncol = 1))
## Frequência
hist(dados_expl_2$CCS, freq = TRUE, labels = TRUE)
## Densidade
hist(dados_expl_2$CCS, freq = FALSE, labels = TRUE)

layout(matrix(1:1, ncol = 1))

#  Histograma (univariado).
par(mfrow=c(3,3))
for(i in 1:9) {
  hist(dados_expl_2[,i], main=names(dados_expl_2)[i])
}

# Gráfico de densidade (univariado).
par(mfrow=c(3,3))
for(i in 1:9) {
  plot(density(dados_expl_2[,i]), main=names(dados_expl_2)[i])
}

layout(matrix(1:1, ncol = 1))

Os gráficos estão sendo gerados a partir da combinação dos dois modelos e variáveis individualmente

dados <- dados_expl_2

ggplot(dados,aes(x = water)) +
  geom_histogram(aes(y = ..density.. ), bins = 20,
                 colour = "blue",
                 fill = "cornflowerblue") +
  stat_function(fun = dnorm,
                args = list(mean = mean(dados$water), sd (dados$water)),
                col = "red", lwd = 1.2) +
  theme_light()

ggplot(dados,aes(x = cement)) +
  geom_histogram(aes(y = ..density.. ), bins = 20,
                 colour = "blue",
                 fill = "cornflowerblue") +
  stat_function(fun = dnorm,
                args = list(mean = mean(dados$cement), sd (dados$cement)),
                col = "red", lwd = 1.2) +
  theme_light()

ggplot(dados,aes(x = coarse_aggregate)) +
  geom_histogram(aes(y = ..density.. ), bins = 20,
                 colour = "blue",
                 fill = "cornflowerblue") +
  stat_function(fun = dnorm,
                args = list(mean = mean(dados$coarse_aggregate), sd (dados$coarse_aggregate)),
                col = "red", lwd = 1.2) +
  theme_light()

ggplot(dados,aes(x = fine_aggregate)) +
  geom_histogram(aes(y = ..density.. ), bins = 20,
                 colour = "blue",
                 fill = "cornflowerblue") +
  stat_function(fun = dnorm,
                args = list(mean = mean(dados$fine_aggregate), sd (dados$fine_aggregate)),
                col = "red", lwd = 1.2) +
  theme_light()

ggplot(dados,aes(x = age)) +
  geom_histogram(aes(y = ..density.. ), bins = 20,
                 colour = "blue",
                 fill = "cornflowerblue") +
  stat_function(fun = dnorm,
                args = list(mean = mean(dados$age), sd (dados$age)),
                col = "red", lwd = 1.2) +
  theme_light()

ggplot(dados,aes(x = w.c)) +
  geom_histogram(aes(y = ..density.. ), bins = 20,
                 colour = "blue",
                 fill = "cornflowerblue") +
  stat_function(fun = dnorm,
                args = list(mean = mean(dados$w.c), sd (dados$w.c)),
                col = "red", lwd = 1.2) +
  theme_light()

ggplot(dados,aes(x = s.c)) +
  geom_histogram(aes(y = ..density.. ), bins = 20,
                 colour = "blue",
                 fill = "cornflowerblue") +
  stat_function(fun = dnorm,
                args = list(mean = mean(dados$s.c), sd (dados$s.c)),
                col = "red", lwd = 1.2) +
  theme_light()

6.4 Gráficos de DENSIDADE

Os gráficos estão sendo gerados a partir variáveis individualmente

a <- ggplot(dados_expl_2, aes(x = CCS))

# y axis scale = ..density.. (default behaviour)
a + geom_density() +
  geom_vline(aes(xintercept  = mean(CCS)), 
             linetype = "dashed", size = 0.6)

# Change y axis to count instead of density
a + geom_density(aes(y = ..count..), fill = "lightgray") +
  geom_vline(aes(xintercept = mean(CCS)), 
             linetype = "dashed", size = 0.6,
             color = "#FC4E07")

Os gráficos estão sendo gerados a partir variáveis em conjunto

# Change outline and fill colors by groups 
# Use a custom palette
# a variável cluster se torna categórica
dados_expl_2$cluster <- as.factor(dados_expl_2$cluster)

# Change point shapes by groups
ggplot(dados_expl_2, aes(sample = CCS)) +
  stat_qq(aes(color = cluster)) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF", "#FF0000", "#00FF00", "#0000FF", "#ffea00", "#ff00ae"))+
  labs(y = "CCS")

library(ggridges)

ggplot(dados_expl_2, aes(x = CCS, y = cluster)) +
  geom_density_ridges(aes(fill = cluster)) +
  scale_fill_manual(values = c("#868686FF", "#EFC000FF", "#FF0000", "#00FF00", "#0000FF", "#ffea00", "#ff00ae"))
## Picking joint bandwidth of 3.66

ggplot(dados_expl_2, aes(x = cement, y = cluster)) +
  geom_density_ridges(aes(fill = cluster)) +
  scale_fill_manual(values = c("#868686FF", "#EFC000FF", "#FF0000", "#00FF00", "#0000FF", "#ffea00", "#ff00ae"))
## Picking joint bandwidth of 17.2

ggplot(dados_expl_2, aes(x = water, y = cluster)) +
  geom_density_ridges(aes(fill = cluster)) +
  scale_fill_manual(values = c("#868686FF", "#EFC000FF", "#FF0000", "#00FF00", "#0000FF", "#ffea00", "#ff00ae"))
## Picking joint bandwidth of 4.21

ggplot(dados_expl_2, aes(x = coarse_aggregate, y = cluster)) +
  geom_density_ridges(aes(fill = cluster)) +
  scale_fill_manual(values = c("#868686FF", "#EFC000FF", "#FF0000", "#00FF00", "#0000FF", "#ffea00", "#ff00ae"))
## Picking joint bandwidth of 16.8

ggplot(dados_expl_2, aes(x = fine_aggregate, y = cluster)) +
  geom_density_ridges(aes(fill = cluster)) +
  scale_fill_manual(values = c("#868686FF", "#EFC000FF", "#FF0000", "#00FF00", "#0000FF", "#ffea00", "#ff00ae"))
## Picking joint bandwidth of 19.1

ggplot(dados_expl_2, aes(x = fly_ash, y = cluster)) +
  geom_density_ridges(aes(fill = cluster)) +
  scale_fill_manual(values = c("#868686FF", "#EFC000FF", "#FF0000", "#00FF00", "#0000FF", "#ffea00", "#ff00ae"))
## Picking joint bandwidth of 9.75

6.5 Gráficos de BARRA

Os gráficos estão sendo gerados a partir variáveis individualmente

library(ggpubr)
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:cowplot':
## 
##     get_legend
ggplot(dados_expl_2, aes(cluster)) +
  geom_bar(fill = "#0073C2FF") +
  theme_pubclean()

# Organizando a frequência para cada cluster
df <- dados_expl_2 %>%
  group_by(cluster) %>%
  summarise(counts = n())
df
## # A tibble: 7 x 2
##   cluster counts
##   <fct>    <int>
## 1 1          389
## 2 2           95
## 3 3          233
## 4 4          218
## 5 5          822
## 6 6          303
## 7 7          370
# Create the bar plot. Use theme_pubclean() [in ggpubr]
ggplot(df, aes(x = cluster, y = counts)) +
  geom_bar(fill = "#0073C2FF", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3) + 
  theme_pubclean()

counts <- table(dados_expl_2$cluster)
barplot(counts, main="Distribuição dos Clusters",
        xlab="Número de Clusters")

# Stacked Bar Plot with Colors and Legend
counts <- table(dados_expl_2$cluster, dados_expl_2$cement)
barplot(counts, main="Distribuição dos Clusters by quantidades de cimento",
        xlab="Número de Clusters", col=c("darkblue","red"),
        legend = rownames(counts))

 med <- c("Cimento", "Agua", "Brita", "Areia", "Age", 
          "Cimento", "Agua", "Brita", "Areia", "Age",
          "Cimento", "Agua", "Brita", "Areia", "Age",
          "Cimento", "Agua", "Brita", "Areia", "Age",
          "Cimento", "Agua", "Brita", "Areia", "Age")

  val <- c(203.4354, 167.9896, 1040.671, 773.8045, 38.93557,
           255.5413, 195.1651, 866.1292, 749.5544, 27.16923,
           297.1801, 195.5849, 1004.152, 742.0937, 63.77561,
           380.8699, 159.3411, 938.4668, 791.905,  31.15119,
           232.0802, 173.2611, 967.8469, 881.7636, 38.13882)      

 clus <- c(1,1,1,1,1,
           2,2,2,2,2,
           3,3,3,3,3,
           4,4,4,4,4,
           5,5,5,5,5)

 grupos <- data.frame(clus, val, med)
 
 ggp <- ggplot(grupos, aes(x = as.factor(clus), y = val, fill = med, label = val)) +  # Create stacked bar chart
   geom_bar(stat = "identity", width=0.9) +                                                            # Add values on top of bars
   geom_text(size = 5, position = position_stack(vjust = 0.5)) +
   xlab("Clusters") + 
   ylab("Médias") + 
   ggtitle("Distribuição média dos componentes por cluster") + 
   labs(fill = "Componentes")
 ggp

6.6 Gráficos de DISPERSÃO

Os gráficos estão sendo gerados a partir diversas variáveis

INSERIR AS LEGENDAS

ggplot(data = dados_expl_2, aes(cement, water))+
  geom_point(aes(colour = cluster))

ggplot(data = dados_expl_2, aes(x = CCS, y = water, colour = as.factor(cluster), shape = as.factor(age), size = cement))+
  geom_point()

ggplot(data = dados_expl_2, aes(age, CCS))+
  geom_point(aes(colour = as.factor(cluster)))

6.7 Gráficos de MAPA DE ÁRVORE

Os gráficos estão sendo gerados a partir das variáveis e seus sub-conjuntos internos

INSERIR AS LEGENDAS

# preparação dos dados
dados_expl_2 %>%
  filter(cluster == 1) %>%
  summary()
##      cement      blast_furnace_slag    fly_ash            water      
##  Min.   :200.0   Min.   : 0.000     Min.   : 0.0000   Min.   :142.0  
##  1st Qu.:296.0   1st Qu.: 0.000     1st Qu.: 0.0000   1st Qu.:186.0  
##  Median :339.0   Median : 0.000     Median : 0.0000   Median :192.0  
##  Mean   :345.9   Mean   : 1.185     Mean   : 0.3059   Mean   :192.7  
##  3rd Qu.:382.5   3rd Qu.: 0.000     3rd Qu.: 0.0000   3rd Qu.:193.0  
##  Max.   :540.0   Max.   :50.000     Max.   :60.0000   Max.   :234.0  
##                                                                      
##  superplasticizer coarse_aggregate fine_aggregate       age        
##  Min.   :0.0000   Min.   : 838.4   Min.   :594.0   Min.   :  1.00  
##  1st Qu.:0.0000   1st Qu.: 968.0   1st Qu.:754.0   1st Qu.:  7.00  
##  Median :0.0000   Median :1012.0   Median :784.0   Median : 28.00  
##  Mean   :0.2612   Mean   :1017.5   Mean   :776.4   Mean   : 35.96  
##  3rd Qu.:0.0000   3rd Qu.:1075.0   3rd Qu.:821.0   3rd Qu.: 28.00  
##  Max.   :8.0000   Max.   :1125.0   Max.   :945.0   Max.   :180.00  
##                                                                    
##       CCS             w.c              s.c        cluster
##  Min.   : 6.27   Min.   :0.2989   Min.   :1.135   1:389  
##  1st Qu.:17.54   1st Qu.:0.5766   1st Qu.:1.959   2:  0  
##  Median :26.06   Median :0.8220   Median :2.318   3:  0  
##  Mean   :27.61   Mean   :1.1479   Mean   :2.386   4:  0  
##  3rd Qu.:36.94   3rd Qu.:1.7240   3rd Qu.:2.705   5:  0  
##  Max.   :71.99   Max.   :2.7778   Max.   :4.225   6:  0  
##                                                   7:  0
dados_expl_2 %>%
  filter(cluster == 2) %>%
  summary()
##      cement      blast_furnace_slag    fly_ash      water      
##  Min.   :139.6   Min.   :  0.0      Min.   :0   Min.   :173.0  
##  1st Qu.:266.0   1st Qu.:  0.0      1st Qu.:0   1st Qu.:197.0  
##  Median :339.0   Median : 38.0      Median :0   Median :228.0  
##  Mean   :343.0   Mean   : 66.6      Mean   :0   Mean   :214.9  
##  3rd Qu.:380.0   3rd Qu.:114.0      3rd Qu.:0   3rd Qu.:228.0  
##  Max.   :540.0   Max.   :237.5      Max.   :0   Max.   :228.0  
##                                                                
##  superplasticizer coarse_aggregate fine_aggregate       age       
##  Min.   :0        Min.   : 932.0   Min.   :594.0   Min.   :180.0  
##  1st Qu.:0        1st Qu.: 932.0   1st Qu.:594.0   1st Qu.:180.0  
##  Median :0        Median : 932.0   Median :670.0   Median :270.0  
##  Mean   :0        Mean   : 967.4   Mean   :677.9   Mean   :281.9  
##  3rd Qu.:0        3rd Qu.: 968.0   3rd Qu.:722.5   3rd Qu.:365.0  
##  Max.   :0        Max.   :1125.0   Max.   :885.0   Max.   :365.0  
##                                                                   
##       CCS             w.c              s.c        cluster
##  Min.   :25.08   Min.   :0.3204   Min.   :1.135   1: 0   
##  1st Qu.:40.02   1st Qu.:0.6240   1st Qu.:1.563   2:95   
##  Median :43.70   Median :0.9668   Median :2.204   3: 0   
##  Mean   :46.14   Mean   :1.1378   Mean   :2.212   4: 0   
##  3rd Qu.:52.91   3rd Qu.:1.5453   3rd Qu.:2.519   5: 0   
##  Max.   :74.17   Max.   :3.1214   Max.   :5.780   6: 0   
##                                                   7: 0
# library
library(treemap)

# treemap
treemap(dados_expl_2,
        index="cluster",
        vSize="water",
        type="index")

# treemap
treemap(grupos,
        index=c("clus","med"),
        vSize="val",
        fontsize.labels=c(20, 16),
        type="index")

library(treemapify)
ggplot(dados_expl_2, aes(area = cement, fill = cluster)) +
  geom_treemap()

ggplot(grupos, aes(area = val, fill = clus)) +
  geom_treemap()+
  facet_wrap( ~ clus) +
  labs(
    title = "Título",
    caption = "Legenda",
    fill = "Cluster")

names(dados_expl_2)
##  [1] "cement"             "blast_furnace_slag" "fly_ash"           
##  [4] "water"              "superplasticizer"   "coarse_aggregate"  
##  [7] "fine_aggregate"     "age"                "CCS"               
## [10] "w.c"                "s.c"                "cluster"
library(data.table)
long <- melt(setDT(dados_expl_2[,c(1,4,6,7,8,12)]), id.vars = c("cluster"), variable.name = c("componente"))

# treemap
treemap(long,
        index=c("cluster","componente"),
        vSize="value",
        fontsize.labels=c(20, 16),
        type="index",
        vColor = "cluster", 
        palette="Pastel1")

## palette options =>  https://colorbrewer2.org/#type=qualitative&scheme=Pastel1&n=5

6.8 Gráficos CART

Os gráficos estão sendo gerados a partir das arvores aleatórias

Apesar de ser um técnica “ensemble” (classificação e regressão) pode ajudar muito no entendimento e na exploração dos dados

library(rpart)
library(rpart.plot)
library(rattle)
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(RColorBrewer)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## The following object is masked from 'package:car':
## 
##     Predict
## Loading required package: strucchange
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
# Split the data into training and test set
pima.data <- na.omit(dados_expl_2[c(-12)])
set.seed(123)
training.samples <- dados_expl_2[c(-12)]$CCS %>%
  createDataPartition(p = 0.75, list = FALSE)
train.data  <- pima.data[training.samples, ]
test.data <- pima.data[-training.samples, ]

library(party)
set.seed(123)
model <- train(
  CCS ~., data = train.data, method = "ctree2",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(maxdepth = 3, mincriterion = 0.95 )
)
plot(model$finalModel)

# Create a decision tree model
tree <- rpart(CCS~., data = dados_expl_2[c(-12)], cp=.02)
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)

fancyRpartPlot(tree, caption = NULL)

plotcp(tree)

fit <- rpart(CCS ~ ., method = "anova", data = dados_expl_2[c(-12)])
plot(fit, uniform = T, main="Classification Tree for Kyphosis")
text(fit, use.n = TRUE, all = TRUE, cex = .8)

fit <- ctree(CCS ~ ., data = dados_expl_2[c(-12)])
plot(fit, main="Conditional Inference Tree for Kyphosis")

gtree <- ctree(CCS ~ ., data = dados_expl_2[c(-12)])
plot(gtree)

a <- rpart(CCS ~., data = dados_expl_2[c(-12)], method="class", cp=.001, minsplit=5)
rpart.plot(a, type=0, extra=4, under=TRUE, branch=.3)

# Step1: Begin with a small cp. 
set.seed(123)
tree <- rpart(CCS ~ ., data = dados_expl_2[c(-12)], 
                      control = rpart.control(cp = 0.0001))
printcp(tree)
## 
## Regression tree:
## rpart(formula = CCS ~ ., data = dados_expl_2[c(-12)], control = rpart.control(cp = 1e-04))
## 
## Variables actually used in tree construction:
## [1] age                blast_furnace_slag cement             coarse_aggregate  
## [5] fine_aggregate     fly_ash            s.c                superplasticizer  
## [9] water             
## 
## Root node error: 610152/2430 = 251.09
## 
## n= 2430 
## 
##             CP nsplit rel error   xerror      xstd
## 1   0.26907068      0  1.000000 1.000889 0.0267347
## 2   0.13667044      1  0.730929 0.732221 0.0210169
## 3   0.07018919      2  0.594259 0.597817 0.0175576
## 4   0.05361979      3  0.524070 0.528792 0.0154163
## 5   0.04345721      4  0.470450 0.477827 0.0141349
## 6   0.02997631      5  0.426993 0.433772 0.0127434
## 7   0.02880672      6  0.397016 0.407046 0.0124516
## 8   0.02332465      7  0.368210 0.374631 0.0118797
## 9   0.02203690      8  0.344885 0.358359 0.0113467
## 10  0.01481403      9  0.322848 0.332636 0.0101677
## 11  0.01258723     10  0.308034 0.320823 0.0101306
## 12  0.00884509     11  0.295447 0.303074 0.0098086
## 13  0.00880508     12  0.286602 0.295473 0.0099057
## 14  0.00854518     13  0.277797 0.294884 0.0098997
## 15  0.00847648     14  0.269251 0.293064 0.0097912
## 16  0.00825518     15  0.260775 0.291545 0.0097303
## 17  0.00802649     16  0.252520 0.284042 0.0094680
## 18  0.00791332     17  0.244493 0.275804 0.0092073
## 19  0.00685384     18  0.236580 0.263352 0.0084330
## 20  0.00548474     19  0.229726 0.244359 0.0079897
## 21  0.00530424     20  0.224241 0.243421 0.0080357
## 22  0.00490457     21  0.218937 0.240568 0.0079490
## 23  0.00471301     22  0.214033 0.235440 0.0077831
## 24  0.00443272     23  0.209320 0.228660 0.0076866
## 25  0.00428865     24  0.204887 0.218450 0.0073296
## 26  0.00422457     26  0.196310 0.216649 0.0072702
## 27  0.00384991     27  0.192085 0.214947 0.0072619
## 28  0.00384954     28  0.188235 0.212443 0.0072337
## 29  0.00362786     29  0.184386 0.211036 0.0072247
## 30  0.00343244     30  0.180758 0.207461 0.0072618
## 31  0.00333735     32  0.173893 0.205230 0.0072278
## 32  0.00329594     33  0.170555 0.204730 0.0072243
## 33  0.00306182     35  0.163964 0.203532 0.0072163
## 34  0.00296884     36  0.160902 0.198300 0.0070531
## 35  0.00259727     37  0.157933 0.194256 0.0071361
## 36  0.00258373     38  0.155336 0.187874 0.0068979
## 37  0.00255136     40  0.150168 0.185232 0.0068321
## 38  0.00252734     41  0.147617 0.183781 0.0068189
## 39  0.00250391     42  0.145090 0.183094 0.0068116
## 40  0.00247208     43  0.142586 0.180857 0.0067146
## 41  0.00244862     44  0.140114 0.179872 0.0066687
## 42  0.00244128     45  0.137665 0.179088 0.0066567
## 43  0.00235057     46  0.135224 0.174671 0.0065958
## 44  0.00213024     47  0.132873 0.169311 0.0062737
## 45  0.00210225     49  0.128613 0.165840 0.0060708
## 46  0.00209625     51  0.124408 0.166084 0.0061255
## 47  0.00189329     52  0.122312 0.164831 0.0061407
## 48  0.00174604     53  0.120419 0.161902 0.0060586
## 49  0.00173567     54  0.118672 0.156030 0.0058408
## 50  0.00163052     55  0.116937 0.154258 0.0058271
## 51  0.00157897     56  0.115306 0.149849 0.0056940
## 52  0.00155800     57  0.113727 0.149910 0.0057243
## 53  0.00152325     59  0.110611 0.148937 0.0057173
## 54  0.00150656     60  0.109088 0.148414 0.0057103
## 55  0.00149121     61  0.107582 0.146648 0.0055398
## 56  0.00146684     62  0.106090 0.145814 0.0055301
## 57  0.00144448     63  0.104623 0.144027 0.0054979
## 58  0.00144297     64  0.103179 0.143356 0.0054999
## 59  0.00142606     65  0.101736 0.143022 0.0054823
## 60  0.00142193     66  0.100310 0.142861 0.0054827
## 61  0.00141828     67  0.098888 0.142757 0.0054803
## 62  0.00139428     70  0.094569 0.141481 0.0054724
## 63  0.00137856     73  0.090386 0.141012 0.0054707
## 64  0.00123616     74  0.089007 0.138237 0.0054379
## 65  0.00120596     76  0.086535 0.134080 0.0052977
## 66  0.00119621     77  0.085329 0.131957 0.0052770
## 67  0.00105951     78  0.084133 0.127098 0.0052033
## 68  0.00104530     79  0.083073 0.125338 0.0051449
## 69  0.00093666     80  0.082028 0.122969 0.0050327
## 70  0.00084541     81  0.081091 0.119073 0.0049569
## 71  0.00084279     82  0.080246 0.116876 0.0049005
## 72  0.00079491     83  0.079403 0.115771 0.0048804
## 73  0.00072182     84  0.078608 0.113672 0.0047505
## 74  0.00072065     85  0.077886 0.111894 0.0046391
## 75  0.00071339     86  0.077166 0.111782 0.0046393
## 76  0.00070541     87  0.076452 0.111612 0.0046358
## 77  0.00067232     88  0.075747 0.110276 0.0046341
## 78  0.00064813     89  0.075075 0.109373 0.0046448
## 79  0.00063945     90  0.074426 0.109481 0.0046456
## 80  0.00062175     91  0.073787 0.108888 0.0046308
## 81  0.00060900     92  0.073165 0.108742 0.0046310
## 82  0.00060653     93  0.072556 0.108338 0.0046268
## 83  0.00057929     94  0.071950 0.108531 0.0046244
## 84  0.00054857     95  0.071370 0.106417 0.0045389
## 85  0.00054270     96  0.070822 0.105866 0.0045349
## 86  0.00053774     98  0.069736 0.105508 0.0045347
## 87  0.00053463    100  0.068661 0.105067 0.0045319
## 88  0.00053332    102  0.067592 0.104806 0.0045268
## 89  0.00051173    103  0.067058 0.104135 0.0045203
## 90  0.00050639    104  0.066547 0.103624 0.0045207
## 91  0.00049746    105  0.066040 0.103397 0.0045228
## 92  0.00046981    106  0.065543 0.102105 0.0044889
## 93  0.00046845    107  0.065073 0.101153 0.0044791
## 94  0.00046329    108  0.064604 0.100791 0.0044773
## 95  0.00045663    109  0.064141 0.100449 0.0044717
## 96  0.00044203    110  0.063685 0.100492 0.0044762
## 97  0.00043694    111  0.063243 0.099136 0.0044418
## 98  0.00043674    112  0.062806 0.098949 0.0044422
## 99  0.00042992    113  0.062369 0.098647 0.0044412
## 100 0.00042269    114  0.061939 0.098613 0.0044454
## 101 0.00040872    115  0.061516 0.098348 0.0044495
## 102 0.00040768    116  0.061108 0.097861 0.0044330
## 103 0.00040624    117  0.060700 0.097626 0.0044322
## 104 0.00039624    119  0.059887 0.097381 0.0044311
## 105 0.00036548    120  0.059491 0.096397 0.0044284
## 106 0.00034583    121  0.059126 0.095818 0.0044232
## 107 0.00033607    122  0.058780 0.095435 0.0044198
## 108 0.00033452    123  0.058444 0.095015 0.0043990
## 109 0.00033190    124  0.058109 0.094864 0.0043986
## 110 0.00032988    125  0.057777 0.094755 0.0043988
## 111 0.00032532    127  0.057118 0.094813 0.0044008
## 112 0.00031721    128  0.056792 0.094685 0.0044012
## 113 0.00029800    129  0.056475 0.094435 0.0043952
## 114 0.00029314    130  0.056177 0.093832 0.0043800
## 115 0.00028457    132  0.055591 0.092951 0.0043757
## 116 0.00028147    133  0.055306 0.092687 0.0043740
## 117 0.00026273    134  0.055025 0.091994 0.0043700
## 118 0.00026151    137  0.054237 0.091358 0.0043608
## 119 0.00025999    138  0.053975 0.091358 0.0043608
## 120 0.00025819    139  0.053715 0.091304 0.0043618
## 121 0.00024384    140  0.053457 0.090822 0.0043601
## 122 0.00023650    141  0.053213 0.090164 0.0043586
## 123 0.00023385    142  0.052976 0.089884 0.0043483
## 124 0.00023312    143  0.052743 0.089832 0.0043480
## 125 0.00022673    144  0.052510 0.089772 0.0043472
## 126 0.00021980    145  0.052283 0.089302 0.0043453
## 127 0.00021952    146  0.052063 0.089226 0.0043462
## 128 0.00021418    147  0.051843 0.089145 0.0043465
## 129 0.00021100    148  0.051629 0.089100 0.0043461
## 130 0.00021075    149  0.051418 0.089181 0.0043462
## 131 0.00020179    150  0.051208 0.088957 0.0043322
## 132 0.00019441    151  0.051006 0.088296 0.0043131
## 133 0.00017584    152  0.050811 0.087830 0.0043113
## 134 0.00016983    154  0.050460 0.087522 0.0043130
## 135 0.00016944    155  0.050290 0.087361 0.0043100
## 136 0.00016589    156  0.050120 0.087194 0.0043107
## 137 0.00016322    157  0.049954 0.087170 0.0043108
## 138 0.00016250    158  0.049791 0.087110 0.0043125
## 139 0.00015946    159  0.049629 0.086651 0.0043033
## 140 0.00015259    160  0.049469 0.086642 0.0043014
## 141 0.00014679    161  0.049317 0.086496 0.0043004
## 142 0.00014343    162  0.049170 0.086307 0.0043004
## 143 0.00014284    163  0.049026 0.086288 0.0043004
## 144 0.00014094    164  0.048884 0.086202 0.0043008
## 145 0.00013741    165  0.048743 0.086092 0.0042999
## 146 0.00013389    166  0.048605 0.085715 0.0042959
## 147 0.00011307    167  0.048471 0.084939 0.0042904
## 148 0.00010345    168  0.048358 0.084606 0.0042892
## 149 0.00010000    169  0.048255 0.084560 0.0042883
bestcp <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(tree, cp = bestcp)

plot(tree.pruned)
text(tree.pruned, cex = 0.8, use.n = TRUE, xpd = TRUE)

prp(tree.pruned, faclen = 0, cex = 0.8, extra = 1)

tot_count <- function(x, labs, digits, varlen)
{
  paste(labs, "\n\nn =", x$frame$n)
}
prp(tree.pruned, faclen = 0, cex = 0.8, node.fun=tot_count)

only_count <- function(x, labs, digits, varlen)
{
  paste(x$frame$n)
}
boxcols <- c("pink", "palegreen3")[tree.pruned$frame$yval]
par(xpd=TRUE)
prp(tree.pruned, faclen = 0, cex = 0.8, node.fun=only_count, box.col = boxcols)
legend("bottomleft", legend = c("died","survived"), fill = c("pink", "palegreen3"),
       title = "Group")

binary.model <- rpart(CCS ~ ., data = dados_expl_2[c(-12)], cp = .02)
rpart.plot(binary.model)

anova.model <- rpart(CCS ~ ., data = dados_expl_2[c(-12)])
rpart.plot(anova.model)

m1 <- rpart(
  formula = CCS ~ .,
  data    = dados_expl_2[c(-12)],
  method  = "anova")
m1
## n= 2430 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 2430 610151.700 35.46851  
##     2) age< 21 758 103497.700 23.26084  
##       4) cement< 354.5 562  45897.870 19.38107  
##         8) age< 10.5 397  19549.520 15.77574 *
##         9) age>=10.5 165   8771.887 28.05570 *
##       5) cement>=354.5 196  24883.640 34.38551 *
##     3) age>=21 1672 342480.000 41.00284  
##       6) cement< 354.5 1342 205711.500 37.50082  
##        12) cement< 164.8 295  23568.650 26.85841  
##          24) blast_furnace_slag< 0.15 62   3373.118 16.12774 *
##          25) blast_furnace_slag>=0.15 233  11156.730 29.71378 *
##        13) cement>=164.8 1047 139316.800 40.49940  
##          26) water>=175.98 643  60500.700 36.51042  
##            52) blast_furnace_slag< 13 336  18321.210 32.13935 *
##            53) blast_furnace_slag>=13 307  28733.640 41.29440 *
##          27) water< 175.98 404  52300.590 46.84819  
##            54) blast_furnace_slag< 47.63 265  22795.440 42.54966  
##             108) s.c>=3.984321 117   5204.612 36.49487 *
##             109) s.c< 3.984321 148   9910.706 47.33622 *
##            55) blast_furnace_slag>=47.63 139  15273.580 55.04324 *
##       7) cement>=354.5 330  53378.840 55.24439  
##        14) water>=183.05 143  17260.350 46.73098 *
##        15) water< 183.05 187  17828.400 61.75465 *
rpart.plot(m1)
plotcp(m1)

m2 <- rpart(
  formula = CCS ~ .,
  data    = dados_expl_2[c(-12)],
  method  = "anova", 
  control = list(cp = 0, xval = 10)
)
rpart.plot(m2)

plotcp(m2)
abline(v = 12, lty = "dashed")

## Tuning
m3 <- rpart(
  formula = CCS ~ .,
  data    = dados_expl_2[c(-12)],
  method  = "anova", 
  control = list(minsplit = 10, maxdepth = 12, xval = 10))
m3$cptable
##            CP nsplit rel error    xerror       xstd
## 1  0.26907068      0 1.0000000 1.0012305 0.02674818
## 2  0.13667044      1 0.7309293 0.7319427 0.02099423
## 3  0.07018919      2 0.5942589 0.5977102 0.01754427
## 4  0.05361979      3 0.5240697 0.5312068 0.01546320
## 5  0.04345721      4 0.4704499 0.4809029 0.01427868
## 6  0.02997631      5 0.4269927 0.4421931 0.01321163
## 7  0.02880672      6 0.3970164 0.4213547 0.01302233
## 8  0.02332465      7 0.3682097 0.3762610 0.01185103
## 9  0.02203690      8 0.3448850 0.3673092 0.01151320
## 10 0.01481403      9 0.3228481 0.3439957 0.01065425
## 11 0.01258723     10 0.3080341 0.3305256 0.01056481
## 12 0.01000000     11 0.2954468 0.3134969 0.01014127
rpart.plot(m3)

plotcp(m3)

hyper_grid <- expand.grid(
  minsplit = seq(5, 20, 1),
  maxdepth = seq(8, 15, 1))
head(hyper_grid)
##   minsplit maxdepth
## 1        5        8
## 2        6        8
## 3        7        8
## 4        8        8
## 5        9        8
## 6       10        8
nrow(hyper_grid)
## [1] 128
models <- list()

for (i in 1:nrow(hyper_grid)) {
  
  # get minsplit, maxdepth values at row i
  minsplit <- hyper_grid$minsplit[i]
  maxdepth <- hyper_grid$maxdepth[i]
  
  # train a model and store in the list
  models[[i]] <- rpart(
    formula = CCS ~ .,
    data    = dados_expl_2[c(-12)],
    method  = "anova",
    control = list(minsplit = minsplit, maxdepth = maxdepth)
  )
}

# function to get optimal cp
get_cp <- function(x) {
  min    <- which.min(x$cptable[, "xerror"])
  cp <- x$cptable[min, "CP"] 
}

# function to get minimum error
get_min_error <- function(x) {
  min    <- which.min(x$cptable[, "xerror"])
  xerror <- x$cptable[min, "xerror"] 
}

hyper_grid %>%
  mutate(
    cp    = purrr::map_dbl(models, get_cp),
    error = purrr::map_dbl(models, get_min_error)
  ) %>%
  arrange(error) %>%
  top_n(-5, wt = error)
##   minsplit maxdepth   cp     error
## 1       17        8 0.01 0.2991030
## 2       18        8 0.01 0.2995492
## 3       20       14 0.01 0.2996629
## 4        9       15 0.01 0.3015332
## 5        5       15 0.01 0.3018493
optimal_tree <- rpart(
  formula = CCS ~ .,
  data    = dados_expl_2[c(-12)],
  method  = "anova",
  control = list(minsplit = 11, maxdepth = 8, cp = 0.01)
)

pred <- predict(optimal_tree, newdata = dados_expl_2[c(-12)])
RMSE(dados_expl_2$CCS,pred)
## [1] 8.61302
library(DataExplorer)

plot_str(dados_expl_2[c(1,2,3,4,5,6,7,8,9)], type = "r", max_level = 1, print_network = TRUE, 
         fontSize = 40, width = 1000, margin = 10)
plot_str(dados_expl_2[c(1,2,3,4,5,6,7,8,9)], type = "d", max_level = 1, print_network = TRUE, 
         fontSize = 40, width = 800, margin = 10)

6.9 Pacotes para Exclusiva Exploração de Dados

6.9.0.1 Pacote XRAY

# install.packages("xray")
library(xray)
anomalies(dados_expl_2)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## $variables
##              Variable    q qNA pNA qZero  pZero qBlank pBlank qInf pInf
## 1  blast_furnace_slag 2430   0   -  1149 47.28%      0      -    0    -
## 2             fly_ash 2430   0   -  1104 45.43%      0      -    0    -
## 3    superplasticizer 2430   0   -   758 31.19%      0      -    0    -
## 4             cluster 2430   0   -     0      -      0      -    0    -
## 5                 age 2430   0   -     0      -      0      -    0    -
## 6               water 2430   0   -     0      -      0      -    0    -
## 7    coarse_aggregate 2430   0   -     0      -      0      -    0    -
## 8              cement 2430   0   -     0      -      0      -    0    -
## 9      fine_aggregate 2430   0   -     0      -      0      -    0    -
## 10                s.c 2430   0   -     0      -      0      -    0    -
## 11                w.c 2430   0   -     0      -      0      -    0    -
## 12                CCS 2430   0   -     0      -      0      -    0    -
##    qDistinct    type anomalous_percent
## 1        238 Numeric            47.28%
## 2        236 Numeric            45.43%
## 3        176 Numeric            31.19%
## 4          7  Factor                 -
## 5         14 Numeric                 -
## 6        274 Numeric                 -
## 7        346 Numeric                 -
## 8        348 Numeric                 -
## 9        380 Numeric                 -
## 10       508 Numeric                 -
## 11       824 Numeric                 -
## 12       917 Numeric                 -
## 
## $problem_variables
##  [1] Variable          q                 qNA               pNA              
##  [5] qZero             pZero             qBlank            pBlank           
##  [9] qInf              pInf              qDistinct         type             
## [13] anomalous_percent problems         
## <0 rows> (or 0-length row.names)
distributions(dados_expl_2) 
## ================================================================================

##              Variable     p_1   p_10   p_25   p_50    p_75    p_90    p_99
## 1  blast_furnace_slag       0      0      0     19   129.8     189   282.8
## 2             fly_ash       0      0      0   81.9     123     161 237.923
## 3    superplasticizer       0      0      0   6.86    10.1      12    23.4
## 4                 age       3      3     14     28      56     100     365
## 5               water 126.716 155.52    164  182.9     192   203.5     228
## 6    coarse_aggregate     801  852.1    932  971.8    1040  1076.2    1125
## 7              cement   122.6  154.8  190.3    252   337.9     425   531.3
## 8      fine_aggregate     594    670    737  780.1     824 875.949   943.1
## 9                 s.c  1.1676 1.7585 2.2312 3.0163  4.2165  4.9681  6.5261
## 10                w.c  0.3204 0.5333 0.7282 1.0996  1.5376  2.0699  3.1214
## 11                CCS    6.94  14.64     24 34.345 45.0275   56.62   74.99

6.9.1 Pacote VISDAT

# install.packages("visdat")
library(visdat)
vis_dat(dados_expl_2)

vis_guess(dados_expl_2)

vis_miss(dados_expl_2)

vis_cor(dados_expl_2[,1:8])

6.9.2 Pacote DLOOKR

# install.packages("dlookr")
library(dlookr)
## 
## Attaching package: 'dlookr'
## The following object is masked from 'package:rattle':
## 
##     binning
## The following objects are masked from 'package:PerformanceAnalytics':
## 
##     kurtosis, skewness
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following object is masked from 'package:base':
## 
##     transform
eda_report(dados_expl_2, "CCS", output_format = "html")
## 
## 
## processing file: EDA_Report.Rmd
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |..                                                                    |   3%
##    inline R code fragments
## 
## 
  |                                                                            
  |.....                                                                 |   7%
## label: setup (with options) 
## List of 1
##  $ include: logi FALSE
## 
## 
  |                                                                            
  |.......                                                               |  10%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..........                                                            |  14%
## label: enrironment (with options) 
## List of 3
##  $ echo   : logi FALSE
##  $ warning: logi FALSE
##  $ message: logi FALSE
## 
## 
  |                                                                            
  |............                                                          |  17%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..............                                                        |  21%
## label: udf (with options) 
## List of 3
##  $ echo   : logi FALSE
##  $ warning: logi FALSE
##  $ message: logi FALSE
## 
## 
  |                                                                            
  |.................                                                     |  24%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...................                                                   |  28%
## label: check_variables (with options) 
## List of 4
##  $ echo   : logi FALSE
##  $ warning: logi FALSE
##  $ message: logi FALSE
##  $ comment: chr ""
## 
## 
  |                                                                            
  |......................                                                |  31%
##    inline R code fragments
## 
## 
  |                                                                            
  |........................                                              |  34%
## label: info_variables (with options) 
## List of 5
##  $ echo   : logi FALSE
##  $ warning: logi FALSE
##  $ message: logi FALSE
##  $ comment: chr ""
##  $ results: chr "asis"
## 
## 
  |                                                                            
  |...........................                                           |  38%
##    inline R code fragments
## 
## 
  |                                                                            
  |.............................                                         |  41%
## label: describe_univariate (with options) 
## List of 4
##  $ echo   : logi FALSE
##  $ warning: logi FALSE
##  $ message: logi FALSE
##  $ comment: chr ""
## 
  |                                                                            
  |...............................                                       |  45%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..................................                                    |  48%
## label: normality (with options) 
## List of 7
##  $ echo      : logi FALSE
##  $ warning   : logi FALSE
##  $ message   : logi FALSE
##  $ comment   : chr ""
##  $ fig.height: num 4
##  $ fig.width : num 6
##  $ results   : chr "asis"
## 
  |                                                                            
  |....................................                                  |  52%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.......................................                               |  55%
## label: correlations (with options) 
## List of 4
##  $ echo   : logi FALSE
##  $ warning: logi FALSE
##  $ message: logi FALSE
##  $ comment: chr ""
## 
## 
  |                                                                            
  |.........................................                             |  59%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...........................................                           |  62%
## label: plot_correlations (with options) 
## List of 6
##  $ echo      : logi FALSE
##  $ warning   : logi FALSE
##  $ message   : logi FALSE
##  $ comment   : chr ""
##  $ fig.height: num 8
##  $ fig.width : num 8
## 
  |                                                                            
  |..............................................                        |  66%
##   ordinary text without R code
## 
## 
  |                                                                            
  |................................................                      |  69%
## label: create_target_by (with options) 
## List of 3
##  $ echo   : logi FALSE
##  $ warning: logi FALSE
##  $ message: logi FALSE
## 
## 
  |                                                                            
  |...................................................                   |  72%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.....................................................                 |  76%
## label: numeric_variables (with options) 
## List of 7
##  $ echo      : logi FALSE
##  $ warning   : logi FALSE
##  $ message   : logi FALSE
##  $ comment   : chr ""
##  $ fig.height: num 6
##  $ fig.width : num 7
##  $ results   : chr "asis"
## 
  |                                                                            
  |........................................................              |  79%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..........................................................            |  83%
## label: category_variables (with options) 
## List of 7
##  $ echo      : logi FALSE
##  $ warning   : logi FALSE
##  $ message   : logi FALSE
##  $ comment   : chr ""
##  $ fig.height: num 4
##  $ fig.width : num 7
##  $ results   : chr "asis"
## 
  |                                                                            
  |............................................................          |  86%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...............................................................       |  90%
## label: group_correlations (with options) 
## List of 4
##  $ echo   : logi FALSE
##  $ warning: logi FALSE
##  $ message: logi FALSE
##  $ comment: chr ""
## 
## 
  |                                                                            
  |.................................................................     |  93%
##   ordinary text without R code
## 
## 
  |                                                                            
  |....................................................................  |  97%
## label: plot_group_correlations (with options) 
## List of 6
##  $ echo      : logi FALSE
##  $ warning   : logi FALSE
##  $ message   : logi FALSE
##  $ comment   : chr ""
##  $ fig.height: num 8
##  $ fig.width : num 8
## 
## 
  |                                                                            
  |......................................................................| 100%
##   ordinary text without R code
## output file: EDA_Report.knit.md
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS EDA_Report.utf8.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc33ac14f349e3.html --lua-filter "D:\Users\Fernando\Documents\R\win-library\4.0\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "D:\Users\Fernando\Documents\R\win-library\4.0\rmarkdown\rmarkdown\lua\latex-div.lua" --self-contained --standalone --section-divs --table-of-contents --toc-depth 3 --template "D:/Users/Fernando/Documents/R/win-library/4.0/prettydoc/resources/templates/cayman.html" --highlight-style pygments --number-sections --include-in-header "C:\Users\Fernando\AppData\Local\Temp\RtmpI9H8QY\rmarkdown-str33ac2a392f91.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --css "C:\Users\Fernando\AppData\Local\Temp\RtmpI9H8QY\EDA_Report_files/style.css"
## 
## Output created: C:\Users\Fernando\AppData\Local\Temp\RtmpI9H8QY/EDA_Report.html

6.9.3 Pacote DATAEXPLORER

# install.packages("DataExplorer")
library(DataExplorer)
introduce(dados_expl_2[c(1,2,3,4,5,6,7,8)])
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1 2430       8                0                  8                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0          2430              19440       158120
plot_intro(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_missing(dados_expl_2[c(1,2,3,4,5,6,7,8)])

profile_missing(dados_expl_2[c(1,2,3,4,5,6,7,8)])
##              feature num_missing pct_missing
## 1             cement           0           0
## 2 blast_furnace_slag           0           0
## 3            fly_ash           0           0
## 4              water           0           0
## 5   superplasticizer           0           0
## 6   coarse_aggregate           0           0
## 7     fine_aggregate           0           0
## 8                age           0           0
plot_histogram(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_density(dados_expl_2[c(1,2,3,4,5,6,7,8)])

#plot_bar(dados_expl_2[c(1,2,3,4,5,6,7,8)])
plot_qq(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_correlation(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_prcomp(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_scatterplot(dados_expl_2[c(1,2,3,4,5,6,7,8,9)], by = "CCS")

plot_boxplot(dados_expl_2[c(1,2,3,4,5,6,7,8,9)], by = "CCS")

plot_str(dados_expl_2[c(1,2,3,4,5,6,7,8,9)])

6.9.4 Pacote DATAEXPLORER

# install.packages("DataExplorer")
library(DataExplorer)
introduce(dados_expl_2[c(1,2,3,4,5,6,7,8)])
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1 2430       8                0                  8                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0          2430              19440       158120
plot_intro(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_missing(dados_expl_2[c(1,2,3,4,5,6,7,8)])

profile_missing(dados_expl_2[c(1,2,3,4,5,6,7,8)])
##              feature num_missing pct_missing
## 1             cement           0           0
## 2 blast_furnace_slag           0           0
## 3            fly_ash           0           0
## 4              water           0           0
## 5   superplasticizer           0           0
## 6   coarse_aggregate           0           0
## 7     fine_aggregate           0           0
## 8                age           0           0
plot_histogram(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_density(dados_expl_2[c(1,2,3,4,5,6,7,8)])

#plot_bar(dados_expl_2[c(1,2,3,4,5,6,7,8)])
plot_qq(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_correlation(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_prcomp(dados_expl_2[c(1,2,3,4,5,6,7,8)])

plot_scatterplot(dados_expl_2[c(1,2,3,4,5,6,7,8,9)], by = "CCS")

plot_boxplot(dados_expl_2[c(1,2,3,4,5,6,7,8,9)], by = "CCS")

plot_str(dados_expl_2[c(1,2,3,4,5,6,7,8,9)])
create_report(dados_expl_2[c(1,2,3,4,5,6,7,8,9)])
## 
## 
## processing file: report.rmd
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |..                                                                    |   2%
##    inline R code fragments
## 
## 
  |                                                                            
  |...                                                                   |   5%
## label: global_options (with options) 
## List of 1
##  $ include: logi FALSE
## 
## 
  |                                                                            
  |.....                                                                 |   7%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.......                                                               |  10%
## label: introduce
## 
  |                                                                            
  |........                                                              |  12%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..........                                                            |  14%
## label: plot_intro
## 
  |                                                                            
  |............                                                          |  17%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.............                                                         |  19%
## label: data_structure
## 
  |                                                                            
  |...............                                                       |  21%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.................                                                     |  24%
## label: missing_profile
## 
  |                                                                            
  |..................                                                    |  26%
##   ordinary text without R code
## 
## 
  |                                                                            
  |....................                                                  |  29%
## label: univariate_distribution_header
## 
  |                                                                            
  |......................                                                |  31%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.......................                                               |  33%
## label: plot_histogram
## 
  |                                                                            
  |.........................                                             |  36%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...........................                                           |  38%
## label: plot_density
## 
  |                                                                            
  |............................                                          |  40%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..............................                                        |  43%
## label: plot_frequency_bar
## 
  |                                                                            
  |................................                                      |  45%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.................................                                     |  48%
## label: plot_response_bar
## 
  |                                                                            
  |...................................                                   |  50%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.....................................                                 |  52%
## label: plot_with_bar
## 
  |                                                                            
  |......................................                                |  55%
##   ordinary text without R code
## 
## 
  |                                                                            
  |........................................                              |  57%
## label: plot_normal_qq
## 
  |                                                                            
  |..........................................                            |  60%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...........................................                           |  62%
## label: plot_response_qq
## 
  |                                                                            
  |.............................................                         |  64%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...............................................                       |  67%
## label: plot_by_qq
## 
  |                                                                            
  |................................................                      |  69%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..................................................                    |  71%
## label: correlation_analysis
## 
  |                                                                            
  |....................................................                  |  74%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.....................................................                 |  76%
## label: principal_component_analysis
## 
  |                                                                            
  |.......................................................               |  79%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.........................................................             |  81%
## label: bivariate_distribution_header
## 
  |                                                                            
  |..........................................................            |  83%
##   ordinary text without R code
## 
## 
  |                                                                            
  |............................................................          |  86%
## label: plot_response_boxplot
## 
  |                                                                            
  |..............................................................        |  88%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...............................................................       |  90%
## label: plot_by_boxplot
## 
  |                                                                            
  |.................................................................     |  93%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...................................................................   |  95%
## label: plot_response_scatterplot
## 
  |                                                                            
  |....................................................................  |  98%
##   ordinary text without R code
## 
## 
  |                                                                            
  |......................................................................| 100%
## label: plot_by_scatterplot
## output file: D:/Docencia/UPF/PPGEng/Redes Neurais - Machine Learn/Portifolio/report.knit.md
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS "D:/Docencia/UPF/PPGEng/Redes Neurais - Machine Learn/Portifolio/report.utf8.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc33ac119a614e.html --lua-filter "D:\Users\Fernando\Documents\R\win-library\4.0\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "D:\Users\Fernando\Documents\R\win-library\4.0\rmarkdown\rmarkdown\lua\latex-div.lua" --self-contained --variable bs3=TRUE --standalone --section-divs --table-of-contents --toc-depth 6 --template "D:\Users\Fernando\Documents\R\win-library\4.0\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --include-in-header "C:\Users\Fernando\AppData\Local\Temp\RtmpI9H8QY\rmarkdown-str33ac6fe62952.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
## 
## Output created: report.html

by Ramires Engenharia