Análise - Escribo Panamá (Matemática)
1. Carregamento da base de dados
Aqui iniciamos a importação da base de dados onde foi necessário utilizar o pacote read_spss para realizar a importação no formato .sav do SPSS.
#importando a base de dados
dados <- read_spss("data/Math dados.sav")2. Pré-processamento dos dados
Realizamos um pré processamento: limitando apenas alunos em um intervalo de idade; apenas alunos do Panamá; alunos que jogaram os games de matemática (grupo experimental); Apenas características gerais, pré-pós-teste e interação com os jogos e por fim retiramos todos os alunos que deixaram de fazer um dos testes (pré ou pós)
#Pré-processamento
dados <- dados %>%
filter(age >= 4 & age <= 7) %>% #Apenas alunos entre 4 à 7 anos.
filter(`@1_Games` == 1) %>% # Apenas grupo experimental.
select(3:12, 56, 74, 159:238) %>%
mutate(Gain_Math = POS_Math_Score - PRE_Math_Score) %>% #Criação da variável de ganho.
drop_na() #Apenas estudantes sem NA's.3. Análise Descritiva
Nesta seção mostraremos as tabela de contingência para as variáveis qualitativas, bem como testes de relação entre as variáveis….
3.1 Base de dados
Aqui apresentamos a base de dados utilizada para as análises após o pré-processamento dos dados
#library(rmarkdown)
paged_table(dados)3.2 Tabela de contingência
Nesta tabela de contingência fizemos uma dupla entrada entre as variáveis qualitativas cruzando com uma única variável escolhida. É apresentado os valores de p-valor referente ao teste de relação de qui-quadrado.
Geral
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Math_Score,
POS_Math_Score) %>%
tbl_summary(statistic = list(all_continuous() ~ "{mean} ({sd})"))| Characteristic | N = 6921 |
|---|---|
| special | |
| no | 623 (90%) |
| not-informed | 50 (7.2%) |
| yes | 19 (2.7%) |
| child_school | |
| centro-educativo-particular | 135 (20%) |
| centro-educativo-publico | 544 (79%) |
| otro | 13 (1.9%) |
| child_school_grade | |
| kinder | 167 (24%) |
| otro | 44 (6.4%) |
| pre-kinder | 72 (10%) |
| primary-basica-general | 210 (30%) |
| secondary-basica-general | 199 (29%) |
| age | |
| 4 | 94 (14%) |
| 5 | 188 (27%) |
| 6 | 208 (30%) |
| 7 | 202 (29%) |
| gender | |
| f | 349 (50%) |
| m | 343 (50%) |
| PRE_Math_Score | 0.96 (0.07) |
| POS_Math_Score | 0.98 (0.04) |
|
1
n (%); Mean (SD)
|
|
Sexo
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Math_Score,
POS_Math_Score) %>%
tbl_summary(
by = gender,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | f, N = 3491 | m, N = 3431 | p-value2 |
|---|---|---|---|
| special | 0.014 | ||
| no | 324 (93%) | 299 (87%) | |
| not-informed | 21 (6.0%) | 29 (8.5%) | |
| yes | 4 (1.1%) | 15 (4.4%) | |
| child_school | 0.5 | ||
| centro-educativo-particular | 64 (18%) | 71 (21%) | |
| centro-educativo-publico | 280 (80%) | 264 (77%) | |
| otro | 5 (1.4%) | 8 (2.3%) | |
| child_school_grade | 0.3 | ||
| kinder | 86 (25%) | 81 (24%) | |
| otro | 23 (6.6%) | 21 (6.1%) | |
| pre-kinder | 36 (10%) | 36 (10%) | |
| primary-basica-general | 94 (27%) | 116 (34%) | |
| secondary-basica-general | 110 (32%) | 89 (26%) | |
| age | 0.077 | ||
| 4 | 49 (14%) | 45 (13%) | |
| 5 | 90 (26%) | 98 (29%) | |
| 6 | 94 (27%) | 114 (33%) | |
| 7 | 116 (33%) | 86 (25%) | |
| PRE_Math_Score | 0.96 (0.08) | 0.96 (0.06) | 0.2 |
| POS_Math_Score | 0.97 (0.05) | 0.98 (0.04) | >0.9 |
|
1
n (%); Mean (SD)
2
Pearson's Chi-squared test; Wilcoxon rank sum test
|
|||
Especial
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Math_Score,
POS_Math_Score) %>%
tbl_summary(
by = special,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | no, N = 6231 | not-informed, N = 501 | yes, N = 191 | p-value2 |
|---|---|---|---|---|
| child_school | 0.12 | |||
| centro-educativo-particular | 123 (20%) | 8 (16%) | 4 (21%) | |
| centro-educativo-publico | 491 (79%) | 39 (78%) | 14 (74%) | |
| otro | 9 (1.4%) | 3 (6.0%) | 1 (5.3%) | |
| child_school_grade | ||||
| kinder | 148 (24%) | 13 (26%) | 6 (32%) | |
| otro | 41 (6.6%) | 1 (2.0%) | 2 (11%) | |
| pre-kinder | 61 (9.8%) | 9 (18%) | 2 (11%) | |
| primary-basica-general | 191 (31%) | 16 (32%) | 3 (16%) | |
| secondary-basica-general | 182 (29%) | 11 (22%) | 6 (32%) | |
| age | 0.2 | |||
| 4 | 82 (13%) | 9 (18%) | 3 (16%) | |
| 5 | 163 (26%) | 16 (32%) | 9 (47%) | |
| 6 | 192 (31%) | 14 (28%) | 2 (11%) | |
| 7 | 186 (30%) | 11 (22%) | 5 (26%) | |
| gender | 0.014 | |||
| f | 324 (52%) | 21 (42%) | 4 (21%) | |
| m | 299 (48%) | 29 (58%) | 15 (79%) | |
| PRE_Math_Score | 0.96 (0.07) | 0.96 (0.06) | 0.95 (0.07) | 0.057 |
| POS_Math_Score | 0.98 (0.05) | 0.98 (0.03) | 0.98 (0.05) | >0.9 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Pearson's Chi-squared test; Kruskal-Wallis rank sum test
|
||||
Tipo Escola
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Math_Score,
POS_Math_Score) %>%
tbl_summary(
by = child_school,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | centro-educativo-particular, N = 1351 | centro-educativo-publico, N = 5441 | otro, N = 131 | p-value2 |
|---|---|---|---|---|
| special | 0.12 | |||
| no | 123 (91%) | 491 (90%) | 9 (69%) | |
| not-informed | 8 (5.9%) | 39 (7.2%) | 3 (23%) | |
| yes | 4 (3.0%) | 14 (2.6%) | 1 (7.7%) | |
| child_school_grade | ||||
| kinder | 35 (26%) | 129 (24%) | 3 (23%) | |
| otro | 10 (7.4%) | 32 (5.9%) | 2 (15%) | |
| pre-kinder | 17 (13%) | 51 (9.4%) | 4 (31%) | |
| primary-basica-general | 39 (29%) | 169 (31%) | 2 (15%) | |
| secondary-basica-general | 34 (25%) | 163 (30%) | 2 (15%) | |
| age | ||||
| 4 | 24 (18%) | 64 (12%) | 6 (46%) | |
| 5 | 39 (29%) | 146 (27%) | 3 (23%) | |
| 6 | 34 (25%) | 172 (32%) | 2 (15%) | |
| 7 | 38 (28%) | 162 (30%) | 2 (15%) | |
| gender | 0.5 | |||
| f | 64 (47%) | 280 (51%) | 5 (38%) | |
| m | 71 (53%) | 264 (49%) | 8 (62%) | |
| PRE_Math_Score | 0.96 (0.06) | 0.96 (0.08) | 0.93 (0.06) | 0.017 |
| POS_Math_Score | 0.97 (0.05) | 0.98 (0.04) | 0.99 (0.02) | 0.4 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Pearson's Chi-squared test; Kruskal-Wallis rank sum test
|
||||
Série
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Math_Score,
POS_Math_Score) %>%
tbl_summary(
by = child_school_grade,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | kinder, N = 1671 | otro, N = 441 | pre-kinder, N = 721 | primary-basica-general, N = 2101 | secondary-basica-general, N = 1991 | p-value2 |
|---|---|---|---|---|---|---|
| special | ||||||
| no | 148 (89%) | 41 (93%) | 61 (85%) | 191 (91%) | 182 (91%) | |
| not-informed | 13 (7.8%) | 1 (2.3%) | 9 (12%) | 16 (7.6%) | 11 (5.5%) | |
| yes | 6 (3.6%) | 2 (4.5%) | 2 (2.8%) | 3 (1.4%) | 6 (3.0%) | |
| child_school | ||||||
| centro-educativo-particular | 35 (21%) | 10 (23%) | 17 (24%) | 39 (19%) | 34 (17%) | |
| centro-educativo-publico | 129 (77%) | 32 (73%) | 51 (71%) | 169 (80%) | 163 (82%) | |
| otro | 3 (1.8%) | 2 (4.5%) | 4 (5.6%) | 2 (1.0%) | 2 (1.0%) | |
| age | <0.001 | |||||
| 4 | 27 (16%) | 2 (4.5%) | 65 (90%) | 0 (0%) | 0 (0%) | |
| 5 | 140 (84%) | 4 (9.1%) | 6 (8.3%) | 38 (18%) | 0 (0%) | |
| 6 | 0 (0%) | 0 (0%) | 1 (1.4%) | 168 (80%) | 39 (20%) | |
| 7 | 0 (0%) | 38 (86%) | 0 (0%) | 4 (1.9%) | 160 (80%) | |
| gender | 0.3 | |||||
| f | 86 (51%) | 23 (52%) | 36 (50%) | 94 (45%) | 110 (55%) | |
| m | 81 (49%) | 21 (48%) | 36 (50%) | 116 (55%) | 89 (45%) | |
| PRE_Math_Score | 0.95 (0.10) | 0.97 (0.04) | 0.94 (0.07) | 0.97 (0.06) | 0.97 (0.06) | 0.003 |
| POS_Math_Score | 0.97 (0.04) | 0.97 (0.06) | 0.98 (0.04) | 0.98 (0.04) | 0.98 (0.05) | 0.4 |
|
1
n (%); Mean (SD)
2
Pearson's Chi-squared test; Kruskal-Wallis rank sum test
|
||||||
Idade
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Math_Score,
POS_Math_Score) %>%
tbl_summary(
by = age,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | 4, N = 941 | 5, N = 1881 | 6, N = 2081 | 7, N = 2021 | p-value2 |
|---|---|---|---|---|---|
| special | 0.2 | ||||
| no | 82 (87%) | 163 (87%) | 192 (92%) | 186 (92%) | |
| not-informed | 9 (9.6%) | 16 (8.5%) | 14 (6.7%) | 11 (5.4%) | |
| yes | 3 (3.2%) | 9 (4.8%) | 2 (1.0%) | 5 (2.5%) | |
| child_school | |||||
| centro-educativo-particular | 24 (26%) | 39 (21%) | 34 (16%) | 38 (19%) | |
| centro-educativo-publico | 64 (68%) | 146 (78%) | 172 (83%) | 162 (80%) | |
| otro | 6 (6.4%) | 3 (1.6%) | 2 (1.0%) | 2 (1.0%) | |
| child_school_grade | <0.001 | ||||
| kinder | 27 (29%) | 140 (74%) | 0 (0%) | 0 (0%) | |
| otro | 2 (2.1%) | 4 (2.1%) | 0 (0%) | 38 (19%) | |
| pre-kinder | 65 (69%) | 6 (3.2%) | 1 (0.5%) | 0 (0%) | |
| primary-basica-general | 0 (0%) | 38 (20%) | 168 (81%) | 4 (2.0%) | |
| secondary-basica-general | 0 (0%) | 0 (0%) | 39 (19%) | 160 (79%) | |
| gender | 0.077 | ||||
| f | 49 (52%) | 90 (48%) | 94 (45%) | 116 (57%) | |
| m | 45 (48%) | 98 (52%) | 114 (55%) | 86 (43%) | |
| PRE_Math_Score | 0.94 (0.10) | 0.95 (0.08) | 0.97 (0.06) | 0.97 (0.06) | 0.001 |
| POS_Math_Score | 0.97 (0.05) | 0.98 (0.04) | 0.98 (0.04) | 0.97 (0.05) | 0.3 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Pearson's Chi-squared test; Kruskal-Wallis rank sum test
|
|||||
3.3 Análise de Densidade do Pré e Pós teste
Análise gráfica utilizando gráficos de densidade para verificar a diferença entre o pré e pós teste de acordo com as características: geral, idade, sexo, escola e série.
Geral
Plotando a densidade para os scores de pré e pós teste.
#gráfico de densidade
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1)Idade
Plotando a densidade para os scores de pré e pós teste de acordo com a idade do estudante
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~age,1)Sexo
Plotando a densidade para os scores de pré e pós teste em relação ao sexo.
#gráfico de densidade em relação ao sexo
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~gender,1)Escola
Plotando a densidade para os scores de pré e pós teste em relação ao tipo de escola.
#gráfico de densidade em relação a escola
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~child_school,1)Série
Plotando a densidade para os scores de pré e pós teste em relação a série do estudante.
#gráfico de densidade em relação a escola
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~child_school_grade,1)3.4 Análise com BoxPlot do Pré e Pós teste
Análise gráfica utilizando gráficos de boxplot para verificar a diferença entre o pré e pós teste de acordo com as características: geral, idade, sexo, escola e série.
Geral
Plotando a densidade para os scores de pré e pós teste.
#gráfico de boxplot geral
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
coord_flip() +
xlim(0.75,1)Idade
Plotando o boxplot para os scores de pré e pós teste de acordo com a idade do estudante
#gráfico de boxplot em relação a idade
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
xlim(0.75,1) +
coord_flip() +
facet_wrap(~age, 1)Sexo
Plotando o boxplot para os scores de pré e pós teste em relação ao sexo.
#gráfico de boxplot em relação ao sexo
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
xlim(0.75,1) +
coord_flip() +
facet_wrap(~gender,1)Escola
Plotando o boxplot para os scores de pré e pós teste em relação ao tipo de escola.
#gráfico de boxplot em relação a escola
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
xlim(0.75,1) +
coord_flip() +
facet_wrap(~child_school,1)Série
Plotando o boxplot para os scores de pré e pós teste em relação a série do estudante.
#gráfico de boxplot em relação a escola
dados %>%
gather(PRE_Math_Score,
POS_Math_Score,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
xlim(0.75,1) +
coord_flip() +
facet_wrap(~child_school_grade,1)4. Cluster Analysis
4.1 Assessing Clustering Tendency
Hopkins Statistic Method
V
dadosCluster <- dados %>%
select(ends_with("_V")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1132266
R
dadosCluster <- dados %>%
select(ends_with("_R")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.09341127
W
dadosCluster <- dados %>%
select(ends_with("_W")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.06698298
C
dadosCluster <- dados %>%
select(ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.08534153
V + R
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1082903
V + W
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_W")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.09320807
V + C
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1071213
R + W
dadosCluster <- dados %>%
select(ends_with("_R"), ends_with("_W")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.08472295
R + C
dadosCluster <- dados %>%
select(ends_with("_R"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.09169524
W + C
dadosCluster <- dados %>%
select(ends_with("_W"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.08375064
V + R + W
dadosCluster <- dados %>%
select(ends_with("_W"), ends_with("_R"), ends_with("_W")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.08458943
V + R + C
dadosCluster <- dados %>%
select(ends_with("_W"), ends_with("_R"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.08821345
V + W + C
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_W"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.09671276
R + W + C
dadosCluster <- dados %>%
select(ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.08819709
V + R + W + C
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.09750174
4.2 K-means Clustering
4.2.1 Estimating the optimal number of clusters
WSS Method
fviz_nbclust(dadosCluster, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")Silhouette Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
fviz_nbclust(dadosCluster, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette method")Gap Statistic Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
fviz_nbclust(dadosCluster, kmeans, method = "gap_stat", nboot = 50) +
labs(subtitle = "Gap statistic method")4.2.2 Computing and Visualizing k-means clustering
K = 2
set.seed(123)
km.res2 <- kmeans(dadosCluster, 2, nstart = 25)
fviz_cluster(km.res2, data = dadosCluster,
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal())K = 3
set.seed(123)
km.res3 <- kmeans(dadosCluster, 3, nstart = 25)
fviz_cluster(km.res3, data = dadosCluster,
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal())K = 4
set.seed(123)
km.res4 <- kmeans(dadosCluster, 4, nstart = 25)
fviz_cluster(km.res4, data = dadosCluster,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal())K = 5
set.seed(123)
km.res5 <- kmeans(dadosCluster, 5, nstart = 25)
fviz_cluster(km.res5, data = dadosCluster,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07", "#0000CD"),
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal())4.2.3 Teste de Kruskall Wallis
K = 2
dadosKruskal_k2 <- cbind(dados, Cluster = km.res2$cluster)
dadosKruskal_k2 <- dadosKruskal_k2 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k2)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 0.16448, df = 1, p-value = 0.6851
K = 3
dadosKruskal_k3 <- cbind(dados, Cluster = km.res3$cluster)
dadosKruskal_k3 <- dadosKruskal_k3 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k3)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 4.6034, df = 2, p-value = 0.1001
K = 4
dadosKruskal_k4 <- cbind(dados, Cluster = km.res4$cluster)
dadosKruskal_k4 <- dadosKruskal_k4 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k4)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 2.2109, df = 3, p-value = 0.5298
K = 5
dadosKruskal_k5 <- cbind(dados, Cluster = km.res5$cluster)
dadosKruskal_k5 <- dadosKruskal_k5 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k5)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 1.7409, df = 4, p-value = 0.7833
4.2.4 Teste de Dunn com ajuste do p-valor
K = 2
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k2, p.adjust.method = "bonferroni")## # A tibble: 1 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 40 652 -0.406 0.685 0.685 ns
K = 3
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k3, p.adjust.method = "bonferroni")## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 5 577 -1.57 0.116 0.348 ns
## 2 Gain_Math 1 3 5 110 -1.20 0.231 0.693 ns
## 3 Gain_Math 2 3 577 110 1.52 0.129 0.387 ns
K = 4
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k4, p.adjust.method = "bonferroni")## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 36 520 -0.775 0.438 1 ns
## 2 Gain_Math 1 3 36 2 0.242 0.808 1 ns
## 3 Gain_Math 1 4 36 134 -0.0473 0.962 1 ns
## 4 Gain_Math 2 3 520 2 0.437 0.662 1 ns
## 5 Gain_Math 2 4 520 134 1.29 0.198 1 ns
## 6 Gain_Math 3 4 2 134 -0.260 0.795 1 ns
K = 5
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k5, p.adjust.method = "bonferroni")## # A tibble: 10 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 2 7 0.0218 0.983 1 ns
## 2 Gain_Math 1 3 2 40 -0.187 0.852 1 ns
## 3 Gain_Math 1 4 2 479 -0.418 0.676 1 ns
## 4 Gain_Math 1 5 2 164 -0.386 0.700 1 ns
## 5 Gain_Math 2 3 7 40 -0.373 0.709 1 ns
## 6 Gain_Math 2 4 7 479 -0.824 0.410 1 ns
## 7 Gain_Math 2 5 7 164 -0.756 0.449 1 ns
## 8 Gain_Math 3 4 40 479 -0.977 0.328 1 ns
## 9 Gain_Math 3 5 40 164 -0.788 0.431 1 ns
## 10 Gain_Math 4 5 479 164 0.242 0.809 1 ns
4.3 K-Medoids
4.3.1 Estimating the optimal number of clusters
WSS Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
fviz_nbclust(dadosCluster, pam, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")Silhouette Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
fviz_nbclust(dadosCluster, pam, method = "silhouette") +
labs(subtitle = "Silhouette method")Gap Statistic Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
fviz_nbclust(dadosCluster, pam, method = "gap_stat", nboot = 50) +
labs(subtitle = "Gap statistic method")4.3.2 Computing and Visualizing PAM clustering
K = 2
set.seed(123)
pam.res2 <- pam(dadosCluster, 2)
fviz_cluster(pam.res2,
palette = c("#00AFBB", "#FC4E07"), # color palette
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic())K = 3
set.seed(123)
pam.res3 <- pam(dadosCluster, 3)
fviz_cluster(pam.res3,
palette = c("#2E9FDF", "#00AFBB", "#FC4E07"), # color palette
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic())K = 4
set.seed(123)
pam.res4 <- pam(dadosCluster, 4)
fviz_cluster(pam.res4,
palette = c("#2E9FDF", "#E7B800", "#00AFBB", "#FC4E07"), # color palette
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic())K = 5
set.seed(123)
pam.res5 <- pam(dadosCluster, 5)
fviz_cluster(pam.res5,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07", "#0000CD"),
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic())4.3.3 Teste de Kruskall Wallis
K = 2
dadosKruskal_k2 <- cbind(dados, Cluster = pam.res2$cluster)
dadosKruskal_k2 <- dadosKruskal_k2 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k2)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 0.078735, df = 1, p-value = 0.779
K = 3
dadosKruskal_k3 <- cbind(dados, Cluster = pam.res3$cluster)
dadosKruskal_k3 <- dadosKruskal_k3 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k3)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 0.64694, df = 2, p-value = 0.7236
K = 4
dadosKruskal_k4 <- cbind(dados, Cluster = pam.res4$cluster)
dadosKruskal_k4 <- dadosKruskal_k4 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k4)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 2.0662, df = 3, p-value = 0.5588
K = 5
dadosKruskal_k5 <- cbind(dados, Cluster = pam.res5$cluster)
dadosKruskal_k5 <- dadosKruskal_k5 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k5)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 2.948, df = 4, p-value = 0.5666
4.3.4 Teste de Dunn com ajuste do p-valor
K = 2
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k2, p.adjust.method = "bonferroni")## # A tibble: 1 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 509 183 0.281 0.779 0.779 ns
K = 3
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k3, p.adjust.method = "bonferroni")## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 301 307 0.406 0.685 1 ns
## 2 Gain_Math 1 3 301 84 0.790 0.430 1 ns
## 3 Gain_Math 2 3 307 84 0.524 0.600 1 ns
K = 4
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k4, p.adjust.method = "bonferroni")## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 258 307 0.741 0.459 1 ns
## 2 Gain_Math 1 3 258 103 1.37 0.171 1 ns
## 3 Gain_Math 1 4 258 24 0.619 0.536 1 ns
## 4 Gain_Math 2 3 307 103 0.852 0.394 1 ns
## 5 Gain_Math 2 4 307 24 0.328 0.743 1 ns
## 6 Gain_Math 3 4 103 24 -0.121 0.904 1 ns
K = 5
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k5, p.adjust.method = "bonferroni")## # A tibble: 10 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 258 307 0.741 0.459 1 ns
## 2 Gain_Math 1 3 258 103 1.37 0.171 1 ns
## 3 Gain_Math 1 4 258 23 0.791 0.429 1 ns
## 4 Gain_Math 1 5 258 1 -0.786 0.432 1 ns
## 5 Gain_Math 2 3 307 103 0.852 0.394 1 ns
## 6 Gain_Math 2 4 307 23 0.506 0.613 1 ns
## 7 Gain_Math 2 5 307 1 -0.848 0.396 1 ns
## 8 Gain_Math 3 4 103 23 0.0542 0.957 1 ns
## 9 Gain_Math 3 5 103 1 -0.942 0.346 1 ns
## 10 Gain_Math 4 5 23 1 -0.939 0.348 1 ns
4.4 CLARA - Clustering Large Applications
4.4.1 Estimating the optimal number of clusters
WSS Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
fviz_nbclust(dadosCluster, clara, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")Silhouette Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
fviz_nbclust(dadosCluster, clara, method = "silhouette") +
labs(subtitle = "Silhouette method")Gap Statistic Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
fviz_nbclust(dadosCluster, clara, method = "gap_stat", nboot = 50) +
labs(subtitle = "Gap statistic method")4.4.2 Computing and Visualizing CLARA clustering
K = 2
set.seed(123)
clara.res2 <- clara(dadosCluster, 2, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res2,
palette = c("#00AFBB", "#FC4E07"), # color palette
geom = "point", pointsize = 1,
ggtheme = theme_classic())K = 3
set.seed(123)
clara.res3 <- clara(dadosCluster, 3, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res3,
palette = c("#2E9FDF", "#00AFBB", "#FC4E07"), # color palette
geom = "point", pointsize = 1,
ggtheme = theme_classic())K = 4
set.seed(123)
clara.res4 <- clara(dadosCluster, 4, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res4,
palette = c("#2E9FDF", "#E7B800", "#00AFBB", "#FC4E07"), # color palette
geom = "point", pointsize = 1,
ggtheme = theme_classic())K = 5
set.seed(123)
clara.res5 <- clara(dadosCluster, 5, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res5,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07", "#0000CD"),
geom = "point", pointsize = 1,
ggtheme = theme_classic())4.4.3 Teste de Kruskall Wallis
K = 2
dadosKruskal_k2 <- cbind(dados, Cluster = clara.res2$cluster)
dadosKruskal_k2 <- dadosKruskal_k2 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k2)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 0.078735, df = 1, p-value = 0.779
K = 3
dadosKruskal_k3 <- cbind(dados, Cluster = clara.res3$cluster)
dadosKruskal_k3 <- dadosKruskal_k3 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k3)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 0.50904, df = 2, p-value = 0.7753
K = 4
dadosKruskal_k4 <- cbind(dados, Cluster = clara.res4$cluster)
dadosKruskal_k4 <- dadosKruskal_k4 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k4)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 2.7583, df = 3, p-value = 0.4304
K = 5
dadosKruskal_k5 <- cbind(dados, Cluster = clara.res5$cluster)
dadosKruskal_k5 <- dadosKruskal_k5 %>%
select(Cluster, Gain_Math)
kruskal.test(Gain_Math ~ Cluster, data = dadosKruskal_k5)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Math by Cluster
## Kruskal-Wallis chi-squared = 3.2859, df = 4, p-value = 0.5112
4.4.4 Teste de Dunn com ajuste do p-valor
K = 2
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k2, p.adjust.method = "bonferroni")## # A tibble: 1 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 509 183 0.281 0.779 0.779 ns
K = 3
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k3, p.adjust.method = "bonferroni")## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 277 405 0.506 0.613 1 ns
## 2 Gain_Math 1 3 277 10 0.571 0.568 1 ns
## 3 Gain_Math 2 3 405 10 0.450 0.652 1 ns
K = 4
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k4, p.adjust.method = "bonferroni")## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 318 307 0.434 0.664 1 ns
## 2 Gain_Math 1 3 318 62 0.622 0.534 1 ns
## 3 Gain_Math 1 4 318 5 1.56 0.118 0.710 ns
## 4 Gain_Math 2 3 307 62 0.371 0.711 1 ns
## 5 Gain_Math 2 4 307 5 1.48 0.138 0.827 ns
## 6 Gain_Math 3 4 62 5 1.33 0.184 1 ns
K = 5
dunn_test(Gain_Math ~ Cluster, data = dadosKruskal_k5, p.adjust.method = "bonferroni")## # A tibble: 10 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Math 1 2 318 307 0.434 0.664 1 ns
## 2 Gain_Math 1 3 318 62 0.622 0.534 1 ns
## 3 Gain_Math 1 4 318 3 1.67 0.0948 0.948 ns
## 4 Gain_Math 1 5 318 2 0.431 0.666 1 ns
## 5 Gain_Math 2 3 307 62 0.371 0.711 1 ns
## 6 Gain_Math 2 4 307 3 1.61 0.107 1 ns
## 7 Gain_Math 2 5 307 2 0.382 0.702 1 ns
## 8 Gain_Math 3 4 62 3 1.49 0.135 1 ns
## 9 Gain_Math 3 5 62 2 0.306 0.760 1 ns
## 10 Gain_Math 4 5 3 2 -0.726 0.468 1 ns
4.5 Agglomerative Clustering
4.4.1 Dendrogram
Euclidean Distance Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
res.dist <- dist(dadosCluster, method = "euclidean")
res.hc <- hclust(d = res.dist, method = "ward.D2")
fviz_dend(res.hc, k = 3, # Cut in three groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#2E9FDF", "#00AFBB", "#E7B800"),
rect_fill = TRUE)Manhattan Distance Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
res.dist <- dist(dadosCluster, method = "manhattan")
res.hc <- hclust(d = res.dist, method = "ward.D2")
fviz_dend(res.hc, k = 3, # Cut in three groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#2E9FDF", "#00AFBB", "#E7B800"),
rect_fill = TRUE)Maximun Distance Method
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale()
res.dist <- dist(dadosCluster, method = "maximum")
res.hc <- hclust(d = res.dist, method = "ward.D2")
fviz_dend(res.hc, k = 3, # Cut in three groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#2E9FDF", "#00AFBB", "#E7B800"),
rect_fill = TRUE)