Análise - Escribo Panamá (Leitura)
1. Carregamento da base de dados
Aqui iniciamos a importação da base de dados onde foi necessário utilizar o pacote read_spss para realizar a importação no formato .sav do SPSS.
#importando a base de dados
dados <- read_spss("data/Reading dados.sav")2. Pré-processamento dos dados
Realizamos um pré processamento: limitando apenas alunos em um intervalo de idade; apenas alunos do Panamá; alunos que jogaram os games de matemática (grupo experimental); Apenas caracterÃsticas gerais, pré-pós-teste e interação com os jogos e por fim retiramos todos os alunos que deixaram de fazer um dos testes (pré ou pós)
#Pré-processamento
dadosControle <- dados %>%
filter(age >= 4 & age <= 5) %>% #Apenas alunos entre 4 Ã 7 anos.
filter(`Reading_games` == 0) %>% # Apenas grupo experimental.
select(5:12, 26, 40) %>%
mutate(Gain_Read = POS_Read_SCORE - PRE_Read_Score) %>% #Criação da variável de ganho.
drop_na() #Apenas estudantes sem NA's.
dados <- dados %>%
filter(age >= 4 & age <= 5) %>% #Apenas alunos entre 4 Ã 7 anos.
filter(`Reading_games` == 1) %>% # Apenas grupo experimental.
select(5:12, 26, 40, 79:154, 247:250) %>%
mutate(Gain_Read = POS_Read_SCORE - PRE_Read_Score) %>% #Criação da variável de ganho.
drop_na() #Apenas estudantes sem NA's.3. Análise Descritiva
Nesta seção mostraremos as tabela de contingência para as variáveis qualitativas, bem como testes de relação entre as variáveis….
3.1 Base de dados
Aqui apresentamos a base de dados utilizada para as análises após o pré-processamento dos dados
#library(rmarkdown)
paged_table(dados)3.2 Tabela de contingência
Nesta tabela de contingência fizemos uma dupla entrada entre as variáveis qualitativas cruzando com uma única variável escolhida. É apresentado os valores de p-valor referente ao teste de relação de qui-quadrado.
Geral
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Read_Score,
POS_Read_SCORE) %>%
tbl_summary(statistic = list(all_continuous() ~ "{mean} ({sd})"))| Characteristic | N = 2021 |
|---|---|
| special | |
| no | 178 (88%) |
| not-informed | 15 (7.4%) |
| yes | 9 (4.5%) |
| child_school | |
| centro-educativo-particular | 42 (21%) |
| centro-educativo-publico | 152 (75%) |
| otro | 8 (4.0%) |
| child_school_grade | |
| kinder | 92 (46%) |
| otro | 6 (3.0%) |
| pre-kinder | 69 (34%) |
| primary-basica-general | 35 (17%) |
| age | |
| 4 | 84 (42%) |
| 5 | 118 (58%) |
| gender | |
| f | 87 (43%) |
| m | 115 (57%) |
| PRE_Read_Score | 0.93 (0.14) |
| POS_Read_SCORE | 0.96 (0.10) |
|
1
n (%); Mean (SD)
|
|
Sexo
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Read_Score,
POS_Read_SCORE) %>%
tbl_summary(
by = gender,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | f, N = 871 | m, N = 1151 | p-value2 |
|---|---|---|---|
| special | 0.11 | ||
| no | 78 (90%) | 100 (87%) | |
| not-informed | 8 (9.2%) | 7 (6.1%) | |
| yes | 1 (1.1%) | 8 (7.0%) | |
| child_school | 0.9 | ||
| centro-educativo-particular | 17 (20%) | 25 (22%) | |
| centro-educativo-publico | 66 (76%) | 86 (75%) | |
| otro | 4 (4.6%) | 4 (3.5%) | |
| child_school_grade | 0.5 | ||
| kinder | 44 (51%) | 48 (42%) | |
| otro | 2 (2.3%) | 4 (3.5%) | |
| pre-kinder | 29 (33%) | 40 (35%) | |
| primary-basica-general | 12 (14%) | 23 (20%) | |
| age | 0.8 | ||
| 4 | 37 (43%) | 47 (41%) | |
| 5 | 50 (57%) | 68 (59%) | |
| PRE_Read_Score | 0.97 (0.06) | 0.91 (0.18) | 0.052 |
| POS_Read_SCORE | 0.95 (0.12) | 0.97 (0.09) | 0.3 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Pearson's Chi-squared test; Wilcoxon rank sum test
|
|||
Especial
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Read_Score,
POS_Read_SCORE) %>%
tbl_summary(
by = special,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | no, N = 1781 | not-informed, N = 151 | yes, N = 91 | p-value2 |
|---|---|---|---|---|
| child_school | 0.8 | |||
| centro-educativo-particular | 37 (21%) | 2 (13%) | 3 (33%) | |
| centro-educativo-publico | 133 (75%) | 13 (87%) | 6 (67%) | |
| otro | 8 (4.5%) | 0 (0%) | 0 (0%) | |
| child_school_grade | 0.6 | |||
| kinder | 84 (47%) | 5 (33%) | 3 (33%) | |
| otro | 5 (2.8%) | 1 (6.7%) | 0 (0%) | |
| pre-kinder | 58 (33%) | 7 (47%) | 4 (44%) | |
| primary-basica-general | 31 (17%) | 2 (13%) | 2 (22%) | |
| age | 0.7 | |||
| 4 | 72 (40%) | 7 (47%) | 5 (56%) | |
| 5 | 106 (60%) | 8 (53%) | 4 (44%) | |
| gender | 0.11 | |||
| f | 78 (44%) | 8 (53%) | 1 (11%) | |
| m | 100 (56%) | 7 (47%) | 8 (89%) | |
| PRE_Read_Score | 0.94 (0.13) | 0.85 (0.25) | 0.94 (0.07) | 0.2 |
| POS_Read_SCORE | 0.96 (0.11) | 0.98 (0.03) | 0.99 (0.02) | 0.7 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Kruskal-Wallis rank sum test
|
||||
Tipo Escola
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Read_Score,
POS_Read_SCORE) %>%
tbl_summary(
by = child_school,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | centro-educativo-particular, N = 421 | centro-educativo-publico, N = 1521 | otro, N = 81 | p-value2 |
|---|---|---|---|---|
| special | 0.8 | |||
| no | 37 (88%) | 133 (88%) | 8 (100%) | |
| not-informed | 2 (4.8%) | 13 (8.6%) | 0 (0%) | |
| yes | 3 (7.1%) | 6 (3.9%) | 0 (0%) | |
| child_school_grade | 0.3 | |||
| kinder | 16 (38%) | 72 (47%) | 4 (50%) | |
| otro | 0 (0%) | 5 (3.3%) | 1 (12%) | |
| pre-kinder | 16 (38%) | 50 (33%) | 3 (38%) | |
| primary-basica-general | 10 (24%) | 25 (16%) | 0 (0%) | |
| age | 0.10 | |||
| 4 | 23 (55%) | 57 (38%) | 4 (50%) | |
| 5 | 19 (45%) | 95 (62%) | 4 (50%) | |
| gender | 0.9 | |||
| f | 17 (40%) | 66 (43%) | 4 (50%) | |
| m | 25 (60%) | 86 (57%) | 4 (50%) | |
| PRE_Read_Score | 0.93 (0.17) | 0.93 (0.14) | 0.94 (0.10) | 0.8 |
| POS_Read_SCORE | 0.97 (0.05) | 0.96 (0.11) | 0.96 (0.05) | 0.6 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Kruskal-Wallis rank sum test
|
||||
Série
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Read_Score,
POS_Read_SCORE) %>%
tbl_summary(
by = child_school_grade,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | kinder, N = 921 | otro, N = 61 | pre-kinder, N = 691 | primary-basica-general, N = 351 | p-value2 |
|---|---|---|---|---|---|
| special | 0.6 | ||||
| no | 84 (91%) | 5 (83%) | 58 (84%) | 31 (89%) | |
| not-informed | 5 (5.4%) | 1 (17%) | 7 (10%) | 2 (5.7%) | |
| yes | 3 (3.3%) | 0 (0%) | 4 (5.8%) | 2 (5.7%) | |
| child_school | 0.3 | ||||
| centro-educativo-particular | 16 (17%) | 0 (0%) | 16 (23%) | 10 (29%) | |
| centro-educativo-publico | 72 (78%) | 5 (83%) | 50 (72%) | 25 (71%) | |
| otro | 4 (4.3%) | 1 (17%) | 3 (4.3%) | 0 (0%) | |
| age | <0.001 | ||||
| 4 | 16 (17%) | 3 (50%) | 65 (94%) | 0 (0%) | |
| 5 | 76 (83%) | 3 (50%) | 4 (5.8%) | 35 (100%) | |
| gender | 0.5 | ||||
| f | 44 (48%) | 2 (33%) | 29 (42%) | 12 (34%) | |
| m | 48 (52%) | 4 (67%) | 40 (58%) | 23 (66%) | |
| PRE_Read_Score | 0.95 (0.13) | 0.93 (0.09) | 0.91 (0.17) | 0.94 (0.14) | 0.14 |
| POS_Read_SCORE | 0.97 (0.11) | 0.97 (0.05) | 0.95 (0.11) | 0.97 (0.05) | 0.8 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Kruskal-Wallis rank sum test
|
|||||
Idade
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Read_Score,
POS_Read_SCORE) %>%
tbl_summary(
by = age,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | 4, N = 841 | 5, N = 1181 | p-value2 |
|---|---|---|---|
| special | 0.7 | ||
| no | 72 (86%) | 106 (90%) | |
| not-informed | 7 (8.3%) | 8 (6.8%) | |
| yes | 5 (6.0%) | 4 (3.4%) | |
| child_school | 0.10 | ||
| centro-educativo-particular | 23 (27%) | 19 (16%) | |
| centro-educativo-publico | 57 (68%) | 95 (81%) | |
| otro | 4 (4.8%) | 4 (3.4%) | |
| child_school_grade | <0.001 | ||
| kinder | 16 (19%) | 76 (64%) | |
| otro | 3 (3.6%) | 3 (2.5%) | |
| pre-kinder | 65 (77%) | 4 (3.4%) | |
| primary-basica-general | 0 (0%) | 35 (30%) | |
| gender | 0.8 | ||
| f | 37 (44%) | 50 (42%) | |
| m | 47 (56%) | 68 (58%) | |
| PRE_Read_Score | 0.91 (0.17) | 0.95 (0.12) | 0.011 |
| POS_Read_SCORE | 0.96 (0.10) | 0.96 (0.10) | 0.6 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Pearson's Chi-squared test; Wilcoxon rank sum test
|
|||
3.3 Análise de Densidade do Pré e Pós teste
Análise gráfica utilizando gráficos de densidade para verificar a diferença entre o pré e pós teste de acordo com as caracterÃsticas: geral, idade, sexo, escola e série.
Geral
Plotando a densidade para os scores de pré e pós teste.
#gráfico de densidade
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1)Idade
Plotando a densidade para os scores de pré e pós teste de acordo com a idade do estudante
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~age,1)Sexo
Plotando a densidade para os scores de pré e pós teste em relação ao sexo.
#gráfico de densidade em relação ao sexo
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~gender,1)Escola
Plotando a densidade para os scores de pré e pós teste em relação ao tipo de escola.
#gráfico de densidade em relação a escola
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~child_school,1)Série
Plotando a densidade para os scores de pré e pós teste em relação a série do estudante.
#gráfico de densidade em relação a escola
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~child_school_grade,1)3.4 Análise com BoxPlot do Pré e Pós teste
Análise gráfica utilizando gráficos de boxplot para verificar a diferença entre o pré e pós teste de acordo com as caracterÃsticas: geral, idade, sexo, escola e série.
Geral
Plotando a densidade para os scores de pré e pós teste.
#gráfico de boxplot geral
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
coord_flip() +
xlim(0.75,1)Idade
Plotando o boxplot para os scores de pré e pós teste de acordo com a idade do estudante
#gráfico de boxplot em relação a idade
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
xlim(0.75,1) +
coord_flip() +
facet_wrap(~age, 1)Sexo
Plotando o boxplot para os scores de pré e pós teste em relação ao sexo.
#gráfico de boxplot em relação ao sexo
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
xlim(0.75,1) +
coord_flip() +
facet_wrap(~gender,1)Escola
Plotando o boxplot para os scores de pré e pós teste em relação ao tipo de escola.
#gráfico de boxplot em relação a escola
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
xlim(0.75,1) +
coord_flip() +
facet_wrap(~child_school,1)Série
Plotando o boxplot para os scores de pré e pós teste em relação a série do estudante.
#gráfico de boxplot em relação a escola
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
xlim(0.75,1) +
coord_flip() +
facet_wrap(~child_school_grade,1)4. Cluster Analysis
4.1 Assessing Clustering Tendency
Hopkins Statistic Method
V
dadosCluster <- dados %>%
select(ends_with("_V")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1368253
R
dadosCluster <- dados %>%
select(ends_with("_R")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1026321
W
dadosCluster <- dados %>%
select(ends_with("_W")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1171089
C
dadosCluster <- dados %>%
select(ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.08886933
V + R
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1230036
V + W
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_W")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1405325
V + C
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1183648
R + W
dadosCluster <- dados %>%
select(ends_with("_R"), ends_with("_W")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1176674
R + C
dadosCluster <- dados %>%
select(ends_with("_R"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.09787017
W + C
dadosCluster <- dados %>%
select(ends_with("_W"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1111217
V + R + W
dadosCluster <- dados %>%
select(ends_with("_W"), ends_with("_R"), ends_with("_W")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1172034
V + R + C
dadosCluster <- dados %>%
select(ends_with("_W"), ends_with("_R"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1112065
V + W + C
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_W"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1261339
R + W + C
dadosCluster <- dados %>%
select(ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1114106
V + R + W + C
dadosCluster <- dados %>%
select(ends_with("_V"), ends_with("_R"), ends_with("_W"), ends_with("_C")) %>%
scale() #Escalonando os valores
set.seed(123)
hopkins(dadosCluster, n = nrow(dadosCluster)-1)## $H
## [1] 0.1230248
dadosCluster <- dados %>%
select(ends_with("_C")) %>%
scale() #Escalonando os valores4.2 K-means Clustering
4.2.1 Estimating the optimal number of clusters
WSS Method
fviz_nbclust(dadosCluster, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")Silhouette Method
fviz_nbclust(dadosCluster, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette method")Gap Statistic Method
fviz_nbclust(dadosCluster, kmeans, method = "gap_stat", nboot = 50) +
labs(subtitle = "Gap statistic method")4.2.2 Computing and Visualizing k-means clustering
K = 2
set.seed(123)
km.res2 <- kmeans(dadosCluster, 2, nstart = 25)
fviz_cluster(km.res2, data = dadosCluster,
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal())K = 3
set.seed(123)
km.res3 <- kmeans(dadosCluster, 3, nstart = 25)
fviz_cluster(km.res3, data = dadosCluster,
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal())K = 4
set.seed(123)
km.res4 <- kmeans(dadosCluster, 4, nstart = 25)
fviz_cluster(km.res4, data = dadosCluster,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal())K = 5
set.seed(123)
km.res5 <- kmeans(dadosCluster, 5, nstart = 25)
fviz_cluster(km.res5, data = dadosCluster,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07", "#0000CD"),
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal())4.2.3 Teste de Kruskall Wallis
K = 2
dadosKruskal_k2 <- cbind(dados, Cluster = km.res2$cluster)
dadosKruskal_k2 <- dadosKruskal_k2 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k2)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 0.78432, df = 1, p-value = 0.3758
K = 3
dadosKruskal_k3 <- cbind(dados, Cluster = km.res3$cluster)
dadosKruskal_k3 <- dadosKruskal_k3 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k3)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 1.1736, df = 2, p-value = 0.5561
K = 4
dadosKruskal_k4 <- cbind(dados, Cluster = km.res4$cluster)
dadosKruskal_k4 <- dadosKruskal_k4 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k4)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 4.9958, df = 3, p-value = 0.1721
K = 5
dadosKruskal_k5 <- cbind(dados, Cluster = km.res5$cluster)
dadosKruskal_k5 <- dadosKruskal_k5 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k5)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 7.5359, df = 4, p-value = 0.1101
4.2.4 Teste de Dunn com ajuste do p-valor
K = 2
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k2, p.adjust.method = "bonferroni")## # A tibble: 1 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 201 1 -0.886 0.376 0.376 ns
K = 3
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k3, p.adjust.method = "bonferroni")## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 13 1 -0.694 0.488 1 ns
## 2 Gain_Read 1 3 13 188 0.624 0.533 1 ns
## 3 Gain_Read 2 3 1 188 0.897 0.370 1 ns
K = 4
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k4, p.adjust.method = "bonferroni")## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 5 34 1.78 0.0751 0.451 ns
## 2 Gain_Read 1 3 5 162 1.24 0.217 1 ns
## 3 Gain_Read 1 4 5 1 -0.266 0.790 1 ns
## 4 Gain_Read 2 3 34 162 -1.55 0.122 0.733 ns
## 5 Gain_Read 2 4 34 1 -1.13 0.260 1 ns
## 6 Gain_Read 3 4 162 1 -0.850 0.395 1 ns
K = 5
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k5, p.adjust.method = "bonferroni")## # A tibble: 10 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 148 40 2.21 0.0274 0.274 ns
## 2 Gain_Read 1 3 148 5 -1.15 0.249 1 ns
## 3 Gain_Read 1 4 148 1 -0.813 0.416 1 ns
## 4 Gain_Read 1 5 148 8 0.476 0.634 1 ns
## 5 Gain_Read 2 3 40 5 -1.93 0.0531 0.531 ns
## 6 Gain_Read 2 4 40 1 -1.19 0.232 1 ns
## 7 Gain_Read 2 5 40 8 -0.569 0.569 1 ns
## 8 Gain_Read 3 4 5 1 -0.266 0.790 1 ns
## 9 Gain_Read 3 5 5 8 1.22 0.222 1 ns
## 10 Gain_Read 4 5 1 8 0.932 0.351 1 ns
4.3 K-Medoids
4.3.1 Estimating the optimal number of clusters
WSS Method
fviz_nbclust(dadosCluster, pam, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")Silhouette Method
fviz_nbclust(dadosCluster, pam, method = "silhouette") +
labs(subtitle = "Silhouette method")Gap Statistic Method
fviz_nbclust(dadosCluster, pam, method = "gap_stat", nboot = 50) +
labs(subtitle = "Gap statistic method")4.3.2 Computing and Visualizing PAM clustering
K = 2
set.seed(123)
pam.res2 <- pam(dadosCluster, 2)
fviz_cluster(pam.res2,
palette = c("#00AFBB", "#FC4E07"), # color palette
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic())K = 3
set.seed(123)
pam.res3 <- pam(dadosCluster, 3)
fviz_cluster(pam.res3,
palette = c("#2E9FDF", "#00AFBB", "#FC4E07"), # color palette
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic())K = 4
set.seed(123)
pam.res4 <- pam(dadosCluster, 4)
fviz_cluster(pam.res4,
palette = c("#2E9FDF", "#E7B800", "#00AFBB", "#FC4E07"), # color palette
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic())K = 5
set.seed(123)
pam.res5 <- pam(dadosCluster, 5)
fviz_cluster(pam.res5,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07", "#0000CD"),
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic())4.3.3 Teste de Kruskall Wallis
K = 2
dadosKruskal_k2 <- cbind(dados, Cluster = pam.res2$cluster)
dadosKruskal_k2 <- dadosKruskal_k2 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k2)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 2.9914, df = 1, p-value = 0.08371
K = 3
dadosKruskal_k3 <- cbind(dados, Cluster = pam.res3$cluster)
dadosKruskal_k3 <- dadosKruskal_k3 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k3)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 4.2377, df = 2, p-value = 0.1202
K = 4
dadosKruskal_k4 <- cbind(dados, Cluster = pam.res4$cluster)
dadosKruskal_k4 <- dadosKruskal_k4 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k4)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 8.1659, df = 3, p-value = 0.0427
K = 5
dadosKruskal_k5 <- cbind(dados, Cluster = pam.res5$cluster)
dadosKruskal_k5 <- dadosKruskal_k5 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k5)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 6.0918, df = 4, p-value = 0.1924
4.3.4 Teste de Dunn com ajuste do p-valor
K = 2
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k2, p.adjust.method = "bonferroni")## # A tibble: 1 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 155 47 1.73 0.0837 0.0837 ns
K = 3
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k3, p.adjust.method = "bonferroni")## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 155 46 1.86 0.0631 0.189 ns
## 2 Gain_Read 1 3 155 1 -0.814 0.416 1 ns
## 3 Gain_Read 2 3 46 1 -1.12 0.264 0.793 ns
K = 4
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k4, p.adjust.method = "bonferroni")## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 155 41 2.36 0.0184 0.110 ns
## 2 Gain_Read 1 3 155 1 -0.814 0.416 1 ns
## 3 Gain_Read 1 4 155 5 -1.16 0.248 1 ns
## 4 Gain_Read 2 3 41 1 -1.22 0.224 1 ns
## 5 Gain_Read 2 4 41 5 -1.98 0.0475 0.285 ns
## 6 Gain_Read 3 4 1 5 0.266 0.790 1 ns
K = 5
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k5, p.adjust.method = "bonferroni")## # A tibble: 10 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 102 38 1.86 0.0632 0.632 ns
## 2 Gain_Read 1 3 102 56 0.403 0.687 1 ns
## 3 Gain_Read 1 4 102 1 -0.811 0.417 1 ns
## 4 Gain_Read 1 5 102 5 -1.14 0.253 1 ns
## 5 Gain_Read 2 3 38 56 -1.36 0.174 1 ns
## 6 Gain_Read 2 4 38 1 -1.15 0.249 1 ns
## 7 Gain_Read 2 5 38 5 -1.84 0.0653 0.653 ns
## 8 Gain_Read 3 4 56 1 -0.875 0.382 1 ns
## 9 Gain_Read 3 5 56 5 -1.27 0.206 1 ns
## 10 Gain_Read 4 5 1 5 0.266 0.790 1 ns
4.4 CLARA - Clustering Large Applications
4.4.1 Estimating the optimal number of clusters
WSS Method
fviz_nbclust(dadosCluster, clara, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")Silhouette Method
fviz_nbclust(dadosCluster, clara, method = "silhouette") +
labs(subtitle = "Silhouette method")Gap Statistic Method
fviz_nbclust(dadosCluster, clara, method = "gap_stat", nboot = 50) +
labs(subtitle = "Gap statistic method")4.4.2 Computing and Visualizing CLARA clustering
K = 2
set.seed(123)
clara.res2 <- clara(dadosCluster, 2, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res2,
palette = c("#00AFBB", "#FC4E07"), # color palette
geom = "point", pointsize = 1,
ggtheme = theme_classic())K = 3
set.seed(123)
clara.res3 <- clara(dadosCluster, 3, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res3,
palette = c("#2E9FDF", "#00AFBB", "#FC4E07"), # color palette
geom = "point", pointsize = 1,
ggtheme = theme_classic())K = 4
set.seed(123)
clara.res4 <- clara(dadosCluster, 4, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res4,
palette = c("#2E9FDF", "#E7B800", "#00AFBB", "#FC4E07"), # color palette
geom = "point", pointsize = 1,
ggtheme = theme_classic())K = 5
set.seed(123)
clara.res5 <- clara(dadosCluster, 5, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res5,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07", "#0000CD"),
geom = "point", pointsize = 1,
ggtheme = theme_classic())4.4.3 Teste de Kruskall Wallis
K = 2
dadosKruskal_k2 <- cbind(dados, Cluster = clara.res2$cluster)
dadosKruskal_k2 <- dadosKruskal_k2 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k2)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 0.78432, df = 1, p-value = 0.3758
K = 3
dadosKruskal_k3 <- cbind(dados, Cluster = clara.res3$cluster)
dadosKruskal_k3 <- dadosKruskal_k3 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k3)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 4.2377, df = 2, p-value = 0.1202
K = 4
dadosKruskal_k4 <- cbind(dados, Cluster = clara.res4$cluster)
dadosKruskal_k4 <- dadosKruskal_k4 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k4)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 8.1659, df = 3, p-value = 0.0427
K = 5
dadosKruskal_k5 <- cbind(dados, Cluster = clara.res5$cluster)
dadosKruskal_k5 <- dadosKruskal_k5 %>%
select(Cluster, Gain_Read)
kruskal.test(Gain_Read ~ Cluster, data = dadosKruskal_k5)##
## Kruskal-Wallis rank sum test
##
## data: Gain_Read by Cluster
## Kruskal-Wallis chi-squared = 10.235, df = 4, p-value = 0.03665
4.4.4 Teste de Dunn com ajuste do p-valor
K = 2
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k2, p.adjust.method = "bonferroni")## # A tibble: 1 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 201 1 -0.886 0.376 0.376 ns
K = 3
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k3, p.adjust.method = "bonferroni")## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 155 46 1.86 0.0631 0.189 ns
## 2 Gain_Read 1 3 155 1 -0.814 0.416 1 ns
## 3 Gain_Read 2 3 46 1 -1.12 0.264 0.793 ns
K = 4
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k4, p.adjust.method = "bonferroni")## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 155 41 2.36 0.0184 0.110 ns
## 2 Gain_Read 1 3 155 1 -0.814 0.416 1 ns
## 3 Gain_Read 1 4 155 5 -1.16 0.248 1 ns
## 4 Gain_Read 2 3 41 1 -1.22 0.224 1 ns
## 5 Gain_Read 2 4 41 5 -1.98 0.0475 0.285 ns
## 6 Gain_Read 3 4 1 5 0.266 0.790 1 ns
K = 5
dunn_test(Gain_Read ~ Cluster, data = dadosKruskal_k5, p.adjust.method = "bonferroni")## # A tibble: 10 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Gain_Read 1 2 155 41 2.36 0.0184 0.184 ns
## 2 Gain_Read 1 3 155 1 -0.814 0.416 1 ns
## 3 Gain_Read 1 4 155 4 -1.67 0.0946 0.946 ns
## 4 Gain_Read 1 5 155 1 0.759 0.448 1 ns
## 5 Gain_Read 2 3 41 1 -1.22 0.224 1 ns
## 6 Gain_Read 2 4 41 4 -2.41 0.0161 0.161 ns
## 7 Gain_Read 2 5 41 1 0.344 0.731 1 ns
## 8 Gain_Read 3 4 1 4 -0.0269 0.979 1 ns
## 9 Gain_Read 3 5 1 1 1.12 0.264 1 ns
## 10 Gain_Read 4 5 4 1 1.44 0.150 1 ns
4.5 Agglomerative Clustering
4.5.1 Dendrogram
Euclidean Distance Method
res.dist <- dist(dadosCluster, method = "euclidean")
res.hc <- hclust(d = res.dist, method = "ward.D2")
fviz_dend(res.hc, k = 3, # Cut in three groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#2E9FDF", "#00AFBB", "#E7B800"),
rect_fill = TRUE)Manhattan Distance Method
res.dist <- dist(dadosCluster, method = "manhattan")
res.hc <- hclust(d = res.dist, method = "ward.D2")
fviz_dend(res.hc, k = 3, # Cut in three groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#2E9FDF", "#00AFBB", "#E7B800"),
rect_fill = TRUE)Maximun Distance Method
res.dist <- dist(dadosCluster, method = "maximum")
res.hc <- hclust(d = res.dist, method = "ward.D2")
fviz_dend(res.hc, k = 3, # Cut in three groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#2E9FDF", "#00AFBB", "#E7B800"),
rect_fill = TRUE)5. Explorando os Clusters
5.1 Criando a variável Cluster na base de dados
##Criando coluna de cluster na tabela de dados do grupo experimental
dados <- cbind(dados, Cluster = pam.res4$cluster)
dados$Cluster <-
factor(dados$Cluster, label = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), levels = c(1,2,3,4))5.2 Tabela de variáveis em relação aos clusters
dados %>%
select( #aqui ta selecionando todas as variáveis na tabela
special,
child_school,
child_school_grade,
age,
gender,
PRE_Read_Score,
POS_Read_SCORE,
Cluster) %>%
tbl_summary(by = Cluster,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% #o by informa a varilavel do cruzamento
add_p() #adiciona os p-valores dos testes de qui-quadrado| Characteristic | Cluster 1, N = 1551 | Cluster 2, N = 411 | Cluster 3, N = 11 | Cluster 4, N = 51 | p-value2 |
|---|---|---|---|---|---|
| special | 0.3 | ||||
| no | 140 (90%) | 32 (78%) | 1 (100%) | 5 (100%) | |
| not-informed | 9 (5.8%) | 6 (15%) | 0 (0%) | 0 (0%) | |
| yes | 6 (3.9%) | 3 (7.3%) | 0 (0%) | 0 (0%) | |
| child_school | >0.9 | ||||
| centro-educativo-particular | 34 (22%) | 7 (17%) | 0 (0%) | 1 (20%) | |
| centro-educativo-publico | 115 (74%) | 32 (78%) | 1 (100%) | 4 (80%) | |
| otro | 6 (3.9%) | 2 (4.9%) | 0 (0%) | 0 (0%) | |
| child_school_grade | 0.6 | ||||
| kinder | 69 (45%) | 20 (49%) | 0 (0%) | 3 (60%) | |
| otro | 6 (3.9%) | 0 (0%) | 0 (0%) | 0 (0%) | |
| pre-kinder | 51 (33%) | 16 (39%) | 0 (0%) | 2 (40%) | |
| primary-basica-general | 29 (19%) | 5 (12%) | 1 (100%) | 0 (0%) | |
| age | 0.8 | ||||
| 4 | 63 (41%) | 18 (44%) | 0 (0%) | 3 (60%) | |
| 5 | 92 (59%) | 23 (56%) | 1 (100%) | 2 (40%) | |
| gender | 0.4 | ||||
| f | 70 (45%) | 14 (34%) | 0 (0%) | 3 (60%) | |
| m | 85 (55%) | 27 (66%) | 1 (100%) | 2 (40%) | |
| PRE_Read_Score | 0.95 (0.12) | 0.87 (0.20) | 1.00 (NA) | 0.96 (0.04) | 0.008 |
| POS_Read_SCORE | 0.96 (0.11) | 0.96 (0.07) | 0.99 (NA) | 0.95 (0.06) | 0.2 |
|
1
n (%); Mean (SD)
2
Fisher's exact test; Kruskal-Wallis rank sum test
|
|||||
5.3 Gráficos de Densidade em relação a formação dos clusters
Plotando a densidade para os scores de pré e pós teste de acordo com o Cluster
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_density(alpha=0.5) +
xlim(0.75,1) +
facet_wrap(~Cluster,1)5.4 Gráficos de Boxplot em relação a formação dos clusters
Plotando a densidade para os scores de pré e pós teste de acordo com o Cluster
dados %>%
gather(PRE_Read_Score,
POS_Read_SCORE,
key = test, value = scores) %>%
ggplot(aes(scores, fill = test)) +
geom_boxplot(alpha=0.5) +
coord_flip() +
facet_wrap(~Cluster,1)