library(pacman)
p_load('factoextra','cowplot','cluster','NbClust','questionr','tidyr','vcd','tidyverse', "haven","networkD3","plot3D","plotly")Thesis
introduction research
Motivation is a critical component of academic success and has been the subject of extensive research in the field of education. The present study seeks to contribute to this body of literature by analyzing the motivation profiles of Flemish students over time, with a specific focus on categorization into five distinct groups. Drawing on existing literature, these groups are defined as qualitative-low, qualitative-high, quantitative-low, quantitative-high, and amotive. The primary objective of the study is to establish the feasibility of clustering the motivation profiles of the participants into these five categories.
The study applies a mixed-methods approach, combining both quantitative and qualitative data to analyze the motivation profiles of the participants over time. The results of the study suggest that the theoretical framework, which posits the existence of five distinct motivation profiles, is validated by the data. The participants’ motivation profiles were indeed found to cluster into five groups, supporting the validity of the theoretical framework. However, it is important to note that these clusters were observed across the entire dataset, rather than within each phase of the study. The findings further suggest that only three different profiles were observed per phase, indicating that motivation profiles may vary over time.
Overall, this study contributes to the existing literature on motivation and provides valuable insights into the motivation profiles of Flemish students. The findings may have important implications for educators and policymakers, as understanding the motivation profiles of students can help to design more effective teaching strategies and interventions to enhance student learning and success.
load packages
Data cleaning
We wish to obtain the columns containing the values for automotive, controle and amotive motivation
df = read_sav("dataset_OLB.sav")
namen_col = names(df)
auto_col = namen_col[which(str_detect(namen_col, "mot_auto"))]
controle_col = namen_col[which(str_detect(namen_col, "cont"))]
amot_col = namen_col[which(str_detect(namen_col, "amot") & !str_detect(namen_col,"verz"))]
selectie = c(auto_col, controle_col, amot_col)
basetable = df %>%
select(Toegangscode, all_of(selectie))
knitr::kable(basetable[1:3, ], "html")| Toegangscode | FASE1_mot_auto | FASE2_mot_auto | FASE3_mot_auto | FASE4_mot_auto | Fase5_mot_auto | FASE1_mot_cont | FASE2_mot_cont | FASE3_mot_cont | FASE4_mot_cont | Fase5_mot_cont | FASE1_mot_amot | FASE2_mot_amot | FASE3_mot_amot | FASE4_mot_amot | Fase5_mot_amot |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A0301 | 2.750 | 2.5 | 3.75 | 2.375 | 3.75 | 3.125 | 2.875 | 3.25 | 2.25 | 3.25 | 1.50 | 1.75 | 1 | 3.75 | 1.25 |
| A0302 | 2.500 | NA | NA | NA | 2.25 | 2.750 | NA | NA | NA | 3.00 | 1.25 | NA | NA | NA | 1.00 |
| A0303 | 2.625 | NA | NA | NA | NA | 3.625 | NA | NA | NA | NA | 1.00 | NA | NA | NA | NA |
We pivot our dataset to obtain a more readable format
basetable = pivot_longer(basetable, -Toegangscode, names_to = "measurement_phase",values_to = "score") %>%
mutate(fase = str_extract(measurement_phase, "\\d")) %>%
mutate(motivation = str_sub(measurement_phase, -4,-1)) %>%
select(-measurement_phase) %>%
distinct(Toegangscode, fase,motivation, .keep_all = T) %>%
pivot_wider(names_from = motivation, values_from = score)
knitr::kable(basetable[1:5, ], caption = "Overview dataset.")| Toegangscode | fase | auto | cont | amot |
|---|---|---|---|---|
| A0301 | 1 | 2.750 | 3.125 | 1.50 |
| A0301 | 2 | 2.500 | 2.875 | 1.75 |
| A0301 | 3 | 3.750 | 3.250 | 1.00 |
| A0301 | 4 | 2.375 | 2.250 | 3.75 |
| A0301 | 5 | 3.750 | 3.250 | 1.25 |
Data exploration
Prior to our analysis, we conducted an examination of missing values in the dataset. Given the low completion rate observed in phase 5, we made the decision to use either phase 3 or 4 as the final phase for our analysis.
codes = unique(basetable$Toegangscode)
basetable %>%
group_by(fase) %>%
summarise(drop_outs = sum(is.na(auto)), drop_out_percentage = drop_outs*100/length(codes)) %>%
ggplot(aes(x = fase, y = drop_outs)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(format(round(drop_out_percentage), nsmall = 2), "%")), vjust = -0.2)+
theme_minimal()+
ggtitle("drop out per phase")However when looking at the completion rate rate across all phase, we observe that in phase 3 we have 250 complete measurements and in phase 4 we have 182 complete measurements, thus we opt for ending it at phase 3
basetable %>%
group_by(Toegangscode) %>%
filter(fase %in% c(1:4)) %>%
filter(sum(is.na(auto)) == 0) %>%
ungroup() %>%
summarise(total_groups_phase_4 = length(unique(Toegangscode)))# A tibble: 1 x 1
total_groups_phase_4
<int>
1 182
basetable %>%
group_by(Toegangscode) %>%
filter(fase %in% c(1:3)) %>%
filter(sum(is.na(auto)) == 0) %>%
ungroup() %>%
summarise(total_groups_phase_3 = length(unique(Toegangscode)))# A tibble: 1 x 1
total_groups_phase_3
<int>
1 250
The distrubtion of male and female is
codes_fase_3 = basetable %>%
group_by(Toegangscode) %>%
filter(fase %in% c(1:3)) %>%
filter(sum(is.na(auto)) == 0) %>%
ungroup() %>%
summarise(Toegangscode = unique(Toegangscode))Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
i Please use `reframe()` instead.
i When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
distribution_male_female = merge(codes_fase_3, df, "Toegangscode") %>%
select(Fase3_Geslacht) %>%
table() %>%
prop.table()
ggplot(data = data.frame(x = 1, y = c(38.4, 61.6), Sex = c("Male", "Female")),
aes(x, y, fill = Sex)) +
geom_col() +
coord_polar(theta = "y") +
theme_void() +
geom_text(aes(label = paste0(y, "%")), colour = "white",
position = position_stack(vjust = 0.5), size = 6) +
scale_fill_manual(values = c("deepskyblue4", "tomato2")) +
theme(title = element_text(size = 16, face = 2, hjust = 0.5))+
ggtitle("distribution male/female")clustering
cluster validation global dataset
#define final basetable
basetable_final = basetable %>%
group_by(Toegangscode) %>%
filter(fase %in% c(1:3)) %>%
filter(sum(is.na(auto)) == 0) %>%
ungroup()
#define final dataframe for clustering
phase_global = basetable %>%
group_by(Toegangscode) %>%
filter(fase %in% c(1:3)) %>%
filter(sum(is.na(auto)) == 0) %>%
ungroup() %>%
select(auto, cont, amot)
#validation before clustering
# Test tendency on original dataset
res.o <- get_clust_tendency(phase_global, n = nrow(phase_global)-1, graph=FALSE)
H_dataset <- round(res.o$hopkins_stat,3)
# random dataset
random_dffull <- as.data.frame(apply(phase_global,2,function(x){runif(length(x),min(x),max(x))}))
# testing tendency
res.r <- get_clust_tendency(random_dffull,
n = nrow(random_dffull)-1,
graph=FALSE)
H_random <- round(res.r$hopkins_stat,3)
# make table for output
table_H <- cbind(H_dataset,H_random)
colnames(table_H) <- c('original dataset','random dataset')
rownames(table_H) <- 'Hopkin statistic'
knitr::kable(table_H, "html")| original dataset | random dataset | |
|---|---|---|
| Hopkin statistic | 0.728 | 0.507 |
cluster validation per phase
phase1 = basetable_final %>% filter(fase == 1) %>% select(auto, cont, amot)
phase2 = basetable_final %>% filter(fase == 2) %>% select(auto, cont, amot)
phase3 = basetable_final %>% filter(fase == 3) %>% select(auto, cont, amot)
phase1_cluster_tendency <- get_clust_tendency(phase1, n = nrow(phase1)-1, graph=FALSE)
phase_1_value <- round(phase1_cluster_tendency$hopkins_stat,3)
phase2_cluster_tendency <- get_clust_tendency(phase2, n = nrow(phase2)-1, graph=FALSE)
phase_2_value <- round(phase2_cluster_tendency$hopkins_stat,3)
phase3_cluster_tendency <- get_clust_tendency(phase3, n = nrow(phase3)-1, graph=FALSE)
phase_3_value <- round(phase3_cluster_tendency$hopkins_stat,3)
# make table for output
overview_phases <- cbind(phase_1_value,phase_2_value,phase_3_value)
colnames(overview_phases) <- c("phase 1", "phase 2", "phase 3")
rownames(overview_phases) <- 'Hopkin statistic'
knitr::kable(overview_phases, "html")| phase 1 | phase 2 | phase 3 | |
|---|---|---|---|
| Hopkin statistic | 0.724 | 0.696 | 0.699 |
cluster estimation
In order to determine the optimal cluster size, we examined all possible distance metrics, including Euclidean, Maximum, Manhattan, Canberra, and Minkowski, utilizing all available methods, namely Ward.D, Ward.D2, Complete, Average, Mcquitty, Median, Centroid, and K-means.
#define param
distances = c("euclidean", "maximum", "manhattan", "canberra", "minkowski")
methods = c("ward.D","ward.D2", "complete", "average", "mcquitty", "median", "centroid","kmeans")
n_distance = length(distances)
n_methods = length(methods)
#create function
sensitivity_cluster_analysis = function(basetable, distance, method){
tryCatch({
freq_table = freq(NbClust(basetable, distance = distance,
min.nc = 3, max.nc = 6,
method = method,
index ="all")$Best.nc[1,])[,c(1,3)]
freq_table["cluster"] =rownames(freq_table)
return(freq_table)
},warning = function(w){
print(paste(distance,method))
})
}
#call function
total_overview = lapply(1:n_distance, function(i) lapply(1:n_methods, function(j) sensitivity_cluster_analysis(phase1, distances[i], methods[j])))
#give names
names(total_overview) = distances
tibble_total_overview = tibble(total_overview = total_overview)
#write to RDS
#write_rds(tibble_total_overview,"final_clustering_1.rds")The function was implemented on four datasets, namely phase 1, phase 2, phase 3, and the basetable_final, resulting in the following output. Notably, for phase 3, a total of five clustering methods failed, specifically those utilizing the Euclidean distance metric with median and centroid linkage, the Manhattan distance metric with median linkage, and the maximum distance metric with median and centroid linkage.
#define param
distances = c("euclidean", "maximum", "manhattan", "canberra", "minkowski")
methods = c("ward.D","ward.D2", "complete", "average", "mcquitty", "median", "centroid","kmeans")
n_distance = length(distances)
n_methods = length(methods)
#read in data
final_clustering_1 = read_rds("final_clustering_1.rds")
final_clustering_2 = read_rds("final_clustering_2.rds")
final_clustering_3 = read_rds("final_clustering_3_cleaned.rds")
final_clustering_1_3 = read_rds("final_clustering_1_3.rds")
#output
library(cowplot)
#function to create tables
create_table = function(x){
cleaned_tibble = x %>%
unnest(total_overview) %>%
mutate(distance = rep(c(1:n_distance),each = n_methods)) %>%
mutate(method = rep(c(1:n_methods),n_distance)) %>%
unnest(total_overview) %>%
group_by(distance,method) %>%
arrange(desc(n)) %>%
slice(1) %>%
select(cluster)
freq_tibble = freq(cleaned_tibble$cluster)
freq_tibble["cluster"] = rownames(freq_tibble)
return(freq_tibble)
}
plot_cluster_1 = create_table(final_clustering_1) %>%
ggplot(aes(x = cluster, y = n, fill = cluster))+
geom_bar(stat = "identity")
plot_cluster_2 = create_table(final_clustering_2)%>%
ggplot(aes(x = cluster, y = n, fill = cluster))+
geom_bar(stat = "identity")
plot_cluster_3 = final_clustering_3 %>%
ggplot(aes(x = cluster, y = n, fill = cluster))+
geom_bar(stat = "identity")
plot_cluster_1_3 = create_table(final_clustering_1_3)%>%
ggplot(aes(x = cluster, y = n, fill = cluster))+
geom_bar(stat = "identity")
plot_grid(plot_cluster_1, plot_cluster_2,plot_cluster_3,plot_cluster_1_3, labels = c("phase 1","phase 2","phase 3","all phases"))`summarise()` has grouped output by 'distance'. You can override using the
`.groups` argument.
cluster introspection
#we take 5 clusters
#fase 1
fase1_cluster <- hkmeans(basetable_final %>% filter(fase == 1) %>% select(auto, cont, amot),
k = 5,
hc.metric="manhattan",
hc.method='ward.D2',
iter.max = 10,
km.algorithm = "Hartigan-Wong")
#fase 2
fase2_cluster <- hkmeans(basetable_final %>% filter(fase == 2) %>% select(auto, cont, amot),
k = 5,
hc.metric="manhattan",
hc.method='ward.D2',
iter.max = 10,
km.algorithm = "Hartigan-Wong")
#fase 3
fase3_cluster <- hkmeans(basetable_final %>% filter(fase == 3) %>% select(auto, cont, amot),
k = 5,
hc.metric="manhattan",
hc.method='ward.D2',
iter.max = 10,
km.algorithm = "Hartigan-Wong")
clusters_evolution = tibble(clusters_fase1 = fase1_cluster$cluster,
clusters_fase2 = fase2_cluster$cluster,
clusters_fase3 = fase3_cluster$cluster)
#data generation
df = tibble(
source_target_1_2 = as_factor(do.call(paste, c("1_2", clusters_evolution[1:2], sep="-"))),
source_target_2_3 = as_factor(do.call(paste, c("2_3",clusters_evolution[2:3], sep="-")))
)
df = data.frame(source_target = c(df$source_target_1_2, df$source_target_2_3))
df = df %>%
group_by(source_target) %>%
summarise(n = n()) %>%
mutate(source = str_sub(source_target, start = 5, 5)) %>%
mutate(target = str_sub(source_target, start = 7, 7)) %>%
mutate(fase = str_sub(source_target, start = 1,3))
profiles_1 = data.frame(source = c(1:5), profile = c("qualitative-HIGH_1", "quantitative-LOW_1", "qualitative-LOW_1","quantitative-HIGH_1","amotive_1"))
profiles_2 = data.frame(source = c(1:5), profile = c("qualitative-HIGH_2", "quantitative-LOW_2", "qualitative-LOW_2","quantitative-HIGH_2","amotive_2"))
profiles_3 = data.frame(source = c(1:5), profile = c("qualitative-HIGH_3", "quantitative-LOW_3", "qualitative-LOW_3","quantitative-HIGH_3","amotive_3"))
fase_1_2 = df %>%
filter(fase == "1_2") %>%
merge(profiles_1, by.x = "source", by.y = "source") %>%
merge(profiles_2, by.x = "target", by.y = "source") %>%
select(n, profile.x, profile.y)
fase_2_3 = df %>%
filter(fase == "2_3") %>%
merge(profiles_2, by.x = "source", by.y = "source") %>%
merge(profiles_3, by.x = "target", by.y = "source") %>%
select(n, profile.x, profile.y)
names(fase_1_2) = c("link","source","target")
names(fase_2_3) = c("link","source","target")
final_df = bind_rows(fase_1_2, fase_2_3)
links = data.frame(
source = final_df$source,
target = final_df$target,
value = final_df$link
)
# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
name=c(as.character(links$source),
as.character(links$target)) %>% unique()
)
# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1
# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name",
sinksRight=FALSE, fontSize = 14)
pcluster characteristics
#take cluster size 5.
# function for two-step clustering procedure
basetable_clusters <- hkmeans(phase_global,
k = 5,
hc.metric="manhattan",
hc.method='average',
iter.max = 10,
km.algorithm = "Hartigan-Wong")
# assign cluster variable to your dataset
phase_global$clusters <- as.factor(basetable_clusters$cluster)
# check proportions of participants into the clusters
freq(phase_global$clusters) n % val%
1 132 17.6 17.6
2 79 10.5 10.5
3 191 25.5 25.5
4 171 22.8 22.8
5 177 23.6 23.6
#visual representation
# calculate means of variables by cluster levels
final_clusters = phase_global %>%
group_by(clusters) %>%
summarise(across(c(auto, cont, amot), mean)) %>%
pivot_longer(!clusters, names_to = "learning_profile", values_to = "score")
ggplot(data = final_clusters, aes(x = "", y = score, fill = learning_profile))+
geom_bar(stat = "identity",position = "dodge", width = 1)+
facet_wrap(vars(clusters))+
geom_text(aes(label = paste(format(round(score,2), nsmall = 2))), vjust = -0.2,
position = position_dodge(width = .9))+
theme_minimal()conclusion
This study presents an analysis of the motivation profiles of Flemish students over time, with a focus on categorization into five distinct groups. The literature highlights five categories, namely qualitative-low, qualitative-high, quantitative-low, quantitative-high, and amotive. The present research aimed to establish the feasibility of clustering the findings and identify the aforementioned categories. The results validate the theoretical framework and suggest that the motivation profiles of the participants were clustered into five distinct groups. However, it is important to note that these clusters were observed across the entire dataset, as opposed to within each phase of the study. In fact, the findings suggest that only three different profiles were observed per phase.