Thesis

Author

Example Project: sociology

Published

May 22, 2021

introduction research

Motivation is a critical component of academic success and has been the subject of extensive research in the field of education. The present study seeks to contribute to this body of literature by analyzing the motivation profiles of Flemish students over time, with a specific focus on categorization into five distinct groups. Drawing on existing literature, these groups are defined as qualitative-low, qualitative-high, quantitative-low, quantitative-high, and amotive. The primary objective of the study is to establish the feasibility of clustering the motivation profiles of the participants into these five categories.

The study applies a mixed-methods approach, combining both quantitative and qualitative data to analyze the motivation profiles of the participants over time. The results of the study suggest that the theoretical framework, which posits the existence of five distinct motivation profiles, is validated by the data. The participants’ motivation profiles were indeed found to cluster into five groups, supporting the validity of the theoretical framework. However, it is important to note that these clusters were observed across the entire dataset, rather than within each phase of the study. The findings further suggest that only three different profiles were observed per phase, indicating that motivation profiles may vary over time.

Overall, this study contributes to the existing literature on motivation and provides valuable insights into the motivation profiles of Flemish students. The findings may have important implications for educators and policymakers, as understanding the motivation profiles of students can help to design more effective teaching strategies and interventions to enhance student learning and success.

load packages

library(pacman)
p_load('factoextra','cowplot','cluster','NbClust','questionr','tidyr','vcd','tidyverse', "haven","networkD3","plot3D","plotly")

Data cleaning

We wish to obtain the columns containing the values for automotive, controle and amotive motivation

df = read_sav("dataset_OLB.sav")

namen_col = names(df)

auto_col = namen_col[which(str_detect(namen_col, "mot_auto"))]
controle_col = namen_col[which(str_detect(namen_col, "cont"))]
amot_col = namen_col[which(str_detect(namen_col, "amot") & !str_detect(namen_col,"verz"))]

selectie = c(auto_col, controle_col, amot_col)

basetable = df %>%
  select(Toegangscode, all_of(selectie))

knitr::kable(basetable[1:3, ], "html")
Toegangscode FASE1_mot_auto FASE2_mot_auto FASE3_mot_auto FASE4_mot_auto Fase5_mot_auto FASE1_mot_cont FASE2_mot_cont FASE3_mot_cont FASE4_mot_cont Fase5_mot_cont FASE1_mot_amot FASE2_mot_amot FASE3_mot_amot FASE4_mot_amot Fase5_mot_amot
A0301 2.750 2.5 3.75 2.375 3.75 3.125 2.875 3.25 2.25 3.25 1.50 1.75 1 3.75 1.25
A0302 2.500 NA NA NA 2.25 2.750 NA NA NA 3.00 1.25 NA NA NA 1.00
A0303 2.625 NA NA NA NA 3.625 NA NA NA NA 1.00 NA NA NA NA

We pivot our dataset to obtain a more readable format

basetable = pivot_longer(basetable, -Toegangscode, names_to = "measurement_phase",values_to = "score") %>% 
  mutate(fase = str_extract(measurement_phase, "\\d")) %>% 
  mutate(motivation = str_sub(measurement_phase, -4,-1)) %>% 
  select(-measurement_phase) %>% 
  distinct(Toegangscode, fase,motivation, .keep_all = T) %>% 
  pivot_wider(names_from = motivation, values_from = score)

knitr::kable(basetable[1:5, ], caption = "Overview dataset.")
Overview dataset.
Toegangscode fase auto cont amot
A0301 1 2.750 3.125 1.50
A0301 2 2.500 2.875 1.75
A0301 3 3.750 3.250 1.00
A0301 4 2.375 2.250 3.75
A0301 5 3.750 3.250 1.25

Data exploration

Prior to our analysis, we conducted an examination of missing values in the dataset. Given the low completion rate observed in phase 5, we made the decision to use either phase 3 or 4 as the final phase for our analysis.

codes = unique(basetable$Toegangscode)

basetable %>% 
  group_by(fase) %>% 
  summarise(drop_outs = sum(is.na(auto)), drop_out_percentage = drop_outs*100/length(codes)) %>% 
  ggplot(aes(x = fase, y = drop_outs)) + 
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(format(round(drop_out_percentage), nsmall = 2), "%")), vjust = -0.2)+
  theme_minimal()+
  ggtitle("drop out per phase")

However when looking at the completion rate rate across all phase, we observe that in phase 3 we have 250 complete measurements and in phase 4 we have 182 complete measurements, thus we opt for ending it at phase 3

basetable %>% 
  group_by(Toegangscode) %>% 
  filter(fase %in% c(1:4)) %>% 
  filter(sum(is.na(auto)) == 0) %>% 
  ungroup() %>% 
  summarise(total_groups_phase_4 = length(unique(Toegangscode)))
# A tibble: 1 x 1
  total_groups_phase_4
                 <int>
1                  182
basetable %>% 
  group_by(Toegangscode) %>% 
  filter(fase %in% c(1:3)) %>% 
  filter(sum(is.na(auto)) == 0) %>% 
  ungroup() %>% 
  summarise(total_groups_phase_3 = length(unique(Toegangscode)))
# A tibble: 1 x 1
  total_groups_phase_3
                 <int>
1                  250

The distrubtion of male and female is

codes_fase_3 = basetable %>% 
  group_by(Toegangscode) %>% 
  filter(fase %in% c(1:3)) %>% 
  filter(sum(is.na(auto)) == 0) %>%
  ungroup() %>% 
  summarise(Toegangscode = unique(Toegangscode))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
i Please use `reframe()` instead.
i When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
distribution_male_female = merge(codes_fase_3, df, "Toegangscode") %>% 
  select(Fase3_Geslacht) %>% 
  table() %>% 
  prop.table()

ggplot(data = data.frame(x = 1, y = c(38.4, 61.6), Sex = c("Male", "Female")),
       aes(x, y, fill = Sex)) +
  geom_col() +
  coord_polar(theta = "y") +
  theme_void() +
  geom_text(aes(label = paste0(y, "%")), colour = "white", 
            position = position_stack(vjust = 0.5), size = 6) +
  scale_fill_manual(values = c("deepskyblue4", "tomato2")) +
  theme(title = element_text(size = 16, face = 2, hjust = 0.5))+
  ggtitle("distribution male/female")

clustering

cluster validation global dataset

#define final basetable
basetable_final = basetable %>% 
  group_by(Toegangscode) %>% 
  filter(fase %in% c(1:3)) %>% 
  filter(sum(is.na(auto)) == 0) %>% 
  ungroup()

#define final dataframe for clustering
phase_global = basetable %>% 
  group_by(Toegangscode) %>% 
  filter(fase %in% c(1:3)) %>% 
  filter(sum(is.na(auto)) == 0) %>% 
  ungroup() %>% 
  select(auto, cont, amot)


#validation before clustering

  # Test tendency on original dataset
  res.o <- get_clust_tendency(phase_global, n = nrow(phase_global)-1, graph=FALSE)
  H_dataset <- round(res.o$hopkins_stat,3)
  
  # random dataset
  random_dffull <- as.data.frame(apply(phase_global,2,function(x){runif(length(x),min(x),max(x))}))
  
  # testing tendency
  res.r <- get_clust_tendency(random_dffull, 
                              n = nrow(random_dffull)-1,
                              graph=FALSE)
  
  H_random <- round(res.r$hopkins_stat,3)
  
  # make table for output
  table_H <- cbind(H_dataset,H_random)
  colnames(table_H) <- c('original dataset','random dataset')
  rownames(table_H) <- 'Hopkin statistic'
knitr::kable(table_H, "html")
original dataset random dataset
Hopkin statistic 0.728 0.507

cluster validation per phase

phase1 = basetable_final %>% filter(fase == 1) %>% select(auto, cont, amot)
phase2 = basetable_final %>% filter(fase == 2) %>% select(auto, cont, amot)
phase3 = basetable_final %>% filter(fase == 3) %>% select(auto, cont, amot)

phase1_cluster_tendency <- get_clust_tendency(phase1, n = nrow(phase1)-1, graph=FALSE)
phase_1_value <- round(phase1_cluster_tendency$hopkins_stat,3)

phase2_cluster_tendency <- get_clust_tendency(phase2, n = nrow(phase2)-1, graph=FALSE)
phase_2_value <- round(phase2_cluster_tendency$hopkins_stat,3)

phase3_cluster_tendency <- get_clust_tendency(phase3, n = nrow(phase3)-1, graph=FALSE)
phase_3_value <- round(phase3_cluster_tendency$hopkins_stat,3)

# make table for output
overview_phases <- cbind(phase_1_value,phase_2_value,phase_3_value)
colnames(overview_phases) <- c("phase 1", "phase 2", "phase 3")
rownames(overview_phases) <- 'Hopkin statistic'
knitr::kable(overview_phases, "html")
phase 1 phase 2 phase 3
Hopkin statistic 0.724 0.696 0.699

cluster estimation

In order to determine the optimal cluster size, we examined all possible distance metrics, including Euclidean, Maximum, Manhattan, Canberra, and Minkowski, utilizing all available methods, namely Ward.D, Ward.D2, Complete, Average, Mcquitty, Median, Centroid, and K-means.

#define param
distances = c("euclidean", "maximum", "manhattan", "canberra", "minkowski")
methods = c("ward.D","ward.D2", "complete", "average", "mcquitty", "median", "centroid","kmeans")

n_distance = length(distances)
n_methods = length(methods)

#create function
sensitivity_cluster_analysis = function(basetable, distance, method){
  tryCatch({
    freq_table = freq(NbClust(basetable, distance = distance,
                              min.nc = 3, max.nc = 6,
                              method = method,
                              index ="all")$Best.nc[1,])[,c(1,3)]
    freq_table["cluster"] =rownames(freq_table)
    return(freq_table)
  },warning = function(w){
    print(paste(distance,method))
  })
}

#call function
total_overview = lapply(1:n_distance, function(i) lapply(1:n_methods, function(j) sensitivity_cluster_analysis(phase1, distances[i], methods[j])))

#give names
names(total_overview) = distances
tibble_total_overview = tibble(total_overview = total_overview)

#write to RDS
#write_rds(tibble_total_overview,"final_clustering_1.rds")

The function was implemented on four datasets, namely phase 1, phase 2, phase 3, and the basetable_final, resulting in the following output. Notably, for phase 3, a total of five clustering methods failed, specifically those utilizing the Euclidean distance metric with median and centroid linkage, the Manhattan distance metric with median linkage, and the maximum distance metric with median and centroid linkage.

#define param
distances = c("euclidean", "maximum", "manhattan", "canberra", "minkowski")
methods = c("ward.D","ward.D2", "complete", "average", "mcquitty", "median", "centroid","kmeans")

n_distance = length(distances)
n_methods = length(methods)

#read in data
final_clustering_1 = read_rds("final_clustering_1.rds")
final_clustering_2 = read_rds("final_clustering_2.rds")
final_clustering_3 = read_rds("final_clustering_3_cleaned.rds")
final_clustering_1_3 = read_rds("final_clustering_1_3.rds")

#output
library(cowplot)

#function to create tables
create_table = function(x){
  cleaned_tibble = x %>% 
    unnest(total_overview) %>% 
    mutate(distance = rep(c(1:n_distance),each = n_methods)) %>% 
    mutate(method = rep(c(1:n_methods),n_distance)) %>% 
    unnest(total_overview) %>% 
    group_by(distance,method) %>% 
    arrange(desc(n)) %>% 
    slice(1) %>% 
    select(cluster)
  
  freq_tibble = freq(cleaned_tibble$cluster)
  freq_tibble["cluster"] = rownames(freq_tibble)
  return(freq_tibble)
}

plot_cluster_1 = create_table(final_clustering_1) %>% 
  ggplot(aes(x = cluster, y = n, fill = cluster))+
  geom_bar(stat = "identity")

plot_cluster_2 = create_table(final_clustering_2)%>% 
  ggplot(aes(x = cluster, y = n, fill = cluster))+
  geom_bar(stat = "identity")

plot_cluster_3 = final_clustering_3 %>% 
  ggplot(aes(x = cluster, y = n, fill = cluster))+
  geom_bar(stat = "identity")

plot_cluster_1_3 = create_table(final_clustering_1_3)%>% 
  ggplot(aes(x = cluster, y = n, fill = cluster))+
  geom_bar(stat = "identity")

plot_grid(plot_cluster_1, plot_cluster_2,plot_cluster_3,plot_cluster_1_3, labels = c("phase 1","phase 2","phase 3","all phases"))

`summarise()` has grouped output by 'distance'. You can override using the
`.groups` argument.

cluster introspection

#we take 5 clusters

  #fase 1
  fase1_cluster <- hkmeans(basetable_final %>% filter(fase == 1) %>% select(auto, cont, amot), 
                              k = 5,
                              hc.metric="manhattan",
                              hc.method='ward.D2',
                              iter.max = 10,
                              km.algorithm = "Hartigan-Wong")

  #fase 2  
  fase2_cluster <- hkmeans(basetable_final %>% filter(fase == 2) %>% select(auto, cont, amot), 
                           k = 5,
                           hc.metric="manhattan",
                           hc.method='ward.D2',
                           iter.max = 10,
                           km.algorithm = "Hartigan-Wong")

  #fase 3
  fase3_cluster <- hkmeans(basetable_final %>% filter(fase == 3) %>% select(auto, cont, amot), 
                           k = 5,
                           hc.metric="manhattan",
                           hc.method='ward.D2',
                           iter.max = 10,
                           km.algorithm = "Hartigan-Wong")
  
clusters_evolution = tibble(clusters_fase1 = fase1_cluster$cluster, 
                            clusters_fase2 = fase2_cluster$cluster,
                            clusters_fase3 = fase3_cluster$cluster)



#data generation
df = tibble(
  source_target_1_2 = as_factor(do.call(paste, c("1_2", clusters_evolution[1:2], sep="-"))),
  source_target_2_3 = as_factor(do.call(paste, c("2_3",clusters_evolution[2:3], sep="-")))
)

df = data.frame(source_target = c(df$source_target_1_2, df$source_target_2_3))

df = df %>% 
  group_by(source_target) %>% 
  summarise(n = n()) %>% 
  mutate(source  = str_sub(source_target, start = 5, 5)) %>% 
  mutate(target  = str_sub(source_target, start = 7, 7)) %>% 
  mutate(fase  = str_sub(source_target, start = 1,3))

profiles_1 = data.frame(source = c(1:5), profile = c("qualitative-HIGH_1", "quantitative-LOW_1", "qualitative-LOW_1","quantitative-HIGH_1","amotive_1"))
profiles_2 = data.frame(source = c(1:5), profile = c("qualitative-HIGH_2", "quantitative-LOW_2", "qualitative-LOW_2","quantitative-HIGH_2","amotive_2"))
profiles_3 = data.frame(source = c(1:5), profile = c("qualitative-HIGH_3", "quantitative-LOW_3", "qualitative-LOW_3","quantitative-HIGH_3","amotive_3"))


fase_1_2 = df %>% 
  filter(fase == "1_2") %>% 
  merge(profiles_1, by.x = "source", by.y = "source") %>% 
  merge(profiles_2, by.x = "target", by.y = "source") %>% 
  select(n, profile.x, profile.y)

fase_2_3 = df %>% 
  filter(fase == "2_3") %>% 
  merge(profiles_2, by.x = "source", by.y = "source") %>% 
  merge(profiles_3, by.x = "target", by.y = "source") %>% 
  select(n, profile.x, profile.y)

names(fase_1_2) = c("link","source","target")
names(fase_2_3) = c("link","source","target")

final_df = bind_rows(fase_1_2, fase_2_3)

links = data.frame(
  source = final_df$source,
  target = final_df$target,
  value = final_df$link
)

# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name=c(as.character(links$source), 
         as.character(links$target)) %>% unique()
)

# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
                   Source = "IDsource", Target = "IDtarget",
                   Value = "value", NodeID = "name", 
                   sinksRight=FALSE, fontSize = 14)
p

cluster characteristics

#take cluster size 5.

# function for two-step clustering procedure
basetable_clusters <- hkmeans(phase_global, 
                              k = 5,
                              hc.metric="manhattan",
                              hc.method='average',
                              iter.max = 10,
                              km.algorithm = "Hartigan-Wong")

# assign cluster variable to your dataset
phase_global$clusters <- as.factor(basetable_clusters$cluster)

# check proportions of participants into the clusters
freq(phase_global$clusters)
    n    % val%
1 132 17.6 17.6
2  79 10.5 10.5
3 191 25.5 25.5
4 171 22.8 22.8
5 177 23.6 23.6
#visual representation
# calculate means of variables by cluster levels
final_clusters = phase_global %>% 
  group_by(clusters) %>% 
  summarise(across(c(auto, cont, amot), mean)) %>% 
  pivot_longer(!clusters, names_to = "learning_profile", values_to = "score")

ggplot(data  = final_clusters, aes(x = "", y = score, fill = learning_profile))+
  geom_bar(stat = "identity",position = "dodge", width = 1)+
  facet_wrap(vars(clusters))+
  geom_text(aes(label = paste(format(round(score,2), nsmall = 2))), vjust = -0.2,
            position = position_dodge(width = .9))+
  theme_minimal()

conclusion

This study presents an analysis of the motivation profiles of Flemish students over time, with a focus on categorization into five distinct groups. The literature highlights five categories, namely qualitative-low, qualitative-high, quantitative-low, quantitative-high, and amotive. The present research aimed to establish the feasibility of clustering the findings and identify the aforementioned categories. The results validate the theoretical framework and suggest that the motivation profiles of the participants were clustered into five distinct groups. However, it is important to note that these clusters were observed across the entire dataset, as opposed to within each phase of the study. In fact, the findings suggest that only three different profiles were observed per phase.