Overview

Summary

After our initial exploration of the dataset, we found that our 116 students could be divided into three main groups: the ones who used only the SQL course (77), the ones who used only the Python course (2), and those who used both (37). We decided to explore the two main groups: students who stopped at the first course, (SQL) versus the ones who continued to the second one (SQL+Python). To understand what makes those two groups different, we analyzed the differences in the behavior of those students in what they have in common, the SQL activities.

Right from the start, we can observe a stark difference between the two groups. Students who proceeded to the Python course had a median of 247 attempts in the SQL course, while the median for those who did not proceed was only 20. Breaking the usage by “app” (four activity types in the SQL course), we could also observe statically significant differences between the two groups in every app but SQL Tutor. SQL Tutor’s also presented other anomalies, such as having its usage completely ceased by all the students early in the term - possibly an indicator of technical issues in the app. Furthermore, it is worth noticing that the “only SQL” has an extremely low usage of all the apps but SQL Knot, the one app which was made mandatory by the instructor for extra credits. Still, the data shows that all the students of both groups tried at least one activity of each app. Even in the mandatory SQL Knot app, more students from the “only SQL” category did not reach the minimum requirement of two activities per topic. However, surprisingly those students also had a highest rate of above minimum usage, with the “SQL+Python” students being the majority only in attempting exactly the required 2 activities per topic.

Regarding their behavior in the SQL course within sessions and over time, the two groups also present some differences. First of all, when looking at the number of activities attempted per section, the group that proceeded to Python reached higher maximums, while the students of the group “SQL Only” are mostly concentrated in sessions with few activities. Similarly, the students who proceeded to Python also attempted more activities and reached higher peaks when we look at their behavior by day. Perhaps surprisingly, the “SQL+Python” group is more active in the first and last months of the trimester, while the “only SQL” group is more active in the intermediate month. This behavior remains the same when we analyze it by app, but there are two details worth noticing. Even in its most active times, the “only SQL” group only surpasses the “SQL+Python” one in one app: SQL Knot, the mandatory app. This could point to a reminder by the instructor about the mandatory nature of these activities if the students are interested in extra credits. It could also mean that those less active students received partial grades and recognized their need for extra credits. Finally, as mentioned before, the SQL Knot has a unique pattern - its activity completely ceases after a couple of weeks, which could be an indicator of unreported technical issues with that type of activity.

Dataset Columns

  • User
    User log-in identifier in the system.

  • Group
    Non-essential information about which KT course was opened.
    In the Temple course, all students opened both KT courses.
    (Use this column as domain.)

  • Session
    Web browser session ID counter per user.

  • Timebin
    When the user opened during a session.
    Larger numbers mean later in a given session.

  • Appid
    UM2 DB reference number of the human-readable app name.

  • Parentname / Activityname / Targetname
    Contextual columns to keep track of hierarchy in the log
    (order in which a student attempts a given activity).

  • Applabel
    Human-readable app code:

    • WEBEX (non-assessed)
      SQL line-by-line explanations.

      • System returns -1 for the Result column.
      • Each entry shows which line the student clicked.
      • Line number is stored in activityname / targetname.
      • Parentname has the activity ID.
    • DBQA (non-assessed)
      Animated SQL example (intentional construction).

      • System returns a value between 0 and 1 for the number of lines read.
    • SQLKnot (assessed)
      PCRS equivalent in SQL (intentional behavioral).

    • SQLTutor (assessed)
      QuizPET + PCEX Challenge equivalent in SQL, with additional construction.

    • PCEX Example (non-assessed)
      Worked Example in Python/Java with explanations.

      • System returns -1.
      • Each line is tracked in a separate row.
      • Targetname = PCEX set name.
      • Parentname / Activityname = ID of the individual example.
      • If activityname and targetname are missing but parentname has a value,
        that value indicates the line clicked by the student.
      • (Currently no column shows the line number directly.)
    • PCEX Challenge (assessed)
      Worked Example in Python/Java with distractors.

      • System returns 0 (incorrect) or 1 (correct).
      • Targetname = PCEX set name.
      • Parentname / Activityname = individual PCEX challenge attempt.
    • Animated Example (non-assessed)
      Step-by-step tracing examples in Python/Java.

      • System returns -1.
      • Each entry shows which line was clicked (targetname / activityname).
    • QuizPET (assessed)
      Student submits correct/incorrect output after tracing code.

    • Parsons (assessed)
      Jigsaw puzzle where students reorder code to get correct output.

  • topicname
    Order of the topic in the KT course.

  • Courseroder
    Order of the activity among all cells in the entire KT course.

  • Topciname
    KT topic name.

  • Attemptno
    For a given (activityname, targetname, parentname) tuple,
    the number of student attempts.

  • Result
    System result for a given (activityname, targetname, parentname) tuple.

  • Datestring
    When the activity was attempted.

Dataset Exploration

data <- read.csv("./TempleMIS2502Spring2025_raw_activity_combined_20250822223624.csv", stringsAsFactors = TRUE)

data <- data %>%
  filter(topicname != "sqllab") %>%
  droplevels()

str(data)
## 'data.frame':    31805 obs. of  18 variables:
##  $ user           : Factor w/ 116 levels "tum0","tum1",..: 12 72 72 72 72 72 72 72 72 72 ...
##  $ group          : Factor w/ 2 levels "TempleMIS2502Spring2025",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ session        : int  0 2 2 2 2 2 2 2 2 2 ...
##  $ timebin        : int  8 27 28 29 33 34 38 39 43 44 ...
##  $ appid          : int  3 3 3 3 3 3 53 53 3 3 ...
##  $ applabel       : Factor w/ 9 levels "ANIMATED_EXAMPLE",..: 9 9 9 9 9 9 2 2 9 9 ...
##  $ activityname   : Factor w/ 304 levels "","0","1","10",..: 2 2 24 3 2 3 288 288 2 3 ...
##  $ targetname     : Factor w/ 304 levels "","0","1","10",..: 2 2 24 3 2 3 288 288 2 3 ...
##  $ parentname     : Factor w/ 314 levels "","ae_adl_arithmetics2",..: 53 53 53 53 58 58 298 298 63 63 ...
##  $ topicname      : Factor w/ 23 levels "Aggregate Functions",..: 21 21 21 21 21 21 21 21 21 21 ...
##  $ courseorder    : int  7 7 7 7 8 8 18 18 9 9 ...
##  $ topicorder     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ attemptno      : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ result         : num  -1 -1 -1 -1 -1 -1 0.5 1 -1 -1 ...
##  $ datestring     : Factor w/ 31795 levels "2025-01-29 20:01:53.335",..: 9 6252 6253 6254 6255 6256 6257 6258 6259 6260 ...
##  $ unixtimestamp  : int  1738253053 1739461896 1739461899 1739461903 1739461909 1739461913 1739461927 1739461940 1739461952 1739461954 ...
##  $ durationseconds: num  0.01 2.93 4.21 2.63 3.72 ...
##  $ concepts       : Factor w/ 218 levels "","_dict_call_eq_assign_for_or_str_index_int_if_gt_functiondef",..: 7 7 7 210 166 210 214 214 193 210 ...
levels(data$group)
## [1] "TempleMIS2502Spring2025"       "TempleMIS2502Spring2025Python"

SQL Course

sql_data <- droplevels(data[data$group == "TempleMIS2502Spring2025", ])

n_obs   <- nrow(sql_data)

applabel_summary <- sql_data %>%
  count(applabel) %>%
  rename(`App Label` = applabel, Observations = n) %>%
  bind_rows(tibble(`App Label` = "Total", Observations = n_obs))

kable(applabel_summary)
App Label Observations
DBQA 4823
SQLKNOT 5572
SQLTUTOR 1565
WEBEX 3879
Total 15839

Python Course

python_data <- droplevels(data[data$group == "TempleMIS2502Spring2025Python", ])

n_obs <- nrow(python_data)

applabel_summary <- python_data %>%
  group_by(applabel) %>%
  summarise(
    Observations = n(),
    `Distinct Activities` = n_distinct(activityname)  ) %>%
  ungroup() %>%
  rename(`App Label` = applabel) %>%
  bind_rows(
    tibble(
      `App Label` = "Total",
      Observations = n_obs,
      `Distinct Activities` = n_distinct(python_data$activityname)
    )
  )

kable(applabel_summary)
App Label Observations Distinct Activities
ANIMATED_EXAMPLE 5115 23
PARSONS 5193 49
PCEX_CHALLENGE 2312 53
PCEX_EXAMPLE 2671 44
QUIZPET 675 38
Total 15966 207

SQL and Python Intersection

sql_users    <- unique(sql_data$user)
python_users <- unique(python_data$user)

common_users <- intersect(sql_users, python_users)
sql_only     <- setdiff(sql_users, python_users)
python_only  <- setdiff(python_users, sql_users)

user_summary <- tibble(
  Category = c("Only in SQL data", "Only in Python data", "In both"),
  Users    = c(length(sql_only), length(python_only), length(common_users))
)

kable(user_summary, col.names = c("User Category", "Number of Users"))
User Category Number of Users
Only in SQL data 77
Only in Python data 2
In both 37

Usage Distribution

This section has only some preliminary analysis done by me before I defined what questions I wanted to explore. Please proceed to the next section for more relevant results.

SQL Course

user_activity_counts <- sql_data %>%
  group_by(user) %>%
  summarise(n_activities = n_distinct(activityname)) %>%
  ungroup()

usage_distribution <- user_activity_counts %>%
  count(n_activities) %>%
  rename(`Number of Activities` = n_activities,
         `Number of Users` = n)

ggplot(user_activity_counts, aes(x = n_activities)) +
  geom_density(fill = "steelblue", alpha = 0.5) +
  labs(
    title = "Density of Distinct SQL Activities per User",
    x = "Number of Distinct Activities",
    y = "Density"
  ) +
  theme_minimal()

Python Course

user_activity_counts <- python_data %>%
  group_by(user) %>%
  summarise(n_activities = n_distinct(activityname)) %>%
  ungroup()

usage_distribution <- user_activity_counts %>%
  count(n_activities) %>%
  rename(`Number of Activities` = n_activities,
         `Number of Users` = n)

ggplot(user_activity_counts, aes(x = n_activities)) +
  geom_density(fill = "steelblue", alpha = 0.5) +
  labs(
    title = "Density of Distinct Python Activities per User",
    x = "Number of Distinct Activities",
    y = "Density"
  ) +
  theme_minimal()

Comparison of Student Profiles

Student Categories (SQL only, or SQL+Python)

Here, the each student was categorized into “SQL_only” or “SQL_plus_Python”, based on their logs.

I also added an activity_id column for later, using this mapping:

  • DBQA, SQLTUTOR, SQLKNOT: activityname;

  • WEBEX: parentname.

sql_users <- unique(sql_data$user)
python_users <- unique(python_data$user)

sql_only <- setdiff(sql_users, python_users)
sql_plus_python <- intersect(sql_users, python_users)

both_categories <- data.frame(
  user = c(sql_only, sql_plus_python),
  user_category = factor(
    c(rep("SQL_only", length(sql_only)),
      rep("SQL_plus_Python", length(sql_plus_python))),
    levels = c("SQL_only", "SQL_plus_Python")
  )
)

data_with_user_cat <- data %>%
  inner_join(both_categories, by = "user")

sql_subset <- data_with_user_cat %>%
  filter(group == "TempleMIS2502Spring2025") %>%
  mutate(
    activity_id = case_when(
      applabel %in% c("DBQA", "SQLTUTOR", "SQLKNOT") ~ as.character(activityname),
      applabel == "WEBEX" ~ as.character(parentname),
      TRUE ~ as.character(activityname)
    )
  )

SQL Attempts per User

Here you can find the summary of number of attempts in the SQL activities per user, for users in both categories.

sql_attempts_per_user <- sql_subset %>%
  group_by(user, user_category) %>%
  summarise(
    sql_attempts = n(),
    .groups = "drop"
  )

sql_attempts_summary <- sql_attempts_per_user %>%
  group_by(user_category) %>%
  summarise(
    mean_attempts   = mean(sql_attempts),
    median_attempts = median(sql_attempts),
    sd_attempts     = sd(sql_attempts),
    min_attempts    = min(sql_attempts),
    max_attempts    = max(sql_attempts),
    n_users         = n(),
    .groups = "drop"
  )

sql_attempts_summary
## # A tibble: 2 × 7
##   user_category   mean_attempts median_attempts sd_attempts min_attempts
##   <fct>                   <dbl>           <int>       <dbl>        <int>
## 1 SQL_only                 87.6              20        131.            1
## 2 SQL_plus_Python         246.              247        139.            1
## # ℹ 2 more variables: max_attempts <int>, n_users <int>
ggplot(sql_attempts_per_user, aes(x = user_category, y = sql_attempts, fill = user_category)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.4) +
  labs(
    title = "SQL Attempts per User",
    x = "User Category",
    y = "Number of SQL Attempts"
  ) +
  theme_minimal()

SQL Attempts per User by App

Here we have similar information, but we the usage divided by the four “apps” in the SQL course. We also have the stastistical significance between the two groups, using Wilcox.

sql_attempts_by_app <- sql_subset %>%
  group_by(user, user_category, applabel) %>%
  summarise(
    attempts = n(),
    .groups = "drop"
  )

sql_app_summary <- sql_attempts_by_app %>%
  group_by(user_category, applabel) %>%
  summarise(
    mean_attempts   = mean(attempts),
    median_attempts = median(attempts),
    sd_attempts     = sd(attempts),
    min_attempts    = min(attempts),
    max_attempts    = max(attempts),
    n_users         = n(),
    .groups = "drop"
  )

sql_app_summary
## # A tibble: 8 × 8
##   user_category  applabel mean_attempts median_attempts sd_attempts min_attempts
##   <fct>          <fct>            <dbl>           <dbl>       <dbl>        <int>
## 1 SQL_only       DBQA              39.3             9          46.1            1
## 2 SQL_only       SQLKNOT           54.7            45          45.2            1
## 3 SQL_only       SQLTUTOR          41.6            10.5        52.7            1
## 4 SQL_only       WEBEX             24.5             9          31.0            1
## 5 SQL_plus_Pyth… DBQA              80.5            97          37.7            3
## 6 SQL_plus_Pyth… SQLKNOT          101.             95          59.5            3
## 7 SQL_plus_Pyth… SQLTUTOR          73.4            75.5        55.0            1
## 8 SQL_plus_Pyth… WEBEX             53.8            59          27.2            1
## # ℹ 2 more variables: max_attempts <int>, n_users <int>
ggplot(sql_attempts_by_app, aes(x = applabel, y = attempts, fill = user_category)) +
  geom_boxplot(alpha = 0.6, position = position_dodge(width = 0.8)) +
  labs(
    title = "SQL Attempts per User by Activity Type",
    x = "Activity Type (applabel)",
    y = "Number of Attempts"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

wilcox_results <- sql_attempts_by_app %>%
  group_by(applabel) %>%
  wilcox_test(attempts ~ user_category) %>%
  adjust_pvalue(method = "holm") %>% 
  add_significance()

wilcox_results
## # A tibble: 4 × 10
##   applabel .y.      group1   group2           n1    n2 statistic       p   p.adj
##   <fct>    <chr>    <chr>    <chr>         <int> <int>     <dbl>   <dbl>   <dbl>
## 1 DBQA     attempts SQL_only SQL_plus_Pyt…    51    35     443   7.77e-5 2.33e-4
## 2 SQLKNOT  attempts SQL_only SQL_plus_Pyt…    37    35     304.  1.14e-4 2.33e-4
## 3 SQLTUTOR attempts SQL_only SQL_plus_Pyt…    20    10      65.5 1.34e-1 1.34e-1
## 4 WEBEX    attempts SQL_only SQL_plus_Pyt…    77    37     662.  3.99e-6 1.60e-5
## # ℹ 1 more variable: p.adj.signif <chr>

Experimentation of Each SQL App

sql_engagement <- sql_subset %>%
  group_by(user, user_category, applabel) %>%
  summarise(attempted = n() > 0, .groups = "drop")

sql_engagement_summary <- sql_engagement %>%
  group_by(user_category, applabel) %>%
  summarise(
    n_users = n(),
    n_attempted = sum(attempted),
    prop_attempted = n_attempted / n_users,
    .groups = "drop"
  )

sql_engagement_summary
## # A tibble: 8 × 5
##   user_category   applabel n_users n_attempted prop_attempted
##   <fct>           <fct>      <int>       <int>          <dbl>
## 1 SQL_only        DBQA          51          51              1
## 2 SQL_only        SQLKNOT       37          37              1
## 3 SQL_only        SQLTUTOR      20          20              1
## 4 SQL_only        WEBEX         77          77              1
## 5 SQL_plus_Python DBQA          35          35              1
## 6 SQL_plus_Python SQLKNOT       35          35              1
## 7 SQL_plus_Python SQLTUTOR      10          10              1
## 8 SQL_plus_Python WEBEX         37          37              1

All the students are at least trying each type of activity, but the ones who also used the Python course went way deeper in their SQL usage!

Distribution of Students in Effort Categories

bare_min_sqlknot <- sql_subset %>%
  filter(applabel == "SQLKNOT") %>%
  distinct(user, user_category, topicname, activity_id) %>%
  group_by(user, user_category, topicname) %>%
  summarise(
    n_sqlknot = n(),
    .groups = "drop"
  ) %>%
  mutate(
    effort_category = case_when(
      n_sqlknot < 2 ~ "Below minimum",
      n_sqlknot == 2 ~ "Minimum",
      n_sqlknot >= 3 & n_sqlknot <= 4 ~ "Slightly above minimum",
      n_sqlknot >= 5 ~ "High effort"
    ),
    effort_category = factor(
      effort_category,
      levels = c("High effort", "Slightly above minimum", "Minimum", "Below minimum")
    )
  )

effort_summary <- bare_min_sqlknot %>%
  group_by(user_category, effort_category) %>%
  summarise(
    n_topics = n(),
    .groups = "drop"
  ) %>%
  group_by(user_category) %>%
  mutate(
    prop = n_topics / sum(n_topics)
  )

effort_summary
## # A tibble: 8 × 4
## # Groups:   user_category [2]
##   user_category   effort_category        n_topics   prop
##   <fct>           <fct>                     <int>  <dbl>
## 1 SQL_only        High effort                  62 0.328 
## 2 SQL_only        Slightly above minimum       30 0.159 
## 3 SQL_only        Minimum                      78 0.413 
## 4 SQL_only        Below minimum                19 0.101 
## 5 SQL_plus_Python High effort                  80 0.302 
## 6 SQL_plus_Python Slightly above minimum       30 0.113 
## 7 SQL_plus_Python Minimum                     144 0.543 
## 8 SQL_plus_Python Below minimum                11 0.0415
  • n_sqlknot < 2 ~ “Below minimum”;

  • n_sqlknot == 2 ~ “Minimum”;

  • n_sqlknot >= 3 & n_sqlknot <= 4 ~ “Slightly above minimum”;

  • n_sqlknot >= 5 ~ “High effort”.

Used n_distinct(activity_id)!

ggplot(effort_summary, aes(x = effort_category, y = prop, fill = user_category)) +
  geom_col(position = "dodge") +
  labs(
    title = "SQLKNOT Effort per Topic",
    x = "Effort Category",
    y = "Proportion of Topics"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal()

ggplot(effort_summary, aes(x = user_category, y = prop, fill = effort_category)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Distribution of SQLKNOT Effort per Topic",
    x = "User Category",
    y = "Proportion of Topics",
    fill = "Effort Level"
  ) +
  theme_minimal()

effort_table <- table(bare_min_sqlknot$user_category, bare_min_sqlknot$effort_category)

effort_table
##                  
##                   High effort Slightly above minimum Minimum Below minimum
##   SQL_only                 62                     30      78            19
##   SQL_plus_Python          80                     30     144            11
chisq_test <- chisq.test(effort_table)
chisq_test
## 
##  Pearson's Chi-squared test
## 
## data:  effort_table
## X-squared = 11.64, df = 3, p-value = 0.008722
chisq_test$residuals
##                  
##                   High effort Slightly above minimum    Minimum Below minimum
##   SQL_only          0.3752911              1.0048480 -1.4998253     1.8424044
##   SQL_plus_Python  -0.3169394             -0.8486104  1.2666267    -1.5559403

Summary of total SQL Knot attempts per topic per student:

sqlknot_attempts_summary <- sql_subset %>%
  filter(applabel == "SQLKNOT") %>%
  group_by(user, user_category, topicname) %>%
  summarise(n_sqlknot = n(), .groups = "drop") %>%
  group_by(user_category) %>%
  summarise(
    mean_attempts = mean(n_sqlknot),
    median_attempts = median(n_sqlknot),
    sd_attempts = sd(n_sqlknot)
  )

sqlknot_attempts_summary
## # A tibble: 2 × 4
##   user_category   mean_attempts median_attempts sd_attempts
##   <fct>                   <dbl>           <int>       <dbl>
## 1 SQL_only                 10.7               7        11.7
## 2 SQL_plus_Python          13.4               8        17.6

Wilcox comparison of the SQL Knot attempts per topic - no significant differences.

sqlknot_per_topic_wilcox_test <- sql_subset %>%
  filter(applabel == "SQLKNOT") %>%
  group_by(user, user_category, topicname) %>%
  summarise(n_sqlknot = n(), .groups = "drop") %>%
  group_by(topicname) %>%
  wilcox_test(n_sqlknot ~ user_category) %>%
  adjust_pvalue(method = "holm") %>%
  add_significance()

sqlknot_per_topic_wilcox_test
## # A tibble: 8 × 10
##   topicname  .y.   group1 group2    n1    n2 statistic      p p.adj p.adj.signif
##   <fct>      <chr> <chr>  <chr>  <int> <int>     <dbl>  <dbl> <dbl> <chr>       
## 1 Aggregate… n_sq… SQL_o… SQL_p…    22    33      360. 0.958  1     ns          
## 2 GROUP-BY   n_sq… SQL_o… SQL_p…    22    33      234  0.0269 0.188 ns          
## 3 Join       n_sq… SQL_o… SQL_p…    15    32      317  0.0744 0.446 ns          
## 4 Left Join  n_sq… SQL_o… SQL_p…    19    32      183  0.0188 0.150 ns          
## 5 ORDER-BY   n_sq… SQL_o… SQL_p…    23    32      366  0.979  1     ns          
## 6 Pattern M… n_sq… SQL_o… SQL_p…    24    34      408. 1      1     ns          
## 7 SELECT-FR… n_sq… SQL_o… SQL_p…    37    35      604. 0.623  1     ns          
## 8 SELECT-FR… n_sq… SQL_o… SQL_p…    27    34      464  0.948  1     ns

Wilcox comparison of total SQL Knot attempts (not divided by topic) - significant differences!

sqlknot_total_attempts_wilcox_test <- sql_subset %>%
  filter(applabel == "SQLKNOT") %>%
  group_by(user, user_category) %>%
  summarise(n_sqlknot = n(), .groups = "drop") %>%
  wilcox_test(n_sqlknot ~ user_category) %>%
  add_significance()

sqlknot_total_attempts_wilcox_test
## # A tibble: 1 × 8
##   .y.       group1   group2             n1    n2 statistic        p p.signif
##   <chr>     <chr>    <chr>           <int> <int>     <dbl>    <dbl> <chr>   
## 1 n_sqlknot SQL_only SQL_plus_Python    37    35      304. 0.000114 ***

Wilcox comparison of SQL Knot attempts (not divided by topic, not counting multiple attempts in the same activity) - significant differences!

sqlknot_unique_activities_wilcox_test <- sql_subset %>%
  filter(applabel == "SQLKNOT") %>%
  group_by(user, user_category) %>%
  summarise(n_sqlknot = n_distinct(activity_id), .groups = "drop") %>%
  wilcox_test(n_sqlknot ~ user_category) %>%
  add_significance()

sqlknot_unique_activities_wilcox_test
## # A tibble: 1 × 8
##   .y.       group1   group2             n1    n2 statistic       p p.signif
##   <chr>     <chr>    <chr>           <int> <int>     <dbl>   <dbl> <chr>   
## 1 n_sqlknot SQL_only SQL_plus_Python    37    35      388. 0.00297 **

Effort Categories by Topic

topic_effort_summary <- bare_min_sqlknot %>%
  group_by(user, user_category, topicname, effort_category) %>%
  summarise(n_unique_activities = n_sqlknot, .groups = "drop") %>%
  group_by(user_category, topicname, effort_category) %>%
  summarise(n_topics = sum(n_unique_activities), .groups = "drop") %>%
  group_by(user_category, topicname) %>%
  mutate(prop = n_topics / sum(n_topics)) %>%
  ungroup()

topic_effort_summary
## # A tibble: 43 × 5
##    user_category topicname           effort_category        n_topics   prop
##    <fct>         <fct>               <fct>                     <int>  <dbl>
##  1 SQL_only      Aggregate Functions High effort                  70 0.805 
##  2 SQL_only      Aggregate Functions Slightly above minimum        3 0.0345
##  3 SQL_only      Aggregate Functions Minimum                      14 0.161 
##  4 SQL_only      GROUP-BY            Minimum                      36 0.9   
##  5 SQL_only      GROUP-BY            Below minimum                 4 0.1   
##  6 SQL_only      Join                Minimum                      30 1     
##  7 SQL_only      Left Join           Minimum                      26 0.812 
##  8 SQL_only      Left Join           Below minimum                 6 0.188 
##  9 SQL_only      ORDER-BY            Slightly above minimum       62 0.827 
## 10 SQL_only      ORDER-BY            Minimum                      12 0.16  
## # ℹ 33 more rows
ggplot(topic_effort_summary, 
       aes(x = topicname, y = prop, fill = effort_category)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ user_category, ncol = 1, scales = "free_y") +
  scale_fill_manual(
    values = c(
      "High effort" = "#1b9e77",
      "Slightly above minimum" = "#d95f02",
      "Minimum" = "#7570b3",
      "Below minimum" = "#e7298a"
    ),
    drop = FALSE
  ) +
  scale_y_continuous(
    labels = percent_format(),
    expand = expansion(mult = c(0, 0.1))  # add 10% padding above bars
  ) +
  labs(
    x = "Topic",
    y = "Proportion of Activities",
    fill = "Effort Category",
    title = "Topic-specific SQLKNOT Effort by User Category"
  ) +
  coord_flip() +  # horizontal bars
  theme_minimal(base_size = 14) +
  theme(
    axis.text.y = element_text(size = 12),
    strip.text = element_text(size = 12),
    legend.position = "top"
  )

SQL_only = 1, SQL_plus_Python = 2.

Order of Topics in the SQL Course

I could not find a pattern between the order and the data in the previous graph.

topic_order_table <- sql_subset %>%
  distinct(topicorder, topicname) %>%
    filter(topicorder != -1) %>%
  arrange(topicorder) 

kable(topic_order_table, caption = "Mapping of Topic Order to Topic Name")
Mapping of Topic Order to Topic Name
topicorder topicname
1 SELECT-FROM
2 SELECT-FROM-WHERE
3 Pattern Matching
4 ORDER-BY
5 Aggregate Functions
6 GROUP-BY
7 Join
8 Left Join

Summary of Activities Distribution per Topic

It is worth noting that some topics (e.g. Join) only had 2 SQL Knot activities, the minimum required for extra credit.

topic_activity_summary <- sql_subset %>%
  group_by(topicname, applabel) %>%
  summarise(n_activities = n_distinct(activity_id), .groups = "drop") %>%
  pivot_wider(
    names_from = applabel,
    values_from = n_activities,
    values_fill = 0
  ) %>%
  mutate(Total = rowSums(across(where(is.numeric))))

kable(topic_activity_summary, caption = "Number of Distinct Activities per Topic by App Label")
Number of Distinct Activities per Topic by App Label
topicname DBQA SQLKNOT SQLTUTOR WEBEX Total
Aggregate Functions 4 5 4 4 17
GROUP-BY 1 2 2 2 7
Join 5 2 4 0 11
Left Join 4 2 0 1 7
ORDER-BY 4 4 6 1 15
Pattern Matching 4 5 2 1 12
SELECT-FROM 4 6 7 4 21
SELECT-FROM-WHERE 4 9 9 6 28

Binging Behavior

Activities per Session

binging_summary <- sql_subset %>%
  group_by(user, user_category, session, topicname) %>%
  summarise(
    n_activities_session = n_distinct(activity_id),
    .groups = "drop"
  ) %>%
  group_by(user, user_category) %>%
  summarise(
    max_activities_in_session = max(n_activities_session),
    mean_activities_in_session = mean(n_activities_session),
    .groups = "drop"
  )

binging_summary
## # A tibble: 114 × 4
##    user   user_category   max_activities_in_session mean_activities_in_session
##    <chr>  <fct>                               <int>                      <dbl>
##  1 tum0   SQL_only                                4                       3   
##  2 tum1   SQL_plus_Python                        28                       6.94
##  3 tum100 SQL_plus_Python                        13                       7.45
##  4 tum101 SQL_only                               17                       7.33
##  5 tum102 SQL_plus_Python                        14                       3.52
##  6 tum103 SQL_plus_Python                        19                       7.73
##  7 tum104 SQL_only                               19                       6.71
##  8 tum106 SQL_plus_Python                        23                       5.33
##  9 tum107 SQL_only                                9                       4.33
## 10 tum109 SQL_only                               10                       4.09
## # ℹ 104 more rows
ggplot(binging_summary, aes(x = user_category, y = max_activities_in_session, fill = user_category)) +
  geom_boxplot() +
  labs(
    x = "User Category",
    y = "Max Activities per Session",
    title = "Binge Behavior Across All Activities"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

binging_per_session <- sql_subset %>%
  group_by(user, user_category, session) %>%
  summarise(n_activities_session = n_distinct(activity_id), .groups = "drop")

ggplot(binging_per_session, aes(x = n_activities_session, fill = user_category)) +
  geom_density(alpha = 0.5) +
  labs(
    x = "Number of Activities per Session",
    y = "Density",
    title = "Distribution of Binge Behavior Across All Activities"
  ) +
  theme_minimal()

## Activities per Day

sql_subset <- sql_subset %>%
  mutate(date = as.Date(datestring))

binging_per_day <- sql_subset %>%
  group_by(user, user_category, date, topicname) %>%
  summarise(
    n_activities_day = n_distinct(activity_id),
    .groups = "drop"
  ) %>%
  group_by(user, user_category) %>%
  summarise(
    max_activities_in_day = max(n_activities_day),
    mean_activities_in_day = mean(n_activities_day),
    .groups = "drop"
  )

binging_per_day
## # A tibble: 114 × 4
##    user   user_category   max_activities_in_day mean_activities_in_day
##    <chr>  <fct>                           <int>                  <dbl>
##  1 tum0   SQL_only                           12                   4.5 
##  2 tum1   SQL_plus_Python                    28                   7.35
##  3 tum100 SQL_plus_Python                    13                   7.8 
##  4 tum101 SQL_only                           20                  11   
##  5 tum102 SQL_plus_Python                    19                   6.15
##  6 tum103 SQL_plus_Python                    19                   7.73
##  7 tum104 SQL_only                           19                   8.7 
##  8 tum106 SQL_plus_Python                    23                   6.47
##  9 tum107 SQL_only                            9                   4.33
## 10 tum109 SQL_only                           10                   4.5 
## # ℹ 104 more rows
ggplot(binging_per_day, aes(x = user_category, y = max_activities_in_day, fill = user_category)) +
  geom_boxplot() +
  labs(
    x = "User Category",
    y = "Max Activities per Day",
    title = "Binge Behavior Across All Activities (Daily)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

ggplot(binging_per_day, aes(x = max_activities_in_day, fill = user_category)) +
  geom_density(alpha = 0.5) +
  labs(
    x = "Number of Activities per Day",
    y = "Density",
    title = "Distribution of Daily Binge Behavior"
  ) +
  theme_minimal()

ggplot(binging_per_day, aes(x = user_category, y = max_activities_in_day, fill = user_category)) +
  geom_boxplot() +
  labs(
    x = "User Category",
    y = "Max Activities per Day",
    title = "Binge Behavior Across All Activities (Daily)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

ggplot(binging_per_day, aes(x = max_activities_in_day, fill = user_category)) +
  geom_density(alpha = 0.5) +
  labs(
    x = "Number of Activities per Day",
    y = "Density",
    title = "Distribution of Daily Binge Behavior"
  ) +
  theme_minimal()

Activity over Time

Activities at Each Day

daily_activity_summary <- sql_subset %>%
  mutate(date = as.Date(datestring)) %>%
  group_by(user_category, user, date) %>%
  summarise(
    n_activities_user_day = n_distinct(activity_id),
    .groups = "drop"
  ) %>%
  group_by(user_category, date) %>%
  summarise(
    mean_activities = mean(n_activities_user_day),
    .groups = "drop"
  )

ggplot(daily_activity_summary, aes(x = date, y = mean_activities, color = user_category)) +
  geom_line(size = 1) +
  geom_point(size = 1.5, alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, linetype = "dashed") +
  scale_x_date(date_labels = "%b %d") +
  labs(
    x = "Date",
    y = "Mean Activities per User",
    title = "Daily Mean Activities by User Category",
    color = "User Category"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

Activities at Each Day, with Confidence Interval

ggplot(daily_activity_summary, aes(x = date, y = mean_activities, color = user_category, fill = user_category)) +
  geom_smooth(method = "loess", se = TRUE, linetype = "dashed", alpha = 0.2) +
  scale_x_date(date_labels = "%b %d") +
  labs(
    x = "Date",
    y = "Mean Activities per User (±95% CI)",
    title = "Daily Mean Activities with Confidence Interval",
    color = "User Category",
    fill = "User Category"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

Activities at Each Day, with Standard Error Bars

daily_activity_summary <- sql_subset %>%
  mutate(date = as.Date(datestring)) %>%
  group_by(user_category, user, date) %>%
  summarise(
    n_activities_user_day = n_distinct(activity_id),
    .groups = "drop"
  ) %>%
  group_by(user_category, date) %>%
  summarise(
    mean_activities = mean(n_activities_user_day),
    se_activities = sd(n_activities_user_day) / sqrt(n()),
    .groups = "drop"
  )

ggplot(daily_activity_summary, aes(x = date, y = mean_activities, color = user_category)) +
  geom_line(size = 1) +
  geom_point(size = 1.5, alpha = 0.6) +
  geom_errorbar(aes(ymin = mean_activities - se_activities,
                    ymax = mean_activities + se_activities),
                width = 0.3, alpha = 0.6) +
  scale_x_date(date_labels = "%b %d") +
  labs(
    x = "Date",
    y = "Mean Activities per User (±SE)",
    title = "Daily Mean Activities with Standard Error Bars",
    color = "User Category"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

Activities at Each Day, by App

daily_activity_summary <- sql_subset %>%
  mutate(date = as.Date(datestring)) %>%
  group_by(user_category, applabel, user, date) %>%
  summarise(
    n_activities_user_day = n_distinct(activity_id),
    .groups = "drop"
  ) %>%
  group_by(user_category, applabel, date) %>%
  summarise(
    mean_activities = mean(n_activities_user_day),
    se_activities = sd(n_activities_user_day) / sqrt(n()),
    .groups = "drop"
  )

ggplot(daily_activity_summary, aes(x = date, y = mean_activities,
                                   color = user_category, group = user_category)) +
  geom_line(size = 1) +
  geom_point(size = 1.2, alpha = 0.6) +
  facet_wrap(~applabel, scales = "free_y") +
  scale_x_date(date_labels = "%b %d") +
  labs(
    x = "Date",
    y = "Mean Activities per User",
    title = "Daily Mean Activities by User Category and Activity Type",
    color = "User Category"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

Activities at Each Day, by App, with Confidence Interval

ggplot(daily_activity_summary, aes(x = date, y = mean_activities,
                                   color = user_category)) +
  geom_point(size = 1.2, alpha = 0.6) +
  geom_smooth(method = "loess", se = TRUE, linetype = "solid") +
  facet_wrap(~applabel, scales = "free_y") +
  scale_x_date(date_labels = "%b %d") +
  labs(
    x = "Date",
    y = "Mean Activities per User (±95% CI)",
    title = "Daily Mean Activities by User Category and Activity Type",
    color = "User Category"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )