After our initial exploration of the dataset, we found that our 116 students could be divided into three main groups: the ones who used only the SQL course (77), the ones who used only the Python course (2), and those who used both (37). We decided to explore the two main groups: students who stopped at the first course, (SQL) versus the ones who continued to the second one (SQL+Python). To understand what makes those two groups different, we analyzed the differences in the behavior of those students in what they have in common, the SQL activities.
Right from the start, we can observe a stark difference between the two groups. Students who proceeded to the Python course had a median of 247 attempts in the SQL course, while the median for those who did not proceed was only 20. Breaking the usage by “app” (four activity types in the SQL course), we could also observe statically significant differences between the two groups in every app but SQL Tutor. SQL Tutor’s also presented other anomalies, such as having its usage completely ceased by all the students early in the term - possibly an indicator of technical issues in the app. Furthermore, it is worth noticing that the “only SQL” has an extremely low usage of all the apps but SQL Knot, the one app which was made mandatory by the instructor for extra credits. Still, the data shows that all the students of both groups tried at least one activity of each app. Even in the mandatory SQL Knot app, more students from the “only SQL” category did not reach the minimum requirement of two activities per topic. However, surprisingly those students also had a highest rate of above minimum usage, with the “SQL+Python” students being the majority only in attempting exactly the required 2 activities per topic.
Regarding their behavior in the SQL course within sessions and over time, the two groups also present some differences. First of all, when looking at the number of activities attempted per section, the group that proceeded to Python reached higher maximums, while the students of the group “SQL Only” are mostly concentrated in sessions with few activities. Similarly, the students who proceeded to Python also attempted more activities and reached higher peaks when we look at their behavior by day. Perhaps surprisingly, the “SQL+Python” group is more active in the first and last months of the trimester, while the “only SQL” group is more active in the intermediate month. This behavior remains the same when we analyze it by app, but there are two details worth noticing. Even in its most active times, the “only SQL” group only surpasses the “SQL+Python” one in one app: SQL Knot, the mandatory app. This could point to a reminder by the instructor about the mandatory nature of these activities if the students are interested in extra credits. It could also mean that those less active students received partial grades and recognized their need for extra credits. Finally, as mentioned before, the SQL Knot has a unique pattern - its activity completely ceases after a couple of weeks, which could be an indicator of unreported technical issues with that type of activity.
User
User log-in identifier in the system.
Group
Non-essential information about which KT course was opened.
In the Temple course, all students opened both KT courses.
(Use this column as domain.)
Session
Web browser session ID counter per user.
Timebin
When the user opened during a session.
Larger numbers mean later in a given session.
Appid
UM2 DB reference number of the human-readable app name.
Parentname / Activityname / Targetname
Contextual columns to keep track of hierarchy in the log
(order in which a student attempts a given activity).
Applabel
Human-readable app code:
WEBEX (non-assessed)
SQL line-by-line explanations.
-1
for the Result column.DBQA (non-assessed)
Animated SQL example (intentional construction).
0
and 1
for
the number of lines read.SQLKnot (assessed)
PCRS equivalent in SQL (intentional behavioral).
SQLTutor (assessed)
QuizPET + PCEX Challenge equivalent in SQL, with additional
construction.
PCEX Example (non-assessed)
Worked Example in Python/Java with explanations.
-1
.PCEX Challenge (assessed)
Worked Example in Python/Java with distractors.
0
(incorrect) or 1
(correct).Animated Example (non-assessed)
Step-by-step tracing examples in Python/Java.
-1
.QuizPET (assessed)
Student submits correct/incorrect output after tracing code.
Parsons (assessed)
Jigsaw puzzle where students reorder code to get correct
output.
topicname
Order of the topic in the KT course.
Courseroder
Order of the activity among all cells in the entire KT course.
Topciname
KT topic name.
Attemptno
For a given (activityname, targetname,
parentname) tuple,
the number of student attempts.
Result
System result for a given (activityname, targetname,
parentname) tuple.
Datestring
When the activity was attempted.
data <- read.csv("./TempleMIS2502Spring2025_raw_activity_combined_20250822223624.csv", stringsAsFactors = TRUE)
data <- data %>%
filter(topicname != "sqllab") %>%
droplevels()
str(data)
## 'data.frame': 31805 obs. of 18 variables:
## $ user : Factor w/ 116 levels "tum0","tum1",..: 12 72 72 72 72 72 72 72 72 72 ...
## $ group : Factor w/ 2 levels "TempleMIS2502Spring2025",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ session : int 0 2 2 2 2 2 2 2 2 2 ...
## $ timebin : int 8 27 28 29 33 34 38 39 43 44 ...
## $ appid : int 3 3 3 3 3 3 53 53 3 3 ...
## $ applabel : Factor w/ 9 levels "ANIMATED_EXAMPLE",..: 9 9 9 9 9 9 2 2 9 9 ...
## $ activityname : Factor w/ 304 levels "","0","1","10",..: 2 2 24 3 2 3 288 288 2 3 ...
## $ targetname : Factor w/ 304 levels "","0","1","10",..: 2 2 24 3 2 3 288 288 2 3 ...
## $ parentname : Factor w/ 314 levels "","ae_adl_arithmetics2",..: 53 53 53 53 58 58 298 298 63 63 ...
## $ topicname : Factor w/ 23 levels "Aggregate Functions",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ courseorder : int 7 7 7 7 8 8 18 18 9 9 ...
## $ topicorder : int 1 1 1 1 1 1 1 1 1 1 ...
## $ attemptno : int 1 1 1 1 1 1 1 2 1 1 ...
## $ result : num -1 -1 -1 -1 -1 -1 0.5 1 -1 -1 ...
## $ datestring : Factor w/ 31795 levels "2025-01-29 20:01:53.335",..: 9 6252 6253 6254 6255 6256 6257 6258 6259 6260 ...
## $ unixtimestamp : int 1738253053 1739461896 1739461899 1739461903 1739461909 1739461913 1739461927 1739461940 1739461952 1739461954 ...
## $ durationseconds: num 0.01 2.93 4.21 2.63 3.72 ...
## $ concepts : Factor w/ 218 levels "","_dict_call_eq_assign_for_or_str_index_int_if_gt_functiondef",..: 7 7 7 210 166 210 214 214 193 210 ...
levels(data$group)
## [1] "TempleMIS2502Spring2025" "TempleMIS2502Spring2025Python"
sql_data <- droplevels(data[data$group == "TempleMIS2502Spring2025", ])
n_obs <- nrow(sql_data)
applabel_summary <- sql_data %>%
count(applabel) %>%
rename(`App Label` = applabel, Observations = n) %>%
bind_rows(tibble(`App Label` = "Total", Observations = n_obs))
kable(applabel_summary)
App Label | Observations |
---|---|
DBQA | 4823 |
SQLKNOT | 5572 |
SQLTUTOR | 1565 |
WEBEX | 3879 |
Total | 15839 |
python_data <- droplevels(data[data$group == "TempleMIS2502Spring2025Python", ])
n_obs <- nrow(python_data)
applabel_summary <- python_data %>%
group_by(applabel) %>%
summarise(
Observations = n(),
`Distinct Activities` = n_distinct(activityname) ) %>%
ungroup() %>%
rename(`App Label` = applabel) %>%
bind_rows(
tibble(
`App Label` = "Total",
Observations = n_obs,
`Distinct Activities` = n_distinct(python_data$activityname)
)
)
kable(applabel_summary)
App Label | Observations | Distinct Activities |
---|---|---|
ANIMATED_EXAMPLE | 5115 | 23 |
PARSONS | 5193 | 49 |
PCEX_CHALLENGE | 2312 | 53 |
PCEX_EXAMPLE | 2671 | 44 |
QUIZPET | 675 | 38 |
Total | 15966 | 207 |
sql_users <- unique(sql_data$user)
python_users <- unique(python_data$user)
common_users <- intersect(sql_users, python_users)
sql_only <- setdiff(sql_users, python_users)
python_only <- setdiff(python_users, sql_users)
user_summary <- tibble(
Category = c("Only in SQL data", "Only in Python data", "In both"),
Users = c(length(sql_only), length(python_only), length(common_users))
)
kable(user_summary, col.names = c("User Category", "Number of Users"))
User Category | Number of Users |
---|---|
Only in SQL data | 77 |
Only in Python data | 2 |
In both | 37 |
This section has only some preliminary analysis done by me before I defined what questions I wanted to explore. Please proceed to the next section for more relevant results.
user_activity_counts <- sql_data %>%
group_by(user) %>%
summarise(n_activities = n_distinct(activityname)) %>%
ungroup()
usage_distribution <- user_activity_counts %>%
count(n_activities) %>%
rename(`Number of Activities` = n_activities,
`Number of Users` = n)
ggplot(user_activity_counts, aes(x = n_activities)) +
geom_density(fill = "steelblue", alpha = 0.5) +
labs(
title = "Density of Distinct SQL Activities per User",
x = "Number of Distinct Activities",
y = "Density"
) +
theme_minimal()
user_activity_counts <- python_data %>%
group_by(user) %>%
summarise(n_activities = n_distinct(activityname)) %>%
ungroup()
usage_distribution <- user_activity_counts %>%
count(n_activities) %>%
rename(`Number of Activities` = n_activities,
`Number of Users` = n)
ggplot(user_activity_counts, aes(x = n_activities)) +
geom_density(fill = "steelblue", alpha = 0.5) +
labs(
title = "Density of Distinct Python Activities per User",
x = "Number of Distinct Activities",
y = "Density"
) +
theme_minimal()
Here, the each student was categorized into “SQL_only” or “SQL_plus_Python”, based on their logs.
I also added an activity_id column for later, using this mapping:
DBQA, SQLTUTOR, SQLKNOT: activityname;
WEBEX: parentname.
sql_users <- unique(sql_data$user)
python_users <- unique(python_data$user)
sql_only <- setdiff(sql_users, python_users)
sql_plus_python <- intersect(sql_users, python_users)
both_categories <- data.frame(
user = c(sql_only, sql_plus_python),
user_category = factor(
c(rep("SQL_only", length(sql_only)),
rep("SQL_plus_Python", length(sql_plus_python))),
levels = c("SQL_only", "SQL_plus_Python")
)
)
data_with_user_cat <- data %>%
inner_join(both_categories, by = "user")
sql_subset <- data_with_user_cat %>%
filter(group == "TempleMIS2502Spring2025") %>%
mutate(
activity_id = case_when(
applabel %in% c("DBQA", "SQLTUTOR", "SQLKNOT") ~ as.character(activityname),
applabel == "WEBEX" ~ as.character(parentname),
TRUE ~ as.character(activityname)
)
)
Here you can find the summary of number of attempts in the SQL activities per user, for users in both categories.
sql_attempts_per_user <- sql_subset %>%
group_by(user, user_category) %>%
summarise(
sql_attempts = n(),
.groups = "drop"
)
sql_attempts_summary <- sql_attempts_per_user %>%
group_by(user_category) %>%
summarise(
mean_attempts = mean(sql_attempts),
median_attempts = median(sql_attempts),
sd_attempts = sd(sql_attempts),
min_attempts = min(sql_attempts),
max_attempts = max(sql_attempts),
n_users = n(),
.groups = "drop"
)
sql_attempts_summary
## # A tibble: 2 × 7
## user_category mean_attempts median_attempts sd_attempts min_attempts
## <fct> <dbl> <int> <dbl> <int>
## 1 SQL_only 87.6 20 131. 1
## 2 SQL_plus_Python 246. 247 139. 1
## # ℹ 2 more variables: max_attempts <int>, n_users <int>
ggplot(sql_attempts_per_user, aes(x = user_category, y = sql_attempts, fill = user_category)) +
geom_boxplot(alpha = 0.6) +
geom_jitter(width = 0.2, alpha = 0.4) +
labs(
title = "SQL Attempts per User",
x = "User Category",
y = "Number of SQL Attempts"
) +
theme_minimal()
Here we have similar information, but we the usage divided by the four “apps” in the SQL course. We also have the stastistical significance between the two groups, using Wilcox.
sql_attempts_by_app <- sql_subset %>%
group_by(user, user_category, applabel) %>%
summarise(
attempts = n(),
.groups = "drop"
)
sql_app_summary <- sql_attempts_by_app %>%
group_by(user_category, applabel) %>%
summarise(
mean_attempts = mean(attempts),
median_attempts = median(attempts),
sd_attempts = sd(attempts),
min_attempts = min(attempts),
max_attempts = max(attempts),
n_users = n(),
.groups = "drop"
)
sql_app_summary
## # A tibble: 8 × 8
## user_category applabel mean_attempts median_attempts sd_attempts min_attempts
## <fct> <fct> <dbl> <dbl> <dbl> <int>
## 1 SQL_only DBQA 39.3 9 46.1 1
## 2 SQL_only SQLKNOT 54.7 45 45.2 1
## 3 SQL_only SQLTUTOR 41.6 10.5 52.7 1
## 4 SQL_only WEBEX 24.5 9 31.0 1
## 5 SQL_plus_Pyth… DBQA 80.5 97 37.7 3
## 6 SQL_plus_Pyth… SQLKNOT 101. 95 59.5 3
## 7 SQL_plus_Pyth… SQLTUTOR 73.4 75.5 55.0 1
## 8 SQL_plus_Pyth… WEBEX 53.8 59 27.2 1
## # ℹ 2 more variables: max_attempts <int>, n_users <int>
ggplot(sql_attempts_by_app, aes(x = applabel, y = attempts, fill = user_category)) +
geom_boxplot(alpha = 0.6, position = position_dodge(width = 0.8)) +
labs(
title = "SQL Attempts per User by Activity Type",
x = "Activity Type (applabel)",
y = "Number of Attempts"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
wilcox_results <- sql_attempts_by_app %>%
group_by(applabel) %>%
wilcox_test(attempts ~ user_category) %>%
adjust_pvalue(method = "holm") %>%
add_significance()
wilcox_results
## # A tibble: 4 × 10
## applabel .y. group1 group2 n1 n2 statistic p p.adj
## <fct> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 DBQA attempts SQL_only SQL_plus_Pyt… 51 35 443 7.77e-5 2.33e-4
## 2 SQLKNOT attempts SQL_only SQL_plus_Pyt… 37 35 304. 1.14e-4 2.33e-4
## 3 SQLTUTOR attempts SQL_only SQL_plus_Pyt… 20 10 65.5 1.34e-1 1.34e-1
## 4 WEBEX attempts SQL_only SQL_plus_Pyt… 77 37 662. 3.99e-6 1.60e-5
## # ℹ 1 more variable: p.adj.signif <chr>
sql_engagement <- sql_subset %>%
group_by(user, user_category, applabel) %>%
summarise(attempted = n() > 0, .groups = "drop")
sql_engagement_summary <- sql_engagement %>%
group_by(user_category, applabel) %>%
summarise(
n_users = n(),
n_attempted = sum(attempted),
prop_attempted = n_attempted / n_users,
.groups = "drop"
)
sql_engagement_summary
## # A tibble: 8 × 5
## user_category applabel n_users n_attempted prop_attempted
## <fct> <fct> <int> <int> <dbl>
## 1 SQL_only DBQA 51 51 1
## 2 SQL_only SQLKNOT 37 37 1
## 3 SQL_only SQLTUTOR 20 20 1
## 4 SQL_only WEBEX 77 77 1
## 5 SQL_plus_Python DBQA 35 35 1
## 6 SQL_plus_Python SQLKNOT 35 35 1
## 7 SQL_plus_Python SQLTUTOR 10 10 1
## 8 SQL_plus_Python WEBEX 37 37 1
All the students are at least trying each type of activity, but the ones who also used the Python course went way deeper in their SQL usage!
bare_min_sqlknot <- sql_subset %>%
filter(applabel == "SQLKNOT") %>%
distinct(user, user_category, topicname, activity_id) %>%
group_by(user, user_category, topicname) %>%
summarise(
n_sqlknot = n(),
.groups = "drop"
) %>%
mutate(
effort_category = case_when(
n_sqlknot < 2 ~ "Below minimum",
n_sqlknot == 2 ~ "Minimum",
n_sqlknot >= 3 & n_sqlknot <= 4 ~ "Slightly above minimum",
n_sqlknot >= 5 ~ "High effort"
),
effort_category = factor(
effort_category,
levels = c("High effort", "Slightly above minimum", "Minimum", "Below minimum")
)
)
effort_summary <- bare_min_sqlknot %>%
group_by(user_category, effort_category) %>%
summarise(
n_topics = n(),
.groups = "drop"
) %>%
group_by(user_category) %>%
mutate(
prop = n_topics / sum(n_topics)
)
effort_summary
## # A tibble: 8 × 4
## # Groups: user_category [2]
## user_category effort_category n_topics prop
## <fct> <fct> <int> <dbl>
## 1 SQL_only High effort 62 0.328
## 2 SQL_only Slightly above minimum 30 0.159
## 3 SQL_only Minimum 78 0.413
## 4 SQL_only Below minimum 19 0.101
## 5 SQL_plus_Python High effort 80 0.302
## 6 SQL_plus_Python Slightly above minimum 30 0.113
## 7 SQL_plus_Python Minimum 144 0.543
## 8 SQL_plus_Python Below minimum 11 0.0415
n_sqlknot < 2 ~ “Below minimum”;
n_sqlknot == 2 ~ “Minimum”;
n_sqlknot >= 3 & n_sqlknot <= 4 ~ “Slightly above minimum”;
n_sqlknot >= 5 ~ “High effort”.
Used n_distinct(activity_id)!
ggplot(effort_summary, aes(x = effort_category, y = prop, fill = user_category)) +
geom_col(position = "dodge") +
labs(
title = "SQLKNOT Effort per Topic",
x = "Effort Category",
y = "Proportion of Topics"
) +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal()
ggplot(effort_summary, aes(x = user_category, y = prop, fill = effort_category)) +
geom_col(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Distribution of SQLKNOT Effort per Topic",
x = "User Category",
y = "Proportion of Topics",
fill = "Effort Level"
) +
theme_minimal()
effort_table <- table(bare_min_sqlknot$user_category, bare_min_sqlknot$effort_category)
effort_table
##
## High effort Slightly above minimum Minimum Below minimum
## SQL_only 62 30 78 19
## SQL_plus_Python 80 30 144 11
chisq_test <- chisq.test(effort_table)
chisq_test
##
## Pearson's Chi-squared test
##
## data: effort_table
## X-squared = 11.64, df = 3, p-value = 0.008722
chisq_test$residuals
##
## High effort Slightly above minimum Minimum Below minimum
## SQL_only 0.3752911 1.0048480 -1.4998253 1.8424044
## SQL_plus_Python -0.3169394 -0.8486104 1.2666267 -1.5559403
Summary of total SQL Knot attempts per topic per student:
sqlknot_attempts_summary <- sql_subset %>%
filter(applabel == "SQLKNOT") %>%
group_by(user, user_category, topicname) %>%
summarise(n_sqlknot = n(), .groups = "drop") %>%
group_by(user_category) %>%
summarise(
mean_attempts = mean(n_sqlknot),
median_attempts = median(n_sqlknot),
sd_attempts = sd(n_sqlknot)
)
sqlknot_attempts_summary
## # A tibble: 2 × 4
## user_category mean_attempts median_attempts sd_attempts
## <fct> <dbl> <int> <dbl>
## 1 SQL_only 10.7 7 11.7
## 2 SQL_plus_Python 13.4 8 17.6
Wilcox comparison of the SQL Knot attempts per topic - no significant differences.
sqlknot_per_topic_wilcox_test <- sql_subset %>%
filter(applabel == "SQLKNOT") %>%
group_by(user, user_category, topicname) %>%
summarise(n_sqlknot = n(), .groups = "drop") %>%
group_by(topicname) %>%
wilcox_test(n_sqlknot ~ user_category) %>%
adjust_pvalue(method = "holm") %>%
add_significance()
sqlknot_per_topic_wilcox_test
## # A tibble: 8 × 10
## topicname .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## <fct> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Aggregate… n_sq… SQL_o… SQL_p… 22 33 360. 0.958 1 ns
## 2 GROUP-BY n_sq… SQL_o… SQL_p… 22 33 234 0.0269 0.188 ns
## 3 Join n_sq… SQL_o… SQL_p… 15 32 317 0.0744 0.446 ns
## 4 Left Join n_sq… SQL_o… SQL_p… 19 32 183 0.0188 0.150 ns
## 5 ORDER-BY n_sq… SQL_o… SQL_p… 23 32 366 0.979 1 ns
## 6 Pattern M… n_sq… SQL_o… SQL_p… 24 34 408. 1 1 ns
## 7 SELECT-FR… n_sq… SQL_o… SQL_p… 37 35 604. 0.623 1 ns
## 8 SELECT-FR… n_sq… SQL_o… SQL_p… 27 34 464 0.948 1 ns
Wilcox comparison of total SQL Knot attempts (not divided by topic) - significant differences!
sqlknot_total_attempts_wilcox_test <- sql_subset %>%
filter(applabel == "SQLKNOT") %>%
group_by(user, user_category) %>%
summarise(n_sqlknot = n(), .groups = "drop") %>%
wilcox_test(n_sqlknot ~ user_category) %>%
add_significance()
sqlknot_total_attempts_wilcox_test
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic p p.signif
## <chr> <chr> <chr> <int> <int> <dbl> <dbl> <chr>
## 1 n_sqlknot SQL_only SQL_plus_Python 37 35 304. 0.000114 ***
Wilcox comparison of SQL Knot attempts (not divided by topic, not counting multiple attempts in the same activity) - significant differences!
sqlknot_unique_activities_wilcox_test <- sql_subset %>%
filter(applabel == "SQLKNOT") %>%
group_by(user, user_category) %>%
summarise(n_sqlknot = n_distinct(activity_id), .groups = "drop") %>%
wilcox_test(n_sqlknot ~ user_category) %>%
add_significance()
sqlknot_unique_activities_wilcox_test
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic p p.signif
## <chr> <chr> <chr> <int> <int> <dbl> <dbl> <chr>
## 1 n_sqlknot SQL_only SQL_plus_Python 37 35 388. 0.00297 **
topic_effort_summary <- bare_min_sqlknot %>%
group_by(user, user_category, topicname, effort_category) %>%
summarise(n_unique_activities = n_sqlknot, .groups = "drop") %>%
group_by(user_category, topicname, effort_category) %>%
summarise(n_topics = sum(n_unique_activities), .groups = "drop") %>%
group_by(user_category, topicname) %>%
mutate(prop = n_topics / sum(n_topics)) %>%
ungroup()
topic_effort_summary
## # A tibble: 43 × 5
## user_category topicname effort_category n_topics prop
## <fct> <fct> <fct> <int> <dbl>
## 1 SQL_only Aggregate Functions High effort 70 0.805
## 2 SQL_only Aggregate Functions Slightly above minimum 3 0.0345
## 3 SQL_only Aggregate Functions Minimum 14 0.161
## 4 SQL_only GROUP-BY Minimum 36 0.9
## 5 SQL_only GROUP-BY Below minimum 4 0.1
## 6 SQL_only Join Minimum 30 1
## 7 SQL_only Left Join Minimum 26 0.812
## 8 SQL_only Left Join Below minimum 6 0.188
## 9 SQL_only ORDER-BY Slightly above minimum 62 0.827
## 10 SQL_only ORDER-BY Minimum 12 0.16
## # ℹ 33 more rows
ggplot(topic_effort_summary,
aes(x = topicname, y = prop, fill = effort_category)) +
geom_bar(stat = "identity") +
facet_wrap(~ user_category, ncol = 1, scales = "free_y") +
scale_fill_manual(
values = c(
"High effort" = "#1b9e77",
"Slightly above minimum" = "#d95f02",
"Minimum" = "#7570b3",
"Below minimum" = "#e7298a"
),
drop = FALSE
) +
scale_y_continuous(
labels = percent_format(),
expand = expansion(mult = c(0, 0.1)) # add 10% padding above bars
) +
labs(
x = "Topic",
y = "Proportion of Activities",
fill = "Effort Category",
title = "Topic-specific SQLKNOT Effort by User Category"
) +
coord_flip() + # horizontal bars
theme_minimal(base_size = 14) +
theme(
axis.text.y = element_text(size = 12),
strip.text = element_text(size = 12),
legend.position = "top"
)
SQL_only = 1, SQL_plus_Python = 2.
I could not find a pattern between the order and the data in the previous graph.
topic_order_table <- sql_subset %>%
distinct(topicorder, topicname) %>%
filter(topicorder != -1) %>%
arrange(topicorder)
kable(topic_order_table, caption = "Mapping of Topic Order to Topic Name")
topicorder | topicname |
---|---|
1 | SELECT-FROM |
2 | SELECT-FROM-WHERE |
3 | Pattern Matching |
4 | ORDER-BY |
5 | Aggregate Functions |
6 | GROUP-BY |
7 | Join |
8 | Left Join |
It is worth noting that some topics (e.g. Join) only had 2 SQL Knot activities, the minimum required for extra credit.
topic_activity_summary <- sql_subset %>%
group_by(topicname, applabel) %>%
summarise(n_activities = n_distinct(activity_id), .groups = "drop") %>%
pivot_wider(
names_from = applabel,
values_from = n_activities,
values_fill = 0
) %>%
mutate(Total = rowSums(across(where(is.numeric))))
kable(topic_activity_summary, caption = "Number of Distinct Activities per Topic by App Label")
topicname | DBQA | SQLKNOT | SQLTUTOR | WEBEX | Total |
---|---|---|---|---|---|
Aggregate Functions | 4 | 5 | 4 | 4 | 17 |
GROUP-BY | 1 | 2 | 2 | 2 | 7 |
Join | 5 | 2 | 4 | 0 | 11 |
Left Join | 4 | 2 | 0 | 1 | 7 |
ORDER-BY | 4 | 4 | 6 | 1 | 15 |
Pattern Matching | 4 | 5 | 2 | 1 | 12 |
SELECT-FROM | 4 | 6 | 7 | 4 | 21 |
SELECT-FROM-WHERE | 4 | 9 | 9 | 6 | 28 |
binging_summary <- sql_subset %>%
group_by(user, user_category, session, topicname) %>%
summarise(
n_activities_session = n_distinct(activity_id),
.groups = "drop"
) %>%
group_by(user, user_category) %>%
summarise(
max_activities_in_session = max(n_activities_session),
mean_activities_in_session = mean(n_activities_session),
.groups = "drop"
)
binging_summary
## # A tibble: 114 × 4
## user user_category max_activities_in_session mean_activities_in_session
## <chr> <fct> <int> <dbl>
## 1 tum0 SQL_only 4 3
## 2 tum1 SQL_plus_Python 28 6.94
## 3 tum100 SQL_plus_Python 13 7.45
## 4 tum101 SQL_only 17 7.33
## 5 tum102 SQL_plus_Python 14 3.52
## 6 tum103 SQL_plus_Python 19 7.73
## 7 tum104 SQL_only 19 6.71
## 8 tum106 SQL_plus_Python 23 5.33
## 9 tum107 SQL_only 9 4.33
## 10 tum109 SQL_only 10 4.09
## # ℹ 104 more rows
ggplot(binging_summary, aes(x = user_category, y = max_activities_in_session, fill = user_category)) +
geom_boxplot() +
labs(
x = "User Category",
y = "Max Activities per Session",
title = "Binge Behavior Across All Activities"
) +
theme_minimal() +
theme(legend.position = "none")
binging_per_session <- sql_subset %>%
group_by(user, user_category, session) %>%
summarise(n_activities_session = n_distinct(activity_id), .groups = "drop")
ggplot(binging_per_session, aes(x = n_activities_session, fill = user_category)) +
geom_density(alpha = 0.5) +
labs(
x = "Number of Activities per Session",
y = "Density",
title = "Distribution of Binge Behavior Across All Activities"
) +
theme_minimal()
## Activities per Day
sql_subset <- sql_subset %>%
mutate(date = as.Date(datestring))
binging_per_day <- sql_subset %>%
group_by(user, user_category, date, topicname) %>%
summarise(
n_activities_day = n_distinct(activity_id),
.groups = "drop"
) %>%
group_by(user, user_category) %>%
summarise(
max_activities_in_day = max(n_activities_day),
mean_activities_in_day = mean(n_activities_day),
.groups = "drop"
)
binging_per_day
## # A tibble: 114 × 4
## user user_category max_activities_in_day mean_activities_in_day
## <chr> <fct> <int> <dbl>
## 1 tum0 SQL_only 12 4.5
## 2 tum1 SQL_plus_Python 28 7.35
## 3 tum100 SQL_plus_Python 13 7.8
## 4 tum101 SQL_only 20 11
## 5 tum102 SQL_plus_Python 19 6.15
## 6 tum103 SQL_plus_Python 19 7.73
## 7 tum104 SQL_only 19 8.7
## 8 tum106 SQL_plus_Python 23 6.47
## 9 tum107 SQL_only 9 4.33
## 10 tum109 SQL_only 10 4.5
## # ℹ 104 more rows
ggplot(binging_per_day, aes(x = user_category, y = max_activities_in_day, fill = user_category)) +
geom_boxplot() +
labs(
x = "User Category",
y = "Max Activities per Day",
title = "Binge Behavior Across All Activities (Daily)"
) +
theme_minimal() +
theme(legend.position = "none")
ggplot(binging_per_day, aes(x = max_activities_in_day, fill = user_category)) +
geom_density(alpha = 0.5) +
labs(
x = "Number of Activities per Day",
y = "Density",
title = "Distribution of Daily Binge Behavior"
) +
theme_minimal()
ggplot(binging_per_day, aes(x = user_category, y = max_activities_in_day, fill = user_category)) +
geom_boxplot() +
labs(
x = "User Category",
y = "Max Activities per Day",
title = "Binge Behavior Across All Activities (Daily)"
) +
theme_minimal() +
theme(legend.position = "none")
ggplot(binging_per_day, aes(x = max_activities_in_day, fill = user_category)) +
geom_density(alpha = 0.5) +
labs(
x = "Number of Activities per Day",
y = "Density",
title = "Distribution of Daily Binge Behavior"
) +
theme_minimal()
daily_activity_summary <- sql_subset %>%
mutate(date = as.Date(datestring)) %>%
group_by(user_category, user, date) %>%
summarise(
n_activities_user_day = n_distinct(activity_id),
.groups = "drop"
) %>%
group_by(user_category, date) %>%
summarise(
mean_activities = mean(n_activities_user_day),
.groups = "drop"
)
ggplot(daily_activity_summary, aes(x = date, y = mean_activities, color = user_category)) +
geom_line(size = 1) +
geom_point(size = 1.5, alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE, linetype = "dashed") +
scale_x_date(date_labels = "%b %d") +
labs(
x = "Date",
y = "Mean Activities per User",
title = "Daily Mean Activities by User Category",
color = "User Category"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top"
)
ggplot(daily_activity_summary, aes(x = date, y = mean_activities, color = user_category, fill = user_category)) +
geom_smooth(method = "loess", se = TRUE, linetype = "dashed", alpha = 0.2) +
scale_x_date(date_labels = "%b %d") +
labs(
x = "Date",
y = "Mean Activities per User (±95% CI)",
title = "Daily Mean Activities with Confidence Interval",
color = "User Category",
fill = "User Category"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top"
)
daily_activity_summary <- sql_subset %>%
mutate(date = as.Date(datestring)) %>%
group_by(user_category, user, date) %>%
summarise(
n_activities_user_day = n_distinct(activity_id),
.groups = "drop"
) %>%
group_by(user_category, date) %>%
summarise(
mean_activities = mean(n_activities_user_day),
se_activities = sd(n_activities_user_day) / sqrt(n()),
.groups = "drop"
)
ggplot(daily_activity_summary, aes(x = date, y = mean_activities, color = user_category)) +
geom_line(size = 1) +
geom_point(size = 1.5, alpha = 0.6) +
geom_errorbar(aes(ymin = mean_activities - se_activities,
ymax = mean_activities + se_activities),
width = 0.3, alpha = 0.6) +
scale_x_date(date_labels = "%b %d") +
labs(
x = "Date",
y = "Mean Activities per User (±SE)",
title = "Daily Mean Activities with Standard Error Bars",
color = "User Category"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top"
)
daily_activity_summary <- sql_subset %>%
mutate(date = as.Date(datestring)) %>%
group_by(user_category, applabel, user, date) %>%
summarise(
n_activities_user_day = n_distinct(activity_id),
.groups = "drop"
) %>%
group_by(user_category, applabel, date) %>%
summarise(
mean_activities = mean(n_activities_user_day),
se_activities = sd(n_activities_user_day) / sqrt(n()),
.groups = "drop"
)
ggplot(daily_activity_summary, aes(x = date, y = mean_activities,
color = user_category, group = user_category)) +
geom_line(size = 1) +
geom_point(size = 1.2, alpha = 0.6) +
facet_wrap(~applabel, scales = "free_y") +
scale_x_date(date_labels = "%b %d") +
labs(
x = "Date",
y = "Mean Activities per User",
title = "Daily Mean Activities by User Category and Activity Type",
color = "User Category"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top"
)
ggplot(daily_activity_summary, aes(x = date, y = mean_activities,
color = user_category)) +
geom_point(size = 1.2, alpha = 0.6) +
geom_smooth(method = "loess", se = TRUE, linetype = "solid") +
facet_wrap(~applabel, scales = "free_y") +
scale_x_date(date_labels = "%b %d") +
labs(
x = "Date",
y = "Mean Activities per User (±95% CI)",
title = "Daily Mean Activities by User Category and Activity Type",
color = "User Category"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top"
)