Extracting indicators from Canvas data

Preparing the data

The following describes the code for extracting learning analytics indicators from data obtained from the learning management system Canvas. The dataset can be retrieved from the Canvas API. This dataset consists of a total of > 100 tables. An overview of all the tables can be found at https://portal.inshosteddata.com/docs. In the following, we assume that the data are stored as .parquet-files. Accordingly, the package arrow is needed to load these data. To transform the data, two packages are used: tidyverse and lubridate.

library(tidyverse)
library(lubridate)
library(arrow)
options(scipen=999)

path <- "your_path_here"

Loading the requests table

For the most part, we will work with the requests-table. This table includes all application server requests, which can be seen as the raw clickstreams of students interacting with the Canvas learning management system. In this case, we will work with a requests table made from a subset of courses in the fourth quartile of the academic year 2021-2022, which ended at June 24th, 2022.

requests <- read_parquet(paste0(path, "requests0.snappy.parquet"),
                     col_types = cols(user_id = col_character(),
                                      course_id = col_character(), 
                                      discussion_id = col_character(),
                                      assignment_id = col_character(),
                                      quiz_id = col_character()),
                     as_tibble = TRUE) %>% 
  na_if("\\N")

quartile <- 4
year <- 21
endcourse <- "2021-06-24 23:59:99"

Filtering the requests table

Now, we will first filter the requests table based on the information that we need. Some possible filters include:

Removing non-clicks (server pings)
Filtering a subset of courses
Filtering a subset of users
Filtering clicks made in a specific time range (e.g., two weeks before the course start up to two weeks after the final exam)

For the latter, additional information is needed about the course structure in the university, to identify the lecture and exam weeks. The yearly week numbers need be to manually mapped to the week numbers as used in the university’s academic calendar.

Below, you can find an example function that identifies the week labels according to the academic calendar: calculate_weeklabel(). Here the week labels are defined as follows: w0 indicates the two weeks prior to the course, w1 indicates the first lecture week, w2 indicates the second lecture week (and so on), ew1 indicates the first exam week, and ew3 indicates the two weeks after the exam period. The suffix ‘a’ indicates a holiday week (e.g., Christmas holiday). For example, w3a, is a break after the 3rd lecture week.

# weeklabels checked for Q3/Q4 2019-2020 and Q1/Q2/Q3/Q4 2020-2021 and Q4 2021-2022
calculate_weeklabel <- function(weekno, quartile, year){
weeklabel =
  case_when(((weekno == "15" & quartile == "3") | 
             (weekno == "26" & quartile == "4") |
             (weekno == "44" & quartile == "1") |
             (weekno == "03" & quartile == "2")) ~ "ew1",
            ((weekno == "16" & quartile == "3") | 
             (weekno == "27" & quartile == "4") |
             (weekno == "45" & quartile == "1") |
             (weekno == "04" & quartile == "2")) ~ "ew2",
            ((weekno %in% c("17", "18") & quartile == "3") | 
             (weekno %in% c("28", "29") & quartile == "4") |
             (weekno %in% c("46", "47") & quartile == "1") |
             (weekno %in% c("05", "06") & quartile == "2")) ~ "ew3",
            ((weekno %in% c("04", "05") & quartile == "3") | 
             (weekno %in% c("15", "16") & quartile == "4") |
             (weekno %in% c("34", "35") & quartile == "1") |
             (weekno %in% c("44", "45") & quartile == "2")) ~ "w0",
            ((weekno == "06" & quartile == "3") | 
             (weekno == "17" & quartile == "4") |
             (weekno == "36" & quartile == "1") |
             (weekno == "46" & quartile == "2")) ~ "w1",
            ((weekno == "07" & quartile == "3") | 
             (weekno == "18" & quartile == "4") |
             (weekno == "37" & quartile == "1") |
             (weekno == "47" & quartile == "2")) ~ "w2",
            ((weekno == "08" & quartile == "3" & year == 19) |
             (weekno == "09" & quartile == "3" & year == 20) |
             (weekno == "19" & quartile == "4") |
             (weekno == "38" & quartile == "1") |
             (weekno == "48" & quartile == "2")) ~ "w3",
            ((weekno == "09" & quartile == "3" & year == 19)) ~ "w3a",
            ((weekno == "08" & quartile == "3" & year == 20)) ~ "w2a",
            ((weekno == "10" & quartile == "3") | 
             (weekno == "20" & quartile == "4") |
             (weekno == "39" & quartile == "1") |
             (weekno == "49" & quartile == "2")) ~ "w4",
            ((weekno == "11" & quartile == "3") | 
             (weekno == "21" & quartile == "4") |
             (weekno == "40" & quartile == "1") |
             (weekno == "50" & quartile == "2")) ~"w5",
            ((weekno == "12" & quartile == "3") | 
             (weekno == "22" & quartile == "4") |
             (weekno == "41" & quartile == "1") |
             (weekno == "51" & quartile == "2")) ~ "w6",
            ((weekno %in% c("52", "53") & quartile == "2")) ~ "w6a",
            ((weekno == "13" & quartile == "3") | 
             (weekno == "23" & quartile == "4") |
             (weekno == "42" & quartile == "1") |
             (weekno == "01" & quartile == "2")) ~ "w7",
            ((weekno == "14" & quartile == "3") | 
             (weekno == "24" & quartile == "4") |
             (weekno == "43" & quartile == "1") |
             (weekno == "02" & quartile == "2")) ~ "w8",
            ((weekno == "25" & quartile == "4")) ~ "w9")
}

Below, we filter the requests table in two ways:

Removing clicks outside course pages
Removing non-clicks (server pings)
Filtering clicks made in a specific time range (e.g., two weeks before the course start up to two weeks after the final exam)

requests1 <- requests %>%
  #remove clicks outside course pages
  filter(!is.na(course_id)) %>%
  group_by(course_id, user_id) %>%
  #remove non-clicks (server pings)
  filter(web_application_action != "ping") %>%
  mutate(
    timestamp = as.POSIXlt(timestamp),
    quartile = quartile,
   ) %>%
    arrange(timestamp, .by_group = TRUE) %>%
  mutate(
    weekno = strftime(timestamp, "%V"),
    weeklabel = calculate_weeklabel(weekno, quartile, year)) %>%
  # remove clicks outside timeframe
  filter(!is.na(weeklabel))

Extending the requests table

The next step is to extend the clickstream data by adding add information on the session in which a server request was made. Within the requests-table there is already an indicator for session (session_id). Here, a new session is created when the user is prompted to log in to the learning management system again. However, with single-sign-on, there are usually only a few sessions, which might stretch over a couple of weeks. Accordingly, we define a learning session as follows:

A sequence of activities from a single user within the learning management system, without the user being inactive for more than 30 minutes.

requests2 <- requests1 %>%
  group_by(course_id, user_id) %>%
  mutate(
    diff_time = timestamp - lag(timestamp),
    # 30 minutes of idle time starts a new session
    session_start = (diff_time >= 1800),
    session_start = ifelse(row_number() == 1, 1,
                           ifelse(is.na(session_start), FALSE,
                                  session_start)),
    session_no = cumsum(session_start),
    time = strftime(timestamp,"%H:%M:%S"))

In addition, we want to add information about the user who made the request. Here we specifically want to add whether the user was enrolled as a teacher or student in the course, to be able to distinguish between student and teacher clicks. The information about student enrollments can be found in the enrollment_dim-table.

# check who is enrolled as student
enrollment_dim <- read_parquet(paste0(path, "enrollment_dim.snappy.parquet"), 
                         as_tibble = TRUE, col_types = cols(.default = "c")) %>% 
  select(course_id, user_id, type) %>%
  distinct() %>%
  group_by(course_id, user_id) %>%
  summarize(enrollment_type = paste(type, collapse = ",")) %>%
  ungroup()

# add enrollment to requests table
requests3 <- requests2 %>% 
    left_join(enrollment_dim, by = c("user_id", "course_id") )

Write the final requests file

Finally, we remove redundant columns and write the filtered and extended requests table to a new file.

#remove redundant columns
requests4 <- requests3 %>%
  select(timestamp, user_id, course_id, quiz_id,
         discussion_id, assignment_id, url, 
         web_application_controller, 
         web_application_action, session_id, 
         quartile:enrollment_type)

write.csv(requests4, paste0(path, "requests_ext.csv"), row.names = FALSE)

The data is now fully prepared and we can focus on extracting the indicators from the Canvas data. The feature extraction can be done per category, depending on the type of learning activities available in the course.

Creating session indicators

In this part, we extract indicators to the general log in behavior in the learning management system. This includes several summary statistics for the different sessions the users had during the course. A session is defined in the extended log file (see #Preparing the data).

The function below extracts the following session indicators, and may be customized depending on the researcher’s needs:

More general course indicators:

time between course publication and first log in-session
time between course publication and first opening of study guide
time between course publication and first opening of schedule
time between course publication and first opening of course information page
time between last lecture day and last log in-session (last Friday of course (23.59). Note negative values indicate that the course was not accessed after the last lecture day.)

Session indicators:

The number of sessions
The number of clicks per session
The total session time in minutes
The average session time in seconds
The standard deviation of session times in seconds
The average time between two sessions in seconds
The standard deviation of the time between two sessions in seconds
The maximum time between two sessions in seconds
The average start time of session

All session indicators are calculated over the full course, per week, and for the first half vs. second half of the course.

getsession_info <- function(requests_df, quartile, year, outfile){
  
  # calculate summary statistics per session
  sessioninfo <- requests_df %>%
    group_by(course_id) %>%
    mutate(
      start_course = min(timestamp),
      end_course = endcourse) %>%
    group_by(course_id, user_id, session_no) %>%
    summarize(
      n_clicks = n(),
      start_course = first(start_course), 
      end_course = first(end_course), 
      firsttime = (hour(first(timestamp)) + minute(first(timestamp))/60) - 6,
      firsttime = ifelse(firsttime < 6, firsttime + 18, firsttime - 6),
      # change first time of the day to numerical, shifted to start at 6 am
      # (6:00 = 0, 6:30 = 0.50, 22:00 = 16, 02:00 = 20)
      starttime = min(timestamp),
      endttime = max(timestamp),
      totaltime = endttime - starttime,
      totaltime_min = ifelse(!is.na(first(totaltime)), first(totaltime)/60, NA),
      interval_time = first(diff_time),
      schedule_time = first(timestamp[grepl("schedule", url)]),
      courseinfo_time = first(timestamp[grepl("course-information", url)]),
      studyguide_time = first(timestamp[web_application_action == "syllabus"])) %>%
    ungroup() %>%
    # at least 2 clicks per session
    filter(n_clicks >= 2)
  
  # Summarize the session statistics over the full course
  session_sum <- sessioninfo %>%
    group_by(course_id, user_id) %>%
    summarize(
      time_to_first_login = as.numeric(first(starttime) - 
                                         first(start_course)),
      time_to_first_schedule = as.numeric(first(schedule_time) - 
                                            first(start_course)),
      time_to_first_courseinfo = as.numeric(first(courseinfo_time) - 
                                              first(start_course)),
      time_to_first_studyguide = as.numeric(first(studyguide_time) - 
                                              first(start_course)),
      time_to_last_login = as.numeric(last(starttime) - 
                                        first(as.POSIXct(end_course))),
      n_clicks = sum(n_clicks, na.rm = T),
      n_sessions = n(),
      totalsessiontime_min = sum(totaltime_min, na.rm = T),
      m_sessiontime = as.numeric(mean(totaltime, na.rm = T)),
      sd_sessiontime = sd(totaltime, na.rm = T),
      m_intervaltime = mean(interval_time, na.rm = T),
      sd_intervaltime = sd(interval_time, na.rm = T),
      max_intervaltime = max(interval_time, na.rm = T),
      # map all times to one day, to 
      m_starttime = mean(firsttime, na.rm = T)
    ) %>%
    ungroup()
  
   # calculate summary statistics per session per week
  session_weekinfo <- requests_df %>%
    group_by(course_id, user_id, weeklabel, session_no) %>%
    summarize(
      n_clicks = n(),
      firsttime = (hour(first(timestamp)) + minute(first(timestamp))/60) - 6,
      firsttime = ifelse(firsttime < 6, firsttime + 18, firsttime - 6),
      starttime = min(timestamp),
      endttime = max(timestamp),
      totaltime = endttime - starttime,
      totaltime_min = ifelse(!is.na(totaltime), totaltime/60, NA),
      interval_time = first(diff_time)) %>%
    ungroup() %>%
    # at least 2 clicks per session
    filter(n_clicks >= 2)
  
  # Summarize the weekly session statistics over the full course
  session_sumweek <- session_weekinfo %>%
    group_by(course_id, user_id, weeklabel) %>%
    summarize(
      n_clicks = sum(n_clicks, na.rm = T),
      n_sessions = n(),
      totalsessiontime_min = sum(totaltime_min, na.rm = T),
      m_sessiontime = as.numeric(mean(totaltime, na.rm = T)),
      sd_sessiontime = sd(totaltime, na.rm = T),
      m_intervaltime = mean(interval_time, na.rm = T),
      sd_intervaltime = sd(interval_time, na.rm = T),
      max_intervaltime = max(interval_time, na.rm = T),
      m_starttime = mean(firsttime, na.rm = T)
    ) %>%
    ungroup()
  
  # sum per first vs. second half of the course
  session_sumweekhalf <- session_weekinfo %>%
    mutate(
      half = case_when(
       (weeklabel %in% c("w1", "w2", "w3", "w4", "w3a") |
                       (weeklabel %in% c("w5") & quartile == 4)) ~ "half1",
       (weeklabel %in% c("w5", "w6", "w7", "w8", "w9", "w6a")) ~ "half2")) %>%
    filter(!is.na(half)) %>%
        group_by(course_id, user_id, half) %>%
        summarize(
          sd_sessiontime = sd(totaltime, na.rm = T),
          sd_intervaltime = sd(interval_time, na.rm = T)
    ) %>%
    pivot_wider(id_col = c(course_id, user_id), 
                names_from = half,
                values_from = c(sd_sessiontime, sd_intervaltime))
  
  # convert to datawide and merge indicators
  session_weekwide <- session_sumweek %>%
    pivot_wider(id_col = c(course_id, user_id), names_from = weeklabel,
                values_from = c(n_clicks, n_sessions, totalsessiontime_min, 
                                m_sessiontime, sd_sessiontime, m_intervaltime, 
                                sd_intervaltime,
                                max_intervaltime, m_starttime)) %>% 
    left_join(session_sum) %>%
    left_join(session_sumweekhalf) %>%
    mutate_at(vars(starts_with("n_"), 
                   starts_with("m_sessiontime"),
                   starts_with("totalsessiontime_min")),
              ~replace_na(., 0))
  
  # save session indicators to a separate file
  write.csv(session_weekwide, paste0(path, outfile),
            row.names = FALSE)
  
  session_weekwide
}

Creating assignment indicators

In this part, we extract indicators which relate to the assignments in the learning management system. For this, information is used from a variety of tables which store information about assignments in Canvas. Specifically, we use the assignment_dim and assignment_fact tables, which contain information on assignment characteristics such as its name, deadline and date published. In addition, we use the assignment_submission_fact and the assignment_submission_dim tables, as these contain specific information on student submissions for a specific assignment.

The function below extracts the following assignment indicators, and may be customized depending on the researcher’s needs:

Number of clicks in assignments
Number of unique submitted assignments
Number of unique submitted assignments after deadline (submissions submitted after the due_date)

All assignment indicators are calculated over the full course and per week of the course.

getassignment_info <- function(requests_df, quartile, year, outfile){
  requests4 <- requests_df
  
  # load assignments 
  assign_dim <- read_parquet(paste0(path, "assignment_dim.snappy.parquet"), 
                           as_tibble = TRUE,
                           col_types = cols(.default = "c")) %>%
    na_if("\\N")
  assign_fact <- read_parquet(paste0(path, "assignment_fact.snappy.parquet"), 
                            as_tibble = TRUE, 
                            col_types = cols(.default = "c")) %>%
    na_if("\\N") 
  assign_submission_fact <- read_parquet(paste0(path, 
                                              "submission_fact.snappy.parquet"), 
                                       as_tibble = TRUE, 
                                       col_types = cols(.default = "c")) %>%
    na_if("\\N")
  
  assign_submission_fact2 <- assign_submission_fact %>%
    #remove quizzes & wiki submissions
    filter(!is.na(quiz_id) | !is.na(wiki_id)) %>%
    mutate_at(c("score"), as.numeric)
  assign_submission_dim <- read_parquet(paste0(path, 
                                 "submission_dim.snappy.parquet"), 
                                      as_tibble = TRUE, 
                                 col_types = cols(.default = "c")) %>%
    na_if("\\N")
  
  ####### transform & merge tables #############################################
  # merge assignment_fact and assignment_dim
  assign_fact_dim <- assign_dim %>%
    # remove unpublished assignments
    filter(workflow_state != "unpublished") %>%
    left_join(assign_fact,by = c("id" = "assignment_id", "course_id",
                               "points_possible")) %>%
    select(-updated_at, -created_at, -workflow_state)
  
  
  # filter only interaction with assignments from requests
  requests_assignment <- requests4 %>%
    filter(!is.na(assignment_id)) %>%
    mutate(assignment_id = as.numeric(assignment_id),
           # get canvas_id from URL in requests
           assignment_canvas_id = gsub(".*/assignments/","", url),
           assignment_canvas_id = as.numeric(gsub("/.*", "", 
                                                  assignment_canvas_id)))
  
  # merge assignment submission dim and fact
  assign_subm_fact_dim <- assign_submission_dim %>%
    # only submissions after 2019
    filter(!is.na(submitted_at), submitted_at > "2020") %>%
    left_join(assign_submission_fact2, by = c("assignment_id", 
                                            "id" = "submission_id",
                                           "user_id")) %>%
    rename("submission_id" = "id")
  
  # merge submissions with assignment fact dim table
  assignment_submission <- assign_subm_fact_dim %>%
    select(-canvas_id) %>%
    left_join(assign_fact_dim, by = c("assignment_id"= "id","course_id"))

  
  # extract courses and add weeklabels
  assignment_submission2 <- assignment_submission %>%
    mutate(date_submission = as.POSIXlt(submitted_at),
           quartile = quartile,
           weekno = strftime(date_submission,"%V"),
           weeklabel = calculate_weeklabel(weekno, quartile, year))
  
  
  # remove clicks outside timeframe & filter courses
  assignment_submission3 <- assignment_submission2 %>%
    filter(!is.na(weeklabel), course_id %in% requests4$course_id)
  
  
####### Summarize assignment info ##############################################
  ### assignment CLICKS ONLY (via requests)
  # sum over full course
  assignment_sum <- requests_assignment %>%
    group_by(course_id, user_id) %>%
    summarize(
      n_assignmentclicks = n()
    ) %>%
    ungroup()
  
  # sum per week
  assignment_sumweek <- requests_assignment %>%
    group_by(course_id, user_id, weeklabel) %>%
    summarize(
      n_assignmentclicks = n()
    ) %>%
    ungroup()
  
 ### assignment SUBMISSION INFO
  # create one submission entry per (unique) assignment per user per course
  assignment_submission4 <- assignment_submission3 %>%
    group_by(course_id, assignment_id, user_id) %>%
    arrange(as.POSIXct(submitted_at)) %>%
    summarize(
      firstattempt_date = min(submitted_at),
      weeklabel = first(weeklabel), # weeklabel of first attempt
      lastattempt_date = last(submitted_at),
      late = ifelse(!is.na(due_at), firstattempt_date > first(due_at), 0)
    ) %>%
    ungroup()
  
  # summarize assignment indicators for full course
  assignment_submission4_sum <- assignment_submission4 %>%
    group_by(course_id, user_id) %>%
    summarize(
      n_assignment = n(),
      n_late_assignment = sum(late)
    ) %>% ungroup()
  
  # summarize assignment indicators per week
  assignment_submission4_sumweek <- assignment_submission4 %>%
    group_by(course_id, user_id, weeklabel) %>%
    summarize(
      n_assignment = n(),
      n_late_assignment = sum(late)
    ) %>% ungroup()
  
  # convert to datawide
  assignment_weekwide <- assignment_submission4_sumweek %>%
    full_join(assignment_sumweek) %>%
    pivot_wider(id_col = c(course_id, user_id), names_from = weeklabel,
                values_from = c(n_assignment, n_late_assignment, 
                                n_assignmentclicks)) %>% 
    left_join(assignment_submission4_sum, by = c("course_id", "user_id")) %>%
    left_join(assignment_sum, by = c("course_id", "user_id")) %>%
    mutate_at(vars(starts_with("n_")), ~replace_na(., 0))
  
  write.csv(assignment_weekwide, paste0(path, outfile),
            row.names = FALSE)
  
  
  assignment_weekwide
}

Creating quiz indicators

In this part, we extract indicators which relate to the quizzes in the learning management system. For this, information is used from a variety of tables which store information about assignments in Canvas. Specifically, we use the quiz_dim and quiz_fact tables, which contain information on quiz characteristics such as its name, and points possible. In addition, we use the quiz_submission_dim and the quiz_submission_fact tables, as these contain specific information on student submissions for a specific quiz.

The function below extracts the following quiz indicators, and may be customized depending on the researcher’s needs:

number of clicks in a quiz
number of unique submitted practice quizzes (quiz_type = practice_quiz/survey OR time_limit = -1)
number of unique submitted exam-relevant quizzes (quiz_type = assignment/graded_survey)
number of unique quizzes attempted more than once (using quizzes that allowed more than one attempt)
mean percentage of time taken exam-relevant quiz (divided by total allowed time)
mean time taken practice quiz (min)
mean percentage of correct quiz answers (for most successful attempts; note that multiple points may be earned per question)

All quiz indicators are calculated over the full course and per week of the course.

getquiz_info <- function(requests_df, quartile, year, outfile){

  requests4 <- requests_df

# load quizzes 
quiz_dim <- read_parquet(paste0(path, "quiz_dim.snappy.parquet"), 
                         as_tibble = TRUE, 
                         col_types = cols(.default = "c")) %>%
    na_if("\\N")
quiz_fact <- read_parquet(paste0(path, "quiz_fact.snappy.parquet"), 
                         as_tibble = TRUE,
                         col_types = cols(.default = "c"))  %>%
    na_if("\\N")
quiz_submission_fact <- read_parquet(paste0(path, 
                         "quiz_submission_fact.snappy.parquet"), 
                          as_tibble = TRUE,
                         col_types = cols(.default = "c")) %>%
    na_if("\\N") %>%
  mutate_at(c("total_attempts", "score", "quiz_points_possible"), as.numeric) 
quiz_submission_dim <- read_parquet(paste0(path, 
                                          "quiz_submission_dim.snappy.parquet"), 
                                     as_tibble = TRUE,
                                    col_types = cols(.default = "c")) %>%
    na_if("\\N")


####### transform & merge tables ###############################################
# merge quiz_fact and quiz_dim
quiz_fact_dim <- quiz_dim %>%
  # remove unpublished quizzes
  filter(workflow_state != "unpublished") %>%
  left_join(quiz_fact,by = c("id" = "quiz_id", "course_id","assignment_id",
                             "points_possible")) %>%
  select(-updated_at, -created_at, -due_at, -workflow_state)


# filter only interaction with quizzes from requests
requests_quiz <- requests4 %>%
  filter(!is.na(quiz_id)) %>%
  mutate(quiz_id = as.numeric(quiz_id),
         # get canvas_id from URL in requests
         quiz_canvas_id = gsub(".*/quizzes/","", url),
         quiz_canvas_id = as.numeric(gsub("/.*", "", quiz_canvas_id)))

# merge quiz submission dim and fact
quiz_subm_fact_dim <- quiz_submission_fact %>%
  filter(!is.na(date)) %>%
  select(-enrollment_term_id, -course_account_id, -assignment_id,
         -enrollment_rollup_id, -fudge_points)  %>%
  right_join(quiz_submission_dim, by = c("quiz_id", 
                                         "quiz_submission_id" = "id",
                                         "user_id","submission_id"))

# merge submissions with quiz fact dim table
quiz_submission <- quiz_subm_fact_dim %>%
  select(-canvas_id) %>%
  left_join(quiz_fact_dim, by = c("quiz_id"= "id","course_id"))

# extract courses and add weeklabels
quiz_submission2 <- quiz_submission %>%
  mutate(date_submission = as.POSIXlt(date),
         quartile = quartile,
         weekno = strftime(date_submission,"%V"),
         weeklabel = calculate_weeklabel(weekno, quartile, year))
         

# remove clicks outside timeframe & filter courses
quiz_submission3 <- quiz_submission2 %>%
  filter(!is.na(weeklabel), course_id %in% requests4$course_id) 

####### Summarize quiz info ####################################################
### QUIZ CLICKS ONLY (via requests)
# sum over full course
quiz_sum <- requests_quiz %>%
  group_by(course_id, user_id) %>%
  summarize(
    n_quizclicks = n()
  ) %>%
  ungroup()

# sum per week
quiz_sumweek <- requests_quiz %>%
  group_by(course_id, user_id, weeklabel) %>%
  summarize(
    n_quizclicks = n()
  ) %>%
  ungroup()

### ALL other quiz indicators via quiz_submissions

### QUIZ SUBMISSION INFO
# get max score per quiz 
## Note: points_possible is often 0, while score > 0
## max score can be higher than max points possible (and vice versa)
## so we take the highest score of both
quiz_submission_sum <- quiz_submission3 %>%
  group_by(course_id, quiz_id) %>%
  summarize(
    max_quiz_score = max(quiz_points_possible, score)
  ) %>%
  ungroup()

# create one submission entry per (unique) quiz per user per course
quiz_submission4 <- quiz_submission3 %>%
  left_join(quiz_submission_sum) %>%
  group_by(course_id, quiz_id, user_id) %>%
  arrange(as.POSIXct(date)) %>%
  summarize(
    firstattempt_date = first(date),
    weeklabel = first(weeklabel), # weeklabel of first attempt
    lastattempt_date = max(date),
    late = ifelse(!is.na(due_at), firstattempt_date > first(due_at), 0),
    quiz_type = first(quiz_type),
    total_duration = sum(as.numeric(time_taken)),
    first_duration = sum(as.numeric(time_taken)),
    m_duration = mean(as.numeric(time_taken)/60),
    nof_questions = first(question_count),
    time_limit = first(as.numeric(time_limit)),
    perc_time_taken = ifelse(time_limit > 0, m_duration/time_limit, NA),
    allowed_attempts = first(allowed_attempts),
    total_attempts = first(total_attempts),
    max_score_perc = max(score)/first(max_quiz_score),
    mean_score_perc = mean(score)/first(max_quiz_score)
  ) %>%
  ungroup()

# summarize quiz indicators for full course
quiz_submission4_sum <- quiz_submission4 %>%
  group_by(course_id, user_id) %>%
  summarize(
    n_practicequiz = sum(quiz_type == "practice_quiz" | time_limit == -1),
    n_examquiz = sum(quiz_type %in% c("assignment", "graded_survey") 
                     & time_limit != -1),
    n_quiz_retry = sum(total_attempts > 1 & allowed_attempts > 1),
    m_examquiz_duration_perc = ifelse(n_examquiz > 0, mean(perc_time_taken, 
                                                           na.rm = T), NA),
    m_practicequiz_duration = mean(m_duration, na.rm = T),
    max_quizscore_perc = mean(max_score_perc, na.rm = T)
  ) %>% ungroup()

# summarize quiz indicators per week
quiz_submission4_sumweek <- quiz_submission4 %>%
  group_by(course_id, user_id, weeklabel) %>%
  summarize(
    n_practicequiz = sum(quiz_type == "practice_quiz" | time_limit == -1),
    n_examquiz = sum(quiz_type %in% c("assignment", "graded_survey") 
                     & time_limit != -1),
    n_quiz_retry = sum(total_attempts > 1 & allowed_attempts > 1),
    m_examquiz_duration_perc = ifelse(n_examquiz > 0, mean(perc_time_taken, 
                                                           na.rm = T),  NA),
    m_practicequiz_duration = mean(m_duration, na.rm = T),
    max_quizscore_perc = mean(max_score_perc, na.rm = T)
  )%>% ungroup()

# convert to datawide
quiz_weekwide <- quiz_submission4_sumweek %>%
  full_join(quiz_sumweek) %>%
  pivot_wider(id_col = c(course_id, user_id), names_from = weeklabel,
              values_from = c(n_practicequiz, n_examquiz, n_quiz_retry,
                              m_examquiz_duration_perc, m_practicequiz_duration,
                              max_quizscore_perc, 
                              n_quizclicks)) %>% 
  left_join(quiz_submission4_sum, by = c("course_id", "user_id")) %>%
  left_join(quiz_sum, by = c("course_id", "user_id")) %>%
  mutate_at(vars(starts_with("n_")), ~replace_na(., 0))

write.csv(quiz_weekwide, paste0(path, outfile),
          row.names = FALSE)

quiz_weekwide
}

Creating discussion forum indicators

In this part, we extract indicators which relate to the discussion forum in the learning management system. For this, information is used from a variety of tables which store information about discussions in Canvas. Specifically, we use the discussion_topic_dim and discussion_topic_fact tables, which contain information on the discussion topics. In addition, we use the discussion_entry_fact table, as this table contains specific information on student replies to a specific discussion topic.

The function below extracts the following discussion forum indicators, and may be customized depending on the researcher’s needs:

Number of forum clicks
Number of announcement clicks
Number of forum topics posts
Number of forum reply posts

All discussion forum indicators are calculated over the full course and per week of the course.

getforum_info <- function(requests_df, quartile, year, outfile){

  requests4 <- requests_df
  
  # load discussion tables 
  discussion_entry_fact <- read_parquet(paste0(path, 
                                      "discussion_entry_fact.snappy.parquet"), 
                           as_tibble = TRUE, 
                           col_types = cols(.default = "c")) %>%
    na_if("\\N") %>%
    filter(!is.na(course_id)) 
  discussion_topic_fact <- read_parquet(paste0(path, 
                                        "discussion_topic_fact.snappy.parquet"), 
                           as_tibble = TRUE, 
                           col_types = cols(.default = "c")) %>%
    na_if("\\N") %>%
    # remove discussion in groups
    filter(!is.na(course_id)) 
  discussion_topic_dim <- read_parquet(paste0(path, 
                                        "discussion_topic_dim.snappy.parquet"), 
                           as_tibble = TRUE, 
                           col_types = cols(.default = "c")) %>%
    na_if("\\N") %>%
    filter(!is.na(course_id)) 
  
  
  ####### transform & merge tables #############################################
  
  # merge discussion_topic dim &  fact  
  disc_topics <- discussion_topic_dim %>%
    left_join(discussion_topic_fact, by = c("id" = "discussion_topic_id",
                                            "course_id", "group_id")) 
  
  # filter courses and add weeklabels
  disc_topics2 <- disc_topics %>%
    filter(!is.na(posted_at)) %>%
    mutate(date_submission = as.POSIXlt(posted_at),
           quartile = quartile,
           weekno = strftime(date_submission,"%V"),
           weeklabel = calculate_weeklabel(weekno, quartile, year))
  
  # remove clicks outside timeframe & filter courses
  disc_topics3 <- disc_topics2 %>%
    filter(!is.na(weeklabel), course_id %in% requests4$course_id)
    
    
  #merge discussion entry fact
  disc_entries <- disc_topics %>%
    select(-user_id) %>%
    right_join(discussion_entry_fact, by = c("id" = "topic_id", "course_id")) 
  
  
  # filter courses and add weeklabels
  disc_entries2 <- disc_entries %>%
    filter(!is.na(posted_at)) %>%
    mutate(date_submission = as.POSIXlt(posted_at),
           quartile = quartile,
           weekno = strftime(date_submission,"%V"),
           weeklabel = calculate_weeklabel(weekno, quartile, year))
  
  # remove clicks outside timeframe & filter courses
  disc_entries3 <- disc_entries2 %>%
    filter(!is.na(weeklabel), course_id %in% requests4$course_id)
  

  # merge with discussion topic dim to get discussion type
  requests_discussions <- requests4 %>%
    filter(!is.na(discussion_id)) %>% 
    left_join(discussion_topic_dim, by = c("discussion_id" = "id", "course_id"))
  
  ####### Summarize discussion info ############################################
  ### Discussion CLICKS ONLY (via requests)
  # sum over full course
  disc_sum <- requests_discussions %>%
    group_by(course_id, user_id) %>%
    summarize(
      n_forumclicks = sum(is.na(type)),
      n_announcementclicks = sum(type == "Announcement")
    ) %>%
    ungroup()
  
  # sum per week
  disc_sumweek <- requests_discussions %>%
    group_by(course_id, user_id, weeklabel) %>%
    summarize(
      n_forumclicks = sum(is.na(type)),
      n_announcementclicks = sum(type == "Announcement")
    ) %>%
    ungroup()
  
  ### ALL other discussion indicators via discussion_submissions
  
  # post in topics
  topic_sum <- disc_topics3 %>%
    group_by(course_id, user_id) %>%
    summarize(
      n_forumtopic_post = n_distinct(id)
    ) %>% ungroup()
  
  # summarize assignment indicators per week
  topic_sumweek <- disc_topics3 %>%
    group_by(course_id, user_id, weeklabel) %>%
    summarize(
      n_forumtopic_post = n_distinct(id)
    ) %>% ungroup()
  
  # post in topics
  entry_sum <- disc_entries3 %>%
    group_by(course_id, user_id) %>%
    summarize(
      n_forumreply_post = n_distinct(id)
    ) %>% ungroup()
  
  # summarize assignment indicators per week
  entry_sumweek <- disc_entries3 %>%
    group_by(course_id, user_id, weeklabel) %>%
    summarize(
      n_forumreply_post = n_distinct(id)
    ) %>% ungroup()
  
  
  # convert to datawide
  disc_weekwide <- topic_sumweek %>%
    full_join(entry_sumweek) %>%
    full_join(disc_sumweek) %>%
    pivot_wider(id_col = c(course_id, user_id), names_from = weeklabel,
                values_from = c(n_forumreply_post, n_forumtopic_post, 
                                n_forumclicks, n_announcementclicks)) %>% 
    left_join(topic_sum, by = c("course_id", "user_id")) %>%
    left_join(entry_sum, by = c("course_id", "user_id")) %>%
    left_join(disc_sum, by = c("course_id", "user_id")) %>%
    mutate_at(vars(starts_with("n_")), ~replace_na(., 0)) %>%
    filter(!is.na(user_id))
  
  write.csv(disc_weekwide, paste0(path, outfile),
            row.names = FALSE)
  
  disc_weekwide
  
  
}

Creating file and video indicators

In this part, we extract indicators which relate to the files in the learning management system. For this, information is used from a a single table which includes specific information on files in Canvas: file_dim. Note that the code discusses all file types, including videos, pictures and presentations. In case one wants to exclude any file type for generating more specific indicators, other filter operations can be applied to the requests_files2 table specifying which file types to include or exclude.

The function below extracts the following file indicators, and may be customized depending on the researcher’s needs:

number of clicks on files
number of unique file accessed (including inline views & downloads)
number of unique file downloads (via ‘download’ in url requests)
number of clicks on video files (mediafiles directly uploaded on canvas with mp4 type of extension (file_type in file_dim))
number of unique accessed video files
number of clicks on livestreamed videos (media accessed via bigbluebutton, Canvas conferences, or panopto)
number of unique accessed livestreamed videos (canvas conferences)

All file indicators are calculated over the full course and per week of the course.

getfiles_info <- function(requests_df, quartile, year, outfile){

  requests4 <- requests_df
  
  
  # load files 
  file_dim <- read_parquet(paste0(path, "file_dim.snappy.parquet"), 
                             as_tibble = TRUE, 
                           col_types = cols(.default = "c")) %>%
    na_if("\\N")
  
####### transform & merge tables ###############################################

file_dim2 <- file_dim %>%
    mutate(course_id = as.numeric(gsub("^7542", "", course_id))) %>%
    filter(course_id %in% requests4$course_id) %>%
    mutate(file_type = gsub("\\/.*", "", content_type),
           file_type = ifelse(file_type == "binary"| is.na(file_type), 
                              "unknown",
                              file_type),
           file_id = as.numeric(gsub("^7542", "", id)))


requests_files <- requests4 %>%
  filter(grepl("files/", url)) %>%
  mutate(file_id = gsub(".*/files/", "", url),
         file_id = as.numeric(gsub("/.*|\\?.*", "", file_id))) %>%
  filter(!is.na(file_id))

#merge with file_dim for file_types
requests_files2 <- requests_files %>%
  left_join(select(file_dim2, file_id, file_type, display_name)) 

# roughly file types include: video (.mp4), text (.docx, csv), 
# application (pdf, pptx, zip), image (jpg, png), unknown (.sav, .R, .do, .dta)

# extract video views in canvas conferences
requests_conferences <- requests4 %>%
  filter(grepl("conferences/|panopto", url)) %>%
  mutate(conference_id = gsub(".*/conferences/", "", url),
         conference_id = as.numeric(gsub("/.*|\\?.*", "", conference_id)))

####### Summarize file info ####################################################

# get unique access
# create one entry per (unique) file per user per course
files2 <- requests_files2 %>%
  group_by(course_id, file_id, user_id) %>%
  arrange(as.POSIXct(timestamp)) %>%
  summarize(
    n_fileclicks = n(),
    weeklabel = first(weeklabel), # weeklabel of first access
    downloaded = sum(grepl("download", url)) > 0 ,
    video = first(file_type) == "video"
  ) %>%
  ungroup()

# summarize files indicators for full course
files_sum <- files2 %>%
  group_by(course_id, user_id) %>%
  summarize(
    n_fileclicks = sum(n_fileclicks, na.rm = T),
    n_fileaccess = n(),
    n_video = sum(video),
    n_videoclicks = sum(n_fileclicks[video], na.rm = T),
    n_filedownload = sum(downloaded)
  ) %>% ungroup()

# summarize assignment indicators per week
files_sumweek <- files2 %>%
  group_by(course_id, user_id, weeklabel) %>%
  summarize(
    n_fileclicks = sum(n_fileclicks, na.rm = T),
    n_fileaccess = n(),
    n_videofile = sum(video),
    n_videofileclicks = sum(n_fileclicks[video], na.rm = T),
    n_filedownload = sum(downloaded)
  ) %>% ungroup()

# get unique access
# create one entry per (unique) video per user per course
conferences2 <- requests_conferences %>%
  group_by(course_id, url, user_id) %>%
  arrange(as.POSIXct(timestamp)) %>%
  summarize(
    n_confclicks = n(),
    weeklabel = first(weeklabel), # weeklabel of first access
  ) %>%
  ungroup()


# summarize video indicators for full course
video_sum <- conferences2 %>%
  group_by(course_id, user_id) %>%
  summarize(
    n_videoclicks_total = sum(n_confclicks, na.rm = T),
    n_video_total = n()
  ) %>% ungroup()

# summarize video indicators per week
video_sumweek <- conferences2 %>%
  group_by(course_id, user_id, weeklabel) %>%
  summarize(
    n_videoclicks = sum(n_confclicks, na.rm = T),
    n_video = n()
  ) %>% ungroup()


# convert to datawide
files_weekwide <- files_sumweek %>%
  full_join(video_sumweek) %>%
  pivot_wider(id_col = c(course_id, user_id), names_from = weeklabel,
              values_from = c(n_fileclicks, n_fileaccess, 
                              n_filedownload, n_videofileclicks, n_videofile,
                              n_videoclicks, n_video)) %>% 
  full_join(files_sum) %>%
  full_join(video_sum) %>%
  mutate_at(vars(starts_with("n_")), ~replace_na(., 0))

write.csv(files_weekwide, paste0(path, outfile),
          row.names = FALSE)

files_weekwide

}

Merging all Canvas indicators

With the code below, you can run all separate functions described above to extract the Canvas indicators per category.

session_weekwide <- getsession_info(requests4, quartile, year, 
                                    "session_indicators.csv")
assignment_weekwide <- getassignment_info(requests4, quartile, year, 
                                          "assignment_indicators.csv")
quiz_weekwide <- getquiz_info(requests4, quartile, year, 
                              "quiz_indicators.csv")
disc_weekwide <- getforum_info(requests4, quartile, year, 
                               "discussion_indicators.csv")
files_weekwide <- getfiles_info(requests4, quartile, year, 
                                "file_indicators.csv")

Thereafter, you can merge all Canvas indicators into one large data-wide csv file. This file contains one row per student per course, for all indicators.

# merge all canvas indicators
canvasall <- session_weekwide %>%
  left_join(assignment_weekwide,  by = c("course_id", "user_id")) %>%
  left_join(quiz_weekwide,  by = c("course_id", "user_id")) %>%
  left_join(disc_weekwide,  by = c("course_id", "user_id")) %>%
  left_join(files_weekwide,  by = c("course_id", "user_id")) 

write.csv(canvasall, paste0(path, "CanvasIndicators.csv"))

This csv file can now be combined with additional data sources, such as grades or survey data and can be used for your further analyses.