Analysis of Student Engagement and Performance in Online Learning

Author

Siyu Long

Published

December 12, 2023

Overview

Introduction

This data analysis project explores student engagement and performance within the Open University’s Virtual Learning Environment (VLE). The onset of the COVID-19 pandemic has accelerated the adoption of online learning platforms, making it crucial to understand how students interact with digital learning materials and how these interactions affect their academic outcomes.

Research Questions

This data analysis is driven by the following questions: 1. What is the relationship between students’ interactions with the VLE and their assessment scores? 2. How do these interactions vary across different courses or modules? 3. What demographic factors influence students’ academic outcomes in online courses?

Data Source

The analysis is based on data from the Open University Learning Analytics Dataset (OULAD). It includes information about courses, assessments, VLE interactions, student demographics, and registration details. These datasets collectively offer a comprehensive view of the student learning journey in an online setting.

Anticipated Significance

This analysis aims to provide valuable insights for online course designers and instructors. By understanding the dynamics of student engagement and performance, educational stakeholders can tailor their content delivery and support mechanisms to enhance learning outcomes.

Prepare

In this section of the analysis, I prepare the datasets for in-depth exploration and analysis. This involves understanding the scope of the data, the context in which it was collected, and the relevance of each variable to the raised questions.

Purpose of the Data Product

The primary purpose of this data product is to explore student engagement and performance within the Open University’s Virtual Learning Environment (VLE) by answering key research questions that revolve around the impact of VLE interactions on student assessments.

Data Sources and Context

The analysis utilizes data from the Open University Learning Analytics Dataset (OULAD). Below, I load and take an initial look at each dataset.

---

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 206 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): code_module, code_presentation, assessment_type
dbl (3): id_assessment, date, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 22 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): code_module, code_presentation
dbl (1): module_presentation_length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 173912 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): id_assessment, id_student, date_submitted, is_banked, score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 32593 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): code_module, code_presentation, gender, region, highest_education, ...
dbl (3): id_student, num_of_prev_attempts, studied_credits

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 32593 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): code_module, code_presentation
dbl (3): id_student, date_registration, date_unregistration

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 10655280 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): code_module, code_presentation
dbl (4): id_student, id_site, date, sum_click

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 6364 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): code_module, code_presentation, activity_type
dbl (3): id_site, week_from, week_to

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Variables of Interest

To address our research questions, I will focus on several key variables across these datasets. Let’s briefly explore these variables.

Code
glimpse(assessments)
Rows: 206
Columns: 6
$ code_module       <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AA…
$ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J", "2013J"…
$ id_assessment     <dbl> 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760…
$ assessment_type   <chr> "TMA", "TMA", "TMA", "TMA", "TMA", "Exam", "TMA", "T…
$ date              <dbl> 19, 54, 117, 166, 215, NA, 19, 54, 117, 166, 215, NA…
$ weight            <dbl> 10, 20, 20, 20, 30, 100, 10, 20, 20, 20, 30, 100, 1,…
Code
glimpse(courses)
Rows: 22
Columns: 3
$ code_module                <chr> "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "…
$ code_presentation          <chr> "2013J", "2014J", "2013J", "2014J", "2013B"…
$ module_presentation_length <dbl> 268, 269, 268, 262, 240, 234, 269, 241, 261…
Code
glimpse(student_assessment)
Rows: 173,912
Columns: 5
$ id_assessment  <dbl> 1752, 1752, 1752, 1752, 1752, 1752, 1752, 1752, 1752, 1…
$ id_student     <dbl> 11391, 28400, 31604, 32885, 38053, 45462, 45642, 52130,…
$ date_submitted <dbl> 18, 22, 17, 26, 19, 20, 18, 19, 9, 18, 19, 18, 17, 19, …
$ is_banked      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ score          <dbl> 78, 70, 72, 69, 79, 70, 72, 72, 71, 68, 73, 67, 73, 83,…
Code
glimpse(student_info)
Rows: 32,593
Columns: 12
$ code_module          <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", …
$ code_presentation    <chr> "2013J", "2013J", "2013J", "2013J", "2013J", "201…
$ id_student           <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 45462, …
$ gender               <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F", "M",…
$ region               <chr> "East Anglian Region", "Scotland", "North Western…
$ highest_education    <chr> "HE Qualification", "HE Qualification", "A Level …
$ imd_band             <chr> "90-100%", "20-30%", "30-40%", "50-60%", "50-60%"…
$ age_band             <chr> "55<=", "35-55", "35-55", "35-55", "0-35", "35-55…
$ num_of_prev_attempts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ studied_credits      <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 60, 60,…
$ disability           <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N", "N",…
$ final_result         <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass", "Pas…
Code
glimpse(student_registration)
Rows: 32,593
Columns: 5
$ code_module         <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
$ code_presentation   <chr> "2013J", "2013J", "2013J", "2013J", "2013J", "2013…
$ id_student          <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 45462, 4…
$ date_registration   <dbl> -159, -53, -92, -52, -176, -110, -67, -29, -33, -1…
$ date_unregistration <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Code
glimpse(student_vle)
Rows: 10,655,280
Columns: 6
$ code_module       <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AA…
$ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J", "2013J"…
$ id_student        <dbl> 28400, 28400, 28400, 28400, 28400, 28400, 28400, 284…
$ id_site           <dbl> 546652, 546652, 546652, 546614, 546714, 546652, 5468…
$ date              <dbl> -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -1…
$ sum_click         <dbl> 4, 1, 1, 11, 1, 8, 2, 15, 17, 1, 1, 1, 3, 4, 3, 2, 3…
Code
glimpse(vle)
Rows: 6,364
Columns: 6
$ id_site           <dbl> 546943, 546712, 546998, 546888, 547035, 546614, 5468…
$ code_module       <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AA…
$ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J", "2013J"…
$ activity_type     <chr> "resource", "oucontent", "resource", "url", "resourc…
$ week_from         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ week_to           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Each of these variables plays a crucial role in our analysis:

  • Assessments: Types and weights of assessments, crucial for understanding assessment strategies.

  • Courses: Course details like module code and presentation length, important for analyzing course structure.

  • Student Assessment: Scores and submission dates, directly relevant to performance analysis.

  • Student Information: Demographic and educational background, vital for understanding student diversity and its impact.

  • Student Registration: Enrollment and dropout data, indicative of student commitment and course demand.

  • Student VLE: Detailed logs of interactions with online resources, key to measuring engagement.

  • VLE: Details of online learning resources, helps in correlating resource types with engagement.

Wrangle

In this section, I will focus on preparing the datasets for in-depth analysis. This involves a series of data preprocessing steps to ensure the quality and consistency of the data.

Data Pre-processing

Data preprocessing is a critical step in the analysis pipeline. I’ll clean and prepare our text data, ensuring it’s in the right format for analysis.

Code
student_assessment_summary <- student_assessment %>%
  group_by(id_student) %>%
  summarize(average_score = mean(score, na.rm = TRUE), .groups = 'drop')


combined_data <- student_info %>%
  inner_join(student_assessment_summary, by = "id_student")



combined_data <- combined_data %>%
  filter(!is.na(average_score))


combined_data <- rename(combined_data, total_score = average_score)

Feature Engineering

In order to address the research questions effectively, I may need to create new variables or transform existing ones.

Code
combined_data$performance_category <- ifelse(combined_data$total_score >= 85, "High",
                                             ifelse(combined_data$total_score >= 60, "Medium", "Low"))

Analyze

This section of the analysis will focus on exploring the data through various analytical techniques that are more suitable to the dataset and research questions. I will use descriptive statistics and visualizations to uncover patterns and insights.

Descriptive Statistics

Let’s start by examining basic statistics of the main variables to get an overview of the dataset.

Code
summary(combined_data)
 code_module        code_presentation    id_student         gender         
 Length:26727       Length:26727       Min.   :   6516   Length:26727      
 Class :character   Class :character   1st Qu.: 505815   Class :character  
 Mode  :character   Mode  :character   Median : 589326   Mode  :character  
                                       Mean   : 708659                     
                                       3rd Qu.: 642188                     
                                       Max.   :2698588                     
    region          highest_education    imd_band           age_band        
 Length:26727       Length:26727       Length:26727       Length:26727      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 num_of_prev_attempts studied_credits   disability        final_result      
 Min.   :0.00         Min.   : 30.00   Length:26727       Length:26727      
 1st Qu.:0.00         1st Qu.: 60.00   Class :character   Class :character  
 Median :0.00         Median : 60.00   Mode  :character   Mode  :character  
 Mean   :0.16         Mean   : 77.77                                        
 3rd Qu.:0.00         3rd Qu.: 90.00                                        
 Max.   :6.00         Max.   :630.00                                        
  total_score     performance_category
 Min.   :  0.00   Length:26727        
 1st Qu.: 64.80   Class :character    
 Median : 75.71   Mode  :character    
 Mean   : 72.83                       
 3rd Qu.: 84.06                       
 Max.   :100.00                       

Data Visualization

Visualizations can provide intuitive insights into the patterns and relationships in our data. Here, I might create visualizations that explore student performance, demographics, and their interactions with the VLE.

RQ 1: Relationship Between VLE Interactions and Assessment Scores

I first explore the relationship between the amount of interaction students have with the VLE (as indicated by the total number of clicks) and their assessment scores.

Code
library(ggplot2)
library(dplyr)

options(repos = c(CRAN = "https://cran.rstudio.com"))
install.packages("GGally")

The downloaded binary packages are in
    /var/folders/t5/zjghhmh14g7cdhtxqmjm_mmw0000gn/T//RtmpMwsrX4/downloaded_packages
Code
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Code
install.packages("gridExtra", repos = "https://cran.rstudio.com")

The downloaded binary packages are in
    /var/folders/t5/zjghhmh14g7cdhtxqmjm_mmw0000gn/T//RtmpMwsrX4/downloaded_packages
Code
library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
Code
student_vle_summary <- student_vle %>%
  group_by(id_student) %>%
  summarise(total_clicks = sum(sum_click), .groups = 'drop')

vle_assessment_data <- student_vle_summary %>%
  inner_join(student_assessment, by = "id_student")

interaction_score_plot <- ggplot(vle_assessment_data, aes(x = total_clicks, y = score)) +
  geom_point(aes(color = score)) +
  labs(title = "Relationship Between VLE Interactions and Assessment Scores",
       x = "Total VLE Interactions (Clicks)",
       y = "Assessment Score") +
  theme_minimal() +
  theme(legend.position = "bottom")
interaction_score_plot
Warning: Removed 172 rows containing missing values (`geom_point()`).

The scatter plot illustrates a general trend where increased interactions within the Virtual Learning Environment are associated with higher assessment scores, suggesting that engagement with online resources positively impacts academic performance. However, some outliers indicate that for a few students, high VLE engagement does not necessarily translate to high scores, pointing to the complexity of learning behaviors and the influence of other factors on academic success.

RQ 2: VLE Interactions Across Different Courses

Next, I will analyze how student interactions with the VLE vary across different courses or modules.

Code
vle_course_interaction <- student_vle %>%
  group_by(code_module) %>%
  summarise(total_clicks = sum(sum_click), .groups = 'drop')

course_interaction_plot <- ggplot(vle_course_interaction, aes(x = code_module, y = total_clicks, fill = code_module)) +
  geom_bar(stat = "identity") +
  labs(title = "VLE Interactions Across Different Courses",
       x = "Course Module",
       y = "Total VLE Interactions (Clicks)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")
course_interaction_plot

The bar chart displays a striking variation in VLE interactions across different courses, highlighting how student engagement with online resources is not uniform. Courses with higher interaction levels (in the analynized OULAD dataset, it appears to be the FFF course module) might be those with more engaging content or more demanding requirements, whereas lower interaction levels could indicate areas where course materials or activities are not resonating as well with students. These insights are crucial for educators to identify which courses might benefit from enhanced interactive materials or alternative engagement strategies

RQ 3: Demographic Factors and Academic Outcomes

Lastly, I will examine the influence of demographic factors on students’ academic outcomes, including gender, region, highest education level, age band, and disability status on students’ academic outcomes.

Code
demographic_assessment_data <- student_info %>%
  inner_join(student_assessment, by = "id_student")
Warning in inner_join(., student_assessment, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 15 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
Code
region_performance_plot <- ggplot(demographic_assessment_data, aes(x = region, y = score, fill = region)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Scores by Region", x = "Region", y = "Score")
Code
education_performance_plot <- ggplot(demographic_assessment_data, aes(x = highest_education, y = score, fill = highest_education)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Scores by Education", x = "Education", y = "Score")
Code
grid.arrange(region_performance_plot, education_performance_plot, ncol = 2)
Warning: Removed 227 rows containing non-finite values (`stat_boxplot()`).
Removed 227 rows containing non-finite values (`stat_boxplot()`).

Code
age_performance_plot <- ggplot(demographic_assessment_data, aes(x = age_band, y = score, fill = age_band)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Scores by Age Band", x = "Age Band", y = "Score")
Code
disability_performance_plot <- ggplot(demographic_assessment_data, aes(x = disability, y = score, fill = disability)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Scores by Disability", x = "Disability", y = "Score")
Code
grid.arrange(age_performance_plot, disability_performance_plot, ncol = 2)
Warning: Removed 227 rows containing non-finite values (`stat_boxplot()`).
Removed 227 rows containing non-finite values (`stat_boxplot()`).

The series of boxplots reveal how different demographic factors impact academic outcomes in online courses. Notably, students with higher initial education levels and those in certain age groups tend to achieve higher scores, possibly reflecting the advantages of prior educational experiences and maturity.

Meanwhile, variations in scores based on region and disability status, while present, are less marked. These findings suggest that while the online learning environment offers a level of accessibility and flexibility, there are still underlying demographic influences that affect student performance. Understanding these can guide the development of more inclusive and adaptive learning strategies to support diverse student populations.

Communication

This final section of the data analysis project is dedicated to discussing the key findings, implications, and the broader context of the study. This aims to provide actionable insights for stakeholders in online education and acknowledge the limitations and ethical considerations of this work.

Key Findings and Insights

The analysis has led to several important insights:

  1. VLE Interactions and Assessment Scores: There is a noticeable trend indicating that increased interactions within the VLE are generally associated with higher assessment scores. However, exceptions to this trend suggest that engagement quality is as important as quantity.

  2. Variations in VLE Interactions Across Courses: Student engagement with the VLE varies significantly across different courses, indicating that course design and content complexity play crucial roles in online learning engagement.

  3. Impact of Demographic Factors: Demographic factors such as education level and age influence academic outcomes. While some demographics like region and disability show less pronounced effects, they still contribute to the diversity of student performance.

Implications and Actions

Based on these findings, This project recommends the following actions for online education designers and instructors:

  • Enhance Interactive Elements: Develop target strategies to increase student interactions with the VLE while enhancing the quality and effectiveness of these interactions, especially for courses with lower interactions.

  • Targeted Support: Offer additional support and resources to student groups that might be at a disadvantage, as indicated by their demographic background, which can be of significant to conduct further at-risk prediction and learning process monitoring.

Limitations

Our study is not without limitations. The data reflects only a snapshot of student behavior and performance, and long-term trends might differ. Additionally, the complexity of online learning behaviors means that there are likely other influential factors not captured in the data analysis of OULAD. Meanwhile, the data analysis presented only a few insights to the OULAD dataset, without applying modeling techniques like logistic regression or conducting statistical analysis like correlation.

Ethical Considerations

In conducting this analysis, especially for the demographic information in the dataset, all student identifiers were removed and used responsibly.