This data analysis project explores student engagement and performance within the Open University’s Virtual Learning Environment (VLE). The onset of the COVID-19 pandemic has accelerated the adoption of online learning platforms, making it crucial to understand how students interact with digital learning materials and how these interactions affect their academic outcomes.
Research Questions
This data analysis is driven by the following questions: 1. What is the relationship between students’ interactions with the VLE and their assessment scores? 2. How do these interactions vary across different courses or modules? 3. What demographic factors influence students’ academic outcomes in online courses?
Data Source
The analysis is based on data from the Open University Learning Analytics Dataset (OULAD). It includes information about courses, assessments, VLE interactions, student demographics, and registration details. These datasets collectively offer a comprehensive view of the student learning journey in an online setting.
Anticipated Significance
This analysis aims to provide valuable insights for online course designers and instructors. By understanding the dynamics of student engagement and performance, educational stakeholders can tailor their content delivery and support mechanisms to enhance learning outcomes.
Prepare
In this section of the analysis, I prepare the datasets for in-depth exploration and analysis. This involves understanding the scope of the data, the context in which it was collected, and the relevance of each variable to the raised questions.
Purpose of the Data Product
The primary purpose of this data product is to explore student engagement and performance within the Open University’s Virtual Learning Environment (VLE) by answering key research questions that revolve around the impact of VLE interactions on student assessments.
Data Sources and Context
The analysis utilizes data from the Open University Learning Analytics Dataset (OULAD). Below, I load and take an initial look at each dataset.
---
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 206 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): code_module, code_presentation, assessment_type
dbl (3): id_assessment, date, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 22 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): code_module, code_presentation
dbl (1): module_presentation_length
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 173912 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): id_assessment, id_student, date_submitted, is_banked, score
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 32593 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): code_module, code_presentation, gender, region, highest_education, ...
dbl (3): id_student, num_of_prev_attempts, studied_credits
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 32593 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): code_module, code_presentation
dbl (3): id_student, date_registration, date_unregistration
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 10655280 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): code_module, code_presentation
dbl (4): id_student, id_site, date, sum_click
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 6364 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): code_module, code_presentation, activity_type
dbl (3): id_site, week_from, week_to
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Variables of Interest
To address our research questions, I will focus on several key variables across these datasets. Let’s briefly explore these variables.
Each of these variables plays a crucial role in our analysis:
Assessments: Types and weights of assessments, crucial for understanding assessment strategies.
Courses: Course details like module code and presentation length, important for analyzing course structure.
Student Assessment: Scores and submission dates, directly relevant to performance analysis.
Student Information: Demographic and educational background, vital for understanding student diversity and its impact.
Student Registration: Enrollment and dropout data, indicative of student commitment and course demand.
Student VLE: Detailed logs of interactions with online resources, key to measuring engagement.
VLE: Details of online learning resources, helps in correlating resource types with engagement.
Wrangle
In this section, I will focus on preparing the datasets for in-depth analysis. This involves a series of data preprocessing steps to ensure the quality and consistency of the data.
Data Pre-processing
Data preprocessing is a critical step in the analysis pipeline. I’ll clean and prepare our text data, ensuring it’s in the right format for analysis.
This section of the analysis will focus on exploring the data through various analytical techniques that are more suitable to the dataset and research questions. I will use descriptive statistics and visualizations to uncover patterns and insights.
Descriptive Statistics
Let’s start by examining basic statistics of the main variables to get an overview of the dataset.
Code
summary(combined_data)
code_module code_presentation id_student gender
Length:26727 Length:26727 Min. : 6516 Length:26727
Class :character Class :character 1st Qu.: 505815 Class :character
Mode :character Mode :character Median : 589326 Mode :character
Mean : 708659
3rd Qu.: 642188
Max. :2698588
region highest_education imd_band age_band
Length:26727 Length:26727 Length:26727 Length:26727
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
num_of_prev_attempts studied_credits disability final_result
Min. :0.00 Min. : 30.00 Length:26727 Length:26727
1st Qu.:0.00 1st Qu.: 60.00 Class :character Class :character
Median :0.00 Median : 60.00 Mode :character Mode :character
Mean :0.16 Mean : 77.77
3rd Qu.:0.00 3rd Qu.: 90.00
Max. :6.00 Max. :630.00
total_score performance_category
Min. : 0.00 Length:26727
1st Qu.: 64.80 Class :character
Median : 75.71 Mode :character
Mean : 72.83
3rd Qu.: 84.06
Max. :100.00
Data Visualization
Visualizations can provide intuitive insights into the patterns and relationships in our data. Here, I might create visualizations that explore student performance, demographics, and their interactions with the VLE.
RQ 1: Relationship Between VLE Interactions and Assessment Scores
I first explore the relationship between the amount of interaction students have with the VLE (as indicated by the total number of clicks) and their assessment scores.
The scatter plot illustrates a general trend where increased interactions within the Virtual Learning Environment are associated with higher assessment scores, suggesting that engagement with online resources positively impacts academic performance. However, some outliers indicate that for a few students, high VLE engagement does not necessarily translate to high scores, pointing to the complexity of learning behaviors and the influence of other factors on academic success.
RQ 2: VLE Interactions Across Different Courses
Next, I will analyze how student interactions with the VLE vary across different courses or modules.
Code
vle_course_interaction <- student_vle %>%group_by(code_module) %>%summarise(total_clicks =sum(sum_click), .groups ='drop')course_interaction_plot <-ggplot(vle_course_interaction, aes(x = code_module, y = total_clicks, fill = code_module)) +geom_bar(stat ="identity") +labs(title ="VLE Interactions Across Different Courses",x ="Course Module",y ="Total VLE Interactions (Clicks)") +theme_minimal() +scale_fill_brewer(palette ="Set3")course_interaction_plot
The bar chart displays a striking variation in VLE interactions across different courses, highlighting how student engagement with online resources is not uniform. Courses with higher interaction levels (in the analynized OULAD dataset, it appears to be the FFF course module) might be those with more engaging content or more demanding requirements, whereas lower interaction levels could indicate areas where course materials or activities are not resonating as well with students. These insights are crucial for educators to identify which courses might benefit from enhanced interactive materials or alternative engagement strategies
RQ 3: Demographic Factors and Academic Outcomes
Lastly, I will examine the influence of demographic factors on students’ academic outcomes, including gender, region, highest education level, age band, and disability status on students’ academic outcomes.
Code
demographic_assessment_data <- student_info %>%inner_join(student_assessment, by ="id_student")
Warning in inner_join(., student_assessment, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 15 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Code
region_performance_plot <-ggplot(demographic_assessment_data, aes(x = region, y = score, fill = region)) +geom_boxplot() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Scores by Region", x ="Region", y ="Score")
Code
education_performance_plot <-ggplot(demographic_assessment_data, aes(x = highest_education, y = score, fill = highest_education)) +geom_boxplot() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Scores by Education", x ="Education", y ="Score")
age_performance_plot <-ggplot(demographic_assessment_data, aes(x = age_band, y = score, fill = age_band)) +geom_boxplot() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Scores by Age Band", x ="Age Band", y ="Score")
Code
disability_performance_plot <-ggplot(demographic_assessment_data, aes(x = disability, y = score, fill = disability)) +geom_boxplot() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Scores by Disability", x ="Disability", y ="Score")
The series of boxplots reveal how different demographic factors impact academic outcomes in online courses. Notably, students with higher initial education levels and those in certain age groups tend to achieve higher scores, possibly reflecting the advantages of prior educational experiences and maturity.
Meanwhile, variations in scores based on region and disability status, while present, are less marked. These findings suggest that while the online learning environment offers a level of accessibility and flexibility, there are still underlying demographic influences that affect student performance. Understanding these can guide the development of more inclusive and adaptive learning strategies to support diverse student populations.
Communication
This final section of the data analysis project is dedicated to discussing the key findings, implications, and the broader context of the study. This aims to provide actionable insights for stakeholders in online education and acknowledge the limitations and ethical considerations of this work.
Key Findings and Insights
The analysis has led to several important insights:
VLE Interactions and Assessment Scores: There is a noticeable trend indicating that increased interactions within the VLE are generally associated with higher assessment scores. However, exceptions to this trend suggest that engagement quality is as important as quantity.
Variations in VLE Interactions Across Courses: Student engagement with the VLE varies significantly across different courses, indicating that course design and content complexity play crucial roles in online learning engagement.
Impact of Demographic Factors: Demographic factors such as education level and age influence academic outcomes. While some demographics like region and disability show less pronounced effects, they still contribute to the diversity of student performance.
Implications and Actions
Based on these findings, This project recommends the following actions for online education designers and instructors:
Enhance Interactive Elements: Develop target strategies to increase student interactions with the VLE while enhancing the quality and effectiveness of these interactions, especially for courses with lower interactions.
Targeted Support: Offer additional support and resources to student groups that might be at a disadvantage, as indicated by their demographic background, which can be of significant to conduct further at-risk prediction and learning process monitoring.
Limitations
Our study is not without limitations. The data reflects only a snapshot of student behavior and performance, and long-term trends might differ. Additionally, the complexity of online learning behaviors means that there are likely other influential factors not captured in the data analysis of OULAD. Meanwhile, the data analysis presented only a few insights to the OULAD dataset, without applying modeling techniques like logistic regression or conducting statistical analysis like correlation.
Ethical Considerations
In conducting this analysis, especially for the demographic information in the dataset, all student identifiers were removed and used responsibly.
Source Code
---title: "Analysis of Student Engagement and Performance in Online Learning"author: "Siyu Long"date: "December 12 2023"output: html_documentformat: html: code-fold: true code-tools: true toc: true toc_float: true toc_depth: 2---# Overview## IntroductionThis data analysis project explores student engagement and performance within the Open University's Virtual Learning Environment (VLE). The onset of the COVID-19 pandemic has accelerated the adoption of online learning platforms, making it crucial to understand how students interact with digital learning materials and how these interactions affect their academic outcomes.## Research QuestionsThis data analysis is driven by the following questions: 1. What is the relationship between students' interactions with the VLE and their assessment scores? 2. How do these interactions vary across different courses or modules? 3. What demographic factors influence students' academic outcomes in online courses?## Data SourceThe analysis is based on data from the Open University Learning Analytics Dataset (OULAD). It includes information about courses, assessments, VLE interactions, student demographics, and registration details. These datasets collectively offer a comprehensive view of the student learning journey in an online setting.## Anticipated SignificanceThis analysis aims to provide valuable insights for online course designers and instructors. By understanding the dynamics of student engagement and performance, educational stakeholders can tailor their content delivery and support mechanisms to enhance learning outcomes.# PrepareIn this section of the analysis, I prepare the datasets for in-depth exploration and analysis. This involves understanding the scope of the data, the context in which it was collected, and the relevance of each variable to the raised questions.## Purpose of the Data ProductThe primary purpose of this data product is to explore student engagement and performance within the Open University's Virtual Learning Environment (VLE) by answering key research questions that revolve around the impact of VLE interactions on student assessments.## Data Sources and ContextThe analysis utilizes data from the Open University Learning Analytics Dataset (OULAD). Below, I load and take an initial look at each dataset.\-\--```{r fold=TRUE}#| label: loadpackageslibrary(tidyverse)``````{r, echo = FALSE, fold=TRUE}#| label:load_dataassessments_path <-"assessments.csv"courses_path <-"courses.csv"student_assessment_path <-"studentAssessment.csv"student_info_path <-"studentInfo.csv"student_registration_path <-"studentRegistration.csv"student_vle_path <-"studentVle.csv"vle_path <-"vle.csv"assessments <-read_csv(assessments_path)courses <-read_csv(courses_path)student_assessment <-read_csv(student_assessment_path)student_info <-read_csv(student_info_path)student_registration <-read_csv(student_registration_path)student_vle <-read_csv(student_vle_path)vle <-read_csv(vle_path)```## Variables of InterestTo address our research questions, I will focus on several key variables across these datasets. Let's briefly explore these variables.```{r, fold=TRUE}glimpse(assessments)glimpse(courses)glimpse(student_assessment)glimpse(student_info)glimpse(student_registration)glimpse(student_vle)glimpse(vle)```Each of these variables plays a crucial role in our analysis:- Assessments: Types and weights of assessments, crucial for understanding assessment strategies.- Courses: Course details like module code and presentation length, important for analyzing course structure.- Student Assessment: Scores and submission dates, directly relevant to performance analysis.- Student Information: Demographic and educational background, vital for understanding student diversity and its impact.- Student Registration: Enrollment and dropout data, indicative of student commitment and course demand.- Student VLE: Detailed logs of interactions with online resources, key to measuring engagement.- VLE: Details of online learning resources, helps in correlating resource types with engagement.# WrangleIn this section, I will focus on preparing the datasets for in-depth analysis. This involves a series of data preprocessing steps to ensure the quality and consistency of the data.## Data Pre-processingData preprocessing is a critical step in the analysis pipeline. I'll clean and prepare our text data, ensuring it's in the right format for analysis.```{r data_preprocessing, echo=TRUE,fold=TRUE}student_assessment_summary <- student_assessment %>%group_by(id_student) %>%summarize(average_score =mean(score, na.rm =TRUE), .groups ='drop')combined_data <- student_info %>%inner_join(student_assessment_summary, by ="id_student")combined_data <- combined_data %>%filter(!is.na(average_score))combined_data <-rename(combined_data, total_score = average_score)```## Feature EngineeringIn order to address the research questions effectively, I may need to create new variables or transform existing ones.```{r, fold=TRUE}combined_data$performance_category <-ifelse(combined_data$total_score >=85, "High",ifelse(combined_data$total_score >=60, "Medium", "Low"))```# AnalyzeThis section of the analysis will focus on exploring the data through various analytical techniques that are more suitable to the dataset and research questions. I will use descriptive statistics and visualizations to uncover patterns and insights.## Descriptive StatisticsLet's start by examining basic statistics of the main variables to get an overview of the dataset.```{r descriptive_stats, echo=TRUE, fold=TRUE}summary(combined_data)```## Data VisualizationVisualizations can provide intuitive insights into the patterns and relationships in our data. Here, I might create visualizations that explore student performance, demographics, and their interactions with the VLE.### RQ 1: Relationship Between VLE Interactions and Assessment ScoresI first explore the relationship between the amount of interaction students have with the VLE (as indicated by the total number of clicks) and their assessment scores.```{r loadpackage, fold=TRUE}library(ggplot2)library(dplyr)options(repos =c(CRAN ="https://cran.rstudio.com"))install.packages("GGally")library(GGally)install.packages("gridExtra", repos ="https://cran.rstudio.com")library(gridExtra)``````{r vle_assessment_relationship, echo=TRUE, fold=TRUE}student_vle_summary <- student_vle %>%group_by(id_student) %>%summarise(total_clicks =sum(sum_click), .groups ='drop')vle_assessment_data <- student_vle_summary %>%inner_join(student_assessment, by ="id_student")interaction_score_plot <-ggplot(vle_assessment_data, aes(x = total_clicks, y = score)) +geom_point(aes(color = score)) +labs(title ="Relationship Between VLE Interactions and Assessment Scores",x ="Total VLE Interactions (Clicks)",y ="Assessment Score") +theme_minimal() +theme(legend.position ="bottom")interaction_score_plot```The scatter plot illustrates a general trend where increased interactions within the Virtual Learning Environment are associated with higher assessment scores, suggesting that engagement with online resources positively impacts academic performance. However, some outliers indicate that for a few students, high VLE engagement does not necessarily translate to high scores, pointing to the complexity of learning behaviors and the influence of other factors on academic success.### RQ 2: VLE Interactions Across Different CoursesNext, I will analyze how student interactions with the VLE vary across different courses or modules.```{r vleinteractions, echo=TRUE, fold=TRUE}vle_course_interaction <- student_vle %>%group_by(code_module) %>%summarise(total_clicks =sum(sum_click), .groups ='drop')course_interaction_plot <-ggplot(vle_course_interaction, aes(x = code_module, y = total_clicks, fill = code_module)) +geom_bar(stat ="identity") +labs(title ="VLE Interactions Across Different Courses",x ="Course Module",y ="Total VLE Interactions (Clicks)") +theme_minimal() +scale_fill_brewer(palette ="Set3")course_interaction_plot```The bar chart displays a striking variation in VLE interactions across different courses, highlighting how student engagement with online resources is not uniform. Courses with higher interaction levels ([in the analynized OULAD dataset, it appears to be the ***FFF*** course module]{.underline}) might be those with more engaging content or more demanding requirements, whereas lower interaction levels could indicate areas where course materials or activities are not resonating as well with students. These insights are crucial for educators to identify which courses might benefit from enhanced interactive materials or alternative engagement strategies### RQ 3: Demographic Factors and Academic OutcomesLastly, I will examine the influence of demographic factors on students' academic outcomes, including gender, region, highest education level, age band, and disability status on students' academic outcomes.```{r prepare_visualizationr, echo = TRUE, fold=TRUE}demographic_assessment_data <- student_info %>%inner_join(student_assessment, by ="id_student")``````{r byregion, echo = TRUE, fold=TRUE}region_performance_plot <-ggplot(demographic_assessment_data, aes(x = region, y = score, fill = region)) +geom_boxplot() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Scores by Region", x ="Region", y ="Score")``````{r byhighesteducationlevel, echo = TRUE, fold=TRUE}education_performance_plot <-ggplot(demographic_assessment_data, aes(x = highest_education, y = score, fill = highest_education)) +geom_boxplot() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Scores by Education", x ="Education", y ="Score")``````{r arrange_sidebyside, fold=TRUE}grid.arrange(region_performance_plot, education_performance_plot, ncol =2)``````{r byage, echo = TRUE, fold=TRUE}age_performance_plot <-ggplot(demographic_assessment_data, aes(x = age_band, y = score, fill = age_band)) +geom_boxplot() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Scores by Age Band", x ="Age Band", y ="Score")``````{r bydisabilitystatus, echo = TRUE, fold=TRUE}disability_performance_plot <-ggplot(demographic_assessment_data, aes(x = disability, y = score, fill = disability)) +geom_boxplot() +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Scores by Disability", x ="Disability", y ="Score")``````{r arrange_sidebyside2, fold=TRUE}grid.arrange(age_performance_plot, disability_performance_plot, ncol =2)```The series of boxplots reveal how different demographic factors impact academic outcomes in online courses. Notably, students with higher initial education levels and those in certain age groups tend to achieve higher scores, possibly reflecting the advantages of prior educational experiences and maturity.Meanwhile, variations in scores based on region and disability status, while present, are less marked. These findings suggest that while the online learning environment offers a level of accessibility and flexibility, there are still underlying demographic influences that affect student performance. Understanding these can guide the development of more inclusive and adaptive learning strategies to support diverse student populations.# CommunicationThis final section of the data analysis project is dedicated to discussing the key findings, implications, and the broader context of the study. This aims to provide actionable insights for stakeholders in online education and acknowledge the limitations and ethical considerations of this work.## Key Findings and InsightsThe analysis has led to several important insights:1. **VLE Interactions and Assessment Scores**: There is a noticeable trend indicating that increased interactions within the VLE are generally associated with higher assessment scores. However, exceptions to this trend suggest that engagement quality is as important as quantity.2. **Variations in VLE Interactions Across Courses**: Student engagement with the VLE varies significantly across different courses, indicating that course design and content complexity play crucial roles in online learning engagement.3. **Impact of Demographic Factors**: Demographic factors such as education level and age influence academic outcomes. While some demographics like region and disability show less pronounced effects, they still contribute to the diversity of student performance.## Implications and ActionsBased on these findings, This project recommends the following actions for online education designers and instructors:- **Enhance Interactive Elements:** Develop target strategies to increase student interactions with the VLE while enhancing the quality and effectiveness of these interactions, especially for courses with lower interactions.- **Targeted Support**: Offer additional support and resources to student groups that might be at a disadvantage, as indicated by their demographic background, which can be of significant to conduct further at-risk prediction and learning process monitoring.## LimitationsOur study is not without limitations. The data reflects only a snapshot of student behavior and performance, and long-term trends might differ. Additionally, the complexity of online learning behaviors means that there are likely other influential factors not captured in the data analysis of OULAD. Meanwhile, the data analysis presented only a few insights to the OULAD dataset, without applying modeling techniques like logistic regression or conducting statistical analysis like correlation.## Ethical ConsiderationsIn conducting this analysis, especially for the demographic information in the dataset, all student identifiers were removed and used responsibly.