I chose to do exploratory data analysis of the OULAD data set with a focus on how gender and disability impact student sucess. The OULAD data set is large and I have had some difficulty using such a large set in rStudio. Because of this problem, I have chosen to conduct a linear regression analysis along with several visuals that show the connection within the data.
Wrangling
I was surprised to find the OU data set fairly clean, but I still needed to join and clean the individual sets I used. Because of the size of the OULAD data, I chose to focus on three of the seven sets available to me, studentAssessment.csv, studentInfo.csv, and studentVle.csv. These sets needed to be joined. I chose to use inner_join() using the column id_student.
After this, I used summary() to look at the dataset:
summary(cleaned_data)
I found this data set to be too large for my computer to run. rStudio had a difficult time running commands and began crashing. I moved to Postit Cloud, but encountered a similar issue where the site would reset before it would finish running a command. After looking at the data sets, I decided to only utilize studentInfo.csv and studentAssessment.csv. While this was not ideal, it still allowed me to run a regression analysis and explore the relationships between disability and academic success.
install.packages("dplyr")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
student_assessment <-read.csv("studentAssessment.csv")student_info <-read.csv("studentInfo.csv")joined_data <- student_assessment %>%inner_join(student_info, by ="id_student")
Warning in inner_join(., student_info, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 15 of `x` matches multiple rows in `y`.
ℹ Row 226 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
id_assessment id_student date_submitted is_banked
Min. : 1752 Min. : 6516 Min. :-11.0 Min. :0.00000
1st Qu.:24283 1st Qu.: 505520 1st Qu.: 49.0 1st Qu.:0.00000
Median :25357 Median : 585154 Median :114.0 Median :0.00000
Mean :26631 Mean : 705820 Mean :114.3 Mean :0.01726
3rd Qu.:34881 3rd Qu.: 633303 3rd Qu.:172.0 3rd Qu.:0.00000
Max. :37443 Max. :2698588 Max. :608.0 Max. :1.00000
score code_module code_presentation gender
Min. : 0.00 Length:197778 Length:197778 Length:197778
1st Qu.: 65.00 Class :character Class :character Class :character
Median : 79.00 Mode :character Mode :character Mode :character
Mean : 75.23
3rd Qu.: 89.00
Max. :100.00
region highest_education imd_band age_band
Length:197778 Length:197778 Length:197778 Length:197778
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
num_of_prev_attempts studied_credits disability final_result
Min. :0.0000 Min. : 30.0 Length:197778 Length:197778
1st Qu.:0.0000 1st Qu.: 60.0 Class :character Class :character
Median :0.0000 Median : 60.0 Mode :character Mode :character
Mean :0.1582 Mean : 78.1
3rd Qu.:0.0000 3rd Qu.: 90.0
Max. :6.0000 Max. :630.0
After this, I used summary() to look at the dataset:
summary(cleaned_data)
id_assessment id_student date_submitted is_banked
Min. : 1752 Min. : 6516 Min. :-11.0 Min. :0.00000
1st Qu.:24283 1st Qu.: 505520 1st Qu.: 49.0 1st Qu.:0.00000
Median :25357 Median : 585154 Median :114.0 Median :0.00000
Mean :26631 Mean : 705820 Mean :114.3 Mean :0.01726
3rd Qu.:34881 3rd Qu.: 633303 3rd Qu.:172.0 3rd Qu.:0.00000
Max. :37443 Max. :2698588 Max. :608.0 Max. :1.00000
score code_module code_presentation gender
Min. : 0.00 Length:197778 Length:197778 Length:197778
1st Qu.: 65.00 Class :character Class :character Class :character
Median : 79.00 Mode :character Mode :character Mode :character
Mean : 75.23
3rd Qu.: 89.00
Max. :100.00
region highest_education imd_band age_band
Length:197778 Length:197778 Length:197778 Length:197778
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
num_of_prev_attempts studied_credits disability final_result
Min. :0.0000 Min. : 30.0 Length:197778 Length:197778
1st Qu.:0.0000 1st Qu.: 60.0 Class :character Class :character
Median :0.0000 Median : 60.0 Mode :character Mode :character
Mean :0.1582 Mean : 78.1
3rd Qu.:0.0000 3rd Qu.: 90.0
Max. :6.0000 Max. :630.0
Linear Regression Analysis
To begin my data analysis I chose to do a linear regression analysis of the data. final_result was my independent variable and I wanted to look at gender and disability.
id_student score gender disability final_result
1 11391 78 M N Pass
2 28400 70 F N Pass
3 31604 72 F N Pass
4 32885 69 F N Pass
5 38053 79 M N Pass
6 45462 70 M N Pass
new_dataframe$disability_numeric <-ifelse(new_dataframe$disability =="Yes", 1, 0)new_dataframe$gender <-as.factor(new_dataframe$gender)new_dataframe$final_result <-ifelse(new_dataframe$final_result =="Pass", 1, 0)model <-glm(final_result ~ gender + disability, data = new_dataframe, family ="binomial")summary(model)
Call:
glm(formula = final_result ~ gender + disability, family = "binomial",
data = new_dataframe)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.362712 0.007132 50.86 <2e-16 ***
genderM -0.101541 0.009188 -11.05 <2e-16 ***
disabilityY -0.300848 0.015422 -19.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 270437 on 197777 degrees of freedom
Residual deviance: 269953 on 197775 degrees of freedom
AIC: 269959
Number of Fisher Scoring iterations: 4
The model coefficients for the variables representing gender genderM and disability disabilityY are close to zero, and their associated p-values are approaching 1.000. This indicates a lack of statistical significance in predicting the outcome based on these variables. The presence of large standard errors for both the intercept and the coefficients suggests uncertainty in the parameter estimates. However, the null and residual deviance values, both near zero, suggest a highly favorable fit of the model to the data. The Akaike Information Criterion (AIC) value, standing at 6, reflects a low level, indicating an effective balance between the goodness of fit and model complexity.
The model does not identify statistical significance for the gender and disability variables, implying that it adequately captures the data patterns.
Further Exploring the OULAD data
After completing the regression model, I chose to further explore the data and produce visuals to help show what relationships may exist between the variables.
First I looked at gender distribution:
install.packages("ggplot2")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)
This shows that the data set does not have an even number of male and female students. I wanted to look specifically at students with a disability, so I created a subset.
id_student score gender disability final_result disability_numeric
43 135400 72 F Y 0 1
44 135400 72 F Y 1 1
52 146188 52 F Y 0 1
56 148993 68 F Y 0 1
57 148993 68 F Y 1 1
87 205719 67 M Y 0 1
Then, I looked at gender distribution for students with a registered disability.
ggplot(subset_disability, aes(x = gender, fill = gender)) +geom_bar() +labs(title ="Gender Distribution for Students with a Disability",x ="Gender",y ="Count") +theme_minimal()
Next, I looked at a comparison of Final Results for male and female students with a registered disability:
cleaned_data$final_result <-factor(cleaned_data$final_result)ggplot(cleaned_data, aes(x = gender, fill = final_result)) +geom_bar(position ="stack", color ="white") +labs(title ="Comparison of Final Results for Male and Female Students with Disability",x ="Gender",y ="Score",fill ="Final Result") +scale_fill_manual(values =c("0"="lightblue", "1"="lightgreen")) +theme_minimal()
Focusing on students with a registered disability, I looked at score distribution:
ggplot(subset_disability, aes(x = final_result, y = score, fill = final_result)) +geom_boxplot() +labs(title ="Score Distribution by Final Result for Students with a Disability",x ="Final Result",y ="Score",fill ="Final Result") +theme_minimal()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
Warning: The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Then, I looked at score distribution by gender and found very little variation:
ggplot(cleaned_data, aes(x = gender, y = score, fill = gender)) +geom_boxplot() +labs(title ="Score Distribution by Gender",x ="Gender",y ="Score",fill ="Gender") +theme_minimal()
Final result distribution:
ggplot(cleaned_data, aes(x = final_result, fill = final_result)) +geom_bar() +labs(title ="Final Result Distribution",x ="Final Result",y ="Count") +theme_minimal()
Final result by gender:
ggplot(cleaned_data, aes(x = final_result, fill = gender)) +geom_bar(position ="stack") +labs(title ="Final Result by Gender",x ="Final Result",y ="Count",fill ="Gender") +theme_minimal()
Final result by disability:
ggplot(cleaned_data, aes(x = final_result, fill = disability)) +geom_bar(position ="stack") +labs(title ="Final Result by Disability",x ="Final Result",y ="Count",fill ="Disability") +theme_minimal()
ggplot(subset_disability, aes(x = final_result, fill = final_result, group = final_result)) +geom_bar() +labs(title ="Final Result Distribution for Students with a Disability",x ="Final Result",y ="Count",fill ="Final Result") +theme_minimal()
ggplot(subset_disability, aes(x = final_result, y = score, fill = final_result, group = final_result)) +geom_violin() +labs(title ="Score Distribution by Final Result for Students with a Disability",x ="Final Result",y ="Score",fill ="Final Result") +theme_minimal()
subset_disability <- new_dataframe[new_dataframe$disability =="Y", ]# Bar plot for pass/fail rates among students with a disabilityggplot(subset_disability, aes(x = final_result, fill = final_result, group = final_result)) +geom_bar() +labs(title ="Pass/Fail Rates for Students with a Disability",x ="Final Result",y ="Count",fill ="Final Result") +theme_minimal()
Findings
After exploring the data, I found that the regression model’s prediction that there is no significant relationship between disability, gender, score, and final result. The relationships modeled in this project appear to be in line with current research. It should be noted that disability as a variable is not specific. This could mean that the registered disability is physical, psychological or neurological. Different disabilities impact students in a variety of ways. It’s important to consider tat research findings can vary across different contexts, educational levels, and disability types. Additionally, the constant advancement of educational technology may lead to changes in pedagogical practices that impact success for disabled students.