Exploring the OU Data Set

Author

Ari Berntsen

Exploring the OU Data Set

I chose to do exploratory data analysis of the OULAD data set with a focus on how gender and disability impact student sucess. The OULAD data set is large and I have had some difficulty using such a large set in rStudio. Because of this problem, I have chosen to conduct a linear regression analysis along with several visuals that show the connection within the data.

Wrangling

I was surprised to find the OU data set fairly clean, but I still needed to join and clean the individual sets I used. Because of the size of the OULAD data, I chose to focus on three of the seven sets available to me, studentAssessment.csv, studentInfo.csv, and studentVle.csv. These sets needed to be joined. I chose to use inner_join() using the column id_student.

student_assessment <- read.csv(“studentAssessment.csv”) student_info <- read.csv(“studentInfo.csv”) student_vle <- read.csv(“studentVle.csv”)

joined_data <- student_assessment %>% inner_join(student_info, by = “id_student”) %>% inner_join(student_vle, by = “id_student”)

Once joined, I cleaned the data:

cleaned_data <- joined_data %>% + filter_all(all_vars(!is.na(.))) %>% + filter_all(all_vars(. != ““)) summary(cleaned_data)

After this, I used summary() to look at the dataset:

summary(cleaned_data)

I found this data set to be too large for my computer to run. rStudio had a difficult time running commands and began crashing. I moved to Postit Cloud, but encountered a similar issue where the site would reset before it would finish running a command. After looking at the data sets, I decided to only utilize studentInfo.csv and studentAssessment.csv. While this was not ideal, it still allowed me to run a regression analysis and explore the relationships between disability and academic success.

install.packages("dplyr")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

student_assessment <- read.csv("studentAssessment.csv")
student_info <- read.csv("studentInfo.csv")

joined_data <- student_assessment %>%
  inner_join(student_info, by = "id_student")

Warning in inner_join(., student_info, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 15 of `x` matches multiple rows in `y`.
ℹ Row 226 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Once joined, I cleaned the data:

cleaned_data <- joined_data %>%
     filter_all(all_vars(!is.na(.))) %>%
     filter_all(all_vars(. != ""))
summary(cleaned_data)

 id_assessment     id_student      date_submitted    is_banked      
 Min.   : 1752   Min.   :   6516   Min.   :-11.0   Min.   :0.00000  
 1st Qu.:24283   1st Qu.: 505520   1st Qu.: 49.0   1st Qu.:0.00000  
 Median :25357   Median : 585154   Median :114.0   Median :0.00000  
 Mean   :26631   Mean   : 705820   Mean   :114.3   Mean   :0.01726  
 3rd Qu.:34881   3rd Qu.: 633303   3rd Qu.:172.0   3rd Qu.:0.00000  
 Max.   :37443   Max.   :2698588   Max.   :608.0   Max.   :1.00000  
     score        code_module        code_presentation     gender         
 Min.   :  0.00   Length:197778      Length:197778      Length:197778     
 1st Qu.: 65.00   Class :character   Class :character   Class :character  
 Median : 79.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 75.23                                                           
 3rd Qu.: 89.00                                                           
 Max.   :100.00                                                           
    region          highest_education    imd_band           age_band        
 Length:197778      Length:197778      Length:197778      Length:197778     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 num_of_prev_attempts studied_credits  disability        final_result      
 Min.   :0.0000       Min.   : 30.0   Length:197778      Length:197778     
 1st Qu.:0.0000       1st Qu.: 60.0   Class :character   Class :character  
 Median :0.0000       Median : 60.0   Mode  :character   Mode  :character  
 Mean   :0.1582       Mean   : 78.1                                        
 3rd Qu.:0.0000       3rd Qu.: 90.0                                        
 Max.   :6.0000       Max.   :630.0

After this, I used summary() to look at the dataset:

summary(cleaned_data)

 id_assessment     id_student      date_submitted    is_banked      
 Min.   : 1752   Min.   :   6516   Min.   :-11.0   Min.   :0.00000  
 1st Qu.:24283   1st Qu.: 505520   1st Qu.: 49.0   1st Qu.:0.00000  
 Median :25357   Median : 585154   Median :114.0   Median :0.00000  
 Mean   :26631   Mean   : 705820   Mean   :114.3   Mean   :0.01726  
 3rd Qu.:34881   3rd Qu.: 633303   3rd Qu.:172.0   3rd Qu.:0.00000  
 Max.   :37443   Max.   :2698588   Max.   :608.0   Max.   :1.00000  
     score        code_module        code_presentation     gender         
 Min.   :  0.00   Length:197778      Length:197778      Length:197778     
 1st Qu.: 65.00   Class :character   Class :character   Class :character  
 Median : 79.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 75.23                                                           
 3rd Qu.: 89.00                                                           
 Max.   :100.00                                                           
    region          highest_education    imd_band           age_band        
 Length:197778      Length:197778      Length:197778      Length:197778     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 num_of_prev_attempts studied_credits  disability        final_result      
 Min.   :0.0000       Min.   : 30.0   Length:197778      Length:197778     
 1st Qu.:0.0000       1st Qu.: 60.0   Class :character   Class :character  
 Median :0.0000       Median : 60.0   Mode  :character   Mode  :character  
 Mean   :0.1582       Mean   : 78.1                                        
 3rd Qu.:0.0000       3rd Qu.: 90.0                                        
 Max.   :6.0000       Max.   :630.0

Linear Regression Analysis

To begin my data analysis I chose to do a linear regression analysis of the data. final_result was my independent variable and I wanted to look at gender and disability.

selected_variables <- cleaned_data[c("id_student", "score", "gender", "disability", "final_result")]
new_dataframe <- data.frame(selected_variables)
head(new_dataframe)

  id_student score gender disability final_result
1      11391    78      M          N         Pass
2      28400    70      F          N         Pass
3      31604    72      F          N         Pass
4      32885    69      F          N         Pass
5      38053    79      M          N         Pass
6      45462    70      M          N         Pass

new_dataframe$disability_numeric <- ifelse(new_dataframe$disability == "Yes", 1, 0)
new_dataframe$gender <- as.factor(new_dataframe$gender)
new_dataframe$final_result <- ifelse(new_dataframe$final_result == "Pass", 1, 0)

model <- glm(final_result ~ gender + disability, data = new_dataframe, family = "binomial")

summary(model)


Call:
glm(formula = final_result ~ gender + disability, family = "binomial", 
    data = new_dataframe)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.362712   0.007132   50.86   <2e-16 ***
genderM     -0.101541   0.009188  -11.05   <2e-16 ***
disabilityY -0.300848   0.015422  -19.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 270437  on 197777  degrees of freedom
Residual deviance: 269953  on 197775  degrees of freedom
AIC: 269959

Number of Fisher Scoring iterations: 4

The model coefficients for the variables representing gender genderM and disability disabilityY are close to zero, and their associated p-values are approaching 1.000. This indicates a lack of statistical significance in predicting the outcome based on these variables. The presence of large standard errors for both the intercept and the coefficients suggests uncertainty in the parameter estimates. However, the null and residual deviance values, both near zero, suggest a highly favorable fit of the model to the data. The Akaike Information Criterion (AIC) value, standing at 6, reflects a low level, indicating an effective balance between the goodness of fit and model complexity.

The model does not identify statistical significance for the gender and disability variables, implying that it adequately captures the data patterns.

Further Exploring the OULAD data

After completing the regression model, I chose to further explore the data and produce visuals to help show what relationships may exist between the variables.

First I looked at gender distribution:

install.packages("ggplot2")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)

library(ggplot2)

ggplot(cleaned_data, aes(x = gender, fill = gender)) +
  geom_bar() +
  labs(title = "Gender Distribution",
       x = "Gender",
       y = "Count") +
  theme_minimal()

This shows that the data set does not have an even number of male and female students. I wanted to look specifically at students with a disability, so I created a subset.

cleaned_data <- new_dataframe %>%
  mutate(disability_numeric = ifelse(disability == "Y", 1, 0))
subset_disability <- cleaned_data[cleaned_data$disability_numeric == 1, ]
head(subset_disability)

   id_student score gender disability final_result disability_numeric
43     135400    72      F          Y            0                  1
44     135400    72      F          Y            1                  1
52     146188    52      F          Y            0                  1
56     148993    68      F          Y            0                  1
57     148993    68      F          Y            1                  1
87     205719    67      M          Y            0                  1

Then, I looked at gender distribution for students with a registered disability.

ggplot(subset_disability, aes(x = gender, fill = gender)) +
  geom_bar() +
  labs(title = "Gender Distribution for Students with a Disability",
       x = "Gender",
       y = "Count") +
  theme_minimal()

Next, I looked at a comparison of Final Results for male and female students with a registered disability:

cleaned_data$final_result <- factor(cleaned_data$final_result)

ggplot(cleaned_data, aes(x = gender, fill = final_result)) +
     geom_bar(position = "stack", color = "white") +
     labs(title = "Comparison of Final Results for Male and Female Students with Disability",
          x = "Gender",
          y = "Score",
          fill = "Final Result") +
     scale_fill_manual(values = c("0" = "lightblue", "1" = "lightgreen")) +
     theme_minimal()

Focusing on students with a registered disability, I looked at score distribution:

ggplot(subset_disability, aes(x = final_result, y = score, fill = final_result)) +
  geom_boxplot() +
  labs(title = "Score Distribution by Final Result for Students with a Disability",
       x = "Final Result",
       y = "Score",
       fill = "Final Result") +
  theme_minimal()

Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

Warning: The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Then, I looked at score distribution by gender and found very little variation:

ggplot(cleaned_data, aes(x = gender, y = score, fill = gender)) +
  geom_boxplot() +
  labs(title = "Score Distribution by Gender",
       x = "Gender",
       y = "Score",
       fill = "Gender") +
  theme_minimal()

Final result distribution:

ggplot(cleaned_data, aes(x = final_result, fill = final_result)) +
  geom_bar() +
  labs(title = "Final Result Distribution",
       x = "Final Result",
       y = "Count") +
  theme_minimal()

Final result by gender:

ggplot(cleaned_data, aes(x = final_result, fill = gender)) +
  geom_bar(position = "stack") +
  labs(title = "Final Result by Gender",
       x = "Final Result",
       y = "Count",
       fill = "Gender") +
  theme_minimal()

Final result by disability:

ggplot(cleaned_data, aes(x = final_result, fill = disability)) +
  geom_bar(position = "stack") +
  labs(title = "Final Result by Disability",
       x = "Final Result",
       y = "Count",
       fill = "Disability") +
  theme_minimal()

ggplot(subset_disability, aes(x = final_result, fill = final_result, group = final_result)) +
  geom_bar() +
  labs(title = "Final Result Distribution for Students with a Disability",
       x = "Final Result",
       y = "Count",
       fill = "Final Result") +
  theme_minimal()

ggplot(subset_disability, aes(x = final_result, y = score, fill = final_result, group = final_result)) +
  geom_violin() +
  labs(title = "Score Distribution by Final Result for Students with a Disability",
       x = "Final Result",
       y = "Score",
       fill = "Final Result") +
  theme_minimal()

subset_disability <- new_dataframe[new_dataframe$disability == "Y", ]

# Bar plot for pass/fail rates among students with a disability
ggplot(subset_disability, aes(x = final_result, fill = final_result, group = final_result)) +
  geom_bar() +
  labs(title = "Pass/Fail Rates for Students with a Disability",
       x = "Final Result",
       y = "Count",
       fill = "Final Result") +
  theme_minimal()

Findings

After exploring the data, I found that the regression model’s prediction that there is no significant relationship between disability, gender, score, and final result. The relationships modeled in this project appear to be in line with current research. It should be noted that disability as a variable is not specific. This could mean that the registered disability is physical, psychological or neurological. Different disabilities impact students in a variety of ways. It’s important to consider tat research findings can vary across different contexts, educational levels, and disability types. Additionally, the constant advancement of educational technology may lead to changes in pedagogical practices that impact success for disabled students.