Part I: Reflect and Plan

Please read this article and then answer then response to the following three questions.

Baker, R. S., Esbenshade, L., Vitale, J., & Karumbaiah, S. (2023). Using Demographic Data as Predictor Variables: a Questionable Choice. Journal of Educational Data Mining, 15(2), 22-52.

Part A:

  1. In what ways could the inclusion of demographic variables as predictors potentially reinforce biases in predictive analytics within education?

The authors argues that the inclusion of demographic variables as predictors in predictive analytics within education can potentially reinforce biases. The authors argue that demographic variables, such as race, ethnicity, and gender, are often correlated with academic achievement, but this correlation is often due to factors such as discrimination and systemic inequality. As a result, using demographic variables as predictors in predictive analytics can lead to models that perpetuate these biases.

  1. The authors argue that demographic variables should be used to validate fairness rather than as predictors within models. What are the potential benefits and drawbacks of this approach?

Potential benefits:

Potential drawbacks:

  1. How could the limitations of categorization impact the use of demographic variables in predictive models?

Categorization can be inaccurate. The categories that we use to define demographic variables are often based on social constructs, which means that they are not always accurate or reliable. For example, the category of “race” is often based on physical characteristics, but these characteristics do not always accurately reflect a person’s ancestry or cultural background. As a result, predictive models that are based on demographic categories may be inaccurate for individuals who do not fit neatly into those categories.

Part B: Once again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, please find an article in your field that utilizes in feature engineering

  1. Provide an APA citation for your selected study.

    • Polyzou, A., & Karypis, G. (2018). Feature Extraction for Classifying Students Based on Their Academic Performance. International Educational Data Mining Society.
  2. What is the data source used in the study?

    • The data source used in the study “Feature Extraction for Classifying Students Based on Their Academic Performance” by Polyzou et al. (2018) was the University of Minnesota’s Learning Record Store (LRS). The LRS is a repository of student learning data, including grades, attendance, test scores, and extracurricular activities. The study used data from the LRS to train a machine learning model to predict student academic success.
  3. Why was feature engineering useful or even necessary in this study?

    • To improve the accuracy of the machine learning model. The raw data in the LRS was not directly suitable for use in a machine learning model. The features needed to be extracted and transformed in a way that would make them more informative and relevant for the model.

Part II: Data Product

For this data product, please add a feature (or several features) specific to one or more of the activity types in the OULAD interactions data, and then add these features to your model and evaluate the accuracy. How did the accuracy appreciably change–if at all? Please copy all of your code in one or more code chunks below.

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read dataset

interactions <- read_csv("data/oulad-interactions-filtered.csv")
## Rows: 5548858 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, activity_type
## dbl (8): id_student, id_site, date, sum_click, week_from, week_to, module_pr...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
student_assessments <- read_csv("data/oulad-students-and-assessments.csv")
## Rows: 32593 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): code_module, code_presentation, gender, region, highest_education, ...
## dbl (9): id_student, imd_band, num_of_prev_attempts, studied_credits, module...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#glimpse(interactions)
colnames(interactions)
##  [1] "code_module"                "code_presentation"         
##  [3] "id_student"                 "id_site"                   
##  [5] "date"                       "sum_click"                 
##  [7] "activity_type"              "week_from"                 
##  [9] "week_to"                    "module_presentation_length"
## [11] "quantile_cutoff_date"
colnames(student_assessments)
##  [1] "code_module"                "code_presentation"         
##  [3] "id_student"                 "gender"                    
##  [5] "region"                     "highest_education"         
##  [7] "imd_band"                   "age_band"                  
##  [9] "num_of_prev_attempts"       "studied_credits"           
## [11] "disability"                 "final_result"              
## [13] "module_presentation_length" "date_registration"         
## [15] "date_unregistration"        "pass"                      
## [17] "mean_weighted_score"

Feauture Engineering

# Filter the data for homepage activity
interactions_homepage <- interactions %>%
  filter(activity_type == "homepage")

# Change sum_click to homepage_click
interactions_homepage <- interactions_homepage %>%
  rename(homepage_click = sum_click)

glimpse(interactions_homepage)
## Rows: 832,424
## Columns: 11
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 28400, 30268, 31604, 32885, 38053, 45462, 5…
## $ id_site                    <dbl> 546614, 546614, 546614, 546614, 546614, 546…
## $ date                       <dbl> -10, -10, -10, -10, -10, -10, -10, -10, -10…
## $ homepage_click             <dbl> 11, 3, 11, 13, 13, 7, 18, 12, 8, 8, 3, 14, …
## $ activity_type              <chr> "homepage", "homepage", "homepage", "homepa…
## $ week_from                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ week_to                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ quantile_cutoff_date       <dbl> 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54,…
# Join interactions_homepage with student_assessments by id_student
joined_data <- interactions_homepage %>%
  inner_join(student_assessments, by = c("id_student"))
## Warning in inner_join(., student_assessments, by = c("id_student")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 11 of `x` matches multiple rows in `y`.
## ℹ Row 338 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
joined_data$gender <- ifelse(joined_data$gender == "M", "Male", "Female")
glimpse(joined_data)
## Rows: 1,015,668
## Columns: 27
## $ code_module.x                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA",…
## $ code_presentation.x          <chr> "2013J", "2013J", "2013J", "2013J", "2013…
## $ id_student                   <dbl> 28400, 30268, 31604, 32885, 38053, 45462,…
## $ id_site                      <dbl> 546614, 546614, 546614, 546614, 546614, 5…
## $ date                         <dbl> -10, -10, -10, -10, -10, -10, -10, -10, -…
## $ homepage_click               <dbl> 11, 3, 11, 13, 13, 7, 18, 12, 8, 8, 3, 3,…
## $ activity_type                <chr> "homepage", "homepage", "homepage", "home…
## $ week_from                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ week_to                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ module_presentation_length.x <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 2…
## $ quantile_cutoff_date         <dbl> 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 5…
## $ code_module.y                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA",…
## $ code_presentation.y          <chr> "2013J", "2013J", "2013J", "2013J", "2013…
## $ gender                       <chr> "Female", "Female", "Female", "Female", "…
## $ region                       <chr> "Scotland", "North Western Region", "Sout…
## $ highest_education            <chr> "HE Qualification", "A Level or Equivalen…
## $ imd_band                     <dbl> 3, 4, 6, 6, 9, 4, 3, 7, 6, 5, 8, 8, 7, NA…
## $ age_band                     <chr> "35-55", "35-55", "35-55", "0-35", "35-55…
## $ num_of_prev_attempts         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ studied_credits              <dbl> 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 6…
## $ disability                   <chr> "N", "Y", "N", "N", "N", "N", "N", "N", "…
## $ final_result                 <chr> "Pass", "Withdrawn", "Pass", "Pass", "Pas…
## $ module_presentation_length.y <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 2…
## $ date_registration            <dbl> -53, -92, -52, -176, -110, -67, -47, -59,…
## $ date_unregistration          <dbl> NA, 12, NA, NA, NA, NA, NA, NA, NA, NA, 9…
## $ pass                         <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
## $ mean_weighted_score          <dbl> 700, NA, 720, 690, 790, 700, 730, 670, 73…
# Convert the character columns to factors
data <- joined_data %>% mutate_at(vars(gender, region, highest_education, disability), as.factor)

Train Test Split

# Split the data into a training set and a test set
set.seed(123)
split <- sample(nrow(data), 0.75*nrow(data))
train <- data[split,]
test <- data[-split,]

Model

# Fit a model to the training data
model <- lm(homepage_click ~ module_presentation_length.x + quantile_cutoff_date + gender + region + highest_education + disability, data=train)

Model Evaluation

# Evaluate the model on the test data
predictions <- predict(model, test)
mse <- mean((predictions - test$homepage_click)^2, na.rm = TRUE)
print(mse)
## [1] 38.94177

Model2: mse: 38.94177

Knit and Publish

Complete the following steps to knit and publish your work:

  1. First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.

  3. Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Receiving Your Machine Learning Badge

To receive credit for this assignment and earn your third ML badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.

Once your instructor has checked your link, you will be provided a physical version of the badge below!