Please read this article and then answer then response to the following three questions.
Baker, R. S., Esbenshade, L., Vitale, J., & Karumbaiah, S. (2023). Using Demographic Data as Predictor Variables: a Questionable Choice. Journal of Educational Data Mining, 15(2), 22-52.
Part A:
The authors argues that the inclusion of demographic variables as predictors in predictive analytics within education can potentially reinforce biases. The authors argue that demographic variables, such as race, ethnicity, and gender, are often correlated with academic achievement, but this correlation is often due to factors such as discrimination and systemic inequality. As a result, using demographic variables as predictors in predictive analytics can lead to models that perpetuate these biases.
Potential benefits:
Potential drawbacks:
Categorization can be inaccurate. The categories that we use to define demographic variables are often based on social constructs, which means that they are not always accurate or reliable. For example, the category of “race” is often based on physical characteristics, but these characteristics do not always accurately reflect a person’s ancestry or cultural background. As a result, predictive models that are based on demographic categories may be inaccurate for individuals who do not fit neatly into those categories.
Part B: Once again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, please find an article in your field that utilizes in feature engineering
Provide an APA citation for your selected study.
What is the data source used in the study?
Why was feature engineering useful or even necessary in this study?
For this data product, please add a feature (or several features) specific to one or more of the activity types in the OULAD interactions data, and then add these features to your model and evaluate the accuracy. How did the accuracy appreciably change–if at all? Please copy all of your code in one or more code chunks below.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
interactions <- read_csv("data/oulad-interactions-filtered.csv")
## Rows: 5548858 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, activity_type
## dbl (8): id_student, id_site, date, sum_click, week_from, week_to, module_pr...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
student_assessments <- read_csv("data/oulad-students-and-assessments.csv")
## Rows: 32593 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): code_module, code_presentation, gender, region, highest_education, ...
## dbl (9): id_student, imd_band, num_of_prev_attempts, studied_credits, module...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#glimpse(interactions)
colnames(interactions)
## [1] "code_module" "code_presentation"
## [3] "id_student" "id_site"
## [5] "date" "sum_click"
## [7] "activity_type" "week_from"
## [9] "week_to" "module_presentation_length"
## [11] "quantile_cutoff_date"
colnames(student_assessments)
## [1] "code_module" "code_presentation"
## [3] "id_student" "gender"
## [5] "region" "highest_education"
## [7] "imd_band" "age_band"
## [9] "num_of_prev_attempts" "studied_credits"
## [11] "disability" "final_result"
## [13] "module_presentation_length" "date_registration"
## [15] "date_unregistration" "pass"
## [17] "mean_weighted_score"
# Filter the data for homepage activity
interactions_homepage <- interactions %>%
filter(activity_type == "homepage")
# Change sum_click to homepage_click
interactions_homepage <- interactions_homepage %>%
rename(homepage_click = sum_click)
glimpse(interactions_homepage)
## Rows: 832,424
## Columns: 11
## $ code_module <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student <dbl> 28400, 30268, 31604, 32885, 38053, 45462, 5…
## $ id_site <dbl> 546614, 546614, 546614, 546614, 546614, 546…
## $ date <dbl> -10, -10, -10, -10, -10, -10, -10, -10, -10…
## $ homepage_click <dbl> 11, 3, 11, 13, 13, 7, 18, 12, 8, 8, 3, 14, …
## $ activity_type <chr> "homepage", "homepage", "homepage", "homepa…
## $ week_from <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ week_to <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ quantile_cutoff_date <dbl> 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54,…
# Join interactions_homepage with student_assessments by id_student
joined_data <- interactions_homepage %>%
inner_join(student_assessments, by = c("id_student"))
## Warning in inner_join(., student_assessments, by = c("id_student")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 11 of `x` matches multiple rows in `y`.
## ℹ Row 338 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
joined_data$gender <- ifelse(joined_data$gender == "M", "Male", "Female")
glimpse(joined_data)
## Rows: 1,015,668
## Columns: 27
## $ code_module.x <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA",…
## $ code_presentation.x <chr> "2013J", "2013J", "2013J", "2013J", "2013…
## $ id_student <dbl> 28400, 30268, 31604, 32885, 38053, 45462,…
## $ id_site <dbl> 546614, 546614, 546614, 546614, 546614, 5…
## $ date <dbl> -10, -10, -10, -10, -10, -10, -10, -10, -…
## $ homepage_click <dbl> 11, 3, 11, 13, 13, 7, 18, 12, 8, 8, 3, 3,…
## $ activity_type <chr> "homepage", "homepage", "homepage", "home…
## $ week_from <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ week_to <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ module_presentation_length.x <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 2…
## $ quantile_cutoff_date <dbl> 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 5…
## $ code_module.y <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA",…
## $ code_presentation.y <chr> "2013J", "2013J", "2013J", "2013J", "2013…
## $ gender <chr> "Female", "Female", "Female", "Female", "…
## $ region <chr> "Scotland", "North Western Region", "Sout…
## $ highest_education <chr> "HE Qualification", "A Level or Equivalen…
## $ imd_band <dbl> 3, 4, 6, 6, 9, 4, 3, 7, 6, 5, 8, 8, 7, NA…
## $ age_band <chr> "35-55", "35-55", "35-55", "0-35", "35-55…
## $ num_of_prev_attempts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ studied_credits <dbl> 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 6…
## $ disability <chr> "N", "Y", "N", "N", "N", "N", "N", "N", "…
## $ final_result <chr> "Pass", "Withdrawn", "Pass", "Pass", "Pas…
## $ module_presentation_length.y <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 2…
## $ date_registration <dbl> -53, -92, -52, -176, -110, -67, -47, -59,…
## $ date_unregistration <dbl> NA, 12, NA, NA, NA, NA, NA, NA, NA, NA, 9…
## $ pass <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
## $ mean_weighted_score <dbl> 700, NA, 720, 690, 790, 700, 730, 670, 73…
# Convert the character columns to factors
data <- joined_data %>% mutate_at(vars(gender, region, highest_education, disability), as.factor)
# Split the data into a training set and a test set
set.seed(123)
split <- sample(nrow(data), 0.75*nrow(data))
train <- data[split,]
test <- data[-split,]
# Fit a model to the training data
model <- lm(homepage_click ~ module_presentation_length.x + quantile_cutoff_date + gender + region + highest_education + disability, data=train)
# Evaluate the model on the test data
predictions <- predict(model, test)
mse <- mean((predictions - test$homepage_click)^2, na.rm = TRUE)
print(mse)
## [1] 38.94177
Model2: mse: 38.94177
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

To receive credit for this assignment and earn your third ML badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.
Once your instructor has checked your link, you will be provided a physical version of the badge below!