Please read this article and then answer then response to the following three questions.
Baker, R. S., Esbenshade, L., Vitale, J., & Karumbaiah, S. (2023). Using Demographic Data as Predictor Variables: a Questionable Choice. Journal of Educational Data Mining, 15(2), 22-52.
Part A:
Benefits: detecting and mitigating bias, improving model accuracy and equity, promoting ethical use of data, and increasing trust and acceptability.
Drawbacks: complexities in implementation, potential overlook of subtle biases, balancing different fairness metrics, risk of misinterpretation, and limited scope of fairness.
Assumes that there is commonality between individuals labeled by the demographic variable. - Limitations: oversimplification and arbitrary grouping, heterogeneity within groups, changing and politically-influenced definitions, exclusion of minor or emerging groups, risks of stereotyping and stigmatization, impact on model’s predictive power and fairness, ethical and privacy concerns, and confusion between proxy and mediating variables.
Part B: Once again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, please find an article in your field that utilizes in feature engineering
Provide an APA citation for your selected study.
NOTE: Difficulty finding one in mathematics education. Had to go more general in my search.
What is the data source used in the study?
Why was feature engineering useful or even necessary in this study?
For this data product, please add a feature (or several features) specific to one or more of the activity types in the OULAD interactions data, and then add these features to your model and evaluate the accuracy. How did the accuracy appreciably change–if at all? Please copy all of your code in one or more code chunks below.
#Setup
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.1.1
## ✔ dials 1.2.0 ✔ tune 1.1.1
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.0 ✔ yardstick 1.2.0
## ✔ recipes 1.0.6
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(vip)
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(ranger)
library(dplyr)
#Data reading and exploration
interactions <- read_csv("data/oulad-interactions-filtered.csv.zip")
## Multiple files in zip: reading 'oulad-interactions-filtered.csv'
## Rows: 5548858 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, activity_type
## dbl (8): id_student, id_site, date, sum_click, week_from, week_to, module_pr...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
interactions %>%
count(activity_type)
## # A tibble: 19 × 2
## activity_type n
## <chr> <int>
## 1 dataplus 311
## 2 dualpane 7306
## 3 externalquiz 18171
## 4 forumng 1279917
## 5 glossary 9630
## 6 homepage 832424
## 7 htmlactivity 6562
## 8 oucollaborate 25861
## 9 oucontent 1065736
## 10 ouelluminate 13829
## 11 ouwiki 66413
## 12 page 33539
## 13 questionnaire 16528
## 14 quiz 398966
## 15 repeatactivity 6
## 16 resource 436704
## 17 sharedsubpage 103
## 18 subpage 1104279
## 19 url 232573
interactions %>%
count(activity_type) %>%
arrange(desc(n))
## # A tibble: 19 × 2
## activity_type n
## <chr> <int>
## 1 forumng 1279917
## 2 subpage 1104279
## 3 oucontent 1065736
## 4 homepage 832424
## 5 resource 436704
## 6 quiz 398966
## 7 url 232573
## 8 ouwiki 66413
## 9 page 33539
## 10 oucollaborate 25861
## 11 externalquiz 18171
## 12 questionnaire 16528
## 13 ouelluminate 13829
## 14 glossary 9630
## 15 dualpane 7306
## 16 htmlactivity 6562
## 17 dataplus 311
## 18 sharedsubpage 103
## 19 repeatactivity 6
interactions %>%
ggplot(aes(x = date)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
interactions %>%
ggplot(aes(x = sum_click)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
students_and_assessments <- read_csv("data/oulad-students-and-assessments.csv")
## Rows: 32593 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): code_module, code_presentation, gender, region, highest_education, ...
## dbl (9): id_student, imd_band, num_of_prev_attempts, studied_credits, module...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments <- read_csv("data/oulad-assessments.csv")
## Rows: 206 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (3): id_assessment, date, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Data preparation for modeling
code_module_dates <- assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
interactions_joined <- interactions %>%
left_join(code_module_dates)
## Joining with `by = join_by(code_module, code_presentation,
## quantile_cutoff_date)`
interactions_filtered <- interactions_joined %>%
filter(date < quantile_cutoff_date)
interactions_summarized <- interactions_filtered %>%
group_by(id_student, code_module, code_presentation) %>%
summarise(sum_clicks = sum(sum_click))
## `summarise()` has grouped output by 'id_student', 'code_module'. You can
## override using the `.groups` argument.
interactions_summarized %>%
ggplot(aes(x = sum_clicks)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
interactions_summarized <- interactions_filtered %>%
group_by(id_student, code_module, code_presentation) %>%
summarize(sum_clicks = sum(sum_click),
sd_clicks = sd(sum_click),
mean_clicks = mean(sum_click))
## `summarise()` has grouped output by 'id_student', 'code_module'. You can
## override using the `.groups` argument.
interactions_slopes <- interactions_filtered %>%
group_by(id_student, code_module, code_presentation) %>%
nest() %>%
mutate(model = map(data, ~lm(sum_click ~ 1 + date + I(date^2), data = .x) %>%
tidy)) %>%
unnest(model)
## Warning: There were 136 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `model = map(...)`.
## ℹ In group 1202: `id_student = 160720`, `code_module = "FFF"`,
## `code_presentation = "2013J"`.
## Caused by warning in `summary.lm()`:
## ! essentially perfect fit: summary may be unreliable
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 135 remaining warnings.
students_assessments_and_interactions <- left_join(students_and_assessments, interactions_summarized)
## Joining with `by = join_by(code_module, code_presentation, id_student)`
students_assessments_and_interactions <- students_assessments_and_interactions %>%
mutate(pass = as.factor(pass))
names(students_assessments_and_interactions)
## [1] "code_module" "code_presentation"
## [3] "id_student" "gender"
## [5] "region" "highest_education"
## [7] "imd_band" "age_band"
## [9] "num_of_prev_attempts" "studied_credits"
## [11] "disability" "final_result"
## [13] "module_presentation_length" "date_registration"
## [15] "date_unregistration" "pass"
## [17] "mean_weighted_score" "sum_clicks"
## [19] "sd_clicks" "mean_clicks"
set.seed(20231204)
train_test_split <- initial_split(students_assessments_and_interactions , prop = .33, strata = "pass")
data_train <- training(train_test_split)
vfcv <- vfold_cv(data_train)
kfcv <- vfold_cv(data_train, v = 20)
kfcv
## # 20-fold cross-validation
## # A tibble: 20 × 2
## splits id
## <list> <chr>
## 1 <split [10217/538]> Fold01
## 2 <split [10217/538]> Fold02
## 3 <split [10217/538]> Fold03
## 4 <split [10217/538]> Fold04
## 5 <split [10217/538]> Fold05
## 6 <split [10217/538]> Fold06
## 7 <split [10217/538]> Fold07
## 8 <split [10217/538]> Fold08
## 9 <split [10217/538]> Fold09
## 10 <split [10217/538]> Fold10
## 11 <split [10217/538]> Fold11
## 12 <split [10217/538]> Fold12
## 13 <split [10217/538]> Fold13
## 14 <split [10217/538]> Fold14
## 15 <split [10217/538]> Fold15
## 16 <split [10218/537]> Fold16
## 17 <split [10218/537]> Fold17
## 18 <split [10218/537]> Fold18
## 19 <split [10218/537]> Fold19
## 20 <split [10218/537]> Fold20
#Recipe definition for preprocessing
my_rec <- recipe( pass ~ disability +
date_registration +
gender +
code_module + sum_clicks,
data = data_train)
my_rec <- my_rec %>%
step_dummy(disability) %>%
step_dummy(gender) %>%
step_dummy(code_module) %>%
step_impute_knn(sum_clicks) %>%
step_impute_knn(date_registration) %>%
step_normalize(all_numeric_predictors())
my_mod <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
my_wf <-
workflow() %>%
add_model(my_mod) %>%
add_recipe(my_rec)
#Model training and evaluation
class_metrics <- metric_set(accuracy, sensitivity, specificity, ppv, npv, kap)
fitted_model_resamples <- fit_resamples(my_wf, resamples = vfcv, metrics = class_metrics)
collect_metrics(fitted_model_resamples)
## # A tibble: 6 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.646 10 0.00661 Preprocessor1_Model1
## 2 kap binary 0.180 10 0.0130 Preprocessor1_Model1
## 3 npv binary 0.559 10 0.0155 Preprocessor1_Model1
## 4 ppv binary 0.670 10 0.00624 Preprocessor1_Model1
## 5 sensitivity binary 0.845 10 0.00748 Preprocessor1_Model1
## 6 specificity binary 0.319 10 0.00806 Preprocessor1_Model1
fitted_model <- fit(my_wf, data_train)
final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)
final_fit %>%
collect_metrics()
## # A tibble: 6 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.650 Preprocessor1_Model1
## 2 sensitivity binary 0.847 Preprocessor1_Model1
## 3 specificity binary 0.326 Preprocessor1_Model1
## 4 ppv binary 0.673 Preprocessor1_Model1
## 5 npv binary 0.566 Preprocessor1_Model1
## 6 kap binary 0.189 Preprocessor1_Model1
collect_predictions(final_fit) %>%
conf_mat(.pred_class, pass)
## Truth
## Prediction 0 1
## 0 11488 2068
## 1 5581 2701
final_fit %>%
pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

To receive credit for this assignment and earn your third ML badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.
Once your instructor has checked your link, you will be provided a physical version of the badge below!