Machine Learning - Learning Lab 3 Badge

Part I: Reflect and Plan

Please read this article and then answer then response to the following three questions.

Baker, R. S., Esbenshade, L., Vitale, J., & Karumbaiah, S. (2023). Using Demographic Data as Predictor Variables: a Questionable Choice. Journal of Educational Data Mining, 15(2), 22-52.

Part A:

In what ways could the inclusion of demographic variables as predictors potentially reinforce biases in predictive analytics within education?

The inclusion of demographic variables as predictors in predictive analytics within education can reinforce biases in many ways. First, they could reinforce existing inequities because because if structural biases affect educational outcomes, then using these variables couldl perpetuate these disparities. Second, including demographic data may shift focus away from actional variables such as behaviors or academic performance. Third, predictive models are often trained on historical data, which may contain biases. Fourth, demographic data might act as a proxy for other variables that relate directly to educational outcomes, suggesting difficulty distinguishing between direct effects and indirect effects. Finally, demographic variables might not accurately represent indnividuals, particularly disadvantaging certain groups.

The authors argue that demographic variables should be used to validate fairness rather than as predictors within models. What are the potential benefits and drawbacks of this approach?

Benefits: detecting and mitigating bias, improving model accuracy and equity, promoting ethical use of data, and increasing trust and acceptability.
Drawbacks: complexities in implementation, potential overlook of subtle biases, balancing different fairness metrics, risk of misinterpretation, and limited scope of fairness.

How could the limitations of categorization impact the use of demographic variables in predictive models?

Assumes that there is commonality between individuals labeled by the demographic variable. - Limitations: oversimplification and arbitrary grouping, heterogeneity within groups, changing and politically-influenced definitions, exclusion of minor or emerging groups, risks of stereotyping and stigmatization, impact on model’s predictive power and fairness, ethical and privacy concerns, and confusion between proxy and mediating variables.

Part B: Once again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, please find an article in your field that utilizes in feature engineering

Provide an APA citation for your selected study.
- Arlitt, R., Khan, S., & Blessing, L. (2019). Feature engineering for design thinking assessment. Proceedings in the International Conference on Engineering Design, DELFT: THe Netherlands.
NOTE: Difficulty finding one in mathematics education. Had to go more general in my search.
What is the data source used in the study?
- Student responses to a variety of contextualized design questions, collected efore and after students engaged in design thinking training course.
Why was feature engineering useful or even necessary in this study?
- Feature engineering was useful to: (1) capture complex, qualitative aspects of design thinking. (2) The data was textual responses from stduents and feature engineering looked at relevant characteristics (e.g., word frequency, syntax, and semantics). (3) Use more accurate and reliable models for evaluating design thinking compentencies to enhance a model in machine learning and data analytics. (4) Interpret outcomes such as textual markers of skill aquisition. (5) scalable computational approach to assess design thinking, dealing with large volumes of student responses. (6) study could perform commparative analyses, such as pre- and post-training.

Part II: Data Product

For this data product, please add a feature (or several features) specific to one or more of the activity types in the OULAD interactions data, and then add these features to your model and evaluate the accuracy. How did the accuracy appreciably change–if at all? Please copy all of your code in one or more code chunks below.

#Setup
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.5     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(vip)

## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi

library(ranger)
library(dplyr)
#Data reading and exploration
interactions <- read_csv("data/oulad-interactions-filtered.csv.zip")

## Multiple files in zip: reading 'oulad-interactions-filtered.csv'
## Rows: 5548858 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, activity_type
## dbl (8): id_student, id_site, date, sum_click, week_from, week_to, module_pr...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

interactions %>% 
  count(activity_type)

## # A tibble: 19 × 2
##    activity_type        n
##    <chr>            <int>
##  1 dataplus           311
##  2 dualpane          7306
##  3 externalquiz     18171
##  4 forumng        1279917
##  5 glossary          9630
##  6 homepage        832424
##  7 htmlactivity      6562
##  8 oucollaborate    25861
##  9 oucontent      1065736
## 10 ouelluminate     13829
## 11 ouwiki           66413
## 12 page             33539
## 13 questionnaire    16528
## 14 quiz            398966
## 15 repeatactivity       6
## 16 resource        436704
## 17 sharedsubpage      103
## 18 subpage        1104279
## 19 url             232573

interactions %>% 
  count(activity_type) %>%
  arrange(desc(n))

## # A tibble: 19 × 2
##    activity_type        n
##    <chr>            <int>
##  1 forumng        1279917
##  2 subpage        1104279
##  3 oucontent      1065736
##  4 homepage        832424
##  5 resource        436704
##  6 quiz            398966
##  7 url             232573
##  8 ouwiki           66413
##  9 page             33539
## 10 oucollaborate    25861
## 11 externalquiz     18171
## 12 questionnaire    16528
## 13 ouelluminate     13829
## 14 glossary          9630
## 15 dualpane          7306
## 16 htmlactivity      6562
## 17 dataplus           311
## 18 sharedsubpage      103
## 19 repeatactivity       6

interactions %>% 
  ggplot(aes(x = date)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

interactions %>% 
  ggplot(aes(x = sum_click)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

students_and_assessments <- read_csv("data/oulad-students-and-assessments.csv")

## Rows: 32593 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): code_module, code_presentation, gender, region, highest_education, ...
## dbl (9): id_student, imd_band, num_of_prev_attempts, studied_credits, module...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

assessments <- read_csv("data/oulad-assessments.csv")

## Rows: 206 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (3): id_assessment, date, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Data preparation for modeling
code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))

## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.

interactions_joined <- interactions %>% 
  left_join(code_module_dates)

## Joining with `by = join_by(code_module, code_presentation,
## quantile_cutoff_date)`

 interactions_filtered <- interactions_joined %>% 
  filter(date < quantile_cutoff_date)
interactions_summarized <- interactions_filtered %>% 
  group_by(id_student, code_module, code_presentation) %>% 
  summarise(sum_clicks = sum(sum_click))

## `summarise()` has grouped output by 'id_student', 'code_module'. You can
## override using the `.groups` argument.

interactions_summarized %>% 
    ggplot(aes(x = sum_clicks)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

interactions_summarized <- interactions_filtered %>% 
  group_by(id_student, code_module, code_presentation) %>% 
  summarize(sum_clicks = sum(sum_click),
            sd_clicks = sd(sum_click), 
            mean_clicks = mean(sum_click))

## `summarise()` has grouped output by 'id_student', 'code_module'. You can
## override using the `.groups` argument.

interactions_slopes <- interactions_filtered %>% 
  group_by(id_student, code_module, code_presentation) %>% 
  nest() %>% 
  mutate(model = map(data, ~lm(sum_click ~ 1 + date + I(date^2), data = .x) %>% 
                       tidy)) %>% 
  unnest(model)

## Warning: There were 136 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `model = map(...)`.
## ℹ In group 1202: `id_student = 160720`, `code_module = "FFF"`,
##   `code_presentation = "2013J"`.
## Caused by warning in `summary.lm()`:
## ! essentially perfect fit: summary may be unreliable
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 135 remaining warnings.

students_assessments_and_interactions <- left_join(students_and_assessments, interactions_summarized)

## Joining with `by = join_by(code_module, code_presentation, id_student)`

students_assessments_and_interactions <- students_assessments_and_interactions %>% 
  mutate(pass = as.factor(pass))
names(students_assessments_and_interactions)

##  [1] "code_module"                "code_presentation"         
##  [3] "id_student"                 "gender"                    
##  [5] "region"                     "highest_education"         
##  [7] "imd_band"                   "age_band"                  
##  [9] "num_of_prev_attempts"       "studied_credits"           
## [11] "disability"                 "final_result"              
## [13] "module_presentation_length" "date_registration"         
## [15] "date_unregistration"        "pass"                      
## [17] "mean_weighted_score"        "sum_clicks"                
## [19] "sd_clicks"                  "mean_clicks"

set.seed(20231204)

train_test_split <- initial_split(students_assessments_and_interactions , prop = .33, strata = "pass")
data_train <- training(train_test_split)
vfcv <- vfold_cv(data_train)
kfcv <- vfold_cv(data_train, v = 20) 
kfcv

## #  20-fold cross-validation 
## # A tibble: 20 × 2
##    splits              id    
##    <list>              <chr> 
##  1 <split [10217/538]> Fold01
##  2 <split [10217/538]> Fold02
##  3 <split [10217/538]> Fold03
##  4 <split [10217/538]> Fold04
##  5 <split [10217/538]> Fold05
##  6 <split [10217/538]> Fold06
##  7 <split [10217/538]> Fold07
##  8 <split [10217/538]> Fold08
##  9 <split [10217/538]> Fold09
## 10 <split [10217/538]> Fold10
## 11 <split [10217/538]> Fold11
## 12 <split [10217/538]> Fold12
## 13 <split [10217/538]> Fold13
## 14 <split [10217/538]> Fold14
## 15 <split [10217/538]> Fold15
## 16 <split [10218/537]> Fold16
## 17 <split [10218/537]> Fold17
## 18 <split [10218/537]> Fold18
## 19 <split [10218/537]> Fold19
## 20 <split [10218/537]> Fold20

#Recipe definition for preprocessing
my_rec <- recipe(  pass ~ disability +
                   date_registration + 
                   gender +
                   code_module + sum_clicks,
                   data = data_train) 
my_rec <- my_rec %>% 
  step_dummy(disability) %>% 
  step_dummy(gender) %>%  
  step_dummy(code_module) %>% 
  step_impute_knn(sum_clicks) %>% 
  step_impute_knn(date_registration) %>% 
  step_normalize(all_numeric_predictors())
my_mod <-
  rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

my_wf <-
  workflow() %>% 
  add_model(my_mod) %>% 
  add_recipe(my_rec)
#Model training and evaluation
class_metrics <- metric_set(accuracy, sensitivity, specificity, ppv, npv, kap)
fitted_model_resamples <- fit_resamples(my_wf, resamples = vfcv, metrics = class_metrics)
collect_metrics(fitted_model_resamples)

## # A tibble: 6 × 6
##   .metric     .estimator  mean     n std_err .config             
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.646    10 0.00661 Preprocessor1_Model1
## 2 kap         binary     0.180    10 0.0130  Preprocessor1_Model1
## 3 npv         binary     0.559    10 0.0155  Preprocessor1_Model1
## 4 ppv         binary     0.670    10 0.00624 Preprocessor1_Model1
## 5 sensitivity binary     0.845    10 0.00748 Preprocessor1_Model1
## 6 specificity binary     0.319    10 0.00806 Preprocessor1_Model1

fitted_model <- fit(my_wf, data_train)
final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)
final_fit %>% 
  collect_metrics()

## # A tibble: 6 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.650 Preprocessor1_Model1
## 2 sensitivity binary         0.847 Preprocessor1_Model1
## 3 specificity binary         0.326 Preprocessor1_Model1
## 4 ppv         binary         0.673 Preprocessor1_Model1
## 5 npv         binary         0.566 Preprocessor1_Model1
## 6 kap         binary         0.189 Preprocessor1_Model1

collect_predictions(final_fit) %>% 
  conf_mat(.pred_class, pass)

##           Truth
## Prediction     0     1
##          0 11488  2068
##          1  5581  2701

final_fit %>% 
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>% 
  vip(num_features = 10)

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Receiving Your Machine Learning Badge

To receive credit for this assignment and earn your third ML badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Learning Lab 3 Badge

AJ Edson

December 4, 2023

Part I: Reflect and Plan

Part II: Data Product

Knit and Publish

Receiving Your Machine Learning Badge