Machine Learning - Learning Lab 3 Badge

Part I: Reflect and Plan

Please read this article and then answer then response to the following three questions.

Baker, R. S., Esbenshade, L., Vitale, J., & Karumbaiah, S. (2023). Using Demographic Data as Predictor Variables: a Questionable Choice. Journal of Educational Data Mining, 15(2), 22-52.

Part A:

In what ways could the inclusion of demographic variables as predictors potentially reinforce biases in predictive analytics within education? produce predictions based primarily on demographic variables rather than more actionable variables.
The authors argue that demographic variables should be used to validate fairness rather than as predictors within models. What are the potential benefits and drawbacks of this approach? Benefits:
1. Getting more actionable variables;
2. reduce bias in training models; Drawbacks:
3. May lead to better prediction;
4. May avoid systematizing abstract liberalism;
How could the limitations of categorization impact the use of demographic variables in predictive models? The authors maintained that the use of demographic variables as predictors assumes that there is commonality between individuals labeled by that demographic variable. Yet the limitations of categorization impact such fundamental assumption.

Part B: Once again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, please find an article in your field that utilizes in feature engineering

Provide an APA citation for your selected study. Gobert,I,D., Sao Pedro,M.,Raziuddin,J & Baker,R (2013) From Log Files to Assessment Metrics: Measuring Students’ Science Inquiry Skills Using Educational Data Mining, Journal of the Learning Sciences, 22(4), 521-563, DOI: 10.1080/10508406.2013.837391
What is the data source used in the study? Participants were 148 eighth-grade students, ranging in age from 12–14 years,from a public middle school in central Massachusetts
Why was feature engineering useful or even necessary in this study? To improve construct validity

Part II: Data Product

For this data product, please add a feature (or several features) specific to one or more of the activity types in the OULAD interactions data, and then add these features to your model and evaluate the accuracy. How did the accuracy appreciably change–if at all? Please copy all of your code in one or more code chunks below.

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/

library(vip) # a new package we're adding for variable importance measures

## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi

library(ranger) # this is needed for the random forest algorithm

interactions <- read_csv("C:/Users/hy5633/OneDrive - The University of Texas at Austin/Documents/LASER 2023/machine-learning-main/lab-3/data/oulad-interactions-filtered.csv")

## Rows: 1048575 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, activity_type
## dbl (8): id_student, id_site, date, sum_click, week_from, week_to, module_pr...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3. EXPLORE

interactions %>% 
  count(activity_type)

## # A tibble: 12 × 2
##    activity_type      n
##    <chr>          <int>
##  1 dataplus         271
##  2 forumng       325314
##  3 glossary        3153
##  4 homepage      178674
##  5 oucollaborate   3059
##  6 oucontent     140585
##  7 ouelluminate     667
##  8 quiz           90855
##  9 resource      122466
## 10 sharedsubpage    103
## 11 subpage       133975
## 12 url            49453

interactions %>% 
  count(activity_type) %>%
  arrange(desc(n))

## # A tibble: 12 × 2
##    activity_type      n
##    <chr>          <int>
##  1 forumng       325314
##  2 homepage      178674
##  3 oucontent     140585
##  4 subpage       133975
##  5 resource      122466
##  6 quiz           90855
##  7 url            49453
##  8 glossary        3153
##  9 oucollaborate   3059
## 10 ouelluminate     667
## 11 dataplus         271
## 12 sharedsubpage    103

interactions %>% 
  ggplot(aes(x = date)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

interactions %>% 
  ggplot(aes(x = sum_click)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

assessments <- read_csv("C:/Users/hy5633/OneDrive - The University of Texas at Austin/Documents/LASER 2023/machine-learning-main/lab-3/data/oulad-assessments.csv")

## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))

## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.

interactions_joined <- interactions %>% 
  left_join(code_module_dates) # join the data based on course_module and course_presentation

## Joining with `by = join_by(code_module, code_presentation,
## quantile_cutoff_date)`

  interactions_filtered <- interactions_joined %>% 
  filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included

interactions_summarized <- interactions_filtered %>% 
  group_by(id_student, code_module, code_presentation) %>% 
  summarise(sum_clicks = sum(sum_click))

## `summarise()` has grouped output by 'id_student', 'code_module'. You can
## override using the `.groups` argument.

interactions_summarized %>% 
    ggplot(aes(x = sum_clicks)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

interactions_summarized <- interactions_filtered %>% 
  group_by(id_student, code_module, code_presentation) %>% 
  summarize(sum_clicks = sum(sum_click),
            sd_clicks = sd(sum_click), 
            mean_clicks = mean(sum_click))

## `summarise()` has grouped output by 'id_student', 'code_module'. You can
## override using the `.groups` argument.

interactions_slopes <- interactions_filtered %>% # start with our interactions data
  group_by(id_student, code_module, code_presentation) %>% # group by student and course so a slope is calculated for each student by course combination 
  nest() %>% # this tells R that we want to create a smaller data frame, wherein each students' data is assigned its own cell; it makes the next step possible
  mutate(model = map(data, ~lm(sum_click ~ 1 + date + I(date^2), data = .x) %>% 
                       tidy)) %>% # now, given that we have a smaller data frame with each student by course combination in its own cell, we can use mutate to fit a linear model for each students by course data set; the linear model fits a linear and a quadratic term
  unnest(model) # this returns the data to its original (non-nested structure), retaining the mutated model output

## Warning: There were 106 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `model = map(...)`.
## ℹ In group 477: `id_student = 155298`, `code_module = "CCC"`,
##   `code_presentation = "2014J"`.
## Caused by warning in `summary.lm()`:
## ! essentially perfect fit: summary may be unreliable
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 105 remaining warnings.

   all_features <- left_join(interactions_filtered, interactions_slopes, by = join_by(id_student, code_module, code_presentation),relationship = "many-to-many")

students_and_assessments <- read_csv("C:/Users/hy5633/OneDrive - The University of Texas at Austin/Documents/LASER 2023/machine-learning-main/lab-3/data/oulad-students-and-assessments.csv")

## Rows: 32593 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): code_module, code_presentation, gender, region, highest_education, ...
## dbl (9): id_student, imd_band, num_of_prev_attempts, studied_credits, module...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students_assessments_and_interactions <- left_join(students_and_assessments, 
                                                   interactions_summarized)

## Joining with `by = join_by(code_module, code_presentation, id_student)`

students_assessments_and_interactions <- students_assessments_and_interactions %>% 
  mutate(pass = as.factor(pass))

# write_csv(students_assessments_and_interactions, "students-assessments-and-interactions.csv")
  names(students_assessments_and_interactions)

##  [1] "code_module"                "code_presentation"         
##  [3] "id_student"                 "gender"                    
##  [5] "region"                     "highest_education"         
##  [7] "imd_band"                   "age_band"                  
##  [9] "num_of_prev_attempts"       "studied_credits"           
## [11] "disability"                 "final_result"              
## [13] "module_presentation_length" "date_registration"         
## [15] "date_unregistration"        "pass"                      
## [17] "mean_weighted_score"        "sum_clicks"                
## [19] "sd_clicks"                  "mean_clicks"

4. MODEL

set.seed(20230712)

train_test_split <- initial_split(students_assessments_and_interactions , prop = .33, strata = "pass")
data_train <- training(train_test_split)

vfcv <- vfold_cv(data_train) # this differentiates this from what we did before
# before, we simple used data_train to fit our model
vfcv

## #  10-fold cross-validation 
## # A tibble: 10 × 2
##    splits              id    
##    <list>              <chr> 
##  1 <split [9679/1076]> Fold01
##  2 <split [9679/1076]> Fold02
##  3 <split [9679/1076]> Fold03
##  4 <split [9679/1076]> Fold04
##  5 <split [9679/1076]> Fold05
##  6 <split [9680/1075]> Fold06
##  7 <split [9680/1075]> Fold07
##  8 <split [9680/1075]> Fold08
##  9 <split [9680/1075]> Fold09
## 10 <split [9680/1075]> Fold10

kfcv <- vfold_cv(data_train, v = 20) # this differentiates this from what we did before
# before, we simple used data_train to fit our model
kfcv

## #  20-fold cross-validation 
## # A tibble: 20 × 2
##    splits              id    
##    <list>              <chr> 
##  1 <split [10217/538]> Fold01
##  2 <split [10217/538]> Fold02
##  3 <split [10217/538]> Fold03
##  4 <split [10217/538]> Fold04
##  5 <split [10217/538]> Fold05
##  6 <split [10217/538]> Fold06
##  7 <split [10217/538]> Fold07
##  8 <split [10217/538]> Fold08
##  9 <split [10217/538]> Fold09
## 10 <split [10217/538]> Fold10
## 11 <split [10217/538]> Fold11
## 12 <split [10217/538]> Fold12
## 13 <split [10217/538]> Fold13
## 14 <split [10217/538]> Fold14
## 15 <split [10217/538]> Fold15
## 16 <split [10218/537]> Fold16
## 17 <split [10218/537]> Fold17
## 18 <split [10218/537]> Fold18
## 19 <split [10218/537]> Fold19
## 20 <split [10218/537]> Fold20

step_normalize(minutes_of_videos_watched)

my_rec <- recipe(  pass ~ disability +
                   date_registration + 
                   gender +
                   code_module + sum_clicks, # +
                    #mean_weighted_score +
                    #sd_clicks + mean_clicks, #+ # new
                   #intercept + date + date_2, # new
                   data = data_train)

my_rec <- my_rec %>% 
  step_dummy(disability) %>% 
  step_dummy(gender) %>%  
  step_dummy(code_module) %>% 
  step_impute_knn(sum_clicks) %>% 
  step_impute_knn(date_registration) %>% 
  step_normalize(all_numeric_predictors())

# specify model
my_mod <-
  rand_forest() %>% # specify the type of model we are fitting, a random forest
  set_engine("ranger", importance = "impurity") %>% # using a variable importance metric specific to this random forest engine
  set_mode("classification") # same as before

# specify workflow
my_wf <-
  workflow() %>% # create a workflow
  add_model(my_mod) %>% # add the model we wrote above
  add_recipe(my_rec) # add our recipe we wrote above

class_metrics <- metric_set(accuracy, sensitivity, specificity, ppv, npv, kap) # specify the same metrics as earlier
fitted_model_resamples <- fit_resamples(my_wf, resamples = vfcv, metrics = class_metrics) # specify the workflow, resampled data, and our metrics

collect_metrics(fitted_model_resamples)

## # A tibble: 6 × 6
##   .metric     .estimator   mean     n std_err .config             
##   <chr>       <chr>       <dbl> <int>   <dbl> <chr>               
## 1 accuracy    binary     0.623     10 0.00578 Preprocessor1_Model1
## 2 kap         binary     0.0433    10 0.00323 Preprocessor1_Model1
## 3 npv         binary     0.519     10 0.0129  Preprocessor1_Model1
## 4 ppv         binary     0.630     10 0.00631 Preprocessor1_Model1
## 5 sensitivity binary     0.951     10 0.00337 Preprocessor1_Model1
## 6 specificity binary     0.0850    10 0.00334 Preprocessor1_Model1

fitted_model <- fit(my_wf, data_train)
final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)

final_fit %>% 
  collect_metrics()

## # A tibble: 6 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary        0.623  Preprocessor1_Model1
## 2 sensitivity binary        0.957  Preprocessor1_Model1
## 3 specificity binary        0.0773 Preprocessor1_Model1
## 4 ppv         binary        0.629  Preprocessor1_Model1
## 5 npv         binary        0.521  Preprocessor1_Model1
## 6 kap         binary        0.0405 Preprocessor1_Model1

collect_predictions(final_fit) %>% 
  conf_mat(.pred_class, pass)

##           Truth
## Prediction     0     1
##          0 12967   589
##          1  7642   640

5. COMMUNICATE

final_fit %>% 
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>% 
  vip(num_features = 10)

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Receiving Your Machine Learning Badge

To receive credit for this assignment and earn your third ML badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.

Once your instructor has checked your link, you will be provided a physical version of the badge below!