Please read this article and then answer then response to the following three questions.
Baker, R. S., Esbenshade, L., Vitale, J., & Karumbaiah, S. (2023). Using Demographic Data as Predictor Variables: a Questionable Choice. Journal of Educational Data Mining, 15(2), 22-52.
Part A:
In what ways could the inclusion of demographic variables as predictors potentially reinforce biases in predictive analytics within education? produce predictions based primarily on demographic variables rather than more actionable variables.
The authors argue that demographic variables should be used to validate fairness rather than as predictors within models. What are the potential benefits and drawbacks of this approach? Benefits:
How could the limitations of categorization impact the use of demographic variables in predictive models? The authors maintained that the use of demographic variables as predictors assumes that there is commonality between individuals labeled by that demographic variable. Yet the limitations of categorization impact such fundamental assumption.
Part B: Once again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, please find an article in your field that utilizes in feature engineering
Provide an APA citation for your selected study. Gobert,I,D., Sao Pedro,M.,Raziuddin,J & Baker,R (2013) From Log Files to Assessment Metrics: Measuring Students’ Science Inquiry Skills Using Educational Data Mining, Journal of the Learning Sciences, 22(4), 521-563, DOI: 10.1080/10508406.2013.837391
What is the data source used in the study? Participants were 148 eighth-grade students, ranging in age from 12–14 years,from a public middle school in central Massachusetts
Why was feature engineering useful or even necessary in this study? To improve construct validity
For this data product, please add a feature (or several features) specific to one or more of the activity types in the OULAD interactions data, and then add these features to your model and evaluate the accuracy. How did the accuracy appreciably change–if at all? Please copy all of your code in one or more code chunks below.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(vip) # a new package we're adding for variable importance measures
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(ranger) # this is needed for the random forest algorithm
interactions <- read_csv("C:/Users/hy5633/OneDrive - The University of Texas at Austin/Documents/LASER 2023/machine-learning-main/lab-3/data/oulad-interactions-filtered.csv")
## Rows: 1048575 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, activity_type
## dbl (8): id_student, id_site, date, sum_click, week_from, week_to, module_pr...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
interactions %>%
count(activity_type)
## # A tibble: 12 × 2
## activity_type n
## <chr> <int>
## 1 dataplus 271
## 2 forumng 325314
## 3 glossary 3153
## 4 homepage 178674
## 5 oucollaborate 3059
## 6 oucontent 140585
## 7 ouelluminate 667
## 8 quiz 90855
## 9 resource 122466
## 10 sharedsubpage 103
## 11 subpage 133975
## 12 url 49453
interactions %>%
count(activity_type) %>%
arrange(desc(n))
## # A tibble: 12 × 2
## activity_type n
## <chr> <int>
## 1 forumng 325314
## 2 homepage 178674
## 3 oucontent 140585
## 4 subpage 133975
## 5 resource 122466
## 6 quiz 90855
## 7 url 49453
## 8 glossary 3153
## 9 oucollaborate 3059
## 10 ouelluminate 667
## 11 dataplus 271
## 12 sharedsubpage 103
interactions %>%
ggplot(aes(x = date)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
interactions %>%
ggplot(aes(x = sum_click)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
assessments <- read_csv("C:/Users/hy5633/OneDrive - The University of Texas at Austin/Documents/LASER 2023/machine-learning-main/lab-3/data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
code_module_dates <- assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
interactions_joined <- interactions %>%
left_join(code_module_dates) # join the data based on course_module and course_presentation
## Joining with `by = join_by(code_module, code_presentation,
## quantile_cutoff_date)`
interactions_filtered <- interactions_joined %>%
filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included
interactions_summarized <- interactions_filtered %>%
group_by(id_student, code_module, code_presentation) %>%
summarise(sum_clicks = sum(sum_click))
## `summarise()` has grouped output by 'id_student', 'code_module'. You can
## override using the `.groups` argument.
interactions_summarized %>%
ggplot(aes(x = sum_clicks)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
interactions_summarized <- interactions_filtered %>%
group_by(id_student, code_module, code_presentation) %>%
summarize(sum_clicks = sum(sum_click),
sd_clicks = sd(sum_click),
mean_clicks = mean(sum_click))
## `summarise()` has grouped output by 'id_student', 'code_module'. You can
## override using the `.groups` argument.
interactions_slopes <- interactions_filtered %>% # start with our interactions data
group_by(id_student, code_module, code_presentation) %>% # group by student and course so a slope is calculated for each student by course combination
nest() %>% # this tells R that we want to create a smaller data frame, wherein each students' data is assigned its own cell; it makes the next step possible
mutate(model = map(data, ~lm(sum_click ~ 1 + date + I(date^2), data = .x) %>%
tidy)) %>% # now, given that we have a smaller data frame with each student by course combination in its own cell, we can use mutate to fit a linear model for each students by course data set; the linear model fits a linear and a quadratic term
unnest(model) # this returns the data to its original (non-nested structure), retaining the mutated model output
## Warning: There were 106 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `model = map(...)`.
## ℹ In group 477: `id_student = 155298`, `code_module = "CCC"`,
## `code_presentation = "2014J"`.
## Caused by warning in `summary.lm()`:
## ! essentially perfect fit: summary may be unreliable
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 105 remaining warnings.
all_features <- left_join(interactions_filtered, interactions_slopes, by = join_by(id_student, code_module, code_presentation),relationship = "many-to-many")
students_and_assessments <- read_csv("C:/Users/hy5633/OneDrive - The University of Texas at Austin/Documents/LASER 2023/machine-learning-main/lab-3/data/oulad-students-and-assessments.csv")
## Rows: 32593 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): code_module, code_presentation, gender, region, highest_education, ...
## dbl (9): id_student, imd_band, num_of_prev_attempts, studied_credits, module...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students_assessments_and_interactions <- left_join(students_and_assessments,
interactions_summarized)
## Joining with `by = join_by(code_module, code_presentation, id_student)`
students_assessments_and_interactions <- students_assessments_and_interactions %>%
mutate(pass = as.factor(pass))
# write_csv(students_assessments_and_interactions, "students-assessments-and-interactions.csv")
names(students_assessments_and_interactions)
## [1] "code_module" "code_presentation"
## [3] "id_student" "gender"
## [5] "region" "highest_education"
## [7] "imd_band" "age_band"
## [9] "num_of_prev_attempts" "studied_credits"
## [11] "disability" "final_result"
## [13] "module_presentation_length" "date_registration"
## [15] "date_unregistration" "pass"
## [17] "mean_weighted_score" "sum_clicks"
## [19] "sd_clicks" "mean_clicks"
set.seed(20230712)
train_test_split <- initial_split(students_assessments_and_interactions , prop = .33, strata = "pass")
data_train <- training(train_test_split)
vfcv <- vfold_cv(data_train) # this differentiates this from what we did before
# before, we simple used data_train to fit our model
vfcv
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [9679/1076]> Fold01
## 2 <split [9679/1076]> Fold02
## 3 <split [9679/1076]> Fold03
## 4 <split [9679/1076]> Fold04
## 5 <split [9679/1076]> Fold05
## 6 <split [9680/1075]> Fold06
## 7 <split [9680/1075]> Fold07
## 8 <split [9680/1075]> Fold08
## 9 <split [9680/1075]> Fold09
## 10 <split [9680/1075]> Fold10
kfcv <- vfold_cv(data_train, v = 20) # this differentiates this from what we did before
# before, we simple used data_train to fit our model
kfcv
## # 20-fold cross-validation
## # A tibble: 20 × 2
## splits id
## <list> <chr>
## 1 <split [10217/538]> Fold01
## 2 <split [10217/538]> Fold02
## 3 <split [10217/538]> Fold03
## 4 <split [10217/538]> Fold04
## 5 <split [10217/538]> Fold05
## 6 <split [10217/538]> Fold06
## 7 <split [10217/538]> Fold07
## 8 <split [10217/538]> Fold08
## 9 <split [10217/538]> Fold09
## 10 <split [10217/538]> Fold10
## 11 <split [10217/538]> Fold11
## 12 <split [10217/538]> Fold12
## 13 <split [10217/538]> Fold13
## 14 <split [10217/538]> Fold14
## 15 <split [10217/538]> Fold15
## 16 <split [10218/537]> Fold16
## 17 <split [10218/537]> Fold17
## 18 <split [10218/537]> Fold18
## 19 <split [10218/537]> Fold19
## 20 <split [10218/537]> Fold20
step_normalize(minutes_of_videos_watched)
my_rec <- recipe( pass ~ disability +
date_registration +
gender +
code_module + sum_clicks, # +
#mean_weighted_score +
#sd_clicks + mean_clicks, #+ # new
#intercept + date + date_2, # new
data = data_train)
my_rec <- my_rec %>%
step_dummy(disability) %>%
step_dummy(gender) %>%
step_dummy(code_module) %>%
step_impute_knn(sum_clicks) %>%
step_impute_knn(date_registration) %>%
step_normalize(all_numeric_predictors())
# specify model
my_mod <-
rand_forest() %>% # specify the type of model we are fitting, a random forest
set_engine("ranger", importance = "impurity") %>% # using a variable importance metric specific to this random forest engine
set_mode("classification") # same as before
# specify workflow
my_wf <-
workflow() %>% # create a workflow
add_model(my_mod) %>% # add the model we wrote above
add_recipe(my_rec) # add our recipe we wrote above
class_metrics <- metric_set(accuracy, sensitivity, specificity, ppv, npv, kap) # specify the same metrics as earlier
fitted_model_resamples <- fit_resamples(my_wf, resamples = vfcv, metrics = class_metrics) # specify the workflow, resampled data, and our metrics
collect_metrics(fitted_model_resamples)
## # A tibble: 6 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.623 10 0.00578 Preprocessor1_Model1
## 2 kap binary 0.0433 10 0.00323 Preprocessor1_Model1
## 3 npv binary 0.519 10 0.0129 Preprocessor1_Model1
## 4 ppv binary 0.630 10 0.00631 Preprocessor1_Model1
## 5 sensitivity binary 0.951 10 0.00337 Preprocessor1_Model1
## 6 specificity binary 0.0850 10 0.00334 Preprocessor1_Model1
fitted_model <- fit(my_wf, data_train)
final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)
final_fit %>%
collect_metrics()
## # A tibble: 6 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.623 Preprocessor1_Model1
## 2 sensitivity binary 0.957 Preprocessor1_Model1
## 3 specificity binary 0.0773 Preprocessor1_Model1
## 4 ppv binary 0.629 Preprocessor1_Model1
## 5 npv binary 0.521 Preprocessor1_Model1
## 6 kap binary 0.0405 Preprocessor1_Model1
collect_predictions(final_fit) %>%
conf_mat(.pred_class, pass)
## Truth
## Prediction 0 1
## 0 12967 589
## 1 7642 640
final_fit %>%
pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

To receive credit for this assignment and earn your third ML badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.
Once your instructor has checked your link, you will be provided a physical version of the badge below!