As a reminder, to earn a badge for each lab, you are required to respond to a set of prompts for two parts:
In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply an analytic technique introduced in this learning lab.
Part A:
Part B: Again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involve making predictions – and, ideally, one that involved in some way engineering features from data.
Provide an APA citation for your selected study.
Jang, W., Francisco, J., Ranganathan, N., McCarroll, K. M., & Ryoo, K. (2020). Using Machine Learning to Understand Students' Learning Patterns in Simulations.
note. this is not a study that used feature selection but I cannot find one in the field of head start education
What research questions were the authors of this study trying to address and why did they consider these questions important?
What were the results of these analyses?
For the data product, you are asked to investigate and add to our recipe a feature engineering step we did not carry out.
Run the code below through the step in which you write down the recipe.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(here)
## here() starts at /Users/lizfrechette/Desktop/machine-learning
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.0 ✔ rsample 1.0.0
## ✔ dials 1.0.0 ✔ tune 1.0.0
## ✔ infer 1.0.2 ✔ workflows 1.0.0
## ✔ modeldata 1.0.0 ✔ workflowsets 1.0.0
## ✔ parsnip 1.0.0 ✔ yardstick 1.0.0
## ✔ recipes 1.0.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
d <- read_csv("data/online-sci-data-joined.csv")
## Rows: 10920 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): course_id, gender, enrollment_reason, enrollment_status, subject,...
## dbl (12): student_id, int, uv, percomp, tv, sum_discussion_posts, sum_n_wor...
## lgl (1): status
## time (1): last_access_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#loading in new data file to join the final grade from the below data to the d dataframe
data_with_final_grade <- read_csv("data/data-to-model-no-gradebook.csv")
## Rows: 546 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): course_id, gender, enrollment_reason, enrollment_status, subject, s...
## dbl (9): student_id, final_grade, time_spent, int, uv, percomp, tv, sum_disc...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view(data_with_final_grade)
data_with_final_grade <- data_with_final_grade %>%
select(student_id, course_id, final_grade)
d <- left_join(d, data_with_final_grade)
## Joining, by = c("student_id", "course_id")
#long to wide format to keep 1 row per student
d <- d %>% distinct(student_id, course_id, .keep_all = TRUE)
#view(d)
set.seed(20220712)
train_test_split <- initial_split(d, prop = .80)
data_train <- training(train_test_split)
kfcv <- vfold_cv(data_train, v = 20) # this differentiates this from what we did before
#view(data_train)
names(d)
## [1] "student_id" "course_id" "gender"
## [4] "enrollment_reason" "enrollment_status" "subject"
## [7] "semester" "section" "int"
## [10] "uv" "percomp" "tv"
## [13] "sum_discussion_posts" "sum_n_words" "passing_grade"
## [16] "gradebook_item" "item_position" "gradebook_type"
## [19] "gradebook_date" "grade_category" "status"
## [22] "points_earned" "points_attempted" "points_possible"
## [25] "last_access_date" "final_grade"
Here’s where you can add a new feature engineering step. For the sake of this badge, choose from among those options here: https://recipes.tidymodels.org/reference/index.html. You can see more - if helpful - here: https://www.tmwr.org/recipes.html
my_rec <- recipe(final_grade ~ int + tv +
student_id + course_id +
sum_discussion_posts + sum_n_words +
percomp + passing_grade,
data = data_train) %>%
update_role(student_id, course_id, new_role = "ID variable") %>%
step_normalize(all_numeric_predictors()) %>% # standardizes numeric variables
step_nzv(all_predictors()) %>% # remove predictors with a "near-zero variance"
step_novel(all_nominal_predictors()) %>% # add a musing label for factors
step_dummy(all_nominal_predictors()) %>% # dummy code all factor variables
step_impute_knn(all_predictors()) # impute missing data for all predictor variables
my_rec
## Recipe
##
## Inputs:
##
## role #variables
## ID variable 2
## outcome 1
## predictor 6
##
## Operations:
##
## Centering and scaling for all_numeric_predictors()
## Sparse, unbalanced variable filter on all_predictors()
## Novel factor level assignment for all_nominal_predictors()
## Dummy variables from all_nominal_predictors()
## K-nearest neighbor imputation for all_predictors()
Run the remaining steps.
my_mod <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
my_wf <-
workflow() %>%
add_model(my_mod) %>%
add_recipe(my_rec)
fitted_model_resamples <- fit_resamples(my_wf, resamples = kfcv,
control = control_grid(save_pred = TRUE)) # this allows us to inspect the #inspect the predictions
fitted_model_resamples %>%
unnest(.metrics) %>%
filter(.metric == "rmse") # we also get another metric, the ROC; we focus just on accuracy for now
## # A tibble: 20 × 8
## splits id .metric .estimator .estimate .config .notes
## <list> <chr> <chr> <chr> <dbl> <chr> <list>
## 1 <split [414/22]> Fold01 rmse standard 17.5 Preprocessor1_… <tibble>
## 2 <split [414/22]> Fold02 rmse standard 11.5 Preprocessor1_… <tibble>
## 3 <split [414/22]> Fold03 rmse standard 13.2 Preprocessor1_… <tibble>
## 4 <split [414/22]> Fold04 rmse standard 13.3 Preprocessor1_… <tibble>
## 5 <split [414/22]> Fold05 rmse standard 10.3 Preprocessor1_… <tibble>
## 6 <split [414/22]> Fold06 rmse standard 11.7 Preprocessor1_… <tibble>
## 7 <split [414/22]> Fold07 rmse standard 9.49 Preprocessor1_… <tibble>
## 8 <split [414/22]> Fold08 rmse standard 11.4 Preprocessor1_… <tibble>
## 9 <split [414/22]> Fold09 rmse standard 13.0 Preprocessor1_… <tibble>
## 10 <split [414/22]> Fold10 rmse standard 15.0 Preprocessor1_… <tibble>
## 11 <split [414/22]> Fold11 rmse standard 10.7 Preprocessor1_… <tibble>
## 12 <split [414/22]> Fold12 rmse standard 9.51 Preprocessor1_… <tibble>
## 13 <split [414/22]> Fold13 rmse standard 15.4 Preprocessor1_… <tibble>
## 14 <split [414/22]> Fold14 rmse standard 14.1 Preprocessor1_… <tibble>
## 15 <split [414/22]> Fold15 rmse standard 10.9 Preprocessor1_… <tibble>
## 16 <split [414/22]> Fold16 rmse standard 9.60 Preprocessor1_… <tibble>
## 17 <split [415/21]> Fold17 rmse standard 7.22 Preprocessor1_… <tibble>
## 18 <split [415/21]> Fold18 rmse standard 15.3 Preprocessor1_… <tibble>
## 19 <split [415/21]> Fold19 rmse standard 12.5 Preprocessor1_… <tibble>
## 20 <split [415/21]> Fold20 rmse standard 15.3 Preprocessor1_… <tibble>
## # … with 1 more variable: .predictions <list>
#create a subset of data for correlation matrix
subd <- d %>%
mutate(section = as.numeric(section)) %>%
select(student_id, section, int, uv, percomp, tv,
sum_discussion_posts, sum_n_words, passing_grade,
item_position, points_earned, points_attempted,
points_possible, final_grade)
#names(subd)
#res <- cor(subd$final_grade, subd$points_earned)
#round(res, 2)
subd.cor <- cor(subd)
## Warning in cor(subd): the standard deviation is zero
subd.cor
## student_id section int uv percomp tv
## student_id 1.0000000000 0.01894348 NA 0.01747198 NA NA
## section 0.0189434843 1.00000000 NA 0.01381090 NA NA
## int NA NA 1 NA NA NA
## uv 0.0174719762 0.01381090 NA 1.00000000 NA NA
## percomp NA NA NA NA 1 NA
## tv NA NA NA NA NA 1
## sum_discussion_posts NA NA NA NA NA NA
## sum_n_words NA NA NA NA NA NA
## passing_grade NA NA NA NA NA NA
## item_position 0.0001073529 0.06679731 NA -0.28871494 NA NA
## points_earned NA NA NA NA NA NA
## points_attempted NA NA NA NA NA NA
## points_possible -0.0706718060 -0.12516345 NA 0.02565155 NA NA
## final_grade NA NA NA NA NA NA
## sum_discussion_posts sum_n_words passing_grade
## student_id NA NA NA
## section NA NA NA
## int NA NA NA
## uv NA NA NA
## percomp NA NA NA
## tv NA NA NA
## sum_discussion_posts 1 NA NA
## sum_n_words NA 1 NA
## passing_grade NA NA 1
## item_position NA NA NA
## points_earned NA NA NA
## points_attempted NA NA NA
## points_possible NA NA NA
## final_grade NA NA NA
## item_position points_earned points_attempted
## student_id 0.0001073529 NA NA
## section 0.0667973064 NA NA
## int NA NA NA
## uv -0.2887149429 NA NA
## percomp NA NA NA
## tv NA NA NA
## sum_discussion_posts NA NA NA
## sum_n_words NA NA NA
## passing_grade NA NA NA
## item_position 1.0000000000 NA NA
## points_earned NA 1 NA
## points_attempted NA NA 1
## points_possible -0.1922397517 NA NA
## final_grade NA NA NA
## points_possible final_grade
## student_id -0.07067181 NA
## section -0.12516345 NA
## int NA NA
## uv 0.02565155 NA
## percomp NA NA
## tv NA NA
## sum_discussion_posts NA NA
## sum_n_words NA NA
## passing_grade NA NA
## item_position -0.19223975 NA
## points_earned NA NA
## points_attempted NA NA
## points_possible 1.00000000 NA
## final_grade NA 1
#??corr
#install.packages("corrr")
library(corrr)
#create correlation matrix
cor.d <- correlate(subd) %>%
shave()
## Warning in stats::cor(x = x, y = y, use = use, method = method): the standard
## deviation is zero
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
fashion(cor.d)
## term student_id section int uv percomp tv
## 1 student_id
## 2 section .02
## 3 int -.02 .17
## 4 uv .02 .01 .54
## 5 percomp -.04 .06 .60 .50
## 6 tv -.00 .11 .88 .87 .62
## 7 sum_discussion_posts .05 -.06 .09 .11 .01 .13
## 8 sum_n_words -.00 .26 .24 .14 .13 .23
## 9 passing_grade -.17 -.07 .09 .04 .11 .08
## 10 item_position .00 .07 -.04 -.29 -.08 -.18
## 11 points_earned -.07 -.11 -.18 .04 -.05 -.07
## 12 points_attempted
## 13 points_possible -.07 -.13 -.20 .03 -.07 -.10
## 14 final_grade -.19 -.02 .14 .03 .09 .10
## sum_discussion_posts sum_n_words passing_grade item_position points_earned
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8 .47
## 9 .35 .21
## 10 -.12 .01 -.05
## 11 -.08 -.04 .13 -.24
## 12
## 13 -.15 -.08 .08 -.19 .98
## 14 .43 .32 .81 -.05 .12
## points_attempted points_possible final_grade
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14 .06
fitted_model_resamples %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 12.3 20 0.577 Preprocessor1_Model1
## 2 rsq standard 0.683 20 0.0251 Preprocessor1_Model1
Did that feature engineering make any difference compared to the mean predictive accuracy you found in the case study? Add a few notes below:
Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:
Change the name of the author: in the YAML
header at the very top of this document to your name. As noted in Reproducible
Research in R, The YAML header controls the style and feel for
knitted document but doesn’t actually display in the final
output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our ML badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.