As a reminder, to earn a badge for each lab, you are required to respond to a set of prompts for two parts:

Part I: Reflect and Plan

Part A:

  1. Like we considered after LL1, how good was the machine learning model we developed in the case study? Stepping back, how successful is this as a predictive model of students’ success in the class using data collected through roughly the first one month of the class? How might this model be used in practice?
  1. Would you be comfortable using this? What if you read about someone using such a model as a reviewer of research. Please add your thoughts and reflections following the bullet point below.
  1. How might the model be improved? Share any ideas you have at this time below:

Part B: Again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involve making predictions – and, ideally, one that involved in some way engineering features from data.

  1. Provide an APA citation for your selected study.

    • Jang, W., Francisco, J., Ranganathan, N., McCarroll, K. M., & Ryoo, K. (2020). Using Machine Learning to Understand Students' Learning Patterns in Simulations.

    • note. this is not a study that used feature selection but I cannot find one in the field of head start education

  2. What research questions were the authors of this study trying to address and why did they consider these questions important?

    • this study explores how these two techniques (Finite Mixture Model and Sequential Pattern Mining) can identify productive interaction patterns that can enhance eighth-grade students’ science learning within a simulation
  3. What were the results of these analyses?

    • The findings of this study show that FMM can categorize students into different learning groups based on the frequency of actions in the simulation, as well as the improvement in their understanding of scientific phenomena. When supported by SPM, unique subsequential patterns of each group can even be detected. Such findings can inform how to design tailored scaffolding for engaging students in effective interactions using simulations.

Part II: Data Product

For the data product, you are asked to investigate and add to our recipe a feature engineering step we did not carry out.

Run the code below through the step in which you write down the recipe.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(here)
## here() starts at /Users/lizfrechette/Desktop/machine-learning
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.0     ✔ rsample      1.0.0
## ✔ dials        1.0.0     ✔ tune         1.0.0
## ✔ infer        1.0.2     ✔ workflows    1.0.0
## ✔ modeldata    1.0.0     ✔ workflowsets 1.0.0
## ✔ parsnip      1.0.0     ✔ yardstick    1.0.0
## ✔ recipes      1.0.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
d <- read_csv("data/online-sci-data-joined.csv")
## Rows: 10920 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (11): course_id, gender, enrollment_reason, enrollment_status, subject,...
## dbl  (12): student_id, int, uv, percomp, tv, sum_discussion_posts, sum_n_wor...
## lgl   (1): status
## time  (1): last_access_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#loading in new data file to join the final grade from the below data to the d dataframe
data_with_final_grade <- read_csv("data/data-to-model-no-gradebook.csv")
## Rows: 546 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): course_id, gender, enrollment_reason, enrollment_status, subject, s...
## dbl (9): student_id, final_grade, time_spent, int, uv, percomp, tv, sum_disc...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view(data_with_final_grade) 

data_with_final_grade <- data_with_final_grade %>% 
    select(student_id, course_id, final_grade)

d <- left_join(d, data_with_final_grade)
## Joining, by = c("student_id", "course_id")
#long to wide format to keep 1 row per student

d <- d %>% distinct(student_id, course_id, .keep_all = TRUE)

#view(d)

set.seed(20220712)

train_test_split <- initial_split(d, prop = .80)

data_train <- training(train_test_split)

kfcv <- vfold_cv(data_train, v = 20) # this differentiates this from what we did before
#view(data_train)
names(d)
##  [1] "student_id"           "course_id"            "gender"              
##  [4] "enrollment_reason"    "enrollment_status"    "subject"             
##  [7] "semester"             "section"              "int"                 
## [10] "uv"                   "percomp"              "tv"                  
## [13] "sum_discussion_posts" "sum_n_words"          "passing_grade"       
## [16] "gradebook_item"       "item_position"        "gradebook_type"      
## [19] "gradebook_date"       "grade_category"       "status"              
## [22] "points_earned"        "points_attempted"     "points_possible"     
## [25] "last_access_date"     "final_grade"

Here’s where you can add a new feature engineering step. For the sake of this badge, choose from among those options here: https://recipes.tidymodels.org/reference/index.html. You can see more - if helpful - here: https://www.tmwr.org/recipes.html

my_rec <- recipe(final_grade ~ int + tv + 
                     student_id + course_id + 
                     sum_discussion_posts + sum_n_words +
                     percomp + passing_grade,
                 data = data_train) %>% 
    update_role(student_id, course_id, new_role = "ID variable") %>%
    step_normalize(all_numeric_predictors()) %>% # standardizes numeric variables
    step_nzv(all_predictors()) %>% # remove predictors with a "near-zero variance"
    step_novel(all_nominal_predictors()) %>% # add a musing label for factors
    step_dummy(all_nominal_predictors()) %>%  # dummy code all factor variables
    step_impute_knn(all_predictors()) # impute missing data for all predictor variables

my_rec
## Recipe
## 
## Inputs:
## 
##         role #variables
##  ID variable          2
##      outcome          1
##    predictor          6
## 
## Operations:
## 
## Centering and scaling for all_numeric_predictors()
## Sparse, unbalanced variable filter on all_predictors()
## Novel factor level assignment for all_nominal_predictors()
## Dummy variables from all_nominal_predictors()
## K-nearest neighbor imputation for all_predictors()

Run the remaining steps.

my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>%
    set_mode("regression")

my_wf <-
    workflow() %>%
    add_model(my_mod) %>% 
    add_recipe(my_rec)


fitted_model_resamples <- fit_resamples(my_wf, resamples = kfcv,
                              control = control_grid(save_pred = TRUE)) # this allows us to inspect the #inspect the predictions

fitted_model_resamples %>% 
    unnest(.metrics) %>% 
    filter(.metric == "rmse") # we also get another metric, the ROC; we focus just on accuracy for now
## # A tibble: 20 × 8
##    splits           id     .metric .estimator .estimate .config         .notes  
##    <list>           <chr>  <chr>   <chr>          <dbl> <chr>           <list>  
##  1 <split [414/22]> Fold01 rmse    standard       17.5  Preprocessor1_… <tibble>
##  2 <split [414/22]> Fold02 rmse    standard       11.5  Preprocessor1_… <tibble>
##  3 <split [414/22]> Fold03 rmse    standard       13.2  Preprocessor1_… <tibble>
##  4 <split [414/22]> Fold04 rmse    standard       13.3  Preprocessor1_… <tibble>
##  5 <split [414/22]> Fold05 rmse    standard       10.3  Preprocessor1_… <tibble>
##  6 <split [414/22]> Fold06 rmse    standard       11.7  Preprocessor1_… <tibble>
##  7 <split [414/22]> Fold07 rmse    standard        9.49 Preprocessor1_… <tibble>
##  8 <split [414/22]> Fold08 rmse    standard       11.4  Preprocessor1_… <tibble>
##  9 <split [414/22]> Fold09 rmse    standard       13.0  Preprocessor1_… <tibble>
## 10 <split [414/22]> Fold10 rmse    standard       15.0  Preprocessor1_… <tibble>
## 11 <split [414/22]> Fold11 rmse    standard       10.7  Preprocessor1_… <tibble>
## 12 <split [414/22]> Fold12 rmse    standard        9.51 Preprocessor1_… <tibble>
## 13 <split [414/22]> Fold13 rmse    standard       15.4  Preprocessor1_… <tibble>
## 14 <split [414/22]> Fold14 rmse    standard       14.1  Preprocessor1_… <tibble>
## 15 <split [414/22]> Fold15 rmse    standard       10.9  Preprocessor1_… <tibble>
## 16 <split [414/22]> Fold16 rmse    standard        9.60 Preprocessor1_… <tibble>
## 17 <split [415/21]> Fold17 rmse    standard        7.22 Preprocessor1_… <tibble>
## 18 <split [415/21]> Fold18 rmse    standard       15.3  Preprocessor1_… <tibble>
## 19 <split [415/21]> Fold19 rmse    standard       12.5  Preprocessor1_… <tibble>
## 20 <split [415/21]> Fold20 rmse    standard       15.3  Preprocessor1_… <tibble>
## # … with 1 more variable: .predictions <list>
#create a subset of data for correlation matrix
subd <- d %>% 
mutate(section = as.numeric(section)) %>%
  select(student_id, section, int, uv, percomp, tv,
          sum_discussion_posts, sum_n_words, passing_grade, 
           item_position, points_earned, points_attempted, 
          points_possible, final_grade)

#names(subd)

#res <- cor(subd$final_grade, subd$points_earned)
#round(res, 2)

subd.cor <- cor(subd)
## Warning in cor(subd): the standard deviation is zero
subd.cor
##                         student_id     section int          uv percomp tv
## student_id            1.0000000000  0.01894348  NA  0.01747198      NA NA
## section               0.0189434843  1.00000000  NA  0.01381090      NA NA
## int                             NA          NA   1          NA      NA NA
## uv                    0.0174719762  0.01381090  NA  1.00000000      NA NA
## percomp                         NA          NA  NA          NA       1 NA
## tv                              NA          NA  NA          NA      NA  1
## sum_discussion_posts            NA          NA  NA          NA      NA NA
## sum_n_words                     NA          NA  NA          NA      NA NA
## passing_grade                   NA          NA  NA          NA      NA NA
## item_position         0.0001073529  0.06679731  NA -0.28871494      NA NA
## points_earned                   NA          NA  NA          NA      NA NA
## points_attempted                NA          NA  NA          NA      NA NA
## points_possible      -0.0706718060 -0.12516345  NA  0.02565155      NA NA
## final_grade                     NA          NA  NA          NA      NA NA
##                      sum_discussion_posts sum_n_words passing_grade
## student_id                             NA          NA            NA
## section                                NA          NA            NA
## int                                    NA          NA            NA
## uv                                     NA          NA            NA
## percomp                                NA          NA            NA
## tv                                     NA          NA            NA
## sum_discussion_posts                    1          NA            NA
## sum_n_words                            NA           1            NA
## passing_grade                          NA          NA             1
## item_position                          NA          NA            NA
## points_earned                          NA          NA            NA
## points_attempted                       NA          NA            NA
## points_possible                        NA          NA            NA
## final_grade                            NA          NA            NA
##                      item_position points_earned points_attempted
## student_id            0.0001073529            NA               NA
## section               0.0667973064            NA               NA
## int                             NA            NA               NA
## uv                   -0.2887149429            NA               NA
## percomp                         NA            NA               NA
## tv                              NA            NA               NA
## sum_discussion_posts            NA            NA               NA
## sum_n_words                     NA            NA               NA
## passing_grade                   NA            NA               NA
## item_position         1.0000000000            NA               NA
## points_earned                   NA             1               NA
## points_attempted                NA            NA                1
## points_possible      -0.1922397517            NA               NA
## final_grade                     NA            NA               NA
##                      points_possible final_grade
## student_id               -0.07067181          NA
## section                  -0.12516345          NA
## int                               NA          NA
## uv                        0.02565155          NA
## percomp                           NA          NA
## tv                                NA          NA
## sum_discussion_posts              NA          NA
## sum_n_words                       NA          NA
## passing_grade                     NA          NA
## item_position            -0.19223975          NA
## points_earned                     NA          NA
## points_attempted                  NA          NA
## points_possible           1.00000000          NA
## final_grade                       NA           1
#??corr

#install.packages("corrr")
library(corrr)

#create correlation matrix
cor.d <- correlate(subd) %>%
    shave()
## Warning in stats::cor(x = x, y = y, use = use, method = method): the standard
## deviation is zero
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
fashion(cor.d)
##                    term student_id section  int   uv percomp   tv
## 1            student_id                                          
## 2               section        .02                               
## 3                   int       -.02     .17                       
## 4                    uv        .02     .01  .54                  
## 5               percomp       -.04     .06  .60  .50             
## 6                    tv       -.00     .11  .88  .87     .62     
## 7  sum_discussion_posts        .05    -.06  .09  .11     .01  .13
## 8           sum_n_words       -.00     .26  .24  .14     .13  .23
## 9         passing_grade       -.17    -.07  .09  .04     .11  .08
## 10        item_position        .00     .07 -.04 -.29    -.08 -.18
## 11        points_earned       -.07    -.11 -.18  .04    -.05 -.07
## 12     points_attempted                                          
## 13      points_possible       -.07    -.13 -.20  .03    -.07 -.10
## 14          final_grade       -.19    -.02  .14  .03     .09  .10
##    sum_discussion_posts sum_n_words passing_grade item_position points_earned
## 1                                                                            
## 2                                                                            
## 3                                                                            
## 4                                                                            
## 5                                                                            
## 6                                                                            
## 7                                                                            
## 8                   .47                                                      
## 9                   .35         .21                                          
## 10                 -.12         .01          -.05                            
## 11                 -.08        -.04           .13          -.24              
## 12                                                                           
## 13                 -.15        -.08           .08          -.19           .98
## 14                  .43         .32           .81          -.05           .12
##    points_attempted points_possible final_grade
## 1                                              
## 2                                              
## 3                                              
## 4                                              
## 5                                              
## 6                                              
## 7                                              
## 8                                              
## 9                                              
## 10                                             
## 11                                             
## 12                                             
## 13                                             
## 14                              .06
fitted_model_resamples %>% 
    collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   12.3      20  0.577  Preprocessor1_Model1
## 2 rsq     standard    0.683    20  0.0251 Preprocessor1_Model1

Did that feature engineering make any difference compared to the mean predictive accuracy you found in the case study? Add a few notes below:

Knit & Submit

Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:

  1. Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.

  3. Commit your changes in GitHub Desktop and push them to your online GitHub repository.

  4. Publish your HTML page the web using one of the following publishing methods:

    • Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.

    • Publishing on GitHub using either GitHub Pages or the HTML previewer.

  5. Post a new discussion on GitHub to our ML badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.