As a reminder, to earn a badge for each lab, you are required to respond to a set of prompts for two parts:

Part I: Reflect and Plan

Part A:

  1. Like we considered after LL1, how good was the machine learning model we developed in the case study? Stepping back, how successful is this as a predictive model of students’ success in the class using data collected through roughly the first one month of the class? How might this model be used in practice?
  1. Would you be comfortable using this? What if you read about someone using such a model as a reviewer of research. Please add your thoughts and reflections following the bullet point below.
  1. How might the model be improved? Share any ideas you have at this time below:

Part B: Again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involve making predictions – and, ideally, one that involved in some way engineering features from data.

  1. Provide an APA citation for your selected study.

    • Wu, J. Y., Hsiao, Y. C., & Nian, M. W. (2020). Using supervised machine learning on large-scale online forums to classify course-related Facebook messages in predicting learning achievement within the personal learning environment. Interactive Learning Environments, 28(1), 65-80.
  2. What research questions were the authors of this study trying to address and why did they consider these questions important?

    • RQ1: How well does each ML algorithm perform in classifying the two online forum posts?
    • RQ2: How well does each ML algorithm perform in classifying the Facebook statistics posts in comparison to the relevance and cognitive level coding by human?
    • RQ3: What is the predictive validity of the ML classification on students’ final grade in the advanced statistics course?

    The availability of digital data brings unparalleled potential to examine people’s learning from different facets and gives rise to interests in the development and use of tools and techniques to support Learning Analytics (LA). Online discussion messages are typical digital data that may convey information for learning diagnosis (Lu & Jeng, 2006). Online communities and discussion boards premised on social media are widely applied in education as an extended platform for students’ seamless learning (Thoms & Eryilmaz, 2015).

  3. What were the results of these analyses?

    • The regression analysis in Model 1 with the frequency of unclassified message explained 38.80% (F(2,20) = 6.33 with p = .007, Adj.R2 = 32.70%) of the variance for students’ final course grade, controlling for gender.
    • The regression model with the frequencies of machine classified mess- ages explained 49.60% (F(5,18) = 3.55 with p = .021, Adj.R2 = 35.70%) of the variance for students’ final course grade.
    • Students with more messages endorsed by two or more ML algorithms as statistics-related had higher final course grades. Students who failed the course also had significantly fewer messages endorsed by all three ML algorithms than those who passed.

Part II: Data Product

For the data product, you are asked to investigate and add to our recipe a feature engineering step we did not carry out.

Run the code below through the step in which you write down the recipe.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(here)
## here() starts at /Users/penghe/Documents/GitHub/machine-learning
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom        1.0.0     ✔ rsample      1.0.0
## ✔ dials        1.0.0     ✔ tune         1.0.0
## ✔ infer        1.0.2     ✔ workflows    1.0.0
## ✔ modeldata    1.0.0     ✔ workflowsets 0.2.1
## ✔ parsnip      1.0.0     ✔ yardstick    1.0.0
## ✔ recipes      1.0.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
d <- read_csv("data/online-sci-data-joined.csv")
## Rows: 10920 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (11): course_id, gender, enrollment_reason, enrollment_status, subject,...
## dbl  (12): student_id, int, uv, percomp, tv, sum_discussion_posts, sum_n_wor...
## lgl   (1): status
## time  (1): last_access_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_with_final_grade <- read_csv("data/data-to-model-no-gradebook.csv")
## Rows: 546 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): course_id, gender, enrollment_reason, enrollment_status, subject, s...
## dbl (9): student_id, final_grade, time_spent, int, uv, percomp, tv, sum_disc...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_with_final_grade <- data_with_final_grade %>%
    select(student_id, course_id, final_grade)

d <- left_join(d, data_with_final_grade, by = c("student_id", "course_id"))

#if have 1000 rows, then use this:
d <- d %>% distinct(student_id, course_id, .keep_all = TRUE)

set.seed(20220712)

train_test_split <- initial_split(d, prop = .80)

data_train <- training(train_test_split)

kfcv <- vfold_cv(data_train, v = 10) # this differentiates this from what we did before

Here’s where you can add a new feature engineering step. For the sake of this badge, choose from among those options here: https://recipes.tidymodels.org/reference/index.html. You can see more - if helpful - here: https://www.tmwr.org/recipes.html

my_rec <- recipe(final_grade ~ int +uv + tv +
                     student_id + course_id +
                     sum_discussion_posts + sum_n_words + subject + percomp + points_earned, 
                 data = data_train) %>% 
    update_role(student_id, course_id, new_role = "ID variables") %>% # this can bee any string
    step_normalize(all_numeric_predictors()) %>% # standardizes numeric variables
    step_nzv(all_predictors()) %>% # remove predictors with a "near-zero variance"
    step_novel(all_nominal_predictors()) %>% # add a musing label for factors
    step_dummy(all_nominal_predictors()) %>%  # dummy code all factor variables
    step_impute_knn(all_predictors()) # impute missing data for all predictor variables

my_rec
## Recipe
## 
## Inputs:
## 
##          role #variables
##  ID variables          2
##       outcome          1
##     predictor          8
## 
## Operations:
## 
## Centering and scaling for all_numeric_predictors()
## Sparse, unbalanced variable filter on all_predictors()
## Novel factor level assignment for all_nominal_predictors()
## Dummy variables from all_nominal_predictors()
## K-nearest neighbor imputation for all_predictors()

Run the remaining steps.

my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>%
    set_mode("regression")

my_wf <-
    workflow() %>%
    add_model(my_mod) %>% 
    add_recipe(my_rec)

fitted_model_resamples <- fit_resamples(my_wf, resamples = kfcv,
                              control = control_grid(save_pred = TRUE)) # this allows us to inspect the predictions
## ! Fold01: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold01: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold02: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold02: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold03: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold03: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold04: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold04: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold05: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold05: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold06: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold06: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold07: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold07: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold08: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold08: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold09: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold09: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold10: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold10: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
fitted_model_resamples %>% 
    unnest(.metrics) %>% 
    filter(.metric == "rmse") # we also get another metric, the ROC; we focus just on accuracy for now
## # A tibble: 10 × 8
##    splits           id     .metric .estimator .estimate .config         .notes  
##    <list>           <chr>  <chr>   <chr>          <dbl> <chr>           <list>  
##  1 <split [392/44]> Fold01 rmse    standard        21.4 Preprocessor1_… <tibble>
##  2 <split [392/44]> Fold02 rmse    standard        20.9 Preprocessor1_… <tibble>
##  3 <split [392/44]> Fold03 rmse    standard        22.3 Preprocessor1_… <tibble>
##  4 <split [392/44]> Fold04 rmse    standard        19.0 Preprocessor1_… <tibble>
##  5 <split [392/44]> Fold05 rmse    standard        13.7 Preprocessor1_… <tibble>
##  6 <split [392/44]> Fold06 rmse    standard        17.7 Preprocessor1_… <tibble>
##  7 <split [393/43]> Fold07 rmse    standard        17.0 Preprocessor1_… <tibble>
##  8 <split [393/43]> Fold08 rmse    standard        23.6 Preprocessor1_… <tibble>
##  9 <split [393/43]> Fold09 rmse    standard        21.2 Preprocessor1_… <tibble>
## 10 <split [393/43]> Fold10 rmse    standard        28.5 Preprocessor1_… <tibble>
## # … with 1 more variable: .predictions <list>
fitted_model_resamples %>% 
    collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   20.5      10  1.28   Preprocessor1_Model1
## 2 rsq     standard    0.109    10  0.0204 Preprocessor1_Model1

Did that feature engineering make any difference compared to the mean predictive accuracy you found in the case study? Add a few notes below:

collect_predictions(fitted_model_resamples) %>% 
    ggplot(aes(x = .pred, y = final_grade)) +
    geom_point()
## Warning: Removed 16 rows containing missing values (geom_point).

Knit & Submit

Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:

  1. Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.

  3. Commit your changes in GitHub Desktop and push them to your online GitHub repository.

  4. Publish your HTML page the web using one of the following publishing methods:

    • Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.

    • Publishing on GitHub using either GitHub Pages or the HTML previewer.

  5. Post a new discussion on GitHub to our ML badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.