As a reminder, to earn a badge for these learning labs, you will have to respond to a set of prompts for two parts.

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this learning lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this badge activity, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Please use the code chunk below for your code:

assessments %>%  
    count(assessment_type) 
## # A tibble: 3 × 2
##   assessment_type     n
##   <chr>           <int>
## 1 CMA             70527
## 2 Exam             4959
## 3 TMA             98426
assessments %>%  
    distinct(id_assessment) # this many unique assessments 
## # A tibble: 188 × 1
##    id_assessment
##            <dbl>
##  1          1752
##  2          1753
##  3          1754
##  4          1755
##  5          1756
##  6          1758
##  7          1759
##  8          1760
##  9          1761
## 10          1762
## # ℹ 178 more rows
assessments %>%  
    count(assessment_type, code_module, code_presentation) 
## # A tibble: 41 × 4
##    assessment_type code_module code_presentation     n
##    <chr>           <chr>       <chr>             <int>
##  1 CMA             BBB         2013B              5049
##  2 CMA             BBB         2013J              6416
##  3 CMA             BBB         2014B              4493
##  4 CMA             CCC         2014B              3920
##  5 CMA             CCC         2014J              5846
##  6 CMA             DDD         2013B              5252
##  7 CMA             FFF         2013B              6681
##  8 CMA             FFF         2013J              8847
##  9 CMA             FFF         2014B              5549
## 10 CMA             FFF         2014J              8915
## # ℹ 31 more rows
assessments %>%  
    summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments 
              median_date = median(date, na.rm = TRUE), # find the median 

              sd_date = sd(date, na.rm = TRUE), # find the sd 

              min_date = min(date, na.rm = TRUE), # find the min 

              max_date = max(date, na.rm = TRUE)) # find the mad 
## # A tibble: 1 × 5
##   mean_date median_date sd_date min_date max_date
##       <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
## 1      131.         129    78.0       12      261
assessments %>%  

    group_by(code_module, code_presentation) %>% # first, group by course (module: course; presentation: semester) 

    summarize(mean_date = mean(date, na.rm = TRUE), 

              median_date = median(date, na.rm = TRUE), 

              sd_date = sd(date, na.rm = TRUE), 

              min_date = min(date, na.rm = TRUE), 

              max_date = max(date, na.rm = TRUE), 

              first_quantile = quantile(date, probs = .25, na.rm = TRUE)) 
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
## # A tibble: 22 × 8
## # Groups:   code_module [7]
##    code_module code_presentation mean_date median_date sd_date min_date max_date
##    <chr>       <chr>                 <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
##  1 AAA         2013J                 109.          117    71.3       19      215
##  2 AAA         2014J                 109.          117    71.5       19      215
##  3 BBB         2013B                 104.           89    55.5       19      187
##  4 BBB         2013J                 112.           96    61.6       19      208
##  5 BBB         2014B                  98.9          82    58.6       12      194
##  6 BBB         2014J                  99.1         110    65.2       19      201
##  7 CCC         2014B                  98.4         102    68.0       18      207
##  8 CCC         2014J                 104.          109    70.8       18      214
##  9 DDD         2013B                 104.           81    66.0       23      240
## 10 DDD         2013J                 118.           88    77.9       25      261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>

New objects begin below

code_module_dates <- assessments %>%  

    group_by(code_module, code_presentation) %>%  

    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE)) 
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
code_module_dates 
## # A tibble: 22 × 3
## # Groups:   code_module [7]
##    code_module code_presentation quantile_cutoff_date
##    <chr>       <chr>                            <dbl>
##  1 AAA         2013J                               54
##  2 AAA         2014J                               54
##  3 BBB         2013B                               54
##  4 BBB         2013J                               54
##  5 BBB         2014B                               47
##  6 BBB         2014J                               54
##  7 CCC         2014B                               32
##  8 CCC         2014J                               32
##  9 DDD         2013B                               51
## 10 DDD         2013J                               53
## # ℹ 12 more rows
assessments_joined <- assessments %>%  
    left_join(code_module_dates) 
## Joining with `by = join_by(code_module, code_presentation)`
assessments_filtered <- assessments_joined %>%  

    filter(date < quantile_cutoff_date) 
assessments_summarized <- assessments_filtered %>%  

    mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment 

    group_by(id_student) %>%  

    summarize(mean_weighted_score = mean(weighted_score))
students <- students %>%  

    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code 

    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps 
students <- students %>%  

    mutate(imd_band = factor(imd_band, levels = c("0-10%", 

                                                  "10-20%", 

                                                  "20-30%", 

                                                  "30-40%", 

                                                  "40-50%", 

                                                  "50-60%", 

                                                  "60-70%", 

                                                  "70-80%", 

                                                  "80-90%", 

                                                  "90-100%"))) %>% # this creates a factor with ordered levels 

    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels 
students_and_assessments <- students %>%
    left_join(assessments_summarized) 
## Joining with `by = join_by(id_student)`
set.seed(20230712)

students <- read_csv("lab-2/data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments <- read_csv("lab-2/data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments %>% 
    count(assessment_type)
## # A tibble: 3 × 2
##   assessment_type     n
##   <chr>           <int>
## 1 CMA             70527
## 2 Exam             4959
## 3 TMA             98426

getting error

mean_weighted_score <- students_and_assessments %>% 
    filter(!is.na(mean_weighted_score))
train_test_split <- initial_split(mean_weighted_score, prop = .50, strata = "pass")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
my_rec <- recipe(mean_weighted_score ~ disability +
                     date_registration + 
                     gender +
                     code_module +
                     mean_weighted_score, 
                 data = data_train) %>% 
    step_dummy(disability) %>% 
    step_dummy(gender) %>%  
    step_dummy(code_module)
my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>% # linear model
    set_mode("regression")

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

fitted_model <- fit(my_wf, data =data_train)
class_metrics <- metric_set(mae, rmse)
final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)
collect_metrics(final_fit)
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 mae     standard        194. Preprocessor1_Model1
## 2 rmse    standard        254. Preprocessor1_Model1

Please add your interpretations here:

Part II: Reflect and Plan

  1. What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?
  1. What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?
  1. Look back to the study you identified for the first machine learning learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

Knit and Publish

Complete the following steps to knit and publish your work:

  1. First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.

  3. Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Receiving Your Machine Learning Badge

To receive credit for this assignment and earn your second ML Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.

Once your instructor has checked your link, you will be provided a physical version of the badge below!