The lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

## This is the code from Lab 1 Case Study
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)

library(tidyverse)
library(janitor)
library(tidymodels)

# Read CVS
students <- read_csv("data/oulad-students.csv")

## Inspect Data
glimpse(students)

# Mutate Variables

students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

## Creating New Independant(predictor) Variable

students <- students %>% 
    mutate(disability = as.factor(disability))

## View Data so far

View(students)

## Creating New Independant(predictor) Variable

students <- students %>%
    mutate(disability = ifelse(disability == "Y", 1, 0)) %>% #
    mutate(disability = as.factor(disability)) # makes the variable a factor, helping later steps


## Examine Variables

students %>% 
    count(id_student) # this many students

students %>% 
    count(code_module, code_presentation) # this many offerings

## Feature Engineering

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students

## Split Data

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

## Check Data split

data_train
data_test

## Create a Recipie

my_rec <- recipe(pass ~ disability + imd_band, data = data_train)

my_rec

## Specify Model

# specify model
my_mod <-
    logistic_reg()

## Finish Specifing Model

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod

## Add Workflow

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

## Fit Model

fitted_model <- fit(my_wf, data = data_train)
## Check out Fitted Model
fitted_model

## Last Fit Function

##last_fit(my_wf,train_test_split)

## Final Fit - Here in the case study that I published I used my_wf instead if fitted_model because I thought I was getting an error. After doing the Lab 1 Overview I think I should have used fitted_model,

final_fit <- last_fit(fitted_model, train_test_split)

## Interpret Accuracy

# collect test split predictions
final_fit %>%
    collect_predictions()

## Summarize

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct

## Counting Values of Correct 62.7%

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

## How Accurate was the model?

students %>% 
    count(pass)

students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & pass == 1 |
               prediction == 0 & pass == 0, 1, 0)) %>% 
    tabyl(correct)

## This is the code with the third predictor variable added in
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)

library(tidyverse)
library(janitor)
library(tidymodels)

# Read CVS
students <- read_csv("data/oulad-students.csv")

## Inspect Data
glimpse(students)

# Mutate Variables

students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

## Creating New Independant(predictor) Variable

students <- students %>% 
    mutate(disability = as.factor(disability))

## View Data so far

View(students)

## Creating New Independent(predictor) Variable

students <- students %>%
    mutate(disability = ifelse(disability == "Y", 1, 0)) %>% #
    mutate(disability = as.factor(disability)) # makes the variable a factor, helping later steps


## Creating our new third independent(predictor Variable) - its already a number so no need for this

##students <- students %>%
  ##  mutate(date_registration = as.factor(date_registration)) # makes the variable a factor, helping later steps
students
## Examine Variables

students %>% 
    count(id_student) # this many students

students %>% 
    count(code_module, code_presentation) # this many offerings

## Feature Engineering

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students

## Split Data

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

## Check Data split

data_train
data_test

## Create a Recipie

my_rec <- recipe(pass ~ disability + imd_band + date_registration, data = data_train)

my_rec

## Specify Model

# specify model
my_mod <-
    logistic_reg()

## Finish Specifing Model

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod

## Add Workflow

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

## Fit Model

fitted_model <- fit(my_wf, data = data_train)

## Check out Fitted Model
##fitted_model

## Last Fit Function

##last_fit(my_wf,train_test_split)

## Final Fit - Here in the case study that I published I used my_wf instead if fitted_model because I thought I was getting an error. After doing the Lab 1 Overview I think I should have used fitted_model,


final_fit <- last_fit(fitted_model, train_test_split)

final_fit
## Interpret Accuracy

# collect test split predictions
final_fit %>%
    collect_predictions()

## Summarize

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct

## Counting Values of Correct 62.7%

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

## How Accurate was the model?

students %>% 
    count(pass)

students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & pass == 1 |
               prediction == 0 & pass == 0, 1, 0)) %>% 
    tabyl(correct)

Previous results: 62.7% accurate results

New results: 62.7%

How does the accuracy of this new model compare? Add a few reflections below:

The results are the exact same with the addition of the third predictor variable. Indicating the third variable didnt have any predicting power.

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

  1. Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?
  1. How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?
  1. Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?

Part B:

  1. How good was the machine learning model you developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.
  1. How might the model be improved? Share any ideas you have at this time below:

Part C: Use the institutional library (e.g. NU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

  1. Provide an APA citation for your selected study.

  2. What research questions were the authors of this study trying to address and why did they consider these questions important?

    • To determine a students probability to pass their courses taken at the earliest stage of the semester. The aim is to improve students academic performance.
  3. What were the results of these analyses?

    • 75% train data / 25% testing data. Decision tree model. 6 independant variables. 76.19% accuracy, 83.33% precision, 88.23% recall. The article stated they interprete the results as good predictive ability.

Knit and Publish

Complete the following steps to knit and publish your work:

  1. First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let your instructor know if you run into any issues with knitting.

  3. Finally, publish your webpage on Rpubs by clicking the “Publish” button located in the Viewer Pane after you knit your document.

Your First Machine Learning Badge

Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official Lab Badge, submit the link on Blackboard and share with your instructor.

Once your instructor has checked your link, you will be provided a physical version of the badge below!