Machine Learning - Lab 1 Independent Assignment

The lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will extend our model by adding another variable.
In Part II, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

## This is the code from Lab 1 Case Study
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)

library(tidyverse)
library(janitor)
library(tidymodels)

# Read CVS
students <- read_csv("data/oulad-students.csv")

## Inspect Data
glimpse(students)

# Mutate Variables

students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

## Creating New Independant(predictor) Variable

students <- students %>% 
    mutate(disability = as.factor(disability))

## View Data so far

View(students)

## Creating New Independant(predictor) Variable

students <- students %>%
    mutate(disability = ifelse(disability == "Y", 1, 0)) %>% #
    mutate(disability = as.factor(disability)) # makes the variable a factor, helping later steps


## Examine Variables

students %>% 
    count(id_student) # this many students

students %>% 
    count(code_module, code_presentation) # this many offerings

## Feature Engineering

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students

## Split Data

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

## Check Data split

data_train
data_test

## Create a Recipie

my_rec <- recipe(pass ~ disability + imd_band, data = data_train)

my_rec

## Specify Model

# specify model
my_mod <-
    logistic_reg()

## Finish Specifing Model

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod

## Add Workflow

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

## Fit Model

fitted_model <- fit(my_wf, data = data_train)
## Check out Fitted Model
fitted_model

## Last Fit Function

##last_fit(my_wf,train_test_split)

## Final Fit - Here in the case study that I published I used my_wf instead if fitted_model because I thought I was getting an error. After doing the Lab 1 Overview I think I should have used fitted_model,

final_fit <- last_fit(fitted_model, train_test_split)

## Interpret Accuracy

# collect test split predictions
final_fit %>%
    collect_predictions()

## Summarize

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct

## Counting Values of Correct 62.7%

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

## How Accurate was the model?

students %>% 
    count(pass)

students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & pass == 1 |
               prediction == 0 & pass == 0, 1, 0)) %>% 
    tabyl(correct)

## This is the code with the third predictor variable added in
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)

library(tidyverse)
library(janitor)
library(tidymodels)

# Read CVS
students <- read_csv("data/oulad-students.csv")

## Inspect Data
glimpse(students)

# Mutate Variables

students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

## Creating New Independant(predictor) Variable

students <- students %>% 
    mutate(disability = as.factor(disability))

## View Data so far

View(students)

## Creating New Independent(predictor) Variable

students <- students %>%
    mutate(disability = ifelse(disability == "Y", 1, 0)) %>% #
    mutate(disability = as.factor(disability)) # makes the variable a factor, helping later steps


## Creating our new third independent(predictor Variable) - its already a number so no need for this

##students <- students %>%
  ##  mutate(date_registration = as.factor(date_registration)) # makes the variable a factor, helping later steps
students
## Examine Variables

students %>% 
    count(id_student) # this many students

students %>% 
    count(code_module, code_presentation) # this many offerings

## Feature Engineering

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students

## Split Data

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

## Check Data split

data_train
data_test

## Create a Recipie

my_rec <- recipe(pass ~ disability + imd_band + date_registration, data = data_train)

my_rec

## Specify Model

# specify model
my_mod <-
    logistic_reg()

## Finish Specifing Model

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod

## Add Workflow

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

## Fit Model

fitted_model <- fit(my_wf, data = data_train)

## Check out Fitted Model
##fitted_model

## Last Fit Function

##last_fit(my_wf,train_test_split)

## Final Fit - Here in the case study that I published I used my_wf instead if fitted_model because I thought I was getting an error. After doing the Lab 1 Overview I think I should have used fitted_model,


final_fit <- last_fit(fitted_model, train_test_split)

final_fit
## Interpret Accuracy

# collect test split predictions
final_fit %>%
    collect_predictions()

## Summarize

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct

## Counting Values of Correct 62.7%

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

## How Accurate was the model?

students %>% 
    count(pass)

students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & pass == 1 |
               prediction == 0 & pass == 0, 1, 0)) %>% 
    tabyl(correct)

Previous results: 62.7% accurate results

New results: 62.7%

How does the accuracy of this new model compare? Add a few reflections below:

The results are the exact same with the addition of the third predictor variable. Indicating the third variable didnt have any predicting power.

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?

The primary difference between the two cultures of statistical modeling from Breiman’s paper is their objectives and how to achieve a prediction/result. The Data Modeling culture operates within a black box and model validation consists of yes/no using goodness of fit test and examining the residuals. The Algorithmic Modeling Culture focuses on understanding the underlying data relationships and validates models by predictive accuracy.

How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?

With the rise of big data there is more a need for analysis techniques that can handle vast amounts of data. Many machine learning models are capable of handling vast amounts of data. So there has been a big increase in the use of machine learning.The downfall of this however is the lack of understaning how machine learning models achieve their results.

Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?

I agree that predictive accuracy is the main goal. However understanding how a method works is also valuable. If there is complete lack of understanding of a method it could easily be applied incorrectly resulting in bad or sub par results.

Part B:

How good was the machine learning model you developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

The machine learning mode from the badge activity predicted correctly 62.7% of the time. I would not consider this good for a dichotomous pass/fail prediction. Naturally guessing if a student would pass or fail overall results in 50% accuracy over time. So the model is only 12.7% better than guessing. Which is better than not having the model, but not good in my opinion.

How might the model be improved? Share any ideas you have at this time below:

The use of a larger data set with more variables would help. Our model also only used three predictor variables from the data set. There are other variables in the data set such as highest education that were not used that could help improve accuracy. Experimenting with other models to compare accuracy would also help.

Part C: Use the institutional library (e.g. NU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

Provide an APA citation for your selected study.
- Doctor, A. C. (2023, April 11). A Predictive Model using Machine Learning Algorithm in Identifying Students Probability on Passing Semestral Course. ArXiv.org. https://doi.org/10.25147/ijcsr.2017.001.1.135
What research questions were the authors of this study trying to address and why did they consider these questions important?
- To determine a students probability to pass their courses taken at the earliest stage of the semester. The aim is to improve students academic performance.
What were the results of these analyses?
- 75% train data / 25% testing data. Decision tree model. 6 independant variables. 76.19% accuracy, 83.33% precision, 88.23% recall. The article stated they interprete the results as good predictive ability.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let your instructor know if you run into any issues with knitting.
Finally, publish your webpage on Rpubs by clicking the “Publish” button located in the Viewer Pane after you knit your document.

Your First Machine Learning Badge

Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official Lab Badge, submit the link on Blackboard and share with your instructor.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Lab 1 Independent Assignment

[Austin Hannold]

April 1, 2024

Part I: Extending our model

Part II: Reflect and Plan

Knit and Publish

Your First Machine Learning Badge