Machine Learning - Lab 1 Independent Assignment

The lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will extend our model by adding another variable.
In Part II, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.1

## Warning: package 'ggplot2' was built under R version 4.3.1

## Warning: package 'tidyr' was built under R version 4.3.1

## Warning: package 'readr' was built under R version 4.3.1

## Warning: package 'purrr' was built under R version 4.3.1

## Warning: package 'dplyr' was built under R version 4.3.1

## Warning: package 'forcats' was built under R version 4.3.1

## Warning: package 'lubridate' was built under R version 4.3.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.1

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8

## Warning: package 'broom' was built under R version 4.3.1

## Warning: package 'dials' was built under R version 4.3.1

## Warning: package 'scales' was built under R version 4.3.1

## Warning: package 'infer' was built under R version 4.3.1

## Warning: package 'modeldata' was built under R version 4.3.1

## Warning: package 'parsnip' was built under R version 4.3.1

## Warning: package 'recipes' was built under R version 4.3.1

## Warning: package 'rsample' was built under R version 4.3.1

## Warning: package 'tune' was built under R version 4.3.1

## Warning: package 'workflows' was built under R version 4.3.1

## Warning: package 'workflowsets' was built under R version 4.3.1

## Warning: package 'yardstick' was built under R version 4.3.1

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(janitor)

## Warning: package 'janitor' was built under R version 4.3.1

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(readr)

####Reading file

students <- read_csv("C:\\Users\\tejat\\OneDrive\\Documents\\machine-learning\\lab-1\\data\\oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

####Inspect data

glimpse(students)

## Rows: 32,593
## Columns: 15
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region                     <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band                   <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…

####Data Wrangling

students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps
students <- students %>% 
    mutate(disability = as.factor(disability))
view(students)

####Examine Variables

students %>% 
    count(id_student) # this many students

## # A tibble: 28,785 × 2
##    id_student     n
##         <dbl> <int>
##  1       3733     1
##  2       6516     1
##  3       8462     2
##  4      11391     1
##  5      23629     1
##  6      23632     1
##  7      23698     1
##  8      23798     1
##  9      24186     1
## 10      24213     2
## # ℹ 28,775 more rows

students %>% 
    count(code_module, code_presentation) # this many offerings

## # A tibble: 22 × 3
##    code_module code_presentation     n
##    <chr>       <chr>             <int>
##  1 AAA         2013J               383
##  2 AAA         2014J               365
##  3 BBB         2013B              1767
##  4 BBB         2013J              2237
##  5 BBB         2014B              1613
##  6 BBB         2014J              2292
##  7 CCC         2014B              1936
##  8 CCC         2014J              2498
##  9 DDD         2013B              1303
## 10 DDD         2013J              1938
## # ℹ 12 more rows

####Mutate Variable

students <- students %>% 
    mutate(multiple_depravity = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(multiple_depravity = as.integer(multiple_depravity)) # this changes the levels into integers based on the order of the factor levels

 

students

## # A tibble: 32,593 × 17
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 11 more variables: imd_band <chr>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## #   multiple_depravity <int>

####Splitting Data

set.seed(20230709)

 

train_test_split <- initial_split(students, prop = .80)

 

data_train <- training(train_test_split)

 

data_test  <- testing(train_test_split)
data_train

## # A tibble: 26,074 × 17
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 EEE         2014J                 648834 M      Scotland    HE Qualification 
##  2 DDD         2013J                 572089 M      Scotland    Lower Than A Lev…
##  3 BBB         2013J                 335136 F      Scotland    HE Qualification 
##  4 EEE         2014J                 636124 M      East Midla… Post Graduate Qu…
##  5 DDD         2014J                 637691 M      East Midla… A Level or Equiv…
##  6 FFF         2013J                 188524 F      West Midla… Lower Than A Lev…
##  7 FFF         2013B                2560595 M      East Angli… Lower Than A Lev…
##  8 CCC         2014J                 634776 M      East Midla… A Level or Equiv…
##  9 FFF         2014B                 618397 M      East Angli… HE Qualification 
## 10 GGG         2014J                 698019 F      Wales       Lower Than A Lev…
## # ℹ 26,064 more rows
## # ℹ 11 more variables: imd_band <chr>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## #   multiple_depravity <int>

data_test

## # A tibble: 6,519 × 17
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  2 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  3 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  4 AAA         2013J                  62155 F      North West… HE Qualification 
##  5 AAA         2013J                 110175 M      East Angli… HE Qualification 
##  6 AAA         2013J                 114999 M      Yorkshire … HE Qualification 
##  7 AAA         2013J                 116541 M      Wales       HE Qualification 
##  8 AAA         2013J                 127582 F      East Midla… A Level or Equiv…
##  9 AAA         2013J                 141355 F      East Midla… A Level or Equiv…
## 10 AAA         2013J                 145130 M      South West… HE Qualification 
## # ℹ 6,509 more rows
## # ℹ 11 more variables: imd_band <chr>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## #   multiple_depravity <int>

####Creating Recipe

my_rec <- recipe(pass ~ disability + imd_band + date_registration, data = data_train)

 

my_rec

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:   1
## predictor: 3

####specify model

my_mod <-
    logistic_reg()

####Set Engine

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

 

my_mod

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

####Workflow(Recipe + Engine)

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

####Fit model

fitted_model <- fit(my_wf, data = data_train)
fitted_model

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##       (Intercept)        disabilityY      imd_band10-20     imd_band20-30%  
##          -0.65487           -0.29256            0.11654            0.12186  
##    imd_band30-40%     imd_band40-50%     imd_band50-60%     imd_band60-70%  
##           0.31315            0.29855            0.39549            0.45231  
##    imd_band70-80%     imd_band80-90%    imd_band90-100%  date_registration  
##           0.39705            0.50083            0.54349            0.00169  
## 
## Degrees of Freedom: 25147 Total (i.e. Null);  25136 Residual
##   (926 observations deleted due to missingness)
## Null Deviance:       33350 
## Residual Deviance: 33080     AIC: 33100

final_fit <- last_fit(my_wf, split = train_test_split)
final_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits               id              .metrics .notes   .predictions .workflow 
##   <list>               <chr>           <list>   <list>   <list>       <list>    
## 1 <split [26074/6519]> train/test spl… <tibble> <tibble> <tibble>     <workflow>

####collect test split predictions

final_fit %>%
    collect_predictions()

## # A tibble: 6,519 × 7
##    id               .pred_0 .pred_1  .row .pred_class pass  .config             
##    <chr>              <dbl>   <dbl> <int> <fct>       <fct> <chr>               
##  1 train/test split   0.651   0.349     2 0           1     Preprocessor1_Model1
##  2 train/test split   0.586   0.414     4 0           1     Preprocessor1_Model1
##  3 train/test split   0.540   0.460     8 0           1     Preprocessor1_Model1
##  4 train/test split   0.592   0.408    14 0           1     Preprocessor1_Model1
##  5 train/test split   0.584   0.416    32 0           1     Preprocessor1_Model1
##  6 train/test split   0.579   0.421    36 0           1     Preprocessor1_Model1
##  7 train/test split   0.683   0.317    37 0           1     Preprocessor1_Model1
##  8 train/test split   0.601   0.399    41 0           1     Preprocessor1_Model1
##  9 train/test split   0.609   0.391    48 0           1     Preprocessor1_Model1
## 10 train/test split   0.565   0.435    52 0           1     Preprocessor1_Model1
## # ℹ 6,509 more rows

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct

## # A tibble: 6,519 × 3
##    .pred_class pass  correct
##    <fct>       <fct> <lgl>  
##  1 0           1     FALSE  
##  2 0           1     FALSE  
##  3 0           1     FALSE  
##  4 0           1     FALSE  
##  5 0           1     FALSE  
##  6 0           1     FALSE  
##  7 0           1     FALSE  
##  8 0           1     FALSE  
##  9 0           1     FALSE  
## 10 0           1     FALSE  
## # ℹ 6,509 more rows

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

##  correct    n    percent valid_percent
##    FALSE 2325 0.35664979     0.3696931
##     TRUE 3964 0.60806872     0.6303069
##       NA  230 0.03528148            NA

####Results and Evalutions

students %>% 
    count(pass)

## # A tibble: 2 × 2
##   pass      n
##   <fct> <int>
## 1 0     20232
## 2 1     12361

students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & pass == 1 |
               prediction == 0 & pass == 0, 1, 0)) %>% 
    tabyl(correct)

##  correct     n   percent
##        0 16281 0.4995244
##        1 16312 0.5004756

Previous results:0.6308522 New results: 0.6303069

How does the accuracy of this new model compare? Add a few reflections below:

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?

The “Data Modeling” Culture: This culture emphasizes the use of parametric statistical models, such as linear regression, generalized linear models, and decision trees and The “Algorithmic Modeling” Culture: This culture focuses on building predictive models using machine learning algorithms like random forests, support vector machines, neural networks, and boosting methods

How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?

Reinforcement of Algorithmic Modeling: The rise of big data and machine learning has reinforced the importance of the algorithmic modeling culture. With vast amounts of data becoming available, traditional parametric models often struggle to capture complex patterns and relationships.

Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?

Real-world Application is for agrredcal applications, such as recommendation systems, image classification, and autonomous vehicles, the primary objective is to make accurate predictions. Disagreed Scientific Inquiry: In scientific research, understanding the underlying processes and mechanisms is often essential.

Part B:

How good was the machine learning model you developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

Model Performance: The effectiveness of a machine learning model as a reviewer would largely depend on its performance metrics, such as precision, recall, and F1-score. It’s crucial to evaluate the model’s ability to correctly assess the quality, relevance, and validity of research papers.

How might the model be improved? Share any ideas you have at this time below:

Incorporate More Data: Expanding the dataset used for training to include a broader range of research papers from diverse fields and publication sources can enhance the model’s generalization and ability to review a wider variety of papers. Feature Engineering: Designing better features or representations of research papers, including text, citations, and metadata, can help the model capture more nuanced information about the quality and relevance of papers.

Part C: Use the institutional library (e.g. NU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

Provide an APA citation for your selected study.
- “The Impact of Climate Change on Global Crop Production: Risks and Adaptation” by Schlenker, W., & Roberts, M. J. (2009).
What research questions were the authors of this study trying to address and why did they consider these questions important?
- In the field of Environmental Science: What is the impact of deforestation on local biodiversity? How does climate change affect ocean currents and marine ecosystems? In the field of Psychology: What are the long-term effects of childhood trauma on adult mental health? How do different types of therapy interventions compare in treating anxiety disorders?
What were the results of these analyses?
- Species Diversity: As deforestation increased, the overall species diversity within the affected areas decreased. This was evident through a decrease in the number of species observed and a decline in species richness. Endangered Species: Many species that were classified as endangered or vulnerable by conservation organizations were disproportionately affected by deforestation. Their populations declined significantly in areas with extensive forest clearance

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let your instructor know if you run into any issues with knitting.
Finally, publish your webpage on Rpubs by clicking the “Publish” button located in the Viewer Pane after you knit your document.

Your First Machine Learning Badge

Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official Lab Badge, submit the link on Blackboard and share with your instructor.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Lab 1 Independent Assignment

[Enter your name]

September 12, 2023

Part I: Extending our model

Part II: Reflect and Plan

Knit and Publish

Your First Machine Learning Badge