The lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.
To earn a badge for each lab, you are required to respond to a set of prompts for two parts:
In Part I, you will extend our model by adding another variable.
In Part II, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.
In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'tidyr' was built under R version 4.3.1
## Warning: package 'readr' was built under R version 4.3.1
## Warning: package 'purrr' was built under R version 4.3.1
## Warning: package 'dplyr' was built under R version 4.3.1
## Warning: package 'forcats' was built under R version 4.3.1
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.1
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## Warning: package 'broom' was built under R version 4.3.1
## Warning: package 'dials' was built under R version 4.3.1
## Warning: package 'scales' was built under R version 4.3.1
## Warning: package 'infer' was built under R version 4.3.1
## Warning: package 'modeldata' was built under R version 4.3.1
## Warning: package 'parsnip' was built under R version 4.3.1
## Warning: package 'recipes' was built under R version 4.3.1
## Warning: package 'rsample' was built under R version 4.3.1
## Warning: package 'tune' was built under R version 4.3.1
## Warning: package 'workflows' was built under R version 4.3.1
## Warning: package 'workflowsets' was built under R version 4.3.1
## Warning: package 'yardstick' was built under R version 4.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.1
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(readr)
students <- read_csv("C:\\Users\\abhil\\Documents\\machine-learning\\lab-1\\data\\oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(students)
## Rows: 32,593
## Columns: 15
## $ code_module <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…
students <- students %>%
mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps
students <- students %>%
mutate(disability = as.factor(disability))
view(students)
students %>%
count(id_student) # this many students
## # A tibble: 28,785 × 2
## id_student n
## <dbl> <int>
## 1 3733 1
## 2 6516 1
## 3 8462 2
## 4 11391 1
## 5 23629 1
## 6 23632 1
## 7 23698 1
## 8 23798 1
## 9 24186 1
## 10 24213 2
## # ℹ 28,775 more rows
students %>%
count(code_module, code_presentation) # this many offerings
## # A tibble: 22 × 3
## code_module code_presentation n
## <chr> <chr> <int>
## 1 AAA 2013J 383
## 2 AAA 2014J 365
## 3 BBB 2013B 1767
## 4 BBB 2013J 2237
## 5 BBB 2014B 1613
## 6 BBB 2014J 2292
## 7 CCC 2014B 1936
## 8 CCC 2014J 2498
## 9 DDD 2013B 1303
## 10 DDD 2013J 1938
## # ℹ 12 more rows
students <- students %>%
mutate(multiple_depravity = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>% # this creates a factor with ordered levels
mutate(multiple_depravity = as.integer(multiple_depravity)) # this changes the levels into integers based on the order of the factor levels
students
## # A tibble: 32,593 × 17
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 11391 M East Angli… HE Qualification
## 2 AAA 2013J 28400 F Scotland HE Qualification
## 3 AAA 2013J 30268 F North West… A Level or Equiv…
## 4 AAA 2013J 31604 F South East… A Level or Equiv…
## 5 AAA 2013J 32885 F West Midla… Lower Than A Lev…
## 6 AAA 2013J 38053 M Wales A Level or Equiv…
## 7 AAA 2013J 45462 M Scotland HE Qualification
## 8 AAA 2013J 45642 F North West… A Level or Equiv…
## 9 AAA 2013J 52130 F East Angli… A Level or Equiv…
## 10 AAA 2013J 53025 M North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 11 more variables: imd_band <chr>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## # multiple_depravity <int>
set.seed(20230709)
train_test_split <- initial_split(students, prop = .80)
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
data_train
## # A tibble: 26,074 × 17
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 EEE 2014J 648834 M Scotland HE Qualification
## 2 DDD 2013J 572089 M Scotland Lower Than A Lev…
## 3 BBB 2013J 335136 F Scotland HE Qualification
## 4 EEE 2014J 636124 M East Midla… Post Graduate Qu…
## 5 DDD 2014J 637691 M East Midla… A Level or Equiv…
## 6 FFF 2013J 188524 F West Midla… Lower Than A Lev…
## 7 FFF 2013B 2560595 M East Angli… Lower Than A Lev…
## 8 CCC 2014J 634776 M East Midla… A Level or Equiv…
## 9 FFF 2014B 618397 M East Angli… HE Qualification
## 10 GGG 2014J 698019 F Wales Lower Than A Lev…
## # ℹ 26,064 more rows
## # ℹ 11 more variables: imd_band <chr>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## # multiple_depravity <int>
data_test
## # A tibble: 6,519 × 17
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 28400 F Scotland HE Qualification
## 2 AAA 2013J 31604 F South East… A Level or Equiv…
## 3 AAA 2013J 45642 F North West… A Level or Equiv…
## 4 AAA 2013J 62155 F North West… HE Qualification
## 5 AAA 2013J 110175 M East Angli… HE Qualification
## 6 AAA 2013J 114999 M Yorkshire … HE Qualification
## 7 AAA 2013J 116541 M Wales HE Qualification
## 8 AAA 2013J 127582 F East Midla… A Level or Equiv…
## 9 AAA 2013J 141355 F East Midla… A Level or Equiv…
## 10 AAA 2013J 145130 M South West… HE Qualification
## # ℹ 6,509 more rows
## # ℹ 11 more variables: imd_band <chr>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## # multiple_depravity <int>
my_rec <- recipe(pass ~ disability + imd_band + date_registration, data = data_train)
my_rec
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 3
my_mod <-
logistic_reg()
my_mod <-
logistic_reg() %>%
set_engine("glm") %>% # generalized linear model
set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression
my_mod
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
my_wf <-
workflow() %>% # create a workflow
add_model(my_mod) %>% # add the model we wrote above
add_recipe(my_rec) # add our recipe we wrote above
fitted_model <- fit(my_wf, data = data_train)
fitted_model
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) disabilityY imd_band10-20 imd_band20-30%
## -0.65487 -0.29256 0.11654 0.12186
## imd_band30-40% imd_band40-50% imd_band50-60% imd_band60-70%
## 0.31315 0.29855 0.39549 0.45231
## imd_band70-80% imd_band80-90% imd_band90-100% date_registration
## 0.39705 0.50083 0.54349 0.00169
##
## Degrees of Freedom: 25147 Total (i.e. Null); 25136 Residual
## (926 observations deleted due to missingness)
## Null Deviance: 33350
## Residual Deviance: 33080 AIC: 33100
final_fit <- last_fit(my_wf, split = train_test_split)
final_fit
## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [26074/6519]> train/test spl… <tibble> <tibble> <tibble> <workflow>
final_fit %>%
collect_predictions()
## # A tibble: 6,519 × 7
## id .pred_0 .pred_1 .row .pred_class pass .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/test split 0.651 0.349 2 0 1 Preprocessor1_Model1
## 2 train/test split 0.586 0.414 4 0 1 Preprocessor1_Model1
## 3 train/test split 0.540 0.460 8 0 1 Preprocessor1_Model1
## 4 train/test split 0.592 0.408 14 0 1 Preprocessor1_Model1
## 5 train/test split 0.584 0.416 32 0 1 Preprocessor1_Model1
## 6 train/test split 0.579 0.421 36 0 1 Preprocessor1_Model1
## 7 train/test split 0.683 0.317 37 0 1 Preprocessor1_Model1
## 8 train/test split 0.601 0.399 41 0 1 Preprocessor1_Model1
## 9 train/test split 0.609 0.391 48 0 1 Preprocessor1_Model1
## 10 train/test split 0.565 0.435 52 0 1 Preprocessor1_Model1
## # ℹ 6,509 more rows
final_fit %>%
collect_predictions() %>% # see test set predictions
select(.pred_class, pass) %>% # just to make the output easier to view
mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct
## # A tibble: 6,519 × 3
## .pred_class pass correct
## <fct> <fct> <lgl>
## 1 0 1 FALSE
## 2 0 1 FALSE
## 3 0 1 FALSE
## 4 0 1 FALSE
## 5 0 1 FALSE
## 6 0 1 FALSE
## 7 0 1 FALSE
## 8 0 1 FALSE
## 9 0 1 FALSE
## 10 0 1 FALSE
## # ℹ 6,509 more rows
final_fit %>%
collect_predictions() %>% # see test set predictions
select(.pred_class, pass) %>% # just to make the output easier to view
mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
tabyl(correct)
## correct n percent valid_percent
## FALSE 2325 0.35664979 0.3696931
## TRUE 3964 0.60806872 0.6303069
## NA 230 0.03528148 NA
students %>%
count(pass)
## # A tibble: 2 × 2
## pass n
## <fct> <int>
## 1 0 20232
## 2 1 12361
students %>%
mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>%
mutate(correct = if_else(prediction == 1 & pass == 1 |
prediction == 0 & pass == 0, 1, 0)) %>%
tabyl(correct)
## correct n percent
## 0 16281 0.4995244
## 1 16312 0.5004756
Previous results: 0.6308522
New results: 0.6303069
How does the accuracy of this new model compare? Add a few reflections below:
Part A: Please refer back to Breiman’s (2001) article for these three questions.
1.Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?
The emphasis in the data modeling culture is on constructing models that reflect the underlying processes that generate the data. This method often relies on parametric statistical models and data distribution assumptions.The fundamental focus in the algorithmic modeling culture is on developing predictive models that can generate good forecasts without necessarily disclosing the underlying data generating process.
Increased Algorithmic Modeling Relevance: With the proliferation of big data, there is a growing need for predictive models that can handle huge and complicated information effectively. Complexity Handling: Deep learning and ensemble approaches, for example, have proved their capacity to capture detailed patterns in data even when the underlying data production process is unknown. Practical Applications: Predictive accuracy is critical in many real-world applications, such as recommendation systems, fraud detection, and natural language processing.
3.Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?
The position on this subject is frequently determined by the unique circumstances and goals: Breiman and I agree: Predictive accuracy is the fundamental goal in many practical scenarios, particularly those involving complicated data and high-dimensional landscapes. If a model can make accurate predictions, understanding the underlying mechanisms may not be important as long as it serves its intended function well. Breiman and I disagree: Understanding why a method works is still important in other settings, particularly in scientific research and fields where interpretability is critical (e.g., healthcare). Understanding the causal links and mechanisms might help you make better decisions and get new insights.
Part B:
The machine learning model created for the badge activity looks to perform moderately, and its effectiveness is dependent on the environment in which it is used. It may be effective in some cases, but there is room for development through various modeling methodologies and modifications.
2.How might the model be improved? Share any ideas you have at this time below:
To improve the model’s accuracy, we may investigate adding more variables and testing with different combinations to find the set of variables that produces the best results. Furthermore, we can investigate feature engineering strategies to generate new variables from existing ones, which may provide useful insights and improve the model’s prediction performance.
Part C: Use the institutional library (e.g. NU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.
Provide an APA citation for your selected study.
What research questions were the authors of this study trying to address and why did they consider these questions important?
A.Can a single modeling technique be designed to reliably predict student outcomes across multiple institutions?
B.Can that modeling technique also reliably predict student outcomes across a range of courses in the curriculum of a chosen discipline?
3.What were the results of these analyses?
Given the differences between institutions and courses, the models result in slightly different prediction accuracies. The results are highly encouraging as the models can predict student outcomes for each course reasonably well.
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let your instructor know if you run into any issues with knitting.
Finally, publish your webpage on Rpubs by clicking the “Publish” button located in the Viewer Pane after you knit your document.
Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official Lab Badge, submit the link on Blackboard and share with your instructor.
Once your instructor has checked your link, you will be provided a physical version of the badge below!