The lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.
To earn a badge for each lab, you are required to respond to a set of prompts for two parts:
In Part I, you will extend our model by adding another variable.
In Part II, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.
In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'readr' was built under R version 4.3.1
## Warning: package 'purrr' was built under R version 4.3.1
## Warning: package 'dplyr' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Install the tidymodels and janitor packages if they are not already installed
if (!requireNamespace("tidymodels", quietly = TRUE)) {
install.packages("tidymodels")
}
if (!requireNamespace("janitor", quietly = TRUE)) {
install.packages("janitor")
}
# Load the tidymodels and janitor packages
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.1
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## Warning: package 'broom' was built under R version 4.3.1
## Warning: package 'dials' was built under R version 4.3.1
## Warning: package 'modeldata' was built under R version 4.3.1
## Warning: package 'parsnip' was built under R version 4.3.1
## Warning: package 'recipes' was built under R version 4.3.1
## Warning: package 'rsample' was built under R version 4.3.1
## Warning: package 'tune' was built under R version 4.3.1
## Warning: package 'workflows' was built under R version 4.3.1
## Warning: package 'workflowsets' was built under R version 4.3.1
## Warning: package 'yardstick' was built under R version 4.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.1
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
As a tip, remember to use the library() function to load
these packages. After you’ve done that, click the green arrow to run the
code chunk. If you see a bunch of messages (not anything labeled as an
error), you are good to go! These messages mean the packages loaded
correctly.
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018). In Part 2, we focus on the the following wrangling processes to:
Importing and Inspecting Data. In this section, we will “read” in our CSV data file and take a quick look at what our file contains.
Mutate Variables. We use the
mutate() function to create a dichotomous variable for
whether or not the student withdrew from the course.
For learning labs 1-3, we’ll be using a widely-used data set in the learning analytics field: the Open University Learning Analytics Dataset (OULAD). The OULAD was created by learning analytics researchers at the United Kingdom-based Open University. It includes data from post-secondary learners participation in one of several Massive Open Online Courses (called modules in the OULAD).
Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Open university learning analytics dataset. Scientific Data, 4(1), 1-8. https://www.nature.com/articles/sdata2017171
Abstract
Learning Analytics focuses on the collection and analysis of learners’ data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students’ interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behavior, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.
You don’t need to read the entire article yet, but please open this article, scan the sections, and write down two things you notice or wonder about the dataset.
The dataset allows researchers to investigate a wide range of educational analytics-related research issues, such as forecasting student success, understanding factors influencing student performance, and enhancing course design.
The dataset has been used in hackathons and research initiatives, demonstrating its usefulness in educational data analytics.
The data can be downloaded at the above link; however, for our
purposes, they are already downloaded to the data
sub-folder.
We’ll use the read_csv() function to load the files –
two in total, with data on students and assessments. Note: we have done
some minimal processing of these files to make getting us started
easier. If you’re interested in what we’ve done, check out the
oulad.R file in the lab-1 folder.
For now, please read in the oulad-students.csv file. Use
the read_csv() function to do this, paying attention to
where those files are located relative to this case study file.
students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(students)
## Rows: 32,593
## Columns: 15
## $ code_module <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…
glimpse(students)
## Rows: 32,593
## Columns: 15
## $ code_module <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…
students <- students %>%
mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps
students <- students %>%
mutate(disability = as.factor(disability))
view(students)
students <- students %>%
mutate(disability = as.factor(disability))
students %>%
count(id_student) # this many students
## # A tibble: 28,785 × 2
## id_student n
## <dbl> <int>
## 1 3733 1
## 2 6516 1
## 3 8462 2
## 4 11391 1
## 5 23629 1
## 6 23632 1
## 7 23698 1
## 8 23798 1
## 9 24186 1
## 10 24213 2
## # ℹ 28,775 more rows
students %>%
count(id_student) # this many students
## # A tibble: 28,785 × 2
## id_student n
## <dbl> <int>
## 1 3733 1
## 2 6516 1
## 3 8462 2
## 4 11391 1
## 5 23629 1
## 6 23632 1
## 7 23698 1
## 8 23798 1
## 9 24186 1
## 10 24213 2
## # ℹ 28,775 more rows
students %>%
count(code_module, code_presentation) # this many offerings
## # A tibble: 22 × 3
## code_module code_presentation n
## <chr> <chr> <int>
## 1 AAA 2013J 383
## 2 AAA 2014J 365
## 3 BBB 2013B 1767
## 4 BBB 2013J 2237
## 5 BBB 2014B 1613
## 6 BBB 2014J 2292
## 7 CCC 2014B 1936
## 8 CCC 2014J 2498
## 9 DDD 2013B 1303
## 10 DDD 2013J 1938
## # ℹ 12 more rows
students <- students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>% # this creates a factor with ordered levels
mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels
students
## # A tibble: 32,593 × 16
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 11391 M East Angli… HE Qualification
## 2 AAA 2013J 28400 F Scotland HE Qualification
## 3 AAA 2013J 30268 F North West… A Level or Equiv…
## 4 AAA 2013J 31604 F South East… A Level or Equiv…
## 5 AAA 2013J 32885 F West Midla… Lower Than A Lev…
## 6 AAA 2013J 38053 M Wales A Level or Equiv…
## 7 AAA 2013J 45462 M Scotland HE Qualification
## 8 AAA 2013J 45642 F North West… A Level or Equiv…
## 9 AAA 2013J 52130 F East Angli… A Level or Equiv…
## 10 AAA 2013J 53025 M North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>
set.seed(20230712)
train_test_split <- initial_split(students, prop = .80)
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
data_test
## # A tibble: 6,519 × 16
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 28400 F Scotland HE Qualification
## 2 AAA 2013J 31604 F South East… A Level or Equiv…
## 3 AAA 2013J 45462 M Scotland HE Qualification
## 4 AAA 2013J 53025 M North Regi… Post Graduate Qu…
## 5 AAA 2013J 65002 F East Angli… A Level or Equiv…
## 6 AAA 2013J 71361 M Ireland HE Qualification
## 7 AAA 2013J 77367 M East Midla… A Level or Equiv…
## 8 AAA 2013J 98094 M Wales Lower Than A Lev…
## 9 AAA 2013J 111717 F East Angli… HE Qualification
## 10 AAA 2013J 114017 F North Regi… Post Graduate Qu…
## # ℹ 6,509 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>
data_train
## # A tibble: 26,074 × 16
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 FFF 2014B 595186 M South Regi… Lower Than A Lev…
## 2 BBB 2014J 504066 F East Midla… Lower Than A Lev…
## 3 BBB 2013J 585790 F South East… HE Qualification
## 4 CCC 2014J 278413 M London Reg… HE Qualification
## 5 GGG 2014B 634933 F South Regi… Lower Than A Lev…
## 6 CCC 2014J 608577 M North Regi… HE Qualification
## 7 BBB 2014B 612120 F East Midla… Lower Than A Lev…
## 8 FFF 2013J 530852 M Wales Lower Than A Lev…
## 9 CCC 2014J 2555596 M South Regi… A Level or Equiv…
## 10 DDD 2013B 556575 M North Regi… HE Qualification
## # ℹ 26,064 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>
my_rec <- recipe(pass ~ disability + imd_band + date_registration, data = data_train)
my_rec
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 3
my_mod <-
logistic_reg()
my_mod <-
logistic_reg() %>%
set_engine("glm") %>% # generalized linear model
set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression
my_mod
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
my_wf <-
workflow() %>% # create a workflow
add_model(my_mod) %>% # add the model we wrote above
add_recipe(my_rec) # add our recipe we wrote above
fitted_model <- fit(my_wf, data = data_train)
fitted_model
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) disabilityY imd_band date_registration
## -0.667029 -0.280013 0.059134 0.001643
##
## Degrees of Freedom: 22371 Total (i.e. Null); 22368 Residual
## (3702 observations deleted due to missingness)
## Null Deviance: 29800
## Residual Deviance: 29580 AIC: 29590
final_fit <- last_fit(my_mod, my_rec, split = train_test_split)
final_fit
## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [26074/6519]> train/test spl… <tibble> <tibble> <tibble> <workflow>
final_fit %>%
collect_predictions()
## # A tibble: 6,519 × 7
## id .pred_0 .pred_1 .row .pred_class pass .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/test split 0.640 0.360 2 0 1 Preprocessor1_Model1
## 2 train/test split 0.598 0.402 4 0 1 Preprocessor1_Model1
## 3 train/test split 0.632 0.368 7 0 1 Preprocessor1_Model1
## 4 train/test split NA NA 10 <NA> 1 Preprocessor1_Model1
## 5 train/test split 0.620 0.380 16 0 0 Preprocessor1_Model1
## 6 train/test split NA NA 18 <NA> 1 Preprocessor1_Model1
## 7 train/test split 0.617 0.383 21 0 1 Preprocessor1_Model1
## 8 train/test split 0.591 0.409 24 0 1 Preprocessor1_Model1
## 9 train/test split 0.537 0.463 33 0 1 Preprocessor1_Model1
## 10 train/test split NA NA 35 <NA> 1 Preprocessor1_Model1
## # ℹ 6,509 more rows
final_fit %>%
collect_predictions() %>% # see test set predictions
select(.pred_class, pass) %>% # just to make the output easier to view
mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct
## # A tibble: 6,519 × 3
## .pred_class pass correct
## <fct> <fct> <lgl>
## 1 0 1 FALSE
## 2 0 1 FALSE
## 3 0 1 FALSE
## 4 <NA> 1 NA
## 5 0 0 TRUE
## 6 <NA> 1 NA
## 7 0 1 FALSE
## 8 0 1 FALSE
## 9 0 1 FALSE
## 10 <NA> 1 NA
## # ℹ 6,509 more rows
final_fit %>%
collect_predictions() %>% # see test set predictions
select(.pred_class, pass) %>% # just to make the output easier to view
mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
tabyl(correct)
## correct n percent valid_percent
## FALSE 2071 0.3176868 0.3728844
## TRUE 3483 0.5342844 0.6271156
## NA 965 0.1480288 NA
students %>%
count(pass)
## # A tibble: 2 × 2
## pass n
## <fct> <int>
## 1 0 20232
## 2 1 12361
students %>%
mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>%
mutate(correct = if_else(prediction == 1 & pass == 1 |
prediction == 0 & pass == 0, 1, 0)) %>%
tabyl(correct)
## correct n percent
## 0 16328 0.5009665
## 1 16265 0.4990335
students %>%
count(pass)
## # A tibble: 2 × 2
## pass n
## <fct> <int>
## 1 0 20232
## 2 1 12361
Previous results: The previous model predicted 3488 data points correctly and 2071 data points incorrectly. comprising approximately 31.77% of the total cases.
The model’s predictions were right in 3488 cases, accounting for approximately 53.51% of all cases.
New results: When only valid examples are included (no NA values), the model’s accuracy rate is roughly 62.75%. Overall, tt had an accuracy percentage of about 53.43%, with 31.77% of the data predicted erroneously.
How does the accuracy of this new model compare? Add a few reflections below:
Part A: Please refer back to Breiman’s (2001) article for these three questions.
Part B:
Part C: Use the institutional library (e.g. NU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.
Provide an APA citation for your selected study.
What research questions were the authors of this study trying to address and why did they consider these questions important?
Are you currently pursuing graduate studies, or is there a specific reason for asking about academic status?
What were the results of these analyses?
Predicting Student Academic Performance Using Machine Learning Algorithms,” the suggested model predicted students’ initial final grades with an accuracy range of 70-75%. This study adds to the early identification of children who are at high risk of academic difficulty and investigates the efficacy of several machine learning algorithms for this goal.
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let your instructor know if you run into any issues with knitting.
Finally, publish your webpage on Rpubs by clicking the “Publish” button located in the Viewer Pane after you knit your document.
Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official Lab Badge, submit the link on Blackboard and share with your instructor.
Once your instructor has checked your link, you will be provided a physical version of the badge below!