In this lab you will respond to a set of prompts for two parts.
For the data product, you will interpret a different type of model – a model in a regression mode.
So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).
While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.
The requirements are as follows:
Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.
Please use the code chunk below for your code:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'purrr' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.1
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## Warning: package 'dials' was built under R version 4.3.1
## Warning: package 'modeldata' was built under R version 4.3.1
## Warning: package 'parsnip' was built under R version 4.3.1
## Warning: package 'recipes' was built under R version 4.3.1
## Warning: package 'rsample' was built under R version 4.3.1
## Warning: package 'tune' was built under R version 4.3.1
## Warning: package 'workflows' was built under R version 4.3.1
## Warning: package 'workflowsets' was built under R version 4.3.1
## Warning: package 'yardstick' was built under R version 4.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.1
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
assessments file, which is named “oulad-assessments.csv”. Please
assign the name assessments to the loaded assessments
file.
library(readr)
assessments <- read_csv("data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Assuming you have two data frames named students and assessments
common_columns <- intersect(names(students), names(assessments))
common_columns
## [1] "code_module" "code_presentation" "id_student"
assessments %>%
distinct(code_module)
## # A tibble: 7 × 1
## code_module
## <chr>
## 1 AAA
## 2 BBB
## 3 CCC
## 4 DDD
## 5 EEE
## 6 FFF
## 7 GGG
assessments %>%
distinct(code_presentation)
## # A tibble: 4 × 1
## code_presentation
## <chr>
## 1 2013J
## 2 2014J
## 3 2013B
## 4 2014B
assessments %>%
distinct(assessment_type)
## # A tibble: 3 × 1
## assessment_type
## <chr>
## 1 TMA
## 2 CMA
## 3 Exam
students %>%
count(highest_education)
## # A tibble: 5 × 2
## highest_education n
## <chr> <int>
## 1 A Level or Equivalent 14045
## 2 HE Qualification 4730
## 3 Lower Than A Level 13158
## 4 No Formal quals 347
## 5 Post Graduate Qualification 313
assessments %>%
count(assessment_type,code_module,code_presentation)
## # A tibble: 41 × 4
## assessment_type code_module code_presentation n
## <chr> <chr> <chr> <int>
## 1 CMA BBB 2013B 5049
## 2 CMA BBB 2013J 6416
## 3 CMA BBB 2014B 4493
## 4 CMA CCC 2014B 3920
## 5 CMA CCC 2014J 5846
## 6 CMA DDD 2013B 5252
## 7 CMA FFF 2013B 6681
## 8 CMA FFF 2013J 8847
## 9 CMA FFF 2014B 5549
## 10 CMA FFF 2014J 8915
## # ℹ 31 more rows
# Load the dplyr package if you haven't already
library(dplyr)
# Assuming df1 and df2 are your data frames
common_cols <- intersect(names(students), names(assessments))
# Merge students and assessments based on common columns and id_student
merged_df <- inner_join(students, assessments, by = "id_student")
## Warning in inner_join(students, assessments, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 15 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
merged_df
## # A tibble: 207,319 × 24
## code_module.x code_presentation.x id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 11391 M East A… HE Qualification
## 2 AAA 2013J 11391 M East A… HE Qualification
## 3 AAA 2013J 11391 M East A… HE Qualification
## 4 AAA 2013J 11391 M East A… HE Qualification
## 5 AAA 2013J 11391 M East A… HE Qualification
## 6 AAA 2013J 28400 F Scotla… HE Qualification
## 7 AAA 2013J 28400 F Scotla… HE Qualification
## 8 AAA 2013J 28400 F Scotla… HE Qualification
## 9 AAA 2013J 28400 F Scotla… HE Qualification
## 10 AAA 2013J 28400 F Scotla… HE Qualification
## # ℹ 207,309 more rows
## # ℹ 18 more variables: imd_band <chr>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## # date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## # code_presentation.y <chr>, assessment_type <chr>, date <dbl>, …
merged_df <- na.omit(merged_df)
library(dplyr)
# Assuming you have a data frame called 'merged_df'
# Replace 'merged_df' with the actual name of your data frame
merged_df %>%
group_by(code_module.x, code_presentation.x,disability) %>%
summarize(
mean_score = mean(score, na.rm = TRUE),
median_score = median(score, na.rm = TRUE),
sd_score = sd(score, na.rm = TRUE),
min_score = min(score, na.rm = TRUE),
max_score = max(score, na.rm = TRUE)
)
## `summarise()` has grouped output by 'code_module.x', 'code_presentation.x'. You
## can override using the `.groups` argument.
## # A tibble: 44 × 8
## # Groups: code_module.x, code_presentation.x [22]
## code_module.x code_presentation.x disability mean_score median_score sd_score
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 AAA 2013J N 68.6 68 13.6
## 2 AAA 2013J Y 60.1 56.5 19.1
## 3 AAA 2014J N 64.7 67 14.9
## 4 AAA 2014J Y 80 80 NA
## 5 BBB 2013B N 73.3 74 20.8
## 6 BBB 2013B Y 72.4 72.5 21.8
## 7 BBB 2013J N 75.3 78 19.9
## 8 BBB 2013J Y 68.6 69 23.7
## 9 BBB 2014B N 74.4 77 21.7
## 10 BBB 2014B Y 65.0 63 22.0
## # ℹ 34 more rows
## # ℹ 2 more variables: min_score <dbl>, max_score <dbl>
merged_df
## # A tibble: 24,679 × 24
## code_module.x code_presentation.x id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 65002 F East A… A Level or Equiv…
## 2 AAA 2013J 65002 F East A… A Level or Equiv…
## 3 AAA 2013J 65002 F East A… A Level or Equiv…
## 4 AAA 2013J 65002 F East A… A Level or Equiv…
## 5 AAA 2013J 94961 M South … Lower Than A Lev…
## 6 AAA 2013J 94961 M South … Lower Than A Lev…
## 7 AAA 2013J 94961 M South … Lower Than A Lev…
## 8 AAA 2013J 94961 M South … Lower Than A Lev…
## 9 AAA 2013J 94961 M South … Lower Than A Lev…
## 10 AAA 2013J 94961 M South … Lower Than A Lev…
## # ℹ 24,669 more rows
## # ℹ 18 more variables: imd_band <chr>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## # date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## # code_presentation.y <chr>, assessment_type <chr>, date <dbl>, …
# Create a new binary numeric column based on the "disability" factor variable
# Create a new binary numeric column based on the "disability" factor variable
merged_df$disability_binary <- ifelse(merged_df$disability == "yes", 1, 0)
# View the transformed data frame
head(merged_df)
## # A tibble: 6 × 25
## code_module.x code_presentation.x id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 65002 F East An… A Level or Equiv…
## 2 AAA 2013J 65002 F East An… A Level or Equiv…
## 3 AAA 2013J 65002 F East An… A Level or Equiv…
## 4 AAA 2013J 65002 F East An… A Level or Equiv…
## 5 AAA 2013J 94961 M South R… Lower Than A Lev…
## 6 AAA 2013J 94961 M South R… Lower Than A Lev…
## # ℹ 19 more variables: imd_band <chr>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## # date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## # code_presentation.y <chr>, assessment_type <chr>, date <dbl>, weight <dbl>,
## # disability_binary <dbl>
merged_df <- merged_df %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>%
mutate(imd_band = as.integer(imd_band))
# View the transformed data frame
head(merged_df)
## # A tibble: 6 × 25
## code_module.x code_presentation.x id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 65002 F East An… A Level or Equiv…
## 2 AAA 2013J 65002 F East An… A Level or Equiv…
## 3 AAA 2013J 65002 F East An… A Level or Equiv…
## 4 AAA 2013J 65002 F East An… A Level or Equiv…
## 5 AAA 2013J 94961 M South R… Lower Than A Lev…
## 6 AAA 2013J 94961 M South R… Lower Than A Lev…
## # ℹ 19 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## # date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## # code_presentation.y <chr>, assessment_type <chr>, date <dbl>, weight <dbl>,
## # disability_binary <dbl>
merged_df[is.na(merged_df)] <- mean(merged_df, na.rm = TRUE)
## Warning in mean.default(merged_df, na.rm = TRUE): argument is not numeric or
## logical: returning NA
merged_df$gender_binary <- ifelse(merged_df$gender == "M", 1, 0)
# View the transformed data frame
head(merged_df)
## # A tibble: 6 × 26
## code_module.x code_presentation.x id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 65002 F East An… A Level or Equiv…
## 2 AAA 2013J 65002 F East An… A Level or Equiv…
## 3 AAA 2013J 65002 F East An… A Level or Equiv…
## 4 AAA 2013J 65002 F East An… A Level or Equiv…
## 5 AAA 2013J 94961 M South R… Lower Than A Lev…
## 6 AAA 2013J 94961 M South R… Lower Than A Lev…
## # ℹ 20 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## # date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## # code_presentation.y <chr>, assessment_type <chr>, date <dbl>, weight <dbl>,
## # disability_binary <dbl>, gender_binary <dbl>
# Load the caTools package for data splitting (if not already loaded)
if (!requireNamespace("caTools", quietly = TRUE)) {
install.packages("caTools")
}
# Load the caTools package if you haven't already
library(caTools)
## Warning: package 'caTools' was built under R version 4.3.1
# Set a seed for reproducibility
set.seed(20230712)
# Split the data into training (70%) and testing (30%) sets
split <- sample.split(merged_df$score, SplitRatio = 0.5)
# Create the training and testing datasets
training_data <- merged_df[split, ]
testing_data <- merged_df[!split, ]
training_data
## # A tibble: 12,345 × 26
## code_module.x code_presentation.x id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 65002 F East A… A Level or Equiv…
## 2 AAA 2013J 65002 F East A… A Level or Equiv…
## 3 AAA 2013J 94961 M South … Lower Than A Lev…
## 4 AAA 2013J 94961 M South … Lower Than A Lev…
## 5 AAA 2013J 94961 M South … Lower Than A Lev…
## 6 AAA 2013J 106247 M South … HE Qualification
## 7 AAA 2013J 106247 M South … HE Qualification
## 8 AAA 2013J 129955 M West M… A Level or Equiv…
## 9 AAA 2013J 129955 M West M… A Level or Equiv…
## 10 AAA 2013J 129955 M West M… A Level or Equiv…
## # ℹ 12,335 more rows
## # ℹ 20 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## # date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## # code_presentation.y <chr>, assessment_type <chr>, date <dbl>, …
lm_model <- lm(score ~ imd_band + disability_binary + factor(highest_education) + studied_credits + gender_binary + num_of_prev_attempts +factor(age_band)+ module_presentation_length, data = training_data)
predictions <- predict(lm_model, testing_data)
mae <- mean(abs(predictions - testing_data$score), na.rm = TRUE)
mse <- mean((predictions - testing_data$score)^2, na.rm = TRUE)
rmse <- sqrt(mse)
rmse
## [1] 21.47275
mae
## [1] 16.89844
mse
## [1] 461.079
rmse
## [1] 21.47275
```
Please add your interpretations here:
MAE:16.95091
MSE:464.4711
RMSE:21.55159
Classification Outcome: Identify if a person has “Diabetes” or is “Non-Diabetic.”
Features/Input Variables: Patient’s health data, such as glucose levels, BMI, family history, age, and lifestyle factors.
Algorithm: Utilize classification algorithms (e.g., Random Forest, Support Vector Machine) to classify individuals.
Evaluation: Assess model performance using metrics like accuracy, sensitivity, specificity, and F1-score to evaluate its effectiveness in diabetic prediction.
Ans: An example of an outcome related to research interests that could be modeled using a regression machine learning model is “Predicting Housing Prices.”
Regression Outcome: Estimate the sale price of residential properties based on various features.
Features/Input Variables: Property size, number of bedrooms, location, year built, proximity to amenities, and economic indicators.
Algorithm: Employ regression algorithms (e.g., Linear Regression, Random Forest Regression) to predict housing prices.
Evaluation: Evaluate the model’s performance using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared to measure how well it predicts actual housing prices.
Ans:the 1st case study i did was wheather student will pass or fail based on lot of attributes. I created factor for pass column or attribute to make them as binary 0 or 1.
With regard for interpretability and appropriate usage, the machine learning model used in the badge activity showed good prediction ability and is beneficial for research assessment. The model may be improved by experimenting with other algorithms and hyperparameters for better performance or by feature engineering to capture more pertinent predictors. The misconceptions 12 African high school teachers held about teaching machine learning were examined in this article. These assumptions were broken down into five different categories and their correlations were looked at in the study.It is crucial for enhancing curriculum creation, student learning outcomes, and machine learning’s general efficacy when it comes to integrating into the educational system. It may result in more interesting and pertinent educational opportunities that better equip students for the requirements of the
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.
Have fun!