Machine Learning - Lab 2 Badge

In this lab you will respond to a set of prompts for two parts.

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.

Please use the code chunk below for your code:

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.1

## Warning: package 'ggplot2' was built under R version 4.3.1

## Warning: package 'purrr' was built under R version 4.3.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.1

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8

## Warning: package 'dials' was built under R version 4.3.1

## Warning: package 'modeldata' was built under R version 4.3.1

## Warning: package 'parsnip' was built under R version 4.3.1

## Warning: package 'recipes' was built under R version 4.3.1

## Warning: package 'rsample' was built under R version 4.3.1

## Warning: package 'tune' was built under R version 4.3.1

## Warning: package 'workflows' was built under R version 4.3.1

## Warning: package 'workflowsets' was built under R version 4.3.1

## Warning: package 'yardstick' was built under R version 4.3.1

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(janitor)

## Warning: package 'janitor' was built under R version 4.3.1

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

We’ll load the students file together; you’ll write code to read the

assessments file, which is named “oulad-assessments.csv”. Please assign the name assessments to the loaded assessments file.

library(readr)
assessments <- read_csv("data/oulad-assessments.csv")

## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

lets load students table

library(readr)
students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

lets find common columns from the both tables

# Assuming you have two data frames named students and assessments
common_columns <- intersect(names(students), names(assessments))
common_columns

## [1] "code_module"       "code_presentation" "id_student"

assessments %>% 
    distinct(code_module)

## # A tibble: 7 × 1
##   code_module
##   <chr>      
## 1 AAA        
## 2 BBB        
## 3 CCC        
## 4 DDD        
## 5 EEE        
## 6 FFF        
## 7 GGG

assessments %>% 
    distinct(code_presentation)

## # A tibble: 4 × 1
##   code_presentation
##   <chr>            
## 1 2013J            
## 2 2014J            
## 3 2013B            
## 4 2014B

assessments %>% 
    distinct(assessment_type)

## # A tibble: 3 × 1
##   assessment_type
##   <chr>          
## 1 TMA            
## 2 CMA            
## 3 Exam

students %>% 
    count(highest_education)

## # A tibble: 5 × 2
##   highest_education               n
##   <chr>                       <int>
## 1 A Level or Equivalent       14045
## 2 HE Qualification             4730
## 3 Lower Than A Level          13158
## 4 No Formal quals               347
## 5 Post Graduate Qualification   313

assessments %>% 
    count(assessment_type,code_module,code_presentation)

## # A tibble: 41 × 4
##    assessment_type code_module code_presentation     n
##    <chr>           <chr>       <chr>             <int>
##  1 CMA             BBB         2013B              5049
##  2 CMA             BBB         2013J              6416
##  3 CMA             BBB         2014B              4493
##  4 CMA             CCC         2014B              3920
##  5 CMA             CCC         2014J              5846
##  6 CMA             DDD         2013B              5252
##  7 CMA             FFF         2013B              6681
##  8 CMA             FFF         2013J              8847
##  9 CMA             FFF         2014B              5549
## 10 CMA             FFF         2014J              8915
## # ℹ 31 more rows

# Load the dplyr package if you haven't already
library(dplyr)

# Assuming df1 and df2 are your data frames
common_cols <- intersect(names(students), names(assessments))

# Merge students and assessments based on common columns and id_student
merged_df <- inner_join(students, assessments, by = "id_student")

## Warning in inner_join(students, assessments, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 15 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

merged_df

## # A tibble: 207,319 × 24
##    code_module.x code_presentation.x id_student gender region  highest_education
##    <chr>         <chr>                    <dbl> <chr>  <chr>   <chr>            
##  1 AAA           2013J                    11391 M      East A… HE Qualification 
##  2 AAA           2013J                    11391 M      East A… HE Qualification 
##  3 AAA           2013J                    11391 M      East A… HE Qualification 
##  4 AAA           2013J                    11391 M      East A… HE Qualification 
##  5 AAA           2013J                    11391 M      East A… HE Qualification 
##  6 AAA           2013J                    28400 F      Scotla… HE Qualification 
##  7 AAA           2013J                    28400 F      Scotla… HE Qualification 
##  8 AAA           2013J                    28400 F      Scotla… HE Qualification 
##  9 AAA           2013J                    28400 F      Scotla… HE Qualification 
## 10 AAA           2013J                    28400 F      Scotla… HE Qualification 
## # ℹ 207,309 more rows
## # ℹ 18 more variables: imd_band <chr>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## #   date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## #   code_presentation.y <chr>, assessment_type <chr>, date <dbl>, …

merged_df <- na.omit(merged_df)

library(dplyr)

# Assuming you have a data frame called 'merged_df'
# Replace 'merged_df' with the actual name of your data frame

merged_df %>%
  group_by(code_module.x, code_presentation.x,disability) %>%
  summarize(
    mean_score = mean(score, na.rm = TRUE),
    median_score = median(score, na.rm = TRUE),
    sd_score = sd(score, na.rm = TRUE),
    min_score = min(score, na.rm = TRUE),
    max_score = max(score, na.rm = TRUE)
  )

## `summarise()` has grouped output by 'code_module.x', 'code_presentation.x'. You
## can override using the `.groups` argument.

## # A tibble: 44 × 8
## # Groups:   code_module.x, code_presentation.x [22]
##    code_module.x code_presentation.x disability mean_score median_score sd_score
##    <chr>         <chr>               <chr>           <dbl>        <dbl>    <dbl>
##  1 AAA           2013J               N                68.6         68       13.6
##  2 AAA           2013J               Y                60.1         56.5     19.1
##  3 AAA           2014J               N                64.7         67       14.9
##  4 AAA           2014J               Y                80           80       NA  
##  5 BBB           2013B               N                73.3         74       20.8
##  6 BBB           2013B               Y                72.4         72.5     21.8
##  7 BBB           2013J               N                75.3         78       19.9
##  8 BBB           2013J               Y                68.6         69       23.7
##  9 BBB           2014B               N                74.4         77       21.7
## 10 BBB           2014B               Y                65.0         63       22.0
## # ℹ 34 more rows
## # ℹ 2 more variables: min_score <dbl>, max_score <dbl>

merged_df

## # A tibble: 24,679 × 24
##    code_module.x code_presentation.x id_student gender region  highest_education
##    <chr>         <chr>                    <dbl> <chr>  <chr>   <chr>            
##  1 AAA           2013J                    65002 F      East A… A Level or Equiv…
##  2 AAA           2013J                    65002 F      East A… A Level or Equiv…
##  3 AAA           2013J                    65002 F      East A… A Level or Equiv…
##  4 AAA           2013J                    65002 F      East A… A Level or Equiv…
##  5 AAA           2013J                    94961 M      South … Lower Than A Lev…
##  6 AAA           2013J                    94961 M      South … Lower Than A Lev…
##  7 AAA           2013J                    94961 M      South … Lower Than A Lev…
##  8 AAA           2013J                    94961 M      South … Lower Than A Lev…
##  9 AAA           2013J                    94961 M      South … Lower Than A Lev…
## 10 AAA           2013J                    94961 M      South … Lower Than A Lev…
## # ℹ 24,669 more rows
## # ℹ 18 more variables: imd_band <chr>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## #   date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## #   code_presentation.y <chr>, assessment_type <chr>, date <dbl>, …

# Create a new binary numeric column based on the "disability" factor variable
# Create a new binary numeric column based on the "disability" factor variable
merged_df$disability_binary <- ifelse(merged_df$disability == "yes", 1, 0)

# View the transformed data frame
head(merged_df)

## # A tibble: 6 × 25
##   code_module.x code_presentation.x id_student gender region   highest_education
##   <chr>         <chr>                    <dbl> <chr>  <chr>    <chr>            
## 1 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 2 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 3 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 4 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 5 AAA           2013J                    94961 M      South R… Lower Than A Lev…
## 6 AAA           2013J                    94961 M      South R… Lower Than A Lev…
## # ℹ 19 more variables: imd_band <chr>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## #   date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## #   code_presentation.y <chr>, assessment_type <chr>, date <dbl>, weight <dbl>,
## #   disability_binary <dbl>

merged_df <- merged_df %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% 
    mutate(imd_band = as.integer(imd_band))

# View the transformed data frame
head(merged_df)

## # A tibble: 6 × 25
##   code_module.x code_presentation.x id_student gender region   highest_education
##   <chr>         <chr>                    <dbl> <chr>  <chr>    <chr>            
## 1 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 2 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 3 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 4 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 5 AAA           2013J                    94961 M      South R… Lower Than A Lev…
## 6 AAA           2013J                    94961 M      South R… Lower Than A Lev…
## # ℹ 19 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## #   date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## #   code_presentation.y <chr>, assessment_type <chr>, date <dbl>, weight <dbl>,
## #   disability_binary <dbl>

merged_df[is.na(merged_df)] <- mean(merged_df, na.rm = TRUE)

## Warning in mean.default(merged_df, na.rm = TRUE): argument is not numeric or
## logical: returning NA

merged_df$gender_binary <- ifelse(merged_df$gender == "M", 1, 0)

# View the transformed data frame
head(merged_df)

## # A tibble: 6 × 26
##   code_module.x code_presentation.x id_student gender region   highest_education
##   <chr>         <chr>                    <dbl> <chr>  <chr>    <chr>            
## 1 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 2 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 3 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 4 AAA           2013J                    65002 F      East An… A Level or Equiv…
## 5 AAA           2013J                    94961 M      South R… Lower Than A Lev…
## 6 AAA           2013J                    94961 M      South R… Lower Than A Lev…
## # ℹ 20 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## #   date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## #   code_presentation.y <chr>, assessment_type <chr>, date <dbl>, weight <dbl>,
## #   disability_binary <dbl>, gender_binary <dbl>

# Load the caTools package for data splitting (if not already loaded)
if (!requireNamespace("caTools", quietly = TRUE)) {
  install.packages("caTools")
}

# Load the caTools package if you haven't already
library(caTools)

## Warning: package 'caTools' was built under R version 4.3.1

# Set a seed for reproducibility
set.seed(20230712)

# Split the data into training (70%) and testing (30%) sets
split <- sample.split(merged_df$score, SplitRatio = 0.5)

# Create the training and testing datasets
training_data <- merged_df[split, ]
testing_data <- merged_df[!split, ]

training_data

## # A tibble: 12,345 × 26
##    code_module.x code_presentation.x id_student gender region  highest_education
##    <chr>         <chr>                    <dbl> <chr>  <chr>   <chr>            
##  1 AAA           2013J                    65002 F      East A… A Level or Equiv…
##  2 AAA           2013J                    65002 F      East A… A Level or Equiv…
##  3 AAA           2013J                    94961 M      South … Lower Than A Lev…
##  4 AAA           2013J                    94961 M      South … Lower Than A Lev…
##  5 AAA           2013J                    94961 M      South … Lower Than A Lev…
##  6 AAA           2013J                   106247 M      South … HE Qualification 
##  7 AAA           2013J                   106247 M      South … HE Qualification 
##  8 AAA           2013J                   129955 M      West M… A Level or Equiv…
##  9 AAA           2013J                   129955 M      West M… A Level or Equiv…
## 10 AAA           2013J                   129955 M      West M… A Level or Equiv…
## # ℹ 12,335 more rows
## # ℹ 20 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, id_assessment <dbl>,
## #   date_submitted <dbl>, is_banked <dbl>, score <dbl>, code_module.y <chr>,
## #   code_presentation.y <chr>, assessment_type <chr>, date <dbl>, …

lm_model <- lm(score ~ imd_band + disability_binary + factor(highest_education) + studied_credits + gender_binary + num_of_prev_attempts +factor(age_band)+ module_presentation_length, data = training_data)

predictions <- predict(lm_model, testing_data)

mae <- mean(abs(predictions - testing_data$score), na.rm = TRUE)
mse <- mean((predictions - testing_data$score)^2, na.rm = TRUE)
rmse <- sqrt(mse)
rmse

## [1] 21.47275

mae

## [1] 16.89844

mse

## [1] 461.079

rmse

## [1] 21.47275

```

Please add your interpretations here:

MAE:16.95091
MSE:464.4711
RMSE:21.55159

Part II: Reflect and Plan

What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?

Classification Outcome: Identify if a person has “Diabetes” or is “Non-Diabetic.”

Features/Input Variables: Patient’s health data, such as glucose levels, BMI, family history, age, and lifestyle factors.

Algorithm: Utilize classification algorithms (e.g., Random Forest, Support Vector Machine) to classify individuals.

Evaluation: Assess model performance using metrics like accuracy, sensitivity, specificity, and F1-score to evaluate its effectiveness in diabetic prediction.

What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?

Ans: An example of an outcome related to research interests that could be modeled using a regression machine learning model is “Predicting Housing Prices.”

Regression Outcome: Estimate the sale price of residential properties based on various features.

Features/Input Variables: Property size, number of bedrooms, location, year built, proximity to amenities, and economic indicators.

Algorithm: Employ regression algorithms (e.g., Linear Regression, Random Forest Regression) to predict housing prices.

Evaluation: Evaluate the model’s performance using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared to measure how well it predicts actual housing prices.

Look back to the study you identified for the first machine learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

Ans:the 1st case study i did was wheather student will pass or fail based on lot of attributes. I created factor for pass column or attribute to make them as binary 0 or 1.

With regard for interpretability and appropriate usage, the machine learning model used in the badge activity showed good prediction ability and is beneficial for research assessment. The model may be improved by experimenting with other algorithms and hyperparameters for better performance or by feature engineering to capture more pertinent predictors. The misconceptions 12 African high school teachers held about teaching machine learning were examined in this article. These assumptions were broken down into five different categories and their correlations were looked at in the study.It is crucial for enhancing curriculum creation, student learning outcomes, and machine learning’s general efficacy when it comes to integrating into the educational system. It may result in more interesting and pertinent educational opportunities that better equip students for the requirements of the

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Have fun!