Machine Learning - Lab 2 Badge

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.

Please use the code chunk below for your code:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

students <- read_csv("oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

assessments <- read.csv("oulad-assessments.csv")

assessments %>% 

  count(id_assessment)

##     id_assessment    n
## 1            1752  359
## 2            1753  342
## 3            1754  331
## 4            1755  303
## 5            1756  298
## 6            1758  337
## 7            1759  317
## 8            1760  304
## 9            1761  280
## 10           1762  278
## 11          14984 1352
## 12          14985 1217
## 13          14986 1065
## 14          14987  931
## 15          14988  876
## 16          14989  766
## 17          14991 1189
## 18          14992 1070
## 19          14993  985
## 20          14994  935
## 21          14995  870
## 22          14996 1695
## 23          14997 1524
## 24          14998 1316
## 25          14999 1227
## 26          15000 1163
## 27          15001 1034
## 28          15003 1490
## 29          15004 1350
## 30          15005 1257
## 31          15006 1183
## 32          15007 1136
## 33          15008 1189
## 34          15009 1086
## 35          15010  928
## 36          15011  822
## 37          15012  791
## 38          15013  684
## 39          15015 1054
## 40          15016  973
## 41          15017  878
## 42          15018  819
## 43          15019  769
## 44          15020 1776
## 45          15021 1586
## 46          15022 1465
## 47          15023 1328
## 48          15024 1253
## 49          24282  937
## 50          24283  689
## 51          24284  613
## 52          24285  583
## 53          24286 1346
## 54          24287 1029
## 55          24288  839
## 56          24289  706
## 57          24290  747
## 58          24291 1428
## 59          24292 1128
## 60          24293  965
## 61          24294  916
## 62          24295 1917
## 63          24296 1534
## 64          24297 1301
## 65          24298 1094
## 66          24299 1168
## 67          25334  988
## 68          25335  883
## 69          25336  802
## 70          25337  725
## 71          25338  618
## 72          25339  503
## 73          25340  602
## 74          25341 1010
## 75          25342  906
## 76          25343  831
## 77          25344  753
## 78          25345  653
## 79          25346  581
## 80          25347  518
## 81          25348 1491
## 82          25349 1314
## 83          25350 1231
## 84          25351 1116
## 85          25352  971
## 86          25353  845
## 87          25354  968
## 88          25355  930
## 89          25356  782
## 90          25357  713
## 91          25358  617
## 92          25359  526
## 93          25360  450
## 94          25361  524
## 95          25362 1415
## 96          25363 1325
## 97          25364 1231
## 98          25365 1134
## 99          25366 1043
## 100         25367  915
## 101         25368  950
## 102         30709  825
## 103         30710  748
## 104         30711  682
## 105         30712  629
## 106         30714  533
## 107         30715  462
## 108         30716  413
## 109         30717  372
## 110         30719  930
## 111         30720  826
## 112         30721  770
## 113         30722  703
## 114         34860 1361
## 115         34861 1231
## 116         34862 1076
## 117         34863  975
## 118         34864  871
## 119         34865 1091
## 120         34866 1016
## 121         34867  952
## 122         34868  922
## 123         34869  899
## 124         34870  912
## 125         34871  889
## 126         34873 1859
## 127         34874 1661
## 128         34875 1402
## 129         34876 1313
## 130         34877 1158
## 131         34878 1470
## 132         34879 1352
## 133         34880 1252
## 134         34881 1224
## 135         34882 1193
## 136         34883 1196
## 137         34884 1160
## 138         34886 1192
## 139         34887 1055
## 140         34888  881
## 141         34889  804
## 142         34890  715
## 143         34891  935
## 144         34892  857
## 145         34893  785
## 146         34894  755
## 147         34895  743
## 148         34896  747
## 149         34897  727
## 150         34899 1826
## 151         34900 1601
## 152         34901 1398
## 153         34902 1307
## 154         34903 1137
## 155         34904 1454
## 156         34905 1363
## 157         34906 1254
## 158         34907 1234
## 159         34908 1213
## 160         34909 1213
## 161         34910 1184
## 162         37415  784
## 163         37416  741
## 164         37417  676
## 165         37418  724
## 166         37419  685
## 167         37420  652
## 168         37421  641
## 169         37422  576
## 170         37423  471
## 171         37425  672
## 172         37426  612
## 173         37427  549
## 174         37428  592
## 175         37429  559
## 176         37430  532
## 177         37431  519
## 178         37432  467
## 179         37433  394
## 180         37435  586
## 181         37436  540
## 182         37437  500
## 183         37438  531
## 184         37439  504
## 185         37440  479
## 186         37441  473
## 187         37442  416
## 188         37443  344

assessments %>% 

  distinct(id_assessment)

##     id_assessment
## 1            1752
## 2            1753
## 3            1754
## 4            1755
## 5            1756
## 6            1758
## 7            1759
## 8            1760
## 9            1761
## 10           1762
## 11          14984
## 12          14985
## 13          14986
## 14          14987
## 15          14988
## 16          14989
## 17          14991
## 18          14992
## 19          14993
## 20          14994
## 21          14995
## 22          14996
## 23          14997
## 24          14998
## 25          14999
## 26          15000
## 27          15001
## 28          15003
## 29          15004
## 30          15005
## 31          15006
## 32          15007
## 33          15008
## 34          15009
## 35          15010
## 36          15011
## 37          15012
## 38          15013
## 39          15015
## 40          15016
## 41          15017
## 42          15018
## 43          15019
## 44          15020
## 45          15021
## 46          15022
## 47          15023
## 48          15024
## 49          24282
## 50          24283
## 51          24284
## 52          24285
## 53          24286
## 54          24287
## 55          24288
## 56          24289
## 57          24290
## 58          24291
## 59          24292
## 60          24293
## 61          24294
## 62          24295
## 63          24296
## 64          24297
## 65          24298
## 66          24299
## 67          25334
## 68          25335
## 69          25336
## 70          25337
## 71          25338
## 72          25339
## 73          25340
## 74          25341
## 75          25342
## 76          25343
## 77          25344
## 78          25345
## 79          25346
## 80          25347
## 81          25348
## 82          25349
## 83          25350
## 84          25351
## 85          25352
## 86          25353
## 87          25354
## 88          25355
## 89          25356
## 90          25357
## 91          25358
## 92          25359
## 93          25360
## 94          25361
## 95          25362
## 96          25363
## 97          25364
## 98          25365
## 99          25366
## 100         25367
## 101         25368
## 102         30709
## 103         30710
## 104         30711
## 105         30712
## 106         30714
## 107         30715
## 108         30716
## 109         30717
## 110         30719
## 111         30720
## 112         30721
## 113         30722
## 114         34860
## 115         34861
## 116         34862
## 117         34863
## 118         34864
## 119         34865
## 120         34866
## 121         34867
## 122         34868
## 123         34869
## 124         34870
## 125         34871
## 126         34873
## 127         34874
## 128         34875
## 129         34876
## 130         34877
## 131         34878
## 132         34879
## 133         34880
## 134         34881
## 135         34882
## 136         34883
## 137         34884
## 138         34886
## 139         34887
## 140         34888
## 141         34889
## 142         34890
## 143         34891
## 144         34892
## 145         34893
## 146         34894
## 147         34895
## 148         34896
## 149         34897
## 150         34899
## 151         34900
## 152         34901
## 153         34902
## 154         34903
## 155         34904
## 156         34905
## 157         34906
## 158         34907
## 159         34908
## 160         34909
## 161         34910
## 162         37415
## 163         37416
## 164         37417
## 165         37418
## 166         37419
## 167         37420
## 168         37421
## 169         37422
## 170         37423
## 171         37425
## 172         37426
## 173         37427
## 174         37428
## 175         37429
## 176         37430
## 177         37431
## 178         37432
## 179         37433
## 180         37435
## 181         37436
## 182         37437
## 183         37438
## 184         37439
## 185         37440
## 186         37441
## 187         37442
## 188         37443

assessments %>% 

  count(assessment_type, code_module, code_presentation)

##    assessment_type code_module code_presentation    n
## 1              CMA         BBB             2013B 5049
## 2              CMA         BBB             2013J 6416
## 3              CMA         BBB             2014B 4493
## 4              CMA         CCC             2014B 3920
## 5              CMA         CCC             2014J 5846
## 6              CMA         DDD             2013B 5252
## 7              CMA         FFF             2013B 6681
## 8              CMA         FFF             2013J 8847
## 9              CMA         FFF             2014B 5549
## 10             CMA         FFF             2014J 8915
## 11             CMA         GGG             2013J 3749
## 12             CMA         GGG             2014B 3063
## 13             CMA         GGG             2014J 2747
## 14            Exam         CCC             2014B  747
## 15            Exam         CCC             2014J 1168
## 16            Exam         DDD             2013B  602
## 17            Exam         DDD             2013J  968
## 18            Exam         DDD             2014B  524
## 19            Exam         DDD             2014J  950
## 20             TMA         AAA             2013J 1633
## 21             TMA         AAA             2014J 1516
## 22             TMA         BBB             2013B 6207
## 23             TMA         BBB             2013J 7959
## 24             TMA         BBB             2014B 5500
## 25             TMA         BBB             2014J 7408
## 26             TMA         CCC             2014B 2822
## 27             TMA         CCC             2014J 4437
## 28             TMA         DDD             2013B 4519
## 29             TMA         DDD             2013J 6968
## 30             TMA         DDD             2014B 4018
## 31             TMA         DDD             2014J 7063
## 32             TMA         EEE             2013J 2884
## 33             TMA         EEE             2014B 1780
## 34             TMA         EEE             2014J 3229
## 35             TMA         FFF             2013B 5514
## 36             TMA         FFF             2013J 7393
## 37             TMA         FFF             2014B 4647
## 38             TMA         FFF             2014J 7269
## 39             TMA         GGG             2013J 2201
## 40             TMA         GGG             2014B 1833
## 41             TMA         GGG             2014J 1626

assessments %>% 

  count(assessment_type, code_module, code_presentation)

##    assessment_type code_module code_presentation    n
## 1              CMA         BBB             2013B 5049
## 2              CMA         BBB             2013J 6416
## 3              CMA         BBB             2014B 4493
## 4              CMA         CCC             2014B 3920
## 5              CMA         CCC             2014J 5846
## 6              CMA         DDD             2013B 5252
## 7              CMA         FFF             2013B 6681
## 8              CMA         FFF             2013J 8847
## 9              CMA         FFF             2014B 5549
## 10             CMA         FFF             2014J 8915
## 11             CMA         GGG             2013J 3749
## 12             CMA         GGG             2014B 3063
## 13             CMA         GGG             2014J 2747
## 14            Exam         CCC             2014B  747
## 15            Exam         CCC             2014J 1168
## 16            Exam         DDD             2013B  602
## 17            Exam         DDD             2013J  968
## 18            Exam         DDD             2014B  524
## 19            Exam         DDD             2014J  950
## 20             TMA         AAA             2013J 1633
## 21             TMA         AAA             2014J 1516
## 22             TMA         BBB             2013B 6207
## 23             TMA         BBB             2013J 7959
## 24             TMA         BBB             2014B 5500
## 25             TMA         BBB             2014J 7408
## 26             TMA         CCC             2014B 2822
## 27             TMA         CCC             2014J 4437
## 28             TMA         DDD             2013B 4519
## 29             TMA         DDD             2013J 6968
## 30             TMA         DDD             2014B 4018
## 31             TMA         DDD             2014J 7063
## 32             TMA         EEE             2013J 2884
## 33             TMA         EEE             2014B 1780
## 34             TMA         EEE             2014J 3229
## 35             TMA         FFF             2013B 5514
## 36             TMA         FFF             2013J 7393
## 37             TMA         FFF             2014B 4647
## 38             TMA         FFF             2014J 7269
## 39             TMA         GGG             2013J 2201
## 40             TMA         GGG             2014B 1833
## 41             TMA         GGG             2014J 1626

assessments %>% 

  summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments

            median_date = median(date, na.rm = TRUE), # find the median

             sd_date = sd(date, na.rm = TRUE), # find the sd

             min_date = min(date, na.rm = TRUE), # find the min

             max_date = max(date, na.rm = TRUE)) # find the mad

##   mean_date median_date  sd_date min_date max_date
## 1  130.6056         129 78.02517       12      261

assessments %>% 

  group_by(code_module, code_presentation, id_student) %>% # first, group by course (module: course; presentation: semester)

  

  summarize(mean_date = mean(date, na.rm = TRUE),

            median_date = median(date, na.rm = TRUE),

             sd_date = sd(date, na.rm = TRUE),

             min_date = min(date, na.rm = TRUE),

             max_date = max(date, na.rm = TRUE),

            first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile

## Warning: There were 2 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `min_date = min(date, na.rm = TRUE)`.
## ℹ In group 14279: `code_module = "DDD"`, `code_presentation = "2014J"`,
##   `id_student = 607110`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

## `summarise()` has grouped output by 'code_module', 'code_presentation'. You can
## override using the `.groups` argument.

## # A tibble: 25,843 × 9
## # Groups:   code_module, code_presentation [22]
##    code_module code_presentation id_student mean_date median_date sd_date
##    <chr>       <chr>                  <int>     <dbl>       <dbl>   <dbl>
##  1 AAA         2013J                  11391      114.         117    79.9
##  2 AAA         2013J                  28400      114.         117    79.9
##  3 AAA         2013J                  31604      114.         117    79.9
##  4 AAA         2013J                  32885      114.         117    79.9
##  5 AAA         2013J                  38053      114.         117    79.9
##  6 AAA         2013J                  45462      114.         117    79.9
##  7 AAA         2013J                  45642      114.         117    79.9
##  8 AAA         2013J                  52130      114.         117    79.9
##  9 AAA         2013J                  53025      114.         117    79.9
## 10 AAA         2013J                  57506      114.         117    79.9
## # ℹ 25,833 more rows
## # ℹ 3 more variables: min_date <dbl>, max_date <dbl>, first_quantile <dbl>

code_module_dates <- assessments %>% 

  group_by(code_module, code_presentation) %>% 

  summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE), .groups = 'drop')

 

code_module_dates

## # A tibble: 22 × 3
##    code_module code_presentation quantile_cutoff_date
##    <chr>       <chr>                            <dbl>
##  1 AAA         2013J                               54
##  2 AAA         2014J                               54
##  3 BBB         2013B                               54
##  4 BBB         2013J                               54
##  5 BBB         2014B                               47
##  6 BBB         2014J                               54
##  7 CCC         2014B                               32
##  8 CCC         2014J                               32
##  9 DDD         2013B                               51
## 10 DDD         2013J                               53
## # ℹ 12 more rows

assessments_joined <- left_join(assessments, code_module_dates)

## Joining with `by = join_by(code_module, code_presentation)`

glimpse(assessments_joined)

## Rows: 173,912
## Columns: 11
## $ id_assessment        <int> 1752, 1752, 1752, 1752, 1752, 1752, 1752, 1752, 1…
## $ id_student           <int> 11391, 28400, 31604, 32885, 38053, 45462, 45642, …
## $ date_submitted       <int> 18, 22, 17, 26, 19, 20, 18, 19, 9, 18, 19, 18, 17…
## $ is_banked            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ score                <int> 78, 70, 72, 69, 79, 70, 72, 72, 71, 68, 73, 67, 7…
## $ code_module          <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", …
## $ code_presentation    <chr> "2013J", "2013J", "2013J", "2013J", "2013J", "201…
## $ assessment_type      <chr> "TMA", "TMA", "TMA", "TMA", "TMA", "TMA", "TMA", …
## $ date                 <int> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1…
## $ weight               <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ quantile_cutoff_date <dbl> 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 5…

assessments_filtered <- assessments_joined %>% 

   filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included

assessments_summarized <- assessments_filtered %>% 

  mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment

  group_by(id_student) %>% 

  summarize(mean_weighted_score = mean(weighted_score))

assessments %>% 

  count(id_assessment)

##     id_assessment    n
## 1            1752  359
## 2            1753  342
## 3            1754  331
## 4            1755  303
## 5            1756  298
## 6            1758  337
## 7            1759  317
## 8            1760  304
## 9            1761  280
## 10           1762  278
## 11          14984 1352
## 12          14985 1217
## 13          14986 1065
## 14          14987  931
## 15          14988  876
## 16          14989  766
## 17          14991 1189
## 18          14992 1070
## 19          14993  985
## 20          14994  935
## 21          14995  870
## 22          14996 1695
## 23          14997 1524
## 24          14998 1316
## 25          14999 1227
## 26          15000 1163
## 27          15001 1034
## 28          15003 1490
## 29          15004 1350
## 30          15005 1257
## 31          15006 1183
## 32          15007 1136
## 33          15008 1189
## 34          15009 1086
## 35          15010  928
## 36          15011  822
## 37          15012  791
## 38          15013  684
## 39          15015 1054
## 40          15016  973
## 41          15017  878
## 42          15018  819
## 43          15019  769
## 44          15020 1776
## 45          15021 1586
## 46          15022 1465
## 47          15023 1328
## 48          15024 1253
## 49          24282  937
## 50          24283  689
## 51          24284  613
## 52          24285  583
## 53          24286 1346
## 54          24287 1029
## 55          24288  839
## 56          24289  706
## 57          24290  747
## 58          24291 1428
## 59          24292 1128
## 60          24293  965
## 61          24294  916
## 62          24295 1917
## 63          24296 1534
## 64          24297 1301
## 65          24298 1094
## 66          24299 1168
## 67          25334  988
## 68          25335  883
## 69          25336  802
## 70          25337  725
## 71          25338  618
## 72          25339  503
## 73          25340  602
## 74          25341 1010
## 75          25342  906
## 76          25343  831
## 77          25344  753
## 78          25345  653
## 79          25346  581
## 80          25347  518
## 81          25348 1491
## 82          25349 1314
## 83          25350 1231
## 84          25351 1116
## 85          25352  971
## 86          25353  845
## 87          25354  968
## 88          25355  930
## 89          25356  782
## 90          25357  713
## 91          25358  617
## 92          25359  526
## 93          25360  450
## 94          25361  524
## 95          25362 1415
## 96          25363 1325
## 97          25364 1231
## 98          25365 1134
## 99          25366 1043
## 100         25367  915
## 101         25368  950
## 102         30709  825
## 103         30710  748
## 104         30711  682
## 105         30712  629
## 106         30714  533
## 107         30715  462
## 108         30716  413
## 109         30717  372
## 110         30719  930
## 111         30720  826
## 112         30721  770
## 113         30722  703
## 114         34860 1361
## 115         34861 1231
## 116         34862 1076
## 117         34863  975
## 118         34864  871
## 119         34865 1091
## 120         34866 1016
## 121         34867  952
## 122         34868  922
## 123         34869  899
## 124         34870  912
## 125         34871  889
## 126         34873 1859
## 127         34874 1661
## 128         34875 1402
## 129         34876 1313
## 130         34877 1158
## 131         34878 1470
## 132         34879 1352
## 133         34880 1252
## 134         34881 1224
## 135         34882 1193
## 136         34883 1196
## 137         34884 1160
## 138         34886 1192
## 139         34887 1055
## 140         34888  881
## 141         34889  804
## 142         34890  715
## 143         34891  935
## 144         34892  857
## 145         34893  785
## 146         34894  755
## 147         34895  743
## 148         34896  747
## 149         34897  727
## 150         34899 1826
## 151         34900 1601
## 152         34901 1398
## 153         34902 1307
## 154         34903 1137
## 155         34904 1454
## 156         34905 1363
## 157         34906 1254
## 158         34907 1234
## 159         34908 1213
## 160         34909 1213
## 161         34910 1184
## 162         37415  784
## 163         37416  741
## 164         37417  676
## 165         37418  724
## 166         37419  685
## 167         37420  652
## 168         37421  641
## 169         37422  576
## 170         37423  471
## 171         37425  672
## 172         37426  612
## 173         37427  549
## 174         37428  592
## 175         37429  559
## 176         37430  532
## 177         37431  519
## 178         37432  467
## 179         37433  394
## 180         37435  586
## 181         37436  540
## 182         37437  500
## 183         37438  531
## 184         37439  504
## 185         37440  479
## 186         37441  473
## 187         37442  416
## 188         37443  344

students <- students %>% 

   mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code

   mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

 

students <- students %>% 

   mutate(imd_band = factor(imd_band, levels = c("0-10%",

                                                "10-20%",

                                                "20-30%",

                                                "30-40%",

                                                "40-50%",

                                                "50-60%",

                                                "60-70%",

                                                "70-80%",

                                                "80-90%",

                                                "90-100%"))) %>% # this creates a factor with ordered levels

   mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students_and_assessments <- left_join(students, assessments_summarized)

## Joining with `by = join_by(id_student)`

glimpse(students_and_assessments)

## Rows: 32,593
## Columns: 17
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region                     <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band                   <int> 10, 3, 4, 6, 6, 9, 4, 10, 8, NA, 8, 3, 7, 6…
## $ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pass                       <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mean_weighted_score        <dbl> 780, 700, NA, 720, 690, 790, 700, 720, 720,…

set.seed(127)

 

students_and_assessments <- students_and_assessments %>% 

  drop_na(mean_weighted_score)

 

train_test_split <- initial_split(students_and_assessments, prop = .80, strata = "mean_weighted_score")

data_train <- training(train_test_split)

data_test <- testing(train_test_split)

glimpse(data_train)

## Rows: 19,707
## Columns: 17
## $ code_module                <chr> "AAA", "BBB", "BBB", "BBB", "BBB", "BBB", "…
## $ code_presentation          <chr> "2013J", "2013B", "2013B", "2013B", "2013B"…
## $ id_student                 <dbl> 2312620, 141823, 521081, 533512, 542562, 55…
## $ gender                     <chr> "F", "F", "F", "F", "F", "F", "M", "F", "F"…
## $ region                     <chr> "East Midlands Region", "East Midlands Regi…
## $ highest_education          <chr> "HE Qualification", "A Level or Equivalent"…
## $ imd_band                   <int> 3, 1, NA, 5, 7, 3, 1, 7, 7, 5, 5, 6, 5, 1, …
## $ age_band                   <chr> "35-55", "0-35", "0-35", "0-35", "0-35", "3…
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 60, 120, 90, 120, 150, 60, 180, 60, 270, 12…
## $ disability                 <chr> "N", "N", "Y", "N", "Y", "N", "N", "N", "Y"…
## $ final_result               <chr> "Withdrawn", "Withdrawn", "Fail", "Withdraw…
## $ module_presentation_length <dbl> 268, 240, 240, 240, 240, 240, 240, 268, 268…
## $ date_registration          <dbl> -45, -47, -65, -95, -85, -46, -126, -89, -1…
## $ date_unregistration        <dbl> 178, 86, NA, -2, -73, NA, 89, -50, 25, 10, …
## $ pass                       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mean_weighted_score        <dbl> 110.0, 145.0, 0.0, 112.0, 100.0, 0.0, 0.0, …

my_rec <- recipe(mean_weighted_score ~ disability + studied_credits + highest_education + age_band, data = data_train) %>% 

  step_dummy(disability) %>%

  step_dummy(highest_education)

# specify model

my_mod <-

   linear_reg() %>% 

  set_engine("lm") %>% # generalized linear model

  set_mode("regression") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

 

# specify workflow

my_wf <-

   workflow() %>% # create a workflow

   add_model(my_mod) %>% # add the model we wrote above

   add_recipe(my_rec) # add our recipe we wrote above

fitted_model <- fit(my_wf, data = data_train)

# Handle missing values in the mean_weighted_score column

data_train <- data_train %>% drop_na(mean_weighted_score, imd_band, date_registration)

data_test <- data_test %>% drop_na(mean_weighted_score, imd_band, date_registration)

# Make predictions on the test data

predictions <- predict(fitted_model, new_data = data_test)

 

# Calculate MAE (Mean Absolute Error)

mae_value <- mean(abs(data_test$mean_weighted_score - predictions$`.pred`))

 

# Calculate MSE (Mean Squared Error)

mse_value <- mean((data_test$mean_weighted_score - predictions$`.pred`)^2)

 

# Calculate RMSE (Root Mean Squared Error)

rmse_value <- sqrt(mse_value)

cat("MAE:", mae_value, "\n")

## MAE: 329.3553

cat("MSE:", mse_value, "\n")

## MSE: 135977.3

cat("RMSE:", rmse_value, "\n")

## RMSE: 368.751

Please add your interpretations here:

MAE:In this instance, the Mean Absolute Error (MAE) value of 326.9786 indicates that, on average, the projections differ by around 330.41 units from the actual values. Compared to Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), MAE is a more reliable statistic for assessing the performance of models since it only takes into account the absolute size of mistakes.
MSE:Since the mistakes were squared before being averaged in this case, the MSE value of 135872.5 indicates that the errors were squared before being averaged, resulting in a result that is much greater than MAE. Greater errors are given more weight by MSE, which is susceptible to outliers. A high MSE thus denotes the possibility of significant forecast errors.
RMSE:Its unit is the same as that of the target variable, and its form is MSE squared. An intelligible measure of the average error is the square root of the MSE, which in your case is 368.6089. It is more comprehensible than MSE since RMSE is given in the same units as the original data, which makes it helpful. In this case, the RMSE suggests that the projections are frequently off by roughly 368.60 units.

Machine Learning - Lab 2 Badge

September 22, 2023

Part I: Data Product

Part II: Reflect and Plan

Knit and Publish