
In this lab you will respond to a set of prompts for two parts.
For the data product, you will interpret a different type of model – a model in a regression mode.
So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).
While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.
The requirements are as follows:
Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.
Please use the code chunk below for your code:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
students <- read_csv("oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments <- read.csv("oulad-assessments.csv")
assessments %>%
count(id_assessment)
## id_assessment n
## 1 1752 359
## 2 1753 342
## 3 1754 331
## 4 1755 303
## 5 1756 298
## 6 1758 337
## 7 1759 317
## 8 1760 304
## 9 1761 280
## 10 1762 278
## 11 14984 1352
## 12 14985 1217
## 13 14986 1065
## 14 14987 931
## 15 14988 876
## 16 14989 766
## 17 14991 1189
## 18 14992 1070
## 19 14993 985
## 20 14994 935
## 21 14995 870
## 22 14996 1695
## 23 14997 1524
## 24 14998 1316
## 25 14999 1227
## 26 15000 1163
## 27 15001 1034
## 28 15003 1490
## 29 15004 1350
## 30 15005 1257
## 31 15006 1183
## 32 15007 1136
## 33 15008 1189
## 34 15009 1086
## 35 15010 928
## 36 15011 822
## 37 15012 791
## 38 15013 684
## 39 15015 1054
## 40 15016 973
## 41 15017 878
## 42 15018 819
## 43 15019 769
## 44 15020 1776
## 45 15021 1586
## 46 15022 1465
## 47 15023 1328
## 48 15024 1253
## 49 24282 937
## 50 24283 689
## 51 24284 613
## 52 24285 583
## 53 24286 1346
## 54 24287 1029
## 55 24288 839
## 56 24289 706
## 57 24290 747
## 58 24291 1428
## 59 24292 1128
## 60 24293 965
## 61 24294 916
## 62 24295 1917
## 63 24296 1534
## 64 24297 1301
## 65 24298 1094
## 66 24299 1168
## 67 25334 988
## 68 25335 883
## 69 25336 802
## 70 25337 725
## 71 25338 618
## 72 25339 503
## 73 25340 602
## 74 25341 1010
## 75 25342 906
## 76 25343 831
## 77 25344 753
## 78 25345 653
## 79 25346 581
## 80 25347 518
## 81 25348 1491
## 82 25349 1314
## 83 25350 1231
## 84 25351 1116
## 85 25352 971
## 86 25353 845
## 87 25354 968
## 88 25355 930
## 89 25356 782
## 90 25357 713
## 91 25358 617
## 92 25359 526
## 93 25360 450
## 94 25361 524
## 95 25362 1415
## 96 25363 1325
## 97 25364 1231
## 98 25365 1134
## 99 25366 1043
## 100 25367 915
## 101 25368 950
## 102 30709 825
## 103 30710 748
## 104 30711 682
## 105 30712 629
## 106 30714 533
## 107 30715 462
## 108 30716 413
## 109 30717 372
## 110 30719 930
## 111 30720 826
## 112 30721 770
## 113 30722 703
## 114 34860 1361
## 115 34861 1231
## 116 34862 1076
## 117 34863 975
## 118 34864 871
## 119 34865 1091
## 120 34866 1016
## 121 34867 952
## 122 34868 922
## 123 34869 899
## 124 34870 912
## 125 34871 889
## 126 34873 1859
## 127 34874 1661
## 128 34875 1402
## 129 34876 1313
## 130 34877 1158
## 131 34878 1470
## 132 34879 1352
## 133 34880 1252
## 134 34881 1224
## 135 34882 1193
## 136 34883 1196
## 137 34884 1160
## 138 34886 1192
## 139 34887 1055
## 140 34888 881
## 141 34889 804
## 142 34890 715
## 143 34891 935
## 144 34892 857
## 145 34893 785
## 146 34894 755
## 147 34895 743
## 148 34896 747
## 149 34897 727
## 150 34899 1826
## 151 34900 1601
## 152 34901 1398
## 153 34902 1307
## 154 34903 1137
## 155 34904 1454
## 156 34905 1363
## 157 34906 1254
## 158 34907 1234
## 159 34908 1213
## 160 34909 1213
## 161 34910 1184
## 162 37415 784
## 163 37416 741
## 164 37417 676
## 165 37418 724
## 166 37419 685
## 167 37420 652
## 168 37421 641
## 169 37422 576
## 170 37423 471
## 171 37425 672
## 172 37426 612
## 173 37427 549
## 174 37428 592
## 175 37429 559
## 176 37430 532
## 177 37431 519
## 178 37432 467
## 179 37433 394
## 180 37435 586
## 181 37436 540
## 182 37437 500
## 183 37438 531
## 184 37439 504
## 185 37440 479
## 186 37441 473
## 187 37442 416
## 188 37443 344
assessments %>%
distinct(id_assessment)
## id_assessment
## 1 1752
## 2 1753
## 3 1754
## 4 1755
## 5 1756
## 6 1758
## 7 1759
## 8 1760
## 9 1761
## 10 1762
## 11 14984
## 12 14985
## 13 14986
## 14 14987
## 15 14988
## 16 14989
## 17 14991
## 18 14992
## 19 14993
## 20 14994
## 21 14995
## 22 14996
## 23 14997
## 24 14998
## 25 14999
## 26 15000
## 27 15001
## 28 15003
## 29 15004
## 30 15005
## 31 15006
## 32 15007
## 33 15008
## 34 15009
## 35 15010
## 36 15011
## 37 15012
## 38 15013
## 39 15015
## 40 15016
## 41 15017
## 42 15018
## 43 15019
## 44 15020
## 45 15021
## 46 15022
## 47 15023
## 48 15024
## 49 24282
## 50 24283
## 51 24284
## 52 24285
## 53 24286
## 54 24287
## 55 24288
## 56 24289
## 57 24290
## 58 24291
## 59 24292
## 60 24293
## 61 24294
## 62 24295
## 63 24296
## 64 24297
## 65 24298
## 66 24299
## 67 25334
## 68 25335
## 69 25336
## 70 25337
## 71 25338
## 72 25339
## 73 25340
## 74 25341
## 75 25342
## 76 25343
## 77 25344
## 78 25345
## 79 25346
## 80 25347
## 81 25348
## 82 25349
## 83 25350
## 84 25351
## 85 25352
## 86 25353
## 87 25354
## 88 25355
## 89 25356
## 90 25357
## 91 25358
## 92 25359
## 93 25360
## 94 25361
## 95 25362
## 96 25363
## 97 25364
## 98 25365
## 99 25366
## 100 25367
## 101 25368
## 102 30709
## 103 30710
## 104 30711
## 105 30712
## 106 30714
## 107 30715
## 108 30716
## 109 30717
## 110 30719
## 111 30720
## 112 30721
## 113 30722
## 114 34860
## 115 34861
## 116 34862
## 117 34863
## 118 34864
## 119 34865
## 120 34866
## 121 34867
## 122 34868
## 123 34869
## 124 34870
## 125 34871
## 126 34873
## 127 34874
## 128 34875
## 129 34876
## 130 34877
## 131 34878
## 132 34879
## 133 34880
## 134 34881
## 135 34882
## 136 34883
## 137 34884
## 138 34886
## 139 34887
## 140 34888
## 141 34889
## 142 34890
## 143 34891
## 144 34892
## 145 34893
## 146 34894
## 147 34895
## 148 34896
## 149 34897
## 150 34899
## 151 34900
## 152 34901
## 153 34902
## 154 34903
## 155 34904
## 156 34905
## 157 34906
## 158 34907
## 159 34908
## 160 34909
## 161 34910
## 162 37415
## 163 37416
## 164 37417
## 165 37418
## 166 37419
## 167 37420
## 168 37421
## 169 37422
## 170 37423
## 171 37425
## 172 37426
## 173 37427
## 174 37428
## 175 37429
## 176 37430
## 177 37431
## 178 37432
## 179 37433
## 180 37435
## 181 37436
## 182 37437
## 183 37438
## 184 37439
## 185 37440
## 186 37441
## 187 37442
## 188 37443
assessments %>%
count(assessment_type, code_module, code_presentation)
## assessment_type code_module code_presentation n
## 1 CMA BBB 2013B 5049
## 2 CMA BBB 2013J 6416
## 3 CMA BBB 2014B 4493
## 4 CMA CCC 2014B 3920
## 5 CMA CCC 2014J 5846
## 6 CMA DDD 2013B 5252
## 7 CMA FFF 2013B 6681
## 8 CMA FFF 2013J 8847
## 9 CMA FFF 2014B 5549
## 10 CMA FFF 2014J 8915
## 11 CMA GGG 2013J 3749
## 12 CMA GGG 2014B 3063
## 13 CMA GGG 2014J 2747
## 14 Exam CCC 2014B 747
## 15 Exam CCC 2014J 1168
## 16 Exam DDD 2013B 602
## 17 Exam DDD 2013J 968
## 18 Exam DDD 2014B 524
## 19 Exam DDD 2014J 950
## 20 TMA AAA 2013J 1633
## 21 TMA AAA 2014J 1516
## 22 TMA BBB 2013B 6207
## 23 TMA BBB 2013J 7959
## 24 TMA BBB 2014B 5500
## 25 TMA BBB 2014J 7408
## 26 TMA CCC 2014B 2822
## 27 TMA CCC 2014J 4437
## 28 TMA DDD 2013B 4519
## 29 TMA DDD 2013J 6968
## 30 TMA DDD 2014B 4018
## 31 TMA DDD 2014J 7063
## 32 TMA EEE 2013J 2884
## 33 TMA EEE 2014B 1780
## 34 TMA EEE 2014J 3229
## 35 TMA FFF 2013B 5514
## 36 TMA FFF 2013J 7393
## 37 TMA FFF 2014B 4647
## 38 TMA FFF 2014J 7269
## 39 TMA GGG 2013J 2201
## 40 TMA GGG 2014B 1833
## 41 TMA GGG 2014J 1626
assessments %>%
count(assessment_type, code_module, code_presentation)
## assessment_type code_module code_presentation n
## 1 CMA BBB 2013B 5049
## 2 CMA BBB 2013J 6416
## 3 CMA BBB 2014B 4493
## 4 CMA CCC 2014B 3920
## 5 CMA CCC 2014J 5846
## 6 CMA DDD 2013B 5252
## 7 CMA FFF 2013B 6681
## 8 CMA FFF 2013J 8847
## 9 CMA FFF 2014B 5549
## 10 CMA FFF 2014J 8915
## 11 CMA GGG 2013J 3749
## 12 CMA GGG 2014B 3063
## 13 CMA GGG 2014J 2747
## 14 Exam CCC 2014B 747
## 15 Exam CCC 2014J 1168
## 16 Exam DDD 2013B 602
## 17 Exam DDD 2013J 968
## 18 Exam DDD 2014B 524
## 19 Exam DDD 2014J 950
## 20 TMA AAA 2013J 1633
## 21 TMA AAA 2014J 1516
## 22 TMA BBB 2013B 6207
## 23 TMA BBB 2013J 7959
## 24 TMA BBB 2014B 5500
## 25 TMA BBB 2014J 7408
## 26 TMA CCC 2014B 2822
## 27 TMA CCC 2014J 4437
## 28 TMA DDD 2013B 4519
## 29 TMA DDD 2013J 6968
## 30 TMA DDD 2014B 4018
## 31 TMA DDD 2014J 7063
## 32 TMA EEE 2013J 2884
## 33 TMA EEE 2014B 1780
## 34 TMA EEE 2014J 3229
## 35 TMA FFF 2013B 5514
## 36 TMA FFF 2013J 7393
## 37 TMA FFF 2014B 4647
## 38 TMA FFF 2014J 7269
## 39 TMA GGG 2013J 2201
## 40 TMA GGG 2014B 1833
## 41 TMA GGG 2014J 1626
assessments %>%
summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments
median_date = median(date, na.rm = TRUE), # find the median
sd_date = sd(date, na.rm = TRUE), # find the sd
min_date = min(date, na.rm = TRUE), # find the min
max_date = max(date, na.rm = TRUE)) # find the mad
## mean_date median_date sd_date min_date max_date
## 1 130.6056 129 78.02517 12 261
assessments %>%
group_by(code_module, code_presentation, id_student) %>% # first, group by course (module: course; presentation: semester)
summarize(mean_date = mean(date, na.rm = TRUE),
median_date = median(date, na.rm = TRUE),
sd_date = sd(date, na.rm = TRUE),
min_date = min(date, na.rm = TRUE),
max_date = max(date, na.rm = TRUE),
first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile
## Warning: There were 2 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `min_date = min(date, na.rm = TRUE)`.
## ℹ In group 14279: `code_module = "DDD"`, `code_presentation = "2014J"`,
## `id_student = 607110`.
## Caused by warning in `min()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
## `summarise()` has grouped output by 'code_module', 'code_presentation'. You can
## override using the `.groups` argument.
## # A tibble: 25,843 × 9
## # Groups: code_module, code_presentation [22]
## code_module code_presentation id_student mean_date median_date sd_date
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 AAA 2013J 11391 114. 117 79.9
## 2 AAA 2013J 28400 114. 117 79.9
## 3 AAA 2013J 31604 114. 117 79.9
## 4 AAA 2013J 32885 114. 117 79.9
## 5 AAA 2013J 38053 114. 117 79.9
## 6 AAA 2013J 45462 114. 117 79.9
## 7 AAA 2013J 45642 114. 117 79.9
## 8 AAA 2013J 52130 114. 117 79.9
## 9 AAA 2013J 53025 114. 117 79.9
## 10 AAA 2013J 57506 114. 117 79.9
## # ℹ 25,833 more rows
## # ℹ 3 more variables: min_date <dbl>, max_date <dbl>, first_quantile <dbl>
code_module_dates <- assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE), .groups = 'drop')
code_module_dates
## # A tibble: 22 × 3
## code_module code_presentation quantile_cutoff_date
## <chr> <chr> <dbl>
## 1 AAA 2013J 54
## 2 AAA 2014J 54
## 3 BBB 2013B 54
## 4 BBB 2013J 54
## 5 BBB 2014B 47
## 6 BBB 2014J 54
## 7 CCC 2014B 32
## 8 CCC 2014J 32
## 9 DDD 2013B 51
## 10 DDD 2013J 53
## # ℹ 12 more rows
assessments_joined <- left_join(assessments, code_module_dates)
## Joining with `by = join_by(code_module, code_presentation)`
glimpse(assessments_joined)
## Rows: 173,912
## Columns: 11
## $ id_assessment <int> 1752, 1752, 1752, 1752, 1752, 1752, 1752, 1752, 1…
## $ id_student <int> 11391, 28400, 31604, 32885, 38053, 45462, 45642, …
## $ date_submitted <int> 18, 22, 17, 26, 19, 20, 18, 19, 9, 18, 19, 18, 17…
## $ is_banked <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ score <int> 78, 70, 72, 69, 79, 70, 72, 72, 71, 68, 73, 67, 7…
## $ code_module <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", …
## $ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J", "201…
## $ assessment_type <chr> "TMA", "TMA", "TMA", "TMA", "TMA", "TMA", "TMA", …
## $ date <int> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1…
## $ weight <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ quantile_cutoff_date <dbl> 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 5…
assessments_filtered <- assessments_joined %>%
filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included
assessments_summarized <- assessments_filtered %>%
mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
group_by(id_student) %>%
summarize(mean_weighted_score = mean(weighted_score))
assessments %>%
count(id_assessment)
## id_assessment n
## 1 1752 359
## 2 1753 342
## 3 1754 331
## 4 1755 303
## 5 1756 298
## 6 1758 337
## 7 1759 317
## 8 1760 304
## 9 1761 280
## 10 1762 278
## 11 14984 1352
## 12 14985 1217
## 13 14986 1065
## 14 14987 931
## 15 14988 876
## 16 14989 766
## 17 14991 1189
## 18 14992 1070
## 19 14993 985
## 20 14994 935
## 21 14995 870
## 22 14996 1695
## 23 14997 1524
## 24 14998 1316
## 25 14999 1227
## 26 15000 1163
## 27 15001 1034
## 28 15003 1490
## 29 15004 1350
## 30 15005 1257
## 31 15006 1183
## 32 15007 1136
## 33 15008 1189
## 34 15009 1086
## 35 15010 928
## 36 15011 822
## 37 15012 791
## 38 15013 684
## 39 15015 1054
## 40 15016 973
## 41 15017 878
## 42 15018 819
## 43 15019 769
## 44 15020 1776
## 45 15021 1586
## 46 15022 1465
## 47 15023 1328
## 48 15024 1253
## 49 24282 937
## 50 24283 689
## 51 24284 613
## 52 24285 583
## 53 24286 1346
## 54 24287 1029
## 55 24288 839
## 56 24289 706
## 57 24290 747
## 58 24291 1428
## 59 24292 1128
## 60 24293 965
## 61 24294 916
## 62 24295 1917
## 63 24296 1534
## 64 24297 1301
## 65 24298 1094
## 66 24299 1168
## 67 25334 988
## 68 25335 883
## 69 25336 802
## 70 25337 725
## 71 25338 618
## 72 25339 503
## 73 25340 602
## 74 25341 1010
## 75 25342 906
## 76 25343 831
## 77 25344 753
## 78 25345 653
## 79 25346 581
## 80 25347 518
## 81 25348 1491
## 82 25349 1314
## 83 25350 1231
## 84 25351 1116
## 85 25352 971
## 86 25353 845
## 87 25354 968
## 88 25355 930
## 89 25356 782
## 90 25357 713
## 91 25358 617
## 92 25359 526
## 93 25360 450
## 94 25361 524
## 95 25362 1415
## 96 25363 1325
## 97 25364 1231
## 98 25365 1134
## 99 25366 1043
## 100 25367 915
## 101 25368 950
## 102 30709 825
## 103 30710 748
## 104 30711 682
## 105 30712 629
## 106 30714 533
## 107 30715 462
## 108 30716 413
## 109 30717 372
## 110 30719 930
## 111 30720 826
## 112 30721 770
## 113 30722 703
## 114 34860 1361
## 115 34861 1231
## 116 34862 1076
## 117 34863 975
## 118 34864 871
## 119 34865 1091
## 120 34866 1016
## 121 34867 952
## 122 34868 922
## 123 34869 899
## 124 34870 912
## 125 34871 889
## 126 34873 1859
## 127 34874 1661
## 128 34875 1402
## 129 34876 1313
## 130 34877 1158
## 131 34878 1470
## 132 34879 1352
## 133 34880 1252
## 134 34881 1224
## 135 34882 1193
## 136 34883 1196
## 137 34884 1160
## 138 34886 1192
## 139 34887 1055
## 140 34888 881
## 141 34889 804
## 142 34890 715
## 143 34891 935
## 144 34892 857
## 145 34893 785
## 146 34894 755
## 147 34895 743
## 148 34896 747
## 149 34897 727
## 150 34899 1826
## 151 34900 1601
## 152 34901 1398
## 153 34902 1307
## 154 34903 1137
## 155 34904 1454
## 156 34905 1363
## 157 34906 1254
## 158 34907 1234
## 159 34908 1213
## 160 34909 1213
## 161 34910 1184
## 162 37415 784
## 163 37416 741
## 164 37417 676
## 165 37418 724
## 166 37419 685
## 167 37420 652
## 168 37421 641
## 169 37422 576
## 170 37423 471
## 171 37425 672
## 172 37426 612
## 173 37427 549
## 174 37428 592
## 175 37429 559
## 176 37430 532
## 177 37431 519
## 178 37432 467
## 179 37433 394
## 180 37435 586
## 181 37436 540
## 182 37437 500
## 183 37438 531
## 184 37439 504
## 185 37440 479
## 186 37441 473
## 187 37442 416
## 188 37443 344
students <- students %>%
mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps
students <- students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>% # this creates a factor with ordered levels
mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels
students_and_assessments <- left_join(students, assessments_summarized)
## Joining with `by = join_by(id_student)`
glimpse(students_and_assessments)
## Rows: 32,593
## Columns: 17
## $ code_module <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band <int> 10, 3, 4, 6, 6, 9, 4, 10, 8, NA, 8, 3, 7, 6…
## $ age_band <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pass <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mean_weighted_score <dbl> 780, 700, NA, 720, 690, 790, 700, 720, 720,…
set.seed(127)
students_and_assessments <- students_and_assessments %>%
drop_na(mean_weighted_score)
train_test_split <- initial_split(students_and_assessments, prop = .80, strata = "mean_weighted_score")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
glimpse(data_train)
## Rows: 19,707
## Columns: 17
## $ code_module <chr> "AAA", "BBB", "BBB", "BBB", "BBB", "BBB", "…
## $ code_presentation <chr> "2013J", "2013B", "2013B", "2013B", "2013B"…
## $ id_student <dbl> 2312620, 141823, 521081, 533512, 542562, 55…
## $ gender <chr> "F", "F", "F", "F", "F", "F", "M", "F", "F"…
## $ region <chr> "East Midlands Region", "East Midlands Regi…
## $ highest_education <chr> "HE Qualification", "A Level or Equivalent"…
## $ imd_band <int> 3, 1, NA, 5, 7, 3, 1, 7, 7, 5, 5, 6, 5, 1, …
## $ age_band <chr> "35-55", "0-35", "0-35", "0-35", "0-35", "3…
## $ num_of_prev_attempts <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0…
## $ studied_credits <dbl> 60, 120, 90, 120, 150, 60, 180, 60, 270, 12…
## $ disability <chr> "N", "N", "Y", "N", "Y", "N", "N", "N", "Y"…
## $ final_result <chr> "Withdrawn", "Withdrawn", "Fail", "Withdraw…
## $ module_presentation_length <dbl> 268, 240, 240, 240, 240, 240, 240, 268, 268…
## $ date_registration <dbl> -45, -47, -65, -95, -85, -46, -126, -89, -1…
## $ date_unregistration <dbl> 178, 86, NA, -2, -73, NA, 89, -50, 25, 10, …
## $ pass <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mean_weighted_score <dbl> 110.0, 145.0, 0.0, 112.0, 100.0, 0.0, 0.0, …
my_rec <- recipe(mean_weighted_score ~ disability + studied_credits + highest_education + age_band, data = data_train) %>%
step_dummy(disability) %>%
step_dummy(highest_education)
# specify model
my_mod <-
linear_reg() %>%
set_engine("lm") %>% # generalized linear model
set_mode("regression") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression
# specify workflow
my_wf <-
workflow() %>% # create a workflow
add_model(my_mod) %>% # add the model we wrote above
add_recipe(my_rec) # add our recipe we wrote above
fitted_model <- fit(my_wf, data = data_train)
# Handle missing values in the mean_weighted_score column
data_train <- data_train %>% drop_na(mean_weighted_score, imd_band, date_registration)
data_test <- data_test %>% drop_na(mean_weighted_score, imd_band, date_registration)
# Make predictions on the test data
predictions <- predict(fitted_model, new_data = data_test)
# Calculate MAE (Mean Absolute Error)
mae_value <- mean(abs(data_test$mean_weighted_score - predictions$`.pred`))
# Calculate MSE (Mean Squared Error)
mse_value <- mean((data_test$mean_weighted_score - predictions$`.pred`)^2)
# Calculate RMSE (Root Mean Squared Error)
rmse_value <- sqrt(mse_value)
cat("MAE:", mae_value, "\n")
## MAE: 329.3553
cat("MSE:", mse_value, "\n")
## MSE: 135977.3
cat("RMSE:", rmse_value, "\n")
## RMSE: 368.751
Please add your interpretations here:
MAE:In this instance, the Mean Absolute Error (MAE) value of 326.9786 indicates that, on average, the projections differ by around 330.41 units from the actual values. Compared to Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), MAE is a more reliable statistic for assessing the performance of models since it only takes into account the absolute size of mistakes.
MSE:Since the mistakes were squared before being averaged in this case, the MSE value of 135872.5 indicates that the errors were squared before being averaged, resulting in a result that is much greater than MAE. Greater errors are given more weight by MSE, which is susceptible to outliers. A high MSE thus denotes the possibility of significant forecast errors.
RMSE:Its unit is the same as that of the target variable, and its form is MSE squared. An intelligible measure of the average error is the square root of the MSE, which in your case is 368.6089. It is more comprehensible than MSE since RMSE is given in the same units as the original data, which makes it helpful. In this case, the RMSE suggests that the projections are frequently off by roughly 368.60 units.
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.
Have fun!
