At least 3 “group by” data frames, and an investigation into each. You’ll need to use categorical columns, or one of the cut_ functions here. Use the group_by function to group your data into (at least) 3 different sets of groups, each summarizing different variables. For example, this could be as simple as three data frames which group your data based on three different categorical columns, but summarize the same continuous column. Or, it could be as complex as three different combinations of categorical columns, each illustrating summarizations of different continuous (or categorical columns). Within each group_by data frame, calculate the expected probability for each group. Maybe assign the lowest probability group an “anomaly” tag, and then translate that back into your original data frame. Draw some conclusions about the numbers you’ve calculated.
Try to draw a testable hypothesis for why some groups are rarer than others (How might you test this hypothesis?) Think of different ways to visualize these groups Pick 2-3 categorical variables for which you know all possible combinations. Which combinations never show up? Why might that be? Which combinations are the most/least common, and why might that be? Try (i.e., no need if you can’t figure this one out) to find a way to visualize these combinations. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'kableExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## group_rows
#Loading the dataset
data <- read_delim("data.csv", delim = ";")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "Marital status"
## [2] "Application mode"
## [3] "Application order"
## [4] "Course"
## [5] "Daytime/evening attendance"
## [6] "Previous qualification"
## [7] "Previous qualification (grade)"
## [8] "Nacionality"
## [9] "Mother's qualification"
## [10] "Father's qualification"
## [11] "Mother's occupation"
## [12] "Father's occupation"
## [13] "Admission grade"
## [14] "Displaced"
## [15] "Educational special needs"
## [16] "Debtor"
## [17] "Tuition fees up to date"
## [18] "Gender"
## [19] "Scholarship holder"
## [20] "Age at enrollment"
## [21] "International"
## [22] "Curricular units 1st sem (credited)"
## [23] "Curricular units 1st sem (enrolled)"
## [24] "Curricular units 1st sem (evaluations)"
## [25] "Curricular units 1st sem (approved)"
## [26] "Curricular units 1st sem (grade)"
## [27] "Curricular units 1st sem (without evaluations)"
## [28] "Curricular units 2nd sem (credited)"
## [29] "Curricular units 2nd sem (enrolled)"
## [30] "Curricular units 2nd sem (evaluations)"
## [31] "Curricular units 2nd sem (approved)"
## [32] "Curricular units 2nd sem (grade)"
## [33] "Curricular units 2nd sem (without evaluations)"
## [34] "Unemployment rate"
## [35] "Inflation rate"
## [36] "GDP"
## [37] "Target"
## [1] "Probability by Marital Status"
## # A tibble: 6 × 5
## `Marital status` Marital_status_Description Count Probability
## <dbl> <chr> <int> <dbl>
## 1 1 Single 3919 88.6
## 2 2 Married 379 8.57
## 3 3 Widower 4 0.0904
## 4 4 Divorced 91 2.06
## 5 5 Facto Union 25 0.565
## 6 6 Legally Separated 6 0.136
## # ℹ 1 more variable: Probability_percentage <dbl>
| Marital status | Marital_status_Description | Count | Probability | Probability_percentage |
|---|---|---|---|---|
| 1 | Single | 3919 | 88.5849910 | 88.58 |
| 2 | Married | 379 | 8.5669078 | 8.57 |
| 3 | Widower | 4 | 0.0904159 | 0.09 |
| 4 | Divorced | 91 | 2.0569620 | 2.06 |
| 5 | Facto Union | 25 | 0.5650995 | 0.57 |
| 6 | Legally Separated | 6 | 0.1356239 | 0.14 |
From the above group-by investigation, the data tells that the average grades for admission of Single and married students have very little difference. There are very few students who are either widower and legally separated.Most of the students are single.
we find probability
## [1] "Grouped by Application mode"
## # A tibble: 18 × 5
## `Application mode` Application_mode_description Average_application_…¹
## <dbl> <chr> <dbl>
## 1 1 1st phase - general contingent 128.
## 2 2 Ordinance No. 612/93 122.
## 3 5 1st phase - special contingent (Az… 129.
## 4 7 Holders of other higher courses 132.
## 5 10 Ordinance No. 854-B/99 148.
## 6 15 International student (bachelor) 126.
## 7 16 1st phase - special contingent (Ma… 131.
## 8 17 2nd phase - general contingent 125.
## 9 18 3rd phase - general contingent 123.
## 10 26 Ordinance No. 533-A/99, item b2) (… 122.
## 11 27 Ordinance No. 533-A/99, item b3 (O… 130
## 12 39 Over 23 years old 126.
## 13 42 Transfer 124.
## 14 43 Change of course 122.
## 15 44 Technological specialization diplo… 140.
## 16 51 Change of institution/course 121.
## 17 53 Short cycle diploma holders 138.
## 18 57 Change of institution/course (Inte… 100
## # ℹ abbreviated name: ¹Average_application_grade
## # ℹ 2 more variables: Total_Students <int>, Probability <dbl>
| Application mode | Application_mode_description | Average_application_grade | Total_Students | Probability |
|---|---|---|---|---|
| 1 | 1st phase - general contingent | 127.66 | 1708 | 38.61 |
| 2 | Ordinance No. 612/93 | 121.50 | 3 | 0.07 |
| 5 | 1st phase - special contingent (Azores Island) | 129.36 | 16 | 0.36 |
| 7 | Holders of other higher courses | 132.38 | 139 | 3.14 |
| 10 | Ordinance No. 854-B/99 | 148.41 | 10 | 0.23 |
| 15 | International student (bachelor) | 126.38 | 30 | 0.68 |
| 16 | 1st phase - special contingent (Madeira Island) | 131.21 | 38 | 0.86 |
| 17 | 2nd phase - general contingent | 124.66 | 872 | 19.71 |
| 18 | 3rd phase - general contingent | 122.85 | 124 | 2.80 |
| 26 | Ordinance No. 533-A/99, item b2) (Different Plan) | 121.50 | 1 | 0.02 |
| 27 | Ordinance No. 533-A/99, item b3 (Other Institution) | 130.00 | 1 | 0.02 |
| 39 | Over 23 years old | 125.92 | 785 | 17.74 |
| 42 | Transfer | 124.48 | 77 | 1.74 |
| 43 | Change of course | 121.93 | 312 | 7.05 |
| 44 | Technological specialization diploma holders | 140.21 | 213 | 4.81 |
| 51 | Change of institution/course | 121.11 | 59 | 1.33 |
| 53 | Short cycle diploma holders | 138.39 | 35 | 0.79 |
| 57 | Change of institution/course (International) | 100.00 | 1 | 0.02 |
## [1] "Grouped by Nationality"
## # A tibble: 21 × 5
## Nacionality Nationality_Description Average_application_grade Total_Students
## <dbl> <chr> <dbl> <int>
## 1 1 Portuguese 127. 4314
## 2 2 German 136. 2
## 3 6 Spanish 129. 13
## 4 11 Italian 128. 3
## 5 13 Dutch 138. 1
## 6 14 English 151. 1
## 7 17 Lithuanian 118. 1
## 8 21 Angolan 109. 2
## 9 22 Cape Verdean 143. 13
## 10 24 Guinean 124. 5
## # ℹ 11 more rows
## # ℹ 1 more variable: Probability <dbl>
| Nacionality | Nationality_Description | Average_application_grade | Total_Students | Probability |
|---|---|---|---|---|
| 1 | Portuguese | 126.92 | 4314 | 97.51 |
| 2 | German | 136.30 | 2 | 0.05 |
| 6 | Spanish | 128.55 | 13 | 0.29 |
| 11 | Italian | 127.77 | 3 | 0.07 |
| 13 | Dutch | 137.60 | 1 | 0.02 |
| 14 | English | 150.80 | 1 | 0.02 |
| 17 | Lithuanian | 118.10 | 1 | 0.02 |
| 21 | Angolan | 108.95 | 2 | 0.05 |
| 22 | Cape Verdean | 143.49 | 13 | 0.29 |
| 24 | Guinean | 124.38 | 5 | 0.11 |
| 25 | Mozambican | 120.70 | 2 | 0.05 |
| 26 | Santomean | 132.81 | 14 | 0.32 |
| 32 | Turkish | 160.00 | 1 | 0.02 |
| 41 | Brazilian | 121.18 | 38 | 0.86 |
| 62 | Romanian | 125.15 | 2 | 0.05 |
| 100 | Moldova (Republic of) | 115.90 | 3 | 0.07 |
| 101 | Mexican | 139.25 | 2 | 0.05 |
| 103 | Ukrainian | 150.10 | 3 | 0.07 |
| 105 | Russian | 135.90 | 2 | 0.05 |
| 108 | Cuban | 190.00 | 1 | 0.02 |
| 109 | Colombia | 126.90 | 1 | 0.02 |
```