Data Dive 2

At least 3 “group by” data frames, and an investigation into each. You’ll need to use categorical columns, or one of the cut_ functions here. Use the group_by function to group your data into (at least) 3 different sets of groups, each summarizing different variables. For example, this could be as simple as three data frames which group your data based on three different categorical columns, but summarize the same continuous column. Or, it could be as complex as three different combinations of categorical columns, each illustrating summarizations of different continuous (or categorical columns). Within each group_by data frame, calculate the expected probability for each group. Maybe assign the lowest probability group an “anomaly” tag, and then translate that back into your original data frame. Draw some conclusions about the numbers you’ve calculated.

Try to draw a testable hypothesis for why some groups are rarer than others (How might you test this hypothesis?) Think of different ways to visualize these groups Pick 2-3 categorical variables for which you know all possible combinations. Which combinations never show up? Why might that be? Which combinations are the most/least common, and why might that be? Try (i.e., no need if you can’t figure this one out) to find a way to visualize these combinations. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated

Importing all the libararies

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
#Loading the dataset
data <- read_delim("data.csv", delim = ";")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##  [1] "Marital status"                                
##  [2] "Application mode"                              
##  [3] "Application order"                             
##  [4] "Course"                                        
##  [5] "Daytime/evening attendance"                    
##  [6] "Previous qualification"                        
##  [7] "Previous qualification (grade)"                
##  [8] "Nacionality"                                   
##  [9] "Mother's qualification"                        
## [10] "Father's qualification"                        
## [11] "Mother's occupation"                           
## [12] "Father's occupation"                           
## [13] "Admission grade"                               
## [14] "Displaced"                                     
## [15] "Educational special needs"                     
## [16] "Debtor"                                        
## [17] "Tuition fees up to date"                       
## [18] "Gender"                                        
## [19] "Scholarship holder"                            
## [20] "Age at enrollment"                             
## [21] "International"                                 
## [22] "Curricular units 1st sem (credited)"           
## [23] "Curricular units 1st sem (enrolled)"           
## [24] "Curricular units 1st sem (evaluations)"        
## [25] "Curricular units 1st sem (approved)"           
## [26] "Curricular units 1st sem (grade)"              
## [27] "Curricular units 1st sem (without evaluations)"
## [28] "Curricular units 2nd sem (credited)"           
## [29] "Curricular units 2nd sem (enrolled)"           
## [30] "Curricular units 2nd sem (evaluations)"        
## [31] "Curricular units 2nd sem (approved)"           
## [32] "Curricular units 2nd sem (grade)"              
## [33] "Curricular units 2nd sem (without evaluations)"
## [34] "Unemployment rate"                             
## [35] "Inflation rate"                                
## [36] "GDP"                                           
## [37] "Target"

Grouping by Marital Status

## [1] "Probability by Marital Status"
## # A tibble: 6 × 5
##   `Marital status` Marital_status_Description Count Probability
##              <dbl> <chr>                      <int>       <dbl>
## 1                1 Single                      3919     88.6   
## 2                2 Married                      379      8.57  
## 3                3 Widower                        4      0.0904
## 4                4 Divorced                      91      2.06  
## 5                5 Facto Union                   25      0.565 
## 6                6 Legally Separated              6      0.136 
## # ℹ 1 more variable: Probability_percentage <dbl>
Marital status Marital_status_Description Count Probability Probability_percentage
1 Single 3919 88.5849910 88.58
2 Married 379 8.5669078 8.57
3 Widower 4 0.0904159 0.09
4 Divorced 91 2.0569620 2.06
5 Facto Union 25 0.5650995 0.57
6 Legally Separated 6 0.1356239 0.14

From the above group-by investigation, the data tells that the average grades for admission of Single and married students have very little difference. There are very few students who are either widower and legally separated.Most of the students are single.

we find probability

Grouping By Application mode

## [1] "Grouped by Application mode"
## # A tibble: 18 × 5
##    `Application mode` Application_mode_description        Average_application_…¹
##                 <dbl> <chr>                                                <dbl>
##  1                  1 1st phase - general contingent                        128.
##  2                  2 Ordinance No. 612/93                                  122.
##  3                  5 1st phase - special contingent (Az…                   129.
##  4                  7 Holders of other higher courses                       132.
##  5                 10 Ordinance No. 854-B/99                                148.
##  6                 15 International student (bachelor)                      126.
##  7                 16 1st phase - special contingent (Ma…                   131.
##  8                 17 2nd phase - general contingent                        125.
##  9                 18 3rd phase - general contingent                        123.
## 10                 26 Ordinance No. 533-A/99, item b2) (…                   122.
## 11                 27 Ordinance No. 533-A/99, item b3 (O…                   130 
## 12                 39 Over 23 years old                                     126.
## 13                 42 Transfer                                              124.
## 14                 43 Change of course                                      122.
## 15                 44 Technological specialization diplo…                   140.
## 16                 51 Change of institution/course                          121.
## 17                 53 Short cycle diploma holders                           138.
## 18                 57 Change of institution/course (Inte…                   100 
## # ℹ abbreviated name: ¹​Average_application_grade
## # ℹ 2 more variables: Total_Students <int>, Probability <dbl>
Application mode Application_mode_description Average_application_grade Total_Students Probability
1 1st phase - general contingent 127.66 1708 38.61
2 Ordinance No. 612/93 121.50 3 0.07
5 1st phase - special contingent (Azores Island) 129.36 16 0.36
7 Holders of other higher courses 132.38 139 3.14
10 Ordinance No. 854-B/99 148.41 10 0.23
15 International student (bachelor) 126.38 30 0.68
16 1st phase - special contingent (Madeira Island) 131.21 38 0.86
17 2nd phase - general contingent 124.66 872 19.71
18 3rd phase - general contingent 122.85 124 2.80
26 Ordinance No. 533-A/99, item b2) (Different Plan) 121.50 1 0.02
27 Ordinance No. 533-A/99, item b3 (Other Institution) 130.00 1 0.02
39 Over 23 years old 125.92 785 17.74
42 Transfer 124.48 77 1.74
43 Change of course 121.93 312 7.05
44 Technological specialization diploma holders 140.21 213 4.81
51 Change of institution/course 121.11 59 1.33
53 Short cycle diploma holders 138.39 35 0.79
57 Change of institution/course (International) 100.00 1 0.02

Grouping by nationality

## [1] "Grouped by Nationality"
## # A tibble: 21 × 5
##    Nacionality Nationality_Description Average_application_grade Total_Students
##          <dbl> <chr>                                       <dbl>          <int>
##  1           1 Portuguese                                   127.           4314
##  2           2 German                                       136.              2
##  3           6 Spanish                                      129.             13
##  4          11 Italian                                      128.              3
##  5          13 Dutch                                        138.              1
##  6          14 English                                      151.              1
##  7          17 Lithuanian                                   118.              1
##  8          21 Angolan                                      109.              2
##  9          22 Cape Verdean                                 143.             13
## 10          24 Guinean                                      124.              5
## # ℹ 11 more rows
## # ℹ 1 more variable: Probability <dbl>
Nacionality Nationality_Description Average_application_grade Total_Students Probability
1 Portuguese 126.92 4314 97.51
2 German 136.30 2 0.05
6 Spanish 128.55 13 0.29
11 Italian 127.77 3 0.07
13 Dutch 137.60 1 0.02
14 English 150.80 1 0.02
17 Lithuanian 118.10 1 0.02
21 Angolan 108.95 2 0.05
22 Cape Verdean 143.49 13 0.29
24 Guinean 124.38 5 0.11
25 Mozambican 120.70 2 0.05
26 Santomean 132.81 14 0.32
32 Turkish 160.00 1 0.02
41 Brazilian 121.18 38 0.86
62 Romanian 125.15 2 0.05
100 Moldova (Republic of) 115.90 3 0.07
101 Mexican 139.25 2 0.05
103 Ukrainian 150.10 3 0.07
105 Russian 135.90 2 0.05
108 Cuban 190.00 1 0.02
109 Colombia 126.90 1 0.02

```