Week3 Data Dive Group By and Probability

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dataset <-read_delim("C:/Users/MSKR/MASTER'S_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

{r} dataset}

Marital status is a categorical attribute, as per the data documentation. Following are the corresponding mapping of the values in column “Marital status”

1 – single 2 – married 3 – widower 4 – divorced 5 – facto union 6 – legally separated

[data documentation: https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success ]
Now creating a new column which reflects these numbers into their actual category into the data frame.
```
dataset_1<-mutate(dataset, marital_status = ifelse(dataset$`Marital status` == 1, "single",
                    ifelse(`Marital status` == 2, "married",
                    ifelse(`Marital status` == 3, "widower",
                    ifelse(`Marital status` == 4, "divorced",
                    ifelse(`Marital status` == 5, "facto union",
                    ifelse(`Marital status` == 6, "legally seperated", "no")))))))
```
Categorical columns (just for my sanity check!)
```
dataset_1[,c('Marital status','Target', 'marital_status')]
```

1. Let’s find the relation between Marital status of a person and Unemployment rate when the data is recorded.

df1<-dataset_1|>
  group_by(marital_status)|>
  summarise(avg_UER=mean(`Unemployment rate`),min_UER=min(`Unemployment rate`),max_UER=max(`Unemployment rate`),count=n() )

df1

The Average unemployment rate is lower in the “widower” category but interestingly the range is very similar/close in all categories of Marital statuses. Also, the minimum and maximum ranges of these rates are observed to be spanning equal spreads.
The least number of students are of “widower” category followed by “legally separated” ones with just 4 and 6 students respectively.So, the probability of a widower student would be 4/4424=0.0009.

Visualizing the above metrics using a box plot ( a bar plot would have been a better plot than box, but sine the values are very close to eachh other, it may not show minimal differences better in bar plot. so using a box plot):

p<-dataset_1|>
  ggplot() +
  geom_boxplot(mapping = aes(x = marital_status , y =`Unemployment rate` )) +
 labs(title="Average Unemployment Rate") +  #labels!
 theme_minimal()

p

2. Now let’s find the pattern between Marital status of a student and their previous grades before enrolling into the course.

df2<-dataset_1|>
  group_by(`marital_status`)|>
  summarise(avg_PrvGrade=mean(`Previous qualification (grade)`),min_PrvGrade=min(`Previous qualification (grade)`),max_PrvGrade=max(`Previous qualification (grade)`), count=n())|>
arrange(desc(avg_PrvGrade))
df2

The mean value and minimum values of Previous grades are maximum for the “widower” category students as per our data. But the count of this category of students is almost negligible when compared to the total student records (4424), this particular metric cannot be a strong point of argument or weighted insight for any decision making.
Majority group of students are from “single” category and posses a median range of average values and both min and max attributes. We can explore this case further only by understanding different combinations of grouping of these students with other categorical factors.

Visualizing the same through a box plot:
```
p2<-dataset_1|>
  ggplot() +
  geom_boxplot(mapping = aes(x = marital_status , y = `Previous qualification (grade)`)) +
 labs(title="Previous qualification (grade)") +
 theme_minimal()
p2
```

3. Let us examine the trend in Day/evening attendance of student sand their 1st semester results after enrolling into the course.

In “Daytime/evening attendance", ‘1’ refers to Day time and ‘0’ refers to evening.

df3<-dataset_1|>
  group_by(`Daytime/evening attendance  `)|>
  summarise(avg_1st_sem=mean(`Curricular units 1st sem (grade)`),min_1st_sem=min(`Curricular units 1st sem (grade)`),max_1st_sem=max(`Curricular units 1st sem (grade)`), count=n())|>
  arrange(desc(avg_1st_sem))

df3

The average grades in 1st semester is 1% more for students who attended the college in day time than those of who attended during the evenings.
This is a strong data point to be commented as the frequency for day time attendees is the majority when compared to the evening attendees and so are the grades.

Visualizing the same through a box plot would result in the following:

(since the categorical column is in numeric, first creating a string type column into the same data frame)
```
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance    ` == 1, "day","evening"))
```
```
p3<-dataset_1 |>
  ggplot()+
  geom_boxplot(mapping= aes(x=day_eve_class, y= `Curricular units 1st sem (grade)`))+
  labs(title="1st Semester Grades") + theme_classic()

p3
```

Hypothesis: The reason for the students to have minority/least probability in “widower” category might be a fallout in data provision of the survey or it might be a personal choice of those students in this category to not to disclose/participate in the data collection survey.

Let us group the data by Gender and Marital status and understand the Admission grades of these combinations of students.
```
dataset_1<-mutate(dataset_1,gender=ifelse(Gender==1,"male","female")) 
```
```
df4<-dataset_1|>
  group_by(gender,marital_status)|>
  summarise(avg_AdmGrade=mean(`Admission grade`),min_AdmGrade=min(`Admission grade`),max_AdmGrade=max(`Admission grade`), count=n())|>
  arrange(desc(avg_AdmGrade))
```
```
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
```
```
df4
```
The maximum population falls in Female Single and Male Single categories followed by married groups. As we observe this combination and earlier results, Widower category students has the highest mean value of Admission grades into the course and “legally separated” has the least mean value of Admission grades.

To visualize better:
```
library(dplyr)
library(ggplot2) 
```
un-grouping the grouped_data frame to pass into the gg_plot:
```
df5<-df4|>
  ungroup() |>
  as.data.frame()

df5<-df5|>
  select(gender, marital_status, avg_AdmGrade) 
```
```
df5
```
```
p4<- df5|>
ggplot(aes(x = marital_status, y = avg_AdmGrade, fill = gender)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Average Admission Grades",
       x = "marital_status",
       y = "Average Value") +
  theme_minimal()

p4
```
The bar plot above shows the Widower category has the highest Average Admission grades among all other categories of students.