Week 2 | Data Dive — Summaries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Uploading my data set into R and storing it into “dataset” dataframe.

dataset <-read_delim("C:/Users/MSKR/MASTER'S_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Finding 1. Minimum, Maximum values of Unemployment rate.
1. Mean, Median of Unemployment rate.
2. standard deviation of Unemployment rate.
```
min_ur<- min(dataset$`Unemployment rate`)
min_ur
```
```
## [1] 7.6
```

max_ur<- max(dataset$`Unemployment rate`)
max_ur

## [1] 16.2

mean_ur <- mean(dataset$`Unemployment rate`)

mean_ur

## [1] 11.56614

med_ur<-median(dataset$`Unemployment rate`)
med_ur

## [1] 11.1

sd_ur<-sd(dataset$`Unemployment rate`)
sd_ur

## [1] 2.66385

The variable “Unemployment rate” in the data taken has a Minimum value of 7.6 in percent and a Maximum value of 16.2 in percent.

It also possess a mean value of 11.566 percent which is very close to the median and with a standard deviation of 2.66 percent. Withe this low percent in standard deviation indicates that the values are clustered closely around the mean value of the data spread.
Let’s find out which course has highest number of enrollments by the students in our data set.
```
course_counts <- dataset|>
  group_by(Course) |>
  summarise(enrollments = n()) |>
  arrange(desc(enrollments))

course_counts
```
From the above result, course “9500”( Nursing) has the highest number of enrollments with 766 students.

[link for column descriptions : https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success]

Related questions observed from the data set in the initial exploration would be:

What is the distribution of “Previous qualification (grade)” values of the students in the data set?
Are there any patterns or trends in the “Daytime/evening attendance” which is effecting the type of “Target”?
How is the “Application mode” and “Admission grade” related to each other?

# Aggregation example: Mean of numeric column by category
aggregated_data <- dataset |>
  group_by('Application_mode'=dataset$`Application mode`) |>
  summarise(mean_value = mean(`Admission grade`),min_value=min(`Admission grade`),max_value=max(`Admission grade`),median_value=median(`Admission grade`))
print(aggregated_data)

## # A tibble: 18 × 5
##    Application_mode mean_value min_value max_value median_value
##               <dbl>      <dbl>     <dbl>     <dbl>        <dbl>
##  1                1       128.      95        184.         126.
##  2                2       122.     120        122.         122 
##  3                5       129.     110.       159.         124.
##  4                7       132.     100        190          130 
##  5               10       148.     116.       184.         150.
##  6               15       126.     100        190          130 
##  7               16       131.      95        167.         127.
##  8               17       125.      95        178          124.
##  9               18       123.     101        175.         122.
## 10               26       122.     122.       122.         122.
## 11               27       130      130        130          130 
## 12               39       126.      95        190          123 
## 13               42       124.     100        161          123 
## 14               43       122.      96        170          122.
## 15               44       140.     105.       180          140 
## 16               51       121.      99.7      150          120 
## 17               53       138.     112.       170          138.
## 18               57       100      100        100          100

From the above metrics, we can assume that students who took admission in mode “1”, “17” and “43” have similar range of previous grades before enrolling in to the study program.

Let us observe the spread in students’ previous grades with the help of an Histogram plot.

# Histogram for "Previous qualification (grade)"
ggplot(dataset, aes(x = dataset$`Previous qualification (grade)`)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Distribution of Previous Grades", x = "Previous qualification (grade)", y = "Frequency")

The Histogram shows that the majority of the students have their previous grades close to the median and average value of the data observation. Also, it can be assumed visually to have a normal distribution.

Lets observe which category of admission type student has high density of previous grades obtained.
```
# Boxplot to show relation
dataset|>
  ggplot() +
  geom_boxplot(mapping = aes(x = Target , y = `Previous qualification (grade)`)) +
 labs(title="Previous qualification (grade)") +  # labels!
 theme_minimal()
```
The resultant box plot explains about the different target students as follows:

Dropout: The median grade is lower compared to the other groups. The IQR is relatively narrow, indicating less variability in grades.

Enrolled: The median grade is higher than Dropout but lower than Graduate. The IQR is wider, suggesting more variability in grades.

Graduate: This group has the highest median grade. The IQR is also wide, indicating a broad range of grades.