library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <-read_delim("C:/Users/MSKR/MASTER'S_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Finding 1. Minimum, Maximum values of Unemployment rate.
Mean, Median of Unemployment rate.
standard deviation of Unemployment rate.
min_ur<- min(dataset$`Unemployment rate`)
min_ur
## [1] 7.6
max_ur<- max(dataset$`Unemployment rate`)
max_ur
## [1] 16.2
mean_ur <- mean(dataset$`Unemployment rate`)
mean_ur
## [1] 11.56614
med_ur<-median(dataset$`Unemployment rate`)
med_ur
## [1] 11.1
sd_ur<-sd(dataset$`Unemployment rate`)
sd_ur
## [1] 2.66385
The variable “Unemployment rate” in the data taken has a Minimum value of 7.6 in percent and a Maximum value of 16.2 in percent.
It also possess a mean value of 11.566 percent which is very close to the median and with a standard deviation of 2.66 percent. Withe this low percent in standard deviation indicates that the values are clustered closely around the mean value of the data spread.
Let’s find out which course has highest number of enrollments by the students in our data set.
course_counts <- dataset|>
group_by(Course) |>
summarise(enrollments = n()) |>
arrange(desc(enrollments))
course_counts
From the above result, course “9500”( Nursing) has the highest number of enrollments with 766 students.
[link for column descriptions : https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success]
Related questions observed from the data set in the initial exploration would be:
# Aggregation example: Mean of numeric column by category
aggregated_data <- dataset |>
group_by('Application_mode'=dataset$`Application mode`) |>
summarise(mean_value = mean(`Admission grade`),min_value=min(`Admission grade`),max_value=max(`Admission grade`),median_value=median(`Admission grade`))
print(aggregated_data)
## # A tibble: 18 × 5
## Application_mode mean_value min_value max_value median_value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 128. 95 184. 126.
## 2 2 122. 120 122. 122
## 3 5 129. 110. 159. 124.
## 4 7 132. 100 190 130
## 5 10 148. 116. 184. 150.
## 6 15 126. 100 190 130
## 7 16 131. 95 167. 127.
## 8 17 125. 95 178 124.
## 9 18 123. 101 175. 122.
## 10 26 122. 122. 122. 122.
## 11 27 130 130 130 130
## 12 39 126. 95 190 123
## 13 42 124. 100 161 123
## 14 43 122. 96 170 122.
## 15 44 140. 105. 180 140
## 16 51 121. 99.7 150 120
## 17 53 138. 112. 170 138.
## 18 57 100 100 100 100
From the above metrics, we can assume that students who took admission in mode “1”, “17” and “43” have similar range of previous grades before enrolling in to the study program.
Let us observe the spread in students’ previous grades with the help of an Histogram plot.
# Histogram for "Previous qualification (grade)"
ggplot(dataset, aes(x = dataset$`Previous qualification (grade)`)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Distribution of Previous Grades", x = "Previous qualification (grade)", y = "Frequency")
The Histogram shows that the majority of the students have their previous grades close to the median and average value of the data observation. Also, it can be assumed visually to have a normal distribution.
Lets observe which category of admission type student has high density of previous grades obtained.
# Boxplot to show relation
dataset|>
ggplot() +
geom_boxplot(mapping = aes(x = Target , y = `Previous qualification (grade)`)) +
labs(title="Previous qualification (grade)") + # labels!
theme_minimal()
The resultant box plot explains about the different target students as follows:
Dropout: The median grade is lower compared to the other groups. The IQR is relatively narrow, indicating less variability in grades.
Enrolled: The median grade is higher than Dropout but lower than Graduate. The IQR is wider, suggesting more variability in grades.
Graduate: This group has the highest median grade. The IQR is also wide, indicating a broad range of grades.