library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset_1<-mutate(dataset, marital_status = ifelse(dataset$`Marital status` == 1, "single",
ifelse(`Marital status` == 2, "married",
ifelse(`Marital status` == 3, "widower",
ifelse(`Marital status` == 4, "divorced",
ifelse(`Marital status` == 5, "facto union",
ifelse(`Marital status` == 6, "legally seperated", "no")))))))
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance ` == 1, "day","evening"))
dataset_1[c('Curricular units 1st sem (approved)','Curricular units 1st sem (evaluations)')]
## # A tibble: 4,424 × 2
## `Curricular units 1st sem (approved)` Curricular units 1st sem (evaluations…¹
## <dbl> <dbl>
## 1 0 0
## 2 6 6
## 3 0 0
## 4 6 8
## 5 5 9
## 6 5 10
## 7 7 9
## 8 0 5
## 9 6 8
## 10 5 9
## # ℹ 4,414 more rows
## # ℹ abbreviated name: ¹`Curricular units 1st sem (evaluations)`
There was a confusion between the columns ‘Curricular units 1st sem (approved)’ and ‘Curricular units 1st sem (evaluations)’ as both of them refers to courses enrolled by the student in 1st semester.
But after going through the attribute definitions, it is clear that a student can enroll into a certain number of courses in a particular semester and it is his/her choice to choose all the courses or lesser number of courses in that semester.
‘Curricular units 1st sem (approved)’ denotes the approved number of courses by the university a student is registered and will be studying in that semester.
‘Curricular units 1st sem (evaluations)’ denotes the number of evaluations the subject/course will be having in order to grade the student at the end of semester.
The column “Nacionality” has integer values like 1,21,65,108 but are not consecutive.
It was assumed that all the countries in world are mapped with numbers in alphabetical order, but after referring to the data definitions, the countries of students’ are taken randomly from Europe, South America, Africa and Asia.
These countries are assigned numbers based on their location on the world map, i.e., based on their latitudes. (which does not effect our analysis of a student’s academic performance)
Attribute “Admission mode” is a categorical column and has integer values from 1 through 57.
It was a little surprise to assume that there were 57 different modes of Admission a student could have taken into the university.
But, the numbers are not sequential and there are not 57 different modes, it is just the numbering given to 18 categories as the data population is covered only for few sets of courses.
Those are “Unemployment rate”, “Inflation rate” and “GDP”.
Assuming these columns to represent a student’s country’s metrics, but there is a trend observed in these values that all three are mutually dependent and possess high co-relation among them irrespective of a country and student.
And these metrics logically should not affect a student’s academic performance as they are out of scope in examining a student in his course and estimating our Target .
p <- ggplot(dataset_1, aes(x = Target, y = `Unemployment rate`))+
geom_col()
p
p1 <- ggplot(dataset_1,aes(x = Target, y = `Inflation rate`))+
geom_col()
p1
p2 <- ggplot(dataset_1,aes(x = Target, y = GDP))+
geom_col()
p2
As observed in the bar plots for all three attributes, their distribution across Target is very similar and have a close co-relation among themselves.
Therefore, both logically and statistically, these columns would not fetch any effective insights on our Target estimation and these are ignored for time being.