library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Importing data from CSV into “dataset”:
dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset_1<-mutate(dataset, marital_status = ifelse(dataset$`Marital status` == 1, "single",
ifelse(`Marital status` == 2, "married",
ifelse(`Marital status` == 3, "widower",
ifelse(`Marital status` == 4, "divorced",
ifelse(`Marital status` == 5, "facto union",
ifelse(`Marital status` == 6, "legally seperated", "no")))))))
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance ` == 1, "day","evening"))
We have two semesters’ results for each student in our data set, let us find how the results in the first two semesters is related to the Target.
The attribute “Target” in our data set is a categorical column with the values “Dropout”,“Enrolled” and “Graduate”.
Let us first create a new column which represents these classes as integers 0,1 and 2 respectively (this column will not be used in visualizations as integers may not make sense in reading a graph).
dataset_1<-mutate(dataset, target = ifelse(dataset$Target == "Graduate",2,
ifelse(Target == "Enrolled",1,
ifelse(Target == "Dropout", 0, "no"))))
dataset_1<-mutate(dataset_1, sem_results= rowMeans(select(dataset_1,`Curricular units 1st sem (grade)`, `Curricular units 2nd sem (grade)`)))
In this case, our response variable is “target” and the explanatory variable is “sem_results”.
A scatter plot to visualize the response and explanatory pair would look like:
p1<- dataset_1|>
ggplot(aes(x = Target, y =sem_results)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "sem_resuts vs target", x = "Target", y = "sem_results")
p1
## `geom_smooth()` using formula = 'y ~ x'
It clearly shows that the average semester grades for first two semesters of Graduate students are more than compared to other two category students. Yet, there are few outliers in Graduate category which can be examined in further analysis.
For majority dropout students, 2nd semester grades are 0 because they either left the courses halfway through or entirely have not registered for any course in 2nd semester. Therefore, the trend is dependent mostly on 1st semester grades
To analyse the same using a box plot:
p2<- dataset_1|>
ggplot(aes(x = Target, y =`Curricular units 1st sem (grade)`)) +
geom_boxplot() +
#geom_smooth(method = "lm", se = FALSE) +
labs(title = "1st sem vs target", x = "Target", y = "1st Sem results")
p2
As expected, the average grades of Graduate students is higher than that of Dropout and Enrolled students.
So there might be considerable rate of co-relation between the 1st semester grades, sem_results and target. Let us calculate the co-relation values:
c1 <- cor(dataset_1$`Curricular units 1st sem (grade)`, as.numeric(dataset_1$target))
c_avg<-cor(dataset_1$sem_results, as.numeric(dataset_1$target))
print(c1)
## [1] 0.485213
print(c_avg)
## [1] 0.5503571
Analyzing in the same way to understand the relation between mean values of Previous Grade and Admission grades with 2nd semester results of a student to understand if the student is still continuing his/her activeness in academics or is it dropped.
dataset_1<-mutate(dataset_1, past_grade= rowMeans(select(dataset_1,`Previous qualification (grade)`,`Admission grade`)))
dataset_2<-filter(dataset_1,`Curricular units 2nd sem (grade)`!=0)
p3<- dataset_2|>
ggplot(aes(x = past_grade, y =`Curricular units 2nd sem (grade)`)) +
geom_point(size=1, shape=1) +
#geom_smooth(method = "lm", se = FALSE) +
labs(title = "1st sem vs target", x = "Past Grades", y = "2nd Sem results")
p3
If observed in the scatter plot, the majority of the students whose past grades are near to average values have average performance in their 2nd semester as well, whereas the students with higher past grades have a diversified spread of their performance which is not expected or assumed in our hypothesis.
Finding the cor-relation between past grades and 2nd semester grades of a student (spearman correlation is used as Target is an ordered variable and not a continuous variable) :
c2<-cor(dataset_2$`Curricular units 2nd sem (grade)`, as.numeric(dataset_2$past_grade), method="spearman")
print(c2)
## [1] 0.2809978
Building a confidence interval for both the response variables;
# Confidence interval for past grades
library(boot)
boot_mean <- function(dataset_1, indices) {
return(mean(dataset_1[indices]))
}
set.seed(123) # For reproducibility
results <- boot(dataset_1$past_grade, boot_mean, R = 1000)
# Confidence intervals
ci <- quantile(results$t, c(0.025, 0.975))
print(ci)
## 2.5% 97.5%
## 129.4382 130.1583
The interval represents a 95% confidence level.
The lowest value you would expect the true mean to be 129.4382 and the highest value to be expected is 130.1583.
This tells that that the average past grade in the population from which our sample was drawn is between approximately 129.4382 and 130.1583.
The spread is very narrow which implies more precision in our estimations.
# Confidence interval for sem results
library(boot)
boot_mean <- function(dataset_1, indices) {
return(mean(dataset_1[indices]))
}
set.seed(123) # For reproducibility
results <- boot(dataset_1$sem_results, boot_mean, R = 1000)
# Confidence intervals
ci <- quantile(results$t, c(0.025, 0.975))
print(ci)
## 2.5% 97.5%
## 10.29941 10.59161
The lowest value you would expect the true mean to be 10.3 and the highest value to be expected is 10.6.
Again the spread is very narrow which implies more precision in our estimations.