Week6_DataDive_Confidence Intervals

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing data from CSV into “dataset”:

dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dataset_1<-mutate(dataset, marital_status = ifelse(dataset$`Marital status` == 1, "single",
                    ifelse(`Marital status` == 2, "married",
                    ifelse(`Marital status` == 3, "widower",
                    ifelse(`Marital status` == 4, "divorced",
                    ifelse(`Marital status` == 5, "facto union",
                    ifelse(`Marital status` == 6, "legally seperated", "no")))))))

dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance    ` == 1, "day","evening"))

We have two semesters’ results for each student in our data set, let us find how the results in the first two semesters is related to the Target.
- The attribute “Target” in our data set is a categorical column with the values “Dropout”,“Enrolled” and “Graduate”.
- Let us first create a new column which represents these classes as integers 0,1 and 2 respectively (this column will not be used in visualizations as integers may not make sense in reading a graph).
```
dataset_1<-mutate(dataset, target = ifelse(dataset$Target == "Graduate",2,
                    ifelse(Target == "Enrolled",1,
                    ifelse(Target == "Dropout", 0, "no"))))
```
- Now, since we have results from two semesters, we can use the mean of both the values and find a relation between semester results and Target classification.
```
dataset_1<-mutate(dataset_1, sem_results= rowMeans(select(dataset_1,`Curricular units 1st sem (grade)`, `Curricular units 2nd sem (grade)`)))
```
- In this case, our response variable is “target” and the explanatory variable is “sem_results”.
- A scatter plot to visualize the response and explanatory pair would look like:
```
p1<- dataset_1|>
  ggplot(aes(x = Target, y =sem_results)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "sem_resuts vs target", x = "Target", y = "sem_results")
p1
```
```
## `geom_smooth()` using formula = 'y ~ x'
```
- It clearly shows that the average semester grades for first two semesters of Graduate students are more than compared to other two category students. Yet, there are few outliers in Graduate category which can be examined in further analysis.
- For majority dropout students, 2nd semester grades are 0 because they either left the courses halfway through or entirely have not registered for any course in 2nd semester. Therefore, the trend is dependent mostly on 1st semester grades
- To analyse the same using a box plot:
```
p2<- dataset_1|>
  ggplot(aes(x = Target, y =`Curricular units 1st sem (grade)`)) +
  geom_boxplot() +
  #geom_smooth(method = "lm", se = FALSE) +
  labs(title = "1st sem vs target", x = "Target", y = "1st Sem results")
p2
```
- As expected, the average grades of Graduate students is higher than that of Dropout and Enrolled students.
- So there might be considerable rate of co-relation between the 1st semester grades, sem_results and target. Let us calculate the co-relation values:
```
c1 <- cor(dataset_1$`Curricular units 1st sem (grade)`, as.numeric(dataset_1$target))

c_avg<-cor(dataset_1$sem_results, as.numeric(dataset_1$target))

print(c1)
```
```
## [1] 0.485213
```
```
print(c_avg)
```
```
## [1] 0.5503571
```
- So as observed, there is nearly 50% co-relation between the average semester results and target which is a good sign of positive trend which explains that ’as the grades increases, there is a high chance that the student is going to be a Graduate and have less chances of dropping out from the course work.
Analyzing in the same way to understand the relation between mean values of Previous Grade and Admission grades with 2nd semester results of a student to understand if the student is still continuing his/her activeness in academics or is it dropped.
- To achieve this, we shall mutate a new column with mean values of Previous Grade and Admission grades.
```
dataset_1<-mutate(dataset_1, past_grade= rowMeans(select(dataset_1,`Previous qualification (grade)`,`Admission grade`)))
```
- Filtering the data to look only into Enrolled and Graduate students by elimination dropout students:
```
dataset_2<-filter(dataset_1,`Curricular units 2nd sem (grade)`!=0)
```
```
p3<- dataset_2|>
  ggplot(aes(x = past_grade, y =`Curricular units 2nd sem (grade)`)) +
  geom_point(size=1, shape=1) +
  #geom_smooth(method = "lm", se = FALSE) +
  labs(title = "1st sem vs target", x = "Past Grades", y = "2nd Sem results")
p3
```
- If observed in the scatter plot, the majority of the students whose past grades are near to average values have average performance in their 2nd semester as well, whereas the students with higher past grades have a diversified spread of their performance which is not expected or assumed in our hypothesis.
- Finding the cor-relation between past grades and 2nd semester grades of a student (spearman correlation is used as Target is an ordered variable and not a continuous variable) :
```
c2<-cor(dataset_2$`Curricular units 2nd sem (grade)`, as.numeric(dataset_2$past_grade), method="spearman")
print(c2)
```
```
## [1] 0.2809978
```
- Hence, it shows very less co-relation (28%) as observed in the scatter plot which means either a student must have performed good enough during his admission test but not actually good in his academics or must have performed good in his previous grades but went down during his Admission test. Either can be possible.

Building a confidence interval for both the response variables;

# Confidence interval for past grades
library(boot)

boot_mean <- function(dataset_1, indices) {
  return(mean(dataset_1[indices]))
}

set.seed(123)  # For reproducibility
results <- boot(dataset_1$past_grade, boot_mean, R = 1000)

# Confidence intervals
ci <- quantile(results$t, c(0.025, 0.975))
print(ci)

##     2.5%    97.5% 
## 129.4382 130.1583

The interval represents a 95% confidence level.
The lowest value you would expect the true mean to be 129.4382 and the highest value to be expected is 130.1583.
This tells that that the average past grade in the population from which our sample was drawn is between approximately 129.4382 and 130.1583.
The spread is very narrow which implies more precision in our estimations.

# Confidence interval for sem results
library(boot)

boot_mean <- function(dataset_1, indices) {
  return(mean(dataset_1[indices]))
}

set.seed(123)  # For reproducibility
results <- boot(dataset_1$sem_results, boot_mean, R = 1000)

# Confidence intervals
ci <- quantile(results$t, c(0.025, 0.975))
print(ci)

##     2.5%    97.5% 
## 10.29941 10.59161

The lowest value you would expect the true mean to be 10.3 and the highest value to be expected is 10.6.
Again the spread is very narrow which implies more precision in our estimations.

Week6_DataDive_Confidence Intervals

Kiran

2024-10-08