Importing dataset:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating a custom table with mutating necessary categorical columns:

dataset_1<-dataset
dataset_1<-mutate(dataset_1, marital_status = ifelse(dataset$`Marital status` == 1, "single",
                    ifelse(`Marital status` == 2, "married",
                    ifelse(`Marital status` == 3, "widower",
                    ifelse(`Marital status` == 4, "divorced",
                    ifelse(`Marital status` == 5, "facto union",
                    ifelse(`Marital status` == 6, "legally seperated", "no")))))))
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance    ` == 1, "day","evening"))
dataset_1<-mutate(dataset_1, target = ifelse(dataset$Target == "Graduate",2,
                    ifelse(Target == "Enrolled",1,
                    ifelse(Target == "Dropout", 0, "no"))))
dataset_1<-mutate(dataset_1, sem_results= rowMeans(select(dataset_1,`Curricular units 1st sem (grade)`, `Curricular units 2nd sem (grade)`)))

Anova test:

1. Response Variable Selection

The response variable selected for this analysis is “sem_results”, which represents the average academic performance of students across two semesters.

2. Explanatory Variable Selection

The categorical variable chosen is “marital_status”, which classifies students as ‘single’, ‘married’, ‘divorced’, ‘widower’, ‘facto union’ and ‘legally seperated’. This variable is expected to influence students’ semester results.

3. Null Hypothesis for ANOVA

The null hypothesis for the ANOVA test is:

H0: There is no significant difference in semester results between students of different marital status.

library(tidyverse)
# Conduct ANOVA
anova_result <- aov(sem_results ~ marital_status, data = dataset_1)
summary(anova_result)
##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## marital_status    5    690  138.04   5.979 1.61e-05 ***
## Residuals      4418 102006   23.09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Df (Degrees of Freedom):

  • marital_status: 5 degrees of freedom indicates that there are 6 groups based on the different marital statuses being compared (since degrees of freedom = number of groups - 1).

  • Residuals: 4418 degrees of freedom, representing the total number of observations minus the number of groups.

Sum Sq (Sum of Squares):

  • marital_status: 690, which represents the variation explained by the different marital status groups (between-group variation).

  • Residuals: 102,006, representing the variation within the groups (within-group variation).

Mean Sq (Mean Squares):

  • marital_status: 138.04, calculated as the Sum of Squares for marital status divided by its degrees of freedom (690 / 5).

  • Residuals: 23.09, calculated as the Sum of Squares for the residuals divided by its degrees of freedom (102,006 / 4418).

F value: 5.979

  • This is the ratio of the Mean Square for marital status to the Mean Square for the residuals (138.04 / 23.09). A higher F value indicates that the variability between the group means is larger relative to the variability within the groups.

Pr(>F): 1.61e-05

  • This p-value is very small (< 0.001), suggesting that there is a statistically significant difference in means among the groups defined by marital status. This strong evidence allows us to reject the null hypothesis.

  • This suggests that there is enough evidence that marital status of a student has a significant effect on the semester_results.

Linear Regression

We can examine how average values of “Admission_grade” and “Previous_grades” of students influence their semester results.

dataset_1<-mutate(dataset_1, prev_perf= rowMeans(select(dataset_1,`Previous qualification (grade)`,`Admission grade`)))
# Linear regression model
lm_model <- lm(sem_results ~ prev_perf , data = dataset_1)
summary(lm_model)
## 
## Call:
## lm(formula = sem_results ~ prev_perf, data = dataset_1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0442   0.5619   1.8408   2.8242   7.8444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.522300   0.765659   8.519  < 2e-16 ***
## prev_perf   0.030150   0.005873   5.134 2.96e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.805 on 4422 degrees of freedom
## Multiple R-squared:  0.005925,   Adjusted R-squared:  0.0057 
## F-statistic: 26.36 on 1 and 4422 DF,  p-value: 2.96e-07