Importing dataset:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Creating a custom table with mutating necessary categorical columns:
dataset_1<-dataset
dataset_1<-mutate(dataset_1, marital_status = ifelse(dataset$`Marital status` == 1, "single",
ifelse(`Marital status` == 2, "married",
ifelse(`Marital status` == 3, "widower",
ifelse(`Marital status` == 4, "divorced",
ifelse(`Marital status` == 5, "facto union",
ifelse(`Marital status` == 6, "legally seperated", "no")))))))
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance ` == 1, "day","evening"))
dataset_1<-mutate(dataset_1, target = ifelse(dataset$Target == "Graduate",2,
ifelse(Target == "Enrolled",1,
ifelse(Target == "Dropout", 0, "no"))))
dataset_1<-mutate(dataset_1, sem_results= rowMeans(select(dataset_1,`Curricular units 1st sem (grade)`, `Curricular units 2nd sem (grade)`)))
The response variable selected for this analysis is “sem_results”, which represents the average academic performance of students across two semesters.
The categorical variable chosen is “marital_status”, which classifies students as ‘single’, ‘married’, ‘divorced’, ‘widower’, ‘facto union’ and ‘legally seperated’. This variable is expected to influence students’ semester results.
The null hypothesis for the ANOVA test is:
H0: There is no significant difference in semester results between students of different marital status.
library(tidyverse)
# Conduct ANOVA
anova_result <- aov(sem_results ~ marital_status, data = dataset_1)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## marital_status 5 690 138.04 5.979 1.61e-05 ***
## Residuals 4418 102006 23.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Df (Degrees of Freedom):
marital_status: 5 degrees of freedom indicates that there are 6 groups based on the different marital statuses being compared (since degrees of freedom = number of groups - 1).
Residuals: 4418 degrees of freedom, representing the total number of observations minus the number of groups.
Sum Sq (Sum of Squares):
marital_status: 690, which represents the variation explained by the different marital status groups (between-group variation).
Residuals: 102,006, representing the variation within the groups (within-group variation).
Mean Sq (Mean Squares):
marital_status: 138.04, calculated as the Sum of Squares for marital status divided by its degrees of freedom (690 / 5).
Residuals: 23.09, calculated as the Sum of Squares for the residuals divided by its degrees of freedom (102,006 / 4418).
F value: 5.979
Pr(>F): 1.61e-05
This p-value is very small (< 0.001), suggesting that there is a statistically significant difference in means among the groups defined by marital status. This strong evidence allows us to reject the null hypothesis.
This suggests that there is enough evidence that marital status of a student has a significant effect on the semester_results.
We can examine how average values of “Admission_grade” and “Previous_grades” of students influence their semester results.
dataset_1<-mutate(dataset_1, prev_perf= rowMeans(select(dataset_1,`Previous qualification (grade)`,`Admission grade`)))
# Linear regression model
lm_model <- lm(sem_results ~ prev_perf , data = dataset_1)
summary(lm_model)
##
## Call:
## lm(formula = sem_results ~ prev_perf, data = dataset_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0442 0.5619 1.8408 2.8242 7.8444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.522300 0.765659 8.519 < 2e-16 ***
## prev_perf 0.030150 0.005873 5.134 2.96e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.805 on 4422 degrees of freedom
## Multiple R-squared: 0.005925, Adjusted R-squared: 0.0057
## F-statistic: 26.36 on 1 and 4422 DF, p-value: 2.96e-07
The residuals represent the differences between the observed and
predicted values of sem_results
. A range of residuals from
negative to positive suggests that the model is predicting both under
and over the actual values.
(Intercept):
Estimate: 6.522300
This is the predicted value of sem_results
when
prev_perf
is 0.
t value: 8.519, with a p-value < 2e-16 (highly significant).
prev_perf:
Estimate: 0.030150
This indicates that for each unit increase in
prev_perf
, the sem_results
is expected to
increase by approximately 0.03015, holding all else constant.
t value: 5.134, with a p-value of 2.96e-07 (also highly significant).
F-statistic: 26.36 with a p-value of 2.96e-07
prev_perf
significantly predicts the response variable.There is a statistically significant positive relationship
between prev_perf
and sem_results
The R-squared value suggests that the model does not explain much of the variability in the response variable. (this is expected because we cannot consider just one attribute for a prediction model)
Given the low R-squared, we can consider exploring additional predictors or interaction terms (like day_eve attendance, Inflation, GDP etc., ) to improve model performance.