Week8_DataDive_RegressionModeling

Importing dataset:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")

## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating a custom table with mutating necessary categorical columns:

dataset_1<-dataset
dataset_1<-mutate(dataset_1, marital_status = ifelse(dataset$`Marital status` == 1, "single",
                    ifelse(`Marital status` == 2, "married",
                    ifelse(`Marital status` == 3, "widower",
                    ifelse(`Marital status` == 4, "divorced",
                    ifelse(`Marital status` == 5, "facto union",
                    ifelse(`Marital status` == 6, "legally seperated", "no")))))))

dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance    ` == 1, "day","evening"))

dataset_1<-mutate(dataset_1, target = ifelse(dataset$Target == "Graduate",2,
                    ifelse(Target == "Enrolled",1,
                    ifelse(Target == "Dropout", 0, "no"))))

dataset_1<-mutate(dataset_1, sem_results= rowMeans(select(dataset_1,`Curricular units 1st sem (grade)`, `Curricular units 2nd sem (grade)`)))

Anova test:

1. Response Variable Selection

The response variable selected for this analysis is “sem_results”, which represents the average academic performance of students across two semesters.

2. Explanatory Variable Selection

The categorical variable chosen is “marital_status”, which classifies students as ‘single’, ‘married’, ‘divorced’, ‘widower’, ‘facto union’ and ‘legally seperated’. This variable is expected to influence students’ semester results.

3. Null Hypothesis for ANOVA

The null hypothesis for the ANOVA test is:

H0: There is no significant difference in semester results between students of different marital status.

library(tidyverse)

# Conduct ANOVA
anova_result <- aov(sem_results ~ marital_status, data = dataset_1)
summary(anova_result)

##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## marital_status    5    690  138.04   5.979 1.61e-05 ***
## Residuals      4418 102006   23.09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Df (Degrees of Freedom):

marital_status: 5 degrees of freedom indicates that there are 6 groups based on the different marital statuses being compared (since degrees of freedom = number of groups - 1).
Residuals: 4418 degrees of freedom, representing the total number of observations minus the number of groups.

Sum Sq (Sum of Squares):

marital_status: 690, which represents the variation explained by the different marital status groups (between-group variation).
Residuals: 102,006, representing the variation within the groups (within-group variation).

Mean Sq (Mean Squares):

marital_status: 138.04, calculated as the Sum of Squares for marital status divided by its degrees of freedom (690 / 5).
Residuals: 23.09, calculated as the Sum of Squares for the residuals divided by its degrees of freedom (102,006 / 4418).

F value: 5.979

This is the ratio of the Mean Square for marital status to the Mean Square for the residuals (138.04 / 23.09). A higher F value indicates that the variability between the group means is larger relative to the variability within the groups.

Pr(>F): 1.61e-05

This p-value is very small (< 0.001), suggesting that there is a statistically significant difference in means among the groups defined by marital status. This strong evidence allows us to reject the null hypothesis.
This suggests that there is enough evidence that marital status of a student has a significant effect on the semester_results.

Linear Regression

We can examine how average values of “Admission_grade” and “Previous_grades” of students influence their semester results.

dataset_1<-mutate(dataset_1, prev_perf= rowMeans(select(dataset_1,`Previous qualification (grade)`,`Admission grade`)))

# Linear regression model
lm_model <- lm(sem_results ~ prev_perf , data = dataset_1)
summary(lm_model)

## 
## Call:
## lm(formula = sem_results ~ prev_perf, data = dataset_1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0442   0.5619   1.8408   2.8242   7.8444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.522300   0.765659   8.519  < 2e-16 ***
## prev_perf   0.030150   0.005873   5.134 2.96e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.805 on 4422 degrees of freedom
## Multiple R-squared:  0.005925,   Adjusted R-squared:  0.0057 
## F-statistic: 26.36 on 1 and 4422 DF,  p-value: 2.96e-07

The residuals represent the differences between the observed and predicted values of sem_results. A range of residuals from negative to positive suggests that the model is predicting both under and over the actual values.
(Intercept):
- Estimate: 6.522300
- This is the predicted value of sem_results when prev_perf is 0.
- t value: 8.519, with a p-value < 2e-16 (highly significant).
prev_perf:
- Estimate: 0.030150
- This indicates that for each unit increase in prev_perf, the sem_results is expected to increase by approximately 0.03015, holding all else constant.
- t value: 5.134, with a p-value of 2.96e-07 (also highly significant).
F-statistic: 26.36 with a p-value of 2.96e-07
- The significant p-value indicates that , prev_perf significantly predicts the response variable.
There is a statistically significant positive relationship between prev_perf and sem_results
The R-squared value suggests that the model does not explain much of the variability in the response variable. (this is expected because we cannot consider just one attribute for a prediction model)
Given the low R-squared, we can consider exploring additional predictors or interaction terms (like day_eve attendance, Inflation, GDP etc., ) to improve model performance.