Introduction

Students’ academic achievement is influenced by a confluence of intellectual, social, and personal elements. Gaining an understanding of these elements is essential to creating tactics that help students achieve better results. The goal of this project is to analyze a dataset of student academic records in order to pinpoint important performance factors and offer useful insights. The project intends to provide a data-driven strategy for tackling educational issues and enhancing student achievement by utilizing data analytics.

Objectives

Clean and pre-process the dataset to ensure it is suitable for analysis.
Identify key patterns and relationships in the data through exploratory data analysis (EDA).
Build predictive models to forecast student grades and evaluate the factors that most influence academic performance.

Details About the Dataset

This document explores a dataset of student grades, performs data pre-processing, exploratory data analysis (EDA), and investigates potential causality between predictors and final grades.

Title: Student Academic Performance Dataset

Year: The exact year of data collection is not mentioned.

Purpose: To analyze and predict students’ academic performance using features such as family, health, and demographic factors.

Dimension: The dataset contains 395 rows (students) and 33 columns (features).

Content (Feature):

Demographic factors: Age, gender, and address.
Family background: Parent’s education, job, and family relationships.
Health and lifestyle: Alcohol consumption, health status, absences, and leisure activities.
Academic performance: G1 (1st-period grade), G2 (2nd-period grade), and G3 (final grade).
Binary Data: Access to internet, school support, and participation in extracurricular activities.

Structure:

Target variable: Final grade (G3).
Data types: Categorical and numerical data types.
Data cleaning: Some features may require cleaning, such as removing redundant or missing values.

Summary:

Statistical overview: Mean and standard deviation of grades.
Categorical distribution: Proportions of categorical variables (e.g., M/F for gender).

library(knitr)

# Corrected data frame with equal rows in all columns
data <- data.frame(
  "Main_Features" = c(
    "school", "sex", "age", "address", "famsize", "Pstatus",
    "Medu", "Fedu", "Mjob", "Fjob", "reason", "guardian",
    "Study environment", "Study environment", "Study environment", "Study environment",
    "Academic support", "Academic support", "Academic support",
    "Activities and Social Factors", "Activities and Social Factors", "Activities and Social Factors",
    "Activities and Social Factors", "Activities and Social Factors", "Activities and Social Factors", "Activities and Social Factors",
    "Health and lifestyles", "Health and lifestyles", "Health and lifestyles", "Health and lifestyles",
    "Grades", "Grades", "Grades"
  ),
  
  "Sub_Features" = c(
    "-", "-", "-", "-", "-", "-",
    "-", "-", "-", "-", "-", "-",
    "traveltime", "studytime", "failures", "internet",
    "schoolsup", "famsup", "paid",
    "activities", "nursery", "higher", "romantic", "famrel", "freetime", "goout",
    "Dalc", "Walc", "health", "absences",
    "G1", "G2", "G3<br>**(target variable)**"
  ),
  
  "Remarks" = c(
    "School name, GP/MS", "Gender (female or male)", "", "Urban or rural",
    "Less or equal to 3 (LE3) or greater than 3 (GT3)",
    "Together (T), apart (A); parent cohabitation status",
    "Education levels of mother", "Education levels of father", "Mother's job", "Father’s job",
    "Reason for choosing the school", "Mother, father or others",
    "Travel time from home to school", "Weekly study time", "Number of past class failures", "Internet access at home",
    "Extra educational support", "Family educational support", "Extra paid classes",
    "Participation in extracurricular activities", "Attended nursery school",
    "Aspiration for higher education", "In a romantic relationship",
    "Quality of family relationships", "Free time after school", "Frequency of going out with friends",
    "Workday alcohol consumption", "Weekend alcohol consumption",
    "Current health status", "Number of school absences",
    "First period grade", "Second period grade", "Final grade<br>(Can reflect overall academic performance at the end of the school year, making it the most comprehensive indicator for prediction.)"
  )
)

# Render the table
kable(data, caption = "Main Features and Their Descriptions")

Main Features and Their Descriptions
Main_Features	Sub_Features	Remarks
school	-	School name, GP/MS
sex	-	Gender (female or male)
age	-
address	-	Urban or rural
famsize	-	Less or equal to 3 (LE3) or greater than 3 (GT3)
Pstatus	-	Together (T), apart (A); parent cohabitation status
Medu	-	Education levels of mother
Fedu	-	Education levels of father
Mjob	-	Mother’s job
Fjob	-	Father’s job
reason	-	Reason for choosing the school
guardian	-	Mother, father or others
Study environment	traveltime	Travel time from home to school
Study environment	studytime	Weekly study time
Study environment	failures	Number of past class failures
Study environment	internet	Internet access at home
Academic support	schoolsup	Extra educational support
Academic support	famsup	Family educational support
Academic support	paid	Extra paid classes
Activities and Social Factors	activities	Participation in extracurricular activities
Activities and Social Factors	nursery	Attended nursery school
Activities and Social Factors	higher	Aspiration for higher education
Activities and Social Factors	romantic	In a romantic relationship
Activities and Social Factors	famrel	Quality of family relationships
Activities and Social Factors	freetime	Free time after school
Activities and Social Factors	goout	Frequency of going out with friends
Health and lifestyles	Dalc	Workday alcohol consumption
Health and lifestyles	Walc	Weekend alcohol consumption
Health and lifestyles	health	Current health status
Health and lifestyles	absences	Number of school absences
Grades	G1	First period grade
Grades	G2	Second period grade
Grades	G3 (target variable)	Final grade (Can reflect overall academic performance at the end of the school year, making it the most comprehensive indicator for prediction.)

Full Code for Analysis

# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(caret)
library(scatterplot3d)

# Load the dataset
df <- read.csv("C:\\Users\\WY Ng\\Desktop\\WY\\Masters\\CLASSES\\2024, 2025 Sem 1\\Programming for Data Science\\Assignment\\student_data.csv")

# View the structure and summary of the dataset
glimpse(df)

## Rows: 395
## Columns: 33
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…

# Part 1: Data Pre-Processing

## 1.1: Check Missing Values and Duplicates
colSums(is.na(df))  # Check for missing values

##     school        sex        age    address    famsize    Pstatus       Medu 
##          0          0          0          0          0          0          0 
##       Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
##          0          0          0          0          0          0          0 
##   failures  schoolsup     famsup       paid activities    nursery     higher 
##          0          0          0          0          0          0          0 
##   internet   romantic     famrel   freetime      goout       Dalc       Walc 
##          0          0          0          0          0          0          0 
##     health   absences         G1         G2         G3 
##          0          0          0          0          0

nrow(df) - nrow(unique(df))  # Count duplicate rows

## [1] 0

df <- unique(df)  # Remove duplicate rows

## 1.2: Check and Handle Outliers
# Boxplot for absences
boxplot(df$absences, main = "Boxplot of Absences", ylab = "Absences")

# Calculate IQR and cap outliers
Q1 <- quantile(df$absences, 0.25, na.rm = TRUE)
Q3 <- quantile(df$absences, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
df$absences[df$absences < lower_bound] <- lower_bound
df$absences[df$absences > upper_bound] <- upper_bound

# Boxplot after capping outliers
boxplot(df$absences, main = "Boxplot After Capping Outliers", ylab = "Absences")

## 1.3: Encoding and Transformation
# Binary encoding
df$schoolsup <- ifelse(df$schoolsup == "yes", 1, 0)
df$famsup <- ifelse(df$famsup == "yes", 1, 0)
df$paid <- ifelse(df$paid == "yes", 1, 0)
df$activities <- ifelse(df$activities == "yes", 1, 0)
df$nursery <- ifelse(df$nursery == "yes", 1, 0)
df$higher <- ifelse(df$higher == "yes", 1, 0)
df$internet <- ifelse(df$internet == "yes", 1, 0)
df$romantic <- ifelse(df$romantic == "yes", 1, 0)

# One-hot encoding for nominal variables
dummy_vars <- dummyVars("~ school + sex + address + famsize + Pstatus", data = df)
encoded_data <- predict(dummy_vars, newdata = df)
df <- cbind(df, as.data.frame(encoded_data))

# Create binary Pass/Fail column
df$Pass <- ifelse(df$G3 >= 10, "Pass", "Fail")
df$Pass <- as.factor(df$Pass)
df <- df %>% rename(Outcome = Pass)

# Standardize numeric columns
df$absences <- scale(df$absences)

## 1.4: Save Pre-Processed Data
write.csv(df, "cleaned_dataset.csv", row.names = FALSE)

Part 2: Exploratory Data Analysis (EDA)

In this section, we will explore the dataset through Exploratory Data Analysis (EDA) to uncover patterns and relationships that influence student performance. We’ll begin with Univariate Analysis to understand individual variables, proceed to Bivariate and Multivariate Analyses to examine relationships between variables, and finally assess causal relationships and underlying mechanisms. This comprehensive analysis will help identify key factors affecting academic outcomes, such as grades, absences, and school environments.

2.1: Univariate Analysis

Distribution of Absences (Histogram)

The histogram shows most students have low or negative scaled absences, indicating fewer absences overall. As absences increase, the frequency drops sharply, suggesting only a few students have high absenteeism, which may affect performance.

Outcome Distribution (Pass/Fail) (Bar Plot)

The bar plot reveals that more students passed than failed. This imbalance highlights that while most students succeed, a notable group still struggles, indicating areas for potential improvement.

## 2.1: Univariate Analysis
# Distribution of absences
ggplot(df, aes(x = absences)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Distribution of Absences", x = "Absences", y = "Frequency")

# Outcome distribution
ggplot(df, aes(x = Outcome)) +
  geom_bar(fill = "green", color = "black") +
  labs(title = "Outcome Distribution (Pass/Fail)", x = "Outcome", y = "Count")

2.2: Bivariate Analysis

Scatterplot of G1 vs G3

This scatterplot shows a strong positive relationship between G1 (First Period Grade) and G3 (Final Grade). The red regression line indicates that higher early grades generally lead to higher final grades, suggesting consistent academic performance.

Boxplot of Absences by Outcome

The boxplot compares absences between students who passed and failed. Students who failed tend to have more absences, with a wider range and more extreme outliers. This suggests that higher absenteeism may negatively impact performance.

## 2.2: Bivariate Analysis
# G1 vs G3
ggplot(df, aes(x = G1, y = G3)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Scatterplot of G1 vs G3", x = "G1", y = "G3")

## `geom_smooth()` using formula = 'y ~ x'

# Absences vs Outcome (Compare average absences for Pass and Fail groups)
df %>%
  group_by(Outcome) %>%
  summarise(mean_absences = mean(absences, na.rm = TRUE)) %>%
  ggplot(aes(x = Outcome, y = mean_absences, fill = Outcome)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Absences by Outcome", x = "Outcome", y = "Mean Absences")

2.3: Multivariate Analysis

3D Scatterplot of G1, G2, and G3 (Pass/Fail Colors)

This 3D plot visualizes the relationship between grades across all periods. Green points represent students who passed, and red points indicate those who failed. Higher grades in G1 and G2 are associated with passing, highlighting the importance of consistent performance.

Interaction Effect of School on G1 vs G3

This plot shows how the relationship between G1 and G3 varies by school. Both schools (GP and MS) show a positive trend, but slight differences in slope suggest that school environment may influence how early grades impact final results.

## 2.3: Multivariate Analysis
# 3D Scatterplot
scatterplot3d(df$G1, df$G2, df$G3, 
              pch = 16, 
              color = ifelse(df$Outcome == "Pass", "green", "red"), 
              xlab = "G1 (First Period Grade)", 
              ylab = "G2 (Second Period Grade)", 
              zlab = "G3 (Final Grade)",
              main = "3D Scatterplot: G1, G2, and G3 with Pass/Fail Colors")

# Interaction effects of school on G1 and G3
ggplot(df, aes(x = G1, y = G3, color = school)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Interaction Effect of School on G1 vs G3", x = "G1", y = "G3")

## `geom_smooth()` using formula = 'y ~ x'

Part 2.4: Comparisons

Average Absences by Outcome

This stacked bar chart shows the relationship between family size (famsize) and student outcomes (Outcome). It compares how many students from small families (LE3) and large families (GT3) passed or failed.

# Part 2.4: Comparisons
# Family Size vs. Student Outcome
ggplot(df, aes(x = famsize, fill = Outcome)) +
  geom_bar(position = "stack") +
  labs(title = "Family Size vs. Student Outcome",
       x = "Family Size",
       y = "Number of Students") +
  scale_fill_manual(values = c("Pass" = "skyblue", "Fail" = "salmon")) +
  theme_minimal()

Part 2.5: Causality

Correlation Matrix Heatmap

Dark blue: Strong positive correlations (e.g., G1, G2, G3). Dark red: Strong negative correlations. Lighter colors: Weak correlations. Insight: Early grades strongly predict final grades, while absences show weaker negative correlations.

Top Correlated Feature Pairs

G3–G2 (0.905): Final and second-period grades are highly related. G2–G1 (0.852): Early grades predict later grades. G3–G1 (0.801): First and final grades are closely linked. Walc–Dalc (0.648): Alcohol use during weekdays and weekends is related. Fedu–Medu (0.623): Parents’ education levels are correlated.

# Part 2.5: Causality
# Correlation matrix for numeric variables
library(corrplot)

## corrplot 0.95 loaded

numeric_vars <- df %>% select_if(is.numeric)
corr_matrix <- cor(numeric_vars, use = "complete.obs")
# Visualize the correlation matrix
corrplot(corr_matrix, method = "color", type = "upper", tl.cex = 0.8)

# Convert the correlation matrix to a data frame
corr_df <- as.data.frame(as.table(corr_matrix))
# Remove duplicate pairs by sorting variable names
filtered_corr <- corr_df %>%
  filter(abs(Freq) < 0.9999) %>%
  rowwise() %>%
  mutate(pair = paste(sort(c(Var1, Var2)), collapse = "-")) %>%
  distinct(pair, .keep_all = TRUE) %>%
  arrange(desc(abs(Freq)))
# Display the top 5 results
head(filtered_corr, 5)

## # A tibble: 5 × 4
## # Rowwise: 
##   Var1  Var2   Freq pair     
##   <fct> <fct> <dbl> <chr>    
## 1 G3    G2    0.905 G2-G3    
## 2 G2    G1    0.852 G1-G2    
## 3 G3    G1    0.801 G1-G3    
## 4 Walc  Dalc  0.648 Dalc-Walc
## 5 Fedu  Medu  0.623 Medu-Fedu

Part 2.6 Mechanism

Interaction Effect of School on G1 vs G3

The violin plot shows the distribution of weekly study time for students who passed (blue) and failed (red). Both groups mostly study around 2 hours, but students who passed tend to study slightly more. This suggests that higher study time may improve performance, though other factors could also play a role.

# Part 2.6 Mechanism
# Study Time vs Outcome
ggplot(df, aes(x = Outcome, y = studytime, fill = Outcome)) +
  geom_violin(trim = FALSE) +
  labs(title = "Study Time vs. Student Outcome",
       x = "Student Outcome",
       y = "Weekly Study Time (Hours)") +
  scale_fill_manual(values = c("Pass" = "skyblue", "Fail" = "salmon")) +
  theme_minimal()

Insights

The Exploratory Data Analysis (EDA) revealed key factors influencing student performance. Early grades (G1 and G2) strongly predict final outcomes (G3), highlighting the importance of consistent academic effort. Higher absenteeism is linked to failing grades, emphasizing the negative impact of poor attendance. Additionally, school environment and parental education levels also contribute to academic success. While study time slightly differs between passing and failing students, other factors like attendance and support systems play a more significant role. These insights can guide targeted interventions to improve student outcomes.

Regression Question

The project aims to predict students’ final grades (G3) based on factors such as study habits, family background, health, and behavior. The key goal is to identify which factors most influence academic performance and understand their relationship with final grades.

Regression Model

# Part 3: Regression Model with More Features

## 3.1: Load required libraries
library(caret)
library(ggplot2)

## 3.2: Prepare data
# Select all relevant columns for prediction
df <- df %>% select(traveltime, studytime, failures, schoolsup, famsup, paid, activities, internet, romantic,
                    famrel, freetime, goout, Dalc, Walc, health, absences, higher, G3, Outcome)

# Convert categorical variables to binary if needed (e.g., famsup, schoolsup, paid, internet, romantic, higher)
df$famsup <- ifelse(df$famsup == "yes", 1, 0)
df$schoolsup <- ifelse(df$schoolsup == "yes", 1, 0)
df$paid <- ifelse(df$paid == "yes", 1, 0)
df$activities <- ifelse(df$activities == "yes", 1, 0)
df$internet <- ifelse(df$internet == "yes", 1, 0)
df$romantic <- ifelse(df$romantic == "yes", 1, 0)
df$higher <- ifelse(df$higher == "yes", 1, 0)

## 3.3: Split data into training and testing sets
set.seed(123)
train_index <- createDataPartition(df$G3, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

## 3.4: Build Linear Regression model using all features
lm_model <- lm(G3 ~ ., data = train_data)

# Summarize the model
summary(lm_model)

## 
## Call:
## lm(formula = G3 ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6667 -2.2254 -0.1285  2.4457  6.7535 
## 
## Coefficients: (7 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.89309    1.19629   4.926 1.38e-06 ***
## traveltime  -0.39302    0.23792  -1.652 0.099582 .  
## studytime    0.07228    0.19878   0.364 0.716401    
## failures    -0.65098    0.24672  -2.639 0.008753 ** 
## schoolsup         NA         NA      NA       NA    
## famsup            NA         NA      NA       NA    
## paid              NA         NA      NA       NA    
## activities        NA         NA      NA       NA    
## internet          NA         NA      NA       NA    
## romantic          NA         NA      NA       NA    
## famrel       0.02722    0.19104   0.142 0.886786    
## freetime     0.10723    0.17635   0.608 0.543607    
## goout        0.10614    0.17203   0.617 0.537717    
## Dalc         0.03812    0.23433   0.163 0.870882    
## Walc        -0.08123    0.18319  -0.443 0.657776    
## health      -0.12835    0.11916  -1.077 0.282264    
## absences     0.62918    0.16814   3.742 0.000218 ***
## higher            NA         NA      NA       NA    
## OutcomePass  7.35273    0.37701  19.503  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.899 on 306 degrees of freedom
## Multiple R-squared:  0.6215, Adjusted R-squared:  0.6079 
## F-statistic: 45.67 on 11 and 306 DF,  p-value: < 2.2e-16

## 3.5: Evaluate model
# Predict on the test data
lm_predictions <- predict(lm_model, newdata = test_data)

# Calculate evaluation metrics
lm_r2 <- cor(test_data$G3, lm_predictions)^2  # R²
lm_mae <- mean(abs(test_data$G3 - lm_predictions))  # Mean Absolute Error
lm_mse <- mean((test_data$G3 - lm_predictions)^2)  # Mean Squared Error

# Print the results
cat("R²:", lm_r2, "\nMAE:", lm_mae, "\nMSE:", lm_mse)

## R²: 0.6492638 
## MAE: 2.234316 
## MSE: 6.847222

## 3.6: compare the coefficients and significance of predictors
# Extract the coefficients and p-values
coef_summary <- summary(lm_model)$coefficients

# Display coefficients and p-values
coef_summary <- coef_summary[, c("Estimate", "Pr(>|t|)")]
coef_summary

##                Estimate     Pr(>|t|)
## (Intercept)  5.89309128 1.375270e-06
## traveltime  -0.39301681 9.958171e-02
## studytime    0.07227843 7.164006e-01
## failures    -0.65097819 8.753024e-03
## famrel       0.02722084 8.867861e-01
## freetime     0.10722969 5.436070e-01
## goout        0.10613919 5.377173e-01
## Dalc         0.03811923 8.708815e-01
## Walc        -0.08123131 6.577759e-01
## health      -0.12835351 2.822642e-01
## absences     0.62918296 2.179018e-04
## OutcomePass  7.35272503 1.283243e-55

Significant Predictors (p-value < 0.05)

failures:
Estimate: -2.1524
P-value: 2.53e-09
Strong negative impact on the final grade (G3). Higher failures lead to lower grades.
goout:
Estimate: -0.5042
P-value: 0.0471
Moderate negative impact on the final grade (G3). More time spent going out (socializing) is associated with lower grades.
absences:
Estimate: 0.5162
P-value: 0.0408
Positive impact on final grade (G3). More absences are surprisingly linked to higher grades, though this requires further interpretation (could be a data anomaly or context-specific).

Non-Significant Predictors (p-value > 0.05)

traveltime, studytime, famrel, freetime, Dalc, Walc, and health have high p-values, indicating that they don’t significantly impact the final grade based on this model.

Findings

Most Influential Factors:
- failures seems to have the largest effect on final grades, with a large negative coefficient and a very significant p-value.
- goout also plays a significant role, but with a moderate effect on the grade.
- absences shows a surprising positive relationship with grades, but more investigation is needed to understand this result.

## 3.7: Visualize the regression results
# Create a data frame for actual vs predicted values
results <- data.frame(
  Actual = test_data$G3,
  Predicted = lm_predictions
)

# Plot actual vs predicted values
ggplot(results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue") + # Scatter plot
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") + # Ideal fit line (y=x)
  labs(
    title = "Linear Regression: Actual vs Predicted Grades",
    x = "Actual Grades",
    y = "Predicted Grades"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
  )

Observations

Most of the predicted grades (blue points) align relatively well with the red dashed line ( 𝑦 = 𝑥 y=x), especially for mid-range grades (5–15). There is some spread at lower grades, indicating that the model struggles to predict very low final grades accurately.

Insights

The model performs reasonably well but could likely benefit from further fine-tuning or the inclusion of additional features to reduce errors at the extremes. Outliers or underrepresented data points at the lower grade spectrum might skew predictions.\

Classification Question

The project aims to predict whether students will pass or fail their final exams and to develop a model that can accurately classify students as pass or fail based on related factors.

Classification Model

# Part 4: Classification Model (Decision Tree)

## 4.1: Load required libraries
library(rpart)
library(rpart.plot)
library(pROC)

## 4.2 Prepare data
df$Outcome<- as.factor(df$Outcome)

## 4.3 Split data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(df$Outcome, p = 0.5, list = FALSE)
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]

## 4.4 Train the Decision Tree model
treeModel <- rpart(Outcome ~ . - G3 - Outcome, data = trainData, method = "class")
rpart.plot(treeModel, type = 3, extra = 102, fallen.leaves = TRUE, main = "Decision Tree")

## 4.5 Make predictions on the test set
predictedClasses <- predict(treeModel, newdata = testData, type = "class")
predictedProbs <- predict(treeModel, newdata = testData, type = "prob")[, "Pass"]

## 4.6 Evaluate the model
confMatrix <- confusionMatrix(predictedClasses, testData$Outcome)

cat("Accuracy:", confMatrix$overall["Accuracy"], " Precision:", confMatrix$byClass["Precision"], " Recall:", confMatrix$byClass["Recall"], " F1 Score:", confMatrix$byClass["F1"], "\n")

## Accuracy: 0.7106599  Precision: 0.5909091  Recall: 0.4  F1 Score: 0.4770642

## 4.7 Visualize the Confusion Matrix
confusionMatrixTable <- as.table(confMatrix$table)
fourfoldplot(confusionMatrixTable, color = c("red", "green"), main = "Confusion Matrix")

## 4.8 Visualize the ROC Curve
curve <- roc(testData$Outcome, predictedProbs, levels = rev(levels(testData$Outcome)))
plot(curve, col = "darkgreen", lwd = 2, main = "ROC Curve for Decision Tree")
abline(a = 0, b = 1, col = "red", lty = 2)

auc <- auc(curve)
cat("AUC:", auc, "\n")

## AUC: 0.6417832

Findings

The model has an accuracy of 71.07%, which indicates that it correctly predicted the outcome for about 71% of the students’ Pass or Fail cases. This is may be due to the class imbalance of the dataset in which it has more ‘Pass’ students than ‘Fail’ students. Based on the Confusion Matrix Analysis, 114 students and 26 students were correctly predicted to pass and to fail, respectively.

Limitations

Class Imbalance: The dataset exhibits an imbalance between “Pass” and “Fail” outcomes, which may affect the performance of classification models.
Data Quality: Some attributes in the dataset may lack granularity or consistency, potentially limiting the robustness of the analysis.
Simplistic Models: The analysis uses linear regression and decision tree models, which may not capture complex relationships within the data as effectively as advanced machine learning algorithms.
Generalizability: The dataset reflects a specific group of students, limiting the ability to generalize findings to broader populations without additional data.
Assumptions in Preprocessing: The decisions made during data preprocessing (e.g., outlier capping, encoding) might introduce biases or oversimplify the data.

Future Work

Address Class Imbalance: Implement techniques like oversampling, undersampling, or synthetic data generation to handle the class imbalance issue and improve model performance.
Explore Advanced Models: Investigate more sophisticated machine learning techniques, such as random forests, gradient boosting, or neural networks, to enhance predictive accuracy.
Feature Engineering: Develop new features that could better capture underlying patterns, such as combining study time with parental support metrics.
Longitudinal Analysis: Incorporate data from multiple time points to explore trends and changes in student performance over time.
Intervention Simulation: Use findings to simulate potential interventions (e.g., increasing study time) and their impact on student outcomes.

Conclusion

Using a collection of student records, this project effectively investigated the variables affecting students’ academic achievement. Patterns and trends were established through exploratory analysis and data pre-processing, which served as the basis for developing predictive models. The results of the study showed that factors including study time, failure rate, and absenteeism have a big impact on students’ grades and chances of passing or failing. The investigation showed how dataset features, including class imbalance, affect model performance, even when the regression and classification models produced plausible predictions.

All things considered, this research shows how data-driven methods may be used to comprehend and enhance educational outcomes. Future analyses can offer even more accurate insights and promote evidence-based decision-making in education by improving the methodologies and addressing the limitations.

Student Data Analysis

Group 6

2025-01-12

Introduction

Objectives

Details About the Dataset

Full Code for Analysis

Part 2: Exploratory Data Analysis (EDA)

2.1: Univariate Analysis

Distribution of Absences (Histogram)

Outcome Distribution (Pass/Fail) (Bar Plot)

2.2: Bivariate Analysis

Scatterplot of G1 vs G3

Boxplot of Absences by Outcome

2.3: Multivariate Analysis

3D Scatterplot of G1, G2, and G3 (Pass/Fail Colors)

Interaction Effect of School on G1 vs G3

Part 2.4: Comparisons

Average Absences by Outcome

Part 2.5: Causality

Correlation Matrix Heatmap

Top Correlated Feature Pairs

Part 2.6 Mechanism

Interaction Effect of School on G1 vs G3

Insights

Regression Question

Regression Model

Significant Predictors (p-value < 0.05)

Non-Significant Predictors (p-value > 0.05)

Findings

Observations

Insights

Classification Question

Classification Model

Findings

Limitations

Future Work

Conclusion