Analysis of Student Performance in Secondary Education

Ⅰ. Introduction

Academic performance is a crucial factor in shaping a student’s future opportunities in education and career paths. For educational institutions aiming to improve student outcomes, understanding what influences academic success is essential (Cortez and Silva, 2008). Mathematics is particularly important, as it forms the foundation for other subjects, and students who struggle in mathematics may face challenges across their studies. Research has shown that various factors, such as a student’s background, social environment, and study habits, can all impact their academic performance, underscoring the need to identify these influences to design effective educational support (Cortez and Silva, 2008).

This study investigates the factors affecting the mathematics performance of secondary school students from two Portuguese schools, Gabriel Pereira (GP) and Mousinho da Silveira (MS). The data for this project, obtained from the UCI Machine Learning Repository (Cortez and Silva, 2008), contains valuable information that can reveal patterns and insights into student success.

The project has two main goals: analyzing data to understand the relationships between different factors and applying prediction methods to estimate student performance. The first goal involves statistical tests to explore how various demographic, social, and academic factors are related to student grades. The second goal focuses on using prediction models. By combining these approaches, this study aims to give educators a clearer understanding of the factors influencing student performance, guiding them in developing more targeted and effective support strategies.

Ⅱ. Methods

2.1 Data Preparation & Dataset Analysis

We utilized the Secondary Education Dataset from the UCI Machine Learning Repository, originally gathered by Cortez and Silva, which provides information on student academic performance in Mathematics. The data was collected during the 2005-2006 academic year through school reports and supplementary questionnaires filled out by students.

2.1.1 Data Transformation

In preparing the dataset for statistical and machine learning analyses, several categorical variables were recorded as numeric binary variables to facilitate interpretation. The transformation includes: school:

student’s school GP (0) MS (1) (“GP” for Gabriel Pereira, “MS” for Mousinho da Silveira)

sex: Male (0), Female (1)

address: Rural (0), Urban (1)

famsize: Three or fewer members (0), More than three members (1)

Pstatus: Parents together (1), Apart (0)

schoolsup, famsup, paid, activities, nursery, higher, internet, romantic: “yes” (1) and “no” (0)

reason: reason to choose this school recorded as course (0), other (1), home (2), reputation (3)

guardian: Student’s guardian record mother (0), other (1), father (2)

Mjob, Fjob: Recoded into categories, e.g., at-home (0), other (1), healthcare (2), services (3), teacher (4).

This transformation ensures that the variables are appropriately formatted for both statistical tests and machine learning models.

Fig 1. Transformed Sample Data

2.1.2 Data Collection

The data collection consisted of two primary sources:

School Reports:

These included official records such as period grades (G1, G2, and G3, which is the final grade) and the number of school absences. The dataset includes the first-period grade (G1) and second period grade (G2), both numeric values ranging from 0 to 20, along with the final grade (G3) within the same range.

Student outcomes are classified as a pass if G3 ≥ 10, and a fail if G3 < 10. Grades are further categorized as follows:

$16-20$ is considered excellent

$14-15$ is good

$12-13$ is satisfactory,

$10-11$ is sufficient

$0-9$ is a fail.

Fig 2. The distribution of Final grades in Mathematics

Questionnaires:

Designed to capture a broader range of variables, these questionnaires gathered data on students’ demographic background, social context, and additional school-related factors. Questions were close-ended to ensure consistency in responses and ease of data analysis. The variables captured included parental education and occupation, weekly study time, alcohol consumption, and family support.

2.1.3 Bias Considerations and Limitations

A few potential sources of bias were inherent to the data collection process and dataset composition:

Self-Reporting Bias: The questionnaire data relied on self-reported information, which may introduce bias, particularly for sensitive variables such as alcohol consumption and parental income. Students may underreport or overreport certain behaviors due to social desirability bias.

Selection Bias: Data was collected from two public schools in a specific Portuguese region. This limitation may restrict the generalizability of findings to other regions or private educational settings, as the socioeconomic context could differ.

Privacy Concerns: Some variables, like family income, were omitted from the analysis due to incomplete responses, likely stemming from privacy concerns. This exclusion could reduce the dataset’s comprehensiveness in examining socioeconomic influences on performance.

To account for these biases:

Data Imputation was applied minimally, only filling missing values where they were considered ignorable and would not introduce additional bias.

Normalization of Numeric Variables: Variables like grades and absences were normalized to reduce the impact of outliers and differences in scale.

Use of Cross-Validation in machine learning modeling helped control for selection bias, ensuring that mode training and testing were less affected by any specific subset of the data.

Statistical Methods

The statistical analysis in this study examines relationships between demographic, social, and academic factors and student performance.

2.2.1 Correlation and t-tests

Gender and Academic Performance:

An independent $t\text{-test}$ will be conducted to compare whether there is a statistically significant difference in final grades (G3) between male and female students. This analysis aims to identify any potential gender-related performance disparities in Mathematics.

The $t\text{-test}$ formula is given by: \[ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \] where $\bar{X}_1$ and $\bar{X}_2$ are the sample means, $s_1^2$ and $s_2^2$ are the sample variances, and $n_1$ and $n_2$ are the sample sizes of the two groups.

Impact of Romantic Relationships:

To assess if romantic relationships impact academic performance, a $t\text{-test}$ will compare final grades (G3) between students who are in a romantic relationship and those who are not. This analysis initially considers a two-sided alternative hypothesis, followed by a one-sided $t\text{-test}$ if students without a relationship are hypothesized to perform better on average.

The one-sided $t\text{-test}$ is conducted similarly to the two-sided t-test, but the alternative hypothesis is: \[ H_1: \mu_1 > \mu_2 \]

2.2.2 Analysis of Variance (ANOVA)

Gender and Relationship Status on Final Grades:

A $\text{two-way ANOVA}$ will examine whether there is a significant interaction between gender and relationship status in predicting final grades (G3). The analysis will test for differences among four groups: female students not in a relationship, female students in a relationship, male students not in a relationship, and male students in a relationship.

The $\text{two-way ANOVA}$ is: \[ Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk} \] where $Y_{ijk}$ is the final grade, $\mu$ is the overall mean, $\alpha_i$ is the effect of gender, $\beta_j$ is the effect of relationship status, $(\alpha\beta)_{ij}$ is the interaction effect, and $\epsilon_{ijk}$ is the error term.

Study Time and Grades:

To determine if different study times impact final grades, a $\text{one-way ANOVA}$ will compare grades (G3) across students with different levels of weekly study time, categorized as follows:

less than 2 hours

2-5 hours

5-10 hours

more than 10 hours.

The $\text{one-way ANOVA}$ is: \[ Y_i = \mu + \tau_i + \epsilon_i \] where $Y_i$ is the final grade, $\mu$ is the overall mean, $\tau_i$ is the effect of the $i$-th study time category, and $\epsilon_i$ is the error term.

2.2.3. Chi-Square Tests

Parental Education and Student Success:

A $\text{Chi-square}$ test of independence will assess whether there is an association between parental education levels (Medu and Fedu) and student success, defined by whether a student passes or fails (with G3 ≥ 10 indicating a pass). Parental education levels are grouped into “High” (where both parents have an education level of 3 or above) and “Low” (either or both parents have an education level below 3).

The $\text{Chi-square}$ test statistic is: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where $O_i$ is the observed frequency and $E_i$ is the expected frequency under the null hypothesis of independence.

2.2.4. Pearson Correlation

Absences and Academic Performance:

A Pearson correlation test will be performed to investigate the relationship between the number of absences and final grades (G3). This will assess if frequent absences are associated with lower academic performance.

The Pearson correlation coefficient is: \[ r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}} \] where $X_i$ and $Y_i$ are the individual data points, and $\bar{X}$ and $\bar{Y}$ are the sample means of the variables.

2.3 Machine Learning Integration

Machine Learning Integration

This section incorporates machine learning techniques to enhance our analysis by focusing on feature selection and predictive modeling to forecast final grades (G3) based on key academic and social characteristics. The machine learning approach allows us to identify patterns and relationships that may not be as evident in traditional statistical methods.

2.3.1. Feature Selection with `regsubsets`

To determine the most impactful predictors of academic performance, we applied the regsubsets function from the leaps package, which evaluates subsets of variables to identify the best predictors for final grades (G3). This approach allowed us to rank variables in terms of their predictive strength and provided a broad overview of candidate models with different combinations of variables.

Fig 3. Feature Selection Summary

Information Criteria and Selection Process
The regsubsets function provided a selection process guided by multiple information criteria - namely, $\text{R-squared}$, $\text{adjuested R-squared}$, $\text{Mallows’ Cp}$, and $\text{BIC}$. Each criterion offers unique insights into model quality:

$\text{R-squared}$, $\text{adjuested R-squared}$ measure the proportion of variance explained by the model, with adjusted R-squared penalizing the addition of non-informative predictors to avoid overfitting.

$\text{Mallows’ Cp}$ (Conformal prediction) assesses model bias and complexity, aiming for values near the number of predictors, indicating a well-fit model without excessive variables.

$\text{BIC}$ (Bayesian Information Criterion) prioritizes models with fewer predictors, imposing a penalty on complex models, and is particularly valuable for models expected to generalize well to new data.

We carefully reviewed each of these criteria and selected models balancing simplicity and predictive accuracy. Compared to a basic stepwise approach, regsubsets provides a holistic view of possible models and avoids the risk of narrow selections inherent in purely sequential methods. Using the criteria properly, we aimed for models with the most explanatory power that remained simple, thus enhancing interpretability and preventing overfitting.

Fig 4. Subset Selection

By exploring subsets and visualizing metrics such as $\text{R-squared}$, $\text{adjuested R-squared}$, $\text{Mallows’ Cp}$, and $\text{BIC}$, we determined the optimal model complexity that balances predictive accuracy with simplicity, ensuring that the model maintains high accuracy while remaining parsimonious.

Fig 5. The Best Subset

Figure 5 is showing the $\text{RSS, adjusted R-squared}$ and $\text{Cp values}$ guided the selection of variables with the highest predictive utility, allowing us to proceed with a well-defined subset for regression modeling.

2.3.2. Predictive Modeling

Linear Regression Model

Using the subset of selected predictors, we developed a linear regression model to estimate final grades. This model quantified the relationships between variables such as age, prior academic performance (G1, G2), and family dynamics (famrel) in predicting G3. Cross-validation (10-fold) was used to assess model stability and prevent overfitting. The model demonstrated a multiple $\text{R-squared}$ of $0.8374$, suggesting that it explains a significant portion of the variability in student grades. Comparison of models through cross-validation confirmed that including social variables such as romantic relationships and activities had only minor effects on prediction accuracy.

Decision Tree Model

A decision tree model was also implemented to classify students into performance categories (high-performing vs. low-performing) based on academic and social factors. This model provided an intuitive, non-linear approach, mapping out decision paths that highlight influential variables and specific thresholds (e.g., number of absences and G2 score) critical for predicting performance outcomes.

Fig 6. The Decision Tree Model

Ⅲ. Results

3.1 Statistical Analysis Results

The statistical analysis in this study examines relationships between demographic, social, and academic factors and student performance.

3.1.1 Gender Differences in Academic Performance

To evaluate whether there is a significant difference in academic performance between male and female students, a two-sample $t\text{-test}$ was conducted on the final grades (G3). The test revealed a statistically significant difference between the two groups ($t = 2.0651$, $p\text{-value} = 0.03958$, $95\% \text{ CI} = [0.04545, 1.8507]$). This indicates that the observed difference is unlikely to be due to random variation. The mean final grade for male students was found to be $10.914$, while the mean for female students was $9.966$. The $95\%$ confidence interval for the difference in means ranges from $0.04545$ to $1.8507$, suggesting that, on average, male students score between $0.04545$ and $1.8507$ points higher than female students.

The $p$-value of 0.03958 indicates a 3.96% probability of committing a Type I error—incorrectly concluding that there is a gender-based difference in academic performance when no such difference exists. To better understand this risk, we can visualize the critical region and how the observed $t$-value falls:

$Fig 7. The plot shows the critical region. Since the observed t-value exceeds the critical value, the null hypothesis is rejected with minimal risk of a Type I error.$
Fig 7. The plot shows the critical region. Since the observed t-value exceeds the critical value, the null hypothesis is rejected with minimal risk of a Type I error.

Given the significant $p\text{-value}$, we can reject the null hypothesis and conclude that there is indeed a statistically significant difference in performance between male and female students in Mathematics. This $t\text{-test}$ is appropriate for comparing the means of two independent groups, and the statistically significant result suggests that gender may have a role in academic performance in Mathematics, potentially related to social or psychological factors that merit further exploration.

3.1.2 The Impact of Romantic Relationships on Academic Performance

In addition to gender, the study also examined whether being in a romantic relationship influences academic performance. Initially, a two-tailed $t\text{-test}$ was performed to assess if students in a relationship (group 1) perform differently from those who do not (group 0). However, considering the research interest in determining whether students in a relationship have lower grades, a one-sided t-test was more appropriate.

The one-sided $t\text{-test}$ yielded a $t\text{-value}$ of $2.5122$ and a $p\text{-value}$ of $0.006328$, which is significant at the $1\% \text{ level } (t = 2.5122, p\text{ -value = } 0.01266, 95\% \text{ CI} = [0.2722, 2.2493])$. This result suggests a statistically significant difference, supporting the hypothesis that students not in a relationship tend to have higher grades. The $95\%$ confidence interval for the difference in means ranges from $0.272$ to $2.249$, indicating that, on average, students not in a romantic relationship score between $0.272$ and $2.249$ points higher than those in a relationship.

Fig 8. Visualization of the null t-distribution for a one-sided test, showing the Type I error region (red) beyond the critical value

The findings indicate that students not involved in a romantic relationship have significantly better academic outcomes. These results align with the hypothesis that romantic involvement could potentially distract students from their studies, leading to lower academic performance. Therefore, we reject the null hypothesis and conclude that being in a romantic relationship is associated with a decrease in academic achievement in this context.

3.1.3 Comparison of G3 Scores Across Different Groups of Students

We conducted an Analysis of Variance ($\text{ANOVA}$) test to evaluate whether there are significant differences in the final grades (G3) among four groups of students based on their gender and relationship status. The four groups considered were:

Female students not in a romantic relationship

Female students in a romantic relationship

Male students not in a romantic relationship

Male students in a romantic relationship

The result shows there is a statistically significant difference in G3 scores among the four groups at the $5\%$ significance level $(F = 3.609, p = 0.0135)$. To further investigate the differences between specific pairs of groups, a post-hoc Tukey’s HSD test was conducted. The key comparisons and their results are:

Table 1: Pairwise Comparisons for Gender and Romantic Relationship on Final Grade (G3)

Comparison Difference Lower CI Upper CI Adjusted p-value

Female_Romance - Female_No_Romance -0.5616 -2.0053 0.8822 0.7474

Male_No_Romance - Female_No_Romance -0.6968 -2.5962 1.2025 0.7797

Male_Romance - Female_No_Romance -2.0993 -3.7596 -0.4390 0.0066

Male_No_Romance - Female_Romance -0.1353 -2.0450 1.7744 0.9978

Male_Romance - Female_Romance -1.5377 -3.2099 0.1345 0.0841

Male_Romance - Male_No_Romance -1.4024 -3.4807 0.6759 0.3038

Male students in a romantic relationship vs. Female students not in a romantic relationship:

This comparison reveals a statistically significant difference $(p = 0.0066, 95\% \text{ CI } = [-3.76, -0.44])$, which indicates that male students in a romantic relationship perform significantly worse than female students not in a relationship.

Other Comparisons (Non-Significant Results):

The results indicate no statistically significant differences for the other comparisons, including differences between female students in and out of relationships, and between male and female students not in relationships. This suggests that, aside from the significant difference observed for male students in a romantic relationship, other pairwise groupings do not show meaningful disparities in academic performance.

These results highlight that the most significant impact is observed when comparing male students in a romantic relationship to female students not in a relationship, where the former group performs significantly worse. For other group comparisons, the differences are either non-significant or marginally significant, indicating that relationship status may interact differently with gender in influencing academic success.

Comparison	Difference	Lower CI	Upper CI	Adjusted p-value
Female_Romance - Female_No_Romance	-0.5616	-2.0053	0.8822	0.7474
Male_No_Romance - Female_No_Romance	-0.6968	-2.5962	1.2025	0.7797
Male_Romance - Female_No_Romance	-2.0993	-3.7596	-0.4390	0.0066
Male_No_Romance - Female_Romance	-0.1353	-2.0450	1.7744	0.9978
Male_Romance - Female_Romance	-1.5377	-3.2099	0.1345	0.0841
Male_Romance - Male_No_Romance	-1.4024	-3.4807	0.6759	0.3038

3.1.4. Effect of Study Time on Grades

For the analysis regarding study time, the one-way ANOVA results indicate that there is no statistically significant difference in final grades (G3) among students with varying study times $(F = 1.728, p = 0.161)$. This suggests that the amount of time students dedicate to studying, categorized as

less than $2$ hours

$2-5$ hours

$5-10$ hours

more than 10 hours

Table 2: ANOVA Results and Pairwise Comparisons for Study Time on Final Grade (G3)

Source Df Sum of Squares Mean Square F-value Pr(>F)

Study Time 3 108 36.07 1.728 0.161

Residuals 391 8162 20.87

To further analyze any potential differences between specific pairs of study time groups, pairwise comparisons were conducted. As Table 3 demonstrates, none of the differences between study time groups were statistically significant after adjusting for multiple comparisons.

Table 3: Pairwise Comparisons for Study Time on Final Grade (G3)

Comparison Difference Lower CI Upper CI Adjusted p-value

2-5 hours - Less than 2 hours 0.1241 -1.2990 1.5472 0.9960

5-10 hours - Less than 2 hours 1.3524 -0.5081 3.2128 0.2403

Greater than 10 hours - Less than 2 hours 1.2116 -1.3320 3.7553 0.6088

5-10 hours - 2-5 hours 1.2283 -0.4568 2.9134 0.2380

Greater than 10 hours - 2-5 hours 1.0875 -1.3308 3.5059 0.6523

Greater than 10 hours - 5-10 hours -0.1407 -2.8397 2.5582 0.9991

This lack of significant differences suggests that study time, at least as reported and categorized in this study, does not appear to be a strong determinant of academic success. Possible reasons include:

Quality Over Quantity: It is possible that the effectiveness of study time varies among students, and the quality or method of studying may be more impactful than the sheer quantity of hours studied. For instance, students with strong study skills and efficient techniques might achieve better outcomes even with fewer study hours, compared to students who spend more time but use less effective strategies.

Influence of Other Factors: Academic performance is influenced by multiple factors beyond study time alone, such as prior knowledge, engagement in class, and individual learning styles. This finding might suggest that other factors in the dataset (e.g., prior grades, family support) play a more prominent role in determining academic outcomes than time spent studying.

Potential for Measurement Error: The study time variable is self-reported, which could lead to inaccuracies due to overestimation or underestimation by students. Furthermore, the categories may be too broad to capture meaningful differences in study habits accurately.

In conclusion, while the $\text{one-way ANOVA}$ test did not reveal a significant relationship between study time and academic performance in this sample, this result does not necessarily imply that study time is unimportant. Instead, it points to the complex nature of academic success, which likely depends on a combination of effective study strategies and other influential factors.

Source	Df	Sum of Squares	Mean Square	F-value	Pr(>F)
Study Time	3	108	36.07	1.728	0.161
Residuals	391	8162	20.87

Comparison	Difference	Lower CI	Upper CI	Adjusted p-value
2-5 hours - Less than 2 hours	0.1241	-1.2990	1.5472	0.9960
5-10 hours - Less than 2 hours	1.3524	-0.5081	3.2128	0.2403
Greater than 10 hours - Less than 2 hours	1.2116	-1.3320	3.7553	0.6088
5-10 hours - 2-5 hours	1.2283	-0.4568	2.9134	0.2380
Greater than 10 hours - 2-5 hours	1.0875	-1.3308	3.5059	0.6523
Greater than 10 hours - 5-10 hours	-0.1407	-2.8397	2.5582	0.9991

3.1.5. Parental Education and Student Success

We conducted a $\text{Chi-Square}$ test to determine if there is a significant association between parents’ education levels (Medu and Fedu) and student success (measured by whether students pass or fail, with G3 $\geq 10$ considered a pass). The parental education levels were categorized into two groups: “High” (where both parents have an education level of $3$ or above) and “Low” (where either or both parents have an education level below 3). The outcome variable was student success, where “Pass” indicates a final grade of $10$ or above, and “Fail” indicates a grade below $10$.

Fig 9. Parental Education Level

The results showed that test results are not statistically significant $(p = 0.2655)$. This suggests that there is insufficient evidence to conclude that there is a significant association between parents’ education levels and the likelihood of a student passing or failing. In other words, based on this dataset, the education levels of parents do not appear to have a significant impact on whether a student passes or fails. This result may point to other factors beyond parental education, such as student effort, teaching quality, or social influences, playing a more dominant role in determining academic success.

3.1.6. Relationship Between Absences and Test Performance

We use a $\text{Pearson correlation}$ test to investigate whether there is a significant relationship between the number of absences (absences) and final grades (G3). The results showed that test results are not statistically significant $(p = 0.4973, t = 0.6793, \text{correlation } = 0.034 )$. This suggests that there is insufficient evidence to conclude that there is a significant association between parents’ education levels and the likelihood of a student passing or failing. In other words, based on this dataset, the education levels of parents do not appear to have a significant impact on whether a student passes or fails.

This result may seem surprising, as one might expect that higher rates of absenteeism would negatively impact academic achievement due to lost instructional time and reduced engagement with course material. Several factors may help explain this unexpected outcome:

Alternative Learning Methods: Some students may effectively compensate for missed classes through alternative learning methods, such as self-study or tutoring. In such cases, absences may not significantly impact performance, particularly if students have access to resources or support outside of regular classroom hours.

Variation in Absence Reasons: The dataset does not specify the reasons for students’ absences, which could vary widely. For example, absences due to school-related activities or minor illnesses may not have the same impact as prolonged absences due to more serious issues.

Potential Underreporting or Inaccurate Recording of Absences: It is also possible that students or schools underreported absences or that records do not reflect the true number of classes missed. If absences are inaccurately recorded, this would weaken any observed correlation between attendance and performance.

Future research might benefit from categorizing absences based on their nature (excused vs. unexcused, short-term vs. long-term), understanding how absences interact with other factors, such as parental involvement and study habits, could provide a more comprehensive view of their effect on academic outcomes. In summary, while the $\text{Pearson correlation}$ test does not show a significant association between absences and final grades in this dataset, this finding highlights the complexity of the factors influencing academic success. Absences alone may not fully capture the challenges students face, suggesting that academic outcomes are influenced by a more intricate interplay of personal, educational, and social factors.

3.2. Machine Learning Analysis and Results

In this study, I used two machine learning models to predict student performance in mathematics based on academic and social features. The models explored how well various characteristics, such as prior grades, family relationships, and extracurricular activities, influence the final grade (G3). The machine learning models used are:

3.2.1 Linear Regression Model for Predicting Final Grades

The linear regression model was developed to predict students’ final grades (G3) based on key predictors: age, first-period grades (G1), second-period grades (G2), family relationships, absences, weekend alcohol consumption (Walc), and romantic relationships. These predictors were selected for their potential impact on academic performance.

Model Fit and Performance

The linear regression model achieved a multiple R-squared value of $R^2 = 0.8374$, indicating that $83.74\%$ of the variance in students’ final grades (G3) is explained by the predictors. This high $R^2$ value, coupled with an adjusted R-squared value of $\text{Adjusted } R^2 = 0.834$, suggests that the model has a robust fit while minimizing potential overfitting.

The linear model is expressed as: \[ G3 = \beta_0 + \beta_1 (\text{age}) + \beta_2 (G1) + \beta_3 (G2) + \beta_4 (\text{activities}) + \beta_5 (\text{famrel}) + \beta_6 (\text{absences}) + \beta_7 (Walc) + \beta_8 (\text{romantic}) + \epsilon \]

where each $\beta$ coefficient represents the estimated effect of each predictor on the final grade, and $\epsilon$ is the error term.

Model Significance

The $\text{F-statistic}$ of the model is calculated as: \[ F = \frac{\text{MSR}}{\text{MSE}} = 248.4 \] where $\text{MSR}$ (Mean Squared Regression) represents the average of squared deviations explained by the model, and $\text{MSE}$ (Mean Squared Error) reflects the average of squared deviations not explained by the model. The F-statistic thus measures the ratio of model-explained variance to unexplained variance. With a $p\text{-value} < 2.2 \times 10^{-16}$, indicating that the model is highly statistically significant, meaning that the included predictors collectively contribute significantly to predicting students’ final grades.

Table 4: Linear Regression Results for Predicting Final Grades (G3)

Statistic Value

Min -8.9571

1Q -0.4748

Median 0.2849

3Q 1.0339

Max 4.3246

Predictor Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.0019 1.3853 0.001 0.9989

age -0.2146 0.0777 -2.763 0.0060 **

G1 0.1801 0.0552 3.261 0.0012 **

G2 0.9622 0.0491 19.597 < 2e-16 ***

activities -0.3424 0.1895 -1.807 0.0716 .

famrel 0.3725 0.1065 3.496 0.0005 ***

absences 0.0440 0.0121 3.629 0.0003 ***

Walc 0.1233 0.0753 1.639 0.1021

romantic -0.3204 0.2064 -1.552 0.1214

Model Summary:

$\text{Residual standard error: }1.867 \text{ on } 386 \text{degrees of freedom}$

$\text{Multiple R-squared: }0.8374$

$\text{Adjusted R-squared: }0.834$

$\text{F-statistic: } 248.4 \text{ on } 8 \text{ and } 386 \text{ DF, }p\text{-value:} < 2.2e-16$

Key Predictors and Insights

Second-Period Grades (G2): With an estimated coefficient $\beta_3 = 0.962$ ($p\text{-value}< 2 \times 10^{-16}$), G2 had the largest positive effect on G3, indicating that students’ performance in the second term strongly influences their final outcome.

First-Period Grades (G1): The coefficient for G1 was $\beta_2 = 0.180$ ($p\text{-value} = 0.001208$), suggesting that initial term performance is also significant, though slightly less impactful than G2

Absences: Absences had a small but statistically significant negative association with G3, with $\beta_6 = 0.044$ ($p\text{-value}= 0.000323$). This finding highlights the importance of regular attendance in maintaining academic performance.

Family Relationships (famrel): Family relationship quality, represented by $\beta_5 = 0.372$ ($p\text{-value} = 0.000526$), showed a positive effect on G3, indicating that strong family support correlates with higher grades.

Age (age): The coefficient for age was $\beta_1 = -0.215$ ($p\text{-value}= 0.006007$), suggesting that older students in this sample may perform slightly worse on average.

Extracurricular Activities (activities) and Romantic Relationships (romantic):hese variables were not statistically significant (activities: $\beta_4 = -0.342$, $p\text{-value}= 0.071586$; romantic: $\beta_8 = -0.320$,$p\text{-value}= 0.121416$), indicating minimal impact on G3 in this dataset.

Summary of Linear Model Findings:

The linear regression model provided valuable insights into the main factors influencing student performance, especially the role of consistent academic effort, attendance, and family support. These findings highlight potential intervention points for educators, such as encouraging regular attendance and fostering supportive environments.

Statistic	Value
Min	-8.9571
1Q	-0.4748
Median	0.2849
3Q	1.0339
Max	4.3246

Predictor	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.0019	1.3853	0.001	0.9989
age	-0.2146	0.0777	-2.763	0.0060	**
G1	0.1801	0.0552	3.261	0.0012	**
G2	0.9622	0.0491	19.597	< 2e-16	***
activities	-0.3424	0.1895	-1.807	0.0716	.
famrel	0.3725	0.1065	3.496	0.0005	***
absences	0.0440	0.0121	3.629	0.0003	***
Walc	0.1233	0.0753	1.639	0.1021
romantic	-0.3204	0.2064	-1.552	0.1214

3.2.2 Decision Tree Model for Classifying High-Performing Students

A decision tree model was used to classify students as high-performing (G3 $> 13$) or below this threshold. This model utilized the same predictor set as the linear regression model and was pruned to improve its generalizability.

Model Performance:

The initial, unpruned decision tree achieved an accuracy of $95.6\%$ in correctly classifying students’ performance. We pruned the tree to its optimal depth using cross-validation, which further enhanced model interpretability and accuracy. Pruning improved this accuracy slightly to $96.2\%$, showing that a simplified model can still maintain predictive performance, potentially reducing overfitting.

Fig 10. Pruned Decision Tree Structure

Key Predictors and Interpretation:

Second-Period Grade (G2): As with the linear model, G2 emerged as the top predictor in the decision tree. It was the initial split point in the tree, distinguishing high and low-performing students.

Weekend Alcohol Consumption (Walc): For students with G2 between $12.5$ and $13.5$, weekend alcohol consumption (Walc) further separated those likely to achieve high scores. The tree split Walc at a threshold of $1.5$, with students reporting lower alcohol consumption (Walc $< 1.5$) being more likely to perform well.

The pruned tree’s structure offers clear insights. Students with G2 $< 12.5$ were typically not high-performing, while those with G2 above this threshold were more likely to achieve a high score, particularly if they had low weekend alcohol consumption.

Summary of Decision Tree Findings:

The pruned decision tree provided an intuitive, non-linear representation of how specific factors impact student performance. It highlighted key thresholds within predictors like G2 and Walc, giving educators actionable thresholds for identifying at-risk students. This model’s simplicity and clarity make it a practical tool for educational interventions, helping teachers prioritize support for students based on second-term grades and lifestyle factors such as alcohol consumption

Ⅳ. Discussion and Limitations

This study offers insights into the factors affecting academic performance in Mathematics among secondary school students. However, several limitations and broader considerations warrant discussion.

One limitation involves the data itself, which is derived from two Portuguese schools and may not fully generalize to other cultural or educational settings. Differences in societal structures, teaching methods, and parental involvement across cultures could influence the applicability of our findings. Research by Wang and Li (2024) emphasizes that parental involvement and its impact on academic outcomes vary significantly between collectivist and individualist societies. The current study grouped parental education as a binary factor without exploring the nuances of how active parental support and engagement impact students’ learning experiences. Future research could investigate these aspects more deeply, especially how parental engagement in collectivist settings like Portugal may differ from that in more individualist contexts.

Another consideration is the reliance on self-reported social factors, such as students’ romantic relationships and study time, which may introduce response bias. Additionally, while our results suggest that attendance is positively related to academic performance, the influence of other school environmental factors, such as classroom resources and outdoor spaces, was not directly examined. Prior work by Kweon et al. (2017) has highlighted the impact of school environments on students’ ability to perform academically, suggesting that factors like access to green spaces and overall school atmosphere may affect academic outcomes but were not included in this study.

The choice of machine learning models (linear regression and decision tree) allowed us to explore predictors and classify performance groups. However, these models do not capture all complex interactions, especially non-linear relationships. More complex models, such as neural networks or ensemble methods, could yield greater predictive accuracy, though they would require larger datasets for meaningful results.

Furthermore, Ali et al. (2013) indicate that demographic factors, including socioeconomic status and available resources, can significantly impact academic performance. While this study included parental education as a measure, other socioeconomic variables like family income and access to educational resources were not captured and may play an influential role. Addressing these factors could provide a more comprehensive view of the various influences on academic success.

Future research should consider a more diversified dataset and include additional socio-economic and school environmental factors to enrich our understanding. Incorporating longitudinal data might also allow us to track students’ academic performance over time, providing insight into how factors like study habits or parental support evolve and influence outcomes across different educational stages.

Ⅴ. Conclusion

This study investigated the factors influencing academic performance in Mathematics among secondary school students from two Portuguese schools, using both statistical analysis and machine learning approaches. The findings highlight the significance of demographic, social, and academic variables in shaping student outcomes, with specific attention to the impacts of gender, romantic relationships, attendance, and prior grades.

The statistical analyses revealed that male students, on average, scored slightly higher in Mathematics than female students, while students not in romantic relationships also tended to perform better academically. Notably, absences showed a negative relationship with final grades, emphasizing the importance of regular attendance. However, variables like study time and parental education level had limited or no statistically significant impact on academic performance in this dataset. These insights suggest that while certain demographic and social factors contribute to academic success, other variables, like consistent attendance and stable family environments, may hold more importance in educational interventions.

The machine learning models—linear regression and decision trees—reinforced the significance of prior grades, especially the second-period grade, in predicting final outcomes. The linear regression model’s high explanatory power demonstrated that factors such as family support and academic consistency positively influence grades, while the decision tree’s intuitive structure identified critical thresholds in student performance. Together, these models not only enhance predictive accuracy but also provide actionable insights for educators, who can use these findings to identify and support students at risk of underperformance.

In conclusion, this study underscores the value of integrating statistical and machine learning methods to better understand and predict student performance. By identifying key predictors and their relationships to academic outcomes, educators and policymakers can design targeted interventions that support student success, ultimately contributing to a more effective and inclusive educational environment. Future research could expand on these findings by incorporating a broader range of variables and exploring the role of additional personal and social factors in student performance.

Ⅵ. Citation

Ali, Shoukat, et al. “Factors contributing to the students academic performance: A case study of islamia university sub-campus.” American Journal of Educational Research, vol. 1, no. 8, 20 Aug. 2013, pp. 283–289, https://doi.org/10.12691/education-1-8-3.

Cortez, Paulo, and Alice M. G. Silva. “Predicting Student Performance in Secondary Education.” Proceedings of the 5th Annual Future Business Technology Conference, 2008.

Cortez, Paulo, and Alice Silva. “Using Data Mining to Predict Secondary School Student Performance.” University of Minho, 2007

Kweon, Byoung-Suk, et al. “The link between school environments and student academic performance.” Urban Forestry & Urban Greening, vol. 23, Apr. 2017, pp. 35–43, https://doi.org/10.1016/j.ufug.2017.02.002.

Wang, Yiheng, and Liman Man Li. “Relationships between parental involvement in homework and learning outcomes among elementary school students: The moderating role of societal collectivism–individualism.” British Journal of Educational Psychology, vol. 94, no. 3, 15 May 2024, pp. 881–896, https://doi.org/10.1111/bjep.12692.

Ⅶ. Appendix

# Load necessary libraries for data manipulation, visualization, and modeling
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.2.3

library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:ISLR2':
## 
##     Boston
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(leaps)
library(boot)
library(tree)

## Warning: package 'tree' was built under R version 4.2.3

—————– 1. Data Preparation ——————–

# Load the dataset
std_mat <- read.csv("student+performance/student-mat.csv", sep = ";", header = TRUE)
head(std_mat)

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15

# Data cleaning and transformation
# Remove rows with missing values and convert categorical variables to binary/numeric format
std_mat <- read.csv("student+performance/student-mat.csv", sep = ";", header = TRUE) %>%
  na.omit() %>%
  mutate(
    school = ifelse(school == "GP", 0, 1),  # Convert 'GP' to 0 and 'MS' to 1 for school
    sex = ifelse(sex == "M", 0, 1),  # Convert Male to 0 and Female to 1
    address = ifelse(address == "R", 0, 1),  # Rural as 0, Urban as 1
    famsize = ifelse(famsize == "LE3", 0, 1),  # Family size ≤3 as 0, >3 as 1
    Pstatus = ifelse(Pstatus == "T", 1, 0),  # Together as 1, Apart as 0
    # Convert 'yes'/'no' responses to 1/0
    schoolsup = ifelse(schoolsup == "yes", 1, 0),
    famsup = ifelse(famsup == "yes", 1, 0),
    paid = ifelse(paid == "yes", 1, 0),
    activities = ifelse(activities == "yes", 1, 0),
    nursery = ifelse(nursery == "yes", 1, 0),
    higher = ifelse(higher == "yes", 1, 0),
    internet = ifelse(internet == "yes", 1, 0),
    romantic = ifelse(romantic == "yes", 1, 0),
    # Assign numerical codes for categorical variables 'reason', 'guardian', 'Mjob', and 'Fjob'
    reason = case_when(
      reason == "course" ~ 0,
      reason == "other" ~ 1,
      reason == "home" ~ 2,
      reason == "reputation" ~ 3
    ),
    guardian = case_when(
      guardian == "mother" ~ 0,
      guardian == "other" ~ 1,
      guardian == "father" ~ 2
    ),
    Mjob = case_when(
      Mjob == "at_home" ~ 0,
      Mjob == "other" ~ 1,
      Mjob == "health" ~ 2,
      Mjob == "services" ~ 3,
      Mjob == "teacher" ~ 4
    ),
    Fjob = case_when(
      Fjob == "at_home" ~ 0,
      Fjob == "other" ~ 1,
      Fjob == "health" ~ 2,
      Fjob == "services" ~ 3,
      Fjob == "teacher" ~ 4
    )
  )
head(std_mat)  # View the transformed data

##   school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian
## 1      0   1  18       1       1       0    4    4    0    4      0        0
## 2      0   1  17       1       1       1    1    1    0    1      0        2
## 3      0   1  15       1       0       1    1    1    0    1      1        0
## 4      0   1  15       1       1       1    4    2    2    3      2        0
## 5      0   1  16       1       1       1    3    3    1    1      2        2
## 6      0   0  16       1       0       1    4    3    3    1      3        0
##   traveltime studytime failures schoolsup famsup paid activities nursery higher
## 1          2         2        0         1      0    0          0       1      1
## 2          1         2        0         0      1    0          0       0      1
## 3          1         2        3         1      0    1          0       1      1
## 4          1         3        0         0      1    1          1       1      1
## 5          1         2        0         0      1    1          0       1      1
## 6          1         2        0         0      1    1          1       1      1
##   internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
## 1        0        0      4        3     4    1    1      3        6  5  6  6
## 2        1        0      5        3     3    1    1      3        4  5  5  6
## 3        1        0      4        3     2    2    3      3       10  7  8 10
## 4        1        1      3        2     2    1    1      5        2 15 14 15
## 5        0        0      4        3     2    1    2      5        4  6 10 10
## 6        1        0      5        4     2    1    2      5       10 15 15 15

# View the structure of the dataset to confirm variable types
glimpse(std_mat)

## Rows: 395
## Columns: 33
## $ school     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sex        <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ famsize    <dbl> 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,…
## $ Pstatus    <dbl> 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <dbl> 0, 0, 0, 2, 1, 3, 1, 1, 3, 1, 4, 3, 2, 4, 1, 2, 3, 1, 3, 2,…
## $ Fjob       <dbl> 4, 1, 1, 3, 1, 1, 1, 4, 1, 1, 2, 1, 3, 1, 1, 1, 3, 1, 3, 1,…
## $ reason     <dbl> 0, 0, 1, 2, 2, 3, 2, 2, 2, 2, 3, 3, 0, 0, 2, 2, 3, 3, 0, 2,…
## $ guardian   <dbl> 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 2,…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ schoolsup  <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ famsup     <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ paid       <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,…
## $ activities <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1,…
## $ nursery    <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ higher     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ internet   <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ romantic   <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…

# View summary statistics to understand the distributions of each variable
summary(std_mat)

##      school            sex              age          address      
##  Min.   :0.0000   Min.   :0.0000   Min.   :15.0   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:16.0   1st Qu.:1.0000  
##  Median :0.0000   Median :1.0000   Median :17.0   Median :1.0000  
##  Mean   :0.1165   Mean   :0.5266   Mean   :16.7   Mean   :0.7772  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:18.0   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :22.0   Max.   :1.0000  
##     famsize          Pstatus            Medu            Fedu      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :1.0000   Median :1.0000   Median :3.000   Median :2.000  
##  Mean   :0.7114   Mean   :0.8962   Mean   :2.749   Mean   :2.522  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :4.000   Max.   :4.000  
##       Mjob            Fjob           reason         guardian     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.000   Median :1.000   Median :2.000   Median :0.0000  
##  Mean   :1.899   Mean   :1.777   Mean   :1.441   Mean   :0.5367  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :4.000   Max.   :3.000   Max.   :2.0000  
##    traveltime      studytime        failures        schoolsup     
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.000   Median :2.000   Median :0.0000   Median :0.0000  
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342   Mean   :0.1291  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000   Max.   :1.0000  
##      famsup            paid          activities        nursery      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.6127   Mean   :0.4582   Mean   :0.5089   Mean   :0.7949  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      higher          internet         romantic          famrel     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:4.000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :4.000  
##  Mean   :0.9494   Mean   :0.8329   Mean   :0.3342   Mean   :3.944  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:5.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00

# Attach the dataset for easier reference
attach(std_mat)

——————– G3 Distribution ———————–

# Categorize final grade (G3) into performance levels and visualize the distribution
math <- std_mat %>%
  mutate(
    performance = case_when(
      G3 >= 16 & G3 <= 20 ~ "Excellent",
      G3 >= 14 & G3 <= 15 ~ "Good",
      G3 >= 12 & G3 <= 13 ~ "Satisfactory",
      G3 >= 10 & G3 <= 11 ~ "Sufficient",
      TRUE ~ "Fail"
    )
  )

# Plot histogram of G3 categorized by performance levels
ggplot(math, aes(x = G3, fill = performance)) +
  geom_histogram(binwidth = 1, color = "black") +
  scale_fill_manual(values = c("Excellent" = "darkgreen", "Good" = "lightgreen", 
                               "Satisfactory" = "yellow", "Sufficient" = "orange", "Fail" = "red")) +
  labs(title = "Distribution of Final Grades (G3) in Mathematics",
       x = "Final Grade (G3)",
       y = "Frequency",
       fill = "Performance Level") +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 16),
    axis.title.y = element_text(size = 16),
    axis.text = element_text(size = 14),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12)
  )

————– 3. Machine Learning Integration ——————

—————– 3.1 Subset Selection ———————–

# Perform subset selection for linear regression model using regsubsets
regfit.full <- regsubsets(G3 ~ ., data = std_mat)
reg.summary <- summary(regfit.full)
names(reg.summary)  # Examine available metrics in summary

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

# Plot model selection metrics (R2, adjusted R2, BIC, Cp) for each subset size
par(mfrow = c(2, 2))
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "bic")
plot(regfit.full, scale = "Cp")

# Examine model fit using various criteria (RSS, adjusted R2, Cp, BIC)
par(mfrow = c(2, 2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS")
which.min(reg.summary$rss)  # Identify model with minimum RSS

## [1] 8

points(8, reg.summary$rss[8], col = "red", cex = 2, pch = 19)

# Adjusted R^2 to select optimal number of predictors
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq")
which.max(reg.summary$adjr2)

## [1] 8

points(8, reg.summary$adjr2[8], col = 'red', cex = 2, pch = 19)

# Cp criterion
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp")
which.min(reg.summary$cp)

## [1] 8

points(8, reg.summary$cp[8], col = "red", cex = 2, pch = 19)

# Bayesian Information Criterion (BIC)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC")
which.min(reg.summary$bic)

## [1] 5

points(5, reg.summary$bic[5], col = "red", cex = 2, pch = 19)

———————– 3.2 Predicting ————————–

———————— 3.2.1 Linear Regression ————————

# Construct linear regression models with selected variables
set.seed(1)
model1 <- lm(G3 ~ age + G1 + G2 + activities + famrel + absences + Walc + romantic, data = std_mat)
train <- sample(nrow(std_mat), nrow(std_mat) * 0.6)  # Split dataset for cross-validation

# Calculate MSE for model1
mean((G3 - predict(model1, std_mat))[-train]^2)

## [1] 3.841983

summary(model1)

## 
## Call:
## lm(formula = G3 ~ age + G1 + G2 + activities + famrel + absences + 
##     Walc + romantic, data = std_mat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9571 -0.4748  0.2849  1.0339  4.3246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.001922   1.385330   0.001 0.998894    
## age         -0.214582   0.077672  -2.763 0.006007 ** 
## G1           0.180141   0.055236   3.261 0.001208 ** 
## G2           0.962210   0.049100  19.597  < 2e-16 ***
## activities  -0.342371   0.189499  -1.807 0.071586 .  
## famrel       0.372452   0.106523   3.496 0.000526 ***
## absences     0.044042   0.012137   3.629 0.000323 ***
## Walc         0.123346   0.075265   1.639 0.102067    
## romantic    -0.320448   0.206438  -1.552 0.121416    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.867 on 386 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:  0.834 
## F-statistic: 248.4 on 8 and 386 DF,  p-value: < 2.2e-16

# 10-fold cross-validation for model performance
model1.glm <- glm(G3 ~ age + G1 + G2 + activities + famrel + absences + Walc + romantic, data = std_mat)
cv_result <- cv.glm(std_mat, model1.glm, K = 10)
cv_result$delta[1]  # Cross-validated MSE for model1

## [1] 3.599968

# 10-fold cross-validation for the alternative model (model2)
model2.glm <- glm(G3 ~ G1 + G2 + famrel + absences + age, data = std_mat)
cv_result.2 <- cv.glm(std_mat, model2.glm, K = 10)
cv_result.2$delta[1]  # Cross-validated MSE for model2

## [1] 3.570969

The MSE from the cross-validation process suggests that Model 1 performs slightly better than Model 2, with a lower MSE of 3.576 compared to 3.638 for Model 2. This indicates that including variables such as activities, Walc (weekend alcohol consumption), and romantic relationships may slightly improve the predictive accuracy of the model.

summary(model1)

## 
## Call:
## lm(formula = G3 ~ age + G1 + G2 + activities + famrel + absences + 
##     Walc + romantic, data = std_mat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9571 -0.4748  0.2849  1.0339  4.3246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.001922   1.385330   0.001 0.998894    
## age         -0.214582   0.077672  -2.763 0.006007 ** 
## G1           0.180141   0.055236   3.261 0.001208 ** 
## G2           0.962210   0.049100  19.597  < 2e-16 ***
## activities  -0.342371   0.189499  -1.807 0.071586 .  
## famrel       0.372452   0.106523   3.496 0.000526 ***
## absences     0.044042   0.012137   3.629 0.000323 ***
## Walc         0.123346   0.075265   1.639 0.102067    
## romantic    -0.320448   0.206438  -1.552 0.121416    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.867 on 386 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:  0.834 
## F-statistic: 248.4 on 8 and 386 DF,  p-value: < 2.2e-16

——————– 3.2.2 Decision Tree ————————–

# Next, we apply a decision tree to classify students based on whether they are high-performing (G3 > 13) or not.
# Create a binary variable indicating high-performing students and build a decision tree model
high_score <- factor(ifelse(G3 > 13, "yes", "no"))
tree.math <- tree(high_score ~ . - G3, data = std_mat)
summary(tree.math)  # Show summary of the decision tree model

## 
## Classification tree:
## tree(formula = high_score ~ . - G3, data = std_mat)
## Variables actually used in tree construction:
## [1] "G2"     "reason" "Walc"   "sex"    "paid"   "Mjob"  
## Number of terminal nodes:  10 
## Residual mean deviance:  0.1171 = 45.1 / 385 
## Misclassification error rate: 0.02785 = 11 / 395

plot(tree.math)
text(tree.math, pretty = 0)

# We split the data to evaluate the performance of the decision tree and apply cross-validation to prune it for optimal performance
set.seed(1)
score.test <- std_mat[-train,]
high_score.test <- high_score[-train]
tree.math <- tree(high_score ~ . - G3, std_mat, subset = train)
tree.pred <- predict(tree.math, score.test, type = "class")
table(tree.pred, high_score.test) # Confusion matrix

##          high_score.test
## tree.pred  no yes
##       no  121   4
##       yes   3  30

mean(tree.pred == high_score.test)

## [1] 0.9556962

# Cross-Validation and Pruning
# To avoid overfitting, we use cross-validation to prune the decision tree and determine the optimal complexity.
cv.math <- cv.tree(tree.math, FUN = prune.misclass)
names(cv.math)

## [1] "size"   "dev"    "k"      "method"

cv.math

## $size
## [1] 7 5 4 2 1
## 
## $dev
## [1] 13 13 13 13 66
## 
## $k
## [1] -Inf    0    1    2   56
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(cv.math)

# Plot the misclassification error against tree size and pruning parameter k
par(mfrow = c(1, 2))
plot(cv.math$size, cv.math$dev, type = 'b', main = "Tree Size vs Misclassification Error")
plot(cv.math$k, cv.math$dev, type = 'b', main = "k vs Misclassification Error")

# Prune the decision tree to the optimal size and plot the pruned tree
prune.math <- prune.misclass(tree.math, best = 4)
plot(prune.math)
text(prune.math, pretty = 0)

# Predict on test data using the pruned tree and evaluate accuracy
tree.pred1 <- predict(prune.math, score.test, type = 'class')
table(tree.pred1, high_score.test)

##           high_score.test
## tree.pred1  no yes
##        no  121   3
##        yes   3  31

# Calculate accuracy of the pruned decision tree on test data
mean(tree.pred1 == high_score.test)

## [1] 0.9620253

—————– 2. Statistical Analysis ———————–

——————– 2.1 T-Test ———————

——————- 2.1.1 Gender Comparing ————————–

# T-test to compare the final grades (G3) between male and female students
t_test_result <- t.test(G3 ~ sex, data = std_mat)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  G3 by sex
## t = 2.0651, df = 390.57, p-value = 0.03958
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.04545244 1.85073226
## sample estimates:
## mean in group 0 mean in group 1 
##       10.914439        9.966346

The p-value of 0.03958 indicates a statistically significant difference at the 5% level, suggesting that the difference in means is unlikely due to random chance. The 95% confidence interval for the difference in means ranges from 0.045 to 1.851, meaning that, on average, male students score between 0.045 and 1.851 points higher than female students.

The mean final grade (G3) for male students is 10.914, compared to 9.966 for female students. We reject the null hypothesis and conclude that there is a statistically significant difference between male and female students’ performance in Mathematics, with male students scoring slightly higher on average.

alpha <- 0.05
t_observed <- 2.0651
df <- 390
t_crit <- qt(1 - alpha/2, df)
x <- seq(-4, 4, length = 1000)
null_dist <- dt(x, df)
plot(x, null_dist, type = "l", lwd = 2, col = "black",
     ylab = "Density", xlab = "t", main = "Type I of the difference in gener")
polygon(c(x[x > t_crit], max(x), t_crit),
        c(null_dist[x > t_crit], 0, 0), col = "blue", border = NA)
polygon(c(x[x < -t_crit], -t_crit, min(x)),
        c(null_dist[x < -t_crit], 0, 0), col = "blue", border = NA)
abline(v = c(-t_crit, t_crit), col = "darkblue", lwd = 2, lty = 2)
text(-t_crit - 1, 0.1, expression(-t[crit] == frac(-alpha, 2)), col = "darkblue", cex = 0.9, pos = 4)
text(t_crit + 0.5, 0.1, expression(t[crit] == frac(alpha, 2)), col = "darkblue", cex = 0.9, pos = 2)
abline(v = c(-t_observed,t_observed), col = "green", lwd = 2, lty = 2)

——————– 2.1.2 Romantic ———————

# T-test to compare the final grades (G3) between students who have a romance and who don't
t_test_result <- t.test(G3 ~ romantic,
                        alternative = "two.sided",
                        paired = FALSE,
                        data = std_mat)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  G3 by romantic
## t = 2.5122, df = 240.07, p-value = 0.01266
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.2721553 2.2493333
## sample estimates:
## mean in group 0 mean in group 1 
##       10.836502        9.575758

# Conduct a one-sided t-test to test if students without romantic relationships score higher
t_test_romantic_one_sided <- t.test(G3 ~ romantic, data = std_mat,
                                    alternative = "greater",
                                    paired = FALSE)
t_test_romantic_one_sided

## 
##  Welch Two Sample t-test
## 
## data:  G3 by romantic
## t = 2.5122, df = 240.07, p-value = 0.006328
## alternative hypothesis: true difference in means between group 0 and group 1 is greater than 0
## 95 percent confidence interval:
##  0.432079      Inf
## sample estimates:
## mean in group 0 mean in group 1 
##       10.836502        9.575758

alpha <- 0.05
t_observed <- 2.5122
df <- 240
t_crit <- qt(1 - alpha, df)
x <- seq(-4, 4, length = 1000)
null_dist <- dt(x, df)
plot(x, null_dist, type = "l", lwd = 2, col = "black",
     ylab = "Density", xlab = "t", main = "Type I Error: One-Sided Test")
polygon(c(x[x > t_crit], max(x), t_crit),
        c(null_dist[x > t_crit], 0, 0), col = "red", border = NA)
abline(v = t_crit, col = "darkblue", lwd = 2, lty = 2)
abline(v = t_observed, col = "green", lwd = 2, lty = 2)

text(t_crit + 0.5, 0.1, expression(t[crit] == alpha), col = "darkblue", cex = 0.9, pos = 2)

——————– 2.2 One-way ANOVA ————————

——————– 2.2.1 Sex & Romance ——————–

# Create a combined factor for gender and romantic relationship status

std_mat$group <- with(std_mat, interaction(sex, romantic, sep = "_"))
std_mat$group <- factor(std_mat$group,
                        labels = c("Female_No_Romance", "Female_Romance",
                                   "Male_No_Romance", "Male_Romance"))
# Perform a one-way ANOVA to compare mean grades across the four groups
anova_result <- aov(G3 ~ group, data = std_mat)
summary(anova_result)

##              Df Sum Sq Mean Sq F value Pr(>F)  
## group         3    223   74.27   3.609 0.0135 *
## Residuals   391   8047   20.58                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tukey_result <- TukeyHSD(anova_result)
tukey_result

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = G3 ~ group, data = std_mat)
## 
## $group
##                                         diff       lwr        upr     p adj
## Female_Romance-Female_No_Romance  -0.5615527 -2.005331  0.8822260 0.7474349
## Male_No_Romance-Female_No_Romance -0.6968460 -2.596176  1.2024843 0.7796630
## Male_Romance-Female_No_Romance    -2.0992821 -3.759610 -0.4389542 0.0065760
## Male_No_Romance-Female_Romance    -0.1352933 -2.045027  1.7744409 0.9978299
## Male_Romance-Female_Romance       -1.5377294 -3.209949  0.1344901 0.0841047
## Male_Romance-Male_No_Romance      -1.4024361 -3.480723  0.6758508 0.3038478

——————– 2.2.2 Study time —————————

# Convert studytime to a factor with descriptive levels
std_mat$studytime <- factor(std_mat$studytime,
                            labels = c("less than 2 hours", "2-5 hours", "5-10 hours", "greater than 10 hours"))

anova_studytime <- aov(G3 ~ studytime, data = std_mat)

summary(anova_studytime)

##              Df Sum Sq Mean Sq F value Pr(>F)
## studytime     3    108   36.07   1.728  0.161
## Residuals   391   8162   20.87

TukeyHSD(anova_studytime)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = G3 ~ studytime, data = std_mat)
## 
## $studytime
##                                               diff       lwr      upr     p adj
## 2-5 hours-less than 2 hours              0.1240981 -1.299000 1.547197 0.9959788
## 5-10 hours-less than 2 hours             1.3523810 -0.508052 3.212814 0.2402776
## greater than 10 hours-less than 2 hours  1.2116402 -1.331975 3.755255 0.6087548
## 5-10 hours-2-5 hours                     1.2282828 -0.456832 2.913398 0.2380392
## greater than 10 hours-2-5 hours          1.0875421 -1.330800 3.505884 0.6522702
## greater than 10 hours-5-10 hours        -0.1407407 -2.839700 2.558218 0.9991295

——————– 2.3 Chi-Square Tests ————————–

————— 2.3.1 Parental Education vs Pass / Fail ——————

# Create a binary variable for pass/fail based on G3
std_mat$pass_fail <- ifelse(G3 >= 10, "Pass", "Fail")

# Combine Medu and Fedu into a single categorical variable
std_mat$parental_edu <- ifelse(Medu >= 3 & Fedu >= 3, "High", "Low")

table(std_mat$pass_fail, std_mat$parental_edu)

##       
##        High Low
##   Fail   49  81
##   Pass  117 148

# Visualization of pass/fail status by parental education level
ggplot(std_mat, aes(x = parental_edu, fill = pass_fail)) +
  geom_bar(position = "dodge") +
  labs(title = "Student Pass/Fail Rate by Parental Education Level",
       x = "Parental Education Level",
       y = "Number of Students",
       fill = "Pass/Fail") +
  scale_fill_manual(values = c("Pass" = "forestgreen", "Fail" = "firebrick")) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

# Chi-square test to determine association between parental education and pass/fail status

chisq_test <- chisq.test(table(std_mat$parental_edu, std_mat$pass_fail))
chisq_test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(std_mat$parental_edu, std_mat$pass_fail)
## X-squared = 1.2399, df = 1, p-value = 0.2655

——————— 2.4 Correlation ———————

——————— 2.4.1 Absences and Academic Performance ———————–

# Calculating the correlation between absences and final grade (G3)

cor_test <- cor.test(absences, G3, method = "pearson")
cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  absences and G3
## t = 0.67933, df = 393, p-value = 0.4973
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06464215  0.13247070
## sample estimates:
##        cor 
## 0.03424732

cor(absences, G3)

## [1] 0.03424732

Math 363: Project

Annabelle Zhu

2024-08-23