Academic performance is a crucial factor in shaping a student’s future opportunities in education and career paths. For educational institutions aiming to improve student outcomes, understanding what influences academic success is essential (Cortez and Silva, 2008). Mathematics is particularly important, as it forms the foundation for other subjects, and students who struggle in mathematics may face challenges across their studies. Research has shown that various factors, such as a student’s background, social environment, and study habits, can all impact their academic performance, underscoring the need to identify these influences to design effective educational support (Cortez and Silva, 2008).
This study investigates the factors affecting the mathematics performance of secondary school students from two Portuguese schools, Gabriel Pereira (GP) and Mousinho da Silveira (MS). The data for this project, obtained from the UCI Machine Learning Repository (Cortez and Silva, 2008), contains valuable information that can reveal patterns and insights into student success.
The project has two main goals: analyzing data to understand the relationships between different factors and applying prediction methods to estimate student performance. The first goal involves statistical tests to explore how various demographic, social, and academic factors are related to student grades. The second goal focuses on using prediction models. By combining these approaches, this study aims to give educators a clearer understanding of the factors influencing student performance, guiding them in developing more targeted and effective support strategies.
We utilized the Secondary Education Dataset from the UCI Machine Learning Repository, originally gathered by Cortez and Silva, which provides information on student academic performance in Mathematics. The data was collected during the 2005-2006 academic year through school reports and supplementary questionnaires filled out by students.
In preparing the dataset for statistical and machine learning analyses, several categorical variables were recorded as numeric binary variables to facilitate interpretation. The transformation includes: school:
student’s schoolGP (0) MS (1) (“GP” for Gabriel Pereira, “MS” for Mousinho da Silveira)
sex: Male (0), Female (1)
address: Rural (0), Urban (1)
famsize: Three or fewer members (0), More than three members (1)
Pstatus: Parents together (1), Apart (0)
schoolsup,famsup,paid,activities,nursery,higher,internet,romantic: “yes” (1) and “no” (0)
reason: reason to choose this school recorded as course (0), other (1), home (2), reputation (3)
guardian: Student’s guardian record mother (0), other (1), father (2)
Mjob,Fjob: Recoded into categories, e.g., at-home (0), other (1), healthcare (2), services (3), teacher (4).This transformation ensures that the variables are appropriately formatted for both statistical tests and machine learning models.
Fig 1. Transformed Sample Data
The data collection consisted of two primary sources:
School Reports:
These included official records such as period grades (
G1,G2, andG3, which is the final grade) and the number of school absences. The dataset includes the first-period grade (G1) and second period grade (G2), both numeric values ranging from 0 to 20, along with the final grade (G3) within the same range.Student outcomes are classified as a pass if
G3≥ 10, and a fail ifG3< 10. Grades are further categorized as follows:
\(16-20\) is considered excellent
\(14-15\) is good
\(12-13\) is satisfactory,
\(10-11\) is sufficient
\(0-9\) is a fail.
Fig 2. The distribution of Final grades in Mathematics
Questionnaires:
Designed to capture a broader range of variables, these questionnaires gathered data on students’ demographic background, social context, and additional school-related factors. Questions were close-ended to ensure consistency in responses and ease of data analysis. The variables captured included parental education and occupation, weekly study time, alcohol consumption, and family support.
A few potential sources of bias were inherent to the data collection process and dataset composition:
Self-Reporting Bias: The questionnaire data relied on self-reported information, which may introduce bias, particularly for sensitive variables such as alcohol consumption and parental income. Students may underreport or overreport certain behaviors due to social desirability bias.
Selection Bias: Data was collected from two public schools in a specific Portuguese region. This limitation may restrict the generalizability of findings to other regions or private educational settings, as the socioeconomic context could differ.
Privacy Concerns: Some variables, like family income, were omitted from the analysis due to incomplete responses, likely stemming from privacy concerns. This exclusion could reduce the dataset’s comprehensiveness in examining socioeconomic influences on performance.
To account for these biases:
Data Imputation was applied minimally, only filling missing values where they were considered ignorable and would not introduce additional bias.
Normalization of Numeric Variables: Variables like grades and absences were normalized to reduce the impact of outliers and differences in scale.
Use of Cross-Validation in machine learning modeling helped control for selection bias, ensuring that mode training and testing were less affected by any specific subset of the data.
The statistical analysis in this study examines relationships between demographic, social, and academic factors and student performance.
Gender and Academic Performance:
An independent \(t\text{-test}\) will be conducted to compare whether there is a statistically significant difference in final grades (
G3) between male and female students. This analysis aims to identify any potential gender-related performance disparities in Mathematics.The \(t\text{-test}\) formula is given by: \[ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \] where \(\bar{X}_1\) and \(\bar{X}_2\) are the sample means, \(s_1^2\) and \(s_2^2\) are the sample variances, and \(n_1\) and \(n_2\) are the sample sizes of the two groups.
Impact of Romantic Relationships:
To assess if romantic relationships impact academic performance, a \(t\text{-test}\) will compare final grades (
G3) between students who are in a romantic relationship and those who are not. This analysis initially considers a two-sided alternative hypothesis, followed by a one-sided \(t\text{-test}\) if students without a relationship are hypothesized to perform better on average.The one-sided \(t\text{-test}\) is conducted similarly to the two-sided t-test, but the alternative hypothesis is: \[ H_1: \mu_1 > \mu_2 \]
Gender and Relationship Status on Final Grades:
A \(\text{two-way ANOVA}\) will examine whether there is a significant interaction between gender and relationship status in predicting final grades (
G3). The analysis will test for differences among four groups: female students not in a relationship, female students in a relationship, male students not in a relationship, and male students in a relationship.The \(\text{two-way ANOVA}\) is: \[ Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk} \] where \(Y_{ijk}\) is the final grade, \(\mu\) is the overall mean, \(\alpha_i\) is the effect of gender, \(\beta_j\) is the effect of relationship status, \((\alpha\beta)_{ij}\) is the interaction effect, and \(\epsilon_{ijk}\) is the error term.
Study Time and Grades:
To determine if different study times impact final grades, a \(\text{one-way ANOVA}\) will compare grades (
G3) across students with different levels of weekly study time, categorized as follows:
less than 2 hours
2-5 hours
5-10 hours
more than 10 hours.
The \(\text{one-way ANOVA}\) is: \[ Y_i = \mu + \tau_i + \epsilon_i \] where \(Y_i\) is the final grade, \(\mu\) is the overall mean, \(\tau_i\) is the effect of the \(i\)-th study time category, and \(\epsilon_i\) is the error term.
Parental Education and Student Success:
A \(\text{Chi-square}\) test of independence will assess whether there is an association between parental education levels (
MeduandFedu) and student success, defined by whether a student passes or fails (withG3≥ 10 indicating a pass). Parental education levels are grouped into “High” (where both parents have an education level of 3 or above) and “Low” (either or both parents have an education level below 3).The \(\text{Chi-square}\) test statistic is: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \(O_i\) is the observed frequency and \(E_i\) is the expected frequency under the null hypothesis of independence.
Absences and Academic Performance:
A Pearson correlation test will be performed to investigate the relationship between the number of absences and final grades (
G3). This will assess if frequent absences are associated with lower academic performance.The Pearson correlation coefficient is: \[ r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}} \] where \(X_i\) and \(Y_i\) are the individual data points, and \(\bar{X}\) and \(\bar{Y}\) are the sample means of the variables.
This section incorporates machine learning techniques to enhance our
analysis by focusing on feature selection and predictive modeling to
forecast final grades (G3) based on key academic and social
characteristics.
This section incorporates machine learning techniques to enhance our
analysis by focusing on feature selection and predictive modeling to
forecast final grades (G3) based on key academic and social
characteristics. The machine learning approach allows us to identify
patterns and relationships that may not be as evident in traditional
statistical methods.
regsubsetsTo determine the most impactful predictors of academic performance, we applied the
regsubsetsfunction from theleapspackage, which evaluates subsets of variables to identify the best predictors for final grades (G3). This approach allowed us to rank variables in terms of their predictive strength and provided a broad overview of candidate models with different combinations of variables.
Fig 3. Feature Selection SummaryInformation Criteria and Selection Process
Theregsubsetsfunction provided a selection process guided by multiple information criteria - namely, \(\text{R-squared}\), \(\text{adjuested R-squared}\), \(\text{Mallows’ Cp}\), and \(\text{BIC}\). Each criterion offers unique insights into model quality:
- \(\text{R-squared}\), \(\text{adjuested R-squared}\) measure the proportion of variance explained by the model, with adjusted R-squared penalizing the addition of non-informative predictors to avoid overfitting.
- \(\text{Mallows’ Cp}\) (Conformal prediction) assesses model bias and complexity, aiming for values near the number of predictors, indicating a well-fit model without excessive variables.
- \(\text{BIC}\) (Bayesian Information Criterion) prioritizes models with fewer predictors, imposing a penalty on complex models, and is particularly valuable for models expected to generalize well to new data.
We carefully reviewed each of these criteria and selected models balancing simplicity and predictive accuracy. Compared to a basic stepwise approach,
regsubsetsprovides a holistic view of possible models and avoids the risk of narrow selections inherent in purely sequential methods. Using the criteria properly, we aimed for models with the most explanatory power that remained simple, thus enhancing interpretability and preventing overfitting.
Fig 4. Subset SelectionBy exploring subsets and visualizing metrics such as \(\text{R-squared}\), \(\text{adjuested R-squared}\), \(\text{Mallows’ Cp}\), and \(\text{BIC}\), we determined the optimal model complexity that balances predictive accuracy with simplicity, ensuring that the model maintains high accuracy while remaining parsimonious.
Fig 5. The Best SubsetFigure 5 is showing the \(\text{RSS, adjusted R-squared}\) and \(\text{Cp values}\) guided the selection of variables with the highest predictive utility, allowing us to proceed with a well-defined subset for regression modeling.
Linear Regression Model
Using the subset of selected predictors, we developed a linear regression model to estimate final grades. This model quantified the relationships between variables such as age, prior academic performance (
G1,G2), and family dynamics (famrel) in predictingG3. Cross-validation (10-fold) was used to assess model stability and prevent overfitting. The model demonstrated a multiple \(\text{R-squared}\) of \(0.8374\), suggesting that it explains a significant portion of the variability in student grades. Comparison of models through cross-validation confirmed that including social variables such as romantic relationships and activities had only minor effects on prediction accuracy.
Decision Tree Model
A decision tree model was also implemented to classify students into performance categories (high-performing vs. low-performing) based on academic and social factors. This model provided an intuitive, non-linear approach, mapping out decision paths that highlight influential variables and specific thresholds (e.g., number of absences and
G2score) critical for predicting performance outcomes.
Fig 6. The Decision Tree Model
The statistical analysis in this study examines relationships between demographic, social, and academic factors and student performance.
To evaluate whether there is a significant difference in academic performance between male and female students, a two-sample \(t\text{-test}\) was conducted on the final grades (G3). The test revealed a statistically significant difference between the two groups (\(t = 2.0651\), \(p\text{-value} = 0.03958\), \(95\% \text{ CI} = [0.04545, 1.8507]\)). This indicates that the observed difference is unlikely to be due to random variation. The mean final grade for male students was found to be \(10.914\), while the mean for female students was \(9.966\). The \(95\%\) confidence interval for the difference in means ranges from \(0.04545\) to \(1.8507\), suggesting that, on average, male students score between \(0.04545\) and \(1.8507\) points higher than female students.
The \(p\)-value of 0.03958 indicates a 3.96% probability of committing a Type I error—incorrectly concluding that there is a gender-based difference in academic performance when no such difference exists. To better understand this risk, we can visualize the critical region and how the observed \(t\)-value falls:
Fig 7. The plot shows the critical region. Since the observed t-value exceeds the critical value, the null hypothesis is rejected with minimal risk of a Type I error.Given the significant \(p\text{-value}\), we can reject the null hypothesis and conclude that there is indeed a statistically significant difference in performance between male and female students in Mathematics. This \(t\text{-test}\) is appropriate for comparing the means of two independent groups, and the statistically significant result suggests that gender may have a role in academic performance in Mathematics, potentially related to social or psychological factors that merit further exploration.
In addition to gender, the study also examined whether being in a romantic relationship influences academic performance. Initially, a two-tailed \(t\text{-test}\) was performed to assess if students in a relationship (group 1) perform differently from those who do not (group 0). However, considering the research interest in determining whether students in a relationship have lower grades, a one-sided t-test was more appropriate.
The one-sided \(t\text{-test}\) yielded a \(t\text{-value}\) of \(2.5122\) and a \(p\text{-value}\) of \(0.006328\), which is significant at the \(1\% \text{ level } (t = 2.5122, p\text{ -value = } 0.01266, 95\% \text{ CI} = [0.2722, 2.2493])\). This result suggests a statistically significant difference, supporting the hypothesis that students not in a relationship tend to have higher grades. The \(95\%\) confidence interval for the difference in means ranges from \(0.272\) to \(2.249\), indicating that, on average, students not in a romantic relationship score between \(0.272\) and \(2.249\) points higher than those in a relationship.
Fig 8. Visualization of the null t-distribution for a one-sided test, showing the Type I error region (red) beyond the critical valueThe findings indicate that students not involved in a romantic relationship have significantly better academic outcomes. These results align with the hypothesis that romantic involvement could potentially distract students from their studies, leading to lower academic performance. Therefore, we reject the null hypothesis and conclude that being in a romantic relationship is associated with a decrease in academic achievement in this context.
We conducted an Analysis of Variance (\(\text{ANOVA}\)) test to evaluate whether there are significant differences in the final grades (
G3) among four groups of students based on their gender and relationship status. The four groups considered were:
Female students not in a romantic relationship
Female students in a romantic relationship
Male students not in a romantic relationship
Male students in a romantic relationship
The result shows there is a statistically significant difference in
G3scores among the four groups at the \(5\%\) significance level \((F = 3.609, p = 0.0135)\). To further investigate the differences between specific pairs of groups, a post-hoc Tukey’s HSD test was conducted. The key comparisons and their results are:Table 1: Pairwise Comparisons for Gender and Romantic Relationship on Final Grade (G3)
Comparison Difference Lower CI Upper CI Adjusted p-value Female_Romance - Female_No_Romance -0.5616 -2.0053 0.8822 0.7474 Male_No_Romance - Female_No_Romance -0.6968 -2.5962 1.2025 0.7797 Male_Romance - Female_No_Romance -2.0993 -3.7596 -0.4390 0.0066 Male_No_Romance - Female_Romance -0.1353 -2.0450 1.7744 0.9978 Male_Romance - Female_Romance -1.5377 -3.2099 0.1345 0.0841 Male_Romance - Male_No_Romance -1.4024 -3.4807 0.6759 0.3038
Male students in a romantic relationship vs. Female students not in a romantic relationship:
This comparison reveals a statistically significant difference \((p = 0.0066, 95\% \text{ CI } = [-3.76, -0.44])\), which indicates that male students in a romantic relationship perform significantly worse than female students not in a relationship.
Other Comparisons (Non-Significant Results):
The results indicate no statistically significant differences for the other comparisons, including differences between female students in and out of relationships, and between male and female students not in relationships. This suggests that, aside from the significant difference observed for male students in a romantic relationship, other pairwise groupings do not show meaningful disparities in academic performance.
These results highlight that the most significant impact is observed when comparing male students in a romantic relationship to female students not in a relationship, where the former group performs significantly worse. For other group comparisons, the differences are either non-significant or marginally significant, indicating that relationship status may interact differently with gender in influencing academic success.
For the analysis regarding study time, the one-way ANOVA results indicate that there is no statistically significant difference in final grades (
G3) among students with varying study times \((F = 1.728, p = 0.161)\). This suggests that the amount of time students dedicate to studying, categorized as
less than \(2\) hours
\(2-5\) hours
\(5-10\) hours
more than 10 hours
Table 2: ANOVA Results and Pairwise Comparisons for Study Time on Final Grade (
G3)
Source Df Sum of Squares Mean Square F-value Pr(>F) Study Time 3 108 36.07 1.728 0.161 Residuals 391 8162 20.87 To further analyze any potential differences between specific pairs of study time groups, pairwise comparisons were conducted. As Table 3 demonstrates, none of the differences between study time groups were statistically significant after adjusting for multiple comparisons.
Table 3: Pairwise Comparisons for Study Time on Final Grade (
G3)
Comparison Difference Lower CI Upper CI Adjusted p-value 2-5 hours - Less than 2 hours 0.1241 -1.2990 1.5472 0.9960 5-10 hours - Less than 2 hours 1.3524 -0.5081 3.2128 0.2403 Greater than 10 hours - Less than 2 hours 1.2116 -1.3320 3.7553 0.6088 5-10 hours - 2-5 hours 1.2283 -0.4568 2.9134 0.2380 Greater than 10 hours - 2-5 hours 1.0875 -1.3308 3.5059 0.6523 Greater than 10 hours - 5-10 hours -0.1407 -2.8397 2.5582 0.9991 This lack of significant differences suggests that study time, at least as reported and categorized in this study, does not appear to be a strong determinant of academic success. Possible reasons include:
Quality Over Quantity: It is possible that the effectiveness of study time varies among students, and the quality or method of studying may be more impactful than the sheer quantity of hours studied. For instance, students with strong study skills and efficient techniques might achieve better outcomes even with fewer study hours, compared to students who spend more time but use less effective strategies.
Influence of Other Factors: Academic performance is influenced by multiple factors beyond study time alone, such as prior knowledge, engagement in class, and individual learning styles. This finding might suggest that other factors in the dataset (e.g., prior grades, family support) play a more prominent role in determining academic outcomes than time spent studying.
Potential for Measurement Error: The study time variable is self-reported, which could lead to inaccuracies due to overestimation or underestimation by students. Furthermore, the categories may be too broad to capture meaningful differences in study habits accurately.
In conclusion, while the \(\text{one-way ANOVA}\) test did not reveal a significant relationship between study time and academic performance in this sample, this result does not necessarily imply that study time is unimportant. Instead, it points to the complex nature of academic success, which likely depends on a combination of effective study strategies and other influential factors.
We conducted a \(\text{Chi-Square}\) test to determine if there is a significant association between parents’ education levels (
MeduandFedu) and student success (measured by whether students pass or fail, withG3\(\geq 10\) considered a pass). The parental education levels were categorized into two groups: “High” (where both parents have an education level of \(3\) or above) and “Low” (where either or both parents have an education level below 3). The outcome variable was student success, where “Pass” indicates a final grade of \(10\) or above, and “Fail” indicates a grade below \(10\).
Fig 9. Parental Education LevelThe results showed that test results are not statistically significant \((p = 0.2655)\). This suggests that there is insufficient evidence to conclude that there is a significant association between parents’ education levels and the likelihood of a student passing or failing. In other words, based on this dataset, the education levels of parents do not appear to have a significant impact on whether a student passes or fails. This result may point to other factors beyond parental education, such as student effort, teaching quality, or social influences, playing a more dominant role in determining academic success.
We use a \(\text{Pearson correlation}\) test to investigate whether there is a significant relationship between the number of absences (
absences) and final grades (G3). The results showed that test results are not statistically significant \((p = 0.4973, t = 0.6793, \text{correlation } = 0.034 )\). This suggests that there is insufficient evidence to conclude that there is a significant association between parents’ education levels and the likelihood of a student passing or failing. In other words, based on this dataset, the education levels of parents do not appear to have a significant impact on whether a student passes or fails.This result may seem surprising, as one might expect that higher rates of absenteeism would negatively impact academic achievement due to lost instructional time and reduced engagement with course material. Several factors may help explain this unexpected outcome:
Alternative Learning Methods: Some students may effectively compensate for missed classes through alternative learning methods, such as self-study or tutoring. In such cases, absences may not significantly impact performance, particularly if students have access to resources or support outside of regular classroom hours.
Variation in Absence Reasons: The dataset does not specify the reasons for students’ absences, which could vary widely. For example, absences due to school-related activities or minor illnesses may not have the same impact as prolonged absences due to more serious issues.
Potential Underreporting or Inaccurate Recording of Absences: It is also possible that students or schools underreported absences or that records do not reflect the true number of classes missed. If absences are inaccurately recorded, this would weaken any observed correlation between attendance and performance.
Future research might benefit from categorizing absences based on their nature (excused vs. unexcused, short-term vs. long-term), understanding how absences interact with other factors, such as parental involvement and study habits, could provide a more comprehensive view of their effect on academic outcomes. In summary, while the \(\text{Pearson correlation}\) test does not show a significant association between absences and final grades in this dataset, this finding highlights the complexity of the factors influencing academic success. Absences alone may not fully capture the challenges students face, suggesting that academic outcomes are influenced by a more intricate interplay of personal, educational, and social factors.
In this study, I used two machine learning models to predict student
performance in mathematics based on academic and social features. The
models explored how well various characteristics, such as prior grades,
family relationships, and extracurricular activities, influence the
final grade (G3). The machine learning models used are:
The linear regression model was developed to predict students’ final
grades (G3) based on key predictors: age, first-period
grades (G1), second-period grades (G2), family
relationships, absences, weekend alcohol consumption
(Walc), and romantic relationships. These predictors were
selected for their potential impact on academic performance.
Model Fit and Performance
The linear regression model achieved a multiple R-squared value of \(R^2 = 0.8374\), indicating that \(83.74\%\) of the variance in students’ final grades (
G3) is explained by the predictors. This high \(R^2\) value, coupled with an adjusted R-squared value of \(\text{Adjusted } R^2 = 0.834\), suggests that the model has a robust fit while minimizing potential overfitting.The linear model is expressed as: \[ G3 = \beta_0 + \beta_1 (\text{age}) + \beta_2 (G1) + \beta_3 (G2) + \beta_4 (\text{activities}) + \beta_5 (\text{famrel}) + \beta_6 (\text{absences}) + \beta_7 (Walc) + \beta_8 (\text{romantic}) + \epsilon \]
where each \(\beta\) coefficient represents the estimated effect of each predictor on the final grade, and \(\epsilon\) is the error term.
Model Significance
The \(\text{F-statistic}\) of the model is calculated as: \[ F = \frac{\text{MSR}}{\text{MSE}} = 248.4 \] where \(\text{MSR}\) (Mean Squared Regression) represents the average of squared deviations explained by the model, and \(\text{MSE}\) (Mean Squared Error) reflects the average of squared deviations not explained by the model. The F-statistic thus measures the ratio of model-explained variance to unexplained variance. With a \(p\text{-value} < 2.2 \times 10^{-16}\), indicating that the model is highly statistically significant, meaning that the included predictors collectively contribute significantly to predicting students’ final grades.
Table 4: Linear Regression Results for Predicting Final Grades (
G3)
Statistic Value Min -8.9571 1Q -0.4748 Median 0.2849 3Q 1.0339 Max 4.3246
Predictor Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0019 1.3853 0.001 0.9989 age -0.2146 0.0777 -2.763 0.0060 ** G1 0.1801 0.0552 3.261 0.0012 ** G2 0.9622 0.0491 19.597 < 2e-16 *** activities -0.3424 0.1895 -1.807 0.0716 . famrel 0.3725 0.1065 3.496 0.0005 *** absences 0.0440 0.0121 3.629 0.0003 *** Walc 0.1233 0.0753 1.639 0.1021 romantic -0.3204 0.2064 -1.552 0.1214 Model Summary:
\(\text{Residual standard error: }1.867 \text{ on } 386 \text{degrees of freedom}\)
\(\text{Multiple R-squared: }0.8374\)
\(\text{Adjusted R-squared: }0.834\)
\(\text{F-statistic: } 248.4 \text{ on } 8 \text{ and } 386 \text{ DF, }p\text{-value:} < 2.2e-16\)
Key Predictors and Insights
Second-Period Grades (
G2): With an estimated coefficient \(\beta_3 = 0.962\) (\(p\text{-value}< 2 \times 10^{-16}\)),G2had the largest positive effect onG3, indicating that students’ performance in the second term strongly influences their final outcome.First-Period Grades (
G1): The coefficient forG1was \(\beta_2 = 0.180\) (\(p\text{-value} = 0.001208\)), suggesting that initial term performance is also significant, though slightly less impactful thanG2Absences:
Absenceshad a small but statistically significant negative association withG3, with \(\beta_6 = 0.044\) (\(p\text{-value}= 0.000323\)). This finding highlights the importance of regular attendance in maintaining academic performance.Family Relationships (
famrel): Family relationship quality, represented by \(\beta_5 = 0.372\) (\(p\text{-value} = 0.000526\)), showed a positive effect onG3, indicating that strong family support correlates with higher grades.Age (
age): The coefficient for age was \(\beta_1 = -0.215\) (\(p\text{-value}= 0.006007\)), suggesting that older students in this sample may perform slightly worse on average.Extracurricular Activities (
activities) and Romantic Relationships (romantic):hese variables were not statistically significant (activities: \(\beta_4 = -0.342\), \(p\text{-value}= 0.071586\);romantic: \(\beta_8 = -0.320\),\(p\text{-value}= 0.121416\)), indicating minimal impact onG3in this dataset.Summary of Linear Model Findings:
The linear regression model provided valuable insights into the main factors influencing student performance, especially the role of consistent academic effort, attendance, and family support. These findings highlight potential intervention points for educators, such as encouraging regular attendance and fostering supportive environments.
A decision tree model was used to classify students as
high-performing (G3 \(>
13\)) or below this threshold. This model utilized the same
predictor set as the linear regression model and was pruned to improve
its generalizability.
Model Performance:
The initial, unpruned decision tree achieved an accuracy of \(95.6\%\) in correctly classifying students’ performance. We pruned the tree to its optimal depth using cross-validation, which further enhanced model interpretability and accuracy. Pruning improved this accuracy slightly to \(96.2\%\), showing that a simplified model can still maintain predictive performance, potentially reducing overfitting.
Fig 10. Pruned Decision Tree StructureKey Predictors and Interpretation:
- Second-Period Grade (
G2): As with the linear model,G2emerged as the top predictor in the decision tree. It was the initial split point in the tree, distinguishing high and low-performing students.- Weekend Alcohol Consumption (
Walc): For students withG2between \(12.5\) and \(13.5\), weekend alcohol consumption (Walc) further separated those likely to achieve high scores. The tree split Walc at a threshold of \(1.5\), with students reporting lower alcohol consumption (Walc\(< 1.5\)) being more likely to perform well.The pruned tree’s structure offers clear insights. Students with
G2\(< 12.5\) were typically not high-performing, while those withG2above this threshold were more likely to achieve a high score, particularly if they had low weekend alcohol consumption.Summary of Decision Tree Findings:
The pruned decision tree provided an intuitive, non-linear representation of how specific factors impact student performance. It highlighted key thresholds within predictors like
G2andWalc, giving educators actionable thresholds for identifying at-risk students. This model’s simplicity and clarity make it a practical tool for educational interventions, helping teachers prioritize support for students based on second-term grades and lifestyle factors such as alcohol consumption
This study offers insights into the factors affecting academic performance in Mathematics among secondary school students. However, several limitations and broader considerations warrant discussion.
One limitation involves the data itself, which is derived from two Portuguese schools and may not fully generalize to other cultural or educational settings. Differences in societal structures, teaching methods, and parental involvement across cultures could influence the applicability of our findings. Research by Wang and Li (2024) emphasizes that parental involvement and its impact on academic outcomes vary significantly between collectivist and individualist societies. The current study grouped parental education as a binary factor without exploring the nuances of how active parental support and engagement impact students’ learning experiences. Future research could investigate these aspects more deeply, especially how parental engagement in collectivist settings like Portugal may differ from that in more individualist contexts.
Another consideration is the reliance on self-reported social factors, such as students’ romantic relationships and study time, which may introduce response bias. Additionally, while our results suggest that attendance is positively related to academic performance, the influence of other school environmental factors, such as classroom resources and outdoor spaces, was not directly examined. Prior work by Kweon et al. (2017) has highlighted the impact of school environments on students’ ability to perform academically, suggesting that factors like access to green spaces and overall school atmosphere may affect academic outcomes but were not included in this study.
The choice of machine learning models (linear regression and decision tree) allowed us to explore predictors and classify performance groups. However, these models do not capture all complex interactions, especially non-linear relationships. More complex models, such as neural networks or ensemble methods, could yield greater predictive accuracy, though they would require larger datasets for meaningful results.
Furthermore, Ali et al. (2013) indicate that demographic factors, including socioeconomic status and available resources, can significantly impact academic performance. While this study included parental education as a measure, other socioeconomic variables like family income and access to educational resources were not captured and may play an influential role. Addressing these factors could provide a more comprehensive view of the various influences on academic success.
Future research should consider a more diversified dataset and include additional socio-economic and school environmental factors to enrich our understanding. Incorporating longitudinal data might also allow us to track students’ academic performance over time, providing insight into how factors like study habits or parental support evolve and influence outcomes across different educational stages.
This study investigated the factors influencing academic performance in Mathematics among secondary school students from two Portuguese schools, using both statistical analysis and machine learning approaches. The findings highlight the significance of demographic, social, and academic variables in shaping student outcomes, with specific attention to the impacts of gender, romantic relationships, attendance, and prior grades.
The statistical analyses revealed that male students, on average, scored slightly higher in Mathematics than female students, while students not in romantic relationships also tended to perform better academically. Notably, absences showed a negative relationship with final grades, emphasizing the importance of regular attendance. However, variables like study time and parental education level had limited or no statistically significant impact on academic performance in this dataset. These insights suggest that while certain demographic and social factors contribute to academic success, other variables, like consistent attendance and stable family environments, may hold more importance in educational interventions.
The machine learning models—linear regression and decision trees—reinforced the significance of prior grades, especially the second-period grade, in predicting final outcomes. The linear regression model’s high explanatory power demonstrated that factors such as family support and academic consistency positively influence grades, while the decision tree’s intuitive structure identified critical thresholds in student performance. Together, these models not only enhance predictive accuracy but also provide actionable insights for educators, who can use these findings to identify and support students at risk of underperformance.
In conclusion, this study underscores the value of integrating statistical and machine learning methods to better understand and predict student performance. By identifying key predictors and their relationships to academic outcomes, educators and policymakers can design targeted interventions that support student success, ultimately contributing to a more effective and inclusive educational environment. Future research could expand on these findings by incorporating a broader range of variables and exploring the role of additional personal and social factors in student performance.
Ali, Shoukat, et al. “Factors contributing to the students academic performance: A case study of islamia university sub-campus.” American Journal of Educational Research, vol. 1, no. 8, 20 Aug. 2013, pp. 283–289, https://doi.org/10.12691/education-1-8-3.
Cortez, Paulo, and Alice M. G. Silva. “Predicting Student Performance in
Secondary Education.” Proceedings of the 5th Annual Future Business
Technology Conference, 2008.
Cortez, Paulo, and Alice Silva. “Using Data Mining to Predict Secondary
School Student Performance.” University of Minho, 2007
Kweon, Byoung-Suk, et al. “The link between school environments and
student academic performance.” Urban Forestry & Urban
Greening, vol. 23, Apr. 2017, pp. 35–43, https://doi.org/10.1016/j.ufug.2017.02.002.
Wang, Yiheng, and Liman Man Li. “Relationships between parental
involvement in homework and learning outcomes among elementary school
students: The moderating role of societal collectivism–individualism.”
British Journal of Educational Psychology, vol. 94, no. 3, 15
May 2024, pp. 881–896, https://doi.org/10.1111/bjep.12692.
# Load necessary libraries for data manipulation, visualization, and modeling
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.2.3
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:ISLR2':
##
## Boston
##
## The following object is masked from 'package:dplyr':
##
## select
library(leaps)
library(boot)
library(tree)
## Warning: package 'tree' was built under R version 4.2.3
# Load the dataset
std_mat <- read.csv("student+performance/student-mat.csv", sep = ";", header = TRUE)
head(std_mat)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 3 yes no yes no
## 4 mother 1 3 0 no yes yes yes
## 5 father 1 2 0 no yes yes no
## 6 mother 1 2 0 no yes yes yes
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes yes no no 4 3 2 1 2 5
## 6 yes yes yes no 5 4 2 1 2 5
## absences G1 G2 G3
## 1 6 5 6 6
## 2 4 5 5 6
## 3 10 7 8 10
## 4 2 15 14 15
## 5 4 6 10 10
## 6 10 15 15 15
# Data cleaning and transformation
# Remove rows with missing values and convert categorical variables to binary/numeric format
std_mat <- read.csv("student+performance/student-mat.csv", sep = ";", header = TRUE) %>%
na.omit() %>%
mutate(
school = ifelse(school == "GP", 0, 1), # Convert 'GP' to 0 and 'MS' to 1 for school
sex = ifelse(sex == "M", 0, 1), # Convert Male to 0 and Female to 1
address = ifelse(address == "R", 0, 1), # Rural as 0, Urban as 1
famsize = ifelse(famsize == "LE3", 0, 1), # Family size ≤3 as 0, >3 as 1
Pstatus = ifelse(Pstatus == "T", 1, 0), # Together as 1, Apart as 0
# Convert 'yes'/'no' responses to 1/0
schoolsup = ifelse(schoolsup == "yes", 1, 0),
famsup = ifelse(famsup == "yes", 1, 0),
paid = ifelse(paid == "yes", 1, 0),
activities = ifelse(activities == "yes", 1, 0),
nursery = ifelse(nursery == "yes", 1, 0),
higher = ifelse(higher == "yes", 1, 0),
internet = ifelse(internet == "yes", 1, 0),
romantic = ifelse(romantic == "yes", 1, 0),
# Assign numerical codes for categorical variables 'reason', 'guardian', 'Mjob', and 'Fjob'
reason = case_when(
reason == "course" ~ 0,
reason == "other" ~ 1,
reason == "home" ~ 2,
reason == "reputation" ~ 3
),
guardian = case_when(
guardian == "mother" ~ 0,
guardian == "other" ~ 1,
guardian == "father" ~ 2
),
Mjob = case_when(
Mjob == "at_home" ~ 0,
Mjob == "other" ~ 1,
Mjob == "health" ~ 2,
Mjob == "services" ~ 3,
Mjob == "teacher" ~ 4
),
Fjob = case_when(
Fjob == "at_home" ~ 0,
Fjob == "other" ~ 1,
Fjob == "health" ~ 2,
Fjob == "services" ~ 3,
Fjob == "teacher" ~ 4
)
)
head(std_mat) # View the transformed data
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian
## 1 0 1 18 1 1 0 4 4 0 4 0 0
## 2 0 1 17 1 1 1 1 1 0 1 0 2
## 3 0 1 15 1 0 1 1 1 0 1 1 0
## 4 0 1 15 1 1 1 4 2 2 3 2 0
## 5 0 1 16 1 1 1 3 3 1 1 2 2
## 6 0 0 16 1 0 1 4 3 3 1 3 0
## traveltime studytime failures schoolsup famsup paid activities nursery higher
## 1 2 2 0 1 0 0 0 1 1
## 2 1 2 0 0 1 0 0 0 1
## 3 1 2 3 1 0 1 0 1 1
## 4 1 3 0 0 1 1 1 1 1
## 5 1 2 0 0 1 1 0 1 1
## 6 1 2 0 0 1 1 1 1 1
## internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
## 1 0 0 4 3 4 1 1 3 6 5 6 6
## 2 1 0 5 3 3 1 1 3 4 5 5 6
## 3 1 0 4 3 2 2 3 3 10 7 8 10
## 4 1 1 3 2 2 1 1 5 2 15 14 15
## 5 0 0 4 3 2 1 2 5 4 6 10 10
## 6 1 0 5 4 2 1 2 5 10 15 15 15
# View the structure of the dataset to confirm variable types
glimpse(std_mat)
## Rows: 395
## Columns: 33
## $ school <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sex <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ famsize <dbl> 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,…
## $ Pstatus <dbl> 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <dbl> 0, 0, 0, 2, 1, 3, 1, 1, 3, 1, 4, 3, 2, 4, 1, 2, 3, 1, 3, 2,…
## $ Fjob <dbl> 4, 1, 1, 3, 1, 1, 1, 4, 1, 1, 2, 1, 3, 1, 1, 1, 3, 1, 3, 1,…
## $ reason <dbl> 0, 0, 1, 2, 2, 3, 2, 2, 2, 2, 3, 3, 0, 0, 2, 2, 3, 3, 0, 2,…
## $ guardian <dbl> 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 2,…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ schoolsup <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ famsup <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ paid <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,…
## $ activities <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1,…
## $ nursery <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ higher <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ internet <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ romantic <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…
# View summary statistics to understand the distributions of each variable
summary(std_mat)
## school sex age address
## Min. :0.0000 Min. :0.0000 Min. :15.0 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:16.0 1st Qu.:1.0000
## Median :0.0000 Median :1.0000 Median :17.0 Median :1.0000
## Mean :0.1165 Mean :0.5266 Mean :16.7 Mean :0.7772
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:18.0 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :22.0 Max. :1.0000
## famsize Pstatus Medu Fedu
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:2.000 1st Qu.:2.000
## Median :1.0000 Median :1.0000 Median :3.000 Median :2.000
## Mean :0.7114 Mean :0.8962 Mean :2.749 Mean :2.522
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :1.0000 Max. :1.0000 Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.000 Median :1.000 Median :2.000 Median :0.0000
## Mean :1.899 Mean :1.777 Mean :1.441 Mean :0.5367
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :4.000 Max. :4.000 Max. :3.000 Max. :2.0000
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.000 Median :2.000 Median :0.0000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342 Mean :0.1291
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000 Max. :1.0000
## famsup paid activities nursery
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :1.0000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.6127 Mean :0.4582 Mean :0.5089 Mean :0.7949
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## higher internet romantic famrel
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:4.000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :4.000
## Mean :0.9494 Mean :0.8329 Mean :0.3342 Mean :3.944
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:5.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
# Attach the dataset for easier reference
attach(std_mat)
# Categorize final grade (G3) into performance levels and visualize the distribution
math <- std_mat %>%
mutate(
performance = case_when(
G3 >= 16 & G3 <= 20 ~ "Excellent",
G3 >= 14 & G3 <= 15 ~ "Good",
G3 >= 12 & G3 <= 13 ~ "Satisfactory",
G3 >= 10 & G3 <= 11 ~ "Sufficient",
TRUE ~ "Fail"
)
)
# Plot histogram of G3 categorized by performance levels
ggplot(math, aes(x = G3, fill = performance)) +
geom_histogram(binwidth = 1, color = "black") +
scale_fill_manual(values = c("Excellent" = "darkgreen", "Good" = "lightgreen",
"Satisfactory" = "yellow", "Sufficient" = "orange", "Fail" = "red")) +
labs(title = "Distribution of Final Grades (G3) in Mathematics",
x = "Final Grade (G3)",
y = "Frequency",
fill = "Performance Level") +
theme_minimal() +
theme(
legend.position = "right",
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
axis.text = element_text(size = 14),
legend.title = element_text(size = 14),
legend.text = element_text(size = 12)
)
# Perform subset selection for linear regression model using regsubsets
regfit.full <- regsubsets(G3 ~ ., data = std_mat)
reg.summary <- summary(regfit.full)
names(reg.summary) # Examine available metrics in summary
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
# Plot model selection metrics (R2, adjusted R2, BIC, Cp) for each subset size
par(mfrow = c(2, 2))
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "bic")
plot(regfit.full, scale = "Cp")
# Examine model fit using various criteria (RSS, adjusted R2, Cp, BIC)
par(mfrow = c(2, 2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS")
which.min(reg.summary$rss) # Identify model with minimum RSS
## [1] 8
points(8, reg.summary$rss[8], col = "red", cex = 2, pch = 19)
# Adjusted R^2 to select optimal number of predictors
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq")
which.max(reg.summary$adjr2)
## [1] 8
points(8, reg.summary$adjr2[8], col = 'red', cex = 2, pch = 19)
# Cp criterion
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp")
which.min(reg.summary$cp)
## [1] 8
points(8, reg.summary$cp[8], col = "red", cex = 2, pch = 19)
# Bayesian Information Criterion (BIC)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC")
which.min(reg.summary$bic)
## [1] 5
points(5, reg.summary$bic[5], col = "red", cex = 2, pch = 19)
# Construct linear regression models with selected variables
set.seed(1)
model1 <- lm(G3 ~ age + G1 + G2 + activities + famrel + absences + Walc + romantic, data = std_mat)
train <- sample(nrow(std_mat), nrow(std_mat) * 0.6) # Split dataset for cross-validation
# Calculate MSE for model1
mean((G3 - predict(model1, std_mat))[-train]^2)
## [1] 3.841983
summary(model1)
##
## Call:
## lm(formula = G3 ~ age + G1 + G2 + activities + famrel + absences +
## Walc + romantic, data = std_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9571 -0.4748 0.2849 1.0339 4.3246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001922 1.385330 0.001 0.998894
## age -0.214582 0.077672 -2.763 0.006007 **
## G1 0.180141 0.055236 3.261 0.001208 **
## G2 0.962210 0.049100 19.597 < 2e-16 ***
## activities -0.342371 0.189499 -1.807 0.071586 .
## famrel 0.372452 0.106523 3.496 0.000526 ***
## absences 0.044042 0.012137 3.629 0.000323 ***
## Walc 0.123346 0.075265 1.639 0.102067
## romantic -0.320448 0.206438 -1.552 0.121416
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.867 on 386 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.834
## F-statistic: 248.4 on 8 and 386 DF, p-value: < 2.2e-16
# 10-fold cross-validation for model performance
model1.glm <- glm(G3 ~ age + G1 + G2 + activities + famrel + absences + Walc + romantic, data = std_mat)
cv_result <- cv.glm(std_mat, model1.glm, K = 10)
cv_result$delta[1] # Cross-validated MSE for model1
## [1] 3.599968
# 10-fold cross-validation for the alternative model (model2)
model2.glm <- glm(G3 ~ G1 + G2 + famrel + absences + age, data = std_mat)
cv_result.2 <- cv.glm(std_mat, model2.glm, K = 10)
cv_result.2$delta[1] # Cross-validated MSE for model2
## [1] 3.570969
The MSE from the cross-validation process suggests that Model 1 performs slightly better than Model 2, with a lower MSE of 3.576 compared to 3.638 for Model 2. This indicates that including variables such as activities, Walc (weekend alcohol consumption), and romantic relationships may slightly improve the predictive accuracy of the model.
summary(model1)
##
## Call:
## lm(formula = G3 ~ age + G1 + G2 + activities + famrel + absences +
## Walc + romantic, data = std_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9571 -0.4748 0.2849 1.0339 4.3246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001922 1.385330 0.001 0.998894
## age -0.214582 0.077672 -2.763 0.006007 **
## G1 0.180141 0.055236 3.261 0.001208 **
## G2 0.962210 0.049100 19.597 < 2e-16 ***
## activities -0.342371 0.189499 -1.807 0.071586 .
## famrel 0.372452 0.106523 3.496 0.000526 ***
## absences 0.044042 0.012137 3.629 0.000323 ***
## Walc 0.123346 0.075265 1.639 0.102067
## romantic -0.320448 0.206438 -1.552 0.121416
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.867 on 386 degrees of freedom
## Multiple R-squared: 0.8374, Adjusted R-squared: 0.834
## F-statistic: 248.4 on 8 and 386 DF, p-value: < 2.2e-16
# Next, we apply a decision tree to classify students based on whether they are high-performing (G3 > 13) or not.
# Create a binary variable indicating high-performing students and build a decision tree model
high_score <- factor(ifelse(G3 > 13, "yes", "no"))
tree.math <- tree(high_score ~ . - G3, data = std_mat)
summary(tree.math) # Show summary of the decision tree model
##
## Classification tree:
## tree(formula = high_score ~ . - G3, data = std_mat)
## Variables actually used in tree construction:
## [1] "G2" "reason" "Walc" "sex" "paid" "Mjob"
## Number of terminal nodes: 10
## Residual mean deviance: 0.1171 = 45.1 / 385
## Misclassification error rate: 0.02785 = 11 / 395
plot(tree.math)
text(tree.math, pretty = 0)
# We split the data to evaluate the performance of the decision tree and apply cross-validation to prune it for optimal performance
set.seed(1)
score.test <- std_mat[-train,]
high_score.test <- high_score[-train]
tree.math <- tree(high_score ~ . - G3, std_mat, subset = train)
tree.pred <- predict(tree.math, score.test, type = "class")
table(tree.pred, high_score.test) # Confusion matrix
## high_score.test
## tree.pred no yes
## no 121 4
## yes 3 30
mean(tree.pred == high_score.test)
## [1] 0.9556962
# Cross-Validation and Pruning
# To avoid overfitting, we use cross-validation to prune the decision tree and determine the optimal complexity.
cv.math <- cv.tree(tree.math, FUN = prune.misclass)
names(cv.math)
## [1] "size" "dev" "k" "method"
cv.math
## $size
## [1] 7 5 4 2 1
##
## $dev
## [1] 13 13 13 13 66
##
## $k
## [1] -Inf 0 1 2 56
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.math)
# Plot the misclassification error against tree size and pruning parameter k
par(mfrow = c(1, 2))
plot(cv.math$size, cv.math$dev, type = 'b', main = "Tree Size vs Misclassification Error")
plot(cv.math$k, cv.math$dev, type = 'b', main = "k vs Misclassification Error")
# Prune the decision tree to the optimal size and plot the pruned tree
prune.math <- prune.misclass(tree.math, best = 4)
plot(prune.math)
text(prune.math, pretty = 0)
# Predict on test data using the pruned tree and evaluate accuracy
tree.pred1 <- predict(prune.math, score.test, type = 'class')
table(tree.pred1, high_score.test)
## high_score.test
## tree.pred1 no yes
## no 121 3
## yes 3 31
# Calculate accuracy of the pruned decision tree on test data
mean(tree.pred1 == high_score.test)
## [1] 0.9620253
# T-test to compare the final grades (G3) between male and female students
t_test_result <- t.test(G3 ~ sex, data = std_mat)
t_test_result
##
## Welch Two Sample t-test
##
## data: G3 by sex
## t = 2.0651, df = 390.57, p-value = 0.03958
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 0.04545244 1.85073226
## sample estimates:
## mean in group 0 mean in group 1
## 10.914439 9.966346
The p-value of 0.03958 indicates a statistically significant difference at the 5% level, suggesting that the difference in means is unlikely due to random chance. The 95% confidence interval for the difference in means ranges from 0.045 to 1.851, meaning that, on average, male students score between 0.045 and 1.851 points higher than female students.
The mean final grade (G3) for male students is 10.914, compared to 9.966 for female students. We reject the null hypothesis and conclude that there is a statistically significant difference between male and female students’ performance in Mathematics, with male students scoring slightly higher on average.
alpha <- 0.05
t_observed <- 2.0651
df <- 390
t_crit <- qt(1 - alpha/2, df)
x <- seq(-4, 4, length = 1000)
null_dist <- dt(x, df)
plot(x, null_dist, type = "l", lwd = 2, col = "black",
ylab = "Density", xlab = "t", main = "Type I of the difference in gener")
polygon(c(x[x > t_crit], max(x), t_crit),
c(null_dist[x > t_crit], 0, 0), col = "blue", border = NA)
polygon(c(x[x < -t_crit], -t_crit, min(x)),
c(null_dist[x < -t_crit], 0, 0), col = "blue", border = NA)
abline(v = c(-t_crit, t_crit), col = "darkblue", lwd = 2, lty = 2)
text(-t_crit - 1, 0.1, expression(-t[crit] == frac(-alpha, 2)), col = "darkblue", cex = 0.9, pos = 4)
text(t_crit + 0.5, 0.1, expression(t[crit] == frac(alpha, 2)), col = "darkblue", cex = 0.9, pos = 2)
abline(v = c(-t_observed,t_observed), col = "green", lwd = 2, lty = 2)
# T-test to compare the final grades (G3) between students who have a romance and who don't
t_test_result <- t.test(G3 ~ romantic,
alternative = "two.sided",
paired = FALSE,
data = std_mat)
t_test_result
##
## Welch Two Sample t-test
##
## data: G3 by romantic
## t = 2.5122, df = 240.07, p-value = 0.01266
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 0.2721553 2.2493333
## sample estimates:
## mean in group 0 mean in group 1
## 10.836502 9.575758
# Conduct a one-sided t-test to test if students without romantic relationships score higher
t_test_romantic_one_sided <- t.test(G3 ~ romantic, data = std_mat,
alternative = "greater",
paired = FALSE)
t_test_romantic_one_sided
##
## Welch Two Sample t-test
##
## data: G3 by romantic
## t = 2.5122, df = 240.07, p-value = 0.006328
## alternative hypothesis: true difference in means between group 0 and group 1 is greater than 0
## 95 percent confidence interval:
## 0.432079 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 10.836502 9.575758
alpha <- 0.05
t_observed <- 2.5122
df <- 240
t_crit <- qt(1 - alpha, df)
x <- seq(-4, 4, length = 1000)
null_dist <- dt(x, df)
plot(x, null_dist, type = "l", lwd = 2, col = "black",
ylab = "Density", xlab = "t", main = "Type I Error: One-Sided Test")
polygon(c(x[x > t_crit], max(x), t_crit),
c(null_dist[x > t_crit], 0, 0), col = "red", border = NA)
abline(v = t_crit, col = "darkblue", lwd = 2, lty = 2)
abline(v = t_observed, col = "green", lwd = 2, lty = 2)
text(t_crit + 0.5, 0.1, expression(t[crit] == alpha), col = "darkblue", cex = 0.9, pos = 2)
# Create a combined factor for gender and romantic relationship status
std_mat$group <- with(std_mat, interaction(sex, romantic, sep = "_"))
std_mat$group <- factor(std_mat$group,
labels = c("Female_No_Romance", "Female_Romance",
"Male_No_Romance", "Male_Romance"))
# Perform a one-way ANOVA to compare mean grades across the four groups
anova_result <- aov(G3 ~ group, data = std_mat)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 223 74.27 3.609 0.0135 *
## Residuals 391 8047 20.58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey_result <- TukeyHSD(anova_result)
tukey_result
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = G3 ~ group, data = std_mat)
##
## $group
## diff lwr upr p adj
## Female_Romance-Female_No_Romance -0.5615527 -2.005331 0.8822260 0.7474349
## Male_No_Romance-Female_No_Romance -0.6968460 -2.596176 1.2024843 0.7796630
## Male_Romance-Female_No_Romance -2.0992821 -3.759610 -0.4389542 0.0065760
## Male_No_Romance-Female_Romance -0.1352933 -2.045027 1.7744409 0.9978299
## Male_Romance-Female_Romance -1.5377294 -3.209949 0.1344901 0.0841047
## Male_Romance-Male_No_Romance -1.4024361 -3.480723 0.6758508 0.3038478
# Convert studytime to a factor with descriptive levels
std_mat$studytime <- factor(std_mat$studytime,
labels = c("less than 2 hours", "2-5 hours", "5-10 hours", "greater than 10 hours"))
anova_studytime <- aov(G3 ~ studytime, data = std_mat)
summary(anova_studytime)
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 3 108 36.07 1.728 0.161
## Residuals 391 8162 20.87
TukeyHSD(anova_studytime)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = G3 ~ studytime, data = std_mat)
##
## $studytime
## diff lwr upr p adj
## 2-5 hours-less than 2 hours 0.1240981 -1.299000 1.547197 0.9959788
## 5-10 hours-less than 2 hours 1.3523810 -0.508052 3.212814 0.2402776
## greater than 10 hours-less than 2 hours 1.2116402 -1.331975 3.755255 0.6087548
## 5-10 hours-2-5 hours 1.2282828 -0.456832 2.913398 0.2380392
## greater than 10 hours-2-5 hours 1.0875421 -1.330800 3.505884 0.6522702
## greater than 10 hours-5-10 hours -0.1407407 -2.839700 2.558218 0.9991295
# Create a binary variable for pass/fail based on G3
std_mat$pass_fail <- ifelse(G3 >= 10, "Pass", "Fail")
# Combine Medu and Fedu into a single categorical variable
std_mat$parental_edu <- ifelse(Medu >= 3 & Fedu >= 3, "High", "Low")
table(std_mat$pass_fail, std_mat$parental_edu)
##
## High Low
## Fail 49 81
## Pass 117 148
# Visualization of pass/fail status by parental education level
ggplot(std_mat, aes(x = parental_edu, fill = pass_fail)) +
geom_bar(position = "dodge") +
labs(title = "Student Pass/Fail Rate by Parental Education Level",
x = "Parental Education Level",
y = "Number of Students",
fill = "Pass/Fail") +
scale_fill_manual(values = c("Pass" = "forestgreen", "Fail" = "firebrick")) +
theme(
plot.title = element_text(hjust = 0.5, size = 16),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)
)
# Chi-square test to determine association between parental education and pass/fail status
chisq_test <- chisq.test(table(std_mat$parental_edu, std_mat$pass_fail))
chisq_test
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(std_mat$parental_edu, std_mat$pass_fail)
## X-squared = 1.2399, df = 1, p-value = 0.2655
# Calculating the correlation between absences and final grade (G3)
cor_test <- cor.test(absences, G3, method = "pearson")
cor_test
##
## Pearson's product-moment correlation
##
## data: absences and G3
## t = 0.67933, df = 393, p-value = 0.4973
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06464215 0.13247070
## sample estimates:
## cor
## 0.03424732
cor(absences, G3)
## [1] 0.03424732