This project investigates how various social, familial, and behavioral factors are correlated with academic performance and average free time among Portuguese high school students. Using the Student Performance dataset compiled by Cortez and Silva in 2008, we explore three specific research questions: the first two concerning the influence of parental education and alcohol consumption on students’ final Portuguese grades (G3), and the last one focusing on how one’s relationship status influences the amount of free time students report. From the results, we found that higher levels of father’s education are strongly correlated with better academic performance, while weekday alcohol consumption is negatively correlated with grades. However, weekend drinking shows no significant effect. Finally, students’ relationship status does not appear to influence how much free time they report. These findings provide insight into which factors may meaningfully impact Portuguese high school students’ academic outcomes and time management, and suggest potential areas for school level intervention and student support.
This project focuses on how different factors influence school performance in Portuguese secondary school students. The dataset our team used is the Student Performance dataset compiled by Cortez and Silva in 2008. The dataset contains 649 observations and 30 features collected through questionnaires from two Portuguese high schools, with features covering aspects of family background, social environment, academic records, and types of lifestyles. The primary objective of our analysis is to examine how different social and behavioral factors relate to the final Portuguese grade (G3), which is measured on a scale from 0 to 20. By answering the three focused research questions below, we aim to understand how various background and behavioral variables correlate with or predict academic performance. The results from this analysis could help inform educational strategies and potentially help students with their stress and mental health.
After statistically controlling for weekly study time (on a 1–4 numeric scale where 1 indicates less than 2 hours and 4 indicates more than 10 hours) and internet access (coded as “yes” or “no”), how is the father’s highest education level (on a 0–4 numeric scale where 0 indicates no formal education and 4 indicates higher education) associated with Portuguese high school students’ final grades in Portuguese class (G3), which are measured on a 0–20 numeric scale?
Hypothesis:
\(H_0\): After controlling for weekly study time and internet access, the father’s highest education level (0 = no formal education to 4 = higher education) has no significant association with students’ final Portuguese grade (G3, measured on a 0–20 scale).
\(H_1\): After controlling for weekly study time and internet access, the father’s highest education level (0 = no formal education to 4 = higher education) is significantly positively associated with students’ final Portuguese grade (G3, 0–20 scale).
Based on the number of alcoholic beverages consumed by Portuguese high school students per weekday and weekend (on a 1-5 numeric scale where 1 indicates very low and 5 indicating very high), how does it affect Portuguese high school students’ final grades in Portuguese class (G3), which are measured on a 0–20 numeric scale?
Hypothesis:
\(H_0\): There is no significant association between students’ alcohol consumption levels during weekdays and weekends (1 = very low to 5 = very high) and their final Portuguese grade (G3, 0–20 scale).
\(H_1\): Students’ alcohol consumption levels during weekdays and weekends (1 = very low to 5 = very high) are significantly associated with their final Portuguese grade (G3, 0–20 scale).
Is there a statistically significant difference in reported average free time (on a 1-5 categorical scale where 1 indicates very low and 5 indicates very high) between Portuguese high school students who are in romantic relationships (coded as 1), and those who are not in romantic relationships (coded as 0)?
Hypothesis:
\(H_0\): There is no significant difference in average free time between students in a romantic relationship and those who are not.
\(H_1\): There is a significant difference in average free time (1 = very low to 5 = very high) between students in a romantic relationship and students who are not in romantic relationships.
Data Loading
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(knitr)
library(MASS) # For AIC/BIC
library(car) # For vifpor_df <- read_delim("C:/Users/jaych/Downloads/PSTAT100-Final-Project-main/PSTAT100-Final-Project-main/data/student-por.csv", delim = ";")
head(por_df)por_df <- por_df %>%
mutate(
school = as.factor(school),
sex = as.factor(sex),
address = as.factor(address),
famsize = as.factor(famsize),
Pstatus = as.factor(Pstatus),
Mjob = as.factor(Mjob),
Fjob = as.factor(Fjob),
reason = as.factor(reason),
guardian = as.factor(guardian),
schoolsup = as.factor(schoolsup),
famsup = as.factor(famsup),
paid = as.factor(paid),
activities = as.factor(activities),
nursery = as.factor(nursery),
higher = as.factor(higher),
internet = as.factor(internet),
romantic = as.factor(romantic),
# ordered factor from 1-5 for very high, low etc.
Dalc = factor(Dalc, levels = 1:5, ordered = TRUE),
Walc = factor(Walc, levels = 1:5, ordered = TRUE)
)Data Processing and Cleaning
The Student Performance dataset does not contain any missing values or fields with unclear meanings. Therefore, no data cleaning or preprocessing was necessary for this project.
The aim is to observe if, after controlling for study time and
internet access, father’s education significantly influenced the
student’s final Portuguese grade. The question at hand calls for a
partial significance test, which works by having one one the models be a
subset to another. Therefore, if there is a significant difference, it
is due to the predictors added in between.
The models were both linear and used the following predictors:
Model 1: Study time and internet access.
Model 2: Model 1’s predictors as well as the education level of the student’s father.
The two linear models were compared using an ANOVA test. Since the only predictor added in model 2 was the father’s education, if there was a significant decrease in residual variance–as indicated by the resulting p-value–it would be due to this predictor.
anova_kable <- function(model1, model2) {
aov_out <- anova(model1, model2)
Model <- rownames(aov_out)
Res.Df <- aov_out[, "Res.Df"]
RSS <- signif(aov_out[, "RSS"], 5)
Df <- ifelse(is.na(aov_out[, "Df"]), "", aov_out[, "Df"])
Sum_of_Sq <- ifelse(is.na(aov_out[, "Sum of Sq"]), "", signif(aov_out[, "Sum of Sq"], 5))
F <- ifelse(is.na(aov_out[, "F"]), "", signif(aov_out[, "F"], 4))
# Extract p-values as numeric, then format with signif( ,2)
pvals_raw <- aov_out$`Pr(>F)`
pvals_signif <- ifelse(
is.na(pvals_raw),
"",
formatC(signif(pvals_raw, 2), format = "e", digits = 2)
)
Signif <- cut(as.numeric(pvals_raw),
breaks = c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
labels = c("***", "**", "*", ".", ""),
right = TRUE)
Signif[is.na(Signif)] <- ""
df <- data.frame(
Model = Model,
Res.Df = Res.Df,
RSS = RSS,
Df = Df,
`Sum of Sq` = Sum_of_Sq,
F = F,
`Pr(>F)` = pvals_signif,
Signif = Signif,
stringsAsFactors = FALSE
)
knitr::kable(df, align = "c",
caption = "ANOVA Comparison Table")
}
lm_table <- function(model, digits = 2, pval_sci_threshold = 0.01, caption = "Linear Model Summary") {
df <- broom::tidy(model) %>%
dplyr::select(term, estimate, std.error, statistic, p.value) %>%
dplyr::rename(
Term = term,
Estimate = estimate,
`Std. Error` = std.error,
`t value` = statistic,
`Pr(>|t|)` = p.value
)
# Format values
df$Estimate <- signif(df$Estimate, digits)
df$`Std. Error` <- signif(df$`Std. Error`, digits)
df$`t value` <- signif(df$`t value`, digits)
# Format p-values with smart thresholding
df$`Pr(>|t|)` <- ifelse(
is.na(df$`Pr(>|t|)`),
"",
ifelse(
df$`Pr(>|t|)` < pval_sci_threshold,
formatC(signif(df$`Pr(>|t|)`, digits), format = "e", digits = digits),
formatC(round(df$`Pr(>|t|)`, digits), format = "f", digits = digits)
)
)
# Add significance stars
pval_raw <- broom::tidy(model)$p.value
df$Signif <- cut(pval_raw,
breaks = c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
labels = c("***", "**", "*", ".", ""),
right = TRUE)
df$Signif[is.na(df$Signif)] <- ""
# Return styled kable
knitr::kable(df, align = "c", caption = caption)
}Here were the results:
f1 <- G3 ~ studytime + internet
f2 <- G3 ~ studytime + internet + Fedu
model1 <- lm(data=por_df, formula = f1)
model2 <- lm(data=por_df, formula = f2)
anova_parsig <- anova(model1, model2)
anova_kable(model1, model2)| Model | Res.Df | RSS | Df | Sum.of.Sq | F | Pr..F. | Signif |
|---|---|---|---|---|---|---|---|
| 1 | 646 | 6207.3 | |||||
| 2 | 645 | 5995.9 | 1 | 211.42 | 22.74 | 2.30e-06 | *** |
With a p-value of approximately 2.30 * \(10^{-6}\), there is sufficient evidence to
reject the null hypothesis,
\(H_0\): that after controlling for
weekly study time and internet access, the father’s highest education
level has no significant association with students’ final Portuguese
grade. Therefore, opting instead for the alternative hypothesis, the
father’s highest education level does play a role in the students’ final
Portuguese grade.
Model checking was performed to ensure this result from the previous section was accurate. To assess whether this model is appropriate, we examined Model 2, which contains the same predictors as Model 1 in addition to father’s education. For the partial significance test to be valid, Model 2 must satisfy the standard linear regression assumptions: independence, no structure in residuals, normality of residuals, and homoscedasticity (equal variance across fitted values).
Independence is satisfied because the dataset includes the full population of students from each school during that year. Additionally, because this is an observational study with no temporal or treatment structure, residual independence is expected and visually confirmed. It only remains to be seen whether the remaining conditions—normality and equal variance—hold true.
shapiro_pval <- function(resids, digits=3){
shap <- shapiro.test(resids)
signif(shap$p.value, digits)
}
qqplott <- function(residuals){
ggplot(data.frame(residuals), aes(sample = residuals)) +
stat_qq() +
stat_qq_line() +
labs(title = "Q-Q Plot of Residuals", x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal()
}
hist_resids <- function(resids, bins = 30) {
df <- data.frame(residuals = resids)
ggplot(df, aes(x = residuals)) +
geom_histogram(bins = bins, fill = "steelblue", color = "black", alpha = 0.7) +
labs(title = "Student Grade Residuals", x = "Residuals", y = "Count") +
theme_minimal()
}
residplot_fitval <- function(model) {
df <- data.frame(Fitted = fitted(model), Residuals = residuals(model))
ggplot(df, aes(x = Fitted, y = Residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
labs(title = "Residuals vs. Fitted Values", x = "Fitted Values", y = "Residuals") +
theme_minimal()
}shap_test <- shapiro.test(resids)
shap_pval <- signif(shap_test$p.value, 3)
off_values_percent <- round(100 * (sum(resids < -7)/nrow(por_df)), 2)The Q-Q plot indicates a deviation from normality in the residuals,
and likewise, a Shapiro-wilk test result of 2.83^{-15}(p-value <
0.05) corroborates this deviation. Nevertheless, as evident through the
histogram, the deviation is primarily due to a relatively small group of
students who scored significantly lower than predicted, causing a mild
left skew. These values represent approximately 2.62% of the residuals.
Aside from this small subset, the residuals are normally distributed and
centered around zero.
Given the large sample size and that the deviation stems from a small
portion of the data, the normality assumption is considered reasonably
satisfied for inference purposes.
The residuals appear to have approximately constant variance across the range of fitted values, with no clear funneling or systematic spread. Thus, there is no strong evidence to reject the homoscedasticity assumption.
Hence, all assumptions for the partial significance test performed were reasonably satisfied. Therefore, the previously found p-value of approximately 2.30 * \(10^{-6}\) can be trusted.
After controlling for study time and internet access, the father’s education level remains a significant positive predictor of students’ final Portuguese grades. This likely reflects the academic support that more educated parents are able to provide at home; eg, helping with homework and explaining difficult topics. To support those lacking this advantage, schools should consider expanding tutoring and afterschool programs to foster a more equitable learning environment. Additionally, making greater use of school libraries—through activities like book clubs or guest author visits—could help motivate students to read more and strengthen their Portuguese language skills.
To investigate the association between alcohol consumption and
students’ final Portuguese grades (G3), we employed a
two-way ANOVA. This method allows us to examine the main effects of
weekday alcohol consumption (Dalc) and weekend alcohol consumption
(Walc) on G3, as well as their interaction
effect. The model is specified as G3 ~ Dalc * Walc, which
includes both main effects and the interaction term. Both Dalc and Walc
are treated as ordered factors on a 1-5 scale, representing consumption
levels from very low to very high.
The ANOVA test was performed to assess the significance of weekday and weekend alcohol consumption, and their interaction, on the final grade. The results are summarized in the table below.
model_q2 <- aov(G3 ~ Dalc * Walc, data = por_df)
anova_kable_q2 <- function(aov_model, caption = "Two-Way ANOVA Summary") {
aov_summary <- summary(aov_model)[[1]]
df <- data.frame(
Term = rownames(aov_summary),
Df = aov_summary$Df,
`Sum Sq` = signif(aov_summary$`Sum Sq`, 5),
`Mean Sq` = signif(aov_summary$`Mean Sq`, 4),
`F value` = signif(aov_summary$`F value`, 4),
`Pr(>F)` = formatC(aov_summary$`Pr(>F)`, format = "f", digits = 4),
stringsAsFactors = FALSE
)
p_values <- aov_summary$`Pr(>F)`
df$Signif <- cut(p_values,
breaks = c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
labels = c("***", "**", "*", ".", ""),
right = TRUE)
df$Signif[is.na(df$Signif)] <- ""
knitr::kable(df, align = "c", caption = caption)
}
#ANOVA table for the Q2 model
anova_kable_q2(model_q2, caption = "ANOVA for Alcohol Consumption's Effect on Final Grades")| Term | Df | Sum.Sq | Mean.Sq | F.value | Pr..F. | Signif |
|---|---|---|---|---|---|---|
| Dalc | 4 | 327.51 | 81.880 | 8.282 | 0.0000 | *** |
| Walc | 4 | 40.11 | 10.030 | 1.014 | 0.3992 | |
| Dalc:Walc | 14 | 206.78 | 14.770 | 1.494 | 0.1077 | |
| Residuals | 626 | 6188.90 | 9.886 | NA | NA |
The results from the two-way ANOVA indicate that weekday alcohol
consumption (Dalc) has a statistically significant main
effect on students’ final grades (G3), with a p-value of
0.0000. This suggests that higher levels of weekday drinking are
significantly associated with differences in academic performance.
In contrast, weekend alcohol consumption (Walc) does not
have a statistically significant effect on grades, with a p-value of
0.3992, which is well above the conventional 0.05 threshold. Therefore,
weekend drinking alone does not show a significant association with
final grades.
The interaction term Dalc:Walc has a p-value of 0.1077,
which is also not statistically significant. This indicates that the
effect of weekday alcohol consumption on grades does not significantly
differ across levels of weekend alcohol consumption, and vice versa.
To better understand the relationship between alcohol consumption and
academic performance, an interaction plot is generated. This plot helps
to visualize the main effects of Dalc and Walc
and the lack of a significant interaction between them:
interaction.plot(x.factor = por_df$Dalc,
trace.factor = por_df$Walc,
response = por_df$G3,
fun = mean,
type = "b",
xlab = "Weekday Alcohol Consumption (Dalc)",
ylab = "Mean Final Grade (G3)",
col = c("blue", "red", "green", "purple", "orange"),
pch = c(19, 17, 15, 12, 8),
trace.label = "Weekend Alcohol Consumption (Walc)",
main = "Interaction Plot of Alcohol Consumption and Final Grades")The interaction plot displays mean final grades by levels of weekday
and weekend alcohol consumption. For all levels of weekend alcohol use,
there is a general downward trend in academic performance as weekday
alcohol use increases. The lines representing different levels of
weekend alcohol use are mostly parallel, suggesting no strong
interaction between weekday and weekend drinking. One exception is the
sharp increase for Walc = 1 at Dalc = 3, which
may indicate variability or a small sample at that point. Overall, the
plot suggests that weekday drinking is more consistently associated with
lower grades than weekend drinking, and that the two do not strongly
interact.
To check the assumptions of the ANOVA model, we analyze the residuals:
resids_q2 <- residuals(model_q2)
# Histogram of residuals
hist(resids_q2,
main = "Histogram of Residuals for Q2 Model",
xlab = "Residuals",
col = "lightblue",
border = "black")The histogram of the residuals shows a distribution that is approximately normal as it is roughly symmetric and bell-shaped, centered around at 0. The Q–Q plot supports this conclusion for the most part, as the residuals closely follow the theoretical quantile line in the center of the distribution. However, there is some noticeable deviation in the lower tail, suggesting potential skewness or the presence of outliers. Despite these tail deviations, the residuals largely conform to the normality assumption required for ANOVA. Overall, the assumption of normality appears reasonably satisfied.
While the Shapiro-Wilk test returned a highly significant p-value (\(p \approx 4.94 \times 10^{-15}\)), indicating a deviation from normality in the residuals, visual assessments such as the histogram and Q–Q plot suggest that the deviation is relatively minor and occurs primarily in the tails.
Additionally, the residuals appear to have approximately constant variance across the range of fitted values, with no clear funneling or systematic patterns—supporting the assumption of homoscedasticity. The presence of vertically aligned points is expected in an ANOVA model due to discrete factor levels, and the vertical spread within those columns is reasonably consistent.
Given the large sample size, ANOVA is generally robust to mild departures from normality. Therefore, despite the significant p-value from the Shapiro-Wilk test, there is no strong evidence of a meaningful violation of model assumptions. The assumptions of normality and homoscedasticity are considered reasonably satisfied, supporting the trustworthiness of the ANOVA results.
Our two-way ANOVA analysis examined the relationship between Portuguese high school students’ alcohol consumption and their final grades in Portuguese class. The results revealed a significant negative association between weekday alcohol consumption and final grades, suggesting that higher levels of drinking during the school week are associated with lower academic performance.
In contrast, weekend alcohol consumption did not show a statistically significant effect on grades. The interaction between weekday and weekend alcohol use was not significant, indicating that the effect of weekday drinking on academic outcomes does not vary meaningfully across levels of weekend drinking.
These findings suggest that weekday drinking may be disruptive to students’ academic success, possibly due to its interference with school responsibilities, sleep quality, and cognitive performance. While the study does not establish causality, the strong association points to a behavioral pattern that may warrant intervention. Future research or policy efforts may consider addressing weekday alcohol consumption to support better academic outcomes.
When looking at Research Question 3, it was evident that I would be comparing the average amount of free time between two groups. Those groups being students in relationships versus students who are not in relationships. The first thing I did was create side by side density histograms showing the distributions of average free time between the two groups. I made sure it was a density histogram because there were more students not in relationships (412) than those in relationships (239). This ensured that the distributions could be compared 1 to 1 without being affected by unequal weighting. The next plot I created was an overlayed density plot of both distributions which allows the viewer to see the similarities and differences in distribution in both plots. I created these plots to get some preliminary insights on how the distributions looked before I used a t-test to see whether I could reject or fail to reject the null hypothesis that there is no significant difference in average free time between students in a romantic relationship and those who are not.
ggplot(por_df, aes(x = freetime)) +
geom_histogram(aes(y = ..density.., fill = as.factor(romantic)),
alpha = 0.7, binwidth = 1, color = "black") +
facet_wrap(~ romantic) +
labs(title = "Density Plot of Freetime Based on Relationship Status",
x = "Freetime",
y = "Density",
fill = "In a Relationship") +
theme_minimal()## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(por_df, aes(x = freetime, fill = as.factor(romantic))) +
geom_density(alpha = 0.4, bw = 0.35) +
labs(title = "Density Estimate of Freetime (In Relationship vs Not in Relationship)",
x = "Freetime",
y = "Density",
color = "Year")| Statistic | Value |
|---|---|
| Mean (No) | 3.159 |
| Mean (Yes) | 3.218 |
| Variance (No) | 1.122 |
| Variance (Yes) | 1.079 |
| t-statistic | -0.693 |
| df | 505.830 |
| p-value | 0.488 |
| 95% CI Lower | -0.226 |
| 95% CI Upper | 0.108 |
We use a Welch’s two-sided t-test because the inputted data is independent, the sample sizes between the groups differ, and the variances are unequal. Based on the t-test, we get a t statistic of -0.67501, and a p value of 0.5. Because this is a two tailed test with a = 0.05, our alpha endpoints are approximately -1.96 and 1.96. The t statistic of -0.67501 falls between this interval, so we fail to reject the null hypothesis.
Based on the two-sided Welch’s t-test, we fail to reject the null hypothesis. This indicates that there is no significant difference in average free time between students in relationships and students not in relationships. This finding agrees with the density histograms and density plot because when looking at the models, the two groups appear to follow similar distributions. Running the t-test confirms the observation that there is no statistically significant difference in average free time.
The analysis for our third research question explored whether being in a romantic relationship affects the amount of average free time reported by Portuguese high school students. After comparing distributions and using Welch’s two sided t-test, the study found no statistically significant difference in free time between students who are in romantic relationships and those who are not (p = 0.488). Both the visualizations (density plots) and summary statistics (means and variances) supported this result, indicating similar distributions across groups.
Based on our findings, we can recommend that different services such as student support services and wellness programs can focus more broadly on time management strategies, balanced lifestyles, and stress management that apply to all students rather than tailoring efforts based on relationship status.
Although our findings give us actionable insights into average free time vs. relationship status, apparent limitations or bias can be the result of self-reported data and simplistic categorization, with our findings overlooking nuances such as quality, duration, or intensity of the relationship that could influence time availability.
To possibly improve or add to future insights and analysis, we could incorporate interviews or surveys that provide more detailed context around students’ relationships and time use, or utilize time-series data to help assess how changes in relationship status affect students’ behaviors and time availability across different periods.
Our project aimed to explore how various social, familial, and behavioral factors influence the academic performance and lifestyle of Portuguese high school students, using the Student Performance dataset compiled by Cortez and Silva (2008). Specifically, we examined the effects of parental education, alcohol consumption, and relationship status on students’ final grades in Portuguese and their reported free time. After our in depth analysis, we can address key insights, recommendations, limitations, and future directions.
Parental Education: We found strong evidence that a student’s father’s education level is positively associated with their final Portuguese grade, even after controlling for weekly study time and internet access. This suggests that higher parental education may provide academic advantages, likely through increased support, expectations, or access to educational resources.
Alcohol Consumption: Weekday alcohol consumption was significantly negatively associated with academic performance, whereas weekend alcohol consumption had no significant effect. This highlights the particular academic risk associated with drinking during school days, likely due to its impact on focus, sleep, and class participation.
Relationship Status and Free Time: No significant difference was found in average free time between students who are in romantic relationships and those who are not. This implies that relationship status alone may not meaningfully affect how much free time students perceive themselves to have.
Expand Academic Support: To help close the gap for students whose parents have lower education levels, schools should consider investing in academic mentoring, tutoring programs, and parent engagement initiatives that provide families with tools to support learning at home.
Target Weekday Alcohol Use: Educational campaigns should specifically address the risks of weekday alcohol consumption and its negative academic consequences. Collaborations with parents, counselors, and local health authorities could reinforce this messaging and offer preventive resources.
Universal Time Management Programs: Given that free time did not significantly vary with relationship status, time management resources, such as programs, apps, or advisory sessions, should be offered broadly to all students rather than tailored to specific demographic groups.
While the findings offer meaningful insights, several limitations must be acknowledged:
The data is mostly observational, limiting our ability to imply causality.
Many variables, such as free time, alcohol use, and relationship status, rely on self-reported data, which may introduce bias or measurement error.
The dataset represents students from two Portuguese schools during a specific time period, which is a relative generalization to other regions or educational systems.
Qualitative research might provide deeper context, especially around family support structures and relationship dynamics.
Future analyses could also explore interaction between multiple variables, such as whether the impact of alcohol use varies by parental education or relationship status.
Expanding this type of analysis to other subjects or countries could validate the findings and help create more elaborate and valid student support strategies.
In summary, this project sheds light on several critical factors affecting student outcomes and well-being. While further research can be beneficial, the results provide a great foundation for insights aimed at improving academic performance and supporting student development in meaningful ways.