2025-02-04

Intro to data set

This dataset contains various attributes to students academic performance. This data set is from Portuguese school. It was collected by using school reports and questionnaires. It has student grades , demographic, social and school related features. This data set shows the performance of mathematics and Portuguese language.

Source: UC Irvine Machine Learning Repository

https://archive.ics.uci.edu/dataset/320/student+performance

Summary of the data set

Call:
lm(formula = G3 ~ studytime, data = students)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4643  -1.8623   0.5357   3.0697   9.1377 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.3283     0.6033  15.463   <2e-16 ***
studytime     0.5340     0.2741   1.949   0.0521 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.565 on 393 degrees of freedom
Multiple R-squared:  0.009569,  Adjusted R-squared:  0.007049 
F-statistic: 3.797 on 1 and 393 DF,  p-value: 0.05206

More about summary

The intercept of β0 is 9.3283. This is the predicted final grade of G3 when the study time is zero. While it is impossible to have 0 study time , it is a useful for constructing the regression equation

Slope which is β1 is 0.5340 which means that for each additional unit of study time , the final grade G3 increases by approximately 0.534 points.

the p- value for study time is 0.05206. This p-value is very close to significant level of 0.05 which means that the relationship between studytime and final grade G3 is marginally significant.

Since the p-value is just above 0.05 we can’t confidently reject the null hypothesis which states that studytime has no effect on final grade.

However, the marginal significance suggests that there might be a weak trend, but it is not statistically significant at the 5% level.

R-square is 0.009569. This is a weak model, indicating that study time alone is not a strong predictor of final grades.

Simple Linear Regression

Simple linear regression is used to model the relationship between two variables.

It is represented by formula : \(Y = \beta_0 + \beta_1 X + \epsilon\)

Where:

Y represents dependent variable

X represents independent variable

β0 represents the intercept

β1 represents the slope

ϵ represent the error term

Scatter Plot of Study time VS Final Grade

in the following slide we will learn more about linear regression line

in the graph the x-axis represents the study time whily y-axis represents final grade (G3)

the red regression line represents the best fit to predict final grades which is G3 based on study time. In this graph our goal to see if there is a general trend of higher study time leading to higher grades.

Scatter Plot of Study time VS Final Grade with Linear Regression Line

`geom_smooth()` using formula = 'y ~ x'

Plotly

Regression equation

Based on the graph, the regression equation is

G3^=β0​+β1​⋅studytime

Where:

G3 represents the predicted final grade

β0 represents the intercept

β1 represents the slope ( how much the final grade changes with study time)

Box plot

# Boxplot to show the distribution of final grades (G3) by studytime
ggplot(students, aes(x = factor(studytime), y = G3)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Final Grade (G3) by Study Time Categories", 
       x = "Study Time", 
       y = "Final Grade (G3)") +
  theme_minimal()

Conclusion

The relationship between studytime and final grade of G3 is weak. While there may be a slight tendency for students who study more to achive higher grades, studytime does not appear to be a strong predictor of final grade G3.

However, The target attribute G3 (final year grade) is strongly correlated with G1 (first period grade) and G2 (second period grade), as all three correspond to different periods of the academic year. While predicting G3 without considering G1 and G2 is more challenging, such predictions are more valuable since G3 represents the final outcome of the academic year.

Given that G3 is strongly influenced by G2 and G1, it is useful to perform a multiple linear regression with G2 and G1 and studytime as independent variables.

Although the correlation between study time and grade G3 is weak, we should not rely solely on study time to predict final grades. Other factors, such as previous period grades (G1 and G2), might influence on the final outcome.