The dataset is downloaded from https://www.kaggle.com/datasets/lainguyn123/student-performance-factors/data.
# Check the names of the variables of the dataset
names(mydata)
## [1] "Hours_Studied" "Attendance"
## [3] "Parental_Involvement" "Access_to_Resources"
## [5] "Extracurricular_Activities" "Sleep_Hours"
## [7] "Previous_Scores" "Motivation_Level"
## [9] "Internet_Access" "Tutoring_Sessions"
## [11] "Family_Income" "Teacher_Quality"
## [13] "School_Type" "Peer_Influence"
## [15] "Physical_Activity" "Learning_Disabilities"
## [17] "Parental_Education_Level" "Distance_from_Home"
## [19] "Gender" "Exam_Score"
Three variables are chosen from this dataset, and the description of the variables is as follows:
Exam_Score: Final exam score.
Hours_Studied: Number of hours spent studying per week.
Sleep_Hours: Average number of hours of sleep per night.
In this project, the effect of the number of hours spent studying and the number of hours of sleep on final exam score is investigated.
Next, it is generated the least square model for the selected variables, and the estimation results is shown.
# Estimate a model
model <- lm(Exam_Score ~ Hours_Studied + Sleep_Hours, data = mydata)
model
##
## Call:
## lm(formula = Exam_Score ~ Hours_Studied + Sleep_Hours, data = mydata)
##
## Coefficients:
## (Intercept) Hours_Studied Sleep_Hours
## 61.86205 0.28945 -0.05807
# Check the model
summary(model)
##
## Call:
## lm(formula = Exam_Score ~ Hours_Studied + Sleep_Hours, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.592 -2.276 -0.139 2.011 33.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.862047 0.252389 245.11 <2e-16 ***
## Hours_Studied 0.289447 0.007153 40.47 <2e-16 ***
## Sleep_Hours -0.058071 0.029188 -1.99 0.0467 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.483 on 6604 degrees of freedom
## Multiple R-squared: 0.1989, Adjusted R-squared: 0.1987
## F-statistic: 819.9 on 2 and 6604 DF, p-value: < 2.2e-16
The model shows that the variable of Hours_Studied significantly and positively affects the final exam score whereas the students’ final exam score significantly decreases as the variable of Sleep_Hours rises.
# Calculate residuals and fitted values
residuals_1 <- model$residuals
fitted_1 <- model$fitted.values
res_1 <- data.frame(resid = residuals_1, fit = fitted_1)
# Get the design matrix (without the intercept)
matrix_model <- model.matrix(model)[, -1]
# Check orthogonality
orthogonality_check <- crossprod(matrix_model, residuals_1)
orthogonality_check
## [,1]
## Hours_Studied 3.241207e-11
## Sleep_Hours 7.394307e-12
The result shows that the vectors are orthogonal to residuals, and they are uncorrelated with the LS residuals (i.e., cov(x,e) = 0).
# Check if the sum of the residuals is zero
sum_residuals <- sum(residuals_1)
sum_residuals
## [1] 4.720668e-13
Since X′e = 0 (orthogonality), from the first element of its matrix, we can clearly see that the sum of residual is zero. The finding supports the assumption that the sum of the residuals is always zero if an intercept is included in the model.
To confirm that the regression line goes through the sample mean of the data, it is needed to check if the predicted value at the mean of the independent variables is equal to the mean of the dependent variable.
# Calculate means of the response and predictor variables
mean_exam_score <- mean(mydata$Exam_Score)
mean_hours_studied <- mean(mydata$Hours_Studied)
mean_sleep_hours <- mean(mydata$Sleep_Hours)
# Extract the coefficients from the model
coefficients <- coef(model)
intercept <- coefficients[1]
beta_hours_studied <- coefficients[2]
beta_sleep_hours <- coefficients[3]
# Compute the predicted Exam_Score at the means
predicted_mean_exam_score <- intercept + beta_hours_studied * mean_hours_studied + beta_sleep_hours * mean_sleep_hours
# Print results
mean_exam_score
## [1] 67.23566
predicted_mean_exam_score
## (Intercept)
## 67.23566
Since these values (mean_exam_score and predicted_mean_exam_score) are the same, it verifies that the regression line passes through the sample mean of the data.
The “invariance” means that the fitted values and residuals will be identical if X variables are linearly transformed.
# Fit the original model
model_original <- lm(Exam_Score ~ Hours_Studied + Sleep_Hours, data = mydata)
# Extract the original fitted values and residuals
fitted_values_original <- fitted(model_original)
residuals_original <- resid(model_original)
# Apply a non-singular linear transformation
# Example transformation: X1' = 2 * Hours_Studied + 5; X2' = 3 * Sleep_Hours + 4
mydata$Transformed_Hours_Studied <- 2 * mydata$Hours_Studied + 5
mydata$Transformed_Sleep_Hours <- 3 * mydata$Sleep_Hours + 4
# Fit the model with transformed variables
model_transformed <- lm(Exam_Score ~ Transformed_Hours_Studied + Transformed_Sleep_Hours, data = mydata)
# Extract the fitted values and residuals from the transformed model
fitted_values_transformed <- fitted(model_transformed)
residuals_transformed <- resid(model_transformed)
# Compare the fitted values and residuals
all.equal(fitted_values_original, fitted_values_transformed)
## [1] TRUE
all.equal(residuals_original, residuals_transformed)
## [1] TRUE
The “all.equal()” function in the above code compares the fitted values and residuals of the original model to the model with transformed variables. The findings confirm that the fitted values and residuals are invariant to a non-singular linear transformation.
# Regress Sleep_Hours on Hours_Studied to get the residuals
model_x2x1 <- lm(Sleep_Hours ~ Hours_Studied, data = mydata)
residuals_x2x1 <- resid(model_x2x1) # Residuals of Sleep_Hours ~ Hours_Studied
# Regress Exam_Score on Hours_Studied to get the residuals
model_yx2 <- lm(Exam_Score ~ Hours_Studied, data = mydata)
residuals_yx2 <- resid(model_yx2) # Residuals of Exam_Score ~ Hours_Studied
# Regress the residuals from step 2 on the residuals from step 1
model_fwl <- lm(residuals_yx2 ~ residuals_x2x1)
# The coefficient of residuals_step1 is the LS estimate of β2
beta_2_fwl <- summary(model_fwl)$coefficients[2, 1]
# Print the estimated β2
beta_2_fwl
## [1] -0.05807087
While the β2 coefficient is the same as in the original multiple regression, this method reveals the partial effect of Sleep_Hours after removing Hours_Studied.
# Fit the model regressing Exam_Score on Hours_Studied
model_y_on_x1 <- lm(Exam_Score ~ Hours_Studied, data = mydata)
residuals_y <- model_y_on_x1$residuals # Residuals of Sleep_Hours ~ Hours_Studied
# Fit the model regressing Sleep_Hours on Hours_Studied
model_x2_on_x1 <- lm(Sleep_Hours ~ Hours_Studied, data = mydata)
residuals_x2 <- model_x2_on_x1$residuals # Residuals of Exam_Score ~ Hours_Studied
# Regress the residuals of Exam_Score on the residuals of Sleep_Hours
model_final <- lm(residuals_y ~ residuals_x2)
# Get the estimate for beta_2 (coefficient of the residuals of Sleep_Hours)
beta_2_estimate <- summary(model_final)$coefficients[2, 1]
# Print the estimated β2
beta_2_estimate
## [1] -0.05807087
This method estimates β2 similarly to the original regression model (lm(Exam_Score ~ Hours_Studied + Sleep_Hours)), but applies the FWL theorem to eliminate the effect of the other independent variable (Hours_Studied).
# Fit the model regressing Exam_Score on Hours_Studied
model_y_on_x1 <- lm(Exam_Score ~ Hours_Studied, data = mydata)
residuals_y <- model_y_on_x1$residuals # Residuals of Exam_Score ~ Hours_Studied
# Fit the model regressing Sleep_Hours on Hours_Studied
model_x2_on_x1 <- lm(Sleep_Hours ~ Hours_Studied, data = mydata)
residuals_x2 <- model_x2_on_x1$residuals # Residuals of Sleep_Hours ~ Hours_Studied
# Regress the residuals of Exam_Score on the residuals of Sleep_Hours without a constant
model_final_no_const <- lm(residuals_y ~ residuals_x2 - 1) # The '- 1' removes the intercept
# Get the estimate for beta_2 (coefficient of the residuals of Sleep_Hours)
beta_2_estimate_no_const <- summary(model_final_no_const)$coefficients[1, 1] # Coefficient of residuals_x2
# Print the estimate of beta_2
beta_2_estimate_no_const
## [1] -0.05807087
This method also estimates the similar coefficient of β2 by isolating the contribution of one independent variable (Sleep_Hours) to the dependent variable (Exam_Score) after controlling for the effect of another independent variable (Hours_Studied).