1. Download any data set that contains at least 3 variables.

The dataset is downloaded from https://www.kaggle.com/datasets/lainguyn123/student-performance-factors/data.

# Check the names of the variables of the dataset
names(mydata)
##  [1] "Hours_Studied"              "Attendance"                
##  [3] "Parental_Involvement"       "Access_to_Resources"       
##  [5] "Extracurricular_Activities" "Sleep_Hours"               
##  [7] "Previous_Scores"            "Motivation_Level"          
##  [9] "Internet_Access"            "Tutoring_Sessions"         
## [11] "Family_Income"              "Teacher_Quality"           
## [13] "School_Type"                "Peer_Influence"            
## [15] "Physical_Activity"          "Learning_Disabilities"     
## [17] "Parental_Education_Level"   "Distance_from_Home"        
## [19] "Gender"                     "Exam_Score"

Three variables are chosen from this dataset, and the description of the variables is as follows:

Exam_Score: Final exam score.

Hours_Studied: Number of hours spent studying per week.

Sleep_Hours: Average number of hours of sleep per night.

In this project, the effect of the number of hours spent studying and the number of hours of sleep on final exam score is investigated.

2. Use LS to estimate a population model.

Next, it is generated the least square model for the selected variables, and the estimation results is shown.

# Estimate a model
model <- lm(Exam_Score ~ Hours_Studied + Sleep_Hours, data = mydata)
model
## 
## Call:
## lm(formula = Exam_Score ~ Hours_Studied + Sleep_Hours, data = mydata)
## 
## Coefficients:
##   (Intercept)  Hours_Studied    Sleep_Hours  
##      61.86205        0.28945       -0.05807
# Check the model
summary(model)
## 
## Call:
## lm(formula = Exam_Score ~ Hours_Studied + Sleep_Hours, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.592 -2.276 -0.139  2.011 33.550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   61.862047   0.252389  245.11   <2e-16 ***
## Hours_Studied  0.289447   0.007153   40.47   <2e-16 ***
## Sleep_Hours   -0.058071   0.029188   -1.99   0.0467 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.483 on 6604 degrees of freedom
## Multiple R-squared:  0.1989, Adjusted R-squared:  0.1987 
## F-statistic: 819.9 on 2 and 6604 DF,  p-value: < 2.2e-16

The model shows that the variable of Hours_Studied significantly and positively affects the final exam score whereas the students’ final exam score significantly decreases as the variable of Sleep_Hours rises.

3. Verify that the x variables are orthogonal to the LS residuals.

# Calculate residuals and fitted values
residuals_1 <- model$residuals
fitted_1 <- model$fitted.values
res_1 <- data.frame(resid = residuals_1, fit = fitted_1)

# Get the design matrix (without the intercept)
matrix_model <- model.matrix(model)[, -1]

# Check orthogonality
orthogonality_check <- crossprod(matrix_model, residuals_1)
orthogonality_check
##                       [,1]
## Hours_Studied 3.241207e-11
## Sleep_Hours   7.394307e-12

The result shows that the vectors are orthogonal to residuals, and they are uncorrelated with the LS residuals (i.e., cov(x,e) = 0).

4. Verify that the LS residuals from your estimated model sum to zero.

# Check if the sum of the residuals is zero
sum_residuals <- sum(residuals_1)
sum_residuals
## [1] 4.720668e-13

Since X′e = 0 (orthogonality), from the first element of its matrix, we can clearly see that the sum of residual is zero. The finding supports the assumption that the sum of the residuals is always zero if an intercept is included in the model.

5. Verify that the regression line (it is actually a 2-dimensional “plane”) passes through the sample mean of the data.

To confirm that the regression line goes through the sample mean of the data, it is needed to check if the predicted value at the mean of the independent variables is equal to the mean of the dependent variable.

# Calculate means of the response and predictor variables
mean_exam_score <- mean(mydata$Exam_Score)
mean_hours_studied <- mean(mydata$Hours_Studied)
mean_sleep_hours <- mean(mydata$Sleep_Hours)

# Extract the coefficients from the model
coefficients <- coef(model)
intercept <- coefficients[1]
beta_hours_studied <- coefficients[2]
beta_sleep_hours <- coefficients[3]

# Compute the predicted Exam_Score at the means
predicted_mean_exam_score <- intercept + beta_hours_studied * mean_hours_studied + beta_sleep_hours * mean_sleep_hours

# Print results
mean_exam_score
## [1] 67.23566
predicted_mean_exam_score
## (Intercept) 
##    67.23566

Since these values (mean_exam_score and predicted_mean_exam_score) are the same, it verifies that the regression line passes through the sample mean of the data.

6. Verify that the fitted values and residuals are invariant to a non-singular linear transformation.

The “invariance” means that the fitted values and residuals will be identical if X variables are linearly transformed.

# Fit the original model
model_original <- lm(Exam_Score ~ Hours_Studied + Sleep_Hours, data = mydata)

# Extract the original fitted values and residuals
fitted_values_original <- fitted(model_original)
residuals_original <- resid(model_original)

# Apply a non-singular linear transformation
# Example transformation: X1' = 2 * Hours_Studied + 5; X2' = 3 * Sleep_Hours + 4
mydata$Transformed_Hours_Studied <- 2 * mydata$Hours_Studied + 5
mydata$Transformed_Sleep_Hours <- 3 * mydata$Sleep_Hours + 4

# Fit the model with transformed variables
model_transformed <- lm(Exam_Score ~ Transformed_Hours_Studied + Transformed_Sleep_Hours, data = mydata)

# Extract the fitted values and residuals from the transformed model
fitted_values_transformed <- fitted(model_transformed)
residuals_transformed <- resid(model_transformed)

# Compare the fitted values and residuals
all.equal(fitted_values_original, fitted_values_transformed)
## [1] TRUE
all.equal(residuals_original, residuals_transformed)
## [1] TRUE

The “all.equal()” function in the above code compares the fitted values and residuals of the original model to the model with transformed variables. The findings confirm that the fitted values and residuals are invariant to a non-singular linear transformation.

7. Use the Frisch-Waugh-Lovell theorem and partial regression to get the LS estimate for just one of the β.

  1. Regressing x2 on x1 and the constant, saving the residuals.
# Regress Sleep_Hours on Hours_Studied to get the residuals
model_x2x1 <- lm(Sleep_Hours ~ Hours_Studied, data = mydata)
residuals_x2x1 <- resid(model_x2x1)  # Residuals of Sleep_Hours ~ Hours_Studied

# Regress Exam_Score on Hours_Studied to get the residuals
model_yx2 <- lm(Exam_Score ~ Hours_Studied, data = mydata)
residuals_yx2 <- resid(model_yx2)  # Residuals of Exam_Score ~ Hours_Studied

# Regress the residuals from step 2 on the residuals from step 1
model_fwl <- lm(residuals_yx2 ~ residuals_x2x1)

# The coefficient of residuals_step1 is the LS estimate of β2
beta_2_fwl <- summary(model_fwl)$coefficients[2, 1]  

# Print the estimated β2
beta_2_fwl
## [1] -0.05807087

While the β2 coefficient is the same as in the original multiple regression, this method reveals the partial effect of Sleep_Hours after removing Hours_Studied.

  1. Regressing y on x1 and the constant, saving the residuals.
# Fit the model regressing Exam_Score on Hours_Studied
model_y_on_x1 <- lm(Exam_Score ~ Hours_Studied, data = mydata)
residuals_y <- model_y_on_x1$residuals # Residuals of Sleep_Hours ~ Hours_Studied

# Fit the model regressing Sleep_Hours on Hours_Studied
model_x2_on_x1 <- lm(Sleep_Hours ~ Hours_Studied, data = mydata)
residuals_x2 <- model_x2_on_x1$residuals # Residuals of Exam_Score ~ Hours_Studied

# Regress the residuals of Exam_Score on the residuals of Sleep_Hours
model_final <- lm(residuals_y ~ residuals_x2)

# Get the estimate for beta_2 (coefficient of the residuals of Sleep_Hours)
beta_2_estimate <- summary(model_final)$coefficients[2, 1]  

# Print the estimated β2
beta_2_estimate
## [1] -0.05807087

This method estimates β2 similarly to the original regression model (lm(Exam_Score ~ Hours_Studied + Sleep_Hours)), but applies the FWL theorem to eliminate the effect of the other independent variable (Hours_Studied).

  1. Regressing the residuals from (ii) onto (i), without a constant.
# Fit the model regressing Exam_Score on Hours_Studied
model_y_on_x1 <- lm(Exam_Score ~ Hours_Studied, data = mydata)
residuals_y <- model_y_on_x1$residuals # Residuals of Exam_Score ~ Hours_Studied

# Fit the model regressing Sleep_Hours on Hours_Studied
model_x2_on_x1 <- lm(Sleep_Hours ~ Hours_Studied, data = mydata)
residuals_x2 <- model_x2_on_x1$residuals # Residuals of Sleep_Hours ~ Hours_Studied

# Regress the residuals of Exam_Score on the residuals of Sleep_Hours without a constant
model_final_no_const <- lm(residuals_y ~ residuals_x2 - 1)  # The '- 1' removes the intercept

# Get the estimate for beta_2 (coefficient of the residuals of Sleep_Hours)
beta_2_estimate_no_const <- summary(model_final_no_const)$coefficients[1, 1]  # Coefficient of residuals_x2

# Print the estimate of beta_2
beta_2_estimate_no_const
## [1] -0.05807087

This method also estimates the similar coefficient of β2 by isolating the contribution of one independent variable (Sleep_Hours) to the dependent variable (Exam_Score) after controlling for the effect of another independent variable (Hours_Studied).