Multiple linear regression is one of the statistical inference methods used to model the linear relationship between one response (dependent) variable and two or more predictor (independent) variables. This method is widely used in various fields, including economics, to explain how an economic variable is influenced by several factors simultaneously, as well as to perform prediction.
In general, the multiple linear regression model with \(p\) predictor variables is written as:
\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \varepsilon_i, \quad i = 1, 2, \dots, n\]
where:
In matrix notation, the model above can be written as:
\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
where \(\mathbf{Y}\) has dimension \(n \times 1\), \(\mathbf{X}\) has dimension \(n \times (p+1)\) (the design matrix containing a column of 1’s for the intercept and \(p\) predictor variables), \(\boldsymbol{\beta}\) has dimension \((p+1) \times 1\), and \(\boldsymbol{\varepsilon}\) has dimension \(n \times 1\).
In this analysis, the parameter \(\boldsymbol{\beta}\) will be estimated
manually using basic matrix operations in R (without
using lm() or additional packages such as car,
lmtest, or MASS), using two approaches:
Ordinary Least Squares (OLS) and Maximum
Likelihood Estimation (MLE). All functions used are base R
functions such as solve(), t(),
%*%, optim(), and functions from the built-in
stats package.
The classical linear regression model requires several assumptions for the resulting estimators to be BLUE (Best Linear Unbiased Estimator):
The principle of OLS is to minimize the sum of squared errors (SSE):
\[SSE = \sum_{i=1}^{n} \varepsilon_i^2 = \boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon} = (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})\]
By differentiating \(SSE\) with respect to \(\boldsymbol{\beta}\) and setting it equal to zero, we obtain the normal equations:
\[\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{Y}\]
so that the OLS estimator is:
\[\hat{\boldsymbol{\beta}}_{OLS} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\]
The estimated error variance:
\[\hat{\sigma}^2_{OLS} = \frac{SSE}{n-p-1} = \frac{(\mathbf{Y}-\mathbf{X}\hat{\boldsymbol{\beta}})^T(\mathbf{Y}-\mathbf{X}\hat{\boldsymbol{\beta}})}{n-p-1}\]
with the variance-covariance matrix of the coefficient estimates:
\[\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^T\mathbf{X})^{-1}\]
The MLE approach assumes the error is normally distributed, \(\boldsymbol{\varepsilon} \sim N(0, \sigma^2 \mathbf{I})\), so that \(\mathbf{Y} \sim N(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I})\). The likelihood function is:
\[L(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{Y}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(Y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2}{2\sigma^2}\right)\]
Its log-likelihood function is:
\[\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})\]
Maximizing \(\ell\) with respect to \(\boldsymbol{\beta}\) is equivalent to minimizing \((\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})\), so that:
\[\hat{\boldsymbol{\beta}}_{MLE} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} = \hat{\boldsymbol{\beta}}_{OLS}\]
However, the MLE estimator of the error variance differs from OLS, since it is derived from \(\partial \ell / \partial \sigma^2 = 0\):
\[\hat{\sigma}^2_{MLE} = \frac{SSE}{n} \neq \hat{\sigma}^2_{OLS} = \frac{SSE}{n-p-1}\]
The estimator \(\hat{\sigma}^2_{MLE}\) is biased (it
underestimates the true variance) because it does not account for the
degrees of freedom lost due to estimating \(\hat{\boldsymbol{\beta}}\), whereas \(\hat{\sigma}^2_{OLS}\) is unbiased. In the
practical section below, \(\hat{\boldsymbol{\beta}}_{MLE}\) will be
obtained in two ways: (1) analytically through the closed-form matrix
formula above, and (2) numerically by optimizing the log-likelihood
function using optim(), as proof that the two are
consistent.
The data used is simulated data set in a household economics context, consisting of one response variable and three predictor variables:
| Variable | Description | Unit |
|---|---|---|
| \(Y\) | Monthly Household Consumption Expenditure | Million IDR |
| \(X_1\) | Monthly Household Income | Million IDR |
| \(X_2\) | Consumer Lending Interest Rate | Percent (%) |
| \(X_3\) | Inflation Rate (Consumer Price Index, YoY) | Percent (%) |
Based on economic theory, income (\(X_1\)) is expected to have a positive effect on consumption, while the interest rate (\(X_2\)) and inflation (\(X_3\)) are expected to have a negative effect on real household consumption.
set.seed(2024)
n <- 100
# Predictor variables
X1 <- round(rnorm(n, mean = 8, sd = 2), 2) # Income (million IDR), mean 8 million
X2 <- round(runif(n, min = 4, max = 12), 2) # Lending interest rate (%), 4%-12%
X3 <- round(runif(n, min = 1, max = 8), 2) # Inflation rate (%), 1%-8%
# Population (true) parameters set by the researcher
beta0_true <- 1.5
beta1_true <- 0.65 # income up by 1 million -> consumption up by 0.65 million
beta2_true <- -0.18 # interest rate up by 1% -> consumption down by 0.18 million
beta3_true <- -0.12 # inflation up by 1% -> consumption down by 0.12 million
# Random error, normally distributed
epsilon <- rnorm(n, mean = 0, sd = 0.6)
# Response variable
Y <- beta0_true + beta1_true * X1 + beta2_true * X2 + beta3_true * X3 + epsilon
Y <- round(Y, 2)
data <- data.frame(Consumption = Y,
Income = X1,
InterestRate = X2,
Inflation = X3)
head(data, 10)## Consumption Income InterestRate Inflation
## 1 5.61 9.96 11.89 1.89
## 2 6.18 8.94 4.24 5.42
## 3 3.88 7.78 9.36 4.38
## 4 4.94 7.57 4.96 5.92
## 5 5.92 10.32 10.81 7.94
## 6 6.80 10.58 4.30 1.64
## 7 4.56 9.07 8.40 7.25
## 8 4.69 7.75 7.25 5.21
## 9 4.46 5.55 4.82 2.55
## 10 2.75 5.76 10.70 6.86
## Consumption Income InterestRate Inflation
## Min. :0.410 Min. : 1.450 Min. : 4.030 Min. :1.010
## 1st Qu.:3.732 1st Qu.: 6.560 1st Qu.: 5.492 1st Qu.:3.393
## Median :4.655 Median : 7.965 Median : 7.705 Median :5.260
## Mean :4.632 Mean : 7.830 Mean : 7.857 Mean :4.928
## 3rd Qu.:5.612 3rd Qu.: 9.348 3rd Qu.:10.193 3rd Qu.:6.763
## Max. :8.070 Max. :11.950 Max. :11.900 Max. :7.950
## Consumption Income InterestRate Inflation
## 1.479771 2.044882 2.475760 2.051868
Multicollinearity occurs when predictor variables are highly correlated with each other, which can make coefficient estimates unstable. One commonly used indicator is the Variance Inflation Factor (VIF):
\[VIF_j = \frac{1}{1 - R_j^2}\]
where \(R_j^2\) is the coefficient of determination from regressing \(X_j\) on all the other predictors. As a general rule of thumb, \(VIF_j > 10\) (or in some references \(> 5\)) indicates serious multicollinearity.
## Income InterestRate Inflation
## Income 1.00000000 -0.07310745 0.07372015
## InterestRate -0.07310745 1.00000000 0.04284133
## Inflation 0.07372015 0.04284133 1.00000000
# Manual VIF function, no package required
calculate_VIF <- function(X) {
p <- ncol(X)
vif_values <- numeric(p)
names(vif_values) <- colnames(X)
for (j in 1:p) {
y_aux <- X[, j]
X_aux <- cbind(1, X[, -j]) # other predictors + intercept
beta_aux <- solve(t(X_aux) %*% X_aux) %*% t(X_aux) %*% y_aux
y_fit <- X_aux %*% beta_aux
SSR <- sum((y_fit - mean(y_aux))^2)
SST <- sum((y_aux - mean(y_aux))^2)
R2_aux <- SSR / SST
vif_values[j] <- 1 / (1 - R2_aux)
}
return(vif_values)
}
predictor_matrix <- as.matrix(data[, c("Income", "InterestRate", "Inflation")])
VIF_result <- calculate_VIF(predictor_matrix)
print(round(VIF_result, 3))## Income InterestRate Inflation
## 1.011 1.008 1.008
Interpretation: If all VIF values are below 10 (ideally close to 1), it can be concluded that there is no indication of serious multicollinearity among the predictor variables, so the regression coefficient estimates can be considered stable.
The linearity test is performed to ensure that the relationship between each predictor variable and the response variable is linear, which is a key requirement for using a linear regression model. The check is done visually through scatter plots equipped with a simple regression line obtained manually.
# Manual simple regression function for the trend line in the scatter plots
simple_regression <- function(x, y) {
b1 <- sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x))^2)
b0 <- mean(y) - b1 * mean(x)
return(c(intercept = b0, slope = b1))
}par(mfrow = c(1, 3))
for (var in c("Income", "InterestRate", "Inflation")) {
x <- data[[var]]
y <- data$Consumption
coef <- simple_regression(x, y)
r <- cor(x, y)
plot(x, y, pch = 19, col = "steelblue4",
xlab = var, ylab = "Consumption",
main = paste0(var, " vs Consumption\n(r = ", round(r, 3), ")"))
abline(a = coef["intercept"], b = coef["slope"], col = "firebrick", lwd = 2)
}Interpretation: A pattern of points in each scatter plot that follows a straight line (either with a positive or negative slope) indicates that the assumption of a linear relationship between the predictor variable and the response variable is satisfied.
# OLS beta estimator: (X'X)^-1 X'Y
XtX <- t(X_mat) %*% X_mat
XtY <- t(X_mat) %*% Y_vec
beta_OLS <- solve(XtX) %*% XtY
# Fitted values and residuals
Y_hat <- X_mat %*% beta_OLS
e_OLS <- Y_vec - Y_hat
# Unbiased estimate of error variance (OLS)
SSE <- sum(e_OLS^2)
sigma2_OLS <- SSE / (n_obs - p_par - 1)
# Variance-covariance matrix of beta and standard errors
varcov_beta <- as.numeric(sigma2_OLS) * solve(XtX)
se_beta <- sqrt(diag(varcov_beta))
# t-statistic and p-value for each coefficient
t_stat <- beta_OLS / se_beta
p_value <- 2 * pt(abs(t_stat), df = n_obs - p_par - 1, lower.tail = FALSE)
table_OLS <- data.frame(
Coefficient = round(beta_OLS, 5),
SE = round(se_beta, 5),
t_value = round(t_stat, 4),
p_value = round(p_value, 5)
)
print(table_OLS)## Coefficient SE t_value p_value
## Intercept 1.54192 0.30394 5.0731 0.0000
## Income 0.62602 0.02664 23.4979 0.0000
## InterestRate -0.17235 0.02197 -7.8467 0.0000
## Inflation -0.09288 0.02650 -3.5044 0.0007
SST <- sum((Y_vec - mean(Y_vec))^2)
SSR <- SST - SSE
R2 <- SSR / SST
R2_adj <- 1 - (1 - R2) * (n_obs - 1) / (n_obs - p_par - 1)
cat("SSE :", round(SSE, 4), "\n")## SSE : 27.8895
## R-squared : 0.87135
## Adjusted R-squared : 0.86733
## Sigma^2 (OLS) : 0.29052
According to the theoretical derivation, \(\hat{\boldsymbol{\beta}}_{MLE}\) is identical to \(\hat{\boldsymbol{\beta}}_{OLS}\), while \(\hat{\sigma}^2_{MLE} = SSE/n\).
beta_MLE_analytic <- beta_OLS # algebraically identical
sigma2_MLE_analytic <- SSE / n_obs # divisor is n, not n-p-1
cat("Sigma^2 (MLE, analytic):", round(sigma2_MLE_analytic, 5), "\n")## Sigma^2 (MLE, analytic): 0.27889
As verification, direct optimization of the negative log-likelihood
function is performed using optim() (a base R function),
without using the closed-form OLS formula.
neg_log_likelihood <- function(par, X, Y) {
k <- ncol(X)
beta <- par[1:k]
sigma2 <- par[k + 1]
if (sigma2 <= 0) return(1e10) # constraint to keep variance positive
n <- length(Y)
resid <- Y - X %*% beta
ll <- -(n/2) * log(2 * pi) - (n/2) * log(sigma2) - (1 / (2 * sigma2)) * sum(resid^2)
return(-ll) # multiplied by -1 because optim() performs minimization
}
# Starting values
start_par <- c(rep(0, ncol(X_mat)), var(Y_vec))
optim_result <- optim(par = start_par,
fn = neg_log_likelihood,
X = X_mat, Y = Y_vec,
method = "BFGS",
hessian = TRUE)
beta_MLE_numeric <- optim_result$par[1:ncol(X_mat)]
sigma2_MLE_numeric <- optim_result$par[ncol(X_mat) + 1]
names(beta_MLE_numeric) <- colnames(X_mat)
cat("Optim convergence (0 = successful):", optim_result$convergence, "\n\n")## Optim convergence (0 = successful): 0
## Beta coefficients (MLE - numeric):
## Intercept Income InterestRate Inflation
## 1.54192 0.62602 -0.17235 -0.09288
##
## Sigma^2 (MLE - numeric): 0.2789
comparison_table <- data.frame(
Parameter = colnames(X_mat),
OLS = round(as.numeric(beta_OLS), 5),
MLE_Analytic = round(as.numeric(beta_MLE_analytic), 5),
MLE_Numeric = round(as.numeric(beta_MLE_numeric), 5)
)
print(comparison_table)## Parameter OLS MLE_Analytic MLE_Numeric
## 1 Intercept 1.54192 1.54192 1.54192
## 2 Income 0.62602 0.62602 0.62602
## 3 InterestRate -0.17235 -0.17235 -0.17235
## 4 Inflation -0.09288 -0.09288 -0.09288
##
## Comparison of error variance estimates:
## Sigma^2 OLS (divisor n-p-1): 0.29052
## Sigma^2 MLE (divisor n) : 0.27889
Interpretation: All three approaches produce (nearly) identical estimates of the regression coefficients \(\boldsymbol{\beta}\), which empirically confirms that the OLS and MLE estimators for the regression coefficients are equivalent. The only difference occurs in the estimated error variance \(\hat\sigma^2\), where MLE produces a slightly smaller (biased) value compared to OLS, due to the difference in the degrees-of-freedom divisor.
The residuals used in all of the following assumption tests are the OLS model residuals (\(\hat{e} = \mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}_{OLS}\)).
The normality assumption of the residuals is required for valid inference (t-test, F-test) on the regression coefficients. The check is conducted through:
\[JB = \frac{n}{6}\left(S^2 + \frac{(K-3)^2}{4}\right) \sim \chi^2_{(2)}\]
shapiro.test() function from the (built-in)
stats package, as a comparison.par(mfrow = c(1, 2))
hist(residuals_vec, breaks = 12, col = "lightblue", border = "white",
main = "Histogram of Residuals", xlab = "Residual", freq = FALSE)
curve(dnorm(x, mean(residuals_vec), sd(residuals_vec)), add = TRUE, col = "firebrick", lwd = 2)
qqnorm(residuals_vec, main = "QQ-Plot of Residuals", pch = 19, col = "steelblue4")
qqline(residuals_vec, col = "firebrick", lwd = 2)jarque_bera_test <- function(e) {
n <- length(e)
m2 <- mean((e - mean(e))^2)
m3 <- mean((e - mean(e))^3)
m4 <- mean((e - mean(e))^4)
S <- m3 / (m2^(3/2)) # skewness
K <- m4 / (m2^2) # kurtosis
JB <- (n / 6) * (S^2 + ((K - 3)^2) / 4)
p_val <- pchisq(JB, df = 2, lower.tail = FALSE)
return(list(skewness = S, kurtosis = K, JB_stat = JB, p_value = p_val))
}
JB_result <- jarque_bera_test(residuals_vec)
cat("Skewness :", round(JB_result$skewness, 4), "\n")## Skewness : -0.1951
## Kurtosis : 3.7901
## JB statistic : 3.2357
## p-value : 0.1983
##
## Shapiro-Wilk normality test
##
## data: residuals_vec
## W = 0.98921, p-value = 0.6011
Hypotheses: \(H_0\): the errors are normally distributed vs \(H_1\): the errors are not normally distributed. If p-value > \(\alpha\) (commonly 0.05), \(H_0\) fails to be rejected, so the normality assumption of the residuals is satisfied.
The homoscedasticity assumption requires the error variance to be constant across all values of the predictors. Violation of this assumption (heteroscedasticity) causes the OLS estimator to remain unbiased but no longer efficient.
plot(fitted_values, residuals_vec, pch = 19, col = "steelblue4",
xlab = "Fitted Value", ylab = "Residual",
main = "Residuals vs Fitted Values")
abline(h = 0, col = "firebrick", lwd = 2, lty = 2)Breusch-Pagan Test (manual). This is performed by regressing the squared residuals on all predictor variables, then computing the statistic:
\[LM = n \times R^2_{aux} \sim \chi^2_{(p)}\]
where \(R^2_{aux}\) is the coefficient of determination from the auxiliary regression of \(\hat{e}^2\) on \(\mathbf{X}\).
breusch_pagan_test <- function(X, e) {
n <- nrow(X)
e2 <- e^2
beta_aux <- solve(t(X) %*% X) %*% t(X) %*% e2
e2_hat <- X %*% beta_aux
SSR_aux <- sum((e2_hat - mean(e2))^2)
SST_aux <- sum((e2 - mean(e2))^2)
R2_aux <- SSR_aux / SST_aux
LM <- n * R2_aux
df_bp <- ncol(X) - 1 # number of predictors (excluding intercept)
p_val <- pchisq(LM, df = df_bp, lower.tail = FALSE)
return(list(LM_stat = LM, df = df_bp, p_value = p_val))
}
BP_result <- breusch_pagan_test(X_mat, residuals_vec)
cat("LM statistic (Breusch-Pagan):", round(BP_result$LM_stat, 4), "\n")## LM statistic (Breusch-Pagan): 5.3909
## Degrees of freedom : 3
## p-value : 0.1453
Hypotheses: \(H_0\): the error variance is homogeneous (homoscedastic) vs \(H_1\): the error variance is not homogeneous (heteroscedastic). If p-value > \(\alpha\), \(H_0\) fails to be rejected, so the homoscedasticity assumption is satisfied.
The no-autocorrelation assumption is especially important for data with an inherent ordering structure (e.g., time series data). A commonly used test is the Durbin-Watson test:
\[DW = \frac{\sum_{i=2}^{n}(\hat{e}_i - \hat{e}_{i-1})^2}{\sum_{i=1}^{n}\hat{e}_i^2}\]
The \(DW\) statistic ranges between 0 and 4, with a value close to 2 indicating no autocorrelation; a value close to 0 indicating positive autocorrelation; and a value close to 4 indicating negative autocorrelation.
durbin_watson_test <- function(e) {
n <- length(e)
diff_e <- diff(e) # e_t - e_(t-1)
DW <- sum(diff_e^2) / sum(e^2)
return(DW)
}
DW_stat <- durbin_watson_test(residuals_vec)
cat("Durbin-Watson statistic:", round(DW_stat, 4), "\n")## Durbin-Watson statistic: 1.9662
Interpretation: A \(DW\) value close to 2 (generally in the range of 1.5 - 2.5) indicates no meaningful autocorrelation in the residuals, so the assumption of error independence is satisfied. More precise critical bounds \(d_L\) and \(d_U\) can be referenced from the Durbin-Watson table according to \(n\) and the number of predictors.
summary_table <- data.frame(
Parameter = colnames(X_mat),
Coefficient_OLS = round(as.numeric(beta_OLS), 4),
SE = round(se_beta, 4),
t_value = round(as.numeric(t_stat), 4),
p_value = round(as.numeric(p_value), 4)
)
print(summary_table)## Parameter Coefficient_OLS SE t_value p_value
## Intercept Intercept 1.5419 0.3039 5.0731 0.0000
## Income Income 0.6260 0.0266 23.4979 0.0000
## InterestRate InterestRate -0.1724 0.0220 -7.8467 0.0000
## Inflation Inflation -0.0929 0.0265 -3.5044 0.0007
##
## R-squared : 0.8713
##
## Adjusted R-squared : 0.8673
##
## Sigma^2 (OLS) : 0.2905
##
## Durbin-Watson statistic: 1.9662
The resulting multiple linear regression model is:
\[\hat{Y} = \hat\beta_0 + \hat\beta_1 \, X_1 + \hat\beta_2 \, X_2 + \hat\beta_3 \, X_3\]
where \(\hat\beta_0, \hat\beta_1, \hat\beta_2, \hat\beta_3\) are as listed in the summary table above (Income, Interest Rate, and Inflation as predictors of Household Consumption).