1) I want you to choose and import any dataset, and briefly describe it (tell us the type of economic data Download type of economic data, and the variable definitions). You can use any in-built R dataset too - just type “data()” in R to see the available options ( Section 3Links to an external site. may be helpful), or can use AER packageLinks to an external site. to find more relevant Economics datasets
Type of Data: The USArrests dataset is a built-in dataset in R, consisting of statistics on violent crime rates in various states of the United States in 1973.
Variables:
Murder: Numeric. Murder arrests (per 100,000 residents). Assault: Numeric. Assault arrests (per 100,000 residents). UrbanPop: Numeric. Urban population percentage. Rape: Numeric. Rape arrests (per 100,000 residents). Usage: This dataset is commonly used in statistics and data analysis courses for demonstrating various analytical techniques, such as clustering, regression analysis, and data visualization.
Example Analysis: You can explore and analyze this dataset using various statistical methods and R functions to understand relationships between variables or to draw insights into crime patterns across different states in the United States.
# Load the USArrests dataset
data(USArrests)
2) Once you have the dataset, decide what multivariate regression you want to run i.e. what is your y variable and what are your X variables? Since this is a multivariate regression, you need at least 2 independent variables (else you are running a bi-variate regression - has only one independent variable). So for example, income may be my dependent variable, while age, experience, … may be my independent variables. TYPE OUT THE FULL ESTIMATING EQUATION
Let:
Y = Murder rate (per 100,000 residents)
𝑋1= Assault rate (per 100,000 residents)
𝑋2= Urban population percentage
The multivariate regression model can be expressed a
Yi = β0 + β1X1i + β2X2i + ϵi
Yi is the Murder rate for the i-th state, X1i is the Assualt rate for the i-th state, X2i is the Urban population percentage for the i-th state, β0 is the intercept (constant term), β1 is the coefficient for Assault rate, β2 is the coefficient for Urban population percentage, ϵi is the error term for the i-th observation.
Estimation Equation
Murderi = β0 + β1.Assualt i+ β2urbanPopi +ϵi
3) Now, first estimate the multivariate regression (linear relationship) above with the “lm” command
# Fit the multivariate regression model
model <- lm(Murder ~ Assault + UrbanPop, data = USArrests)
summary(model)
##
## Call:
## lm(formula = Murder ~ Assault + UrbanPop, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5530 -1.7093 -0.3677 1.2284 7.5985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.207153 1.740790 1.842 0.0717 .
## Assault 0.043910 0.004579 9.590 1.22e-12 ***
## UrbanPop -0.044510 0.026363 -1.688 0.0980 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.58 on 47 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6491
## F-statistic: 46.32 on 2 and 47 DF, p-value: 7.704e-12
4) Second, confirm if you get the same point estimates/coefficients/betas with the matrix algebra formula that we discussed in class - inv(X’X) X’y You should get the same answer. If not, check if you specified the intercept term in your matrix X by adding a unit vector or not
# Dependent variable
y <- as.vector(USArrests$Murder)
# Matrix of feature variables from USArrests
X <- as.matrix(USArrests[, -1]) # Exclude the first column (State names)
# Vector of ones with the same length as rows in USArrests
int <- rep(1, times = nrow(USArrests))
# Add intercept column to X
X <- cbind(int, X)
rm(int) # Remove the temporary variable int
# Implement closed-form solution for beta estimates
betas <- solve(t(X) %*% X) %*% t(X) %*% y
betas <- round(betas, digits = 2) # Round for easier viewing
print("Estimated Betas using Matrix Algebra:")
## [1] "Estimated Betas using Matrix Algebra:"
print(betas)
## [,1]
## int 3.28
## Assault 0.04
## UrbanPop -0.05
## Rape 0.06
# Alternatively, compute betas more efficiently
betas_check <- solve(crossprod(X), crossprod(X, y))
betas_check <- round(betas_check, digits = 2)
# Check if both methods produce the same results
betas_equal <- all.equal(betas, betas_check)
print("Are betas equal between methods?")
## [1] "Are betas equal between methods?"
print(betas_equal)
## [1] TRUE
# Fit linear regression model using lm() function
lm.mod <- lm(Murder ~ ., data = USArrests)
lm.betas <- round(lm.mod$coefficients, digits = 2)
# Display results from both methods
results <- data.frame(our_results = betas, lm_results = lm.betas)
print("Comparison of Results:")
## [1] "Comparison of Results:"
print(results)
## our_results lm_results
## int 3.28 3.28
## Assault 0.04 0.04
## UrbanPop -0.05 -0.05
## Rape 0.06 0.06
# Print out summary of lm model
print("Summary of Linear Model:")
## [1] "Summary of Linear Model:"
summary(lm.mod)
##
## Call:
## lm(formula = Murder ~ ., data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3990 -1.9127 -0.3444 1.2557 7.4279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.276639 1.737997 1.885 0.0657 .
## Assault 0.039777 0.005912 6.729 2.33e-08 ***
## UrbanPop -0.054694 0.027880 -1.962 0.0559 .
## Rape 0.061399 0.055740 1.102 0.2764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.574 on 46 degrees of freedom
## Multiple R-squared: 0.6721, Adjusted R-squared: 0.6507
## F-statistic: 31.42 on 3 and 46 DF, p-value: 3.322e-11
5) Can you also estimate the standard error by hand as you did for point estimates using matrix algebra? Explain your logic as well
# Calculate correlation matrix
corr_matrix <- cor(X[, -1], use = "complete.obs", method = "pearson")
# Calculate residual variance and related statistics
N <- nrow(X) # Number of observations
k <- ncol(X) - 1 # Number of predictor variables (excluding intercept)
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y # Estimated regression coefficients
# Residuals and residual variance
residuals <- y - X %*% beta_hat
residual_variance <- sum(residuals^2) / (N - k - 1)
residual_std_error <- sqrt(residual_variance)
# Variance and standard errors of estimated coefficients
var_betaHat <- residual_variance * solve(t(X) %*% X)
coeff_std_errors <- sqrt(diag(var_betaHat))
# t-values and p-values of estimated coefficients
t_values <- beta_hat / coeff_std_errors
p_values_tstat <- 2 * pt(q = abs(t_values), df = N - k - 1, lower.tail = FALSE)
# Assigning significance codes to p-values
signif_codes_match <- function(x) {
ifelse(x <= 0.001, "***",
ifelse(x <= 0.01, "**",
ifelse(x < 0.05, "*", ".")
)
)
}
signif_codes <- sapply(p_values_tstat, signif_codes_match)
# Calculate R-squared and Adjusted R-squared
R_sq <- 1 - (residual_variance * (N - 1)) / sum((y - mean(y))^2)
R_sq_adj <- 1 - (residual_variance * (N - 1)) / ((N - k - 1) * mean((y - mean(y))^2))
# Residual sum of squares (RSS) for the full model
RSS <- sum(residuals^2)
# Total sum of squares (TSS) for the full model
TSS <- sum((y - mean(y))^2)
# F-statistic and p-value of the F-statistic
F_stat <- ( (TSS - RSS) / k ) / ( RSS / (N - k - 1) )
p_value_F_stat <- pf(q = F_stat, df1 = k, df2 = N - k - 1, lower.tail = FALSE)
# Combine results into a dataframe
lm_results <- data.frame(
Estimate = beta_hat,
Std_Error = coeff_std_errors,
t_value = t_values,
Pr_t = p_values_tstat,
Signif = signif_codes
)
rownames(lm_results) <- colnames(X)
print("Linear Model Results:")
## [1] "Linear Model Results:"
print(lm_results)
## Estimate Std_Error t_value Pr_t Signif
## int 3.27663918 1.737997161 1.885296 6.571517e-02 .
## Assault 0.03977717 0.005911667 6.728587 2.328851e-08 ***
## UrbanPop -0.05469363 0.027880242 -1.961734 5.586128e-02 .
## Rape 0.06139942 0.055740249 1.101528 2.763975e-01 .
print("R-squared:")
## [1] "R-squared:"
print(R_sq)
## [1] 0.6506786
print("Adjusted R-squared:")
## [1] "Adjusted R-squared:"
print(R_sq_adj)
## [1] 0.6203028
print("F-statistic:")
## [1] "F-statistic:"
print(F_stat)
## [1] 31.42399
print("P-value of F-statistic:")
## [1] "P-value of F-statistic:"
print(p_value_F_stat)
## [1] 3.322423e-11
Certainly! The code provided performs several calculations and
outputs related to a linear regression model fitted on the
USArrests dataset in R. Let’s break down each part and
explain the logic:
N <- nrow(X): Calculates the number of observations
(N) from the rows of matrix X.k <- ncol(X) - 1: Calculates the number of predictor
variables (k) excluding the intercept.beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y:
Computes the estimated regression coefficients using matrix
algebra.residuals <- y - X %*% beta_hat: Computes the
residuals (difference between observed and predicted values).residual_variance <- sum(residuals^2) / (N - k - 1):
Calculates the residual variance, which estimates the variance of the
error term in the regression model.var_betaHat <- residual_variance * solve(t(X) %*% X):
Calculates the variance-covariance matrix of the estimated
coefficients.coeff_std_errors <- sqrt(diag(var_betaHat)):
Computes the standard errors of the estimated coefficients by taking the
square root of the diagonal elements of var_betaHat.t_values <- beta_hat / coeff_std_errors: Computes
the t-values for each coefficient estimate.p_values_tstat <- 2 * pt(q = abs(t_values), df = N - k - 1, lower.tail = FALSE):
Calculates the two-sided p-values for each coefficient using the
t-distribution.signif_codes_match function assigns significance codes
(“”, ””, ””, “.”) based on the p-values.signif_codes <- sapply(p_values_tstat, signif_codes_match):
Applies the signif_codes_match function to each p-value to
determine significance codes.R_sq <- 1 - (residual_variance * (N - 1)) / sum((y - mean(y))^2):
Calculates the coefficient of determination (R-squared).R_sq_adj <- 1 - (residual_variance * (N - 1)) / ((N - k - 1) * mean((y - mean(y))^2)):
Computes the adjusted R-squared, which adjusts for the number of
predictors in the model.RSS <- sum(residuals^2): Computes the residual sum
of squares (RSS).TSS <- sum((y - mean(y))^2): Calculates the total
sum of squares (TSS).F_stat <- ( (TSS - RSS) / k ) / ( RSS / (N - k - 1) ):
Computes the F-statistic, which tests the overall significance of the
regression model.p_value_F_stat <- pf(q = F_stat, df1 = k, df2 = N - k - 1, lower.tail = FALSE):
Calculates the p-value associated with the F-statistic using the
F-distribution.lm_results: Combines the estimated coefficients,
standard errors, t-values, p-values, and significance codes into a
dataframe.lm_results, R-squared,
Adjusted R-squared, F-statistic, and its p-value) to summarize the
regression model’s results.solve function) to compute the regression
coefficients, variance-covariance matrix, and subsequently, standard
errors.This approach provides a comprehensive overview of the linear regression model’s results, including point estimates, standard errors, significance tests for coefficients, and overall model fit statistics, ensuring a thorough understanding of the model’s performance.