Week_2

1) I want you to choose and import any dataset, and briefly describe it (tell us the type of economic data Download type of economic data, and the variable definitions). You can use any in-built R dataset too - just type “data()” in R to see the available options ( Section 3Links to an external site. may be helpful), or can use AER packageLinks to an external site. to find more relevant Economics datasets

Type of Data: The USArrests dataset is a built-in dataset in R, consisting of statistics on violent crime rates in various states of the United States in 1973.

Variables:

Murder: Numeric. Murder arrests (per 100,000 residents). Assault: Numeric. Assault arrests (per 100,000 residents). UrbanPop: Numeric. Urban population percentage. Rape: Numeric. Rape arrests (per 100,000 residents). Usage: This dataset is commonly used in statistics and data analysis courses for demonstrating various analytical techniques, such as clustering, regression analysis, and data visualization.

Example Analysis: You can explore and analyze this dataset using various statistical methods and R functions to understand relationships between variables or to draw insights into crime patterns across different states in the United States.

# Load the USArrests dataset
data(USArrests)

2) Once you have the dataset, decide what multivariate regression you want to run i.e. what is your y variable and what are your X variables? Since this is a multivariate regression, you need at least 2 independent variables (else you are running a bi-variate regression - has only one independent variable). So for example, income may be my dependent variable, while age, experience, … may be my independent variables. TYPE OUT THE FULL ESTIMATING EQUATION

Let:

Y = Murder rate (per 100,000 residents)

𝑋1= Assault rate (per 100,000 residents)

𝑋2= Urban population percentage

The multivariate regression model can be expressed a

Yi = β0 + β1X1i + β2X2i + ϵi

Yi is the Murder rate for the i-th state, X1i is the Assualt rate for the i-th state, X2i is the Urban population percentage for the i-th state, β0 is the intercept (constant term), β1 is the coefficient for Assault rate, β2 is the coefficient for Urban population percentage, ϵi is the error term for the i-th observation.

Estimation Equation

Murderi = β0 + β1.Assualt i+ β2urbanPopi +ϵi

3) Now, first estimate the multivariate regression (linear relationship) above with the “lm” command

# Fit the multivariate regression model
model <- lm(Murder ~ Assault + UrbanPop, data = USArrests)
summary(model)

## 
## Call:
## lm(formula = Murder ~ Assault + UrbanPop, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5530 -1.7093 -0.3677  1.2284  7.5985 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.207153   1.740790   1.842   0.0717 .  
## Assault      0.043910   0.004579   9.590 1.22e-12 ***
## UrbanPop    -0.044510   0.026363  -1.688   0.0980 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.58 on 47 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6491 
## F-statistic: 46.32 on 2 and 47 DF,  p-value: 7.704e-12

4) Second, confirm if you get the same point estimates/coefficients/betas with the matrix algebra formula that we discussed in class - inv(X’X) X’y You should get the same answer. If not, check if you specified the intercept term in your matrix X by adding a unit vector or not

# Dependent variable
y <- as.vector(USArrests$Murder)

# Matrix of feature variables from USArrests
X <- as.matrix(USArrests[, -1])  # Exclude the first column (State names)

# Vector of ones with the same length as rows in USArrests
int <- rep(1, times = nrow(USArrests))

# Add intercept column to X
X <- cbind(int, X)
rm(int)  # Remove the temporary variable int

# Implement closed-form solution for beta estimates
betas <- solve(t(X) %*% X) %*% t(X) %*% y
betas <- round(betas, digits = 2)  # Round for easier viewing
print("Estimated Betas using Matrix Algebra:")

## [1] "Estimated Betas using Matrix Algebra:"

print(betas)

##           [,1]
## int       3.28
## Assault   0.04
## UrbanPop -0.05
## Rape      0.06

# Alternatively, compute betas more efficiently
betas_check <- solve(crossprod(X), crossprod(X, y))
betas_check <- round(betas_check, digits = 2)

# Check if both methods produce the same results
betas_equal <- all.equal(betas, betas_check)
print("Are betas equal between methods?")

## [1] "Are betas equal between methods?"

print(betas_equal)

## [1] TRUE

# Fit linear regression model using lm() function
lm.mod <- lm(Murder ~ ., data = USArrests)
lm.betas <- round(lm.mod$coefficients, digits = 2)

# Display results from both methods
results <- data.frame(our_results = betas, lm_results = lm.betas)
print("Comparison of Results:")

## [1] "Comparison of Results:"

print(results)

##          our_results lm_results
## int             3.28       3.28
## Assault         0.04       0.04
## UrbanPop       -0.05      -0.05
## Rape            0.06       0.06

# Print out summary of lm model
print("Summary of Linear Model:")

## [1] "Summary of Linear Model:"

summary(lm.mod)

## 
## Call:
## lm(formula = Murder ~ ., data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3990 -1.9127 -0.3444  1.2557  7.4279 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.276639   1.737997   1.885   0.0657 .  
## Assault      0.039777   0.005912   6.729 2.33e-08 ***
## UrbanPop    -0.054694   0.027880  -1.962   0.0559 .  
## Rape         0.061399   0.055740   1.102   0.2764    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.574 on 46 degrees of freedom
## Multiple R-squared:  0.6721, Adjusted R-squared:  0.6507 
## F-statistic: 31.42 on 3 and 46 DF,  p-value: 3.322e-11

5) Can you also estimate the standard error by hand as you did for point estimates using matrix algebra? Explain your logic as well

# Calculate correlation matrix
corr_matrix <- cor(X[, -1], use = "complete.obs", method = "pearson")

# Calculate residual variance and related statistics
N <- nrow(X)  # Number of observations
k <- ncol(X) - 1  # Number of predictor variables (excluding intercept)
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y  # Estimated regression coefficients

# Residuals and residual variance
residuals <- y - X %*% beta_hat
residual_variance <- sum(residuals^2) / (N - k - 1)
residual_std_error <- sqrt(residual_variance)

# Variance and standard errors of estimated coefficients
var_betaHat <- residual_variance * solve(t(X) %*% X)
coeff_std_errors <- sqrt(diag(var_betaHat))

# t-values and p-values of estimated coefficients
t_values <- beta_hat / coeff_std_errors
p_values_tstat <- 2 * pt(q = abs(t_values), df = N - k - 1, lower.tail = FALSE)

# Assigning significance codes to p-values
signif_codes_match <- function(x) {
  ifelse(x <= 0.001, "***",
         ifelse(x <= 0.01, "**",
                ifelse(x < 0.05, "*", ".")
         )
  )
}
signif_codes <- sapply(p_values_tstat, signif_codes_match)

# Calculate R-squared and Adjusted R-squared
R_sq <- 1 - (residual_variance * (N - 1)) / sum((y - mean(y))^2)
R_sq_adj <- 1 - (residual_variance * (N - 1)) / ((N - k - 1) * mean((y - mean(y))^2))

# Residual sum of squares (RSS) for the full model
RSS <- sum(residuals^2)

# Total sum of squares (TSS) for the full model
TSS <- sum((y - mean(y))^2)

# F-statistic and p-value of the F-statistic
F_stat <- ( (TSS - RSS) / k ) / ( RSS / (N - k - 1) )
p_value_F_stat <- pf(q = F_stat, df1 = k, df2 = N - k - 1, lower.tail = FALSE)

# Combine results into a dataframe
lm_results <- data.frame(
  Estimate = beta_hat,
  Std_Error = coeff_std_errors,
  t_value = t_values,
  Pr_t = p_values_tstat,
  Signif = signif_codes
)
rownames(lm_results) <- colnames(X)
print("Linear Model Results:")

## [1] "Linear Model Results:"

print(lm_results)

##             Estimate   Std_Error   t_value         Pr_t Signif
## int       3.27663918 1.737997161  1.885296 6.571517e-02      .
## Assault   0.03977717 0.005911667  6.728587 2.328851e-08    ***
## UrbanPop -0.05469363 0.027880242 -1.961734 5.586128e-02      .
## Rape      0.06139942 0.055740249  1.101528 2.763975e-01      .

print("R-squared:")

## [1] "R-squared:"

print(R_sq)

## [1] 0.6506786

print("Adjusted R-squared:")

## [1] "Adjusted R-squared:"

print(R_sq_adj)

## [1] 0.6203028

print("F-statistic:")

## [1] "F-statistic:"

print(F_stat)

## [1] 31.42399

print("P-value of F-statistic:")

## [1] "P-value of F-statistic:"

print(p_value_F_stat)

## [1] 3.322423e-11

Certainly! The code provided performs several calculations and outputs related to a linear regression model fitted on the USArrests dataset in R. Let’s break down each part and explain the logic:

Calculations and Output Explanation:

Initial Setup:
- N <- nrow(X): Calculates the number of observations (N) from the rows of matrix X.
- k <- ncol(X) - 1: Calculates the number of predictor variables (k) excluding the intercept.
Estimated Regression Coefficients:
- beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y: Computes the estimated regression coefficients using matrix algebra.
Residuals and Residual Variance:
- residuals <- y - X %*% beta_hat: Computes the residuals (difference between observed and predicted values).
- residual_variance <- sum(residuals^2) / (N - k - 1): Calculates the residual variance, which estimates the variance of the error term in the regression model.
Variance and Standard Errors of Estimated Coefficients:
- var_betaHat <- residual_variance * solve(t(X) %*% X): Calculates the variance-covariance matrix of the estimated coefficients.
- coeff_std_errors <- sqrt(diag(var_betaHat)): Computes the standard errors of the estimated coefficients by taking the square root of the diagonal elements of var_betaHat.
t-values and p-values of Estimated Coefficients:
- t_values <- beta_hat / coeff_std_errors: Computes the t-values for each coefficient estimate.
- p_values_tstat <- 2 * pt(q = abs(t_values), df = N - k - 1, lower.tail = FALSE): Calculates the two-sided p-values for each coefficient using the t-distribution.
Assigning Significance Codes to p-values:
- signif_codes_match function assigns significance codes (“”, ””, ””, “.”) based on the p-values.
- signif_codes <- sapply(p_values_tstat, signif_codes_match): Applies the signif_codes_match function to each p-value to determine significance codes.
R-squared and Adjusted R-squared:
- R_sq <- 1 - (residual_variance * (N - 1)) / sum((y - mean(y))^2): Calculates the coefficient of determination (R-squared).
- R_sq_adj <- 1 - (residual_variance * (N - 1)) / ((N - k - 1) * mean((y - mean(y))^2)): Computes the adjusted R-squared, which adjusts for the number of predictors in the model.
F-statistic and p-value of the F-statistic:
- RSS <- sum(residuals^2): Computes the residual sum of squares (RSS).
- TSS <- sum((y - mean(y))^2): Calculates the total sum of squares (TSS).
- F_stat <- ( (TSS - RSS) / k ) / ( RSS / (N - k - 1) ): Computes the F-statistic, which tests the overall significance of the regression model.
- p_value_F_stat <- pf(q = F_stat, df1 = k, df2 = N - k - 1, lower.tail = FALSE): Calculates the p-value associated with the F-statistic using the F-distribution.
Output:
- lm_results: Combines the estimated coefficients, standard errors, t-values, p-values, and significance codes into a dataframe.
- Printing various statistics (lm_results, R-squared, Adjusted R-squared, F-statistic, and its p-value) to summarize the regression model’s results.

Logic:

Matrix Algebra Approach: The code utilizes matrix algebra (solve function) to compute the regression coefficients, variance-covariance matrix, and subsequently, standard errors.
Statistical Tests: Calculates t-values and associated p-values for each coefficient to assess their individual significance.
Model Fit: Computes R-squared and Adjusted R-squared to evaluate how well the model explains the variability in the dependent variable.
Overall Model Significance: Uses the F-statistic and its associated p-value to test the overall significance of the regression model.

This approach provides a comprehensive overview of the linear regression model’s results, including point estimates, standard errors, significance tests for coefficients, and overall model fit statistics, ensuring a thorough understanding of the model’s performance.

Week_2

Ganesh Kumar

2024-07-14

Calculations and Output Explanation:

Logic: