week2_Discussion

I want you to choose and import any dataset, and briefly describe it (tell us the type of economic data Download type of economic data, and the variable definitions).

# Load dataset
data("USArrests")

# View the first few rows of the dataset
head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

# Summary statistics
summary(USArrests)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

# Scatterplot matrix
pairs(USArrests)

# Correlation matrix
cor(USArrests)

##              Murder   Assault   UrbanPop      Rape
## Murder   1.00000000 0.8018733 0.06957262 0.5635788
## Assault  0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape     0.56357883 0.6652412 0.41134124 1.0000000

Describe: The USArrests dataset contains statistics, in arrests per 100,000 residents, for assault, murder, and rape in each of the 50 US states in 1973.

Variables: Murder: Murder arrests per 100,000 residents. Assault: Assault arrests per 100,000 residents. UrbanPop: Urban population percentage in each state. Rape: Rape arrests per 100,000 residents.

Dataset type: This dataset is a panel dataset, containing cross-sectional data (states) over a single time period (1973). It is often used to analyze crime rates across different states and potentially correlate them with other socioeconomic factors.

The above code gives its summary statistics, a scatterplot matrix showing relationships between variables, and a correlation matrix indicating the pairwise correlations between the variables (Murder, Assault, UrbanPop, and Rape).

2)Once you have the dataset, decide what multivariate regression you want to run i.e. what is your y variable and what are your X variables? Since this is a multivariate regression, you need at least 2 independent variables (else you are running a bi-variate regression - has only one independent variable.

Setting up a multivariate regression model; For this particular dataset, let’s use Murder as our dependent variable (y variable), and Assault and UrbanPop as our independent variables (X variables).

Estimating Equation: The multivariate regression equation we’ll estimate is:

Murder=𝛽0+𝛽1⋅Assault+𝛽2⋅UrbanPop+𝜖

Murder is the number of murder arrests per 100,000 residents (dependent variable).

Assault is the number of assault arrests per 100,000 residents (independent variable).

UrbanPop is the percentage of the urban population in the state (independent variable).

𝛽0 is the intercept (constant term).

𝛽1 is the coefficient for Assault, representing the effect of Assault on Murder when UrbanPop is held constant.

𝛽2is the coefficient for UrbanPop, representing the effect of UrbanPop on Murder when Assault is held constant.

ϵ is the error term.

Now, first estimate the multivariate regression (linear relationship) above with the “lm” command

# Load dataset
data("USArrests")

# multivariate regression model
model <- lm(Murder ~ Assault + UrbanPop, data = USArrests)

# View summary of the model
summary(model)

## 
## Call:
## lm(formula = Murder ~ Assault + UrbanPop, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5530 -1.7093 -0.3677  1.2284  7.5985 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.207153   1.740790   1.842   0.0717 .  
## Assault      0.043910   0.004579   9.590 1.22e-12 ***
## UrbanPop    -0.044510   0.026363  -1.688   0.0980 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.58 on 47 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6491 
## F-statistic: 46.32 on 2 and 47 DF,  p-value: 7.704e-12

The above output will assess the statistical significance and direction of the relationships between Murder, Assault, and UrbanPop in the dataset.

Coefficients: The Estimate column shows the estimated coefficients (beta values). For example, Assault has an estimated coefficient of 0.043910, indicating that for every one-unit increase in Assault arrests per 100,000 residents, Murder arrests per 100,000 residents increase by approximately 0.043910, holding UrbanPop constant.

P-values: The p-values associated with each coefficient. They indicate the statistical significance of each predictor variable in explaining the variation in the dependent variable (Murder).

Multiple R-squared: The Multiple R-squared value (0.6634) indicates that the model explains approximately 66.34% of the variance in Murder arrests across the states.

F-statistic: The F-statistic tests the overall significance of the model. In this case, the model is statistically significant (p-value is very small), suggesting that at least one of the predictors (Assault or UrbanPop) is useful in predicting Murder.

Second, confirm if you get the same point estimates/coefficients/betas with the matrix algebra formula that we discussed in class - inv(X’X) X’y You should get the same answer. If not, check if you specified the intercept term in your matrix X by adding a unit vector or not.
LMR Section 2.6 can be useful if you get stuck (see the textbook folder in the class Dropbox link). Or see the hint / OLS_matrixVSlm.pdf file, and you can recode your dataset using OLS_matrixVSlm.Rmd file in W2 folder in class Dropbox folder.

# Load dataset
data("USArrests")

# Extract relevant variables
y <- USArrests$Murder
X <- as.matrix(cbind(1, USArrests$Assault, USArrests$UrbanPop))  # Design matrix with intercept

# Compute beta hat using matrix algebra
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y

# Display the coefficients
beta_hat

##             [,1]
## [1,]  3.20715340
## [2,]  0.04390995
## [3,] -0.04451047

# the model with lm to get coefficients
model <- lm(Murder ~ Assault + UrbanPop, data = USArrests)

# Extract coefficients from lm model
coefficients(model)

## (Intercept)     Assault    UrbanPop 
##  3.20715340  0.04390995 -0.04451047

Can you also estimate the standard error by hand as you did for point estimates using matrix algebra? Explain your logic as well.

# Load dataset
data(USArrests)

# Create a matrix X with Intercept and Assault
X <- cbind(1, USArrests$Assault)

# Response vector Y
Y <- USArrests$Murder

# Estimate beta
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% Y

# Compute residuals
residuals <- Y - X %*% beta_hat

# Estimate residual variance
sigma_hat_sq <- sum(residuals^2) / (nrow(USArrests) - ncol(X))

# Compute (X'X)^(-1)
XtX_inv <- solve(t(X) %*% X)

# Calculate standard errors of coefficients
se_beta <- sqrt(diag(sigma_hat_sq * XtX_inv))

# Display standard error of the coefficient for Assault
se_Assault <- se_beta[2]
se_Assault

## [1] 0.00450724

Step 1: Load dataset

Step 2: Create the design matrix X Here, X is constructed using cbind() to create a matrix: The first column is a column of ones (1), which represents the intercept term in the linear regression model. The second column is USArrests$Assault, which represents the predictor variable (independent variable) used to predict Murder rates

Step3:Define the response vector Y

Y is set to USArrests$Murder, which contains the response variable (dependent variable) of interest in the regression model. In this case, it represents the murder rates.

Step4: Estimate beta using matrix algebra Matrix Algebra Explanation: XT X(X transpose of X) is denoted by t(X). X transpose of X is X is calculated as t(X) %*% X.

solve(t(X) %*% X) computes the inverse of𝑋transpose X, denoted as (𝑋⊤𝑋)^−1

X⊤Y is t(X) %*% Y.

Thus, 𝛽=(𝑋⊤𝑋)^−1 X⊤Y , computes the coefficient estimates for the intercept and Assault variable in the linear regression model.

Step5: Computing Residuals The residuals (residuals) are calculated as Y−Xβ These represent the differences between the observed values (Y) and the predicted values based on the regression model (X,hat,beta)

Step6: Estimating Residual Variance σ^2

σ^2 (residual variance) is estimated as RSS/n-p-1 where RSS (residual sum of squares) is computed as sum(residuals^2), nrow(USArrests) is the number of observations, and ncol(X) is the number of predictors (including the intercept).

Step7: Calculate XtX_inv calculates the inverse of X⊤X, which is needed to compute the variance-covariance matrix of β and subsequently the standard errors of the coefficients.

Step8: Standard Error Calculation: diag(sigma_hat_sq * XtX_inv) extracts the diagonal elements of 𝜎^2 .(X⊤X)^-1 sqrt() computes the square root of each diagonal element to obtain the standard errors SE(βj) for each coefficient βj

Step9: Display the standard error for assault se_beta[2] retrieves the standard error associated with the Assault variable from se_beta. This value represents the uncertainty in the estimated coefficient of Assault in predicting Murder rates.

week2_Discussion

Reuben

2024-07-14