I want you to choose and import any dataset, and briefly describe it (tell us the type of economic data Download type of economic data, and the variable definitions).
# Load dataset
data("USArrests")
# View the first few rows of the dataset
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
# Summary statistics
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
# Scatterplot matrix
pairs(USArrests)
# Correlation matrix
cor(USArrests)
## Murder Assault UrbanPop Rape
## Murder 1.00000000 0.8018733 0.06957262 0.5635788
## Assault 0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape 0.56357883 0.6652412 0.41134124 1.0000000
Describe: The USArrests dataset contains statistics, in arrests per 100,000 residents, for assault, murder, and rape in each of the 50 US states in 1973.
Variables: Murder: Murder arrests per 100,000 residents. Assault: Assault arrests per 100,000 residents. UrbanPop: Urban population percentage in each state. Rape: Rape arrests per 100,000 residents.
Dataset type: This dataset is a panel dataset, containing cross-sectional data (states) over a single time period (1973). It is often used to analyze crime rates across different states and potentially correlate them with other socioeconomic factors.
The above code gives its summary statistics, a scatterplot matrix showing relationships between variables, and a correlation matrix indicating the pairwise correlations between the variables (Murder, Assault, UrbanPop, and Rape).
2)Once you have the dataset, decide what multivariate regression you want to run i.e. what is your y variable and what are your X variables? Since this is a multivariate regression, you need at least 2 independent variables (else you are running a bi-variate regression - has only one independent variable.
Setting up a multivariate regression model; For this particular dataset, let’s use Murder as our dependent variable (y variable), and Assault and UrbanPop as our independent variables (X variables).
Estimating Equation: The multivariate regression equation we’ll estimate is:
Murder=𝛽0+𝛽1⋅Assault+𝛽2⋅UrbanPop+𝜖
Murder is the number of murder arrests per 100,000 residents (dependent variable).
Assault is the number of assault arrests per 100,000 residents (independent variable).
UrbanPop is the percentage of the urban population in the state (independent variable).
𝛽0 is the intercept (constant term).
𝛽1 is the coefficient for Assault, representing the effect of Assault on Murder when UrbanPop is held constant.
𝛽2is the coefficient for UrbanPop, representing the effect of UrbanPop on Murder when Assault is held constant.
ϵ is the error term.
Now, first estimate the multivariate regression (linear relationship) above with the “lm” command
# Load dataset
data("USArrests")
# multivariate regression model
model <- lm(Murder ~ Assault + UrbanPop, data = USArrests)
# View summary of the model
summary(model)
##
## Call:
## lm(formula = Murder ~ Assault + UrbanPop, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5530 -1.7093 -0.3677 1.2284 7.5985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.207153 1.740790 1.842 0.0717 .
## Assault 0.043910 0.004579 9.590 1.22e-12 ***
## UrbanPop -0.044510 0.026363 -1.688 0.0980 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.58 on 47 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6491
## F-statistic: 46.32 on 2 and 47 DF, p-value: 7.704e-12
The above output will assess the statistical significance and direction of the relationships between Murder, Assault, and UrbanPop in the dataset.
Coefficients: The Estimate column shows the estimated coefficients (beta values). For example, Assault has an estimated coefficient of 0.043910, indicating that for every one-unit increase in Assault arrests per 100,000 residents, Murder arrests per 100,000 residents increase by approximately 0.043910, holding UrbanPop constant.
P-values: The p-values associated with each coefficient. They indicate the statistical significance of each predictor variable in explaining the variation in the dependent variable (Murder).
Multiple R-squared: The Multiple R-squared value (0.6634) indicates that the model explains approximately 66.34% of the variance in Murder arrests across the states.
F-statistic: The F-statistic tests the overall significance of the model. In this case, the model is statistically significant (p-value is very small), suggesting that at least one of the predictors (Assault or UrbanPop) is useful in predicting Murder.
Second, confirm if you get the same point
estimates/coefficients/betas with the matrix algebra formula that we
discussed in class - inv(X’X) X’y You should get the same answer. If
not, check if you specified the intercept term in your matrix X by
adding a unit vector or not.
LMR Section 2.6 can be useful if you get stuck (see the textbook folder
in the class Dropbox link). Or see the hint / OLS_matrixVSlm.pdf file,
and you can recode your dataset using OLS_matrixVSlm.Rmd file in W2
folder in class Dropbox folder.
# Load dataset
data("USArrests")
# Extract relevant variables
y <- USArrests$Murder
X <- as.matrix(cbind(1, USArrests$Assault, USArrests$UrbanPop)) # Design matrix with intercept
# Compute beta hat using matrix algebra
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
# Display the coefficients
beta_hat
## [,1]
## [1,] 3.20715340
## [2,] 0.04390995
## [3,] -0.04451047
# the model with lm to get coefficients
model <- lm(Murder ~ Assault + UrbanPop, data = USArrests)
# Extract coefficients from lm model
coefficients(model)
## (Intercept) Assault UrbanPop
## 3.20715340 0.04390995 -0.04451047
Can you also estimate the standard error by hand as you did for point estimates using matrix algebra? Explain your logic as well.
# Load dataset
data(USArrests)
# Create a matrix X with Intercept and Assault
X <- cbind(1, USArrests$Assault)
# Response vector Y
Y <- USArrests$Murder
# Estimate beta
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% Y
# Compute residuals
residuals <- Y - X %*% beta_hat
# Estimate residual variance
sigma_hat_sq <- sum(residuals^2) / (nrow(USArrests) - ncol(X))
# Compute (X'X)^(-1)
XtX_inv <- solve(t(X) %*% X)
# Calculate standard errors of coefficients
se_beta <- sqrt(diag(sigma_hat_sq * XtX_inv))
# Display standard error of the coefficient for Assault
se_Assault <- se_beta[2]
se_Assault
## [1] 0.00450724
Step 1: Load dataset
Step 2: Create the design matrix X Here, X is constructed using cbind() to create a matrix: The first column is a column of ones (1), which represents the intercept term in the linear regression model. The second column is USArrests$Assault, which represents the predictor variable (independent variable) used to predict Murder rates
Step3:Define the response vector Y
Y is set to USArrests$Murder, which contains the response variable (dependent variable) of interest in the regression model. In this case, it represents the murder rates.
Step4: Estimate beta using matrix algebra Matrix Algebra Explanation: XT X(X transpose of X) is denoted by t(X). X transpose of X is X is calculated as t(X) %*% X.
solve(t(X) %*% X) computes the inverse of𝑋transpose X, denoted as (𝑋⊤𝑋)^−1
X⊤Y is t(X) %*% Y.
Thus, 𝛽=(𝑋⊤𝑋)^−1 X⊤Y , computes the coefficient estimates for the intercept and Assault variable in the linear regression model.
Step5: Computing Residuals The residuals (residuals) are calculated as Y−Xβ These represent the differences between the observed values (Y) and the predicted values based on the regression model (X,hat,beta)
Step6: Estimating Residual Variance σ^2
σ^2 (residual variance) is estimated as RSS/n-p-1 where RSS (residual sum of squares) is computed as sum(residuals^2), nrow(USArrests) is the number of observations, and ncol(X) is the number of predictors (including the intercept).
Step7: Calculate XtX_inv calculates the inverse of X⊤X, which is needed to compute the variance-covariance matrix of β and subsequently the standard errors of the coefficients.
Step8: Standard Error Calculation: diag(sigma_hat_sq * XtX_inv) extracts the diagonal elements of 𝜎^2 .(X⊤X)^-1 sqrt() computes the square root of each diagonal element to obtain the standard errors SE(βj) for each coefficient βj
Step9: Display the standard error for assault se_beta[2] retrieves the standard error associated with the Assault variable from se_beta. This value represents the uncertainty in the estimated coefficient of Assault in predicting Murder rates.