── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ dplyr::select() masks MASS::select()
✖ purrr::some() masks car::some()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(psych)
Warning: package 'psych' was built under R version 4.3.3
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
The following object is masked from 'package:car':
logit
# Fit the full linear regression modelmodel <-lm(Time ~ ., data = UScrime)model1 <-lm(expenditure ~ ., data = CASchools)# Get the summary of the model to check the p-valuessummary(model)
Call:
lm(formula = Time ~ ., data = UScrime)
Residuals:
Min 1Q Median 3Q Max
-8.9233 -2.8814 -0.6517 2.5929 8.6936
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.049e+01 4.705e+01 1.498 0.144151
M 1.103e-01 1.096e-01 1.007 0.321758
So -2.718e+00 3.682e+00 -0.738 0.466013
Ed 1.109e-02 1.765e-01 0.063 0.950304
Po1 4.853e-01 2.647e-01 1.833 0.076390 .
Po2 -5.970e-01 2.774e-01 -2.152 0.039295 *
LF -1.704e-02 3.669e-02 -0.464 0.645703
M.F -4.534e-02 5.077e-02 -0.893 0.378779
Pop 4.127e-02 3.151e-02 1.310 0.199861
NW 2.743e-02 1.553e-02 1.766 0.087200 .
U1 -9.156e-02 1.071e-01 -0.855 0.399064
U2 1.554e-01 2.172e-01 0.716 0.479592
GDP 1.849e-02 2.603e-02 0.710 0.482805
Ineq 8.428e-03 6.496e-02 0.130 0.897608
Prob -1.883e+02 5.050e+01 -3.729 0.000772 ***
y -2.169e-03 4.468e-03 -0.486 0.630708
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.221 on 31 degrees of freedom
Multiple R-squared: 0.6343, Adjusted R-squared: 0.4573
F-statistic: 3.584 on 15 and 31 DF, p-value: 0.001295
#summary(model1)
-Replicate the “beta” coefficient that lm() command generates in R for a bivariate regression (one independent variable only) using cov-var formula.
-Replicate the “beta” coefficients that the lm() command generates in R for a multivariate regression (multiple independent variables) using matrix algebra formula
PART A. Please refer to lecture notes - ADEC 7320 Week 2 Lecture Notes_Annotated.pdf Download ADEC 7320 Week 2 Lecture Notes_Annotated.pdf(pages 21-30). We ran a simple linear regression in class to find the slope / beta1 coefficient (different from the intercept term). Furthermore, we saw that the beta1 coefficient in a simple linear regression can be written as
\[
Cov(y,x)/Var(x).
\]
# Bivariate regression using cov-var formula in R# Use the 'Time' variable as the independent variable and 'y' (crime rate) as the dependent variable#Time <- UScrime$Time#y <- UScrime$y# Step 1: Calculate slope (beta1)#cov_Time_y <- cov(Time, y) # Covariance between Time and y#var_Time <- var(Time) # Variance of Time#beta1 <- cov_Time_y / var_Time # Slope# Step 2: Calculate intercept (beta0)#mean_Time <- mean(Time)#mean_y <- mean(y)#beta0 <- mean_y - beta1 * mean_Time # Intercept# Print the beta coefficients#beta0#beta1# Compare with the lm() function output#lm_model <- lm(y ~ Time, data = UScrime)#summary(lm_model)
Take this chance to deepen your intuition on the slope formula. Why does it work? Thus, please answer -
In your own words, what is covarianceLinks to an external site.? What is varianceLinks to an external site.? Online stats blogsLinks to an external site. may be more helpful.
Covariance is a measure of how two variables move together. If both variables tend to increase together, their covariance is positive, while if one increases as the other decreases, the covariance is negative.
Variance measures how much a single variable varies from its mean. It is the square of the standard deviation and gives insight into the spread of a dataset.
Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?
The reason dividing the covariance of y and x by the variance of x gives the slope (β_1) in a simple linear regression is that covariance shows how y and x move together, while variance normalizes this movement relative to how much x itself varies. This gives the “rate of change” in y for a unit change in x, which is the definition of the slope.
The slope ((_1)) in a simple linear regression can be written as: \[
cov(y,x)/var(x) = β_1
\]
Please show this in R. You just need to match the two outputs (one from regression, one from the formula) like we did in class, but for a dataset of your choice.
# Dependant variable y <-as.vector(UScrime$y)# Matrix of feature variables from UScrime (excluding the response variable)X <-as.matrix(UScrime[-which(names(UScrime) =="y")])?rep #replicates the values in x. # Vector of ones with the same length as rows in UScrimeint <-rep(1, length(y))?cbind # Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively.# Add intercept column to XX <-cbind(int, X)# Compute beta coefficients manuallybetas <-solve(t(X) %*% X) %*%t(X) %*% ybetas_check <-solve(crossprod(X), crossprod(X, y))# Check if the manually computed betas match those from the efficient methodbetas ==round(betas_check, 2)
[,1]
int FALSE
M FALSE
So FALSE
Ed FALSE
Po1 FALSE
Po2 FALSE
LF FALSE
M.F FALSE
Pop FALSE
NW FALSE
U1 FALSE
U2 FALSE
GDP FALSE
Ineq FALSE
Prob FALSE
Time FALSE
# Fit the linear modellm.mod <-lm(y ~ ., data = UScrime)# Round the coefficients for easier viewinglm.betas <-round(lm.mod$coefficients, 2)# Create data.frame of results?data.frameresults <-data.frame(our.results=betas, lm.results=lm.betas)print(results)
our.results lm.results
int -5984.2876045 -5984.29
M 8.7830173 8.78
So -3.8034503 -3.80
Ed 18.8324315 18.83
Po1 19.2804338 19.28
Po2 -10.9421925 -10.94
LF -0.6638261 -0.66
M.F 1.7406856 1.74
Pop -0.7330081 -0.73
NW 0.4204461 0.42
U1 -5.8271027 -5.83
U2 16.7799672 16.78
GDP 0.9616624 0.96
Ineq 7.0672099 7.07
Prob -4855.2658155 -4855.27
Time -3.4790178 -3.48
we now try income in CASchools
# Check for missing valuessum(is.na(CASchools))
[1] 0
# If there are missing values, you can handle them by removing or imputingCASchools <-na.omit(CASchools)# Select only numeric columnsCASchools_numeric <- CASchools %>%select(where(is.numeric))# Standardize the data (excluding the response variable)CASchools_scaled <- CASchools_numeric %>%mutate(across(-income, scale))# Dependent variable a <-as.vector(CASchools_scaled$income)# Matrix of feature variables from CASchools (excluding the response variable)b <-as.matrix(CASchools_scaled[-which(names(CASchools_scaled) =="income")])# Vector of ones with the same length as rows in CASchoolsint1 <-rep(1, length(a))# Add intercept column to Xb <-cbind(int1, b)# Check for multicollinearityvif_values1 <-vif(lm(income ~ ., data = CASchools_scaled))print(vif_values1)
students teachers calworks lunch computer expenditure
193.785829 216.182451 2.630874 6.977956 8.875152 1.243668
english read math
2.600597 12.890113 7.521308
# If VIF values are high, consider removing highly correlated predictors# Compute beta coefficients manuallybetas1 <-solve(t(b) %*% b) %*%t(b) %*% abetas_check1 <-solve(crossprod(b), crossprod(b, a))# Check if the manually computed betas match those from the efficient methodbetas1 ==round(betas_check1, 2)
[,1]
int1 FALSE
students FALSE
teachers FALSE
calworks FALSE
lunch FALSE
computer FALSE
expenditure FALSE
english FALSE
read FALSE
math FALSE
# Fit the linear modellm.mod1 <-lm(income ~ ., data = CASchools_scaled)# Round the coefficients for easier viewinglm.betas1 <-round(lm.mod1$coefficients, 2)# Create data.frame of resultsresults1 <-data.frame(our.results1=betas1, lm.results1=lm.betas1)print(results1)
Check out a few .Rmd files (R Markdown extension) in our class Dropbox folder (link in Course Introduction and Materials), and feel free to adapt any class (or online) code for this discussion. Make sure you understand and adapt the code for your own purposes.
UScrime. The UScrime dataset from the MASS package in R is a cross-sectional dataset. It provides data on crime rates and related variables for different states in the United States, collected at a single point in time. This type of data allows for analysis of relationships between variables within that specific timeframe, but it doesn’t track changes over time.
it seems the police expenditure in 1959 is not quite related and we may test every p value to see the possible relationship for our final model:
??UScrime# Fit the full linear regression model with y (crime rate) as the dependent variablemodel <-lm(y ~ ., data = UScrime)# Get the summary of the model to check p-valuesmodel_summary <-summary(model)# Extract coefficients and p-values from the model summarycoefficients_df <-as.data.frame(model_summary$coefficients)# Extract p-valuesp_values <- coefficients_df[, "Pr(>|t|)"]# Display all p-valuesprint("P-values of all factors:")
# Plot regression for the significant factors and add the regression equation to the plotfor (factor in significant_factors) {# Fit a bivariate regression model for the specific factor factor_model <-lm(as.formula(paste("y ~", factor)), data = UScrime)# Extract the intercept and slope from the model intercept <-round(coef(factor_model)[1], 2) slope <-round(coef(factor_model)[2], 2)# Create the regression equation as a string reg_eq <-paste("y = ", intercept, " + ", slope, "*", factor, sep ="")# Create scatter plot with regression line p <-ggplot(UScrime, aes_string(x = factor, y ="y")) +geom_point() +# Points for each observationgeom_smooth(method ="lm", se =TRUE) +# Add regression line with standard error bandslabs(title =paste("Crime Rate vs", factor),x = factor,y ="Crime Rate") +annotate("text", x =Inf, y =Inf, label = reg_eq, hjust =1.1, vjust =1.1, size =5, color ="red") +theme_minimal()print(p)}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
The CASchools dataset from the “AER” package in R is a cross-sectional dataset. It provides data on various educational and demographic variables for different school districts in California, collected at a single point in time. This type of data allows for analysis of relationships between variables within that specific timeframe, but it doesn’t track changes over time.
# Check for missing valuessum(is.na(CASchools))
[1] 0
# If there are missing values, you can handle them by removing or imputingCASchools <-na.omit(CASchools)# Select only numeric columnsCASchools_numeric <- CASchools %>%select(where(is.numeric))# Standardize the data (excluding the response variable)CASchools_scaled <- CASchools_numeric %>%mutate(across(-income, scale))# Dependent variable a <-as.vector(CASchools_scaled$income)# Matrix of feature variables from CASchools (excluding the response variable)b <-as.matrix(CASchools_scaled[-which(names(CASchools_scaled) =="income")])# Vector of ones with the same length as rows in CASchoolsint1 <-rep(1, length(a))# Add intercept column to bb <-cbind(int1, b)# Fit the full linear regression model with income as the dependent variablemodel1 <-lm(income ~ ., data = CASchools_scaled)# Get the summary of the model to check p-valuesmodel_summary1 <-summary(model1)# Extract coefficients and p-values from the model summarycoefficients_df1 <-as.data.frame(model_summary1$coefficients)# Extract p-valuesp_values1 <- coefficients_df1[, "Pr(>|t|)"]# Display all p-valuesprint("P-values of all factors:")
# Plot regression for the significant factors and add the regression equation to the plotfor (factor in significant_factors1) {# Fit a bivariate regression model for the specific factor factor_model1 <-lm(as.formula(paste("income ~", factor)), data = CASchools_scaled)# Extract the intercept and slope from the model intercept1 <-round(coef(factor_model1)[1], 2) slope1 <-round(coef(factor_model1)[2], 2)# Create the regression equation as a string reg_eq1 <-paste("income = ", intercept1, " + ", slope1, "*", factor, sep ="")# Create scatter plot with regression line p1 <-ggplot(CASchools_scaled, aes_string(x = factor, y ="income")) +geom_point() +# Points for each observationgeom_smooth(method ="lm", se =TRUE) +# Add regression line with standard error bandslabs(title =paste("Income vs", factor),x = factor,y ="Income") +annotate("text", x =Inf, y =Inf, label = reg_eq1, hjust =1.1, vjust =1.1, size =5, color ="red") +theme_minimal()print(p1)}
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
HINT: In particular, you will be able to replicate OLS_matrixVSlm.html output from the OLS_matrixVSlm.qmd file there, which may be relevant for this discussion.
I want you to choose and import any dataset, and briefly describe it (tell us the type of economic data Download type of economic data, and the variable definitions).
Note Here I will choose UScrime
You can use any in-built R dataset too - just type “data()” in R to see the available options ( Section 3Links to an external site. may be helpful), or can use AER packageLinks to an external site. to find more relevant Economics datasets.
Once you have the dataset, decide what multivariate regression you want to run i.e. what is your y variable and what are your X variables? Since this is a multivariate regression, you need at least 2 independent variables (else you are running a bi-variate regression - has only one independent variable). So for example, income may be my dependent variable, while age, experience, … may be my independent variables. TYPE OUT THE FULL ESTIMATING EQUATION.
As code we tested, for UScrime, I’m going to use “M” “Ed” “Ineq” “Prob” as my variables. The estimating equation is: $$ y = _0 + _1M + _2Ed + _3Ineq + _4Prob +
$$
# Fit the multivariate regression modellm.mod <-lm(y ~ M + Ed + Ineq + Prob, data = UScrime)# Extract the coefficients for easier viewinglm.betas <-round(x = lm.mod$coefficients, digits =2)# Print the model summarysummary(lm.mod)
Call:
lm(formula = y ~ M + Ed + Ineq + Prob, data = UScrime)
Residuals:
Min 1Q Median 3Q Max
-532.97 -254.03 -55.72 137.80 960.21
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1339.346 1247.011 -1.074 0.28893
M 3.597 5.339 0.674 0.50417
Ed 14.861 7.192 2.066 0.04499 *
Ineq 2.687 2.277 1.180 0.24458
Prob -7331.915 2560.270 -2.864 0.00651 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 347.5 on 42 degrees of freedom
Multiple R-squared: 0.2629, Adjusted R-squared: 0.1927
F-statistic: 3.745 on 4 and 42 DF, p-value: 0.01077
# Create the design matrix X (including the intercept)X <-model.matrix(~ M + Ed + Ineq + Prob, data = UScrime)# Extract the dependent variable yy <- UScrime$y# Estimate betas using the formula beta_hat = inv(X'X) X'ybeta_hat <-solve(t(X) %*% X) %*%t(X) %*% y# Print beta_hat to compare with lm resultsround(beta_hat, digits =2)
[,1]
(Intercept) -1339.35
M 3.60
Ed 14.86
Ineq 2.69
Prob -7331.92
# Compute residualsresiduals <- y - X %*% beta_hat# Compute variance of residuals (sigma^2)N <-nrow(X) # Number of observationsk <-ncol(X) -1# Number of predictorssigma_sq <-sum(residuals^2) / (N - k -1)# Compute the variance-covariance matrix of the betasvar_beta_hat <- sigma_sq *solve(t(X) %*% X)# Extract the standard errors (square root of diagonal elements)std_errors <-sqrt(diag(var_beta_hat))# Print the standard errorsround(std_errors, digits =2)
(Intercept) M Ed Ineq Prob
1247.01 5.34 7.19 2.28 2560.27
# Compare results from lm and manual matrix algebraresults <-data.frame(lm_results = lm.betas, matrix_results =round(beta_hat, 2))print(results)
lm_results matrix_results
(Intercept) -1339.35 -1339.35
M 3.60 3.60
Ed 14.86 14.86
Ineq 2.69 2.69
Prob -7331.92 -7331.92
As code we tested, for CASchool, I’m going to use “students” “teachers” “lunch” “expenditure” “english” “math” as my variable The estimatin equation is: $$ y = _0 + _1students + _2teachers + _3lunch + _4Expenditure + _5english + _6math +
$$
# Check for missing valuessum(is.na(CASchools))
[1] 0
# If there are missing values, you can handle them by removing or imputingCASchools <-na.omit(CASchools)# Select only numeric columnsCASchools_numeric <- CASchools %>%select(students, teachers, lunch, expenditure, english, math, income)# Standardize the data (excluding the response variable)CASchools_scaled <- CASchools_numeric %>%mutate(across(-income, scale))# Dependent variable a <-as.vector(CASchools_scaled$income)# Matrix of feature variables from CASchools (excluding the response variable)b <-as.matrix(CASchools_scaled %>%select(-income))# Vector of ones with the same length as rows in CASchoolsint1 <-rep(1, length(a))# Add intercept column to bb <-cbind(int1, b)# Fit the multivariate regression modellm.mod1 <-lm(income ~ students + teachers + lunch + expenditure + english + math, data = CASchools_scaled)# Extract the coefficients for easier viewinglm.betas1 <-round(x = lm.mod1$coefficients, digits =2)# Print the model summarysummary(lm.mod1)
Call:
lm(formula = income ~ students + teachers + lunch + expenditure +
english + math, data = CASchools_scaled)
Residuals:
Min 1Q Median 3Q Max
-9.5752 -2.6100 -0.5509 1.9055 26.4882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.3166 0.2145 71.409 < 2e-16 ***
students -6.8499 2.9195 -2.346 0.0194 *
teachers 7.4466 2.9130 2.556 0.0109 *
lunch -3.8894 0.4159 -9.351 < 2e-16 ***
expenditure 1.6994 0.2250 7.555 2.73e-13 ***
english 1.7130 0.3049 5.619 3.54e-08 ***
math 2.5657 0.3867 6.635 1.02e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.396 on 413 degrees of freedom
Multiple R-squared: 0.6352, Adjusted R-squared: 0.6299
F-statistic: 119.9 on 6 and 413 DF, p-value: < 2.2e-16
# Create the design matrix b (including the intercept)b <-model.matrix(~ students + teachers + lunch + expenditure + english + math, data = CASchools_scaled)# Estimate betas using the formula beta_hat = inv(b'b) b'abeta_hat1 <-solve(t(b) %*% b) %*%t(b) %*% a# Print beta_hat to compare with lm resultsround(beta_hat1, digits =2)
[,1]
(Intercept) 15.32
students -6.85
teachers 7.45
lunch -3.89
expenditure 1.70
english 1.71
math 2.57
# Compute residualsresiduals1 <- a - b %*% beta_hat1# Compute variance of residuals (sigma^2)N1 <-nrow(b) # Number of observationsk1 <-ncol(b) -1# Number of predictorssigma_sq1 <-sum(residuals1^2) / (N1 - k1 -1)# Compute the variance-covariance matrix of the betasvar_beta_hat1 <- sigma_sq1 *solve(t(b) %*% b)# Extract the standard errors (square root of diagonal elements)std_errors1 <-sqrt(diag(var_beta_hat1))# Print the standard errorsround(std_errors1, digits =2)
(Intercept) students teachers lunch expenditure english
0.21 2.92 2.91 0.42 0.22 0.30
math
0.39
# Compare results from lm and manual matrix algebraresults1 <-data.frame(lm_results = lm.betas1, matrix_results =round(beta_hat1, 2))print(results1)
lm_results matrix_results
(Intercept) 15.32 15.32
students -6.85 -6.85
teachers 7.45 7.45
lunch -3.89 -3.89
expenditure 1.70 1.70
english 1.71 1.71
math 2.57 2.57
To learn how to type out the equation, read hereLinks to an external site.. Make sure it is well aligned - see the .Rmd file corresponding to OLS_matrixVSlm.pdf Download OLS_matrixVSlm.pdffile if needed.
Now, first estimate the multivariate regression (linear relationship) above with the “lm” command. The syntaxLinks to an external site. is very simple, and you will find many blogs on this - just find the relevant sections as regression is a large topic. (EG Subsection Multiple regression: biking, smoking, and heart diseaseLinks to an external site. may be helpful too). Do not worry too much about causality or interpretation of beta coefficients for this discussion.
Second, confirm if you get the same point estimates/coefficients/betas with the matrix algebra formula that we discussed in class - inv(X’X) X’y
You should get the same answer. If not, check if you specified the intercept term in your matrix X by adding a unit vector or not.
LMR Section 2.6 can be useful if you get stuck (see the textbook folder in the class Dropbox link). Or see the hint / OLS_matrixVSlm.pdf file, and you can recode your dataset using OLS_matrixVSlm.Rmd file in W2 folder in class Dropbox folder.
Food for thought: Can you also estimate the standard error by hand as you did for point estimates using matrix algebra? Explain your logic as well i.e. what is the intuition ?
Standard errors can be calculated by first estimating the residual variance, followed by calculating the variance-covariance matrix of the coefficients. The square roots of the diagonal of this matrix give the standard errors.
The intuition behind this is that the residual variance measures the error in the model’s predictions, and the variance-covariance matrix captures how much the estimated coefficients vary based on the model’s structure. Standard errors quantify the uncertainty in each coefficient.