Econometricswk3

Author

Nuobing Fan

There are two goals for the weekly discussion board activity

I didn’t use Package Boston since we have it as sample, instead, I tried UScrime.

#install.packages('tinytex')

library(tinytex)

Warning: package 'tinytex' was built under R version 4.3.3

library("MASS")
library("AER")

Warning: package 'AER' was built under R version 4.3.3

Loading required package: car

Warning: package 'car' was built under R version 4.3.3

Loading required package: carData

Warning: package 'carData' was built under R version 4.3.3

Loading required package: lmtest

Warning: package 'lmtest' was built under R version 4.3.3

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich

Warning: package 'sandwich' was built under R version 4.3.3

Loading required package: survival

data(UScrime)
head(UScrime)

    M So  Ed Po1 Po2  LF  M.F Pop  NW  U1 U2 GDP Ineq     Prob    Time    y
1 151  1  91  58  56 510  950  33 301 108 41 394  261 0.084602 26.2011  791
2 143  0 113 103  95 583 1012  13 102  96 36 557  194 0.029599 25.2999 1635
3 142  1  89  45  44 533  969  18 219  94 33 318  250 0.083401 24.3006  578
4 136  0 121 149 141 577  994 157  80 102 39 673  167 0.015801 29.9012 1969
5 141  0 121 109 101 591  985  18  30  91 20 578  174 0.041399 21.2998 1234
6 121  0 110 118 115 547  964  25  44  84 29 689  126 0.034201 20.9995  682

data(CASchools)
head(CASchools)

  district                          school  county grades students teachers
1    75119              Sunol Glen Unified Alameda  KK-08      195    10.90
2    61499            Manzanita Elementary   Butte  KK-08      240    11.15
3    61549     Thermalito Union Elementary   Butte  KK-08     1550    82.90
4    61457 Golden Feather Union Elementary   Butte  KK-08      243    14.00
5    61523        Palermo Union Elementary   Butte  KK-08     1335    71.50
6    62042         Burrel Union Elementary  Fresno  KK-08      137     6.40
  calworks   lunch computer expenditure    income   english  read  math
1   0.5102  2.0408       67    6384.911 22.690001  0.000000 691.6 690.0
2  15.4167 47.9167      101    5099.381  9.824000  4.583333 660.5 661.9
3  55.0323 76.3226      169    5501.955  8.978000 30.000002 636.3 650.9
4  36.4754 77.0492       85    7101.831  8.978000  0.000000 651.9 643.5
5  33.1086 78.4270      171    5235.988  9.080333 13.857677 641.8 639.9
6  12.3188 86.9565       25    5580.147 10.415000 12.408759 605.7 605.4

?UScrime
?CASchools
str(UScrime)

'data.frame':   47 obs. of  16 variables:
 $ M   : int  151 143 142 136 141 121 127 131 157 140 ...
 $ So  : int  1 0 1 0 0 0 1 1 1 0 ...
 $ Ed  : int  91 113 89 121 121 110 111 109 90 118 ...
 $ Po1 : int  58 103 45 149 109 118 82 115 65 71 ...
 $ Po2 : int  56 95 44 141 101 115 79 109 62 68 ...
 $ LF  : int  510 583 533 577 591 547 519 542 553 632 ...
 $ M.F : int  950 1012 969 994 985 964 982 969 955 1029 ...
 $ Pop : int  33 13 18 157 18 25 4 50 39 7 ...
 $ NW  : int  301 102 219 80 30 44 139 179 286 15 ...
 $ U1  : int  108 96 94 102 91 84 97 79 81 100 ...
 $ U2  : int  41 36 33 39 20 29 38 35 28 24 ...
 $ GDP : int  394 557 318 673 578 689 620 472 421 526 ...
 $ Ineq: int  261 194 250 167 174 126 168 206 239 174 ...
 $ Prob: num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
 $ Time: num  26.2 25.3 24.3 29.9 21.3 ...
 $ y   : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

str(CASchools)

'data.frame':   420 obs. of  14 variables:
 $ district   : chr  "75119" "61499" "61549" "61457" ...
 $ school     : chr  "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
 $ county     : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
 $ grades     : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
 $ students   : num  195 240 1550 243 1335 ...
 $ teachers   : num  10.9 11.1 82.9 14 71.5 ...
 $ calworks   : num  0.51 15.42 55.03 36.48 33.11 ...
 $ lunch      : num  2.04 47.92 76.32 77.05 78.43 ...
 $ computer   : num  67 101 169 85 171 25 28 66 35 0 ...
 $ expenditure: num  6385 5099 5502 7102 5236 ...
 $ income     : num  22.69 9.82 8.98 8.98 9.08 ...
 $ english    : num  0 4.58 30 0 13.86 ...
 $ read       : num  692 660 636 652 642 ...
 $ math       : num  690 662 651 644 640 ...

#library(visdat)
#vis_dat(UScrime)
#install.packages("tidyverse")
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'tidyr' was built under R version 4.3.3

Warning: package 'readr' was built under R version 4.3.3

Warning: package 'purrr' was built under R version 4.3.3

Warning: package 'dplyr' was built under R version 4.3.3

Warning: package 'forcats' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ dplyr::select() masks MASS::select()
✖ purrr::some()   masks car::some()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(psych)

Warning: package 'psych' was built under R version 4.3.3


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

The following object is masked from 'package:car':

    logit

glimpse(UScrime)

Rows: 47
Columns: 16
$ M    <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134, 128, …
$ So   <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,…
$ Ed   <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 113, 117…
$ Po1  <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 62, 57,…
$ Po2  <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61, 53, …
$ LF   <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595, 624, …
$ M.F  <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 972, 972…
$ Pop  <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30, 33, 1…
$ NW   <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, 46, 72…
$ U1   <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, 92, 11…
$ U2   <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43, 47, 3…
$ GDP  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580, 507, …
$ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172, 206, …
$ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201, 0.042…
$ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.6993, 24…
$ y    <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674, 849, …

glimpse(CASchools)

Rows: 420
Columns: 14
$ district    <chr> "75119", "61499", "61549", "61457", "61523", "62042", "685…
$ school      <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito …
$ county      <fct> Alameda, Butte, Butte, Butte, Butte, Fresno, San Joaquin, …
$ grades      <fct> KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK…
$ students    <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, …
$ teachers    <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.…
$ calworks    <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.90…
$ lunch       <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.62…
$ computer    <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 3…
$ expenditure <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988, 5580.147…
$ income      <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333, 10.4150…
$ english     <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.857677, 12.408…
$ read        <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 60…
$ math        <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 61…

describe(UScrime, skew = FALSE)

     vars  n   mean     sd median    min     max   range    se
M       1 47 138.57  12.57 136.00 119.00  177.00   58.00  1.83
So      2 47   0.34   0.48   0.00   0.00    1.00    1.00  0.07
Ed      3 47 105.64  11.19 108.00  87.00  122.00   35.00  1.63
Po1     4 47  85.00  29.72  78.00  45.00  166.00  121.00  4.33
Po2     5 47  80.23  27.96  73.00  41.00  157.00  116.00  4.08
LF      6 47 561.19  40.41 560.00 480.00  641.00  161.00  5.89
M.F     7 47 983.02  29.47 977.00 934.00 1071.00  137.00  4.30
Pop     8 47  36.62  38.07  25.00   3.00  168.00  165.00  5.55
NW      9 47 101.13 102.83  76.00   2.00  423.00  421.00 15.00
U1     10 47  95.47  18.03  92.00  70.00  142.00   72.00  2.63
U2     11 47  33.98   8.45  34.00  20.00   58.00   38.00  1.23
GDP    12 47 525.38  96.49 537.00 288.00  689.00  401.00 14.07
Ineq   13 47 194.00  39.90 176.00 126.00  276.00  150.00  5.82
Prob   14 47   0.05   0.02   0.04   0.01    0.12    0.11  0.00
Time   15 47  26.60   7.09  25.80  12.20   44.00   31.80  1.03
y      16 47 905.09 386.76 831.00 342.00 1993.00 1651.00 56.42

describe(CASchools, skew = FALSE)

            vars   n    mean      sd  median     min      max    range     se
district*      1 420  210.50  121.39  210.50    1.00   420.00   419.00   5.92
school*        2 420  205.94  117.10  206.50    1.00   409.00   408.00   5.71
county*        3 420   24.49   12.26   26.00    1.00    45.00    44.00   0.60
grades*        4 420    1.85    0.35    2.00    1.00     2.00     1.00   0.02
students       5 420 2628.79 3913.10  950.50   81.00 27176.00 27095.00 190.94
teachers       6 420  129.07  187.91   48.56    4.85  1429.00  1424.15   9.17
calworks       7 420   13.25   11.45   10.52    0.00    78.99    78.99   0.56
lunch          8 420   44.71   27.12   41.75    0.00   100.00   100.00   1.32
computer       9 420  303.38  441.34  117.50    0.00  3324.00  3324.00  21.54
expenditure   10 420 5312.41  633.94 5214.52 3926.07  7711.51  3785.44  30.93
income        11 420   15.32    7.23   13.73    5.34    55.33    49.99   0.35
english       12 420   15.77   18.29    8.78    0.00    85.54    85.54   0.89
read          13 420  654.97   20.11  655.75  604.50   704.00    99.50   0.98
math          14 420  653.34   18.75  652.45  605.40   709.50   104.10   0.92

# Fit the full linear regression model
model <- lm(Time ~ ., data = UScrime)
model1 <- lm(expenditure ~ ., data = CASchools)
# Get the summary of the model to check the p-values
summary(model)


Call:
lm(formula = Time ~ ., data = UScrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9233 -2.8814 -0.6517  2.5929  8.6936 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.049e+01  4.705e+01   1.498 0.144151    
M            1.103e-01  1.096e-01   1.007 0.321758    
So          -2.718e+00  3.682e+00  -0.738 0.466013    
Ed           1.109e-02  1.765e-01   0.063 0.950304    
Po1          4.853e-01  2.647e-01   1.833 0.076390 .  
Po2         -5.970e-01  2.774e-01  -2.152 0.039295 *  
LF          -1.704e-02  3.669e-02  -0.464 0.645703    
M.F         -4.534e-02  5.077e-02  -0.893 0.378779    
Pop          4.127e-02  3.151e-02   1.310 0.199861    
NW           2.743e-02  1.553e-02   1.766 0.087200 .  
U1          -9.156e-02  1.071e-01  -0.855 0.399064    
U2           1.554e-01  2.172e-01   0.716 0.479592    
GDP          1.849e-02  2.603e-02   0.710 0.482805    
Ineq         8.428e-03  6.496e-02   0.130 0.897608    
Prob        -1.883e+02  5.050e+01  -3.729 0.000772 ***
y           -2.169e-03  4.468e-03  -0.486 0.630708    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.221 on 31 degrees of freedom
Multiple R-squared:  0.6343,    Adjusted R-squared:  0.4573 
F-statistic: 3.584 on 15 and 31 DF,  p-value: 0.001295

#summary(model1)

-Replicate the “beta” coefficient that `lm()` command generates in R for a bivariate regression (one independent variable only) using cov-var formula.

-Replicate the “beta” coefficients that the `lm()` command generates in R for a multivariate regression (multiple independent variables) using matrix algebra formula

PART A. Please refer to lecture notes - ADEC 7320 Week 2 Lecture Notes_Annotated.pdf Download ADEC 7320 Week 2 Lecture Notes_Annotated.pdf(pages 21-30). We ran a simple linear regression in class to find the slope / beta1 coefficient (different from the intercept term). Furthermore, we saw that the beta1 coefficient in a simple linear regression can be written as

\[ Cov(y,x)/Var(x). \]

# Bivariate regression using cov-var formula in R

# Use the 'Time' variable as the independent variable and 'y' (crime rate) as the dependent variable
#Time <- UScrime$Time
#y <- UScrime$y

# Step 1: Calculate slope (beta1)
#cov_Time_y <- cov(Time, y)  # Covariance between Time and y
#var_Time <- var(Time)       # Variance of Time
#beta1 <- cov_Time_y / var_Time  # Slope

# Step 2: Calculate intercept (beta0)
#mean_Time <- mean(Time)
#mean_y <- mean(y)
#beta0 <- mean_y - beta1 * mean_Time  # Intercept

# Print the beta coefficients
#beta0
#beta1

# Compare with the lm() function output
#lm_model <- lm(y ~ Time, data = UScrime)
#summary(lm_model)

Take this chance to deepen your intuition on the slope formula. Why does it work? Thus, please answer -

In your own words, what is covarianceLinks to an external site.? What is varianceLinks to an external site.? Online stats blogsLinks to an external site. may be more helpful.

Covariance is a measure of how two variables move together. If both variables tend to increase together, their covariance is positive, while if one increases as the other decreases, the covariance is negative.

Variance measures how much a single variable varies from its mean. It is the square of the standard deviation and gives insight into the spread of a dataset.

Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?

The reason dividing the covariance of y and x by the variance of x gives the slope (β_1) in a simple linear regression is that covariance shows how y and x move together, while variance normalizes this movement relative to how much x itself varies. This gives the “rate of change” in y for a unit change in x, which is the definition of the slope.

The slope ((_1)) in a simple linear regression can be written as: \[ cov(y,x)/var(x) = β_1 \]

Please show this in R. You just need to match the two outputs (one from regression, one from the formula) like we did in class, but for a dataset of your choice.

# Dependant variable 
y <- as.vector(UScrime$y)

# Matrix of feature variables from UScrime (excluding the response variable)
X <- as.matrix(UScrime[-which(names(UScrime) == "y")])

?rep #replicates the values in x. 
# Vector of ones with the same length as rows in UScrime
int <- rep(1, length(y))


?cbind # Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively.

# Add intercept column to X
X <- cbind(int, X)

# Compute beta coefficients manually
betas <- solve(t(X) %*% X) %*% t(X) %*% y
betas_check <- solve(crossprod(X), crossprod(X, y))

# Check if the manually computed betas match those from the efficient method
betas == round(betas_check, 2)

      [,1]
int  FALSE
M    FALSE
So   FALSE
Ed   FALSE
Po1  FALSE
Po2  FALSE
LF   FALSE
M.F  FALSE
Pop  FALSE
NW   FALSE
U1   FALSE
U2   FALSE
GDP  FALSE
Ineq FALSE
Prob FALSE
Time FALSE

# Fit the linear model
lm.mod <- lm(y ~ ., data = UScrime)

# Round the coefficients for easier viewing
lm.betas <- round(lm.mod$coefficients, 2)


# Create data.frame of results
?data.frame
results <- data.frame(our.results=betas, lm.results=lm.betas)

print(results)

       our.results lm.results
int  -5984.2876045   -5984.29
M        8.7830173       8.78
So      -3.8034503      -3.80
Ed      18.8324315      18.83
Po1     19.2804338      19.28
Po2    -10.9421925     -10.94
LF      -0.6638261      -0.66
M.F      1.7406856       1.74
Pop     -0.7330081      -0.73
NW       0.4204461       0.42
U1      -5.8271027      -5.83
U2      16.7799672      16.78
GDP      0.9616624       0.96
Ineq     7.0672099       7.07
Prob -4855.2658155   -4855.27
Time    -3.4790178      -3.48

we now try income in CASchools

# Check for missing values
sum(is.na(CASchools))

[1] 0

# If there are missing values, you can handle them by removing or imputing
CASchools <- na.omit(CASchools)

# Select only numeric columns
CASchools_numeric <- CASchools %>%
  select(where(is.numeric))

# Standardize the data (excluding the response variable)
CASchools_scaled <- CASchools_numeric %>%
  mutate(across(-income, scale))

# Dependent variable 
a <- as.vector(CASchools_scaled$income)

# Matrix of feature variables from CASchools (excluding the response variable)
b <- as.matrix(CASchools_scaled[-which(names(CASchools_scaled) == "income")])

# Vector of ones with the same length as rows in CASchools
int1 <- rep(1, length(a))

# Add intercept column to X
b <- cbind(int1, b)

# Check for multicollinearity
vif_values1 <- vif(lm(income ~ ., data = CASchools_scaled))
print(vif_values1)

   students    teachers    calworks       lunch    computer expenditure 
 193.785829  216.182451    2.630874    6.977956    8.875152    1.243668 
    english        read        math 
   2.600597   12.890113    7.521308

# If VIF values are high, consider removing highly correlated predictors
# Compute beta coefficients manually
betas1 <- solve(t(b) %*% b) %*% t(b) %*% a
betas_check1 <- solve(crossprod(b), crossprod(b, a))

# Check if the manually computed betas match those from the efficient method
betas1 == round(betas_check1, 2)

             [,1]
int1        FALSE
students    FALSE
teachers    FALSE
calworks    FALSE
lunch       FALSE
computer    FALSE
expenditure FALSE
english     FALSE
read        FALSE
math        FALSE

# Fit the linear model
lm.mod1 <- lm(income ~ ., data = CASchools_scaled)

# Round the coefficients for easier viewing
lm.betas1 <- round(lm.mod1$coefficients, 2)

# Create data.frame of results
results1 <- data.frame(our.results1=betas1, lm.results1=lm.betas1)

print(results1)

            our.results1 lm.results1
int1         15.31658805       15.32
students     -6.52191408       -6.52
teachers      7.03459591        7.03
calworks      0.03131177        0.03
lunch        -3.48643013       -3.49
computer      0.11619107        0.12
expenditure   1.55853079        1.56
english       1.95006117        1.95
read          1.43943581        1.44
math          1.74291843        1.74

PART B.

Check out a few .Rmd files (R Markdown extension) in our class Dropbox folder (link in Course Introduction and Materials), and feel free to adapt any class (or online) code for this discussion. Make sure you understand and adapt the code for your own purposes.

UScrime. The UScrime dataset from the MASS package in R is a cross-sectional dataset. It provides data on crime rates and related variables for different states in the United States, collected at a single point in time. This type of data allows for analysis of relationships between variables within that specific timeframe, but it doesn’t track changes over time.

it seems the police expenditure in 1959 is not quite related and we may test every p value to see the possible relationship for our final model:

??UScrime

# Fit the full linear regression model with y (crime rate) as the dependent variable
model <- lm(y ~ ., data = UScrime)

# Get the summary of the model to check p-values
model_summary <- summary(model)

# Extract coefficients and p-values from the model summary
coefficients_df <- as.data.frame(model_summary$coefficients)

# Extract p-values
p_values <- coefficients_df[, "Pr(>|t|)"]

# Display all p-values
print("P-values of all factors:")

[1] "P-values of all factors:"

print(p_values)

 [1] 0.0008929887 0.0434433942 0.9797653725 0.0048614327 0.0788919769
 [6] 0.3588295738 0.6546540941 0.3989953316 0.5738452309 0.5212791189
[11] 0.1762380311 0.0501612829 0.3607537824 0.0039831365 0.0406269260
[16] 0.6307084351

# Filter factors with p-values <= 0.05 (exclude intercept)
significant_factors <- rownames(coefficients_df)[p_values <= 0.05 & rownames(coefficients_df) != "(Intercept)"]

# Display names of significant factors
print("Significant factors (p <= 0.05):")

[1] "Significant factors (p <= 0.05):"

print(significant_factors)

[1] "M"    "Ed"   "Ineq" "Prob"

# Plot regression for the significant factors and add the regression equation to the plot
for (factor in significant_factors) {
  # Fit a bivariate regression model for the specific factor
  factor_model <- lm(as.formula(paste("y ~", factor)), data = UScrime)
  
  # Extract the intercept and slope from the model
  intercept <- round(coef(factor_model)[1], 2)
  slope <- round(coef(factor_model)[2], 2)
  
  # Create the regression equation as a string
  reg_eq <- paste("y = ", intercept, " + ", slope, "*", factor, sep = "")
  
  # Create scatter plot with regression line
  p <- ggplot(UScrime, aes_string(x = factor, y = "y")) +
    geom_point() +  # Points for each observation
    geom_smooth(method = "lm", se = TRUE) +  # Add regression line with standard error bands
    labs(title = paste("Crime Rate vs", factor),
         x = factor,
         y = "Crime Rate") +
    annotate("text", x = Inf, y = Inf, label = reg_eq, hjust = 1.1, vjust = 1.1, size = 5, color = "red") +
    theme_minimal()
  
  print(p)
}

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

The CASchools dataset from the “AER” package in R is a cross-sectional dataset. It provides data on various educational and demographic variables for different school districts in California, collected at a single point in time. This type of data allows for analysis of relationships between variables within that specific timeframe, but it doesn’t track changes over time.

# Check for missing values
sum(is.na(CASchools))

[1] 0

# If there are missing values, you can handle them by removing or imputing
CASchools <- na.omit(CASchools)


# Select only numeric columns
CASchools_numeric <- CASchools %>%
  select(where(is.numeric))

# Standardize the data (excluding the response variable)
CASchools_scaled <- CASchools_numeric %>%
  mutate(across(-income, scale))

# Dependent variable 
a <- as.vector(CASchools_scaled$income)

# Matrix of feature variables from CASchools (excluding the response variable)
b <- as.matrix(CASchools_scaled[-which(names(CASchools_scaled) == "income")])

# Vector of ones with the same length as rows in CASchools
int1 <- rep(1, length(a))

# Add intercept column to b
b <- cbind(int1, b)

# Fit the full linear regression model with income as the dependent variable
model1 <- lm(income ~ ., data = CASchools_scaled)

# Get the summary of the model to check p-values
model_summary1 <- summary(model1)

# Extract coefficients and p-values from the model summary
coefficients_df1 <- as.data.frame(model_summary1$coefficients)

# Extract p-values
p_values1 <- coefficients_df1[, "Pr(>|t|)"]

# Display all p-values
print("P-values of all factors:")

[1] "P-values of all factors:"

print(p_values1)

 [1] 1.591857e-233  2.959667e-02  2.632978e-02  9.283690e-01  1.842038e-09
 [6]  8.558801e-01  2.172120e-10  3.265648e-08  6.244995e-02  3.241327e-03

# Filter factors with p-values <= 0.05 (exclude intercept)
significant_factors1 <- rownames(coefficients_df1)[p_values1 <= 0.05 & rownames(coefficients_df1) != "(Intercept)"]

# Display names of significant factors
print("Significant factors (p <= 0.05):")

[1] "Significant factors (p <= 0.05):"

print(significant_factors1)

[1] "students"    "teachers"    "lunch"       "expenditure" "english"    
[6] "math"

# Plot regression for the significant factors and add the regression equation to the plot
for (factor in significant_factors1) {
  # Fit a bivariate regression model for the specific factor
  factor_model1 <- lm(as.formula(paste("income ~", factor)), data = CASchools_scaled)
  
  # Extract the intercept and slope from the model
  intercept1 <- round(coef(factor_model1)[1], 2)
  slope1 <- round(coef(factor_model1)[2], 2)
  
  # Create the regression equation as a string
  reg_eq1 <- paste("income = ", intercept1, " + ", slope1, "*", factor, sep = "")
  
  # Create scatter plot with regression line
  p1 <- ggplot(CASchools_scaled, aes_string(x = factor, y = "income")) +
    geom_point() +  # Points for each observation
    geom_smooth(method = "lm", se = TRUE) +  # Add regression line with standard error bands
    labs(title = paste("Income vs", factor),
         x = factor,
         y = "Income") +
    annotate("text", x = Inf, y = Inf, label = reg_eq1, hjust = 1.1, vjust = 1.1, size = 5, color = "red") +
    theme_minimal()
  
  print(p1)
}

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

HINT: In particular, you will be able to replicate OLS_matrixVSlm.html output from the OLS_matrixVSlm.qmd file there, which may be relevant for this discussion.

I want you to choose and import any dataset, and briefly describe it (tell us the type of economic data Download type of economic data, and the variable definitions).

Note Here I will choose UScrime

You can use any in-built R dataset too - just type “data()” in R to see the available options ( Section 3Links to an external site. may be helpful), or can use AER packageLinks to an external site. to find more relevant Economics datasets.

Once you have the dataset, decide what multivariate regression you want to run i.e. what is your y variable and what are your X variables? Since this is a multivariate regression, you need at least 2 independent variables (else you are running a bi-variate regression - has only one independent variable). So for example, income may be my dependent variable, while age, experience, … may be my independent variables. TYPE OUT THE FULL ESTIMATING EQUATION.

As code we tested, for UScrime, I’m going to use “M” “Ed” “Ineq” “Prob” as my variables. The estimating equation is: $$ y = _0 + _1M + _2Ed + _3Ineq + _4Prob +

# Fit the multivariate regression model
lm.mod <- lm(y ~ M + Ed + Ineq + Prob, data = UScrime)

# Extract the coefficients for easier viewing
lm.betas <- round(x = lm.mod$coefficients, digits = 2)

# Print the model summary
summary(lm.mod)


Call:
lm(formula = y ~ M + Ed + Ineq + Prob, data = UScrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-532.97 -254.03  -55.72  137.80  960.21 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1339.346   1247.011  -1.074  0.28893   
M               3.597      5.339   0.674  0.50417   
Ed             14.861      7.192   2.066  0.04499 * 
Ineq            2.687      2.277   1.180  0.24458   
Prob        -7331.915   2560.270  -2.864  0.00651 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 347.5 on 42 degrees of freedom
Multiple R-squared:  0.2629,    Adjusted R-squared:  0.1927 
F-statistic: 3.745 on 4 and 42 DF,  p-value: 0.01077

# Create the design matrix X (including the intercept)
X <- model.matrix(~ M + Ed + Ineq + Prob, data = UScrime)

# Extract the dependent variable y
y <- UScrime$y

# Estimate betas using the formula beta_hat = inv(X'X) X'y
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y

# Print beta_hat to compare with lm results
round(beta_hat, digits = 2)

                [,1]
(Intercept) -1339.35
M               3.60
Ed             14.86
Ineq            2.69
Prob        -7331.92

# Compute residuals
residuals <- y - X %*% beta_hat

# Compute variance of residuals (sigma^2)
N <- nrow(X)  # Number of observations
k <- ncol(X) - 1  # Number of predictors
sigma_sq <- sum(residuals^2) / (N - k - 1)

# Compute the variance-covariance matrix of the betas
var_beta_hat <- sigma_sq * solve(t(X) %*% X)

# Extract the standard errors (square root of diagonal elements)
std_errors <- sqrt(diag(var_beta_hat))

# Print the standard errors
round(std_errors, digits = 2)

(Intercept)           M          Ed        Ineq        Prob 
    1247.01        5.34        7.19        2.28     2560.27

# Compare results from lm and manual matrix algebra
results <- data.frame(lm_results = lm.betas, matrix_results = round(beta_hat, 2))

print(results)

            lm_results matrix_results
(Intercept)   -1339.35       -1339.35
M                 3.60           3.60
Ed               14.86          14.86
Ineq              2.69           2.69
Prob          -7331.92       -7331.92

As code we tested, for CASchool, I’m going to use “students” “teachers” “lunch” “expenditure” “english” “math” as my variable The estimatin equation is: $$ y = _0 + _1students + _2teachers + _3lunch + _4Expenditure + _5english + _6math +

# Check for missing values
sum(is.na(CASchools))

[1] 0

# If there are missing values, you can handle them by removing or imputing
CASchools <- na.omit(CASchools)

# Select only numeric columns
CASchools_numeric <- CASchools %>%
  select(students, teachers, lunch, expenditure, english, math, income)

# Standardize the data (excluding the response variable)
CASchools_scaled <- CASchools_numeric %>%
  mutate(across(-income, scale))

# Dependent variable 
a <- as.vector(CASchools_scaled$income)

# Matrix of feature variables from CASchools (excluding the response variable)
b <- as.matrix(CASchools_scaled %>% select(-income))

# Vector of ones with the same length as rows in CASchools
int1 <- rep(1, length(a))

# Add intercept column to b
b <- cbind(int1, b)

# Fit the multivariate regression model
lm.mod1 <- lm(income ~ students + teachers + lunch + expenditure + english + math, data = CASchools_scaled)

# Extract the coefficients for easier viewing
lm.betas1 <- round(x = lm.mod1$coefficients, digits = 2)

# Print the model summary
summary(lm.mod1)


Call:
lm(formula = income ~ students + teachers + lunch + expenditure + 
    english + math, data = CASchools_scaled)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5752 -2.6100 -0.5509  1.9055 26.4882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  15.3166     0.2145  71.409  < 2e-16 ***
students     -6.8499     2.9195  -2.346   0.0194 *  
teachers      7.4466     2.9130   2.556   0.0109 *  
lunch        -3.8894     0.4159  -9.351  < 2e-16 ***
expenditure   1.6994     0.2250   7.555 2.73e-13 ***
english       1.7130     0.3049   5.619 3.54e-08 ***
math          2.5657     0.3867   6.635 1.02e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.396 on 413 degrees of freedom
Multiple R-squared:  0.6352,    Adjusted R-squared:  0.6299 
F-statistic: 119.9 on 6 and 413 DF,  p-value: < 2.2e-16

# Create the design matrix b (including the intercept)
b <- model.matrix(~ students + teachers + lunch + expenditure + english + math, data = CASchools_scaled)

# Estimate betas using the formula beta_hat = inv(b'b) b'a
beta_hat1 <- solve(t(b) %*% b) %*% t(b) %*% a

# Print beta_hat to compare with lm results
round(beta_hat1, digits = 2)

             [,1]
(Intercept) 15.32
students    -6.85
teachers     7.45
lunch       -3.89
expenditure  1.70
english      1.71
math         2.57

# Compute residuals
residuals1 <- a - b %*% beta_hat1

# Compute variance of residuals (sigma^2)
N1 <- nrow(b)  # Number of observations
k1 <- ncol(b) - 1  # Number of predictors
sigma_sq1 <- sum(residuals1^2) / (N1 - k1 - 1)

# Compute the variance-covariance matrix of the betas
var_beta_hat1 <- sigma_sq1 * solve(t(b) %*% b)

# Extract the standard errors (square root of diagonal elements)
std_errors1 <- sqrt(diag(var_beta_hat1))

# Print the standard errors
round(std_errors1, digits = 2)

(Intercept)    students    teachers       lunch expenditure     english 
       0.21        2.92        2.91        0.42        0.22        0.30 
       math 
       0.39

# Compare results from lm and manual matrix algebra
results1 <- data.frame(lm_results = lm.betas1, matrix_results = round(beta_hat1, 2))

print(results1)

            lm_results matrix_results
(Intercept)      15.32          15.32
students         -6.85          -6.85
teachers          7.45           7.45
lunch            -3.89          -3.89
expenditure       1.70           1.70
english           1.71           1.71
math              2.57           2.57

To learn how to type out the equation, read hereLinks to an external site.. Make sure it is well aligned - see the .Rmd file corresponding to OLS_matrixVSlm.pdf Download OLS_matrixVSlm.pdffile if needed.

Now, first estimate the multivariate regression (linear relationship) above with the “lm” command. The syntaxLinks to an external site. is very simple, and you will find many blogs on this - just find the relevant sections as regression is a large topic. (EG Subsection Multiple regression: biking, smoking, and heart diseaseLinks to an external site. may be helpful too). Do not worry too much about causality or interpretation of beta coefficients for this discussion.

Second, confirm if you get the same point estimates/coefficients/betas with the matrix algebra formula that we discussed in class - inv(X’X) X’y

You should get the same answer. If not, check if you specified the intercept term in your matrix X by adding a unit vector or not.

LMR Section 2.6 can be useful if you get stuck (see the textbook folder in the class Dropbox link). Or see the hint / OLS_matrixVSlm.pdf file, and you can recode your dataset using OLS_matrixVSlm.Rmd file in W2 folder in class Dropbox folder.

Food for thought: Can you also estimate the standard error by hand as you did for point estimates using matrix algebra? Explain your logic as well i.e. what is the intuition ?

Standard errors can be calculated by first estimating the residual variance, followed by calculating the variance-covariance matrix of the coefficients. The square roots of the diagonal of this matrix give the standard errors.

The intuition behind this is that the residual variance measures the error in the model’s predictions, and the variance-covariance matrix captures how much the estimated coefficients vary based on the model’s structure. Standard errors quantify the uncertainty in each coefficient.