library(tinytex)

1.Data

For this discussion, I will use two datasets Swiss and NaturalGas.

1.1 Dataset1 Swiss

Swiss contains standardized fertility measure and socioeconomic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.

1.Fertility ‘common standardized fertility measure’

2.Agriculture % of males involved in agriculture as occupation

3.Examination % draftees receiving highest mark on army examination

4.Education % education beyond primary school for draftees

5.Catholic % ‘catholic’(as opposed to ‘protestant’)

6.Infant.Mortality live births who live less than 1 year

1.2 Dataset2 NaturalGas

NaturalGas is a panel data originating from 6 US states over the period 1967-1989.But for ease of calculation, I have selected only four variables to study in this discussion.

1.price Price of natural gas

2.eprice Price of electricity

3.oprice Price of distillate fuel oil

4.lprice Price of liquefied petroleum gas

library(datasets)
data("swiss")
library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
data("NaturalGas")

2. PART A Bivariate Regression and Cov-Var Formula

2.1Dataset1 swiss

\[ Fertility_i = \beta_0 + \beta_1 Education_i + \epsilon_i \]

2.1.1 Estimating Equation

Regression resultCov / Var

beta1 = cov(swiss$Fertility,swiss$Education)/var(swiss$Education) 
beta1
## [1] -0.8623503

2.1.2 Regression Result

reg1 <- lm(data=swiss,
           formula = Fertility~ Education)

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(reg1, type = "text" ) 
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                              Fertility         
## -----------------------------------------------
## Education                    -0.862***         
##                               (0.145)          
##                                                
## Constant                     79.610***         
##                               (2.104)          
##                                                
## -----------------------------------------------
## Observations                    47             
## R2                             0.441           
## Adjusted R2                    0.428           
## Residual Std. Error       9.446 (df = 45)      
## F Statistic           35.446*** (df = 1; 45)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The two methods yield the same β1

2.2 Dataset2 NaturalGas

2.2.1 Estimating Equation

\[ Price_i = \beta_0 + \beta_1 eprice_i + \epsilon_i \]

2.2.2 Cov / Var

beta1 = cov(NaturalGas$price,NaturalGas$eprice)/var(NaturalGas$eprice) 
beta1
## [1] 0.787173

2.2.3 Regression Result

reg2 <- lm(data=NaturalGas,
           formula = price~ eprice)

library(stargazer)
stargazer(reg2, type = "text" ) 
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                price           
## -----------------------------------------------
## eprice                       0.787***          
##                               (0.026)          
##                                                
## Constant                     -0.556***         
##                               (0.145)          
##                                                
## -----------------------------------------------
## Observations                    138            
## R2                             0.875           
## Adjusted R2                    0.874           
## Residual Std. Error      0.769 (df = 136)      
## F Statistic          952.794*** (df = 1; 136)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The two methods yield the same β1

2.3 Question

Variance measures the dispersion of individual data points in a data set from the mean value.

Covariance measures the relationship between x and y. If the covariance is positive, x and y will increase and decrease simultaneously. But if the covariance is negative, x and y will change in opposite directions.

When we use covariance to divide by the variance of x, it means removing the volatility of x from the covariance to quantify the effect between x and y better.

3.PART B Multivariate Regression and Matrix Algebra Formula

3.1 Dataset1 swiss

str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
glimpse(swiss)
## Rows: 47
## Columns: 6
## $ Fertility        <dbl> 80.2, 83.1, 92.5, 85.8, 76.9, 76.1, 83.8, 92.4, 82.4,…
## $ Agriculture      <dbl> 17.0, 45.1, 39.7, 36.5, 43.5, 35.3, 70.2, 67.8, 53.3,…
## $ Examination      <int> 15, 6, 5, 12, 17, 9, 16, 14, 12, 16, 14, 21, 14, 19, …
## $ Education        <int> 12, 9, 5, 7, 15, 7, 7, 8, 7, 13, 6, 12, 7, 12, 5, 2, …
## $ Catholic         <dbl> 9.96, 84.84, 93.40, 33.77, 5.16, 90.57, 92.85, 97.16,…
## $ Infant.Mortality <dbl> 22.2, 22.2, 20.2, 20.3, 20.6, 26.6, 23.6, 24.9, 21.0,…

3.1.1 Estimating Equation

\[ Fertility_i = \beta_0 + \beta_1 Agriculture_i +\beta_2 Examination_i + \beta_3 Education_i + \beta_4 Catholic_i + \beta_5 Infant.Mortality_i + \epsilon_i \]

3.1.2Data Processing

y <- as.vector(swiss$Fertility)

x <- as.matrix(subset(swiss,select=-Fertility))

int <- rep(x = 1, 
           times =  length(y))

x <- cbind(int, x)

remove(int)
betas <- solve(t(x) %*% x) %*% t(x) %*% y

betas <- round(x = betas, 
               digits = 2)

betas
##                   [,1]
## int              66.92
## Agriculture      -0.17
## Examination      -0.26
## Education        -0.87
## Catholic          0.10
## Infant.Mortality  1.08
betas_check <- solve(crossprod(x), crossprod(x,y))

betas == round(x = betas_check, 
               digits = 2
               )
##                  [,1]
## int              TRUE
## Agriculture      TRUE
## Examination      TRUE
## Education        TRUE
## Catholic         TRUE
## Infant.Mortality TRUE
lm.mod <- lm(Fertility ~ ., data=swiss)

lm.betas <- round(x = lm.mod$coefficients, 
                  digits = 2)

?data.frame
## starting httpd help server ... done
results <- data.frame(our.results=betas, lm.results=lm.betas)

print(results)
##                  our.results lm.results
## int                    66.92      66.92
## Agriculture            -0.17      -0.17
## Examination            -0.26      -0.26
## Education              -0.87      -0.87
## Catholic                0.10       0.10
## Infant.Mortality        1.08       1.08

The two methods yield the same β1, β2, β3, β4 andβ5

3.2 Dataset2 NaturalGas

str(NaturalGas)
## 'data.frame':    138 obs. of  10 variables:
##  $ state      : Factor w/ 6 levels "CA","FL","MI",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ statecode  : Factor w/ 6 levels "5","10","23",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ year       : Factor w/ 23 levels "1967","1968",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ consumption: int  313656 319282 331326 346533 352085 363412 342608 341032 327384 339949 ...
##  $ price      : num  1.42 1.38 1.37 1.4 1.5 1.62 1.74 2 2.54 2.87 ...
##  $ eprice     : num  2.98 2.91 2.84 2.87 3.07 3.26 3.51 4.66 5.13 5.37 ...
##  $ oprice     : num  7.4 7.77 7.96 8.33 8.8 ...
##  $ lprice     : num  1.47 1.42 1.38 1.37 1.4 1.5 1.62 1.74 2 2.54 ...
##  $ heating    : int  6262 6125 6040 6085 5907 6248 5450 5858 5583 6238 ...
##  $ income     : num  10904 11370 11579 11587 11657 ...
library(tidyverse)
glimpse(NaturalGas)
## Rows: 138
## Columns: 10
## $ state       <fct> NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY…
## $ statecode   <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35…
## $ year        <fct> 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976…
## $ consumption <int> 313656, 319282, 331326, 346533, 352085, 363412, 342608, 34…
## $ price       <dbl> 1.42, 1.38, 1.37, 1.40, 1.50, 1.62, 1.74, 2.00, 2.54, 2.87…
## $ eprice      <dbl> 2.98, 2.91, 2.84, 2.87, 3.07, 3.26, 3.51, 4.66, 5.13, 5.37…
## $ oprice      <dbl> 7.40, 7.77, 7.96, 8.33, 8.80, 8.85, 10.08, 15.49, 16.37, 1…
## $ lprice      <dbl> 1.47, 1.42, 1.38, 1.37, 1.40, 1.50, 1.62, 1.74, 2.00, 2.54…
## $ heating     <int> 6262, 6125, 6040, 6085, 5907, 6248, 5450, 5858, 5583, 6238…
## $ income      <dbl> 10903.75, 11370.02, 11578.68, 11586.77, 11657.42, 11860.80…

3.2.1 Estimating Equation

\[ price_i = \beta_0 + \beta_1 eprice_i +\beta_2 oprice_i + \beta_3 lprice_i + \epsilon_i \]

3.2.2 Data Processing

y2 <- as.vector(NaturalGas$price)

library(dplyr)

x2 <- as.matrix(NaturalGas[,c("eprice","oprice","lprice")])

int2 <- rep(1, times = length(y2))

x2 <- cbind(int2, x2)

remove(int2)
betass <- solve(t(x2) %*% x2) %*% t(x2) %*% y2

betass <- round(betass,digits = 2)

betass
##        [,1]
## int2   0.03
## eprice 0.02
## oprice 0.02
## lprice 0.85
betass_check <- solve(crossprod(x2), crossprod(x2,y2))

betass == round(betass_check, 
               digits = 2
               )
##        [,1]
## int2   TRUE
## eprice TRUE
## oprice TRUE
## lprice TRUE
lm.mod2 <- lm(price ~ eprice + oprice + lprice, data = NaturalGas)

lm.betass <- round(lm.mod2$coefficients, 
                  digits = 2)

resultss <- data.frame(our.resultss = betass, lm.resultss = lm.betass)


print(resultss)
##        our.resultss lm.resultss
## int2           0.03        0.03
## eprice         0.02        0.02
## oprice         0.02        0.02
## lprice         0.85        0.85

The two methods yield the same β1, β2 andβ3

3.3 Question

Assume that Y = Xβ + ϵ, x is the matrix of variables

First step: we can compute β = (X’ X^-1)X’Y

Second step: e = Y- hat Y

Third step: calculate the variance of e

Fourth step: calculate the variance of β = Var(e)* (X’X)^-1

and then we can calculate the standard deviation of β