library(tinytex)
For this discussion, I will use two datasets Swiss and NaturalGas.
Swiss contains standardized fertility measure and socioeconomic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
1.Fertility ‘common standardized fertility measure’
2.Agriculture % of males involved in agriculture as occupation
3.Examination % draftees receiving highest mark on army examination
4.Education % education beyond primary school for draftees
5.Catholic % ‘catholic’(as opposed to ‘protestant’)
6.Infant.Mortality live births who live less than 1 year
NaturalGas is a panel data originating from 6 US states over the period 1967-1989.But for ease of calculation, I have selected only four variables to study in this discussion.
1.price Price of natural gas
2.eprice Price of electricity
3.oprice Price of distillate fuel oil
4.lprice Price of liquefied petroleum gas
library(datasets)
data("swiss")
library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
data("NaturalGas")
\[ Fertility_i = \beta_0 + \beta_1 Education_i + \epsilon_i \]
Regression resultCov / Var
beta1 = cov(swiss$Fertility,swiss$Education)/var(swiss$Education)
beta1
## [1] -0.8623503
reg1 <- lm(data=swiss,
formula = Fertility~ Education)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(reg1, type = "text" )
##
## ===============================================
## Dependent variable:
## ---------------------------
## Fertility
## -----------------------------------------------
## Education -0.862***
## (0.145)
##
## Constant 79.610***
## (2.104)
##
## -----------------------------------------------
## Observations 47
## R2 0.441
## Adjusted R2 0.428
## Residual Std. Error 9.446 (df = 45)
## F Statistic 35.446*** (df = 1; 45)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The two methods yield the same β1
\[ Price_i = \beta_0 + \beta_1 eprice_i + \epsilon_i \]
beta1 = cov(NaturalGas$price,NaturalGas$eprice)/var(NaturalGas$eprice)
beta1
## [1] 0.787173
reg2 <- lm(data=NaturalGas,
formula = price~ eprice)
library(stargazer)
stargazer(reg2, type = "text" )
##
## ===============================================
## Dependent variable:
## ---------------------------
## price
## -----------------------------------------------
## eprice 0.787***
## (0.026)
##
## Constant -0.556***
## (0.145)
##
## -----------------------------------------------
## Observations 138
## R2 0.875
## Adjusted R2 0.874
## Residual Std. Error 0.769 (df = 136)
## F Statistic 952.794*** (df = 1; 136)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The two methods yield the same β1
Variance measures the dispersion of individual data points in a data set from the mean value.
Covariance measures the relationship between x and y. If the covariance is positive, x and y will increase and decrease simultaneously. But if the covariance is negative, x and y will change in opposite directions.
When we use covariance to divide by the variance of x, it means removing the volatility of x from the covariance to quantify the effect between x and y better.
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
glimpse(swiss)
## Rows: 47
## Columns: 6
## $ Fertility <dbl> 80.2, 83.1, 92.5, 85.8, 76.9, 76.1, 83.8, 92.4, 82.4,…
## $ Agriculture <dbl> 17.0, 45.1, 39.7, 36.5, 43.5, 35.3, 70.2, 67.8, 53.3,…
## $ Examination <int> 15, 6, 5, 12, 17, 9, 16, 14, 12, 16, 14, 21, 14, 19, …
## $ Education <int> 12, 9, 5, 7, 15, 7, 7, 8, 7, 13, 6, 12, 7, 12, 5, 2, …
## $ Catholic <dbl> 9.96, 84.84, 93.40, 33.77, 5.16, 90.57, 92.85, 97.16,…
## $ Infant.Mortality <dbl> 22.2, 22.2, 20.2, 20.3, 20.6, 26.6, 23.6, 24.9, 21.0,…
\[ Fertility_i = \beta_0 + \beta_1 Agriculture_i +\beta_2 Examination_i + \beta_3 Education_i + \beta_4 Catholic_i + \beta_5 Infant.Mortality_i + \epsilon_i \]
y <- as.vector(swiss$Fertility)
x <- as.matrix(subset(swiss,select=-Fertility))
int <- rep(x = 1,
times = length(y))
x <- cbind(int, x)
remove(int)
betas <- solve(t(x) %*% x) %*% t(x) %*% y
betas <- round(x = betas,
digits = 2)
betas
## [,1]
## int 66.92
## Agriculture -0.17
## Examination -0.26
## Education -0.87
## Catholic 0.10
## Infant.Mortality 1.08
betas_check <- solve(crossprod(x), crossprod(x,y))
betas == round(x = betas_check,
digits = 2
)
## [,1]
## int TRUE
## Agriculture TRUE
## Examination TRUE
## Education TRUE
## Catholic TRUE
## Infant.Mortality TRUE
lm.mod <- lm(Fertility ~ ., data=swiss)
lm.betas <- round(x = lm.mod$coefficients,
digits = 2)
?data.frame
## starting httpd help server ... done
results <- data.frame(our.results=betas, lm.results=lm.betas)
print(results)
## our.results lm.results
## int 66.92 66.92
## Agriculture -0.17 -0.17
## Examination -0.26 -0.26
## Education -0.87 -0.87
## Catholic 0.10 0.10
## Infant.Mortality 1.08 1.08
The two methods yield the same β1, β2, β3, β4 andβ5
str(NaturalGas)
## 'data.frame': 138 obs. of 10 variables:
## $ state : Factor w/ 6 levels "CA","FL","MI",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ statecode : Factor w/ 6 levels "5","10","23",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ year : Factor w/ 23 levels "1967","1968",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ consumption: int 313656 319282 331326 346533 352085 363412 342608 341032 327384 339949 ...
## $ price : num 1.42 1.38 1.37 1.4 1.5 1.62 1.74 2 2.54 2.87 ...
## $ eprice : num 2.98 2.91 2.84 2.87 3.07 3.26 3.51 4.66 5.13 5.37 ...
## $ oprice : num 7.4 7.77 7.96 8.33 8.8 ...
## $ lprice : num 1.47 1.42 1.38 1.37 1.4 1.5 1.62 1.74 2 2.54 ...
## $ heating : int 6262 6125 6040 6085 5907 6248 5450 5858 5583 6238 ...
## $ income : num 10904 11370 11579 11587 11657 ...
library(tidyverse)
glimpse(NaturalGas)
## Rows: 138
## Columns: 10
## $ state <fct> NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY, NY…
## $ statecode <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35…
## $ year <fct> 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976…
## $ consumption <int> 313656, 319282, 331326, 346533, 352085, 363412, 342608, 34…
## $ price <dbl> 1.42, 1.38, 1.37, 1.40, 1.50, 1.62, 1.74, 2.00, 2.54, 2.87…
## $ eprice <dbl> 2.98, 2.91, 2.84, 2.87, 3.07, 3.26, 3.51, 4.66, 5.13, 5.37…
## $ oprice <dbl> 7.40, 7.77, 7.96, 8.33, 8.80, 8.85, 10.08, 15.49, 16.37, 1…
## $ lprice <dbl> 1.47, 1.42, 1.38, 1.37, 1.40, 1.50, 1.62, 1.74, 2.00, 2.54…
## $ heating <int> 6262, 6125, 6040, 6085, 5907, 6248, 5450, 5858, 5583, 6238…
## $ income <dbl> 10903.75, 11370.02, 11578.68, 11586.77, 11657.42, 11860.80…
\[ price_i = \beta_0 + \beta_1 eprice_i +\beta_2 oprice_i + \beta_3 lprice_i + \epsilon_i \]
y2 <- as.vector(NaturalGas$price)
library(dplyr)
x2 <- as.matrix(NaturalGas[,c("eprice","oprice","lprice")])
int2 <- rep(1, times = length(y2))
x2 <- cbind(int2, x2)
remove(int2)
betass <- solve(t(x2) %*% x2) %*% t(x2) %*% y2
betass <- round(betass,digits = 2)
betass
## [,1]
## int2 0.03
## eprice 0.02
## oprice 0.02
## lprice 0.85
betass_check <- solve(crossprod(x2), crossprod(x2,y2))
betass == round(betass_check,
digits = 2
)
## [,1]
## int2 TRUE
## eprice TRUE
## oprice TRUE
## lprice TRUE
lm.mod2 <- lm(price ~ eprice + oprice + lprice, data = NaturalGas)
lm.betass <- round(lm.mod2$coefficients,
digits = 2)
resultss <- data.frame(our.resultss = betass, lm.resultss = lm.betass)
print(resultss)
## our.resultss lm.resultss
## int2 0.03 0.03
## eprice 0.02 0.02
## oprice 0.02 0.02
## lprice 0.85 0.85
The two methods yield the same β1, β2 andβ3
Assume that Y = Xβ + ϵ, x is the matrix of variables
First step: we can compute β = (X’ X^-1)X’Y
Second step: e = Y- hat Y
Third step: calculate the variance of e
Fourth step: calculate the variance of β = Var(e)* (X’X)^-1
and then we can calculate the standard deviation of β