Linear regression quantifies the relationship between two (or more variables). We will cover it more thoroughly in the last week of the course.
We can choose an outcome variable, and write it as a function of other variables called “covariates” and estimate the “best fit” line which tells us whether the two variables are positively or negatively correlated.
For example, miles per gallon
a car gives is a
function of horse power
and
weight of the car
.
It does not matter if you model miles per gallon
as
a function weight of the car
and
horse power
.
The order of the variables does not matter !!! It is the same relationship !
?datasets
library(help = "datasets")
df <- mtcars
?mtcars
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg
Miles/(US) gallon
[, 2] cyl
Number of cylinders
[, 3] disp
Displacement (cu.in.)
[, 4] hp
Gross horsepower
[, 5] drat
Rear axle ratio
[, 6] wt
Weight (1000 lbs)
[, 7] qsec
1/4 mile time
[, 8] vs
Engine (0 = V-shaped, 1 =
straight)
[, 9] am
Transmission (0 = automatic, 1 =
manual)
[,10] gear
Number of forward gears
[,11] carb
Number of carburetors
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(df, type = "text")
##
## ============================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------
## mpg 32 20.091 6.027 10.400 33.900
## cyl 32 6.188 1.786 4 8
## disp 32 230.722 123.939 71.100 472.000
## hp 32 146.688 68.563 52 335
## drat 32 3.597 0.535 2.760 4.930
## wt 32 3.217 0.978 1.513 5.424
## qsec 32 17.849 1.787 14.500 22.900
## vs 32 0.438 0.504 0 1
## am 32 0.406 0.499 0 1
## gear 32 3.688 0.738 3 5
## carb 32 2.812 1.615 1 8
## --------------------------------------------
plot(x = df$mpg,y = df$hp)
plot(x = df$mpg,y = df$wt)
reg1 <- lm(data = df,
formula = mpg ~ hp + wt)
reg2 <- lm(data = df,
formula = mpg ~ wt + hp)
stargazer(reg1, reg2, type = "text")
##
## ==========================================================
## Dependent variable:
## ----------------------------
## mpg
## (1) (2)
## ----------------------------------------------------------
## hp -0.032*** -0.032***
## (0.009) (0.009)
##
## wt -3.878*** -3.878***
## (0.633) (0.633)
##
## Constant 37.227*** 37.227***
## (1.599) (1.599)
##
## ----------------------------------------------------------
## Observations 32 32
## R2 0.827 0.827
## Adjusted R2 0.815 0.815
## Residual Std. Error (df = 29) 2.593 2.593
## F Statistic (df = 2; 29) 69.211*** 69.211***
## ==========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Exactly same outcome !!!
We only choose 2 out of the 11 variables from above example. In the real world, we have many more variables and people mayn know how many different specifications exist.
EG
mpg
as a function of cyl
and
disp
is one specification.
mpg
as a function of cyl
and
hp
is another specification.
mpg
as a function of `drat`
and
wt
is one specification….
If we have 160 variables, how many different regression specifications are possible if we choose 1 variable, 2 variables, 3 variables and 4 variables?
160 * 159 / factorial(2)
## [1] 12720
choose(n = 160, k = 2)
## [1] 12720
160 * 159 * 158 / factorial(3)
## [1] 669920
choose(n = 160, k = 3)
## [1] 669920
choose(n = 160, k = 4)
## [1] 26294360
mad
## function (x, center = median(x), constant = 1.4826, na.rm = FALSE,
## low = FALSE, high = FALSE)
## {
## if (na.rm)
## x <- x[!is.na(x)]
## n <- length(x)
## constant * if ((low || high) && n%%2 == 0) {
## if (low && high)
## stop("'low' and 'high' cannot be both TRUE")
## n2 <- n%/%2 + as.integer(high)
## sort(abs(x - center), partial = n2)[n2]
## }
## else median(abs(x - center))
## }
## <bytecode: 0x1121884f0>
## <environment: namespace:stats>
sd
## function (x, na.rm = FALSE)
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
## na.rm = na.rm))
## <bytecode: 0x13432bf20>
## <environment: namespace:stats>
ggplot
# library
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.2.3
# Build dataset with different distributions
data <- data.frame(
type = c( rep("variable 1", 1000), rep("variable 2", 1000) ),
value = c( rnorm(1000), rnorm(1000, mean=4) )
)
# Represent it
p <- data %>%
ggplot( aes(x=value, fill=type)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
theme_ipsum() +
labs(fill="")
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.