1 Linear Regression

Linear regression quantifies the relationship between two (or more variables). We will cover it more thoroughly in the last week of the course.

We can choose an outcome variable, and write it as a function of other variables called “covariates” and estimate the “best fit” line which tells us whether the two variables are positively or negatively correlated.

For example, miles per gallon a car gives is a function of horse power and weight of the car.
It does not matter if you model miles per gallon as a function weight of the car and horse power.

The order of the variables does not matter !!! It is the same relationship !

1.1 Implementation (not required to know for midterm)

?datasets
library(help = "datasets")
df <- mtcars
?mtcars

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

1.2 Summary Statistics

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(df, type = "text")

## 
## ============================================
## Statistic N   Mean   St. Dev.  Min     Max  
## --------------------------------------------
## mpg       32 20.091   6.027   10.400 33.900 
## cyl       32  6.188   1.786     4       8   
## disp      32 230.722 123.939  71.100 472.000
## hp        32 146.688  68.563    52     335  
## drat      32  3.597   0.535   2.760   4.930 
## wt        32  3.217   0.978   1.513   5.424 
## qsec      32 17.849   1.787   14.500 22.900 
## vs        32  0.438   0.504     0       1   
## am        32  0.406   0.499     0       1   
## gear      32  3.688   0.738     3       5   
## carb      32  2.812   1.615     1       8   
## --------------------------------------------

1.3 Linear Regression

plot(x = df$mpg,y = df$hp)

plot(x = df$mpg,y = df$wt)

reg1 <- lm(data = df, 
   formula = mpg ~ hp + wt)

reg2 <- lm(data = df, 
   formula = mpg ~ wt + hp)

stargazer(reg1, reg2, type = "text")

## 
## ==========================================================
##                                   Dependent variable:     
##                               ----------------------------
##                                           mpg             
##                                    (1)            (2)     
## ----------------------------------------------------------
## hp                              -0.032***      -0.032***  
##                                  (0.009)        (0.009)   
##                                                           
## wt                              -3.878***      -3.878***  
##                                  (0.633)        (0.633)   
##                                                           
## Constant                        37.227***      37.227***  
##                                  (1.599)        (1.599)   
##                                                           
## ----------------------------------------------------------
## Observations                        32            32      
## R2                                0.827          0.827    
## Adjusted R2                       0.815          0.815    
## Residual Std. Error (df = 29)     2.593          2.593    
## F Statistic (df = 2; 29)        69.211***      69.211***  
## ==========================================================
## Note:                          *p<0.1; **p<0.05; ***p<0.01

Exactly same outcome !!!

We only choose 2 out of the 11 variables from above example. In the real world, we have many more variables and people mayn know how many different specifications exist.

mpg as a function of cyl and disp is one specification.
mpg as a function of cyl and hp is another specification.
mpg as a function of `drat` and wt is one specification….

2 Real World Example

QUESTION:

If we have 160 variables, how many different regression specifications are possible if we choose 1 variable, 2 variables, 3 variables and 4 variables?

If we have 160 variables, how many different regression specifications are possible if we choose 1 variable, 2 variables, 3 variables and 4 variables?

2.1 Choosing 2 variables out of 160 variables

160 * 159 / factorial(2)

## [1] 12720

choose(n = 160, k = 2)

## [1] 12720

2.2 Choosing 3 variables out of 160 variables

160 * 159 * 158 / factorial(3)

## [1] 669920

choose(n = 160, k = 3)

## [1] 669920

2.3 Choosing 4 variables out of 160 variables

choose(n = 160, k = 4)

## [1] 26294360

3 Opening up Functions

mad

## function (x, center = median(x), constant = 1.4826, na.rm = FALSE, 
##     low = FALSE, high = FALSE) 
## {
##     if (na.rm) 
##         x <- x[!is.na(x)]
##     n <- length(x)
##     constant * if ((low || high) && n%%2 == 0) {
##         if (low && high) 
##             stop("'low' and 'high' cannot be both TRUE")
##         n2 <- n%/%2 + as.integer(high)
##         sort(abs(x - center), partial = n2)[n2]
##     }
##     else median(abs(x - center))
## }
## <bytecode: 0x1121884f0>
## <environment: namespace:stats>

sd

## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x13432bf20>
## <environment: namespace:stats>

4 Playing with `ggplot`

# library
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(hrbrthemes)

## Warning: package 'hrbrthemes' was built under R version 4.2.3

# Build dataset with different distributions
data <- data.frame(
  type = c( rep("variable 1", 1000), rep("variable 2", 1000) ),
  value = c( rnorm(1000), rnorm(1000, mean=4) )
)

# Represent it
p <- data %>%
  ggplot( aes(x=value, fill=type)) +
    geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
    scale_fill_manual(values=c("#69b3a2", "#404080")) +
    theme_ipsum() +
    labs(fill="")
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Choosing Variables for Regression

Arvind Sharma

2024-03-20

1 Linear Regression

1.1 Implementation (not required to know for midterm)

1.2 Summary Statistics

1.3 Linear Regression

2 Real World Example

2.1 Choosing 2 variables out of 160 variables

2.2 Choosing 3 variables out of 160 variables

2.3 Choosing 4 variables out of 160 variables

3 Opening up Functions

4 Playing with `ggplot`

Choosing Variables for Regression

Arvind Sharma

2024-03-20

1 Linear Regression

1.1 Implementation (not required to know for midterm)

1.2 Summary Statistics

1.3 Linear Regression

2 Real World Example

2.1 Choosing 2 variables out of 160 variables

2.2 Choosing 3 variables out of 160 variables

2.3 Choosing 4 variables out of 160 variables

3 Opening up Functions

4 Playing with ggplot

4 Playing with `ggplot`