Regression Exercises

In this exercise, we will reproduce the regressions and graphics from the class slides and then work through a couple of examples. The purpose is to demonstrate how to run regressions, interpret output, and generate visualizations.

General housekeeping items

Let’s begin by opening libraries and clearing the environment:

library(tidyverse)
library(readxl)
library(broom)
library(stargazer)
library(janitor)
library(scales)

rm(list=ls())

Set your working directory:

setwd('C:/YOURWD')

Generate ‘simulated’ data for correlation and regression

Let’s generate some simulated data to illustrate the concepts of correlation and regression. We will be using ‘randomness’ in this simulation so let’s first set a seed so that our procedure is reproducible:

set.seed(42)

Next, let’s create a series of 4 datasets containing two correlated variables ‘x’ and ‘y’. Each dataset will have 100 observations. The first 3 datasets illustrate a positive correlation based on the equation, y = x + noise, where noise is ‘random’ and varies across each dataset: ‘little_noise’, ‘some_noise’, and ‘tons_noise’. The final dataset illustrates a negative correlation based on the equation y = 100 - x + noise.

#Create a dataset with 100 observations where 'x' is a randomly drawn number from 1-100.
pos_little_noise <- data.frame('x' = sample(1:100, size = 100, replace = TRUE)) %>%
#Create a variable, 'y' equal to 'x' plus 'noise.' 'Noise' is a randomly drawn number from -/+5.
  mutate(y = x + sample(-5:5, size = 100, replace = TRUE)) %>% 
#Create a variable, that labels this dataset.
  mutate(label = 'Sim 1: Positive - Little noise')

#Repeat for some_noise and tons_noise datasets (with increased noise)
pos_some_noise <- data.frame('x' = sample(1:100, size = 100, replace = TRUE)) %>% 
  mutate(y = x + sample(-25:25, size = 100, replace = TRUE)) %>% 
  mutate(label = 'Sim 2: Positive - Some noise')

pos_tons_noise <- data.frame('x' = sample(1:100, size = 100, replace = TRUE)) %>% 
  mutate(y = x + sample(-50:50, size = 100, replace = TRUE)) %>% 
  mutate(label = 'Sim 3: Positive - Tons noise')

neg_some_noise <- data.frame('x' = sample(1:100, size = 100, replace = TRUE)) %>% 
  mutate(y = 100 - x + sample(-25:25, size = 100, replace = TRUE)) %>% 
  mutate(label = 'Sim 4: Negative - Some noise')

#Bind simulations together in one dataset
sim_data <- bind_rows(pos_little_noise, pos_some_noise, pos_tons_noise, neg_some_noise)

Plot datasets, determine correlations, and add trendlines

Plot ‘y’ on ‘x’ separately for each simulated dataset:

ggplot(sim_data, aes(x = x, y = y)) +
  geom_point(color = 'dodgerblue', alpha = 0.5) +
  facet_wrap(~label) 

Calculate correlations between ‘x’ and ‘y’ for each simulated dataset:

sim_data %>%
  group_by(label) %>%
  summarize(cor = cor(x,y)) 
# A tibble: 4 × 2
  label                             cor
  <chr>                           <dbl>
1 Sim 1: Positive - Little noise  0.994
2 Sim 2: Positive - Some noise    0.882
3 Sim 3: Positive - Tons noise    0.734
4 Sim 4: Negative - Some noise   -0.887

Let’s try on real data. Plot price on carat weight using the diamonds dataset:

diamonds %>%
  filter(carat < 1.5) %>%
  ggplot(aes(y = price, x = carat)) +
  geom_point(color = 'dodgerblue', alpha = 0.3) +
  labs(title = 'Scatter Plot of Price versus Carat', x = 'Carat', y = 'Price') +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma)

Calculate correlations between price and carat:

diamonds %>%
  filter(carat < 1.5) %>%
  summarize(cor = cor(price,carat)) 
# A tibble: 1 × 1
    cor
  <dbl>
1 0.883

Recreate the plots above, adding manual trend lines:

ggplot(sim_data, aes(x = x, y = y)) +
  geom_point(color = 'dodgerblue', alpha = 0.5) +
  geom_abline(intercept = 100, slope = -1, color = 'red') +
  geom_abline(intercept = 0, slope = 1, color = 'red') +
  geom_abline(intercept = 50, slope = 0, color = 'red') +
  facet_wrap(~label)

diamonds %>%
  filter(carat < 1.5) %>%
  ggplot(aes(y = price, x = carat)) +
  geom_point(color = 'dodgerblue', alpha = 0.3) +
  geom_abline(intercept = 5000, slope = 0, color = 'red') +
  geom_abline(intercept = 0, slope = 5000, color = 'red') +
  geom_abline(intercept = -5000, slope = 10000, color = 'red') +
  labs(title = 'Scatter Plot of Price versus Carat', x = 'Carat', y = 'Price') +  
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma)

Determine the ‘best fit’ with regression

Instead of trying to visualize trends ourselves, we can ‘fit’ a model to our data. Below, we are going to use a regression technique called ‘ordinary least squares’ (OLS) to determine the coefficient(s) and intercept that minimize the “sum of the squared errors.” Let’s add an OLS regression line to our plots:

ggplot(sim_data, aes(x = x, y = y)) +
  geom_point(color = 'dodgerblue', alpha = 0.5) +
  geom_smooth(method = 'lm', se = FALSE, color = 'red') +
  facet_wrap(~label)

diamonds %>%
  filter(carat < 1.5) %>%
  ggplot(aes(y = price, x = carat)) +
  geom_point(color = 'dodgerblue', alpha = 0.3) +
  geom_smooth(method = 'lm', se = FALSE, color = 'red') +
  labs(title = 'Scatter Plot of Price versus Carat', x = 'Carat', y = 'Price', color = 'Clarity') +  
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma)

In R we use lm() to estimate a regression model using OLS. Here is an illustration for each of our simulated datasets:

pln_reg <- lm(y ~ x, data = pos_little_noise)
summary(pln_reg)

Call:
lm(formula = y ~ x, data = pos_little_noise)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.4178 -3.8101  0.1569  3.1569  5.7580 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.79762    0.65058  -1.226    0.223    
x            1.01321    0.01105  91.727   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.477 on 98 degrees of freedom
Multiple R-squared:  0.9885,    Adjusted R-squared:  0.9884 
F-statistic:  8414 on 1 and 98 DF,  p-value: < 2.2e-16
psn_reg <- lm(y ~ x, data = pos_some_noise)
summary(psn_reg)

Call:
lm(formula = y ~ x, data = pos_some_noise)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.6891 -14.5044  -0.4017  14.8686  26.6204 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.15390    3.42857  -0.628    0.531    
x            1.04850    0.05645  18.574   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.58 on 98 degrees of freedom
Multiple R-squared:  0.7788,    Adjusted R-squared:  0.7765 
F-statistic:   345 on 1 and 98 DF,  p-value: < 2.2e-16
ptn_reg <- lm(y ~ x, data = pos_tons_noise)
summary(ptn_reg)

Call:
lm(formula = y ~ x, data = pos_tons_noise)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.033 -27.751  -1.635  26.755  53.975 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -4.5267     5.9289  -0.764    0.447    
x             1.0788     0.1009  10.686   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.49 on 98 degrees of freedom
Multiple R-squared:  0.5382,    Adjusted R-squared:  0.5335 
F-statistic: 114.2 on 1 and 98 DF,  p-value: < 2.2e-16
nsn_reg <- lm(y ~ x, data = neg_some_noise)
summary(nsn_reg)

Call:
lm(formula = y ~ x, data = neg_some_noise)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.5151 -12.4843   0.0163  11.4702  24.5557 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 100.38760    3.03245   33.10   <2e-16 ***
x            -0.99798    0.05259  -18.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.56 on 98 degrees of freedom
Multiple R-squared:  0.7861,    Adjusted R-squared:  0.7839 
F-statistic: 360.2 on 1 and 98 DF,  p-value: < 2.2e-16

Stargazer is a great package for aggregating regression output and creating publication-quality tables:

stargazer(pln_reg, psn_reg, ptn_reg, nsn_reg, 
          type = 'html',
          out = 'noise_models.htm', 
          omit.stat=c('f', 'ser'), 
          title='Simulation Regression Results', 
          column.labels = c('Little','Some','Tons','Negative'),
          column.sep.width = '20pt')
Simulation Regression Results
Dependent variable:
y
Little Some Tons Negative
(1) (2) (3) (4)
x 1.013*** 1.049*** 1.079*** -0.998***
(0.011) (0.056) (0.101) (0.053)
Constant -0.798 -2.154 -4.527 100.388***
(0.651) (3.429) (5.929) (3.032)
Observations 100 100 100 100
R2 0.988 0.779 0.538 0.786
Adjusted R2 0.988 0.777 0.533 0.784
Note: p<0.1; p<0.05; p<0.01

Same for the diamonds regression:

diamonds_filter <- diamonds %>%
  filter(carat < 1.5)  

diamonds_reg <- lm(price ~ carat, data = diamonds_filter)

stargazer(diamonds_reg, 
          type = 'html',
          out = 'diamond_model.htm', 
          omit.stat=c('f', 'ser'), 
          title='Diamond Regression Results')
Diamond Regression Results
Dependent variable:
price
carat 6,858.921***
(16.690)
Constant -1,758.280***
(12.408)
Observations 47,705
R2 0.780
Adjusted R2 0.780
Note: p<0.1; p<0.05; p<0.01

Note that we can also determine the predicted or ‘fitted’ values for y using our regression estimates. For example, in the simple regression of price on carat above, let’s estimate the price of a 1 carat diamond.

diamonds_reg <- lm(price ~ carat, data = diamonds_filter)

augment(diamonds_reg, newdata = data.frame(carat = 1))
# A tibble: 1 × 2
  carat .fitted
  <dbl>   <dbl>
1     1   5101.

Regression illustration with Anscombe’s quartet

Anscombe’s quartet is a famous illustration how regression estimates can oversimplify a trend and overweight outliers. Let’s begin by importing the ansc_quartet.xlsx file from the course website:

ansc_quart <- read_excel('ansc_quart.xlsx')

This file is made up of 4 example datasets, as above, let’s plot and regress (note the extra features used for presentation purposes):

ggplot(ansc_quart, aes(x = x, y = y, color = dataset)) +
  geom_point(size = 3, alpha = .8) +
  facet_wrap(~ dataset) +
  stat_smooth(method = 'lm', se = FALSE, color = 'red', fullrange = TRUE) +
  theme(legend.position = 'none') 

Let’s look at the regression output in tabular form:

data_1_reg <- lm(y ~ x, data = ansc_quart[ansc_quart$dataset == 'Dataset I',])
data_2_reg <- lm(y ~ x, data = ansc_quart[ansc_quart$dataset == 'Dataset II',])
data_3_reg <- lm(y ~ x, data = ansc_quart[ansc_quart$dataset == 'Dataset III',])
data_4_reg <- lm(y ~ x, data = ansc_quart[ansc_quart$dataset == 'Dataset IV',])

stargazer(data_1_reg, data_2_reg, data_3_reg, data_3_reg, 
          type = 'html',
          out = 'ansc_models.htm', 
          omit.stat=c('f', 'ser'), 
          title='Anscombes Regression Results', 
          column.labels = c('Data I','Data II','Data III','Data IV'))
Anscombes Regression Results
Dependent variable:
y
Data I Data II Data III Data IV
(1) (2) (3) (4)
x 0.500*** 0.500*** 0.500*** 0.500***
(0.118) (0.118) (0.118) (0.118)
Constant 3.000** 3.001** 3.002** 3.002**
(1.125) (1.125) (1.124) (1.124)
Observations 11 11 11 11
R2 0.667 0.666 0.666 0.666
Adjusted R2 0.629 0.629 0.629 0.629
Note: p<0.1; p<0.05; p<0.01

Regression with non-linear variables

Regressions are great at fitting lines to our data. While the estimates are ‘linear’ the variables need not be. That is, a regression model is flexible to non-linear transformations of variables (e.g., squared, log). Following the Dataset II above, let’s estimate a quadratic model that includes a squared term:

ggplot(ansc_quart[ansc_quart$dataset == 'Dataset II',], aes(x = x, y = y)) +
  geom_point(size = 3, color = 'dodgerblue', alpha = .8) +
  stat_smooth(method = 'lm', se = FALSE, formula = y ~ x + I(x^2), color = 'red', fullrange = TRUE) +
  labs(title = 'Anscombes Quartet - Dataset II - Quadratic')

data_2_reg_quad <- lm(y ~ x + I(x^2), data = ansc_quart[ansc_quart$dataset == 'Dataset II',])
summary(data_2_reg_quad)

Call:
lm(formula = y ~ x + I(x^2), data = ansc_quart[ansc_quart$dataset == 
    "Dataset II", ])

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0013287 -0.0011888 -0.0006294  0.0008741  0.0023776 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.9957343  0.0043299   -1385   <2e-16 ***
x            2.7808392  0.0010401    2674   <2e-16 ***
I(x^2)      -0.1267133  0.0000571   -2219   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.001672 on 8 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 7.378e+06 on 2 and 8 DF,  p-value: < 2.2e-16

Log transformations of variables are frequently useful in improving model fit and interpretability (among other benefits). Recall that the relation between price and carat in the diamonds dataset was log linear. Let’s reperform our regression above using log values for price and carat.

diamonds %>%
  filter(carat < 1.5) %>%
  ggplot(aes(y = log(price), x = log(carat))) +
  geom_point(color = 'dodgerblue', alpha = 0.3) +
  geom_smooth(method = 'lm', se = FALSE, color = 'red') +
    labs(title = 'Scatter Plot of Price versus Carat', x = 'Natural Log of Carat', y = 'Natrual Log of Price') +  
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma)

diamonds_reg <- lm(log(price) ~ log(carat), data = diamonds_filter)

stargazer(diamonds_reg, 
          type = 'html',
          out = 'log_diamond_model.htm', 
          omit.stat=c('f', 'ser'), 
          title='Diamond Regression Results')
Diamond Regression Results
Dependent variable:
log(price)
log(carat) 1.692***
(0.002)
Constant 8.458***
(0.002)
Observations 47,705
R2 0.914
Adjusted R2 0.914
Note: p<0.1; p<0.05; p<0.01

As above, we can also come up with predictions for our own data. Let’s estimate the price of a 1 carat diamond.

diamonds_reg <- lm(log(price) ~ log(carat), data = diamonds_filter)

augment(diamonds_reg, newdata = data.frame(carat = 1)) %>% 
  mutate(pred_price = exp(.fitted))
# A tibble: 1 × 3
  carat .fitted pred_price
  <dbl>   <dbl>      <dbl>
1     1    8.46      4715.

Multiple regression

Regressions can have more than one ‘x’ or predictor variable. In fact, we can expand our regression models to estimate the effects of many variables. Using the diamonds dataset, let’s increase the number of variables in our model one step at a time:

diamonds_short1 <- lm(log(price) ~ log(carat), data = diamonds_filter)
diamonds_short2 <- lm(log(price) ~ as.numeric(clarity), data = diamonds_filter)
diamonds_long1 <- lm(log(price) ~ log(carat) + as.numeric(clarity), data = diamonds_filter)
diamonds_long2 <- lm(log(price) ~ log(carat) + as.numeric(clarity) + as.numeric(cut) + as.numeric(color), data = diamonds_filter)

stargazer(diamonds_short1, diamonds_short2, diamonds_long1, diamonds_long2, 
          type = 'html',
          out = 'multi_diamond_models.htm', 
          omit.stat=c('f', 'ser'), 
          column.labels = c('Short1','Short2','Long1','Long2'),
          title='Diamond Regression Results')
Diamond Regression Results
Dependent variable:
log(price)
Short1 Short2 Long1 Long2
(1) (2) (3) (4)
log(carat) 1.692*** 1.827*** 1.886***
(0.002) (0.002) (0.001)
as.numeric(clarity) -0.082*** 0.114*** 0.121***
(0.002) (0.001) (0.0004)
as.numeric(cut) 0.031***
(0.001)
as.numeric(color) -0.076***
(0.0004)
Constant 8.458*** 7.922*** 8.053*** 8.195***
(0.002) (0.011) (0.002) (0.003)
Observations 47,705 47,705 47,705 47,705
R2 0.914 0.024 0.954 0.974
Adjusted R2 0.914 0.024 0.954 0.974
Note: p<0.1; p<0.05; p<0.01
Note

In this exercise, we treated all variables as numeric (even ordinal categorical/factor variables like clarity). Alternatively, we can also estimate the effect of each level of a categorical level separately by including an indicator variable for each level. See the example below.

diamonds_filter <- diamonds_filter %>%
  mutate(across(c('clarity', 'color', 'cut'), ~ factor(.x, ordered = FALSE)))

diamonds_long3 <- lm(log(price) ~ log(carat) + clarity + cut + color, data = diamonds_filter)

stargazer(diamonds_long3, 
          type = 'html',
          out = 'long_diamond_models.htm', 
          omit.stat=c('f', 'ser'), 
          column.labels = c('Long3'),
          title='Diamond Regression Results')
Diamond Regression Results
Dependent variable:
log(price)
Long3
log(carat) 1.889***
(0.001)
claritySI2 0.362***
(0.006)
claritySI1 0.521***
(0.006)
clarityVS2 0.673***
(0.006)
clarityVS1 0.744***
(0.006)
clarityVVS2 0.884***
(0.006)
clarityVVS1 0.955***
(0.006)
clarityIF 1.053***
(0.007)
cutGood 0.075***
(0.004)
cutVery Good 0.110***
(0.004)
cutPremium 0.139***
(0.004)
cutIdeal 0.156***
(0.004)
colorE -0.057***
(0.002)
colorF -0.100***
(0.002)
colorG -0.165***
(0.002)
colorH -0.250***
(0.002)
colorI -0.377***
(0.003)
colorJ -0.499***
(0.003)
Constant 7.934***
(0.007)
Observations 47,705
R2 0.978
Adjusted R2 0.978
Note: p<0.1; p<0.05; p<0.01

Visualizing the ‘error term’

The error term, or residual, represents the difference between the predicted values and the actual values of the dependent variable. In a simple model with just two variables (y and x) the concept can be illustrated graphically. Below is a plot of the Anscombe’s dataset 1:

ansc_quart_I <- ansc_quart[ansc_quart$dataset == 'Dataset I',]

ansc_quart_fit <- lm(y ~ x, data = ansc_quart_I)

prediction <- augment(ansc_quart_fit) %>% 
  clean_names()

ggplot(prediction, aes(x = x, y = y)) +
  geom_smooth(method = 'lm', se = FALSE, color = 'red') +
  geom_segment(aes(xend = x, yend = fitted), alpha = .25) +
  geom_point(size = 3, alpha = .5, color = 'dodgerblue') + 
  geom_point(aes(y = fitted), shape = 1, size = 2) +
  labs(title = 'Anscombes Quartet - Dataset I - Residuals/Errors')

Next, let’s do the same thing, but for a sample of the diamonds dataset (variables in log form). Notice the extra features added for presentation.

diamonds_sample <- diamonds %>%
  filter(carat < 1.5) %>%
  slice_sample(n = 100)

diamonds_fit <- lm(log(price) ~ log(carat), data = diamonds_sample)

augment(diamonds_fit) %>% 
  clean_names() %>% 
  mutate(resid = log_price - fitted) %>% 
  ggplot(aes(x = log_carat, y = log_price)) +
  geom_smooth(method = 'lm', se = FALSE, color = 'lightgrey') +
  geom_segment(aes(xend = log_carat, yend = fitted), alpha = .25) +
  geom_point(aes(color = abs(resid), size = abs(resid))) + 
  scale_color_continuous(low = 'blue', high = 'red') +
  geom_point(aes(y = fitted), shape = 1) +
  labs(title = 'Sample of Diamonds Data - Residuals/Errors', color = 'Residual', size = 'Residual', x = 'Natural Log of Carat', y = 'Natural Log of Price')

Practice interpreting output with the mtcars dataset

Let’s predict fuel economy with characteristics of cars using the mtcars dataset. Try to interpret the output.

mpg_reg <- lm(mpg ~ wt + hp + cyl, data = mtcars)

stargazer(mpg_reg, 
          type = 'html',
          out = 'mpg_model.htm', 
          omit.stat=c('f', 'ser'), 
          title='MPG Regression Results')
MPG Regression Results
Dependent variable:
mpg
wt -3.167***
(0.741)
hp -0.018
(0.012)
cyl -0.942*
(0.551)
Constant 38.752***
(1.787)
Observations 32
R2 0.843
Adjusted R2 0.826
Note: p<0.1; p<0.05; p<0.01

Extra practice with regression - CPUs

On the course site, you will find a dataset containing prices of CPUs. The data was collected from PassMark in September 2022. Make sure this file is saved in your working directory. Let’s practice regression and visualization using that dataset:

cpu_data <- read_csv('cpu_data.csv')

Nearly all of the CPUs in the dataset are manufactured by Intel or AMD. Let’s use some wrangling functions and create a variable that indicates the manufacturer and retain Intel and AMD CPUs:

cpu_data <- cpu_data %>% 
  mutate(brand = factor(case_when(str_detect(str_to_lower(cpu),'amd') ~ 'AMD',
                           str_detect(str_to_lower(cpu), 'intel') ~ 'Intel',
                           TRUE ~ 'Other'))) %>% 
  filter(brand != 'Other')

Let’s plot CPU price on CPU performance (variables in log form):

Assign color to brand:

Estimate regressions and present in tabular form:

cpu_reg_short1 <- lm(log(price_usd) ~ log(cpu_mark), data = cpu_data)
cpu_reg_short2 <- lm(log(price_usd) ~ factor(brand), data = cpu_data)
cpu_reg_long <- lm(log(price_usd) ~ log(cpu_mark) + factor(brand), data = cpu_data)

stargazer(cpu_reg_short1, cpu_reg_short2, cpu_reg_long,
          type = 'html',
          out = 'cpu_models.htm', 
          omit.stat=c('f', 'ser'), 
          title='CPU Regression Results')
CPU Regression Results
Dependent variable:
log(price_usd)
(1) (2) (3)
log(cpu_mark) 0.712*** 0.695***
(0.017) (0.017)
factor(brand)Intel 0.655*** 0.229***
(0.064) (0.049)
Constant -0.805*** 4.642*** -0.829***
(0.140) (0.055) (0.139)
Observations 2,026 2,026 2,026
R2 0.475 0.049 0.481
Adjusted R2 0.475 0.048 0.480
Note: p<0.1; p<0.05; p<0.01

Collect predictions from the long regression:

cpu_prediction <- augment(cpu_reg_long) %>% 
  clean_names()
Other practice

Apply what you have learned above to other datasets!

  • On the mid-term exam, you worked with audit fees and industry data for S&P500 audits. Store those datasets to your working directory and import them.

  • Example housing data can be found on Kaggle.

A special case: Logistic regression

We are commonly tasked with predicting a binary outcome variable (e.g., win/loss, bankrupt/not bankrupt). In this case, we are estimating the probability of some occurrence based on some predictor variables. OLS methods will work in this case (but are retitled ‘linear probability models’ or LPM). However, special regression techniques, like logistic regression, are sometimes better suited to this setting. Let’s illustrate with some simulated data:

set.seed(42)

logit_data <- data.frame('predictor' = rnorm(50, mean = 1, sd = 1), 'import_outcome' = 1) %>% 
  bind_rows(data.frame('predictor' = rnorm(50, mean = -1, sd = 1), 'import_outcome' = 0))

Next, let’s plot this dataset and illustrate regression lines:

ggplot(logit_data, aes(x = predictor, y = import_outcome)) + 
  geom_point(alpha = 0.3) + 
  scale_y_continuous(breaks = seq(0,1)) +
  geom_smooth(method = 'lm', se = FALSE, color = 'blue') +
  geom_smooth(method = 'glm', method.args = list(family = 'binomial'), se = FALSE, color = 'red') +
  labs(title = 'Prediction of Important Outcome')

Repeat with prediction of engine type using the MPG dataset.

ggplot(mtcars, aes(x = mpg, y = vs)) + 
  geom_point(alpha = 0.3) + 
  scale_y_continuous(breaks = seq(0,1)) +
  geom_smooth(method = 'lm', se = FALSE, color = 'blue') +
  geom_smooth(method = 'glm', method.args = list(family = 'binomial'), se = FALSE, color = 'red') +
  labs(title = 'Prediction of Engine Type')

Finally, here is an illustration of how to estimate a logistic regression and how to store the predictions:

import_logit <- glm(import_outcome ~ predictor, data = logit_data, family = 'binomial') 
summary(import_logit)

Call:
glm(formula = import_outcome ~ predictor, family = "binomial", 
    data = logit_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.002295   0.280832   0.008    0.993    
predictor   1.821336   0.358475   5.081 3.76e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.629  on 99  degrees of freedom
Residual deviance:  79.217  on 98  degrees of freedom
AIC: 83.217

Number of Fisher Scoring iterations: 5
logit_predicton <- augment(import_logit, type.predict = 'response')