library(tidyverse)
library(readxl)
library(broom)
library(stargazer)
library(janitor)
library(scales)
rm(list=ls())Regression Exercises
In this exercise, we will reproduce the regressions and graphics from the class slides and then work through a couple of examples. The purpose is to demonstrate how to run regressions, interpret output, and generate visualizations.
General housekeeping items
Let’s begin by opening libraries and clearing the environment:
Set your working directory:
setwd('C:/YOURWD')Generate ‘simulated’ data for correlation and regression
Let’s generate some simulated data to illustrate the concepts of correlation and regression. We will be using ‘randomness’ in this simulation so let’s first set a seed so that our procedure is reproducible:
set.seed(42)Next, let’s create a series of 4 datasets containing two correlated variables ‘x’ and ‘y’. Each dataset will have 100 observations. The first 3 datasets illustrate a positive correlation based on the equation, y = x + noise, where noise is ‘random’ and varies across each dataset: ‘little_noise’, ‘some_noise’, and ‘tons_noise’. The final dataset illustrates a negative correlation based on the equation y = 100 - x + noise.
#Create a dataset with 100 observations where 'x' is a randomly drawn number from 1-100.
pos_little_noise <- data.frame('x' = sample(1:100, size = 100, replace = TRUE)) %>%
#Create a variable, 'y' equal to 'x' plus 'noise.' 'Noise' is a randomly drawn number from -/+5.
mutate(y = x + sample(-5:5, size = 100, replace = TRUE)) %>%
#Create a variable, that labels this dataset.
mutate(label = 'Sim 1: Positive - Little noise')
#Repeat for some_noise and tons_noise datasets (with increased noise)
pos_some_noise <- data.frame('x' = sample(1:100, size = 100, replace = TRUE)) %>%
mutate(y = x + sample(-25:25, size = 100, replace = TRUE)) %>%
mutate(label = 'Sim 2: Positive - Some noise')
pos_tons_noise <- data.frame('x' = sample(1:100, size = 100, replace = TRUE)) %>%
mutate(y = x + sample(-50:50, size = 100, replace = TRUE)) %>%
mutate(label = 'Sim 3: Positive - Tons noise')
neg_some_noise <- data.frame('x' = sample(1:100, size = 100, replace = TRUE)) %>%
mutate(y = 100 - x + sample(-25:25, size = 100, replace = TRUE)) %>%
mutate(label = 'Sim 4: Negative - Some noise')
#Bind simulations together in one dataset
sim_data <- bind_rows(pos_little_noise, pos_some_noise, pos_tons_noise, neg_some_noise)Plot datasets, determine correlations, and add trendlines
Plot ‘y’ on ‘x’ separately for each simulated dataset:
ggplot(sim_data, aes(x = x, y = y)) +
geom_point(color = 'dodgerblue', alpha = 0.5) +
facet_wrap(~label) Calculate correlations between ‘x’ and ‘y’ for each simulated dataset:
sim_data %>%
group_by(label) %>%
summarize(cor = cor(x,y)) # A tibble: 4 × 2
label cor
<chr> <dbl>
1 Sim 1: Positive - Little noise 0.994
2 Sim 2: Positive - Some noise 0.882
3 Sim 3: Positive - Tons noise 0.734
4 Sim 4: Negative - Some noise -0.887
Let’s try on real data. Plot price on carat weight using the diamonds dataset:
diamonds %>%
filter(carat < 1.5) %>%
ggplot(aes(y = price, x = carat)) +
geom_point(color = 'dodgerblue', alpha = 0.3) +
labs(title = 'Scatter Plot of Price versus Carat', x = 'Carat', y = 'Price') +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma)Calculate correlations between price and carat:
diamonds %>%
filter(carat < 1.5) %>%
summarize(cor = cor(price,carat)) # A tibble: 1 × 1
cor
<dbl>
1 0.883
Recreate the plots above, adding manual trend lines:
ggplot(sim_data, aes(x = x, y = y)) +
geom_point(color = 'dodgerblue', alpha = 0.5) +
geom_abline(intercept = 100, slope = -1, color = 'red') +
geom_abline(intercept = 0, slope = 1, color = 'red') +
geom_abline(intercept = 50, slope = 0, color = 'red') +
facet_wrap(~label)diamonds %>%
filter(carat < 1.5) %>%
ggplot(aes(y = price, x = carat)) +
geom_point(color = 'dodgerblue', alpha = 0.3) +
geom_abline(intercept = 5000, slope = 0, color = 'red') +
geom_abline(intercept = 0, slope = 5000, color = 'red') +
geom_abline(intercept = -5000, slope = 10000, color = 'red') +
labs(title = 'Scatter Plot of Price versus Carat', x = 'Carat', y = 'Price') +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma)Determine the ‘best fit’ with regression
Instead of trying to visualize trends ourselves, we can ‘fit’ a model to our data. Below, we are going to use a regression technique called ‘ordinary least squares’ (OLS) to determine the coefficient(s) and intercept that minimize the “sum of the squared errors.” Let’s add an OLS regression line to our plots:
ggplot(sim_data, aes(x = x, y = y)) +
geom_point(color = 'dodgerblue', alpha = 0.5) +
geom_smooth(method = 'lm', se = FALSE, color = 'red') +
facet_wrap(~label)diamonds %>%
filter(carat < 1.5) %>%
ggplot(aes(y = price, x = carat)) +
geom_point(color = 'dodgerblue', alpha = 0.3) +
geom_smooth(method = 'lm', se = FALSE, color = 'red') +
labs(title = 'Scatter Plot of Price versus Carat', x = 'Carat', y = 'Price', color = 'Clarity') +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma)In R we use lm() to estimate a regression model using OLS. Here is an illustration for each of our simulated datasets:
pln_reg <- lm(y ~ x, data = pos_little_noise)
summary(pln_reg)
Call:
lm(formula = y ~ x, data = pos_little_noise)
Residuals:
Min 1Q Median 3Q Max
-5.4178 -3.8101 0.1569 3.1569 5.7580
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.79762 0.65058 -1.226 0.223
x 1.01321 0.01105 91.727 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.477 on 98 degrees of freedom
Multiple R-squared: 0.9885, Adjusted R-squared: 0.9884
F-statistic: 8414 on 1 and 98 DF, p-value: < 2.2e-16
psn_reg <- lm(y ~ x, data = pos_some_noise)
summary(psn_reg)
Call:
lm(formula = y ~ x, data = pos_some_noise)
Residuals:
Min 1Q Median 3Q Max
-24.6891 -14.5044 -0.4017 14.8686 26.6204
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.15390 3.42857 -0.628 0.531
x 1.04850 0.05645 18.574 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.58 on 98 degrees of freedom
Multiple R-squared: 0.7788, Adjusted R-squared: 0.7765
F-statistic: 345 on 1 and 98 DF, p-value: < 2.2e-16
ptn_reg <- lm(y ~ x, data = pos_tons_noise)
summary(ptn_reg)
Call:
lm(formula = y ~ x, data = pos_tons_noise)
Residuals:
Min 1Q Median 3Q Max
-47.033 -27.751 -1.635 26.755 53.975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.5267 5.9289 -0.764 0.447
x 1.0788 0.1009 10.686 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30.49 on 98 degrees of freedom
Multiple R-squared: 0.5382, Adjusted R-squared: 0.5335
F-statistic: 114.2 on 1 and 98 DF, p-value: < 2.2e-16
nsn_reg <- lm(y ~ x, data = neg_some_noise)
summary(nsn_reg)
Call:
lm(formula = y ~ x, data = neg_some_noise)
Residuals:
Min 1Q Median 3Q Max
-25.5151 -12.4843 0.0163 11.4702 24.5557
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 100.38760 3.03245 33.10 <2e-16 ***
x -0.99798 0.05259 -18.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.56 on 98 degrees of freedom
Multiple R-squared: 0.7861, Adjusted R-squared: 0.7839
F-statistic: 360.2 on 1 and 98 DF, p-value: < 2.2e-16
Stargazer is a great package for aggregating regression output and creating publication-quality tables:
stargazer(pln_reg, psn_reg, ptn_reg, nsn_reg,
type = 'html',
out = 'noise_models.htm',
omit.stat=c('f', 'ser'),
title='Simulation Regression Results',
column.labels = c('Little','Some','Tons','Negative'),
column.sep.width = '20pt')| Dependent variable: | ||||
| y | ||||
| Little | Some | Tons | Negative | |
| (1) | (2) | (3) | (4) | |
| x | 1.013*** | 1.049*** | 1.079*** | -0.998*** |
| (0.011) | (0.056) | (0.101) | (0.053) | |
| Constant | -0.798 | -2.154 | -4.527 | 100.388*** |
| (0.651) | (3.429) | (5.929) | (3.032) | |
| Observations | 100 | 100 | 100 | 100 |
| R2 | 0.988 | 0.779 | 0.538 | 0.786 |
| Adjusted R2 | 0.988 | 0.777 | 0.533 | 0.784 |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Same for the diamonds regression:
diamonds_filter <- diamonds %>%
filter(carat < 1.5)
diamonds_reg <- lm(price ~ carat, data = diamonds_filter)
stargazer(diamonds_reg,
type = 'html',
out = 'diamond_model.htm',
omit.stat=c('f', 'ser'),
title='Diamond Regression Results')| Dependent variable: | |
| price | |
| carat | 6,858.921*** |
| (16.690) | |
| Constant | -1,758.280*** |
| (12.408) | |
| Observations | 47,705 |
| R2 | 0.780 |
| Adjusted R2 | 0.780 |
| Note: | p<0.1; p<0.05; p<0.01 |
Note that we can also determine the predicted or ‘fitted’ values for y using our regression estimates. For example, in the simple regression of price on carat above, let’s estimate the price of a 1 carat diamond.
diamonds_reg <- lm(price ~ carat, data = diamonds_filter)
augment(diamonds_reg, newdata = data.frame(carat = 1))# A tibble: 1 × 2
carat .fitted
<dbl> <dbl>
1 1 5101.
Regression illustration with Anscombe’s quartet
Anscombe’s quartet is a famous illustration how regression estimates can oversimplify a trend and overweight outliers. Let’s begin by importing the ansc_quartet.xlsx file from the course website:
ansc_quart <- read_excel('ansc_quart.xlsx')This file is made up of 4 example datasets, as above, let’s plot and regress (note the extra features used for presentation purposes):
ggplot(ansc_quart, aes(x = x, y = y, color = dataset)) +
geom_point(size = 3, alpha = .8) +
facet_wrap(~ dataset) +
stat_smooth(method = 'lm', se = FALSE, color = 'red', fullrange = TRUE) +
theme(legend.position = 'none') Let’s look at the regression output in tabular form:
data_1_reg <- lm(y ~ x, data = ansc_quart[ansc_quart$dataset == 'Dataset I',])
data_2_reg <- lm(y ~ x, data = ansc_quart[ansc_quart$dataset == 'Dataset II',])
data_3_reg <- lm(y ~ x, data = ansc_quart[ansc_quart$dataset == 'Dataset III',])
data_4_reg <- lm(y ~ x, data = ansc_quart[ansc_quart$dataset == 'Dataset IV',])
stargazer(data_1_reg, data_2_reg, data_3_reg, data_3_reg,
type = 'html',
out = 'ansc_models.htm',
omit.stat=c('f', 'ser'),
title='Anscombes Regression Results',
column.labels = c('Data I','Data II','Data III','Data IV'))| Dependent variable: | ||||
| y | ||||
| Data I | Data II | Data III | Data IV | |
| (1) | (2) | (3) | (4) | |
| x | 0.500*** | 0.500*** | 0.500*** | 0.500*** |
| (0.118) | (0.118) | (0.118) | (0.118) | |
| Constant | 3.000** | 3.001** | 3.002** | 3.002** |
| (1.125) | (1.125) | (1.124) | (1.124) | |
| Observations | 11 | 11 | 11 | 11 |
| R2 | 0.667 | 0.666 | 0.666 | 0.666 |
| Adjusted R2 | 0.629 | 0.629 | 0.629 | 0.629 |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Regression with non-linear variables
Regressions are great at fitting lines to our data. While the estimates are ‘linear’ the variables need not be. That is, a regression model is flexible to non-linear transformations of variables (e.g., squared, log). Following the Dataset II above, let’s estimate a quadratic model that includes a squared term:
ggplot(ansc_quart[ansc_quart$dataset == 'Dataset II',], aes(x = x, y = y)) +
geom_point(size = 3, color = 'dodgerblue', alpha = .8) +
stat_smooth(method = 'lm', se = FALSE, formula = y ~ x + I(x^2), color = 'red', fullrange = TRUE) +
labs(title = 'Anscombes Quartet - Dataset II - Quadratic')data_2_reg_quad <- lm(y ~ x + I(x^2), data = ansc_quart[ansc_quart$dataset == 'Dataset II',])
summary(data_2_reg_quad)
Call:
lm(formula = y ~ x + I(x^2), data = ansc_quart[ansc_quart$dataset ==
"Dataset II", ])
Residuals:
Min 1Q Median 3Q Max
-0.0013287 -0.0011888 -0.0006294 0.0008741 0.0023776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.9957343 0.0043299 -1385 <2e-16 ***
x 2.7808392 0.0010401 2674 <2e-16 ***
I(x^2) -0.1267133 0.0000571 -2219 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.001672 on 8 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 7.378e+06 on 2 and 8 DF, p-value: < 2.2e-16
Log transformations of variables are frequently useful in improving model fit and interpretability (among other benefits). Recall that the relation between price and carat in the diamonds dataset was log linear. Let’s reperform our regression above using log values for price and carat.
diamonds %>%
filter(carat < 1.5) %>%
ggplot(aes(y = log(price), x = log(carat))) +
geom_point(color = 'dodgerblue', alpha = 0.3) +
geom_smooth(method = 'lm', se = FALSE, color = 'red') +
labs(title = 'Scatter Plot of Price versus Carat', x = 'Natural Log of Carat', y = 'Natrual Log of Price') +
scale_y_continuous(labels = comma) +
scale_x_continuous(labels = comma)diamonds_reg <- lm(log(price) ~ log(carat), data = diamonds_filter)
stargazer(diamonds_reg,
type = 'html',
out = 'log_diamond_model.htm',
omit.stat=c('f', 'ser'),
title='Diamond Regression Results')| Dependent variable: | |
| log(price) | |
| log(carat) | 1.692*** |
| (0.002) | |
| Constant | 8.458*** |
| (0.002) | |
| Observations | 47,705 |
| R2 | 0.914 |
| Adjusted R2 | 0.914 |
| Note: | p<0.1; p<0.05; p<0.01 |
As above, we can also come up with predictions for our own data. Let’s estimate the price of a 1 carat diamond.
diamonds_reg <- lm(log(price) ~ log(carat), data = diamonds_filter)
augment(diamonds_reg, newdata = data.frame(carat = 1)) %>%
mutate(pred_price = exp(.fitted))# A tibble: 1 × 3
carat .fitted pred_price
<dbl> <dbl> <dbl>
1 1 8.46 4715.
Multiple regression
Regressions can have more than one ‘x’ or predictor variable. In fact, we can expand our regression models to estimate the effects of many variables. Using the diamonds dataset, let’s increase the number of variables in our model one step at a time:
diamonds_short1 <- lm(log(price) ~ log(carat), data = diamonds_filter)
diamonds_short2 <- lm(log(price) ~ as.numeric(clarity), data = diamonds_filter)
diamonds_long1 <- lm(log(price) ~ log(carat) + as.numeric(clarity), data = diamonds_filter)
diamonds_long2 <- lm(log(price) ~ log(carat) + as.numeric(clarity) + as.numeric(cut) + as.numeric(color), data = diamonds_filter)
stargazer(diamonds_short1, diamonds_short2, diamonds_long1, diamonds_long2,
type = 'html',
out = 'multi_diamond_models.htm',
omit.stat=c('f', 'ser'),
column.labels = c('Short1','Short2','Long1','Long2'),
title='Diamond Regression Results')| Dependent variable: | ||||
| log(price) | ||||
| Short1 | Short2 | Long1 | Long2 | |
| (1) | (2) | (3) | (4) | |
| log(carat) | 1.692*** | 1.827*** | 1.886*** | |
| (0.002) | (0.002) | (0.001) | ||
| as.numeric(clarity) | -0.082*** | 0.114*** | 0.121*** | |
| (0.002) | (0.001) | (0.0004) | ||
| as.numeric(cut) | 0.031*** | |||
| (0.001) | ||||
| as.numeric(color) | -0.076*** | |||
| (0.0004) | ||||
| Constant | 8.458*** | 7.922*** | 8.053*** | 8.195*** |
| (0.002) | (0.011) | (0.002) | (0.003) | |
| Observations | 47,705 | 47,705 | 47,705 | 47,705 |
| R2 | 0.914 | 0.024 | 0.954 | 0.974 |
| Adjusted R2 | 0.914 | 0.024 | 0.954 | 0.974 |
| Note: | p<0.1; p<0.05; p<0.01 | |||
In this exercise, we treated all variables as numeric (even ordinal categorical/factor variables like clarity). Alternatively, we can also estimate the effect of each level of a categorical level separately by including an indicator variable for each level. See the example below.
diamonds_filter <- diamonds_filter %>%
mutate(across(c('clarity', 'color', 'cut'), ~ factor(.x, ordered = FALSE)))
diamonds_long3 <- lm(log(price) ~ log(carat) + clarity + cut + color, data = diamonds_filter)
stargazer(diamonds_long3,
type = 'html',
out = 'long_diamond_models.htm',
omit.stat=c('f', 'ser'),
column.labels = c('Long3'),
title='Diamond Regression Results')| Dependent variable: | |
| log(price) | |
| Long3 | |
| log(carat) | 1.889*** |
| (0.001) | |
| claritySI2 | 0.362*** |
| (0.006) | |
| claritySI1 | 0.521*** |
| (0.006) | |
| clarityVS2 | 0.673*** |
| (0.006) | |
| clarityVS1 | 0.744*** |
| (0.006) | |
| clarityVVS2 | 0.884*** |
| (0.006) | |
| clarityVVS1 | 0.955*** |
| (0.006) | |
| clarityIF | 1.053*** |
| (0.007) | |
| cutGood | 0.075*** |
| (0.004) | |
| cutVery Good | 0.110*** |
| (0.004) | |
| cutPremium | 0.139*** |
| (0.004) | |
| cutIdeal | 0.156*** |
| (0.004) | |
| colorE | -0.057*** |
| (0.002) | |
| colorF | -0.100*** |
| (0.002) | |
| colorG | -0.165*** |
| (0.002) | |
| colorH | -0.250*** |
| (0.002) | |
| colorI | -0.377*** |
| (0.003) | |
| colorJ | -0.499*** |
| (0.003) | |
| Constant | 7.934*** |
| (0.007) | |
| Observations | 47,705 |
| R2 | 0.978 |
| Adjusted R2 | 0.978 |
| Note: | p<0.1; p<0.05; p<0.01 |
Visualizing the ‘error term’
The error term, or residual, represents the difference between the predicted values and the actual values of the dependent variable. In a simple model with just two variables (y and x) the concept can be illustrated graphically. Below is a plot of the Anscombe’s dataset 1:
ansc_quart_I <- ansc_quart[ansc_quart$dataset == 'Dataset I',]
ansc_quart_fit <- lm(y ~ x, data = ansc_quart_I)
prediction <- augment(ansc_quart_fit) %>%
clean_names()
ggplot(prediction, aes(x = x, y = y)) +
geom_smooth(method = 'lm', se = FALSE, color = 'red') +
geom_segment(aes(xend = x, yend = fitted), alpha = .25) +
geom_point(size = 3, alpha = .5, color = 'dodgerblue') +
geom_point(aes(y = fitted), shape = 1, size = 2) +
labs(title = 'Anscombes Quartet - Dataset I - Residuals/Errors')Next, let’s do the same thing, but for a sample of the diamonds dataset (variables in log form). Notice the extra features added for presentation.
diamonds_sample <- diamonds %>%
filter(carat < 1.5) %>%
slice_sample(n = 100)
diamonds_fit <- lm(log(price) ~ log(carat), data = diamonds_sample)
augment(diamonds_fit) %>%
clean_names() %>%
mutate(resid = log_price - fitted) %>%
ggplot(aes(x = log_carat, y = log_price)) +
geom_smooth(method = 'lm', se = FALSE, color = 'lightgrey') +
geom_segment(aes(xend = log_carat, yend = fitted), alpha = .25) +
geom_point(aes(color = abs(resid), size = abs(resid))) +
scale_color_continuous(low = 'blue', high = 'red') +
geom_point(aes(y = fitted), shape = 1) +
labs(title = 'Sample of Diamonds Data - Residuals/Errors', color = 'Residual', size = 'Residual', x = 'Natural Log of Carat', y = 'Natural Log of Price')Practice interpreting output with the mtcars dataset
Let’s predict fuel economy with characteristics of cars using the mtcars dataset. Try to interpret the output.
mpg_reg <- lm(mpg ~ wt + hp + cyl, data = mtcars)
stargazer(mpg_reg,
type = 'html',
out = 'mpg_model.htm',
omit.stat=c('f', 'ser'),
title='MPG Regression Results')| Dependent variable: | |
| mpg | |
| wt | -3.167*** |
| (0.741) | |
| hp | -0.018 |
| (0.012) | |
| cyl | -0.942* |
| (0.551) | |
| Constant | 38.752*** |
| (1.787) | |
| Observations | 32 |
| R2 | 0.843 |
| Adjusted R2 | 0.826 |
| Note: | p<0.1; p<0.05; p<0.01 |
Extra practice with regression - CPUs
On the course site, you will find a dataset containing prices of CPUs. The data was collected from PassMark in September 2022. Make sure this file is saved in your working directory. Let’s practice regression and visualization using that dataset:
cpu_data <- read_csv('cpu_data.csv')Nearly all of the CPUs in the dataset are manufactured by Intel or AMD. Let’s use some wrangling functions and create a variable that indicates the manufacturer and retain Intel and AMD CPUs:
cpu_data <- cpu_data %>%
mutate(brand = factor(case_when(str_detect(str_to_lower(cpu),'amd') ~ 'AMD',
str_detect(str_to_lower(cpu), 'intel') ~ 'Intel',
TRUE ~ 'Other'))) %>%
filter(brand != 'Other')Let’s plot CPU price on CPU performance (variables in log form):
Assign color to brand:
Estimate regressions and present in tabular form:
cpu_reg_short1 <- lm(log(price_usd) ~ log(cpu_mark), data = cpu_data)
cpu_reg_short2 <- lm(log(price_usd) ~ factor(brand), data = cpu_data)
cpu_reg_long <- lm(log(price_usd) ~ log(cpu_mark) + factor(brand), data = cpu_data)
stargazer(cpu_reg_short1, cpu_reg_short2, cpu_reg_long,
type = 'html',
out = 'cpu_models.htm',
omit.stat=c('f', 'ser'),
title='CPU Regression Results')| Dependent variable: | |||
| log(price_usd) | |||
| (1) | (2) | (3) | |
| log(cpu_mark) | 0.712*** | 0.695*** | |
| (0.017) | (0.017) | ||
| factor(brand)Intel | 0.655*** | 0.229*** | |
| (0.064) | (0.049) | ||
| Constant | -0.805*** | 4.642*** | -0.829*** |
| (0.140) | (0.055) | (0.139) | |
| Observations | 2,026 | 2,026 | 2,026 |
| R2 | 0.475 | 0.049 | 0.481 |
| Adjusted R2 | 0.475 | 0.048 | 0.480 |
| Note: | p<0.1; p<0.05; p<0.01 | ||
Collect predictions from the long regression:
cpu_prediction <- augment(cpu_reg_long) %>%
clean_names()Apply what you have learned above to other datasets!
On the mid-term exam, you worked with audit fees and industry data for S&P500 audits. Store those datasets to your working directory and import them.
Example housing data can be found on Kaggle.
A special case: Logistic regression
We are commonly tasked with predicting a binary outcome variable (e.g., win/loss, bankrupt/not bankrupt). In this case, we are estimating the probability of some occurrence based on some predictor variables. OLS methods will work in this case (but are retitled ‘linear probability models’ or LPM). However, special regression techniques, like logistic regression, are sometimes better suited to this setting. Let’s illustrate with some simulated data:
set.seed(42)
logit_data <- data.frame('predictor' = rnorm(50, mean = 1, sd = 1), 'import_outcome' = 1) %>%
bind_rows(data.frame('predictor' = rnorm(50, mean = -1, sd = 1), 'import_outcome' = 0))Next, let’s plot this dataset and illustrate regression lines:
ggplot(logit_data, aes(x = predictor, y = import_outcome)) +
geom_point(alpha = 0.3) +
scale_y_continuous(breaks = seq(0,1)) +
geom_smooth(method = 'lm', se = FALSE, color = 'blue') +
geom_smooth(method = 'glm', method.args = list(family = 'binomial'), se = FALSE, color = 'red') +
labs(title = 'Prediction of Important Outcome')Repeat with prediction of engine type using the MPG dataset.
ggplot(mtcars, aes(x = mpg, y = vs)) +
geom_point(alpha = 0.3) +
scale_y_continuous(breaks = seq(0,1)) +
geom_smooth(method = 'lm', se = FALSE, color = 'blue') +
geom_smooth(method = 'glm', method.args = list(family = 'binomial'), se = FALSE, color = 'red') +
labs(title = 'Prediction of Engine Type')Finally, here is an illustration of how to estimate a logistic regression and how to store the predictions:
import_logit <- glm(import_outcome ~ predictor, data = logit_data, family = 'binomial')
summary(import_logit)
Call:
glm(formula = import_outcome ~ predictor, family = "binomial",
data = logit_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.002295 0.280832 0.008 0.993
predictor 1.821336 0.358475 5.081 3.76e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.629 on 99 degrees of freedom
Residual deviance: 79.217 on 98 degrees of freedom
AIC: 83.217
Number of Fisher Scoring iterations: 5
logit_predicton <- augment(import_logit, type.predict = 'response')