Chapter 1. Introduction to Regression

  • The ultimate goal of regression is to predict a numerical outcome from the dataset of inputs.
  • The numerical outcome is response variable and the dataset of inputs is explanatory variable.

(1) Research Questions

  • How many will we sell?
  • What price will the customer buy our products?
  • What marketing input will the customer buy our products?
  • But, watch out here. Answering to this kind of question (will the customer buy us? (yes/no)) is not the question in regression.

(2) Linear Regression Example

  • We will develop simple Linear gression example here with the sales data.
  • Research question is what leaflets will the customer buy our products?

(A) Data import

# step 1. data import
juice_sales <- read.csv("data_files/Lemonade2016.csv")
head(juice_sales, n = 3)
##         Date Location Lemon Orange Temperature Temperature.1 Leaflets
## 1 2016-07-01     Park    97     67          70          21.1       90
## 2 2016-07-02     Park    98     67          72          22.2       90
## 3 2016-07-03     Park   110     77          71          21.7      104
##   Price Sales    Revenue
## 1   500   164  82,000.0 
## 2   500   165  82,500.0 
## 3   500   187  93,500.0
  • In this data, we need only two data columns, Leaflets and Sales. So, let’s do data handling

(B) Data Handling

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
sales_df <- juice_sales %>% select(Leaflets, Sales)
str(sales_df)
## 'data.frame':    31 obs. of  2 variables:
##  $ Leaflets: int  90 90 104 98 135 90 135 113 126 131 ...
##  $ Sales   : int  164 165 187 233 277 172 244 209 229 238 ...
  • Now, we get two columns to develop regression model

(C) Define a formula

lmla <- Sales ~ Leaflets
lmla
## Sales ~ Leaflets
  • The outcome is Sales, the input is Leaflets.

(D) Develop Model: sales_model

sales_model <- lm(lmla, data = sales_df)
sales_model
## 
## Call:
## lm(formula = lmla, data = sales_df)
## 
## Coefficients:
## (Intercept)     Leaflets  
##     -27.212        2.053

(E) Interpretaton of sales_model

  • When the result from sales model appears, you try to look at only the Leaflets. This is the most important value.
  • Leaflets shows positive. So, Sales increases as Leaflets does.

(F) Prediction from sales_model

  • We are going to add new data. Imagine that owner simply wants to know how many customers will buy our products corresponding the number of leaflets.
# newrates is in your workspace
newleaflets <- data.frame(Leaflets = c(80, 90, 100)) # new data made like this way

sales_df$prediction <- predict(sales_model)

pred <- predict(sales_model, newdata = newleaflets)
pred
##        1        2        3 
## 137.0568 157.5904 178.1240
  • If you are owner, simply you can approximately compute the expected outcomes from new data (80, 90, 100).

(G) Scatterplot between prediction and actual outcome from dataset

  • Let’s make a plot to compare prediction to actual outcome below
# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(sales_df, aes(x = prediction, y = Sales)) + 
  geom_point() +
  geom_abline(color = "blue")

(3) Multivariate Linear Regression

  • If you have mastered one variable linear regression, then it’s time to create multivariate linear regression.
  • As you’ve looked at the original dataset juice_sales, all the dataset have multivariate variables.
  • However, don’t worry it’s not difficult to build regression model.
  • Everything is same but one different is to create the formula.
  • Do it inline code at once.
# step 1. overview of juice_sales
str(juice_sales)
## 'data.frame':    31 obs. of  10 variables:
##  $ Date         : Factor w/ 31 levels "2016-07-01","2016-07-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Location     : Factor w/ 2 levels "Beach","Park": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Lemon        : int  97 98 110 134 159 103 143 123 134 140 ...
##  $ Orange       : int  67 67 77 99 118 69 101 86 95 98 ...
##  $ Temperature  : int  70 72 71 76 78 82 81 82 80 82 ...
##  $ Temperature.1: num  21.1 22.2 21.7 24.4 25.6 27.8 27.2 27.8 26.7 27.8 ...
##  $ Leaflets     : int  90 90 104 98 135 90 135 113 126 131 ...
##  $ Price        : int  500 500 500 500 500 500 500 500 500 500 ...
##  $ Sales        : int  164 165 187 233 277 172 244 209 229 238 ...
##  $ Revenue      : Factor w/ 28 levels " 101,000.0 ",..: 22 23 27 8 12 25 10 4 7 9 ...
# step 2. creating formula
juice_la <- Sales ~ Temperature.1 + Leaflets

# step 3. build model
sales_multi_model <- lm(juice_la, data = juice_sales)

# step 4. print
sales_multi_model
## 
## Call:
## lm(formula = juice_la, data = juice_sales)
## 
## Coefficients:
##   (Intercept)  Temperature.1       Leaflets  
##      -142.988          5.155          1.884
# step 5. add value prediction
juice_sales$prediction <- predict(sales_multi_model)

# step 6. plot the results
ggplot(juice_sales, aes(x = prediction, y = Sales)) + 
    geom_point() +
    geom_abline(color = "red")

# step 7. predict with new data
new.temp.leafltes <- data.frame(Temperature.1 = c(28, 28, 29, 29, 30, 30), 
                                Leaflets = c(130,140,130,140,130,140)) # new data made like this way

# step 8. get predictable outcomes
pred.outcome <- predict(sales_multi_model, newdata = new.temp.leafltes)
pred.outcome
##        1        2        3        4        5        6 
## 246.2725 265.1122 251.4277 270.2674 256.5828 275.4225

(4) Linear Regression Pros and Cons

(A) Advantages

  • Regression is easy to fit and apply
  • It’s very concise.

(B) Cons

  • Can only express linear and additive relationships
  • Collinearity - when variable are partially correlated
    • Coefficients might change sign
  • High Collinearity:
    • Coefficients (or standard errors) look too large
    • Model may be unstable