Chapter 1. Introduction to Regression

The ultimate goal of regression is to predict a numerical outcome from the dataset of inputs.
The numerical outcome is response variable and the dataset of inputs is explanatory variable.

(1) Research Questions

How many will we sell?
What price will the customer buy our products?
What marketing input will the customer buy our products?
But, watch out here. Answering to this kind of question (will the customer buy us? (yes/no)) is not the question in regression.

(2) Linear Regression Example

We will develop simple Linear gression example here with the sales data.
Research question is what leaflets will the customer buy our products?

(A) Data import

# step 1. data import
juice_sales <- read.csv("data_files/Lemonade2016.csv")
head(juice_sales, n = 3)

##         Date Location Lemon Orange Temperature Temperature.1 Leaflets
## 1 2016-07-01     Park    97     67          70          21.1       90
## 2 2016-07-02     Park    98     67          72          22.2       90
## 3 2016-07-03     Park   110     77          71          21.7      104
##   Price Sales    Revenue
## 1   500   164  82,000.0 
## 2   500   165  82,500.0 
## 3   500   187  93,500.0

In this data, we need only two data columns, Leaflets and Sales. So, let’s do data handling

(B) Data Handling

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

sales_df <- juice_sales %>% select(Leaflets, Sales)
str(sales_df)

## 'data.frame':    31 obs. of  2 variables:
##  $ Leaflets: int  90 90 104 98 135 90 135 113 126 131 ...
##  $ Sales   : int  164 165 187 233 277 172 244 209 229 238 ...

Now, we get two columns to develop regression model

(C) Define a formula

lmla <- Sales ~ Leaflets
lmla

## Sales ~ Leaflets

The outcome is Sales, the input is Leaflets.

(D) Develop Model: sales_model

sales_model <- lm(lmla, data = sales_df)
sales_model

## 
## Call:
## lm(formula = lmla, data = sales_df)
## 
## Coefficients:
## (Intercept)     Leaflets  
##     -27.212        2.053

(E) Interpretaton of sales_model

When the result from sales model appears, you try to look at only the Leaflets. This is the most important value.
Leaflets shows positive. So, Sales increases as Leaflets does.

(F) Prediction from sales_model

We are going to add new data. Imagine that owner simply wants to know how many customers will buy our products corresponding the number of leaflets.

# newrates is in your workspace
newleaflets <- data.frame(Leaflets = c(80, 90, 100)) # new data made like this way

sales_df$prediction <- predict(sales_model)

pred <- predict(sales_model, newdata = newleaflets)
pred

##        1        2        3 
## 137.0568 157.5904 178.1240

If you are owner, simply you can approximately compute the expected outcomes from new data (80, 90, 100).

(G) Scatterplot between prediction and actual outcome from dataset

Let’s make a plot to compare prediction to actual outcome below

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(sales_df, aes(x = prediction, y = Sales)) + 
  geom_point() +
  geom_abline(color = "blue")

(3) Multivariate Linear Regression

If you have mastered one variable linear regression, then it’s time to create multivariate linear regression.
As you’ve looked at the original dataset juice_sales, all the dataset have multivariate variables.
However, don’t worry it’s not difficult to build regression model.
Everything is same but one different is to create the formula.
Do it inline code at once.

# step 1. overview of juice_sales
str(juice_sales)

## 'data.frame':    31 obs. of  10 variables:
##  $ Date         : Factor w/ 31 levels "2016-07-01","2016-07-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Location     : Factor w/ 2 levels "Beach","Park": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Lemon        : int  97 98 110 134 159 103 143 123 134 140 ...
##  $ Orange       : int  67 67 77 99 118 69 101 86 95 98 ...
##  $ Temperature  : int  70 72 71 76 78 82 81 82 80 82 ...
##  $ Temperature.1: num  21.1 22.2 21.7 24.4 25.6 27.8 27.2 27.8 26.7 27.8 ...
##  $ Leaflets     : int  90 90 104 98 135 90 135 113 126 131 ...
##  $ Price        : int  500 500 500 500 500 500 500 500 500 500 ...
##  $ Sales        : int  164 165 187 233 277 172 244 209 229 238 ...
##  $ Revenue      : Factor w/ 28 levels " 101,000.0 ",..: 22 23 27 8 12 25 10 4 7 9 ...

# step 2. creating formula
juice_la <- Sales ~ Temperature.1 + Leaflets

# step 3. build model
sales_multi_model <- lm(juice_la, data = juice_sales)

# step 4. print
sales_multi_model

## 
## Call:
## lm(formula = juice_la, data = juice_sales)
## 
## Coefficients:
##   (Intercept)  Temperature.1       Leaflets  
##      -142.988          5.155          1.884

# step 5. add value prediction
juice_sales$prediction <- predict(sales_multi_model)

# step 6. plot the results
ggplot(juice_sales, aes(x = prediction, y = Sales)) + 
    geom_point() +
    geom_abline(color = "red")

# step 7. predict with new data
new.temp.leafltes <- data.frame(Temperature.1 = c(28, 28, 29, 29, 30, 30), 
                                Leaflets = c(130,140,130,140,130,140)) # new data made like this way

# step 8. get predictable outcomes
pred.outcome <- predict(sales_multi_model, newdata = new.temp.leafltes)
pred.outcome

##        1        2        3        4        5        6 
## 246.2725 265.1122 251.4277 270.2674 256.5828 275.4225

(4) Linear Regression Pros and Cons

(A) Advantages

Regression is easy to fit and apply
It’s very concise.

(B) Cons

Can only express linear and additive relationships
Collinearity - when variable are partially correlated
- Coefficients might change sign
High Collinearity:
- Coefficients (or standard errors) look too large
- Model may be unstable

Data Statistical Project 1.1 Introduction to Regression

Evan

2017년 7월 29일

Chapter 1. Introduction to Regression

(1) Research Questions

(2) Linear Regression Example

(A) Data import

(B) Data Handling

(C) Define a formula

(D) Develop Model: sales_model

(E) Interpretaton of sales_model

(F) Prediction from sales_model

(G) Scatterplot between prediction and actual outcome from dataset

(3) Multivariate Linear Regression

(4) Linear Regression Pros and Cons

(A) Advantages

(B) Cons