(2) Linear Regression Example
- We will develop simple Linear gression example here with the sales data.
- Research question is what leaflets will the customer buy our products?
(A) Data import
# step 1. data import
juice_sales <- read.csv("data_files/Lemonade2016.csv")
head(juice_sales, n = 3)
## Date Location Lemon Orange Temperature Temperature.1 Leaflets
## 1 2016-07-01 Park 97 67 70 21.1 90
## 2 2016-07-02 Park 98 67 72 22.2 90
## 3 2016-07-03 Park 110 77 71 21.7 104
## Price Sales Revenue
## 1 500 164 82,000.0
## 2 500 165 82,500.0
## 3 500 187 93,500.0
- In this data, we need only two data columns, Leaflets and Sales. So, let’s do data handling
(B) Data Handling
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
sales_df <- juice_sales %>% select(Leaflets, Sales)
str(sales_df)
## 'data.frame': 31 obs. of 2 variables:
## $ Leaflets: int 90 90 104 98 135 90 135 113 126 131 ...
## $ Sales : int 164 165 187 233 277 172 244 209 229 238 ...
- Now, we get two columns to develop regression model
(D) Develop Model: sales_model
sales_model <- lm(lmla, data = sales_df)
sales_model
##
## Call:
## lm(formula = lmla, data = sales_df)
##
## Coefficients:
## (Intercept) Leaflets
## -27.212 2.053
(E) Interpretaton of sales_model
- When the result from sales model appears, you try to look at only the Leaflets. This is the most important value.
- Leaflets shows positive. So, Sales increases as Leaflets does.
(F) Prediction from sales_model
- We are going to add new data. Imagine that owner simply wants to know how many customers will buy our products corresponding the number of leaflets.
# newrates is in your workspace
newleaflets <- data.frame(Leaflets = c(80, 90, 100)) # new data made like this way
sales_df$prediction <- predict(sales_model)
pred <- predict(sales_model, newdata = newleaflets)
pred
## 1 2 3
## 137.0568 157.5904 178.1240
- If you are owner, simply you can approximately compute the expected outcomes from new data (80, 90, 100).
(G) Scatterplot between prediction and actual outcome from dataset
- Let’s make a plot to compare prediction to actual outcome below
# Make a plot to compare predictions to actual (prediction on x axis).
ggplot(sales_df, aes(x = prediction, y = Sales)) +
geom_point() +
geom_abline(color = "blue")

(3) Multivariate Linear Regression
- If you have mastered one variable linear regression, then it’s time to create multivariate linear regression.
- As you’ve looked at the original dataset juice_sales, all the dataset have multivariate variables.
- However, don’t worry it’s not difficult to build regression model.
- Everything is same but one different is to create the formula.
- Do it inline code at once.
# step 1. overview of juice_sales
str(juice_sales)
## 'data.frame': 31 obs. of 10 variables:
## $ Date : Factor w/ 31 levels "2016-07-01","2016-07-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Location : Factor w/ 2 levels "Beach","Park": 2 2 2 1 1 1 1 1 1 1 ...
## $ Lemon : int 97 98 110 134 159 103 143 123 134 140 ...
## $ Orange : int 67 67 77 99 118 69 101 86 95 98 ...
## $ Temperature : int 70 72 71 76 78 82 81 82 80 82 ...
## $ Temperature.1: num 21.1 22.2 21.7 24.4 25.6 27.8 27.2 27.8 26.7 27.8 ...
## $ Leaflets : int 90 90 104 98 135 90 135 113 126 131 ...
## $ Price : int 500 500 500 500 500 500 500 500 500 500 ...
## $ Sales : int 164 165 187 233 277 172 244 209 229 238 ...
## $ Revenue : Factor w/ 28 levels " 101,000.0 ",..: 22 23 27 8 12 25 10 4 7 9 ...
# step 2. creating formula
juice_la <- Sales ~ Temperature.1 + Leaflets
# step 3. build model
sales_multi_model <- lm(juice_la, data = juice_sales)
# step 4. print
sales_multi_model
##
## Call:
## lm(formula = juice_la, data = juice_sales)
##
## Coefficients:
## (Intercept) Temperature.1 Leaflets
## -142.988 5.155 1.884
# step 5. add value prediction
juice_sales$prediction <- predict(sales_multi_model)
# step 6. plot the results
ggplot(juice_sales, aes(x = prediction, y = Sales)) +
geom_point() +
geom_abline(color = "red")

# step 7. predict with new data
new.temp.leafltes <- data.frame(Temperature.1 = c(28, 28, 29, 29, 30, 30),
Leaflets = c(130,140,130,140,130,140)) # new data made like this way
# step 8. get predictable outcomes
pred.outcome <- predict(sales_multi_model, newdata = new.temp.leafltes)
pred.outcome
## 1 2 3 4 5 6
## 246.2725 265.1122 251.4277 270.2674 256.5828 275.4225