Learning By Building (LBB) : Regression Model

In this LBB, I will use a Supermarket Sales Report dataset from Kaggle.com. The purpose of using this data is to determine whether there is a relationship or correlation between target variable and predictor variable, described by using Multiple Linear Regression model.

spm <- read.csv("supermarket_sales - Sheet1.csv")
str(spm)

## 'data.frame':    1000 obs. of  17 variables:
##  $ Invoice.ID             : Factor w/ 1000 levels "101-17-6199",..: 815 143 654 19 340 734 316 265 703 727 ...
##  $ Branch                 : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
##  $ City                   : Factor w/ 3 levels "Mandalay","Naypyitaw",..: 3 2 3 3 3 2 3 2 3 1 ...
##  $ Customer.type          : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
##  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
##  $ Product.line           : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
##  $ Unit.price             : num  74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax.5.                 : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Total                  : num  549 80.2 340.5 489 634.4 ...
##  $ Date                   : Factor w/ 89 levels "1/1/2019","1/10/2019",..: 27 88 82 20 58 77 49 48 2 44 ...
##  $ Time                   : Factor w/ 506 levels "10:00","10:01",..: 147 24 156 486 30 394 215 78 342 160 ...
##  $ Payment                : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
##  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
##  $ gross.margin.percentage: num  4.76 4.76 4.76 4.76 4.76 ...
##  $ gross.income           : num  26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...

Attribute information :

Invoice id: Computer generated sales slip invoice identification number

Branch: Branch of supercenter (3 branches are available identified by A, B and C)

City: Location of supercenters

Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card

Gender: Gender type of customer

Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel

Unit price: Price of each product in $

Quantity: Number of products purchased by customer

Tax: 5% tax fee for customer buying

Total: Total price including tax

Date: Date of purchase (Record available from January 2019 to March 2019)

Time: Purchase time (10am to 9pm)

Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)

COGS: Cost of goods sold

Gross margin percentage: Gross margin percentage

Gross income: Gross income

Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

From existing 17 variables, I want to create two new variables with the following names :

UPT : Unit Per Transaction, to measure the average number of items that customers are purchasing in any given transaction.
ATV : Average Transaction Value, to measure the average amount or cash that customers spend in single transaction.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

spm$Invoice.ID <-  as.character(spm$Invoice.ID)
spm <- spm %>% group_by(Invoice.ID) %>% 
 mutate(Trx = n()) %>% 
    ungroup() %>% 
  mutate( UPT = Quantity/Trx, ATV = Total/Quantity ) 

range(spm$UPT)

## [1]  1 10

head(spm)

## # A tibble: 6 x 20
##   Invoice.ID Branch City  Customer.type Gender Product.line Unit.price
##   <chr>      <fct>  <fct> <fct>         <fct>  <fct>             <dbl>
## 1 750-67-84~ A      Yang~ Member        Female Health and ~       74.7
## 2 226-31-30~ C      Nayp~ Normal        Female Electronic ~       15.3
## 3 631-41-31~ A      Yang~ Normal        Male   Home and li~       46.3
## 4 123-19-11~ A      Yang~ Member        Male   Health and ~       58.2
## 5 373-73-79~ A      Yang~ Normal        Male   Sports and ~       86.3
## 6 699-14-30~ C      Nayp~ Normal        Male   Electronic ~       85.4
## # ... with 13 more variables: Quantity <int>, Tax.5. <dbl>, Total <dbl>,
## #   Date <fct>, Time <fct>, Payment <fct>, cogs <dbl>,
## #   gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## #   Trx <int>, UPT <dbl>, ATV <dbl>

To look the correlation matrix by using ggcorr function.

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

ggcorr(spm, label = T, hjust = 1, label_size = 2)

## Warning in ggcorr(spm, label = T, hjust = 1, label_size = 2): data in
## column(s) 'Invoice.ID', 'Branch', 'City', 'Customer.type', 'Gender',
## 'Product.line', 'Date', 'Time', 'Payment' are not numeric and were ignored

## Warning in cor(data, use = method[1], method = method[2]): the standard
## deviation is zero

Multiple Linear Regression

A linear regression model that contains more than one predictor variable is called a multiple linear regression model.

I expect that Total Sales as target variable is affected by two predictor variables, UPT and ATV.

y = Total , x1 = UPT , x2 = ATV

Cook’s distance

model3 <- lm(Total~ UPT+ATV, spm)
summary(model3)

## 
## Call:
## lm(formula = Total ~ UPT + ATV, data = spm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -211.831  -47.400    0.434   45.925  217.304 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -324.5222     7.6933  -42.18   <2e-16 ***
## UPT           58.7715     0.8831   66.56   <2e-16 ***
## ATV            5.5368     0.0928   59.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.59 on 997 degrees of freedom
## Multiple R-squared:  0.8901, Adjusted R-squared:  0.8899 
## F-statistic:  4038 on 2 and 997 DF,  p-value: < 2.2e-16

From the summary above, I had the model of Total :

-324.5222 + 58.7715UPT + 5.5368ATV

with the interpretation :

- Each increase of 1 on UPT value, it will increase Total value as much 58.7715, with the condition that the value of other variables are constant

- Each increase of 1 on ATV value, it will increase Total value as much 5.5368, with the condition that the value of other variables are constant

- The predictor variables of UPT and ATV are significant to model because both of their p-values are 0.00000, ( p-value < 0.05 )

- R Square is 89%, it indicates that 89% of UPT and ATV variable can explain Total variable, and 11% is explained by other variables are not included into model

As I knew the model, if I have UPT value, 10 and 11, with ATV value both 50.

I could predict total value :

predict(model3, data.frame(UPT=c(10, 11), ATV = c(50, 50)))

##        1        2 
## 540.0333 598.8048

from the result, I got the difference between Total value 1 and Total value 2 is 58.771. So, it is proven that each increase of 1 on UPT value, it will increase Total value as much 58.771, with the condition that both ATV value are constant.

Assumption & Limitation

Linearity Test , x variable and y variable are significant correlation.

ggcorr(spm, label = T)

## Warning in ggcorr(spm, label = T): data in column(s) 'Invoice.ID',
## 'Branch', 'City', 'Customer.type', 'Gender', 'Product.line', 'Date',
## 'Time', 'Payment' are not numeric and were ignored

## Warning in cor(data, use = method[1], method = method[2]): the standard
## deviation is zero

Correlation test is test for association or correlation between paired sample. pvalue < 0.05

Ho = unsignificant correlation ( cor = 0) H1 = significant correltion ( cor =/ 0) pvalue < 0.05

cor.test(spm$Total, spm$UPT)

## 
##  Pearson's product-moment correlation
## 
## data:  spm$Total and spm$UPT
## t = 31.449, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6729497 0.7353418
## sample estimates:
##       cor 
## 0.7055102

cor.test(spm$Total, spm$ATV)

## 
##  Pearson's product-moment correlation
## 
## data:  spm$Total and spm$ATV
## t = 25.897, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5953677 0.6696376
## sample estimates:
##       cor 
## 0.6339621

As the result above, I could intrepret that x variable and y variable are significant correlation with p-value is 0.0000 or p-value < 0.05

Multicollinearity Test : refers to predictor variable that are correlated to other predictor variable or to look whether there is a relationship among the predictor variable. By using VIF that value must be less than 10.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(model3)

##      UPT      ATV 
## 1.000116 1.000116

From the output, I could intrepret that there is no relationship or no correlated between predictors variable, with vif value is 1, or less than 10.

Normality of Residual

plot(density(model3$residuals))

plot(model3)

Saphiro test

shapiro.test(model3$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model3$residuals
## W = 0.98787, p-value = 2.339e-07

From the output, the p-value < 0.05 implying that the distribution of the data are not normal. In other words, I couldn’t assume the normality. ( NOT PASSED NORMALITY TEST )

Homoskedasticity

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(model3)

## 
##  studentized Breusch-Pagan test
## 
## data:  model3
## BP = 0.092184, df = 2, p-value = 0.955

Learning By Building (LBB) : Regression Model

Wahyudi Tri

8/8/2019

Multiple Linear Regression

Assumption & Limitation