In this LBB, I will use a Supermarket Sales Report dataset from Kaggle.com. The purpose of using this data is to determine whether there is a relationship or correlation between target variable and predictor variable, described by using Multiple Linear Regression model.
spm <- read.csv("supermarket_sales - Sheet1.csv")
str(spm)
## 'data.frame': 1000 obs. of 17 variables:
## $ Invoice.ID : Factor w/ 1000 levels "101-17-6199",..: 815 143 654 19 340 734 316 265 703 727 ...
## $ Branch : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
## $ City : Factor w/ 3 levels "Mandalay","Naypyitaw",..: 3 2 3 3 3 2 3 2 3 1 ...
## $ Customer.type : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
## $ Product.line : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
## $ Unit.price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax.5. : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## $ Date : Factor w/ 89 levels "1/1/2019","1/10/2019",..: 27 88 82 20 58 77 49 48 2 44 ...
## $ Time : Factor w/ 506 levels "10:00","10:01",..: 147 24 156 486 30 394 215 78 342 160 ...
## $ Payment : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross.margin.percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross.income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
Attribute information :
Invoice id: Computer generated sales slip invoice identification number
Branch: Branch of supercenter (3 branches are available identified by A, B and C)
City: Location of supercenters
Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card
Gender: Gender type of customer
Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Unit price: Price of each product in $
Quantity: Number of products purchased by customer
Tax: 5% tax fee for customer buying
Total: Total price including tax
Date: Date of purchase (Record available from January 2019 to March 2019)
Time: Purchase time (10am to 9pm)
Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
COGS: Cost of goods sold
Gross margin percentage: Gross margin percentage
Gross income: Gross income
Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)
From existing 17 variables, I want to create two new variables with the following names :
UPT : Unit Per Transaction, to measure the average number of items that customers are purchasing in any given transaction.
ATV : Average Transaction Value, to measure the average amount or cash that customers spend in single transaction.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
spm$Invoice.ID <- as.character(spm$Invoice.ID)
spm <- spm %>% group_by(Invoice.ID) %>%
mutate(Trx = n()) %>%
ungroup() %>%
mutate( UPT = Quantity/Trx, ATV = Total/Quantity )
range(spm$UPT)
## [1] 1 10
head(spm)
## # A tibble: 6 x 20
## Invoice.ID Branch City Customer.type Gender Product.line Unit.price
## <chr> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 750-67-84~ A Yang~ Member Female Health and ~ 74.7
## 2 226-31-30~ C Nayp~ Normal Female Electronic ~ 15.3
## 3 631-41-31~ A Yang~ Normal Male Home and li~ 46.3
## 4 123-19-11~ A Yang~ Member Male Health and ~ 58.2
## 5 373-73-79~ A Yang~ Normal Male Sports and ~ 86.3
## 6 699-14-30~ C Nayp~ Normal Male Electronic ~ 85.4
## # ... with 13 more variables: Quantity <int>, Tax.5. <dbl>, Total <dbl>,
## # Date <fct>, Time <fct>, Payment <fct>, cogs <dbl>,
## # gross.margin.percentage <dbl>, gross.income <dbl>, Rating <dbl>,
## # Trx <int>, UPT <dbl>, ATV <dbl>
To look the correlation matrix by using ggcorr function.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
ggcorr(spm, label = T, hjust = 1, label_size = 2)
## Warning in ggcorr(spm, label = T, hjust = 1, label_size = 2): data in
## column(s) 'Invoice.ID', 'Branch', 'City', 'Customer.type', 'Gender',
## 'Product.line', 'Date', 'Time', 'Payment' are not numeric and were ignored
## Warning in cor(data, use = method[1], method = method[2]): the standard
## deviation is zero
A linear regression model that contains more than one predictor variable is called a multiple linear regression model.
I expect that Total Sales as target variable is affected by two predictor variables, UPT and ATV.
y = Total , x1 = UPT , x2 = ATV
Cook’s distance
model3 <- lm(Total~ UPT+ATV, spm)
summary(model3)
##
## Call:
## lm(formula = Total ~ UPT + ATV, data = spm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.831 -47.400 0.434 45.925 217.304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -324.5222 7.6933 -42.18 <2e-16 ***
## UPT 58.7715 0.8831 66.56 <2e-16 ***
## ATV 5.5368 0.0928 59.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.59 on 997 degrees of freedom
## Multiple R-squared: 0.8901, Adjusted R-squared: 0.8899
## F-statistic: 4038 on 2 and 997 DF, p-value: < 2.2e-16
From the summary above, I had the model of Total :
-324.5222 + 58.7715UPT + 5.5368ATV
with the interpretation :
- Each increase of 1 on UPT value, it will increase Total value as much 58.7715, with the condition that the value of other variables are constant
- Each increase of 1 on ATV value, it will increase Total value as much 5.5368, with the condition that the value of other variables are constant
- The predictor variables of UPT and ATV are significant to model because both of their p-values are 0.00000, ( p-value < 0.05 )
- R Square is 89%, it indicates that 89% of UPT and ATV variable can explain Total variable, and 11% is explained by other variables are not included into model
As I knew the model, if I have UPT value, 10 and 11, with ATV value both 50.
I could predict total value :
predict(model3, data.frame(UPT=c(10, 11), ATV = c(50, 50)))
## 1 2
## 540.0333 598.8048
from the result, I got the difference between Total value 1 and Total value 2 is 58.771. So, it is proven that each increase of 1 on UPT value, it will increase Total value as much 58.771, with the condition that both ATV value are constant.
ggcorr(spm, label = T)
## Warning in ggcorr(spm, label = T): data in column(s) 'Invoice.ID',
## 'Branch', 'City', 'Customer.type', 'Gender', 'Product.line', 'Date',
## 'Time', 'Payment' are not numeric and were ignored
## Warning in cor(data, use = method[1], method = method[2]): the standard
## deviation is zero
Correlation test is test for association or correlation between paired sample. pvalue < 0.05
Ho = unsignificant correlation ( cor = 0) H1 = significant correltion ( cor =/ 0) pvalue < 0.05
cor.test(spm$Total, spm$UPT)
##
## Pearson's product-moment correlation
##
## data: spm$Total and spm$UPT
## t = 31.449, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6729497 0.7353418
## sample estimates:
## cor
## 0.7055102
cor.test(spm$Total, spm$ATV)
##
## Pearson's product-moment correlation
##
## data: spm$Total and spm$ATV
## t = 25.897, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5953677 0.6696376
## sample estimates:
## cor
## 0.6339621
As the result above, I could intrepret that x variable and y variable are significant correlation with p-value is 0.0000 or p-value < 0.05
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model3)
## UPT ATV
## 1.000116 1.000116
From the output, I could intrepret that there is no relationship or no correlated between predictors variable, with vif value is 1, or less than 10.
plot(density(model3$residuals))
plot(model3)
Saphiro test
shapiro.test(model3$residuals)
##
## Shapiro-Wilk normality test
##
## data: model3$residuals
## W = 0.98787, p-value = 2.339e-07
From the output, the p-value < 0.05 implying that the distribution of the data are not normal. In other words, I couldn’t assume the normality. ( NOT PASSED NORMALITY TEST )
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(model3)
##
## studentized Breusch-Pagan test
##
## data: model3
## BP = 0.092184, df = 2, p-value = 0.955