LBB Regression

Gaos Tipki Alpandi

2022-06-02

Data Explanation

Objective

The data used in this modeling is data consisting of variables related to the community’s perspective on a company such as Loyalty, Trust, Quality, Customer Satisfaction, Negative Publicity, Community Outreach, and Price. There are 1711 observations with numeric data type. The source of this dataset comes from Kaggle.com.

Why?

We know that these variables are very influential on the performance of a company. Therefore, a company should provide the best services and products so that consumers remain loyal in making purchases and transactions at the company. By doing a regression analysis, we can find out what variables have the greatest influence on consumer loyalty. By knowing the most influential variables, the company management is able to make the right strategy in increasing revenue and profits for the company.

Data Preparation

data <- read.csv("loyalty.csv")
data <- data[,-1] #Removing 'CustomerID' columns because it has no leverage on this data.
rmarkdown::paged_table(data)
str(data)
## 'data.frame':    1711 obs. of  7 variables:
##  $ Loyalty             : num  6.14 6.03 6.53 6.83 6.64 ...
##  $ Price               : num  10 10 10 10 10 10 7.48 7.43 7.54 10 ...
##  $ Quality             : num  0.87 0.93 0.86 0.92 0.85 0.92 0.75 0.64 0.68 0.9 ...
##  $ Community.Outreach  : num  -0.07 0.14 -0.02 0.29 0.05 0.03 -0.02 -0.01 0.03 -0.07 ...
##  $ Trust               : num  7.45 7.62 7.48 7.39 7.42 7.71 5.86 5.74 5.93 7.68 ...
##  $ Customer.satifaction: num  0.78 0.9 0.85 0.87 0.66 0.94 0.89 0.84 0.91 0.92 ...
##  $ Negative.publicity  : num  0.04 0.05 0.06 0.06 0.07 0.07 0.08 0.08 0.09 0.1 ...

Exploratory Data Analysis

library(GGally)
## Warning: package 'GGally' was built under R version 4.1.3
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(data, label = T, label_size = 5,hjust=1, layout.exp = 2)

The graphic shows that Loyalty correlate significantly (>0.5) with Price, Quality, Trust, and Cutomer.satisfaction.

Modelling

Based on correlation graphic, we can model Loyalty as a response variable and the others as predictor variabels.

modelall_data <- lm(Loyalty~., data)
summary(modelall_data)
## 
## Call:
## lm(formula = Loyalty ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.87321 -0.35232  0.02212  0.36966  1.97059 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.03509    0.18391 -11.065  < 2e-16 ***
## Price                 0.32907    0.03129  10.518  < 2e-16 ***
## Quality               2.58202    0.16735  15.429  < 2e-16 ***
## Community.Outreach    0.75944    0.09595   7.915 4.42e-15 ***
## Trust                 0.37415    0.03591  10.420  < 2e-16 ***
## Customer.satifaction  0.99510    0.12452   7.991 2.44e-15 ***
## Negative.publicity   -0.95285    0.08987 -10.602  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5786 on 1704 degrees of freedom
## Multiple R-squared:  0.7413, Adjusted R-squared:  0.7404 
## F-statistic:   814 on 6 and 1704 DF,  p-value: < 2.2e-16

Result: We can see that all the predictor variables have significant P-value (less than 0.05). Because of that, we don’t have to do any step-wise regression to eliminate the insignificant predictors. Then, the adjusted R-squared is 0.7404 which is quite good for a model.

Also

If we want to model one predictor to response (target) variable, we should choose Trust as predictor variable and Loyalty as response variable because both variables has significant correlation with value 0.8.

trust_to_loyalty <- lm(Loyalty~Trust, data)
summary(trust_to_loyalty)
## 
## Call:
## lm(formula = Loyalty ~ Trust, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25569 -0.52986  0.01248  0.56120  2.34380 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.60308    0.14940  -10.73   <2e-16 ***
## Trust        1.11529    0.02347   47.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7455 on 1709 degrees of freedom
## Multiple R-squared:  0.5693, Adjusted R-squared:  0.5691 
## F-statistic:  2259 on 1 and 1709 DF,  p-value: < 2.2e-16

Result: Even though Trust is significant, but the adjusted R-squared is lower (0.5691 < 0.7404) than modelall_data model.

Evaluating Model

library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.1.3
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
RMSE(modelall_data$fitted.values, data$Loyalty)
## [1] 0.5773965
RMSE(trust_to_loyalty$fitted.values, data$Loyalty)
## [1] 0.7450636
summary(data$Loyalty)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.375   4.595   5.364   5.446   6.254   7.971

We can conclude that modelall_data model has lower RMSE value than trust_to_loyalty model.

Classical Assumption Test

When performing regression analysis, it is necessary to fulfill several assumptions, for example classical assumptions consisting of normality test, multicollinearity test, heteroscedasticity test, and autocorrelation test (Mulyono, 2019)

Normality Test

hist(modelall_data$residuals)

shapiro.test(modelall_data$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  modelall_data$residuals
## W = 0.9908, p-value = 6.539e-09

Result: P-value of Shapiro-Wilk test shows lower value (6.539e-09 < 0.05). So we can conclude that the residual of modelall_data model is not normally distributed. The reason that usually occurs when the residual data is not normally distributed is that there are outliers in the data.

boxplot(data)

We can see that there are several variables such as Quality, Trust, Community.outreach, Customer.satifaction, and Negative.publicity have some outliers.

Linearity Test

resact <- data.frame(residual = modelall_data$residuals, fitted = modelall_data$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Result: There is a pattern in the data, with the residuals has become more negative as the fitted values increase before increased again. The pattern indicate that our model may not be linear enough.

Autocorrelation Test

library(car)
## Loading required package: carData
durbinWatsonTest(modelall_data)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.04474191      1.908275   0.048
##  Alternative hypothesis: rho != 0

Result: the p-value of Durbin-Watson test shows that we are failed to reject null hypothesis (0.066 > 0.05). We can conclude that there is no autocorrelation on the data (random data).

Homoscedasticity Test

library(lmtest)
## Warning: package 'lmtest' was built under R version 4.1.3
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(modelall_data)
## 
##  studentized Breusch-Pagan test
## 
## data:  modelall_data
## BP = 80.419, df = 6, p-value = 2.927e-15

Result: P-value of Breusch-Pagan test shows lower value (2.927e-15 < 0.05). So we can conclude that the residual of modelall_data model has no equal or similar variances in different groups (homogeneous).

resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

The plot shows that as the fitted increase, otherwhile the residual decrease randomly.That’s why heterocesdasticity is detected.

Multicollinearity Test

library(car)
vif(modelall_data)
##                Price              Quality   Community.Outreach 
##             4.429890             2.110246             1.236937 
##                Trust Customer.satifaction   Negative.publicity 
##             3.887198             1.616260             1.426135

Result: Due to the VIF value of each data is still under 10, we can conclude that there is no correlation between target variable (Loyalty) and all predictor variables.

Conclusion

1. The best regression model for the data is \[ \hat{y} = (-2.03509) + (0.32907)*Price + (2.58202)*Quality + (0.75944)*Community.Outreach\] \[ + (0.37415)*Trust + (0.99510)*Customer.satifaction + (-0.95285)*Negative.publicity \] 2. Variables that are useful to describe the Loyalty of customers are Price, Quality, Community Outreach, Trust, Customer Satisfaction, and Negative Publicity. Based on the model, we know that Quality is the most influential variable to describe Loyalty of the customers. Though, we can’t say that model is representative because it only fulfill two classic assumptions such as multicollinearity and autocorrelation test. Then, the adjusted R-squared of the model is quite good with percentage 74.04%.

3. The interpretation of each variables as followed: (a) The Loyalty will increase by 0.33 if the Price variable increases by one unit and other variables are held constant; (b) The Loyalty will increase by 2.58 if the Quality variable increases by one unit and other variables are held constant; (c) The Loyalty will increase by 0.76 if the Community Outreach variable increases by one unit and other variables are held constant; (d) The Loyalty will increase by 0.37 if the Trust variable increases by one unit and other variables are held constant; (e) The Loyalty will increase by 0.99 if the Customer Satisfaction variable increases by one unit and other variables are held constant; (f) The Loyalty will decrease by 0.95 if the Negative Publicity variable increases by one unit and other variables are held constant.

Reference