Business Question

A company wants to know which aspects affect the profit of its company. Most of the company’s data is only in the form of numbers, therefore we will find out what aspects affect its profit by using a linear regression model.

let’s Start!

Import Library

library(tidyverse)
library(lubridate)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(plotly)
library(data.table)

Data Preparation & EDA

Read Data

online <- read.csv("online.csv", stringsAsFactors = T)
online
##    Marketing.Spend Administration Transport    Area    Profit
## 1        114523.61      136897.80 471784.10   Dhaka 192261.83
## 2        162597.70      151377.59 443898.53     Ctg 191792.06
## 3        153441.51      101145.55 407934.54 Rangpur 191050.39
## 4        144372.41      118671.85 383199.62   Dhaka 182901.99
## 5        142107.34       91391.77 366168.42 Rangpur 166187.94
## 6        131876.90       99814.71 362861.36   Dhaka 156991.12
## 7        134615.46      147198.87 127716.82     Ctg 156122.51
## 8        130298.13      145530.06 323876.68 Rangpur 155752.60
## 9        120542.52      148718.95 311613.29   Dhaka 152211.77
## 10       123334.88      108679.17 304981.62     Ctg 149759.96
## 11       101913.08      110594.11 229160.95 Rangpur 146121.95
## 12       100671.96       91790.61 249744.55     Ctg 144259.40
## 13        93863.75      127320.38 249839.44 Rangpur 141585.52
## 14        91992.39      135495.07 252664.93     Ctg 134307.35
## 15       119943.24      156547.42 256512.92 Rangpur 132602.65
## 16       165349.20      122616.84 261776.23   Dhaka 129917.04
## 17        78013.11      121597.55 264346.06     Ctg 126992.93
## 18        94657.16      145077.58 282574.31   Dhaka 125370.37
## 19        91749.16      114175.79 294919.57 Rangpur 124266.90
## 20        86419.70      153514.11      0.00   Dhaka 122776.86
## 21        76253.86      113867.30 298664.47     Ctg 118474.03
## 22        78389.47      153773.43 299737.29   Dhaka 111313.02
## 23        73994.56      122782.75 303319.26 Rangpur 110352.25
## 24        67532.53      105751.03 304768.73 Rangpur 108733.99
## 25        77044.01       99281.34 140574.81   Dhaka 108552.04
## 26        64664.71      139553.16 137962.62     Ctg 107404.34
## 27        75328.87      144135.98 134050.07 Rangpur 105733.54
## 28        72107.60      127864.55 353183.81   Dhaka 105008.31
## 29        66051.52      182645.56 118148.20 Rangpur 103282.38
## 30        65605.48      153032.06 107138.38   Dhaka 101004.64
## 31        61994.48      115641.28  91131.24 Rangpur  99937.59
## 32        61136.38      152701.92  88218.23   Dhaka  97483.56
## 33        63408.86      129219.61  46085.25     Ctg  97427.84
## 34        55493.95      103057.49 214634.81 Rangpur  96778.92
## 35        46426.07      157693.92 210797.67     Ctg  96712.80
## 36        46014.02       85047.44 205517.64   Dhaka  96479.51
## 37        28663.76      127056.21 201126.82 Rangpur  90708.19
## 38        44069.95       51283.14 197029.42     Ctg  89949.14
## 39        20229.59       65947.93 185265.10   Dhaka  81229.06
## 40        38558.51       82982.09 174999.30     Ctg  81005.76
## 41        28754.33      118546.05 172795.67     Ctg  78239.91
## 42        27892.92       84710.77 164470.71 Rangpur  77798.83
## 43        23640.93       96189.63 148001.11     Ctg  71498.49
## 44        15505.73      127382.30  35534.17   Dhaka  69758.98
## 45        22177.74      154806.14  28334.72     Ctg  65200.33
## 46         1000.23      124153.04   1903.93   Dhaka  64926.08
## 47         1315.46      115816.21 297114.46 Rangpur  49490.75
## 48            0.00      135426.92      0.00     Ctg  42559.73
## 49          542.05       51743.15      0.00   Dhaka  35673.41
## 50            0.00      116983.80  45173.06     Ctg  14681.40
unique(online$Area)
## [1] Dhaka   Ctg     Rangpur
## Levels: Ctg Dhaka Rangpur

Data Type

glimpse(online)
## Rows: 50
## Columns: 5
## $ Marketing.Spend <dbl> 114523.61, 162597.70, 153441.51, 144372.41, 142107....
## $ Administration  <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.7...
## $ Transport       <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3...
## $ Area            <fct> Dhaka, Ctg, Rangpur, Dhaka, Rangpur, Dhaka, Ctg, Ra...
## $ Profit          <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1...

As we can see Area variable is factor.

Checking Missing value

After that, let’s check is there any missing values or NA

colSums(is.na(online))
## Marketing.Spend  Administration       Transport            Area          Profit 
##               0               0               0               0               0

From the data above we can see that there are no missing value

Checking Distribution

ggplot(gather(online %>% select_if(is.numeric)), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

Let’s check the distribution using scatterplot to see the outliers possibility

plot(online %>% select_if(is.numeric))

Checking Correlation

In this part, we will try to observe and explore our variables to see if there any pattern on them. We ’ll use ggcorr argument

ggcorr(online, label = T, hjust= 1, layout.exp = 3)
## Warning in ggcorr(online, label = T, hjust = 1, layout.exp = 3): data in
## column(s) 'Area' are not numeric and were ignored

Each variable has a strong enough correlation to the target variable. Then it is suitable to do regression analysis using a linear model.

Modelling

Before creating models, let’s splitting data into two part, data train and data test. We use 80% data for train and 20% for test

set.seed(123)
index <- sample (nrow(online), nrow(online)*0.8)
data_train<- online[index, ]
data_test <- online[-index, ]

Creating Model

Baseline Model

For Baseline model, we create using all predictors

model_all <- lm(formula = Profit~., data=data_train)
summary(model_all)
## 
## Call:
## lm(formula = Profit ~ ., data = data_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31825  -5622     12   6509  17692 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.064e+04  8.319e+03   6.088 6.63e-07 ***
## Marketing.Spend  8.137e-01  5.908e-02  13.773 1.81e-15 ***
## Administration  -4.753e-02  6.356e-02  -0.748    0.460    
## Transport        3.148e-02  2.131e-02   1.477    0.149    
## AreaDhaka        1.617e+03  3.818e+03   0.423    0.675    
## AreaRangpur      1.407e+03  4.003e+03   0.352    0.727    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9880 on 34 degrees of freedom
## Multiple R-squared:  0.9425, Adjusted R-squared:  0.934 
## F-statistic: 111.4 on 5 and 34 DF,  p-value: < 2.2e-16

From the summary above we can say: 1.Marketing.Spend are highly siginificant to Food Demand 2 Adjusted r squared value is quite high 0.934, this means this model is good enough to be used.

Model Tuning (Step-wise Regression)

From the model made, there are still several variables that do not have a significant effect on the target variable. let’s do a stepwise regression to choose the best variable.

firstly, we make model without using any variable that will be used in forward regression.

model_none <- lm(formula = Profit~1, data = data_train)

Backward

model_backward <- step(object = model_all, direction = "backward", trace = 1)
## Start:  AIC=741.36
## Profit ~ Marketing.Spend + Administration + Transport + Area
## 
##                   Df  Sum of Sq        RSS    AIC
## - Area             2 2.0285e+07 3.3391e+09 737.60
## - Administration   1 5.4583e+07 3.3734e+09 740.01
## <none>                          3.3188e+09 741.36
## - Transport        1 2.1299e+08 3.5318e+09 741.85
## - Marketing.Spend  1 1.8516e+10 2.1835e+10 814.72
## 
## Step:  AIC=737.6
## Profit ~ Marketing.Spend + Administration + Transport
## 
##                   Df  Sum of Sq        RSS    AIC
## - Administration   1 5.3267e+07 3.3923e+09 736.24
## <none>                          3.3391e+09 737.60
## - Transport        1 2.2913e+08 3.5682e+09 738.26
## - Marketing.Spend  1 1.8748e+10 2.2087e+10 811.18
## 
## Step:  AIC=736.24
## Profit ~ Marketing.Spend + Transport
## 
##                   Df  Sum of Sq        RSS    AIC
## <none>                          3.3923e+09 736.24
## - Transport        1 3.8181e+08 3.7741e+09 738.50
## - Marketing.Spend  1 2.2130e+10 2.5523e+10 814.96

Forward

model_forward <- step(object = model_none, direction = "forward", 
                      scope = list(lower=model_none, upper=model_all))
## Start:  AIC=845.59
## Profit ~ 1
## 
##                   Df  Sum of Sq        RSS    AIC
## + Marketing.Spend  1 5.3930e+10 3.7741e+09 738.50
## + Transport        1 3.2182e+10 2.5523e+10 814.96
## <none>                          5.7704e+10 845.59
## + Area             2 5.3160e+09 5.2388e+10 845.72
## + Administration   1 1.6647e+09 5.6040e+10 846.42
## 
## Step:  AIC=738.5
## Profit ~ Marketing.Spend
## 
##                  Df Sum of Sq        RSS    AIC
## + Transport       1 381807938 3392330395 736.24
## + Administration  1 205947710 3568190624 738.26
## <none>                        3774138334 738.50
## + Area            2  40320654 3733817680 742.07
## 
## Step:  AIC=736.24
## Profit ~ Marketing.Spend + Transport
## 
##                  Df Sum of Sq        RSS    AIC
## <none>                        3392330395 736.24
## + Administration  1  53267402 3339062993 737.60
## + Area            2  18969245 3373361150 740.01

Both

model_both1 <- step(object = model_all, direction = "both", 
                    scope = list(lower=model_none, upper = model_all))
## Start:  AIC=741.36
## Profit ~ Marketing.Spend + Administration + Transport + Area
## 
##                   Df  Sum of Sq        RSS    AIC
## - Area             2 2.0285e+07 3.3391e+09 737.60
## - Administration   1 5.4583e+07 3.3734e+09 740.01
## <none>                          3.3188e+09 741.36
## - Transport        1 2.1299e+08 3.5318e+09 741.85
## - Marketing.Spend  1 1.8516e+10 2.1835e+10 814.72
## 
## Step:  AIC=737.6
## Profit ~ Marketing.Spend + Administration + Transport
## 
##                   Df  Sum of Sq        RSS    AIC
## - Administration   1 5.3267e+07 3.3923e+09 736.24
## <none>                          3.3391e+09 737.60
## - Transport        1 2.2913e+08 3.5682e+09 738.26
## + Area             2 2.0285e+07 3.3188e+09 741.36
## - Marketing.Spend  1 1.8748e+10 2.2087e+10 811.18
## 
## Step:  AIC=736.24
## Profit ~ Marketing.Spend + Transport
## 
##                   Df  Sum of Sq        RSS    AIC
## <none>                          3.3923e+09 736.24
## + Administration   1 5.3267e+07 3.3391e+09 737.60
## - Transport        1 3.8181e+08 3.7741e+09 738.50
## + Area             2 1.8969e+07 3.3734e+09 740.01
## - Marketing.Spend  1 2.2130e+10 2.5523e+10 814.96
model_both2 <- step(object = model_none, direction = "both", 
                    scope = list(lower=model_none, upper = model_all))
## Start:  AIC=845.59
## Profit ~ 1
## 
##                   Df  Sum of Sq        RSS    AIC
## + Marketing.Spend  1 5.3930e+10 3.7741e+09 738.50
## + Transport        1 3.2182e+10 2.5523e+10 814.96
## <none>                          5.7704e+10 845.59
## + Area             2 5.3160e+09 5.2388e+10 845.72
## + Administration   1 1.6647e+09 5.6040e+10 846.42
## 
## Step:  AIC=738.5
## Profit ~ Marketing.Spend
## 
##                   Df  Sum of Sq        RSS    AIC
## + Transport        1 3.8181e+08 3.3923e+09 736.24
## + Administration   1 2.0595e+08 3.5682e+09 738.26
## <none>                          3.7741e+09 738.50
## + Area             2 4.0321e+07 3.7338e+09 742.07
## - Marketing.Spend  1 5.3930e+10 5.7704e+10 845.59
## 
## Step:  AIC=736.24
## Profit ~ Marketing.Spend + Transport
## 
##                   Df  Sum of Sq        RSS    AIC
## <none>                          3.3923e+09 736.24
## + Administration   1 5.3267e+07 3.3391e+09 737.60
## - Transport        1 3.8181e+08 3.7741e+09 738.50
## + Area             2 1.8969e+07 3.3734e+09 740.01
## - Marketing.Spend  1 2.2130e+10 2.5523e+10 814.96

Goodnes of Fit Comparison

summary(model_all)$adj.r.squared
## [1] 0.9340287
summary(model_backward)$adj.r.squared
## [1] 0.9380342
summary(model_forward)$adj.r.squared
## [1] 0.9380342
summary(model_both1)$adj.r.squared
## [1] 0.9380342
summary(model_both2)$adj.r.squared
## [1] 0.9380342

The adjusted r-squared model that has gone through the stepwise describes more information from the taget variable

Model Evaluation

library(performance)
compare_performance(model_all,model_backward,model_forward,model_both1,model_both2)
## # Comparison of Model Performance Indices
## 
## Model          | Type |    AIC |    BIC |   R2 | R2 (adj.) |    RMSE |   Sigma |     BF |     p
## -----------------------------------------------------------------------------------------------
## model_all      |   lm | 856.87 | 868.70 | 0.94 |      0.93 | 9108.76 | 9879.83 |   1.00 |      
## model_backward |   lm | 851.75 | 858.51 | 0.94 |      0.94 | 9209.14 | 9575.21 | 163.19 | 0.861
## model_forward  |   lm | 851.75 | 858.51 | 0.94 |      0.94 | 9209.14 | 9575.21 | 163.19 |      
## model_both1    |   lm | 851.75 | 858.51 | 0.94 |      0.94 | 9209.14 | 9575.21 | 163.19 |      
## model_both2    |   lm | 851.75 | 858.51 | 0.94 |      0.94 | 9209.14 | 9575.21 | 163.19 |

the results of the error evaluation model with all predictors and the combined stages of forward and backward or both have a higher error than the model carried out by the forward or backward stepwise only. Judging from the value of adj rsquared, backward and forward stepwise models are better than both models and models with all variables. If there is a case like this, we must choose to be concerned with how the accuracy of the model predicts new data (choose the lowest error evaluation) or the model that describes the best target variable information (choose the highest adj.rsquared).

Evaluation to Test Data

# predict ke unseen data
all_pred <- predict(model_all,data_test)
backward_pred <- predict(model_backward,data_test)
forward_pred <- predict(model_forward,data_test)
both_pred1 <- predict(model_both1,data_test)
both_pred2 <- predict(model_both2,data_test)

data.frame(RMSE = c(RMSE(all_pred,data_test$Profit),
                    RMSE(backward_pred,data_test$Profit),
                    RMSE(forward_pred,data_test$Profit),
                    RMSE(both_pred1,data_test$Profit),
                    RMSE(both_pred2,data_test$Profit)),
           model = c("All","Backward","Forward","Both 1","Both 2"))
##       RMSE    model
## 1 23150.61      All
## 2 22378.48 Backward
## 3 22378.48  Forward
## 4 22378.48   Both 1
## 5 22378.48   Both 2

From the RMSE table above we can see that Backward and Forward stepwise model has lowest RMSE, so I will choose one of them which is Backward Model to do the Assumption Checking.

Assumption Checking

Normality

  1. Create a visualization of the resulting error (histogram)
hist(model_backward$residuals,breaks = 10)

2.Perform a statistical test using the shapiro.test () function

Shapiro-Wilk hypothesis:

  • H0: error/residual normally distributed
  • H1: error/residual isn’t normally distributed
shapiro.test(model_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.94292, p-value = 0.04341

From the Shapiro test above, we can see that the p-value is smaller than alpha, so we can conclude that the error / residual is not normally distributed.

Homoscedasticity

  1. Make a visualization between the prediction results and the error (scatter plot)
plot(model_backward$fitted.values, model_backward$residuals)
abline(h=0, col = "red")

From scatter plot above we can see that the residuals didn’t have pattern. So, the plot above shows that the residual meets homoscedasticity .

  1. Performs a Breusch-Pagan test using the bptest () function from the lmtest library

Breusch-Pagan hypothesis :

  • H0: Homoscedasticity
  • H1: Heteroscedasticity
library(lmtest)
bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 3.1273, df = 2, p-value = 0.2094

From the Breusch-Pagan test above, we can see that p-value higher than alpha, so we can’t reject the H0 hypothesis and it’s mean that the residuals is homoscedastic.

No-multicolinearity

One of the statistical tools to assess multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model. VIF < 10 means, no multicollinearity occurred among predictors.

library(car)
vif(model_backward)
## Marketing.Spend       Transport 
##         2.03719         2.03719

VIF value is less than 10, the assumption of non-multicollinearity is fulfilled.

Conclusion

From the Assumption Checking, we can see that the model passes the No-multicolinearity and Homoscedasticity test but failed in residual normality test. The model with lowest RMSE are forward and backward model at 9209.14, from the model with lowest RMSE has adjusted r2 0.94 which means the predictor that choosen in the models can interpret 94% of information and 6% of information could be interpreted by the other vairables.

So, the variables that affecting Profit is Marketing.Spend and Transport.