A company wants to know which aspects affect the profit of its company. Most of the company’s data is only in the form of numbers, therefore we will find out what aspects affect its profit by using a linear regression model.
let’s Start!
library(tidyverse)
library(lubridate)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(plotly)
library(data.table)
online <- read.csv("online.csv", stringsAsFactors = T)
online
## Marketing.Spend Administration Transport Area Profit
## 1 114523.61 136897.80 471784.10 Dhaka 192261.83
## 2 162597.70 151377.59 443898.53 Ctg 191792.06
## 3 153441.51 101145.55 407934.54 Rangpur 191050.39
## 4 144372.41 118671.85 383199.62 Dhaka 182901.99
## 5 142107.34 91391.77 366168.42 Rangpur 166187.94
## 6 131876.90 99814.71 362861.36 Dhaka 156991.12
## 7 134615.46 147198.87 127716.82 Ctg 156122.51
## 8 130298.13 145530.06 323876.68 Rangpur 155752.60
## 9 120542.52 148718.95 311613.29 Dhaka 152211.77
## 10 123334.88 108679.17 304981.62 Ctg 149759.96
## 11 101913.08 110594.11 229160.95 Rangpur 146121.95
## 12 100671.96 91790.61 249744.55 Ctg 144259.40
## 13 93863.75 127320.38 249839.44 Rangpur 141585.52
## 14 91992.39 135495.07 252664.93 Ctg 134307.35
## 15 119943.24 156547.42 256512.92 Rangpur 132602.65
## 16 165349.20 122616.84 261776.23 Dhaka 129917.04
## 17 78013.11 121597.55 264346.06 Ctg 126992.93
## 18 94657.16 145077.58 282574.31 Dhaka 125370.37
## 19 91749.16 114175.79 294919.57 Rangpur 124266.90
## 20 86419.70 153514.11 0.00 Dhaka 122776.86
## 21 76253.86 113867.30 298664.47 Ctg 118474.03
## 22 78389.47 153773.43 299737.29 Dhaka 111313.02
## 23 73994.56 122782.75 303319.26 Rangpur 110352.25
## 24 67532.53 105751.03 304768.73 Rangpur 108733.99
## 25 77044.01 99281.34 140574.81 Dhaka 108552.04
## 26 64664.71 139553.16 137962.62 Ctg 107404.34
## 27 75328.87 144135.98 134050.07 Rangpur 105733.54
## 28 72107.60 127864.55 353183.81 Dhaka 105008.31
## 29 66051.52 182645.56 118148.20 Rangpur 103282.38
## 30 65605.48 153032.06 107138.38 Dhaka 101004.64
## 31 61994.48 115641.28 91131.24 Rangpur 99937.59
## 32 61136.38 152701.92 88218.23 Dhaka 97483.56
## 33 63408.86 129219.61 46085.25 Ctg 97427.84
## 34 55493.95 103057.49 214634.81 Rangpur 96778.92
## 35 46426.07 157693.92 210797.67 Ctg 96712.80
## 36 46014.02 85047.44 205517.64 Dhaka 96479.51
## 37 28663.76 127056.21 201126.82 Rangpur 90708.19
## 38 44069.95 51283.14 197029.42 Ctg 89949.14
## 39 20229.59 65947.93 185265.10 Dhaka 81229.06
## 40 38558.51 82982.09 174999.30 Ctg 81005.76
## 41 28754.33 118546.05 172795.67 Ctg 78239.91
## 42 27892.92 84710.77 164470.71 Rangpur 77798.83
## 43 23640.93 96189.63 148001.11 Ctg 71498.49
## 44 15505.73 127382.30 35534.17 Dhaka 69758.98
## 45 22177.74 154806.14 28334.72 Ctg 65200.33
## 46 1000.23 124153.04 1903.93 Dhaka 64926.08
## 47 1315.46 115816.21 297114.46 Rangpur 49490.75
## 48 0.00 135426.92 0.00 Ctg 42559.73
## 49 542.05 51743.15 0.00 Dhaka 35673.41
## 50 0.00 116983.80 45173.06 Ctg 14681.40
unique(online$Area)
## [1] Dhaka Ctg Rangpur
## Levels: Ctg Dhaka Rangpur
glimpse(online)
## Rows: 50
## Columns: 5
## $ Marketing.Spend <dbl> 114523.61, 162597.70, 153441.51, 144372.41, 142107....
## $ Administration <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.7...
## $ Transport <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3...
## $ Area <fct> Dhaka, Ctg, Rangpur, Dhaka, Rangpur, Dhaka, Ctg, Ra...
## $ Profit <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1...
As we can see Area
variable is factor.
After that, let’s check is there any missing values or NA
colSums(is.na(online))
## Marketing.Spend Administration Transport Area Profit
## 0 0 0 0 0
From the data above we can see that there are no missing value
ggplot(gather(online %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
Let’s check the distribution using scatterplot to see the outliers possibility
plot(online %>% select_if(is.numeric))
In this part, we will try to observe and explore our variables to see if there any pattern on them. We ’ll use ggcorr argument
ggcorr(online, label = T, hjust= 1, layout.exp = 3)
## Warning in ggcorr(online, label = T, hjust = 1, layout.exp = 3): data in
## column(s) 'Area' are not numeric and were ignored
Each variable has a strong enough correlation to the target variable. Then it is suitable to do regression analysis using a linear model.
Before creating models, let’s splitting data into two part, data train and data test. We use 80% data for train and 20% for test
set.seed(123)
index <- sample (nrow(online), nrow(online)*0.8)
data_train<- online[index, ]
data_test <- online[-index, ]
For Baseline model, we create using all predictors
model_all <- lm(formula = Profit~., data=data_train)
summary(model_all)
##
## Call:
## lm(formula = Profit ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31825 -5622 12 6509 17692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.064e+04 8.319e+03 6.088 6.63e-07 ***
## Marketing.Spend 8.137e-01 5.908e-02 13.773 1.81e-15 ***
## Administration -4.753e-02 6.356e-02 -0.748 0.460
## Transport 3.148e-02 2.131e-02 1.477 0.149
## AreaDhaka 1.617e+03 3.818e+03 0.423 0.675
## AreaRangpur 1.407e+03 4.003e+03 0.352 0.727
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9880 on 34 degrees of freedom
## Multiple R-squared: 0.9425, Adjusted R-squared: 0.934
## F-statistic: 111.4 on 5 and 34 DF, p-value: < 2.2e-16
From the summary above we can say: 1.Marketing.Spend
are highly siginificant to Food Demand
2 Adjusted r squared value is quite high 0.934, this means this model is good enough to be used.
From the model made, there are still several variables that do not have a significant effect on the target variable. let’s do a stepwise regression
to choose the best variable.
firstly, we make model without using any variable that will be used in forward regression.
model_none <- lm(formula = Profit~1, data = data_train)
model_backward <- step(object = model_all, direction = "backward", trace = 1)
## Start: AIC=741.36
## Profit ~ Marketing.Spend + Administration + Transport + Area
##
## Df Sum of Sq RSS AIC
## - Area 2 2.0285e+07 3.3391e+09 737.60
## - Administration 1 5.4583e+07 3.3734e+09 740.01
## <none> 3.3188e+09 741.36
## - Transport 1 2.1299e+08 3.5318e+09 741.85
## - Marketing.Spend 1 1.8516e+10 2.1835e+10 814.72
##
## Step: AIC=737.6
## Profit ~ Marketing.Spend + Administration + Transport
##
## Df Sum of Sq RSS AIC
## - Administration 1 5.3267e+07 3.3923e+09 736.24
## <none> 3.3391e+09 737.60
## - Transport 1 2.2913e+08 3.5682e+09 738.26
## - Marketing.Spend 1 1.8748e+10 2.2087e+10 811.18
##
## Step: AIC=736.24
## Profit ~ Marketing.Spend + Transport
##
## Df Sum of Sq RSS AIC
## <none> 3.3923e+09 736.24
## - Transport 1 3.8181e+08 3.7741e+09 738.50
## - Marketing.Spend 1 2.2130e+10 2.5523e+10 814.96
model_forward <- step(object = model_none, direction = "forward",
scope = list(lower=model_none, upper=model_all))
## Start: AIC=845.59
## Profit ~ 1
##
## Df Sum of Sq RSS AIC
## + Marketing.Spend 1 5.3930e+10 3.7741e+09 738.50
## + Transport 1 3.2182e+10 2.5523e+10 814.96
## <none> 5.7704e+10 845.59
## + Area 2 5.3160e+09 5.2388e+10 845.72
## + Administration 1 1.6647e+09 5.6040e+10 846.42
##
## Step: AIC=738.5
## Profit ~ Marketing.Spend
##
## Df Sum of Sq RSS AIC
## + Transport 1 381807938 3392330395 736.24
## + Administration 1 205947710 3568190624 738.26
## <none> 3774138334 738.50
## + Area 2 40320654 3733817680 742.07
##
## Step: AIC=736.24
## Profit ~ Marketing.Spend + Transport
##
## Df Sum of Sq RSS AIC
## <none> 3392330395 736.24
## + Administration 1 53267402 3339062993 737.60
## + Area 2 18969245 3373361150 740.01
model_both1 <- step(object = model_all, direction = "both",
scope = list(lower=model_none, upper = model_all))
## Start: AIC=741.36
## Profit ~ Marketing.Spend + Administration + Transport + Area
##
## Df Sum of Sq RSS AIC
## - Area 2 2.0285e+07 3.3391e+09 737.60
## - Administration 1 5.4583e+07 3.3734e+09 740.01
## <none> 3.3188e+09 741.36
## - Transport 1 2.1299e+08 3.5318e+09 741.85
## - Marketing.Spend 1 1.8516e+10 2.1835e+10 814.72
##
## Step: AIC=737.6
## Profit ~ Marketing.Spend + Administration + Transport
##
## Df Sum of Sq RSS AIC
## - Administration 1 5.3267e+07 3.3923e+09 736.24
## <none> 3.3391e+09 737.60
## - Transport 1 2.2913e+08 3.5682e+09 738.26
## + Area 2 2.0285e+07 3.3188e+09 741.36
## - Marketing.Spend 1 1.8748e+10 2.2087e+10 811.18
##
## Step: AIC=736.24
## Profit ~ Marketing.Spend + Transport
##
## Df Sum of Sq RSS AIC
## <none> 3.3923e+09 736.24
## + Administration 1 5.3267e+07 3.3391e+09 737.60
## - Transport 1 3.8181e+08 3.7741e+09 738.50
## + Area 2 1.8969e+07 3.3734e+09 740.01
## - Marketing.Spend 1 2.2130e+10 2.5523e+10 814.96
model_both2 <- step(object = model_none, direction = "both",
scope = list(lower=model_none, upper = model_all))
## Start: AIC=845.59
## Profit ~ 1
##
## Df Sum of Sq RSS AIC
## + Marketing.Spend 1 5.3930e+10 3.7741e+09 738.50
## + Transport 1 3.2182e+10 2.5523e+10 814.96
## <none> 5.7704e+10 845.59
## + Area 2 5.3160e+09 5.2388e+10 845.72
## + Administration 1 1.6647e+09 5.6040e+10 846.42
##
## Step: AIC=738.5
## Profit ~ Marketing.Spend
##
## Df Sum of Sq RSS AIC
## + Transport 1 3.8181e+08 3.3923e+09 736.24
## + Administration 1 2.0595e+08 3.5682e+09 738.26
## <none> 3.7741e+09 738.50
## + Area 2 4.0321e+07 3.7338e+09 742.07
## - Marketing.Spend 1 5.3930e+10 5.7704e+10 845.59
##
## Step: AIC=736.24
## Profit ~ Marketing.Spend + Transport
##
## Df Sum of Sq RSS AIC
## <none> 3.3923e+09 736.24
## + Administration 1 5.3267e+07 3.3391e+09 737.60
## - Transport 1 3.8181e+08 3.7741e+09 738.50
## + Area 2 1.8969e+07 3.3734e+09 740.01
## - Marketing.Spend 1 2.2130e+10 2.5523e+10 814.96
Goodnes of Fit Comparison
summary(model_all)$adj.r.squared
## [1] 0.9340287
summary(model_backward)$adj.r.squared
## [1] 0.9380342
summary(model_forward)$adj.r.squared
## [1] 0.9380342
summary(model_both1)$adj.r.squared
## [1] 0.9380342
summary(model_both2)$adj.r.squared
## [1] 0.9380342
The adjusted r-squared model that has gone through the stepwise describes more information from the taget variable
library(performance)
compare_performance(model_all,model_backward,model_forward,model_both1,model_both2)
## # Comparison of Model Performance Indices
##
## Model | Type | AIC | BIC | R2 | R2 (adj.) | RMSE | Sigma | BF | p
## -----------------------------------------------------------------------------------------------
## model_all | lm | 856.87 | 868.70 | 0.94 | 0.93 | 9108.76 | 9879.83 | 1.00 |
## model_backward | lm | 851.75 | 858.51 | 0.94 | 0.94 | 9209.14 | 9575.21 | 163.19 | 0.861
## model_forward | lm | 851.75 | 858.51 | 0.94 | 0.94 | 9209.14 | 9575.21 | 163.19 |
## model_both1 | lm | 851.75 | 858.51 | 0.94 | 0.94 | 9209.14 | 9575.21 | 163.19 |
## model_both2 | lm | 851.75 | 858.51 | 0.94 | 0.94 | 9209.14 | 9575.21 | 163.19 |
the results of the error evaluation model with all predictors and the combined stages of forward and backward or both have a higher error than the model carried out by the forward or backward stepwise only. Judging from the value of adj rsquared, backward and forward stepwise models are better than both models and models with all variables. If there is a case like this, we must choose to be concerned with how the accuracy of the model predicts new data (choose the lowest error evaluation) or the model that describes the best target variable information (choose the highest adj.rsquared).
# predict ke unseen data
all_pred <- predict(model_all,data_test)
backward_pred <- predict(model_backward,data_test)
forward_pred <- predict(model_forward,data_test)
both_pred1 <- predict(model_both1,data_test)
both_pred2 <- predict(model_both2,data_test)
data.frame(RMSE = c(RMSE(all_pred,data_test$Profit),
RMSE(backward_pred,data_test$Profit),
RMSE(forward_pred,data_test$Profit),
RMSE(both_pred1,data_test$Profit),
RMSE(both_pred2,data_test$Profit)),
model = c("All","Backward","Forward","Both 1","Both 2"))
## RMSE model
## 1 23150.61 All
## 2 22378.48 Backward
## 3 22378.48 Forward
## 4 22378.48 Both 1
## 5 22378.48 Both 2
From the RMSE table above we can see that Backward and Forward stepwise model has lowest RMSE, so I will choose one of them which is Backward Model to do the Assumption Checking.
hist(model_backward$residuals,breaks = 10)
2.Perform a statistical test using the
shapiro.test ()
function
Shapiro-Wilk hypothesis:
shapiro.test(model_backward$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.94292, p-value = 0.04341
From the Shapiro test above, we can see that the p-value is smaller than alpha, so we can conclude that the error / residual is not normally distributed.
plot(model_backward$fitted.values, model_backward$residuals)
abline(h=0, col = "red")
From scatter plot above we can see that the residuals didn’t have pattern. So, the plot above shows that the residual meets homoscedasticity .
bptest ()
function from the lmtest
libraryBreusch-Pagan hypothesis :
library(lmtest)
bptest(model_backward)
##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 3.1273, df = 2, p-value = 0.2094
From the Breusch-Pagan test above, we can see that p-value higher than alpha, so we can’t reject the H0 hypothesis and it’s mean that the residuals is homoscedastic.
One of the statistical tools to assess multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model. VIF < 10 means, no multicollinearity occurred among predictors.
library(car)
vif(model_backward)
## Marketing.Spend Transport
## 2.03719 2.03719
VIF value is less than 10, the assumption of non-multicollinearity is fulfilled.
From the Assumption Checking, we can see that the model passes the No-multicolinearity
and Homoscedasticity
test but failed in residual normality
test. The model with lowest RMSE are forward
and backward
model at 9209.14, from the model with lowest RMSE has adjusted r2 0.94 which means the predictor that choosen in the models can interpret 94% of information and 6% of information could be interpreted by the other vairables.
So, the variables that affecting Profit
is Marketing.Spend
and Transport
.