3.Problem Statement: All Greens Franchise

All Greens Franchise Explain the importance of X2, X3, X4, X5, X6 on Annual Net Sales, X1. The data (X1, X2, X3, X4, X5, X6) are for each franchise store. X1 = annual net sales/$1000 X2 = number sq. ft./1000 X3 = inventory/$1000 X4 = amount spent on advertising/$1000 X5 = size of sales district/1000 families X6 = number of competing stores in district.

library(readxl)
Dataset_All_Greens_Franchise <- read_excel("E:/Great Learning/Advance stats/group Assignment/Dataset_All Greens Franchise.xls")
dim(Dataset_All_Greens_Franchise)
## [1] 27  6
str(Dataset_All_Greens_Franchise)
## Classes 'tbl_df', 'tbl' and 'data.frame':    27 obs. of  6 variables:
##  $ X1: num  231 156 10 519 437 487 299 195 20 68 ...
##  $ X2: num  3 2.2 0.5 5.5 4.4 ...
##  $ X3: num  294 232 149 600 567 571 512 347 212 102 ...
##  $ X4: num  8.2 6.9 3 12 10.6 ...
##  $ X5: num  8.2 4.1 4.3 16.1 14.1 ...
##  $ X6: num  11 12 15 1 5 4 10 12 15 8 ...

The above results show that there are 27 rows and 6 variables and all are of type number.

cbind(colSums(is.na(Dataset_All_Greens_Franchise)))
##    [,1]
## X1    0
## X2    0
## X3    0
## X4    0
## X5    0
## X6    0
#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
corel<-cor(Dataset_All_Greens_Franchise)
corrplot(corel,method ="number")

The figures above figure shows that there is a very high correlation between X1 and all other dependant variables. The other dependent variables are also highly correlated with each other so there can be a problem of multicollinearity. Let do the regression analysis of complete dataset:

reg1 <- lm(X1 ~ ., data= Dataset_All_Greens_Franchise)
summary(reg1)
## 
## Call:
## lm(formula = X1 ~ ., data = Dataset_All_Greens_Franchise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.338  -9.699  -4.496   4.040  41.139 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.85941   30.15023  -0.626 0.538372    
## X2           16.20157    3.54444   4.571 0.000166 ***
## X3            0.17464    0.05761   3.032 0.006347 ** 
## X4           11.52627    2.53210   4.552 0.000174 ***
## X5           13.58031    1.77046   7.671 1.61e-07 ***
## X6           -5.31097    1.70543  -3.114 0.005249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared:  0.9932, Adjusted R-squared:  0.9916 
## F-statistic: 611.6 on 5 and 21 DF,  p-value: < 2.2e-16

From the above results we can see that all the variables are significant and both R-squared and R-squared adjusted are greater than 99%.

Now let us find the variance:

#install.packages("car")
library(car)
vif(reg1)
##        X2        X3        X4        X5        X6 
##  4.240914 10.122480  7.624391  6.912318  5.818768

Since the value for X3 is greater than 10 which is unacceptable.So let us drop this variable from the model and try to build another one.

reg2 <- lm(X1 ~ .-X3, data= Dataset_All_Greens_Franchise)
summary(reg2)
## 
## Call:
## lm(formula = X1 ~ . - X3, data = Dataset_All_Greens_Franchise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.422 -12.858  -6.477  16.160  45.255 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -39.460     34.411  -1.147   0.2638    
## X2            20.444      3.815   5.359 2.22e-05 ***
## X4            16.966      2.093   8.107 4.73e-08 ***
## X5            15.673      1.910   8.206 3.86e-08 ***
## X6            -4.043      1.937  -2.088   0.0486 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.68 on 22 degrees of freedom
## Multiple R-squared:  0.9902, Adjusted R-squared:  0.9884 
## F-statistic: 555.4 on 4 and 22 DF,  p-value: < 2.2e-16

From the above results we can see that all the variables are significant and R-squared is greater than 99% and R-squared adjusted is greater than 98%. Now let us find the changed variance :

vif(reg2)
##       X2       X4       X5       X6 
## 3.579850 3.795323 5.861520 5.468943

Now since none of the variances are greater than 10 so this model is good.

Lets resolve the issue of multicollinearity using PCA

pca1 <- prcomp(Dataset_All_Greens_Franchise[,-1], scale. = T)
summary(pca1)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5
## Standard deviation     2.0769 0.5277 0.47972 0.35207 0.23260
## Proportion of Variance 0.8627 0.0557 0.04603 0.02479 0.01082
## Cumulative Proportion  0.8627 0.9184 0.96439 0.98918 1.00000

From the values for Proportion of Variance that the first factor which is PC1 is sufficient and it is explaining 86% of the variance. Second component can only explain <6% variance. Lets use the first component for building the regression model:

Loadings of principal components

loadings(pca1)
## NULL
data1<- data.frame(Dataset_All_Greens_Franchise[,1], pca1$x[,1])
colnames(data1) <- c("X1", "PC1")
reg3 <- lm(X1 ~ ., data=data1)
summary(reg3)
## 
## Call:
## lm(formula = X1 ~ ., data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.168 -15.059  -3.809  11.915  47.944 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  286.574      3.774   75.94   <2e-16 ***
## PC1           92.013      1.852   49.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.61 on 25 degrees of freedom
## Multiple R-squared:   0.99,  Adjusted R-squared:  0.9896 
## F-statistic:  2469 on 1 and 25 DF,  p-value: < 2.2e-16

Thus, by using the techniques like PCA and linear regression we are able reduced multi-colinearity in the data and also we could explain the importance of all the independant variables on X1 which is Annual net sales.