## Warning: package 'ggplot2' was built under R version 3.4.1
## Warning: package 'ggthemes' was built under R version 3.4.1
## Warning: package 'scales' was built under R version 3.4.1
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'mice' was built under R version 3.4.2
## Loading required package: lattice
## Warning: package 'randomForest' was built under R version 3.4.1
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'rpart' was built under R version 3.4.2
## Warning: package 'ROCR' was built under R version 3.4.1
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.4.1
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## Warning: package 'rpart.plot' was built under R version 3.4.2
## Warning: package 'corrr' was built under R version 3.4.1
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
## Warning: package 'glue' was built under R version 3.4.2
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
## Warning: package 'caTools' was built under R version 3.4.1
## Warning: package 'data.table' was built under R version 3.4.2
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 3.4.2
## Loading required package: geosphere
## Warning: package 'geosphere' was built under R version 3.4.2
## Loading required package: gmapsdistance
## Warning: package 'gmapsdistance' was built under R version 3.4.2
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 3.4.2
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:mice':
##
## complete
## Warning: package 'car' was built under R version 3.4.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Warning: package 'caret' was built under R version 3.4.1
## Warning: package 'gclus' was built under R version 3.4.1
## Loading required package: cluster
## Warning: package 'visdat' was built under R version 3.4.1
## Warning: package 'psych' was built under R version 3.4.2
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following object is masked from 'package:randomForest':
##
## outlier
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Warning: package 'leaflet' was built under R version 3.4.1
## Warning: package 'leaflet.extras' was built under R version 3.4.1
## Warning: package 'GPArotation' was built under R version 3.4.1
## Warning: package 'MVN' was built under R version 3.4.2
## sROC 0.1-2 loaded
##
## Attaching package: 'MVN'
## The following object is masked from 'package:psych':
##
## mardia
## Warning: package 'MASS' was built under R version 3.4.1
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Warning: package 'psy' was built under R version 3.4.1
##
## Attaching package: 'psy'
## The following object is masked from 'package:psych':
##
## wkappa
## Warning: package 'corpcor' was built under R version 3.4.1
## Warning: package 'fastmatch' was built under R version 3.4.1
##
## Attaching package: 'fastmatch'
## The following object is masked from 'package:dplyr':
##
## coalesce
## Warning: package 'plyr' was built under R version 3.4.1
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Warning: package 'fmsb' was built under R version 3.4.1
## Warning: package 'QuantPsyc' was built under R version 3.4.2
## Loading required package: boot
## Warning: package 'boot' was built under R version 3.4.1
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:car':
##
## logit
## The following object is masked from 'package:lattice':
##
## melanoma
##
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
##
## norm
Load the file into data frame and do basic stats to understand variable and their 5-point summary.
my_sales_data <- read.csv('Dataset_All Greens Franchise.csv')
str(my_sales_data)
## 'data.frame': 27 obs. of 6 variables:
## $ X1: num 231 156 10 519 437 487 299 195 20 68 ...
## $ X2: num 3 2.2 0.5 5.5 4.4 ...
## $ X3: int 294 232 149 600 567 571 512 347 212 102 ...
## $ X4: num 8.2 6.9 3 12 10.6 ...
## $ X5: num 8.2 4.1 4.3 16.1 14.1 ...
## $ X6: int 11 12 15 1 5 4 10 12 15 8 ...
summary(my_sales_data)
## X1 X2 X3 X4
## Min. : 0.5 Min. :0.500 Min. :102.0 Min. : 2.50
## 1st Qu.: 98.5 1st Qu.:1.400 1st Qu.:204.0 1st Qu.: 4.80
## Median :341.0 Median :3.500 Median :382.0 Median : 8.10
## Mean :286.6 Mean :3.326 Mean :387.5 Mean : 8.10
## 3rd Qu.:450.5 3rd Qu.:4.750 3rd Qu.:551.0 3rd Qu.:10.95
## Max. :570.0 Max. :8.600 Max. :788.0 Max. :17.40
## X5 X6
## Min. : 1.600 Min. : 0.000
## 1st Qu.: 4.500 1st Qu.: 4.000
## Median :11.300 Median : 8.000
## Mean : 9.693 Mean : 7.741
## 3rd Qu.:14.050 3rd Qu.:12.000
## Max. :16.300 Max. :15.000
Doing the co-relation here to understand if there are significant co-relation existing. Co-relation diagram shows clearly that X is having co-relation as follows..So it seems that X1 and X5 are having highest co-relation factor
–X1 and X2 are postively co-related with corelation factor of 0.89 –X1 and X3 are postively co-related with corelation factor of 0.95 –X1 and X4 are postively co-related with corelation factor of 0.91 –X1 and X5 are postively co-related with corelation factor of 0.8 –X1 and X6 are postively co-related with corelation factor of 0.91 But the co-relation diagram also shows that there are strong corelation among other variables ex:… –X3 and X2 having co-relation coeficient of 0.84
Lot of other independent variables have corelation among themselves and hence this is a perfect example of multi-colinearity. So this means we can’t only relation on c0relation diagram to understand the effect of X2,X3,X4,X5 and X6 on X1.
my_sales_data.cormatrix <- cor(my_sales_data)
my_sales_data.cormatrix.rounded<- round(my_sales_data.cormatrix , digits=3)
print(my_sales_data.cormatrix.rounded)
## X1 X2 X3 X4 X5 X6
## X1 1.000 0.894 0.946 0.914 0.954 -0.912
## X2 0.894 1.000 0.844 0.749 0.838 -0.766
## X3 0.946 0.844 1.000 0.906 0.864 -0.807
## X4 0.914 0.749 0.906 1.000 0.795 -0.841
## X5 0.954 0.838 0.864 0.795 1.000 -0.870
## X6 -0.912 -0.766 -0.807 -0.841 -0.870 1.000
corrplot(my_sales_data.cormatrix.rounded, method="shade", type="full", addCoef.col = "blue", order ="AOE", bg ='grey')
Now lets do the regression analysis and see how the regression coefficients are appearing Observation from Regression Analysis: – It seems that 99.3% of variance ins sales are explained by the model and this is quite suspecious. May be our model is overfitting due to high corelation among indendepdent variables?? – Model shows that X2, X4 and X6 are important for influence on X1 – Model shows that X2 has highest regression coefficient but can we then assume that variance of X2 has highest impact on variance of X1? This is NOT because input data X2, X3,X4,X5 and X6 are not in same scale
We cant judge important independent variables in regresion based on coefficient if data are not in same scale.
my_sales_data.lm <- lm(my_sales_data$X1 ~ my_sales_data$X2 + my_sales_data$X3 + my_sales_data$X4 + my_sales_data$X5 + my_sales_data$X6 , data = my_sales_data)
summary(my_sales_data.lm)
##
## Call:
## lm(formula = my_sales_data$X1 ~ my_sales_data$X2 + my_sales_data$X3 +
## my_sales_data$X4 + my_sales_data$X5 + my_sales_data$X6, data = my_sales_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.338 -9.699 -4.496 4.040 41.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.85941 30.15023 -0.626 0.538372
## my_sales_data$X2 16.20157 3.54444 4.571 0.000166 ***
## my_sales_data$X3 0.17464 0.05761 3.032 0.006347 **
## my_sales_data$X4 11.52627 2.53210 4.552 0.000174 ***
## my_sales_data$X5 13.58031 1.77046 7.671 1.61e-07 ***
## my_sales_data$X6 -5.31097 1.70543 -3.114 0.005249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared: 0.9932, Adjusted R-squared: 0.9916
## F-statistic: 611.6 on 5 and 21 DF, p-value: < 2.2e-16
Calcuating the VIF(variance inflation factor to check multi colinearity).. Thumb rule is that If VIF is more than 10, multicolinearity is strongly suggested. VIF Output shows that X3 has corelation X2,X4 and oothers which was also revealed from co-relation diagram as mentioned above
VIF(lm(X2 ~ X3+X4+X5+X6, data=my_sales_data))
## [1] 4.240914
VIF(lm(X3 ~ X2+X4+X5+X6, data=my_sales_data))
## [1] 10.12248
VIF(lm(X4 ~ X3+X2+X5+X6, data=my_sales_data))
## [1] 7.624391
VIF(lm(X5 ~ X3+X4+X2+X6, data=my_sales_data))
## [1] 6.912318
VIF(lm(X6 ~ X3+X4+X2+X2, data=my_sales_data))
## [1] 4.001994
Now Standardise the variables and then run the regression
So after standardize we see that .. –regression coefficient for X5 is highest and hence we conclude variance in X5 has highest impact on sales variance –regression coefficient for X4 is 2nd highest and hence we conclude variance in X4 has 2nd highest impact on sales variance –It is also seen that variance in X6 is -vely impacting the variance of Sales variance So our conclusion are: Least; x2 = number of square feet of floor display. Most; x5 = size of the sales district (1000 families) -Vely related; X6 = number of competing stores in district
my_sales_data.lm.standardise <- lm.beta(my_sales_data.lm)
summary(my_sales_data.lm.standardise)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1354 0.1696 0.1738 0.1596 0.2265 0.3634