All_Green_Variable

## Warning: package 'ggplot2' was built under R version 3.4.1

## Warning: package 'ggthemes' was built under R version 3.4.1

## Warning: package 'scales' was built under R version 3.4.1

## Warning: package 'dplyr' was built under R version 3.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: package 'mice' was built under R version 3.4.2

## Loading required package: lattice

## Warning: package 'randomForest' was built under R version 3.4.1

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Warning: package 'rpart' was built under R version 3.4.2

## Warning: package 'ROCR' was built under R version 3.4.1

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.4.1

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

## Warning: package 'rpart.plot' was built under R version 3.4.2

## Warning: package 'corrr' was built under R version 3.4.1

## Warning: package 'corrplot' was built under R version 3.4.2

## corrplot 0.84 loaded

## Warning: package 'glue' was built under R version 3.4.2

## 
## Attaching package: 'glue'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## Warning: package 'caTools' was built under R version 3.4.1

## Warning: package 'data.table' was built under R version 3.4.2

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## Loading required package: knitr

## Warning: package 'knitr' was built under R version 3.4.2

## Loading required package: geosphere

## Warning: package 'geosphere' was built under R version 3.4.2

## Loading required package: gmapsdistance

## Warning: package 'gmapsdistance' was built under R version 3.4.2

## Loading required package: tidyr

## Warning: package 'tidyr' was built under R version 3.4.2

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:mice':
## 
##     complete

## Warning: package 'car' was built under R version 3.4.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## Warning: package 'caret' was built under R version 3.4.1

## Warning: package 'gclus' was built under R version 3.4.1

## Loading required package: cluster

## Warning: package 'visdat' was built under R version 3.4.1

## Warning: package 'psych' was built under R version 3.4.2

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

## The following object is masked from 'package:randomForest':
## 
##     outlier

## The following objects are masked from 'package:scales':
## 
##     alpha, rescale

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## Warning: package 'leaflet' was built under R version 3.4.1

## Warning: package 'leaflet.extras' was built under R version 3.4.1

## Warning: package 'GPArotation' was built under R version 3.4.1

## Warning: package 'MVN' was built under R version 3.4.2

## sROC 0.1-2 loaded

## 
## Attaching package: 'MVN'

## The following object is masked from 'package:psych':
## 
##     mardia

## Warning: package 'MASS' was built under R version 3.4.1

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Warning: package 'psy' was built under R version 3.4.1

## 
## Attaching package: 'psy'

## The following object is masked from 'package:psych':
## 
##     wkappa

## Warning: package 'corpcor' was built under R version 3.4.1

## Warning: package 'fastmatch' was built under R version 3.4.1

## 
## Attaching package: 'fastmatch'

## The following object is masked from 'package:dplyr':
## 
##     coalesce

## Warning: package 'plyr' was built under R version 3.4.1

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## Warning: package 'fmsb' was built under R version 3.4.1

## Warning: package 'QuantPsyc' was built under R version 3.4.2

## Loading required package: boot

## Warning: package 'boot' was built under R version 3.4.1

## 
## Attaching package: 'boot'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:car':
## 
##     logit

## The following object is masked from 'package:lattice':
## 
##     melanoma

## 
## Attaching package: 'QuantPsyc'

## The following object is masked from 'package:base':
## 
##     norm

Load the file into data frame and do basic stats to understand variable and their 5-point summary.

my_sales_data <- read.csv('Dataset_All Greens Franchise.csv')
str(my_sales_data)

## 'data.frame':    27 obs. of  6 variables:
##  $ X1: num  231 156 10 519 437 487 299 195 20 68 ...
##  $ X2: num  3 2.2 0.5 5.5 4.4 ...
##  $ X3: int  294 232 149 600 567 571 512 347 212 102 ...
##  $ X4: num  8.2 6.9 3 12 10.6 ...
##  $ X5: num  8.2 4.1 4.3 16.1 14.1 ...
##  $ X6: int  11 12 15 1 5 4 10 12 15 8 ...

summary(my_sales_data)

##        X1              X2              X3              X4       
##  Min.   :  0.5   Min.   :0.500   Min.   :102.0   Min.   : 2.50  
##  1st Qu.: 98.5   1st Qu.:1.400   1st Qu.:204.0   1st Qu.: 4.80  
##  Median :341.0   Median :3.500   Median :382.0   Median : 8.10  
##  Mean   :286.6   Mean   :3.326   Mean   :387.5   Mean   : 8.10  
##  3rd Qu.:450.5   3rd Qu.:4.750   3rd Qu.:551.0   3rd Qu.:10.95  
##  Max.   :570.0   Max.   :8.600   Max.   :788.0   Max.   :17.40  
##        X5               X6        
##  Min.   : 1.600   Min.   : 0.000  
##  1st Qu.: 4.500   1st Qu.: 4.000  
##  Median :11.300   Median : 8.000  
##  Mean   : 9.693   Mean   : 7.741  
##  3rd Qu.:14.050   3rd Qu.:12.000  
##  Max.   :16.300   Max.   :15.000

Doing the co-relation here to understand if there are significant co-relation existing. Co-relation diagram shows clearly that X is having co-relation as follows..So it seems that X1 and X5 are having highest co-relation factor

–X1 and X2 are postively co-related with corelation factor of 0.89 –X1 and X3 are postively co-related with corelation factor of 0.95 –X1 and X4 are postively co-related with corelation factor of 0.91 –X1 and X5 are postively co-related with corelation factor of 0.8 –X1 and X6 are postively co-related with corelation factor of 0.91 But the co-relation diagram also shows that there are strong corelation among other variables ex:… –X3 and X2 having co-relation coeficient of 0.84

Lot of other independent variables have corelation among themselves and hence this is a perfect example of multi-colinearity. So this means we can’t only relation on c0relation diagram to understand the effect of X2,X3,X4,X5 and X6 on X1.

my_sales_data.cormatrix <- cor(my_sales_data)
my_sales_data.cormatrix.rounded<- round(my_sales_data.cormatrix , digits=3)
print(my_sales_data.cormatrix.rounded)

##        X1     X2     X3     X4     X5     X6
## X1  1.000  0.894  0.946  0.914  0.954 -0.912
## X2  0.894  1.000  0.844  0.749  0.838 -0.766
## X3  0.946  0.844  1.000  0.906  0.864 -0.807
## X4  0.914  0.749  0.906  1.000  0.795 -0.841
## X5  0.954  0.838  0.864  0.795  1.000 -0.870
## X6 -0.912 -0.766 -0.807 -0.841 -0.870  1.000

corrplot(my_sales_data.cormatrix.rounded, method="shade", type="full", addCoef.col = "blue", order ="AOE", bg ='grey')

Now lets do the regression analysis and see how the regression coefficients are appearing Observation from Regression Analysis: – It seems that 99.3% of variance ins sales are explained by the model and this is quite suspecious. May be our model is overfitting due to high corelation among indendepdent variables?? – Model shows that X2, X4 and X6 are important for influence on X1 – Model shows that X2 has highest regression coefficient but can we then assume that variance of X2 has highest impact on variance of X1? This is NOT because input data X2, X3,X4,X5 and X6 are not in same scale

We cant judge important independent variables in regresion based on coefficient if data are not in same scale.

my_sales_data.lm <- lm(my_sales_data$X1 ~ my_sales_data$X2 + my_sales_data$X3 + my_sales_data$X4 + my_sales_data$X5 + my_sales_data$X6 , data = my_sales_data)
summary(my_sales_data.lm)

## 
## Call:
## lm(formula = my_sales_data$X1 ~ my_sales_data$X2 + my_sales_data$X3 + 
##     my_sales_data$X4 + my_sales_data$X5 + my_sales_data$X6, data = my_sales_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.338  -9.699  -4.496   4.040  41.139 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -18.85941   30.15023  -0.626 0.538372    
## my_sales_data$X2  16.20157    3.54444   4.571 0.000166 ***
## my_sales_data$X3   0.17464    0.05761   3.032 0.006347 ** 
## my_sales_data$X4  11.52627    2.53210   4.552 0.000174 ***
## my_sales_data$X5  13.58031    1.77046   7.671 1.61e-07 ***
## my_sales_data$X6  -5.31097    1.70543  -3.114 0.005249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared:  0.9932, Adjusted R-squared:  0.9916 
## F-statistic: 611.6 on 5 and 21 DF,  p-value: < 2.2e-16

Calcuating the VIF(variance inflation factor to check multi colinearity).. Thumb rule is that If VIF is more than 10, multicolinearity is strongly suggested. VIF Output shows that X3 has corelation X2,X4 and oothers which was also revealed from co-relation diagram as mentioned above

VIF(lm(X2 ~ X3+X4+X5+X6, data=my_sales_data))

## [1] 4.240914

VIF(lm(X3 ~ X2+X4+X5+X6, data=my_sales_data))

## [1] 10.12248

VIF(lm(X4 ~ X3+X2+X5+X6, data=my_sales_data))

## [1] 7.624391

VIF(lm(X5 ~ X3+X4+X2+X6, data=my_sales_data))

## [1] 6.912318

VIF(lm(X6 ~ X3+X4+X2+X2, data=my_sales_data))

## [1] 4.001994

Now Standardise the variables and then run the regression

So after standardize we see that .. –regression coefficient for X5 is highest and hence we conclude variance in X5 has highest impact on sales variance –regression coefficient for X4 is 2nd highest and hence we conclude variance in X4 has 2nd highest impact on sales variance –It is also seen that variance in X6 is -vely impacting the variance of Sales variance So our conclusion are: Least; x2 = number of square feet of floor display. Most; x5 = size of the sales district (1000 families) -Vely related; X6 = number of competing stores in district

my_sales_data.lm.standardise <- lm.beta(my_sales_data.lm)
summary(my_sales_data.lm.standardise)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1354  0.1696  0.1738  0.1596  0.2265  0.3634

All_Green_Variable_Relation

Amit Kayal

November 22, 2017