The Goal of the Project mainly focuses on the question of how online actions influence the number of offline store visits. There are two datasets provided, that is, “online data” and “offline data” which contains the information of online and offline behaviors of consumers for 2013 and that of first 11 weeks for 2014. In this project, A Poisson model is created to decribe the relationship between the Store Traffic and online behaviors, which accordingly, could provide corresponding suggestions to business decision.
library(rJava)
library(xlsxjars)
library(xlsx)
library(GGally)
library(ggplot2)
library(dplyr)
library(influence.ME)
Load the consumer behavior datasets respectively.
The Online Data is stored in the variable of ‘onelinedata’.
The Offline Data is stored in the variable of ‘offlinedata’.
setwd("E:/ResumeApplication/Teleflora")
onlinedata<-read.xlsx("Data Modeling Exercise.xlsx",sheetIndex=2,header=TRUE)
offlinedata<-read.xlsx("Data Modeling Exercise.xlsx",sheetIndex=3,header=TRUE)
The online data is composed of 222 objects of 31 variables.
Any online actions that start with a “m” indicate it’s an action performed on the smartphone.
Any online actions that start with a “pp” indicate it’s an action performed on the product page.
str(onlinedata)
## 'data.frame': 222 obs. of 31 variables:
## $ Week : num 1 1 1 1 1 1 2 2 2 2 ...
## $ Year : num 2014 2014 2014 2014 2014 ...
## $ Region : Factor w/ 6 levels "Bentonville",..: 5 2 3 4 1 6 5 2 3 4 ...
## $ chat.form : num 7 5 14 0 1 3 14 11 14 0 ...
## $ contact.us : num 1 3 4 0 1 1 0 1 1 0 ...
## $ create.account : num 2 3 3 0 2 0 1 0 3 0 ...
## $ make.appt : num 2 1 2 0 1 2 3 2 5 0 ...
## $ mchat : num 3 1 6 1 2 2 5 3 17 0 ...
## $ mfind.store.phone.call : num 12 5 21 0 2 4 4 10 23 0 ...
## $ mmain.phone.call : num 24 16 55 7 17 10 29 12 57 2 ...
## $ msend.appt.request.ty : num 2 2 3 0 0 5 1 0 2 0 ...
## $ calculator : num 105 107 251 14 39 33 96 80 257 13 ...
## $ find.store : num 81 111 212 7 48 49 84 106 203 6 ...
## $ get.directions : num 8 11 9 0 4 3 11 14 28 0 ...
## $ visit.store : num 172 132 194 11 93 48 162 110 213 9 ...
## $ mFinancing : num 19 32 57 2 6 8 12 22 54 10 ...
## $ mFind.a.Store : num 101 92 283 12 50 82 90 104 261 8 ...
## $ mMap : num 1 1 5 0 1 1 6 4 6 0 ...
## $ Print.Coupon : num 0 0 0 0 0 0 0 0 1 0 ...
## $ mVisit.Store : num 36 46 93 4 25 33 35 22 82 2 ...
## $ mBrowse.products : num 1 2 6 0 2 0 2 0 6 0 ...
## $ mBrowse.Product.Selection: num 1290 1017 2923 109 589 ...
## $ mDesign.Style : num 244 223 652 21 119 98 202 207 601 20 ...
## $ mProduct.Type : num 73 71 174 5 33 25 56 72 166 9 ...
## $ mSequential.Step : num 173 136 447 25 77 69 154 151 386 14 ...
## $ ppBudget : num 130 119 245 11 44 31 114 108 256 15 ...
## $ ppDesign.Style : num 91 93 236 11 37 37 91 98 225 16 ...
## $ ppEmail.a.Friend : num 9 8 19 2 6 4 12 9 14 0 ...
## $ ppProduct.Type : num 1490 1231 2973 123 661 ...
## $ ppSecond.Step : num 344 296 615 40 169 124 327 342 640 58 ...
## $ ppSocial.ShareProduct : num 7 5 7 1 1 1 3 5 12 2 ...
The offline data is composed of 282 objects of 4 variables. They are “Week”, “Year”, “Region” and “Store.Traffic”.
str(offlinedata)
## 'data.frame': 282 obs. of 4 variables:
## $ Week : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : num 2013 2013 2013 2013 2013 ...
## $ Region : Factor w/ 6 levels "Bentonville",..: 5 2 3 4 1 6 5 2 3 4 ...
## $ Store.Traffic: num 313 294 407 62 164 0 337 296 418 55 ...
head(offlinedata)
## Week Year Region Store.Traffic
## 1 1 2013 Sacramento 313
## 2 1 2013 Brooklyn 294
## 3 1 2013 Denver 407
## 4 1 2013 Phoenix 62
## 5 1 2013 Bentonville 164
## 6 1 2013 Tampa Bay 0
In order to merge the two datasets, explore the common column names of online data and offline data first.They are “Week”,“Year”,“Region”.
intersect(names(onlinedata),names(offlinedata))
## [1] "Week" "Year" "Region"
The tables below show the information of “Year” and “Week” in the online data and offline data respectively. As it is indicated, there is only offline and no online data recorded from 1st to 11th week of 2013. In order to determine which online actions are correlated to store visits, we only focus on the data from the 27th week of 2013 to the 11th week of 2014.
table(onlinedata$Year,onlinedata$Week)
##
## 1 2 3 4 5 6 7 8 9 10 11 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 2013 0 0 0 0 0 0 0 0 0 0 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 2014 6 6 6 6 6 6 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 41 42 43 44 45 46 47 48 49 50 51 52
## 2013 6 6 6 6 6 6 6 6 6 6 6 6
## 2014 0 0 0 0 0 0 0 0 0 0 0 0
table(offlinedata$Year,offlinedata$Week)
##
## 1 2 3 4 5 6 7 8 9 10 11 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 2013 6 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 2014 6 6 6 6 6 6 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 41 42 43 44 45 46 47 48 49 50 51 52
## 2013 6 6 6 6 6 6 6 6 6 6 6 6
## 2014 0 0 0 0 0 0 0 0 0 0 0 0
Merge the online data and offline data together with all common column names
mergedData<-merge(onlinedata,offlinedata,all=TRUE)
str(mergedData)
## 'data.frame': 282 obs. of 32 variables:
## $ Week : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : num 2013 2013 2013 2013 2013 ...
## $ Region : Factor w/ 6 levels "Bentonville",..: 1 2 3 4 5 6 1 2 3 4 ...
## $ chat.form : num NA NA NA NA NA NA 1 5 14 0 ...
## $ contact.us : num NA NA NA NA NA NA 1 3 4 0 ...
## $ create.account : num NA NA NA NA NA NA 2 3 3 0 ...
## $ make.appt : num NA NA NA NA NA NA 1 1 2 0 ...
## $ mchat : num NA NA NA NA NA NA 2 1 6 1 ...
## $ mfind.store.phone.call : num NA NA NA NA NA NA 2 5 21 0 ...
## $ mmain.phone.call : num NA NA NA NA NA NA 17 16 55 7 ...
## $ msend.appt.request.ty : num NA NA NA NA NA NA 0 2 3 0 ...
## $ calculator : num NA NA NA NA NA NA 39 107 251 14 ...
## $ find.store : num NA NA NA NA NA NA 48 111 212 7 ...
## $ get.directions : num NA NA NA NA NA NA 4 11 9 0 ...
## $ visit.store : num NA NA NA NA NA NA 93 132 194 11 ...
## $ mFinancing : num NA NA NA NA NA NA 6 32 57 2 ...
## $ mFind.a.Store : num NA NA NA NA NA NA 50 92 283 12 ...
## $ mMap : num NA NA NA NA NA NA 1 1 5 0 ...
## $ Print.Coupon : num NA NA NA NA NA NA 0 0 0 0 ...
## $ mVisit.Store : num NA NA NA NA NA NA 25 46 93 4 ...
## $ mBrowse.products : num NA NA NA NA NA NA 2 2 6 0 ...
## $ mBrowse.Product.Selection: num NA NA NA NA NA ...
## $ mDesign.Style : num NA NA NA NA NA NA 119 223 652 21 ...
## $ mProduct.Type : num NA NA NA NA NA NA 33 71 174 5 ...
## $ mSequential.Step : num NA NA NA NA NA NA 77 136 447 25 ...
## $ ppBudget : num NA NA NA NA NA NA 44 119 245 11 ...
## $ ppDesign.Style : num NA NA NA NA NA NA 37 93 236 11 ...
## $ ppEmail.a.Friend : num NA NA NA NA NA NA 6 8 19 2 ...
## $ ppProduct.Type : num NA NA NA NA NA ...
## $ ppSecond.Step : num NA NA NA NA NA NA 169 296 615 40 ...
## $ ppSocial.ShareProduct : num NA NA NA NA NA NA 1 5 7 1 ...
## $ Store.Traffic : num 164 294 407 62 313 0 133 296 418 55 ...
Removing the missing values in the dataset and stored them into ‘mergedData2’
which(mergedData$Week==27)
## [1] 127 128 129 130 131 132
mergedData2<-mergedData[127:282,]
In order to model the dataset better, convert the factor variable ‘Region’ into numeric type.
Bentonville=1
Brooklyn=2
Denver=3
Phoenix=4
Sacramento=5
Tampa Bay=6
As we can see, the merged data consists of 156 objects and 32 variables. The project is to create a model with the variable of “Store.Traffic” as the outcome and the other 31 variables as the possible predictors.
mergedData2$Region<-as.numeric(mergedData2$Region)
str(mergedData2)
## 'data.frame': 156 obs. of 32 variables:
## $ Week : num 27 27 27 27 27 27 28 28 28 28 ...
## $ Year : num 2013 2013 2013 2013 2013 ...
## $ Region : num 1 2 3 4 5 6 1 2 3 4 ...
## $ chat.form : num 4 3 7 0 2 0 2 4 10 2 ...
## $ contact.us : num 1 1 1 0 0 1 1 1 5 0 ...
## $ create.account : num 1 1 1 0 0 0 0 2 3 0 ...
## $ make.appt : num 0 0 0 0 0 0 0 0 2 0 ...
## $ mchat : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mfind.store.phone.call : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mmain.phone.call : num 0 0 0 0 0 0 0 0 0 0 ...
## $ msend.appt.request.ty : num 0 0 0 0 0 0 0 0 0 0 ...
## $ calculator : num 0 0 0 0 0 0 0 0 0 0 ...
## $ find.store : num 0 0 0 0 0 0 0 0 0 0 ...
## $ get.directions : num 3 7 16 0 4 2 5 6 15 2 ...
## $ visit.store : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mFinancing : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mFind.a.Store : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mMap : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Print.Coupon : num 0 0 0 0 0 0 0 0 1 0 ...
## $ mVisit.Store : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mBrowse.products : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mBrowse.Product.Selection: num 0 0 0 0 0 0 0 0 0 0 ...
## $ mDesign.Style : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mProduct.Type : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mSequential.Step : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ppBudget : num 4 20 48 5 19 4 21 74 146 13 ...
## $ ppDesign.Style : num 8 27 56 2 26 3 20 73 127 33 ...
## $ ppEmail.a.Friend : num 1 3 9 0 7 1 1 4 9 0 ...
## $ ppProduct.Type : num 313 832 1658 97 724 ...
## $ ppSecond.Step : num 54 140 281 23 148 34 118 268 608 76 ...
## $ ppSocial.ShareProduct : num 0 3 6 0 5 0 2 1 5 1 ...
## $ Store.Traffic : num 207 319 521 74 293 96 172 257 514 77 ...
The correlation between Store.Traffic and the other variables is shown below. ppSecond.Step,ppProduct.Type and ppDesign.Style is the top 3 variables which are most related to the Outcome.
cor<-sort(cor(mergedData2,method="pearson")[32,],decreasing=TRUE)
cor
## Store.Traffic ppSecond.Step
## 1.00000000 0.93264116
## ppProduct.Type ppDesign.Style
## 0.93155491 0.90053394
## ppBudget get.directions
## 0.88983663 0.85822139
## chat.form ppEmail.a.Friend
## 0.84286271 0.82587836
## mSequential.Step mBrowse.Product.Selection
## 0.81154264 0.81095333
## mDesign.Style mProduct.Type
## 0.80436577 0.79719714
## mFind.a.Store find.store
## 0.79676185 0.76484383
## mfind.store.phone.call mMap
## 0.76417260 0.75989117
## mFinancing mmain.phone.call
## 0.75501863 0.74612196
## mVisit.Store ppSocial.ShareProduct
## 0.74263542 0.72786247
## calculator visit.store
## 0.71221332 0.65975371
## msend.appt.request.ty make.appt
## 0.61212320 0.59558021
## create.account mBrowse.products
## 0.57806719 0.56731943
## mchat contact.us
## 0.55419289 0.43839955
## Print.Coupon Week
## 0.09536204 0.07381276
## Region
## -0.19972890
For the counts data with no bounded upper limit, Possion model is often used to describe the relationships between predictors and outcome.
lm1<-lm(I(log(Store.Traffic+1))~ppSecond.Step,data=mergedData2)
summary(lm1)
##
## Call:
## lm(formula = I(log(Store.Traffic + 1)) ~ ppSecond.Step, data = mergedData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77368 -0.27688 0.03827 0.24399 0.94607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4498699 0.0452306 98.38 <2e-16 ***
## ppSecond.Step 0.0030667 0.0001353 22.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3353 on 154 degrees of freedom
## Multiple R-squared: 0.7693, Adjusted R-squared: 0.7678
## F-statistic: 513.5 on 1 and 154 DF, p-value: < 2.2e-16
As it is indicated above, the predictor “ppSecond.Step” is significant in the model with 0.001 significance level.
Residuals Standard Error reaches 0.3353.
The adjusted R squared is 0.7678.
round(exp(coef(lm1)),5)
## (Intercept) ppSecond.Step
## 85.61580 1.00307
The coefficients of the model are shown above.
85.61580 is the estimated geometric mean store traffic when there is no ‘ppSecond.Step’.
When ‘ppSecond.Step’ increases 1, the Store Traffic would increase 0.307%.
plot(lm1,which=1)
The Residuals-Fitted plot of the model is shown below. For the ideal model, the residuals points should spread randomly and the total residuals should be 0.
In this plot, the residuals are mostly negative when the fitted value is small, positive when the fitted value is in the middle and negative when the fitted value is large. The mean residual changes with the fitted value. There are some patterns still need to be explored.
Take the variable of “Store.Traffic” as the Outcome and other variables as predictors to create the Possion Regression Model.
lm2<-lm(I(log(Store.Traffic+1))~.,data=mergedData2)
summary(lm2)
##
## Call:
## lm(formula = I(log(Store.Traffic + 1)) ~ ., data = mergedData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57856 -0.17813 0.00953 0.15930 0.62791
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7450314 0.2006506 23.648 < 2e-16 ***
## Week -0.0023425 0.0047326 -0.495 0.62149
## Year NA NA NA NA
## Region -0.0744006 0.0154158 -4.826 3.98e-06 ***
## chat.form -0.0042367 0.0111490 -0.380 0.70459
## contact.us -0.0390670 0.0129983 -3.006 0.00320 **
## create.account 0.0066206 0.0203568 0.325 0.74555
## make.appt -0.0122821 0.0221770 -0.554 0.58069
## mchat -0.0009696 0.0152434 -0.064 0.94939
## mfind.store.phone.call -0.0050440 0.0103714 -0.486 0.62758
## mmain.phone.call -0.0094214 0.0055857 -1.687 0.09416 .
## msend.appt.request.ty -0.0333352 0.0212407 -1.569 0.11908
## calculator -0.0024568 0.0014008 -1.754 0.08191 .
## find.store -0.0035532 0.0015435 -2.302 0.02299 *
## get.directions 0.0162739 0.0084554 1.925 0.05654 .
## visit.store 0.0031461 0.0010113 3.111 0.00231 **
## mFinancing 0.0040125 0.0045709 0.878 0.38171
## mFind.a.Store 0.0044726 0.0018634 2.400 0.01786 *
## mMap 0.0002982 0.0154004 0.019 0.98458
## Print.Coupon -0.0307311 0.0563032 -0.546 0.58617
## mVisit.Store -0.0026630 0.0028820 -0.924 0.35725
## mBrowse.products 0.0308502 0.0386870 0.797 0.42671
## mBrowse.Product.Selection 0.0007498 0.0004013 1.869 0.06401 .
## mDesign.Style -0.0028788 0.0021333 -1.349 0.17963
## mProduct.Type -0.0041307 0.0038377 -1.076 0.28384
## mSequential.Step -0.0005790 0.0024279 -0.238 0.81192
## ppBudget 0.0001179 0.0021444 0.055 0.95625
## ppDesign.Style -0.0004033 0.0025634 -0.157 0.87524
## ppEmail.a.Friend -0.0006761 0.0096192 -0.070 0.94408
## ppProduct.Type 0.0009074 0.0002499 3.631 0.00041 ***
## ppSecond.Step -0.0001579 0.0009570 -0.165 0.86922
## ppSocial.ShareProduct 0.0137292 0.0162220 0.846 0.39898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2693 on 125 degrees of freedom
## Multiple R-squared: 0.8792, Adjusted R-squared: 0.8502
## F-statistic: 30.33 on 30 and 125 DF, p-value: < 2.2e-16
As it is indicated, not all variables are significant. The variables which are significant in 0.05 significance level for the model are shown below.
“Region”
“contact.us”
“find.store”
“visit.store”
“mFind.a.Store”
“ppProduct.Type”
The Residuals Standard Error reaches 0.2693.
The Adjusted R squared is 0.8502.
plot(lm2,which=1)
The Residuals-Fitted plot is improved better. The mean residuals reach almost 0 for each range of fitted values.
Take the significant variables in full model as the predictor in the modified one and “Store Traffic” as the outcome.
lm3<-lm(I(log(Store.Traffic+1))~Region+contact.us+find.store+visit.store+mFind.a.Store+ppProduct.Type,data=mergedData2)
summary(lm3)
##
## Call:
## lm(formula = I(log(Store.Traffic + 1)) ~ Region + contact.us +
## find.store + visit.store + mFind.a.Store + ppProduct.Type,
## data = mergedData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6972 -0.1860 0.0410 0.1941 0.7163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.662e+00 6.785e-02 68.707 < 2e-16 ***
## Region -5.857e-02 1.420e-02 -4.126 6.13e-05 ***
## contact.us -3.064e-02 1.085e-02 -2.824 0.00539 **
## find.store -7.261e-03 1.174e-03 -6.188 5.62e-09 ***
## visit.store 3.313e-03 7.956e-04 4.164 5.28e-05 ***
## mFind.a.Store 2.311e-03 8.190e-04 2.822 0.00542 **
## ppProduct.Type 8.263e-04 5.932e-05 13.930 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2928 on 149 degrees of freedom
## Multiple R-squared: 0.8297, Adjusted R-squared: 0.8229
## F-statistic: 121 on 6 and 149 DF, p-value: < 2.2e-16
As it is indicated above, all variables are significant in 0.01 significance level.
Residual Standard Error reaches 0.2928.
Adjusted R-squared is 0.8229.
Relative to the Full model, the RSE is increased and R-squared is decreased.
round(exp(coef(lm3)),5)
## (Intercept) Region contact.us find.store visit.store
## 105.84120 0.94311 0.96982 0.99277 1.00332
## mFind.a.Store ppProduct.Type
## 1.00231 1.00083
As it is shown above, 105.84120 is the estimated geometric mean store traffic when holding the other variables 0.
The variables of “Region”,“contact.us”,“find.store” are negative to the “Store.Traffic”.
The variables of “visit.store”, “mFind.a.Store”,“ppProduct.Type” are positive to the “Store.Traffic”.
plot(lm3,which=1)
There is no big change of “Residual-fitted” plot relative to full model.
In order to create the optimized model, Bayesian Information Criterion (BIC) algorithm is used for model selection.
The model with the lowest BIC score is the best one.
The process is taken by forward selection and backward elmination methods by BIC algorithm.
By comparing the BIC improvements from dropping each candidate variable and adding each variable, an optimized model would be arrived with the best BIC improvement (smallest BIC).
lm4<-step(lm2,direction="both",k=log(nrow(mergedData2)))
summary(lm4)
##
## Call:
## lm(formula = I(log(Store.Traffic + 1)) ~ Region + contact.us +
## calculator + find.store + get.directions + visit.store +
## mFind.a.Store + mDesign.Style + ppProduct.Type, data = mergedData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63073 -0.18537 0.01773 0.17270 0.69194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.623e+00 6.265e-02 73.799 < 2e-16 ***
## Region -6.470e-02 1.312e-02 -4.932 2.19e-06 ***
## contact.us -3.440e-02 1.004e-02 -3.425 0.000799 ***
## calculator -2.463e-03 7.083e-04 -3.477 0.000668 ***
## find.store -5.147e-03 1.210e-03 -4.255 3.72e-05 ***
## get.directions 1.972e-02 7.774e-03 2.537 0.012240 *
## visit.store 3.378e-03 7.482e-04 4.515 1.30e-05 ***
## mFind.a.Store 3.611e-03 1.067e-03 3.385 0.000914 ***
## mDesign.Style -1.183e-03 5.214e-04 -2.270 0.024670 *
## ppProduct.Type 8.464e-04 8.482e-05 9.979 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2678 on 146 degrees of freedom
## Multiple R-squared: 0.8604, Adjusted R-squared: 0.8518
## F-statistic: 100 on 9 and 146 DF, p-value: < 2.2e-16
All variables in the stepwise model are significant in 0.05 significance level.They are listed below.
“Region”
“contact.us”
“calculator”
“find.store”
“get.directions”
“visit.store”
“mFind.a.Store”
“mDesign.Style”
“ppProduct.Type”
The Residual Standard Error reaches 0.2678.
The Adjusted R squared is 0.8518.
Relative to the modified model, the RSE is decreased and Adjusted R squared is improved.
round(exp(coef(lm4)),5)
## (Intercept) Region contact.us calculator find.store
## 101.84680 0.93735 0.96618 0.99754 0.99487
## get.directions visit.store mFind.a.Store mDesign.Style ppProduct.Type
## 1.01992 1.00338 1.00362 0.99882 1.00085
As it is shown above, 101.84680 is the estimated geometric mean store traffic when holding the other variables 0.
“Region”,“contact.us”,“calculator”,“find.store”,“mDesign.Style” are negative to the “Store Traffic”.
“get.directions”,“visit.store”,“mFind.a.Store”,“ppProduct.Type” are positive to the “Store Traffic”.
plot(lm4,which=1)
plot(lm4,which=3)
From the Residual-Fitted plot and standardized one, the points are spreaded randomly, which shows that the model describes the relationship well.
Points 131, 214,178 are the outliers of the plot.
plot(lm4,which=4)
From the Cook’s distance plot, the 1st 3 influential points are points 141,147,261.
influential<-cooks.distance(lm4)
head(sort(influential,decreasing=TRUE),3)
## 147 141 261
## 0.56351625 0.10392203 0.09385565
mergedData<-mergedData[-c(131,178,214,147),]
which(mergedData$Week==27)
## [1] 127 128 129 130 131
mergedData3<-mergedData[127:278,]
mergedData3$Region<-as.numeric(mergedData3$Region)
lm5<-lm(I(log(Store.Traffic+1))~.,data=mergedData3)
lm6<-step(lm5,direction="both",k=log(nrow(mergedData3)))
summary(lm6)
##
## Call:
## lm(formula = I(log(Store.Traffic + 1)) ~ Region + calculator +
## find.store + get.directions + visit.store + mFind.a.Store +
## mProduct.Type + ppProduct.Type, data = mergedData3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52718 -0.18290 0.01793 0.15998 0.48443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.633e+00 5.877e-02 78.825 < 2e-16 ***
## Region -6.838e-02 1.213e-02 -5.640 8.80e-08 ***
## calculator -2.267e-03 6.546e-04 -3.463 0.000705 ***
## find.store -5.935e-03 1.185e-03 -5.007 1.61e-06 ***
## get.directions 2.252e-02 7.120e-03 3.163 0.001904 **
## visit.store 3.018e-03 6.960e-04 4.336 2.72e-05 ***
## mFind.a.Store 4.875e-03 1.080e-03 4.516 1.31e-05 ***
## mProduct.Type -5.135e-03 1.647e-03 -3.118 0.002204 **
## ppProduct.Type 7.986e-04 7.523e-05 10.615 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2488 on 143 degrees of freedom
## Multiple R-squared: 0.8743, Adjusted R-squared: 0.8672
## F-statistic: 124.3 on 8 and 143 DF, p-value: < 2.2e-16
As we can see, all the variables in the model are significant in 0.01 significance level. They are listed below.
“Region”
“calculator”
“find.store”
“get.directions”
“visit.store”
“mFind.a.Store”
“mProduct.Type”
“ppProduct.Type”
The Residual Standard Error reaches 0.2488.
The Adjusted R-squared is 0.8672.
Relative to the stepwise model, the two parameters are improved slightly.
This is the best model obtained finally.
round(exp(coef(lm6)),5)
## (Intercept) Region calculator find.store get.directions
## 102.80374 0.93390 0.99774 0.99408 1.02278
## visit.store mFind.a.Store mProduct.Type ppProduct.Type
## 1.00302 1.00489 0.99488 1.00080
As it is shown above, 102.80374 is the estimated geometric mean store traffic when holding the other variables 0.
The variables of “Region”, “calculator”,“mProduct.Type”,“find.store” are negative to the “Store Traffic”.
The variables of “get.directions”,“visit.store”,“mFind.a.Store”,“ppProduct.Type” are positive to the “Store Traffic”.
plot(lm6,which=1)
plot(lm6,which=3)
From the residual-fitted plot and the standardized one, the residual points are spreaded randomly, which indicates the model could describe the relationship well.
plot(lm6,which=4)
The Cook’s Distance for all points are within 0.12, which would not influence the model greatly.
In the optimized model, 102.80374 is the estimated geometric mean store traffic when holding the other variables 0.
The variables of “Region”, “calculator”,“mProduct.Type”,“find.store” are negative to the “Store Traffic”. We should minimize the influence of the factors as much as possible.
The variables of “get.directions”,“visit.store”,“mFind.a.Store”,“ppProduct.Type” are positive to the “Store Traffic”.The improvement in this field would contribute to the increase of Store Traffic.
results<-data.frame(behaviors=c("Region","calculator","mProduct.Type","find.store","get.directions","visit.store","mFind.a.Store","ppProduct.Type"),percentage=c("-6.61%","-0.226%","-0.512%","-0.592%","2.278%","0.302%","0.489%","0.080%"))
results
## behaviors percentage
## 1 Region -6.61%
## 2 calculator -0.226%
## 3 mProduct.Type -0.512%
## 4 find.store -0.592%
## 5 get.directions 2.278%
## 6 visit.store 0.302%
## 7 mFind.a.Store 0.489%
## 8 ppProduct.Type 0.080%