List of attributes included in the Analysis & their data types
## 'data.frame': 165 obs. of 7 variables:
## $ prod_id : chr "621998000" "623010000" "651693000" "621992000" ...
## $ Unit_sales : int 3302 2260 1993 1708 806 736 638 632 554 493 ...
## $ L_unitsales : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Review_all : int 284 0 1110 131 470 219 169 90 47 6 ...
## $ search : int 261650 11140 326130 123160 1880 49790 3020 269020 55960 8940 ...
## $ SocialSignal_all: int 247 0 1059 1 272 392 167 226 16 17 ...
## $ Backlinks_all : int 70 0 64 23 9 75 20 57 10 10 ...
Summary or Descriptives of all attributes
## vars n mean sd median trimmed mad min
## Unit_sales 1 165 131.52 383.71 20 49.68 28.17 0
## Review_all 2 165 90.62 229.92 7 32.72 10.38 0
## search 3 165 36772.67 77785.65 5710 17739.32 8465.65 0
## SocialSignal_all 4 165 87.22 251.07 5 27.33 7.41 0
## Backlinks_all 5 165 17.22 35.92 4 9.43 5.93 0
## max range skew kurtosis se
## Unit_sales 3302 3302 5.65 36.22 29.87
## Review_all 1570 1570 3.97 17.37 17.90
## search 574200 574200 3.71 17.02 6055.60
## SocialSignal_all 1861 1861 4.59 23.04 19.55
## Backlinks_all 286 286 4.73 28.87 2.80
## Unit_sales Review_all search SocialSignal_all
## Unit_sales 1.0000000 0.2447388 0.3402762 0.2018502
## Review_all 0.2447388 1.0000000 0.6012752 0.5684504
## search 0.3402762 0.6012752 1.0000000 0.7117992
## SocialSignal_all 0.2018502 0.5684504 0.7117992 1.0000000
## Backlinks_all 0.1671937 0.6237374 0.7055494 0.6185073
## Backlinks_all
## Unit_sales 0.1671937
## Review_all 0.6237374
## search 0.7055494
## SocialSignal_all 0.6185073
## Backlinks_all 1.0000000
## vars n mean sd median trimmed mad min max range skew
## Unit_sales 1 161 77.25 140.08 18 43.5 25.2 0 806 806 2.95
## kurtosis se
## Unit_sales 9.59 11.04
## vars n mean sd median trimmed mad min max range skew
## Unit_sales 1 4 2315.75 695.06 2126.5 2315.75 409.2 1708 3302 1594 0.52
## kurtosis se
## Unit_sales -1.83 347.53
## [1] "93.75%"
Interpretation : ~94% of the data have sales units between 100 to 500. (OR) Approximately 94% of observations will lie within 4*standard deviation of the mean.
Logistic Regression with all cases : Baseline Model
##
## Call:
## glm(formula = L_unitsales ~ Review_all + search + SocialSignal_all +
## Backlinks_all, family = binomial(link = "logit"), data = SII_TRAIN)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8781 -0.7949 -0.7790 1.3648 1.6356
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.045e+00 2.412e-01 -4.333 1.47e-05 ***
## Review_all 8.459e-04 1.058e-03 0.800 0.424
## search 3.114e-06 4.570e-06 0.681 0.496
## SocialSignal_all -1.233e-03 2.030e-03 -0.607 0.544
## Backlinks_all 6.139e-03 8.799e-03 0.698 0.485
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 143.70 on 115 degrees of freedom
## Residual deviance: 139.31 on 111 degrees of freedom
## AIC: 149.31
##
## Number of Fisher Scoring iterations: 4
Stepwise Logistic Regression
##
## Call: glm(formula = L_unitsales ~ Review_all + search + SocialSignal_all +
## Backlinks_all, family = binomial(link = "logit"), data = SII_TRAIN)
##
## Coefficients:
## (Intercept) Review_all search SocialSignal_all
## -1.045e+00 8.459e-04 3.114e-06 -1.233e-03
## Backlinks_all
## 6.139e-03
##
## Degrees of Freedom: 115 Total (i.e. Null); 111 Residual
## Null Deviance: 143.7
## Residual Deviance: 139.3 AIC: 149.3
##
## Call:
## glm(formula = L_unitsales ~ Review_all + search + SocialSignal_all +
## Backlinks_all, family = binomial(link = "logit"), data = SII_TRAIN)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8781 -0.7949 -0.7790 1.3648 1.6356
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.045e+00 2.412e-01 -4.333 1.47e-05 ***
## Review_all 8.459e-04 1.058e-03 0.800 0.424
## search 3.114e-06 4.570e-06 0.681 0.496
## SocialSignal_all -1.233e-03 2.030e-03 -0.607 0.544
## Backlinks_all 6.139e-03 8.799e-03 0.698 0.485
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 143.70 on 115 degrees of freedom
## Residual deviance: 139.31 on 111 degrees of freedom
## AIC: 149.31
##
## Number of Fisher Scoring iterations: 4
##
## Call: glm(formula = L_unitsales ~ Backlinks_all, family = binomial(link = "logit"),
## data = SII_TRAIN)
##
## Coefficients:
## (Intercept) Backlinks_all
## -0.980827 0.009089
##
## Degrees of Freedom: 115 Total (i.e. Null); 114 Residual
## Null Deviance: 143.7
## Residual Deviance: 140.4 AIC: 144.4
##
## Call:
## glm(formula = L_unitsales ~ Backlinks_all, family = binomial(link = "logit"),
## data = SII_TRAIN)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8971 -0.8137 -0.7981 1.3832 1.6120
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.980827 0.230297 -4.259 2.05e-05 ***
## Backlinks_all 0.009089 0.005552 1.637 0.102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 143.70 on 115 degrees of freedom
## Residual deviance: 140.41 on 114 degrees of freedom
## AIC: 144.41
##
## Number of Fisher Scoring iterations: 4
## L_unitsales ~ Backlinks_all
Loglikelihood Value
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## 'log Lik.' -69.65422 (df=5)
## Start: AIC=149.31
## L_unitsales ~ Review_all + search + SocialSignal_all + Backlinks_all
## [1] 163.0764
## Start: AIC=149.31
## L_unitsales ~ Review_all + search + SocialSignal_all + Backlinks_all
##
## Df Deviance AIC
## - SocialSignal_all 1 139.68 147.68
## - search 1 139.77 147.77
## - Backlinks_all 1 139.82 147.82
## - Review_all 1 139.94 147.94
## <none> 139.31 149.31
##
## Step: AIC=147.68
## L_unitsales ~ Review_all + search + Backlinks_all
##
## Df Deviance AIC
## - search 1 139.82 145.82
## - Backlinks_all 1 139.97 145.97
## - Review_all 1 140.11 146.11
## <none> 139.68 147.68
##
## Step: AIC=145.82
## L_unitsales ~ Review_all + Backlinks_all
##
## Df Deviance AIC
## - Review_all 1 140.41 144.41
## - Backlinks_all 1 140.66 144.66
## <none> 139.82 145.82
##
## Step: AIC=144.41
## L_unitsales ~ Backlinks_all
##
## Df Deviance AIC
## <none> 140.41 144.41
## - Backlinks_all 1 143.69 145.69
## [1] 149.9183
Overall Variance Explained & Significance
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 3.7, df = 4, P(> X2) = 0.44
Prediction & Odds Ratio
## Waiting for profiling to be done...
Confusion matrix
## Predict
## Actual 0 1
## 0 35 0
## 1 12 2
## 0 1 Per_Correct
## 0 35 0 71.428571
## 1 12 2 4.081633
## [1] 75.5102
Identifying & handling Outliers
## Unit_sales Review_all search SocialSignal_all
## Unit_sales 1.0000000 0.3388066 0.3577576 0.2417655
## Review_all 0.3388066 1.0000000 0.7244664 0.8053331
## search 0.3577576 0.7244664 1.0000000 0.8501081
## SocialSignal_all 0.2417655 0.8053331 0.8501081 1.0000000
## Backlinks_all 0.1872276 0.7076531 0.7617333 0.7944275
## Backlinks_all
## Unit_sales 0.1872276
## Review_all 0.7076531
## search 0.7617333
## SocialSignal_all 0.7944275
## Backlinks_all 1.0000000
Revised model : Removing oultliers based on Standardised residuals & Influential cases(Cook’s distance)
##
## Call:
## glm(formula = L_unitsales ~ Review_all + search + SocialSignal_all +
## Backlinks_all, family = binomial(link = "logit"), data = SII_SR_CD)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6217 -0.7150 -0.6860 0.9805 1.7987
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.329e+00 2.760e-01 -4.817 1.46e-06 ***
## Review_all 6.904e-03 3.089e-03 2.235 0.0254 *
## search 3.705e-07 5.028e-06 0.074 0.9413
## SocialSignal_all -6.060e-03 3.608e-03 -1.680 0.0930 .
## Backlinks_all 2.348e-02 1.351e-02 1.738 0.0822 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 142.19 on 113 degrees of freedom
## Residual deviance: 124.25 on 109 degrees of freedom
## AIC: 134.25
##
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
Loglikelihood & BIC
## 'log Lik.' -62.1243 (df=5)
## Start: AIC=134.25
## L_unitsales ~ Review_all + search + SocialSignal_all + Backlinks_all
## [1] 147.9296
## Start: AIC=134.25
## L_unitsales ~ Review_all + search + SocialSignal_all + Backlinks_all
##
## Df Deviance AIC
## - search 1 124.25 132.25
## <none> 124.25 134.25
## - Backlinks_all 1 127.36 135.35
## - SocialSignal_all 1 128.52 136.52
## - Review_all 1 132.73 140.73
##
## Step: AIC=132.25
## L_unitsales ~ Review_all + SocialSignal_all + Backlinks_all
##
## Df Deviance AIC
## <none> 124.25 132.25
## - Backlinks_all 1 127.52 133.51
## - SocialSignal_all 1 129.84 135.84
## - Review_all 1 132.80 138.80
## [1] 143.1988
Overall Variance Explained : Revised Model
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 11.3, df = 4, P(> X2) = 0.023
Revised Model Prediction & Odds
Revised Model confusion matrix
## Predict
## Actual 0 1
## 0 35 0
## 1 11 3
## 0 1 Per_Correct
## 0 35 0 71.428571
## 1 11 3 6.122449
## [1] 77.55102
Conclusion : Since logistic regression with removing Outliers & influential cases has more than 2% more accuracy in classifying cases than logistic regression with all cases, we conclude that “Revised model is better than Baseline model”.
Prediction & Odds for complete data.
## Waiting for profiling to be done...
Classification Accuracy/Confusion matrix for complete data with 30% & 50% cut-off
## Predict
## Actual 0 1
## 0 102 13
## 1 21 29
## 0 1 Per_Correct
## 0 102 13 61.81818
## 1 21 29 17.57576
## [1] 79.39394
## Predict
## Actual 0 1
## 0 109 6
## 1 35 15
## 0 1 Per_Correct
## 0 109 6 66.060606
## 1 35 15 9.090909
## [1] 75.15152
Classification Accuracy is 29-products out of 50 with Probability cut-off 30% & 15-products with Probability cut-off 50%.
## prod_id Top_Prod_30 Top_Prod_50
## 1 621998000 Yes Yes
## 3 651693000 Yes Yes
## 4 621992000 Yes Yes
## 5 648235000 Yes Yes
## 12 651692000 Yes Yes
## 19 664156000 Yes Yes
## 20 659573000 Yes Yes
## 22 604341000 Yes Yes
## 24 610649000 Yes Yes
## 27 661246000 Yes Yes
## 28 610659000 Yes Yes
## 29 624836000 Yes Yes
## 33 661243000 Yes Yes
## 40 632705000 Yes Yes
## 45 604358000 Yes Yes
## 6 621990000 Yes No
## 7 623865000 Yes No
## 8 664150000 Yes No
## 9 621095000 Yes No
## 13 661959000 Yes No
## 31 602135000 Yes No
## 35 621901000 Yes No
## 36 659574000 Yes No
## 38 648159000 Yes No
## 39 651983000 Yes No
## 42 632710000 Yes No
## 44 602069000 Yes No
## 49 661073000 Yes No
## 50 624707000 Yes No
## 2 623010000 No No
## 10 602072000 No No
## 11 621260000 No No
## 14 621902000 No No
## 15 624838000 No No
## 16 648077000 No No
## 17 661961000 No No
## 18 679666000 No No
## 21 632635000 No No
## 23 660772000 No No
## 25 684253000 No No
## 26 684246000 No No
## 30 688352000 No No
## 32 602133000 No No
## 34 661178000 No No
## 37 664165000 No No
## 41 648076000 No No
## 43 604342000 No No
## 46 624715000 No No
## 47 669659000 No No
## 48 624706000 No No