List of attributes included in the Analysis & their data types

## 'data.frame':    165 obs. of  7 variables:
##  $ prod_id         : chr  "621998000" "623010000" "651693000" "621992000" ...
##  $ Unit_sales      : int  3302 2260 1993 1708 806 736 638 632 554 493 ...
##  $ L_unitsales     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Review_all      : int  284 0 1110 131 470 219 169 90 47 6 ...
##  $ search          : int  261650 11140 326130 123160 1880 49790 3020 269020 55960 8940 ...
##  $ SocialSignal_all: int  247 0 1059 1 272 392 167 226 16 17 ...
##  $ Backlinks_all   : int  70 0 64 23 9 75 20 57 10 10 ...

Summary or Descriptives of all attributes

##                  vars   n     mean       sd median  trimmed     mad min
## Unit_sales          1 165   131.52   383.71     20    49.68   28.17   0
## Review_all          2 165    90.62   229.92      7    32.72   10.38   0
## search              3 165 36772.67 77785.65   5710 17739.32 8465.65   0
## SocialSignal_all    4 165    87.22   251.07      5    27.33    7.41   0
## Backlinks_all       5 165    17.22    35.92      4     9.43    5.93   0
##                     max  range skew kurtosis      se
## Unit_sales         3302   3302 5.65    36.22   29.87
## Review_all         1570   1570 3.97    17.37   17.90
## search           574200 574200 3.71    17.02 6055.60
## SocialSignal_all   1861   1861 4.59    23.04   19.55
## Backlinks_all       286    286 4.73    28.87    2.80
##                  Unit_sales Review_all    search SocialSignal_all
## Unit_sales        1.0000000  0.2447388 0.3402762        0.2018502
## Review_all        0.2447388  1.0000000 0.6012752        0.5684504
## search            0.3402762  0.6012752 1.0000000        0.7117992
## SocialSignal_all  0.2018502  0.5684504 0.7117992        1.0000000
## Backlinks_all     0.1671937  0.6237374 0.7055494        0.6185073
##                  Backlinks_all
## Unit_sales           0.1671937
## Review_all           0.6237374
## search               0.7055494
## SocialSignal_all     0.6185073
## Backlinks_all        1.0000000
##            vars   n  mean     sd median trimmed  mad min max range skew
## Unit_sales    1 161 77.25 140.08     18    43.5 25.2   0 806   806 2.95
##            kurtosis    se
## Unit_sales     9.59 11.04
##            vars n    mean     sd median trimmed   mad  min  max range skew
## Unit_sales    1 4 2315.75 695.06 2126.5 2315.75 409.2 1708 3302  1594 0.52
##            kurtosis     se
## Unit_sales    -1.83 347.53
## [1] "93.75%"

Interpretation : ~94% of the data have sales units between 100 to 500. (OR) Approximately 94% of observations will lie within 4*standard deviation of the mean.

Logistic Regression with all cases : Baseline Model

## 
## Call:
## glm(formula = L_unitsales ~ Review_all + search + SocialSignal_all + 
##     Backlinks_all, family = binomial(link = "logit"), data = SII_TRAIN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8781  -0.7949  -0.7790   1.3648   1.6356  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.045e+00  2.412e-01  -4.333 1.47e-05 ***
## Review_all        8.459e-04  1.058e-03   0.800    0.424    
## search            3.114e-06  4.570e-06   0.681    0.496    
## SocialSignal_all -1.233e-03  2.030e-03  -0.607    0.544    
## Backlinks_all     6.139e-03  8.799e-03   0.698    0.485    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 143.70  on 115  degrees of freedom
## Residual deviance: 139.31  on 111  degrees of freedom
## AIC: 149.31
## 
## Number of Fisher Scoring iterations: 4

Stepwise Logistic Regression

## 
## Call:  glm(formula = L_unitsales ~ Review_all + search + SocialSignal_all + 
##     Backlinks_all, family = binomial(link = "logit"), data = SII_TRAIN)
## 
## Coefficients:
##      (Intercept)        Review_all            search  SocialSignal_all  
##       -1.045e+00         8.459e-04         3.114e-06        -1.233e-03  
##    Backlinks_all  
##        6.139e-03  
## 
## Degrees of Freedom: 115 Total (i.e. Null);  111 Residual
## Null Deviance:       143.7 
## Residual Deviance: 139.3     AIC: 149.3
## 
## Call:
## glm(formula = L_unitsales ~ Review_all + search + SocialSignal_all + 
##     Backlinks_all, family = binomial(link = "logit"), data = SII_TRAIN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8781  -0.7949  -0.7790   1.3648   1.6356  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.045e+00  2.412e-01  -4.333 1.47e-05 ***
## Review_all        8.459e-04  1.058e-03   0.800    0.424    
## search            3.114e-06  4.570e-06   0.681    0.496    
## SocialSignal_all -1.233e-03  2.030e-03  -0.607    0.544    
## Backlinks_all     6.139e-03  8.799e-03   0.698    0.485    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 143.70  on 115  degrees of freedom
## Residual deviance: 139.31  on 111  degrees of freedom
## AIC: 149.31
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:  glm(formula = L_unitsales ~ Backlinks_all, family = binomial(link = "logit"), 
##     data = SII_TRAIN)
## 
## Coefficients:
##   (Intercept)  Backlinks_all  
##     -0.980827       0.009089  
## 
## Degrees of Freedom: 115 Total (i.e. Null);  114 Residual
## Null Deviance:       143.7 
## Residual Deviance: 140.4     AIC: 144.4
## 
## Call:
## glm(formula = L_unitsales ~ Backlinks_all, family = binomial(link = "logit"), 
##     data = SII_TRAIN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8971  -0.8137  -0.7981   1.3832   1.6120  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.980827   0.230297  -4.259 2.05e-05 ***
## Backlinks_all  0.009089   0.005552   1.637    0.102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 143.70  on 115  degrees of freedom
## Residual deviance: 140.41  on 114  degrees of freedom
## AIC: 144.41
## 
## Number of Fisher Scoring iterations: 4
## L_unitsales ~ Backlinks_all

Loglikelihood Value

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 'log Lik.' -69.65422 (df=5)
## Start:  AIC=149.31
## L_unitsales ~ Review_all + search + SocialSignal_all + Backlinks_all
## [1] 163.0764
## Start:  AIC=149.31
## L_unitsales ~ Review_all + search + SocialSignal_all + Backlinks_all
## 
##                    Df Deviance    AIC
## - SocialSignal_all  1   139.68 147.68
## - search            1   139.77 147.77
## - Backlinks_all     1   139.82 147.82
## - Review_all        1   139.94 147.94
## <none>                  139.31 149.31
## 
## Step:  AIC=147.68
## L_unitsales ~ Review_all + search + Backlinks_all
## 
##                 Df Deviance    AIC
## - search         1   139.82 145.82
## - Backlinks_all  1   139.97 145.97
## - Review_all     1   140.11 146.11
## <none>               139.68 147.68
## 
## Step:  AIC=145.82
## L_unitsales ~ Review_all + Backlinks_all
## 
##                 Df Deviance    AIC
## - Review_all     1   140.41 144.41
## - Backlinks_all  1   140.66 144.66
## <none>               139.82 145.82
## 
## Step:  AIC=144.41
## L_unitsales ~ Backlinks_all
## 
##                 Df Deviance    AIC
## <none>               140.41 144.41
## - Backlinks_all  1   143.69 145.69
## [1] 149.9183

Overall Variance Explained & Significance

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 3.7, df = 4, P(> X2) = 0.44

Prediction & Odds Ratio

## Waiting for profiling to be done...

Confusion matrix

##       Predict
## Actual  0  1
##      0 35  0
##      1 12  2
##    0 1 Per_Correct
## 0 35 0   71.428571
## 1 12 2    4.081633
## [1] 75.5102

Identifying & handling Outliers

##                  Unit_sales Review_all    search SocialSignal_all
## Unit_sales        1.0000000  0.3388066 0.3577576        0.2417655
## Review_all        0.3388066  1.0000000 0.7244664        0.8053331
## search            0.3577576  0.7244664 1.0000000        0.8501081
## SocialSignal_all  0.2417655  0.8053331 0.8501081        1.0000000
## Backlinks_all     0.1872276  0.7076531 0.7617333        0.7944275
##                  Backlinks_all
## Unit_sales           0.1872276
## Review_all           0.7076531
## search               0.7617333
## SocialSignal_all     0.7944275
## Backlinks_all        1.0000000

Revised model : Removing oultliers based on Standardised residuals & Influential cases(Cook’s distance)

## 
## Call:
## glm(formula = L_unitsales ~ Review_all + search + SocialSignal_all + 
##     Backlinks_all, family = binomial(link = "logit"), data = SII_SR_CD)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6217  -0.7150  -0.6860   0.9805   1.7987  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.329e+00  2.760e-01  -4.817 1.46e-06 ***
## Review_all        6.904e-03  3.089e-03   2.235   0.0254 *  
## search            3.705e-07  5.028e-06   0.074   0.9413    
## SocialSignal_all -6.060e-03  3.608e-03  -1.680   0.0930 .  
## Backlinks_all     2.348e-02  1.351e-02   1.738   0.0822 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 142.19  on 113  degrees of freedom
## Residual deviance: 124.25  on 109  degrees of freedom
## AIC: 134.25
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...

Loglikelihood & BIC

## 'log Lik.' -62.1243 (df=5)
## Start:  AIC=134.25
## L_unitsales ~ Review_all + search + SocialSignal_all + Backlinks_all
## [1] 147.9296
## Start:  AIC=134.25
## L_unitsales ~ Review_all + search + SocialSignal_all + Backlinks_all
## 
##                    Df Deviance    AIC
## - search            1   124.25 132.25
## <none>                  124.25 134.25
## - Backlinks_all     1   127.36 135.35
## - SocialSignal_all  1   128.52 136.52
## - Review_all        1   132.73 140.73
## 
## Step:  AIC=132.25
## L_unitsales ~ Review_all + SocialSignal_all + Backlinks_all
## 
##                    Df Deviance    AIC
## <none>                  124.25 132.25
## - Backlinks_all     1   127.52 133.51
## - SocialSignal_all  1   129.84 135.84
## - Review_all        1   132.80 138.80
## [1] 143.1988

Overall Variance Explained : Revised Model

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 11.3, df = 4, P(> X2) = 0.023

Revised Model Prediction & Odds

Revised Model confusion matrix

##       Predict
## Actual  0  1
##      0 35  0
##      1 11  3
##    0 1 Per_Correct
## 0 35 0   71.428571
## 1 11 3    6.122449
## [1] 77.55102

Conclusion : Since logistic regression with removing Outliers & influential cases has more than 2% more accuracy in classifying cases than logistic regression with all cases, we conclude that “Revised model is better than Baseline model”.

Prediction & Odds for complete data.

## Waiting for profiling to be done...

Classification Accuracy/Confusion matrix for complete data with 30% & 50% cut-off

##       Predict
## Actual   0   1
##      0 102  13
##      1  21  29
##     0  1 Per_Correct
## 0 102 13    61.81818
## 1  21 29    17.57576
## [1] 79.39394
##       Predict
## Actual   0   1
##      0 109   6
##      1  35  15
##     0  1 Per_Correct
## 0 109  6   66.060606
## 1  35 15    9.090909
## [1] 75.15152

Classification Accuracy is 29-products out of 50 with Probability cut-off 30% & 15-products with Probability cut-off 50%.

##      prod_id Top_Prod_30 Top_Prod_50
## 1  621998000         Yes         Yes
## 3  651693000         Yes         Yes
## 4  621992000         Yes         Yes
## 5  648235000         Yes         Yes
## 12 651692000         Yes         Yes
## 19 664156000         Yes         Yes
## 20 659573000         Yes         Yes
## 22 604341000         Yes         Yes
## 24 610649000         Yes         Yes
## 27 661246000         Yes         Yes
## 28 610659000         Yes         Yes
## 29 624836000         Yes         Yes
## 33 661243000         Yes         Yes
## 40 632705000         Yes         Yes
## 45 604358000         Yes         Yes
## 6  621990000         Yes          No
## 7  623865000         Yes          No
## 8  664150000         Yes          No
## 9  621095000         Yes          No
## 13 661959000         Yes          No
## 31 602135000         Yes          No
## 35 621901000         Yes          No
## 36 659574000         Yes          No
## 38 648159000         Yes          No
## 39 651983000         Yes          No
## 42 632710000         Yes          No
## 44 602069000         Yes          No
## 49 661073000         Yes          No
## 50 624707000         Yes          No
## 2  623010000          No          No
## 10 602072000          No          No
## 11 621260000          No          No
## 14 621902000          No          No
## 15 624838000          No          No
## 16 648077000          No          No
## 17 661961000          No          No
## 18 679666000          No          No
## 21 632635000          No          No
## 23 660772000          No          No
## 25 684253000          No          No
## 26 684246000          No          No
## 30 688352000          No          No
## 32 602133000          No          No
## 34 661178000          No          No
## 37 664165000          No          No
## 41 648076000          No          No
## 43 604342000          No          No
## 46 624715000          No          No
## 47 669659000          No          No
## 48 624706000          No          No