Feature Analysis for EOC RE

End of content RE takes into consideration relevance, effectiveness and engagement of content to be recommended in calculation of recommendation score. These measures are derived from different content usage parameters obtained from content_popularity_summary_fact and content_usage_summary tables. An understanding of these parameters are required to choose the weightage and ‘importance’ of each feature in calculating recommendation score.

Feature distributions:

content features available:

avg_rating
downloads
side_loads
avg_interactions_min
avg_sess_device
avg_ts_session
last_gen_date
last_sync_date
publish_date
total_devices
total_interactions
total_sessions
total_ts

#print("Features being Used:")
#print( colnames(content_info.df))
summary(content_info.df)

##    avg_rating      downloads         side_loads     avg_interactions_min
##  Min.   :0.000   Min.   :   0.00   Min.   : 0.000   Min.   :  0.00      
##  1st Qu.:0.000   1st Qu.:   4.00   1st Qu.: 0.000   1st Qu.:  9.73      
##  Median :3.130   Median :  17.00   Median : 0.000   Median : 18.46      
##  Mean   :2.348   Mean   :  37.48   Mean   : 3.978   Mean   : 26.07      
##  3rd Qu.:4.605   3rd Qu.:  49.00   3rd Qu.: 3.000   3rd Qu.: 31.78      
##  Max.   :5.000   Max.   :2673.00   Max.   :80.000   Max.   :436.36      
##                  NA's   :1         NA's   :1                            
##  avg_sess_device  avg_ts_session   last_gen_date       last_sync_date     
##  Min.   : 1.000   Min.   :  1.31   Min.   :6.747e+04   Min.   :1.461e+09  
##  1st Qu.: 2.000   1st Qu.: 37.62   1st Qu.:1.483e+09   1st Qu.:1.484e+09  
##  Median : 3.000   Median : 75.63   Median :1.486e+09   Median :1.486e+09  
##  Mean   : 4.529   Mean   :104.40   Mean   :1.481e+09   Mean   :1.484e+09  
##  3rd Qu.: 5.180   3rd Qu.:138.91   3rd Qu.:1.487e+09   3rd Qu.:1.487e+09  
##  Max.   :66.220   Max.   :840.54   Max.   :1.487e+09   Max.   :1.487e+09  
##                                                                           
##   publish_date        total_devices     total_interactions
##  Min.   :    -18595   Min.   :   1.00   Min.   :       0  
##  1st Qu.:1467425675   1st Qu.:   2.00   1st Qu.:      71  
##  Median :1476929758   Median :   7.00   Median :     522  
##  Mean   :1384814302   Mean   :  30.26   Mean   :   21772  
##  3rd Qu.:1482886261   3rd Qu.:  24.00   3rd Qu.:    2234  
##  Max.   :1487141008   Max.   :6582.00   Max.   :16481935  
##                                                           
##  total_sessions        total_ts       
##  Min.   :     1.0   Min.   :       2  
##  1st Qu.:     5.0   1st Qu.:     268  
##  Median :    22.0   Median :    1848  
##  Mean   :   576.6   Mean   :   98605  
##  3rd Qu.:   100.5   3rd Qu.:    8204  
##  Max.   :435860.0   Max.   :81295274  
##

#time outlier treatment
for (feat_num in 1:dim(content_info.df)[2]){
  outlier_values <- boxplot.stats(content_info.df[,feat_num])$out
  print(paste("Number of outliers in feature '",feature.names[feat_num],"': ", length(unique(outlier_values)), sep="")) 
  content_info_outlierrm<-remove_outliers(content_info.df[,feat_num],outlier_values)
  hist(content_info_outlierrm , main = paste("Histogram of ",feature.names[feat_num], "/content",sep=""), xlab = feature.names[feat_num])
}

## [1] "Number of outliers in feature 'avg_rating': 0"

## [1] "Number of outliers in feature 'downloads': 36"

## [1] "Number of outliers in feature 'side_loads': 42"

## [1] "Number of outliers in feature 'avg_interactions_min': 70"

## [1] "Number of outliers in feature 'avg_sess_device': 89"

## [1] "Number of outliers in feature 'avg_ts_session': 57"

## [1] "Number of outliers in feature 'last_gen_date': 146"

## [1] "Number of outliers in feature 'last_sync_date': 130"

## [1] "Number of outliers in feature 'publish_date': 212"

## [1] "Number of outliers in feature 'total_devices': 62"

## [1] "Number of outliers in feature 'total_interactions': 173"

## [1] "Number of outliers in feature 'total_sessions': 126"

## [1] "Number of outliers in feature 'total_ts': 154"

Distribution of outlier treated features suggest that we need to perform transformation before using these features. Considering the reresults from usage based device RE, we go for quantile binning and onehot coding of variables.

Binning:

Each feature is divided to 4 bins

The following feature values, quantiles share bin boundaries:

avg_rating
downloads
sideloads

Note: current design of EOC takes avg_rating and number of downloads as meassure

With num_interaction_min as proxy for engagement and rating as popularity, number of content in bin high popularity, high engagement=94

1 being low value bin and 4/2 being largest value bin,

#length(which(which(content_info_binned.df$avg_interactions_min==4)%in% which(content_info_binned.df$avg_rating==2)))
#content_info_binned.df$count<-rep(1,dim(content_info_binned.df)[1])
#EOC_param_gp<-group_by(content_info_binned.df,c(avg_interactions_min,avg_rating))%>%summarise(sum(count))
EOC_param_gp<-as.data.frame(table(content_info_binned.df$avg_interactions_min,content_info_binned.df$avg_rating))
colnames(EOC_param_gp)<-c("avg_interactions_min","avg_rating","count")
print(EOC_param_gp)

##   avg_interactions_min avg_rating count
## 1                    1          1   145
## 2                    2          1   117
## 3                    3          1   102
## 4                    4          1   176
## 5                    1          2   125
## 6                    2          2   154
## 7                    3          2   166
## 8                    4          2    94

Variable importance

To look at significance of each of binned feature, a decision tree is constructed here with total_ts as the target variable.

content_rpart=rpart(total_ts ~., data = content_info_onehot.df)
plot(content_rpart, uniform = TRUE, compress=TRUE,margin=0.2)
text(content_rpart, use.n = TRUE, cex = 0.8)

x<-as.data.frame(content_rpart$variable.importance)
x$`content_rpart$variable.importance`<- x$`content_rpart$variable.importance`/sum(x$`content_rpart$variable.importance`)
x

##                        content_rpart$variable.importance
## total_interactions_4                        0.2699381450
## total_devices_4                             0.1392116715
## total_interactions_3                        0.1381416524
## total_sessions_4                            0.1278694935
## total_sessions_3                            0.0478177092
## avg_rating_1                                0.0365972288
## avg_rating_2                                0.0365972288
## total_devices_3                             0.0349162970
## total_interactions_1                        0.0278283515
## avg_interactions_min_4                      0.0276620086
## publish_date_1                              0.0239945018
## avg_sess_device_4                           0.0199954181
## total_sessions_1                            0.0128404129
## total_devices_1                             0.0110714035
## avg_ts_session_1                            0.0109252047
## avg_sess_device_1                           0.0106411270
## total_devices_2                             0.0080393016
## avg_interactions_min_1                      0.0074663543
## avg_interactions_min_3                      0.0065636580
## avg_ts_session_4                            0.0012088383
## publish_date_3                              0.0006739935

Fitting a linear regression model:

lm.model = lm(content_info_onehot.df$total_ts ~., data=content_info_onehot.df, na.action = na.exclude) 
summary(lm.model)

## 
## Call:
## lm(formula = content_info_onehot.df$total_ts ~ ., data = content_info_onehot.df, 
##     na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02618 -0.27071  0.00294  0.24371  1.17660 
## 
## Coefficients: (10 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.103455   0.062818  65.323  < 2e-16 ***
## avg_rating_11           -0.019983   0.027701  -0.721 0.470841    
## avg_rating_21                  NA         NA      NA       NA    
## avg_interactions_min_11  0.278623   0.038930   7.157 1.55e-12 ***
## avg_interactions_min_21  0.181247   0.034985   5.181 2.65e-07 ***
## avg_interactions_min_31  0.063372   0.032562   1.946 0.051896 .  
## avg_interactions_min_41        NA         NA      NA       NA    
## avg_sess_device_11      -0.344254   0.051232  -6.720 2.98e-11 ***
## avg_sess_device_21      -0.189200   0.041149  -4.598 4.79e-06 ***
## avg_sess_device_31      -0.081900   0.034420  -2.379 0.017517 *  
## avg_sess_device_41             NA         NA      NA       NA    
## avg_ts_session_11       -0.727024   0.044271 -16.422  < 2e-16 ***
## avg_ts_session_21       -0.436329   0.035112 -12.427  < 2e-16 ***
## avg_ts_session_31       -0.235003   0.031224  -7.526 1.12e-13 ***
## avg_ts_session_41              NA         NA      NA       NA    
## last_gen_date_11         0.019872   0.058057   0.342 0.732210    
## last_gen_date_21         0.020956   0.050702   0.413 0.679455    
## last_gen_date_31         0.022902   0.047141   0.486 0.627203    
## last_gen_date_41               NA         NA      NA       NA    
## last_sync_date_11       -0.007248   0.060285  -0.120 0.904323    
## last_sync_date_21       -0.009016   0.051921  -0.174 0.862177    
## last_sync_date_31       -0.019564   0.045013  -0.435 0.663920    
## last_sync_date_41              NA         NA      NA       NA    
## publish_date_11         -0.013993   0.049520  -0.283 0.777554    
## publish_date_21          0.002590   0.041374   0.063 0.950098    
## publish_date_31          0.061287   0.038060   1.610 0.107633    
## publish_date_41                NA         NA      NA       NA    
## total_devices_11        -0.547640   0.089153  -6.143 1.15e-09 ***
## total_devices_21        -0.416254   0.073799  -5.640 2.18e-08 ***
## total_devices_31        -0.182717   0.051071  -3.578 0.000362 ***
## total_devices_41               NA         NA      NA       NA    
## total_interactions_11   -0.973537   0.079864 -12.190  < 2e-16 ***
## total_interactions_21   -0.716039   0.060428 -11.849  < 2e-16 ***
## total_interactions_31   -0.328477   0.041792  -7.860 9.48e-15 ***
## total_interactions_41          NA         NA      NA       NA    
## total_sessions_11       -0.748435   0.104268  -7.178 1.34e-12 ***
## total_sessions_21       -0.619461   0.078910  -7.850 1.02e-14 ***
## total_sessions_31       -0.334650   0.055620  -6.017 2.46e-09 ***
## total_sessions_41              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3422 on 1050 degrees of freedom
## Multiple R-squared:  0.9089, Adjusted R-squared:  0.9065 
## F-statistic: 374.2 on 28 and 1050 DF,  p-value: < 2.2e-16

Features with significant p-value:

avg_interactions_min_1
avg_interactions_min_2
avg_sess_device_1
avg_sess_device_2
total_devices_1
total_devices_2
total_devices_3
total_interactions_1
total_interactions_2
total_interactions_3
total_sessions_1
total_sessions_2
total_sessions_3

Feature Analysis for EOC RE

Adarsa

16 February 2017

Feature distributions:

Binning:

Variable importance