End of content RE takes into consideration relevance, effectiveness and engagement of content to be recommended in calculation of recommendation score. These measures are derived from different content usage parameters obtained from content_popularity_summary_fact and content_usage_summary tables. An understanding of these parameters are required to choose the weightage and ‘importance’ of each feature in calculating recommendation score.

Feature distributions:

content features available:

  1. avg_rating
  2. downloads
  3. side_loads
  4. avg_interactions_min
  5. avg_sess_device
  6. avg_ts_session
  7. last_gen_date
  8. last_sync_date
  9. publish_date
  10. total_devices
  11. total_interactions
  12. total_sessions
  13. total_ts
#print("Features being Used:")
#print( colnames(content_info.df))
summary(content_info.df)
##    avg_rating      downloads         side_loads     avg_interactions_min
##  Min.   :0.000   Min.   :   0.00   Min.   : 0.000   Min.   :  0.00      
##  1st Qu.:0.000   1st Qu.:   4.00   1st Qu.: 0.000   1st Qu.:  9.73      
##  Median :3.130   Median :  17.00   Median : 0.000   Median : 18.46      
##  Mean   :2.348   Mean   :  37.48   Mean   : 3.978   Mean   : 26.07      
##  3rd Qu.:4.605   3rd Qu.:  49.00   3rd Qu.: 3.000   3rd Qu.: 31.78      
##  Max.   :5.000   Max.   :2673.00   Max.   :80.000   Max.   :436.36      
##                  NA's   :1         NA's   :1                            
##  avg_sess_device  avg_ts_session   last_gen_date       last_sync_date     
##  Min.   : 1.000   Min.   :  1.31   Min.   :6.747e+04   Min.   :1.461e+09  
##  1st Qu.: 2.000   1st Qu.: 37.62   1st Qu.:1.483e+09   1st Qu.:1.484e+09  
##  Median : 3.000   Median : 75.63   Median :1.486e+09   Median :1.486e+09  
##  Mean   : 4.529   Mean   :104.40   Mean   :1.481e+09   Mean   :1.484e+09  
##  3rd Qu.: 5.180   3rd Qu.:138.91   3rd Qu.:1.487e+09   3rd Qu.:1.487e+09  
##  Max.   :66.220   Max.   :840.54   Max.   :1.487e+09   Max.   :1.487e+09  
##                                                                           
##   publish_date        total_devices     total_interactions
##  Min.   :    -18595   Min.   :   1.00   Min.   :       0  
##  1st Qu.:1467425675   1st Qu.:   2.00   1st Qu.:      71  
##  Median :1476929758   Median :   7.00   Median :     522  
##  Mean   :1384814302   Mean   :  30.26   Mean   :   21772  
##  3rd Qu.:1482886261   3rd Qu.:  24.00   3rd Qu.:    2234  
##  Max.   :1487141008   Max.   :6582.00   Max.   :16481935  
##                                                           
##  total_sessions        total_ts       
##  Min.   :     1.0   Min.   :       2  
##  1st Qu.:     5.0   1st Qu.:     268  
##  Median :    22.0   Median :    1848  
##  Mean   :   576.6   Mean   :   98605  
##  3rd Qu.:   100.5   3rd Qu.:    8204  
##  Max.   :435860.0   Max.   :81295274  
## 
#time outlier treatment
for (feat_num in 1:dim(content_info.df)[2]){
  outlier_values <- boxplot.stats(content_info.df[,feat_num])$out
  print(paste("Number of outliers in feature '",feature.names[feat_num],"': ", length(unique(outlier_values)), sep="")) 
  content_info_outlierrm<-remove_outliers(content_info.df[,feat_num],outlier_values)
  hist(content_info_outlierrm , main = paste("Histogram of ",feature.names[feat_num], "/content",sep=""), xlab = feature.names[feat_num])
}
## [1] "Number of outliers in feature 'avg_rating': 0"

## [1] "Number of outliers in feature 'downloads': 36"

## [1] "Number of outliers in feature 'side_loads': 42"

## [1] "Number of outliers in feature 'avg_interactions_min': 70"

## [1] "Number of outliers in feature 'avg_sess_device': 89"

## [1] "Number of outliers in feature 'avg_ts_session': 57"

## [1] "Number of outliers in feature 'last_gen_date': 146"

## [1] "Number of outliers in feature 'last_sync_date': 130"

## [1] "Number of outliers in feature 'publish_date': 212"

## [1] "Number of outliers in feature 'total_devices': 62"

## [1] "Number of outliers in feature 'total_interactions': 173"

## [1] "Number of outliers in feature 'total_sessions': 126"

## [1] "Number of outliers in feature 'total_ts': 154"

Distribution of outlier treated features suggest that we need to perform transformation before using these features. Considering the reresults from usage based device RE, we go for quantile binning and onehot coding of variables.

Binning:

Each feature is divided to 4 bins

The following feature values, quantiles share bin boundaries:

Note: current design of EOC takes avg_rating and number of downloads as meassure

With num_interaction_min as proxy for engagement and rating as popularity, number of content in bin high popularity, high engagement=94

1 being low value bin and 4/2 being largest value bin,

#length(which(which(content_info_binned.df$avg_interactions_min==4)%in% which(content_info_binned.df$avg_rating==2)))
#content_info_binned.df$count<-rep(1,dim(content_info_binned.df)[1])
#EOC_param_gp<-group_by(content_info_binned.df,c(avg_interactions_min,avg_rating))%>%summarise(sum(count))
EOC_param_gp<-as.data.frame(table(content_info_binned.df$avg_interactions_min,content_info_binned.df$avg_rating))
colnames(EOC_param_gp)<-c("avg_interactions_min","avg_rating","count")
print(EOC_param_gp)
##   avg_interactions_min avg_rating count
## 1                    1          1   145
## 2                    2          1   117
## 3                    3          1   102
## 4                    4          1   176
## 5                    1          2   125
## 6                    2          2   154
## 7                    3          2   166
## 8                    4          2    94

Variable importance

To look at significance of each of binned feature, a decision tree is constructed here with total_ts as the target variable.

content_rpart=rpart(total_ts ~., data = content_info_onehot.df)
plot(content_rpart, uniform = TRUE, compress=TRUE,margin=0.2)
text(content_rpart, use.n = TRUE, cex = 0.8)

x<-as.data.frame(content_rpart$variable.importance)
x$`content_rpart$variable.importance`<- x$`content_rpart$variable.importance`/sum(x$`content_rpart$variable.importance`)
x
##                        content_rpart$variable.importance
## total_interactions_4                        0.2699381450
## total_devices_4                             0.1392116715
## total_interactions_3                        0.1381416524
## total_sessions_4                            0.1278694935
## total_sessions_3                            0.0478177092
## avg_rating_1                                0.0365972288
## avg_rating_2                                0.0365972288
## total_devices_3                             0.0349162970
## total_interactions_1                        0.0278283515
## avg_interactions_min_4                      0.0276620086
## publish_date_1                              0.0239945018
## avg_sess_device_4                           0.0199954181
## total_sessions_1                            0.0128404129
## total_devices_1                             0.0110714035
## avg_ts_session_1                            0.0109252047
## avg_sess_device_1                           0.0106411270
## total_devices_2                             0.0080393016
## avg_interactions_min_1                      0.0074663543
## avg_interactions_min_3                      0.0065636580
## avg_ts_session_4                            0.0012088383
## publish_date_3                              0.0006739935

Fitting a linear regression model:

lm.model = lm(content_info_onehot.df$total_ts ~., data=content_info_onehot.df, na.action = na.exclude) 
summary(lm.model)
## 
## Call:
## lm(formula = content_info_onehot.df$total_ts ~ ., data = content_info_onehot.df, 
##     na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02618 -0.27071  0.00294  0.24371  1.17660 
## 
## Coefficients: (10 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.103455   0.062818  65.323  < 2e-16 ***
## avg_rating_11           -0.019983   0.027701  -0.721 0.470841    
## avg_rating_21                  NA         NA      NA       NA    
## avg_interactions_min_11  0.278623   0.038930   7.157 1.55e-12 ***
## avg_interactions_min_21  0.181247   0.034985   5.181 2.65e-07 ***
## avg_interactions_min_31  0.063372   0.032562   1.946 0.051896 .  
## avg_interactions_min_41        NA         NA      NA       NA    
## avg_sess_device_11      -0.344254   0.051232  -6.720 2.98e-11 ***
## avg_sess_device_21      -0.189200   0.041149  -4.598 4.79e-06 ***
## avg_sess_device_31      -0.081900   0.034420  -2.379 0.017517 *  
## avg_sess_device_41             NA         NA      NA       NA    
## avg_ts_session_11       -0.727024   0.044271 -16.422  < 2e-16 ***
## avg_ts_session_21       -0.436329   0.035112 -12.427  < 2e-16 ***
## avg_ts_session_31       -0.235003   0.031224  -7.526 1.12e-13 ***
## avg_ts_session_41              NA         NA      NA       NA    
## last_gen_date_11         0.019872   0.058057   0.342 0.732210    
## last_gen_date_21         0.020956   0.050702   0.413 0.679455    
## last_gen_date_31         0.022902   0.047141   0.486 0.627203    
## last_gen_date_41               NA         NA      NA       NA    
## last_sync_date_11       -0.007248   0.060285  -0.120 0.904323    
## last_sync_date_21       -0.009016   0.051921  -0.174 0.862177    
## last_sync_date_31       -0.019564   0.045013  -0.435 0.663920    
## last_sync_date_41              NA         NA      NA       NA    
## publish_date_11         -0.013993   0.049520  -0.283 0.777554    
## publish_date_21          0.002590   0.041374   0.063 0.950098    
## publish_date_31          0.061287   0.038060   1.610 0.107633    
## publish_date_41                NA         NA      NA       NA    
## total_devices_11        -0.547640   0.089153  -6.143 1.15e-09 ***
## total_devices_21        -0.416254   0.073799  -5.640 2.18e-08 ***
## total_devices_31        -0.182717   0.051071  -3.578 0.000362 ***
## total_devices_41               NA         NA      NA       NA    
## total_interactions_11   -0.973537   0.079864 -12.190  < 2e-16 ***
## total_interactions_21   -0.716039   0.060428 -11.849  < 2e-16 ***
## total_interactions_31   -0.328477   0.041792  -7.860 9.48e-15 ***
## total_interactions_41          NA         NA      NA       NA    
## total_sessions_11       -0.748435   0.104268  -7.178 1.34e-12 ***
## total_sessions_21       -0.619461   0.078910  -7.850 1.02e-14 ***
## total_sessions_31       -0.334650   0.055620  -6.017 2.46e-09 ***
## total_sessions_41              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3422 on 1050 degrees of freedom
## Multiple R-squared:  0.9089, Adjusted R-squared:  0.9065 
## F-statistic: 374.2 on 28 and 1050 DF,  p-value: < 2.2e-16

Features with significant p-value:

  1. avg_interactions_min_1
  2. avg_interactions_min_2
  3. avg_sess_device_1
  4. avg_sess_device_2
  5. total_devices_1
  6. total_devices_2
  7. total_devices_3
  8. total_interactions_1
  9. total_interactions_2
  10. total_interactions_3
  11. total_sessions_1
  12. total_sessions_2
  13. total_sessions_3