End of content RE takes into consideration relevance, effectiveness and engagement of content to be recommended in calculation of recommendation score. These measures are derived from different content usage parameters obtained from content_popularity_summary_fact and content_usage_summary tables. An understanding of these parameters are required to choose the weightage and ‘importance’ of each feature in calculating recommendation score.
content features available:
#print("Features being Used:")
#print( colnames(content_info.df))
summary(content_info.df)
## avg_rating downloads side_loads avg_interactions_min
## Min. :0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.:0.000 1st Qu.: 4.00 1st Qu.: 0.000 1st Qu.: 9.73
## Median :3.130 Median : 17.00 Median : 0.000 Median : 18.46
## Mean :2.348 Mean : 37.48 Mean : 3.978 Mean : 26.07
## 3rd Qu.:4.605 3rd Qu.: 49.00 3rd Qu.: 3.000 3rd Qu.: 31.78
## Max. :5.000 Max. :2673.00 Max. :80.000 Max. :436.36
## NA's :1 NA's :1
## avg_sess_device avg_ts_session last_gen_date last_sync_date
## Min. : 1.000 Min. : 1.31 Min. :6.747e+04 Min. :1.461e+09
## 1st Qu.: 2.000 1st Qu.: 37.62 1st Qu.:1.483e+09 1st Qu.:1.484e+09
## Median : 3.000 Median : 75.63 Median :1.486e+09 Median :1.486e+09
## Mean : 4.529 Mean :104.40 Mean :1.481e+09 Mean :1.484e+09
## 3rd Qu.: 5.180 3rd Qu.:138.91 3rd Qu.:1.487e+09 3rd Qu.:1.487e+09
## Max. :66.220 Max. :840.54 Max. :1.487e+09 Max. :1.487e+09
##
## publish_date total_devices total_interactions
## Min. : -18595 Min. : 1.00 Min. : 0
## 1st Qu.:1467425675 1st Qu.: 2.00 1st Qu.: 71
## Median :1476929758 Median : 7.00 Median : 522
## Mean :1384814302 Mean : 30.26 Mean : 21772
## 3rd Qu.:1482886261 3rd Qu.: 24.00 3rd Qu.: 2234
## Max. :1487141008 Max. :6582.00 Max. :16481935
##
## total_sessions total_ts
## Min. : 1.0 Min. : 2
## 1st Qu.: 5.0 1st Qu.: 268
## Median : 22.0 Median : 1848
## Mean : 576.6 Mean : 98605
## 3rd Qu.: 100.5 3rd Qu.: 8204
## Max. :435860.0 Max. :81295274
##
#time outlier treatment
for (feat_num in 1:dim(content_info.df)[2]){
outlier_values <- boxplot.stats(content_info.df[,feat_num])$out
print(paste("Number of outliers in feature '",feature.names[feat_num],"': ", length(unique(outlier_values)), sep=""))
content_info_outlierrm<-remove_outliers(content_info.df[,feat_num],outlier_values)
hist(content_info_outlierrm , main = paste("Histogram of ",feature.names[feat_num], "/content",sep=""), xlab = feature.names[feat_num])
}
## [1] "Number of outliers in feature 'avg_rating': 0"
## [1] "Number of outliers in feature 'downloads': 36"
## [1] "Number of outliers in feature 'side_loads': 42"
## [1] "Number of outliers in feature 'avg_interactions_min': 70"
## [1] "Number of outliers in feature 'avg_sess_device': 89"
## [1] "Number of outliers in feature 'avg_ts_session': 57"
## [1] "Number of outliers in feature 'last_gen_date': 146"
## [1] "Number of outliers in feature 'last_sync_date': 130"
## [1] "Number of outliers in feature 'publish_date': 212"
## [1] "Number of outliers in feature 'total_devices': 62"
## [1] "Number of outliers in feature 'total_interactions': 173"
## [1] "Number of outliers in feature 'total_sessions': 126"
## [1] "Number of outliers in feature 'total_ts': 154"
Distribution of outlier treated features suggest that we need to perform transformation before using these features. Considering the reresults from usage based device RE, we go for quantile binning and onehot coding of variables.
Each feature is divided to 4 bins
The following feature values, quantiles share bin boundaries:
Note: current design of EOC takes avg_rating and number of downloads as meassure
With num_interaction_min as proxy for engagement and rating as popularity, number of content in bin high popularity, high engagement=94
1 being low value bin and 4/2 being largest value bin,
#length(which(which(content_info_binned.df$avg_interactions_min==4)%in% which(content_info_binned.df$avg_rating==2)))
#content_info_binned.df$count<-rep(1,dim(content_info_binned.df)[1])
#EOC_param_gp<-group_by(content_info_binned.df,c(avg_interactions_min,avg_rating))%>%summarise(sum(count))
EOC_param_gp<-as.data.frame(table(content_info_binned.df$avg_interactions_min,content_info_binned.df$avg_rating))
colnames(EOC_param_gp)<-c("avg_interactions_min","avg_rating","count")
print(EOC_param_gp)
## avg_interactions_min avg_rating count
## 1 1 1 145
## 2 2 1 117
## 3 3 1 102
## 4 4 1 176
## 5 1 2 125
## 6 2 2 154
## 7 3 2 166
## 8 4 2 94
To look at significance of each of binned feature, a decision tree is constructed here with total_ts as the target variable.
content_rpart=rpart(total_ts ~., data = content_info_onehot.df)
plot(content_rpart, uniform = TRUE, compress=TRUE,margin=0.2)
text(content_rpart, use.n = TRUE, cex = 0.8)
x<-as.data.frame(content_rpart$variable.importance)
x$`content_rpart$variable.importance`<- x$`content_rpart$variable.importance`/sum(x$`content_rpart$variable.importance`)
x
## content_rpart$variable.importance
## total_interactions_4 0.2699381450
## total_devices_4 0.1392116715
## total_interactions_3 0.1381416524
## total_sessions_4 0.1278694935
## total_sessions_3 0.0478177092
## avg_rating_1 0.0365972288
## avg_rating_2 0.0365972288
## total_devices_3 0.0349162970
## total_interactions_1 0.0278283515
## avg_interactions_min_4 0.0276620086
## publish_date_1 0.0239945018
## avg_sess_device_4 0.0199954181
## total_sessions_1 0.0128404129
## total_devices_1 0.0110714035
## avg_ts_session_1 0.0109252047
## avg_sess_device_1 0.0106411270
## total_devices_2 0.0080393016
## avg_interactions_min_1 0.0074663543
## avg_interactions_min_3 0.0065636580
## avg_ts_session_4 0.0012088383
## publish_date_3 0.0006739935
Fitting a linear regression model:
lm.model = lm(content_info_onehot.df$total_ts ~., data=content_info_onehot.df, na.action = na.exclude)
summary(lm.model)
##
## Call:
## lm(formula = content_info_onehot.df$total_ts ~ ., data = content_info_onehot.df,
## na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02618 -0.27071 0.00294 0.24371 1.17660
##
## Coefficients: (10 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.103455 0.062818 65.323 < 2e-16 ***
## avg_rating_11 -0.019983 0.027701 -0.721 0.470841
## avg_rating_21 NA NA NA NA
## avg_interactions_min_11 0.278623 0.038930 7.157 1.55e-12 ***
## avg_interactions_min_21 0.181247 0.034985 5.181 2.65e-07 ***
## avg_interactions_min_31 0.063372 0.032562 1.946 0.051896 .
## avg_interactions_min_41 NA NA NA NA
## avg_sess_device_11 -0.344254 0.051232 -6.720 2.98e-11 ***
## avg_sess_device_21 -0.189200 0.041149 -4.598 4.79e-06 ***
## avg_sess_device_31 -0.081900 0.034420 -2.379 0.017517 *
## avg_sess_device_41 NA NA NA NA
## avg_ts_session_11 -0.727024 0.044271 -16.422 < 2e-16 ***
## avg_ts_session_21 -0.436329 0.035112 -12.427 < 2e-16 ***
## avg_ts_session_31 -0.235003 0.031224 -7.526 1.12e-13 ***
## avg_ts_session_41 NA NA NA NA
## last_gen_date_11 0.019872 0.058057 0.342 0.732210
## last_gen_date_21 0.020956 0.050702 0.413 0.679455
## last_gen_date_31 0.022902 0.047141 0.486 0.627203
## last_gen_date_41 NA NA NA NA
## last_sync_date_11 -0.007248 0.060285 -0.120 0.904323
## last_sync_date_21 -0.009016 0.051921 -0.174 0.862177
## last_sync_date_31 -0.019564 0.045013 -0.435 0.663920
## last_sync_date_41 NA NA NA NA
## publish_date_11 -0.013993 0.049520 -0.283 0.777554
## publish_date_21 0.002590 0.041374 0.063 0.950098
## publish_date_31 0.061287 0.038060 1.610 0.107633
## publish_date_41 NA NA NA NA
## total_devices_11 -0.547640 0.089153 -6.143 1.15e-09 ***
## total_devices_21 -0.416254 0.073799 -5.640 2.18e-08 ***
## total_devices_31 -0.182717 0.051071 -3.578 0.000362 ***
## total_devices_41 NA NA NA NA
## total_interactions_11 -0.973537 0.079864 -12.190 < 2e-16 ***
## total_interactions_21 -0.716039 0.060428 -11.849 < 2e-16 ***
## total_interactions_31 -0.328477 0.041792 -7.860 9.48e-15 ***
## total_interactions_41 NA NA NA NA
## total_sessions_11 -0.748435 0.104268 -7.178 1.34e-12 ***
## total_sessions_21 -0.619461 0.078910 -7.850 1.02e-14 ***
## total_sessions_31 -0.334650 0.055620 -6.017 2.46e-09 ***
## total_sessions_41 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3422 on 1050 degrees of freedom
## Multiple R-squared: 0.9089, Adjusted R-squared: 0.9065
## F-statistic: 374.2 on 28 and 1050 DF, p-value: < 2.2e-16
Features with significant p-value: