In this report I perform some EDA on the two datasets provided to me, and then run some models to attempt to predict sales volume by day. The first part of the EDA explores the sales.csv. This contains our target “sales” column. Sales are our downloads, please note that throughout the writeup I interchangeably use the word “sales” and “downloads”. I found that only 5 apps in this csv file contain complete information (download totals for all 90 days).To conduct time series analysis, complete or near complete information is needed. However, incomplete sales data, isn’t an issue for most of our modeling process
The second csv file “Ranks.csv” gives us hourly data for over 3500 apps. For the hourly ranking data (ro-r23), I imputed missing values with the average of the unranked data rank=2278 (please see validate rank and impute section under the EDA Ranks CSV for further explanation). Next, I moved onto feature building. I created several ratings and rankings aggregated statistics(please see the feature building section under the EDA Ranks CSV). Continuing the data processing, I dropped the hourly ranking data in our data and joined the two csv files under date/id.
My dataset is now ready for modeling. I filtered the data for app_ids that were only present in the sales.csv file and used that as my modeling dataframe. All other data will be used for predictions. I took the most downloaded app( app_id 2891) and ran a time series analysis of the app. I wanted to forecast future downloads for the app as I thought this could be valuable to clients. I then setup a 5-fold cross validation to compare several linear, non-linear, generalized linear, Gradient-boosted, and Random forest models to predict daily downloads. I showed the variable importance from the models and discussed how RMSE was minimized by the RF and XGboost models.
The models seem to indicate that the 3 most important features for daily prediction are “the number of five star reviews”, “the average daily rank”, and the statistic which tracks “highest rank” the app has achieved in that day. Two of these features were custom built features. In fact, looking at the top 8 features in feature importance for RF and XgBoost, respectively 7 and 5 of the features were custom built features.
I made a conscious decision to not impute the sales.csv data with 0’s for the missing data. The sales csv had one zero present. If i had built out the missing dates with 0’s, our model would predict a significantly large amount of 0’s. While realistic, I was unsure of the validity of imputing with 0’s given the information at hand(please see Sales Data tab under EDA Sales DATA CSV). Because of this, our predictions minimum is 246. The maximum prediction is 79546. App 2981 is predicted to have the top three highest downloads in the data set. It does this on three consecutive days from 2/19-2-21. App 119 is predicted to have downloads over 50k 72 out of the 90 days. App 1808 and 308 are expected to have downloads over 50k over 50 times.
If you are exploring this notebook via an rmd file, please disregard all blocks set to eval=False. If you are using the rpubs, please realize that the sections are tabbed twice
Please see tabs below
Code for this section has been appended to the bottom
##
## 146 186 222 237 906 1812 1840 1989 1997 2191 2640 1498 2821 3550 1264
## 1 1 1 1 1 1 1 1 1 1 2 3 4 4 5
## 82 3082 2373 2490 2942 1874 3333 3373 907 3428 2333 3308 1461 676 2398
## 8 8 12 13 13 14 14 17 19 19 20 23 25 38 73
## 2667 1490 406 2346 320 459 722 1234 2891
## 74 75 80 82 90 90 90 90 90
Above I print a count for the Sales.csv dataset. This table shows out of the 90 days, how many days’ worth of download data we have for each application. Overall, I want to treat the applications in this dataset, as our training dataset. The plan is to join this sales data by date and ID with our other dataset. Imputation of missing values isn’t needed for the models I will be testing. Even data with only 1 days’ worth of downloads will hopefully have some value towards my eventual model.
I feel I need to address the missing sales data in the sales.csv. One could assume that if sales data isn’t present in what is a clean dataset, that it is because there were no sales on those dates. I contemplated imputing missing sales data as 0’s, but I decided against it after looking into app sales like the sales of app_id 1874 below. I just can’t justify imputing an application with 0’s given that it received upwards of 23k downloads on the limited days from which I have data for it. Therefore, when we join datasets, we will not have a complete dataset where every app sale is calculated for every day. Instead our final dataset for modeling will simply contain observations in which we have sales data and therefore it will not predict 0 sales.
## Date app_id sales
## 1 2016-03-02 1874 250
## 2 2016-03-06 1874 96
## 3 2016-03-09 1874 9727
## 4 2016-03-10 1874 19010
## 5 2016-03-11 1874 22265
## 6 2016-03-12 1874 16656
## 7 2016-03-13 1874 4107
## 8 2016-03-14 1874 1761
## 9 2016-03-15 1874 472
## 10 2016-03-16 1874 48
## 11 2016-03-17 1874 0
## 12 2016-03-29 1874 1982
## 13 2016-03-30 1874 1364
## 14 2016-03-31 1874 237
# display counts
table_counts
# dispaly application 1874 sales
app_1874 <- sales %>%
filter(app_id == 1874)
app_1874
Below I created an interactive Plotly display of the sales.csv data. Before we begin to breakdown the data, let’s center and scale the sales of each app, and then explore the overall time series plot. The product sales are on different scales. By centering and scaling each individual product, we can explore the trends in overall data over time
Below you can see that the beginning of January has above average sales. It appears there is an overall large drop in sales after the first week. The first week of January has some clear outliers in sales. February has about average sales. March seems to have about average sales with a clear uptick towards the end. The trend in March may be due to the outliers that are clearly visible. You can see there are outliers over 5 SD away in March.
I also plot the overall sales volume by day with a 7 day rolling average. This can be something that may be useful for figruing out the demands on our overall network.
## function ()
## .Internal(date())
## <bytecode: 0x00000000614d3b48>
## <environment: namespace:base>
## Scale and center each app by app_id
data <- transform(sales, sales.norm = ave(sales, app_id, FUN = scale))
data$app_id <- as.factor(data$app_id)
data$Date <- as.Date(data$Date )
## Monthly ggplot
monthly_ggplotly <- ggplot(aes(Date, sales.norm),data=data)+
geom_point()+
geom_smooth()+
xlab("")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()
## monthly ggplot broken down by product id
monthly_prod_id_ggplotly <- ggplot(aes(Date, sales.norm,col=app_id),data=data)+
geom_point()+
xlab("")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()
date
total_sales <- sales%>%
group_by(Date) %>%
summarize(value = sum(sales))
##plotting total sales by day
ggplot(total_sales, aes(x = as.POSIXct(Date), y = value)) +
geom_bar(stat = "identity") +
theme_bw() +
labs(x = "Month", y = "Average Visits per User")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
geom_line(aes(y=rollmean(value,7, na.pad=TRUE)))+
scale_x_datetime(date_breaks = "2 days", labels = date_format("%b %d"))
)
## Make sales into date column
sales$Date <- as.Date(sales$Date)
Code for this section has been appended to the bottom. For clarity of this report, while the code explores all the app_ids, I have only displayed app_ids with 90 sales observations
Apps with 90 days of sales data
Explore outliers in app 722 and 1234
## [1] 1 2 3 4 5 6 7
Outliers app_1234
| Date | app_id | sales |
|---|---|---|
| 2016-03-05 | 1234 | 3348 |
| 2016-03-30 | 1234 | 2628 |
Outliers app_722
| Date | app_id | sales |
|---|---|---|
| 2016-03-22 | 722 | 12717 |
| 2016-03-23 | 722 | 7450 |
Recommendation for imputation of outliers
## $index
## [1] 79 80 81 82
##
## $replacements
## [1] 1973.8 2877.6 3781.4 4685.2
## $index
## [1] 64 89
##
## $replacements
## [1] 1553 1399
## get correct app_id
complete_sales_indx <-as.integer(names(which(table_counts>89)))
## filter for app_id
complete_sales <- sales %>%
filter(app_id %in%complete_sales_indx)
## scatterplot data of apps
ggplot(complete_sales) + geom_point(aes(x=Date, y=sales))+
facet_wrap(~ app_id,scales = "free")
## spread df int multiple columns,
spread_df <- complete_sales %>%
spread(.,app_id,sales)
## apply describe over spread_df
dfs <- sapply(spread_df[,c(as.character(complete_sales_indx))],function(x){psych::describe(x,IQR=TRUE)})
## store names for df rebuild
row_names_df <- rownames(dfs)
colnames_df <- colnames(dfs)
## apply round and reshape dataframe
dfs <- sapply(dfs,function(x){round(x,2)})
dfs <- matrix(dfs,ncol=5)
rownames(dfs) <- row_names_df
colnames(dfs) <- colnames_df
## Describes statistics by differnt app_ids((Commented out to make report clearer))
kable(dfs[2:14,],digits =2,'markdown',booktabs =T)
## visualizewith boxplots
boxplot(sales ~ app_id, data=complete_sales, main="boxplot")
outliers = boxplot(spread_df$`2891`, plot=FALSE)
which(spread_df$`2891` < outliers$stats[1] | spread_df$`2891` > outliers$stats[5])
## Look at outliers 722 and 1234
complete_sales %>%
filter(app_id== 1234& sales>2200) %>%
kable(.,digits =2,'markdown',booktabs =T)
complete_sales %>%
filter(app_id== 722& sales>6700) %>%
kable(.,digits =2,'markdown',booktabs =T)
## create time series data see tsimpute recommendations
ts_complete_data <-lapply(spread_df,ts)
forecast::tsoutliers(ts_complete_data$`722`)
tsoutliers(ts_complete_data$`1234`)
## look at sales data for nearly complete apps
almost_complete_sales_indx <-as.integer(names(which(table_counts>70 &table_counts<90 )))
almost_complete_sales <- sales %>%
filter(app_id %in%almost_complete_sales_indx)
plot_3 <- ggplot(almost_complete_sales) +
geom_point(aes(x=Date, y=sales))+
xlab("")+facet_wrap(~ app_id,scales = "free")
ggplotly(plot_3)
## spread df int multiple columns,
spread_df <- almost_complete_sales %>%
spread(.,app_id,sales)
## apply describe over spread_df
dfs <- sapply(spread_df[,c(as.character(almost_complete_sales_indx))],function(x){psych::describe(x,IQR=TRUE)})
## store names for df rebuild
row_names_df <- rownames(dfs)
colnames_df <- colnames(dfs)
## apply round and reshape dataframe
dfs <- sapply(dfs,function(x){round(x,2)})
dfs <- as.data.frame(matrix(dfs,ncol=5))
rownames(dfs) <- row_names_df
colnames(dfs) <- colnames_df
## Print df
kable(dfs,digits =2,'markdown',booktabs =T)
boxplot(sales ~ app_id, data=almost_complete_sales, main="boxplot")
## look at sales data for anything above 1 but less than 70
incomplete_sales_indx <-as.integer(names(which(table_counts>5 &table_counts<70)))
incomplete_sales <- sales %>%
filter(app_id %in%incomplete_sales_indx)
ggplot(incomplete_sales) + geom_point(aes(x=Date, y=sales))+
facet_wrap(~ app_id,scales = "free",nrow = 3)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Apps With 5 or Less Days of Sales Data**
incomplete_sales_indx <-as.integer(names(which(table_counts>0 &table_counts<6)))
incomplete_sales <- sales %>%
filter(app_id %in%incomplete_sales_indx)
ggplot(incomplete_sales) + geom_point(aes(x=Date, y=sales))+
facet_wrap(~ app_id,scales = "free",nrow = 3)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
incomplete_sales_indx
Please see tabs below
Ranks data contains 3557 apps and sales data only contains 37 apps. Our sales data is daily, so we want to aggregate our hourly ranking data with an average by day rank. The cumulative data seem to be cumulative reviews for the apps, before the dataset began. This should be altered to reflect the new reviews in this dataset. NA values must be imputed. The NA’s seem to be isolated to the hourly ranking data. This is because there are only ranks form 1-1000 and there are over 3k apps in the dataset, therefore daily hourly ranking columns have 2k+ NA’s. The validate ranking section below attempts to show that these NA values are rank values from 1k-3557. How I imputed this data is addressed in the impute section below.
Overall Data Manipulation steps. Validate data, impute missing values, create features for potential models, join both datasets and filter our joined dataset for app_ids which are only present in the Sales data. This means we will be running cross validated models on a dataset of about 39 apps. This model will be used to predict on the leftover 3500 apps. This may not be a strong enough dataset considering how half of these apps only have 1 date with any sales data, but it seems like the best approach.
App 2891 is what appears to be the best-selling app in the dataset. In the modeling stage I am going to explore this app with a time series analysis to see if we can predict future downloads. As such I explore a df with only app 2891 in the corrplots section
Code for this section has been appended to the bottom
The table below shows that the range of every daily ranking column is from 1-1000. That means there are no values inside these columns with values outside the expected 1-1000 ranks. The total volume of each column should be 90k values, if the columns were complete.
\(1000ranks * 90 days = 90000\)
As the table below shows, there aren’t 90k values(n) in each column. However, most columns contain near 90k with the worst column still containing 99.98% of the ranks that should exist. This may seem inconsequential, but it’s important because my plan to impute involves if any data that is NA is ranked somewhere outside the top 1000. My imputation method will turn what is possibly about .02% of data that should be ranked, into my imputed value. Being that the values that are NA represent some random (not fully random) value from 1k-3557(total number apps), I decided to impute given the average of unranked data.
Value to Be imputed for NA:
\[(3557+1000)/2=~ 2278\]
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| r0 | 1 | 89992 | 500.46 | 288.65 | 500 | 500.46 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r1 | 2 | 89953 | 500.51 | 288.55 | 501 | 500.52 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r2 | 3 | 88974 | 500.63 | 288.62 | 501 | 500.64 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.97 |
| r3 | 4 | 89999 | 500.49 | 288.67 | 500 | 500.49 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r4 | 5 | 89997 | 500.48 | 288.67 | 500 | 500.48 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r5 | 6 | 89886 | 500.87 | 288.39 | 501 | 500.92 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r6 | 7 | 89888 | 500.88 | 288.40 | 501 | 500.93 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r7 | 8 | 89991 | 500.45 | 288.65 | 500 | 500.45 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r8 | 9 | 89994 | 500.47 | 288.66 | 500 | 500.47 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r9 | 10 | 89995 | 500.47 | 288.66 | 500 | 500.47 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r10 | 11 | 89977 | 500.37 | 288.60 | 500 | 500.37 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r11 | 12 | 89997 | 500.48 | 288.67 | 500 | 500.48 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r12 | 13 | 89958 | 500.61 | 288.61 | 501 | 500.62 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r13 | 14 | 89931 | 500.12 | 288.46 | 500 | 500.12 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r14 | 15 | 89990 | 500.44 | 288.64 | 500 | 500.44 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r15 | 16 | 89998 | 500.49 | 288.67 | 500 | 500.49 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r16 | 17 | 89960 | 500.28 | 288.55 | 500 | 500.28 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r17 | 18 | 89984 | 500.41 | 288.63 | 500 | 500.41 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r18 | 19 | 89991 | 500.45 | 288.65 | 500 | 500.45 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r19 | 20 | 89973 | 500.35 | 288.59 | 500 | 500.35 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r20 | 21 | 89993 | 500.46 | 288.65 | 500 | 500.46 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r21 | 22 | 89999 | 500.49 | 288.67 | 500 | 500.49 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r22 | 23 | 89995 | 500.47 | 288.66 | 500 | 500.47 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
| r23 | 24 | 89981 | 500.39 | 288.62 | 500 | 500.39 | 370.65 | 1 | 1000 | 999 | 0 | -1.2 | 0.96 |
ranks <- read.csv('ranks.csv')
ranks$Date <- as.Date(ranks$Date)
ranks$app_id <- as.factor(ranks$app_id)
kable(describe(ranks[,3:26]),digits =2,'markdown',booktabs =T)
## impute hourly missing ranking data
for(i in 3:26){
ranks[is.na(ranks[,i]), i] <- 2278
}
## Impute missing ratings data with 0's
for(i in 27:36){
ranks[is.na(ranks[,i]), i] <- 0
}
Code for this section has been appended to the bottom
What kind of features can we develop from what we have? Time is an important value in most time series data. However, models can’t be trained easily with time series data. Because of this, I decided to create an is_weekend column to capture any weekend effect. I also created a month column to attempt to capture any monthly effects. Our hourly data seems like it can be aggregated and safely dropped. I create a Average_daily_ranking, a daily_highest_ranking and a daily_variance stat to capture the statistical effects in the rankings columns. I believe we can then safely drop the 24-hour columns.
For the ratings features I create a daily average rating score by assigning values to 1-5 star ratings. For 1,2,3,4,5 stars I assign the following values (-2,-1,0,1,2). The cs columns in our dataset represents the total number of reviews from before our dataset starts, to the end of the dataset. At first, I wanted to create an old reviews column from this data, but I opted to just leave it as is but to create several new cumulative columns. I create columns (csum1- csum5), to make a category which tracks total new reviews throughout the dataset(total_new_reviews) and cumulative_rating_score_daily, which tracks overall rating score changes in the app by date. I believe these features will do a good job at capturing some of the time series elements we are missing in the modeling stage.
Csum1-csum5 as well as all the hourly ranking data are dropped at the end of feature building. After I join my datasets, I filter our models training data for app_ids only present in the sales.csv data. The rest of the data will be for our predictions. Overall, we have a training dataset with 1110 rows, and a prediction dataset with 97608 rows. This section is all done under the hood, so I print out the description of each df.
Features
Modeling Dataframe
| Date | app_id | sales | s1 | s2 | s3 | s4 | s5 | cs1 | cs2 | cs3 | cs4 | cs5 | daily_rank_average | daily_variance | daily_rank_highest | total_reviews | net_rating_score_daily | cumulative_rating_score_daily | is_weekend | month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016-01-01 | 1234 | 1553 | 0 | 0 | 0 | 0 | 0 | 332 | 163 | 269 | 395 | 1320 | 778.33 | 2502.41 | 678 | 0 | 0 | 0 | 0 | 01 |
| 2016-01-01 | 1490 | 2152 | 2 | 0 | 0 | 0 | 2 | 248 | 91 | 167 | 290 | 877 | 518.67 | 80.93 | 506 | 4 | 0 | 0 | 0 | 01 |
| 2016-01-01 | 2398 | 1501 | 1 | 0 | 0 | 0 | 0 | 40 | 13 | 17 | 24 | 120 | 1897.58 | 366764.51 | 950 | 1 | -2 | -2 | 0 | 01 |
| 2016-01-01 | 2891 | 89289 | 8 | 4 | 8 | 16 | 106 | 73 | 41 | 71 | 191 | 1036 | 5.75 | 0.20 | 5 | 142 | 208 | 208 | 0 | 01 |
| 2016-01-01 | 320 | 2412 | 1 | 0 | 2 | 0 | 0 | 746 | 284 | 290 | 568 | 1736 | 470.50 | 213.48 | 438 | 3 | -2 | -2 | 0 | 01 |
| 2016-01-01 | 406 | 1308 | 0 | 0 | 0 | 0 | 0 | 1376 | 1008 | 5674 | 20010 | 82635 | 717.75 | 539.33 | 681 | 0 | 0 | 0 | 0 | 01 |
| 2016-01-01 | 459 | 2037 | 0 | 0 | 0 | 0 | 1 | 10 | 2 | 14 | 32 | 264 | 603.42 | 52.17 | 594 | 1 | 2 | 2 | 0 | 01 |
| 2016-01-01 | 722 | 2052 | 1 | 0 | 2 | 8 | 1 | 4025 | 1129 | 1488 | 2007 | 5477 | 566.42 | 614.95 | 530 | 12 | 8 | 8 | 0 | 01 |
| 2016-01-02 | 1234 | 1011 | 0 | 0 | 0 | 0 | 0 | 332 | 163 | 269 | 395 | 1320 | 725.62 | 938.94 | 675 | 0 | 0 | 0 | 1 | 01 |
| 2016-01-02 | 1490 | 1722 | 0 | 0 | 2 | 0 | 0 | 248 | 91 | 169 | 290 | 877 | 545.79 | 126.78 | 532 | 6 | 0 | 0 | 1 | 01 |
Prediction Dataframe
kable((prediction_df[1:10,]),digits =2,'markdown',booktabs =T)
| Date | app_id | sales | s1 | s2 | s3 | s4 | s5 | cs1 | cs2 | cs3 | cs4 | cs5 | daily_rank_average | daily_variance | daily_rank_highest | total_reviews | net_rating_score_daily | cumulative_rating_score_daily | is_weekend | month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016-01-01 | 1 | NA | 0 | 0 | 0 | 0 | 0 | 47 | 8 | 6 | 12 | 30 | 591.96 | 740.22 | 548 | 0 | 0 | 0 | 0 | 01 |
| 2016-01-01 | 100 | NA | 4 | 0 | 0 | 0 | 0 | 4682 | 2187 | 6226 | 28460 | 315885 | 69.54 | 2.26 | 68 | 4 | -8 | -8 | 0 | 01 |
| 2016-01-01 | 1006 | NA | 0 | 0 | 0 | 0 | 0 | 33 | 13 | 11 | 12 | 18 | 177.46 | 233.91 | 158 | 0 | 0 | 0 | 0 | 01 |
| 2016-01-01 | 1007 | NA | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 16 | 1615.42 | 458464.78 | 920 | 0 | 0 | 0 | 0 | 01 |
| 2016-01-01 | 1009 | NA | 4 | 0 | 2 | 2 | 12 | 11267 | 3178 | 2907 | 4622 | 17393 | 109.33 | 3.01 | 107 | 20 | 18 | 18 | 0 | 01 |
| 2016-01-01 | 1014 | NA | 2 | 0 | 0 | 0 | 0 | 719 | 216 | 428 | 1549 | 6833 | 537.04 | 79.00 | 522 | 2 | -4 | -4 | 0 | 01 |
| 2016-01-01 | 1015 | NA | 0 | 0 | 0 | 0 | 2 | 121 | 95 | 186 | 370 | 2883 | 445.29 | 84.04 | 428 | 2 | 4 | 4 | 0 | 01 |
| 2016-01-01 | 1017 | NA | 0 | 0 | 0 | 0 | 0 | 228 | 95 | 147 | 451 | 5223 | 286.79 | 24.78 | 278 | 0 | 0 | 0 | 0 | 01 |
| 2016-01-01 | 1018 | NA | 0 | 0 | 0 | 0 | 0 | 31 | 27 | 59 | 410 | 1269 | 1331.58 | 385273.82 | 906 | 0 | 0 | 0 | 0 | 01 |
| 2016-01-01 | 102 | NA | 0 | 0 | 0 | 0 | 0 | 1071 | 733 | 1632 | 3331 | 10022 | 901.54 | 1535.48 | 831 | 0 | 0 | 0 | 0 | 01 |
## create daily average rank, avg daily variance, daily highest rank
rank_average <- rowSums(ranks[3:26])/24
my_vars <- rowVars(as.matrix(ranks[,3:26]))
ranks$daily_rank_average <- rank_average
ranks$daily_variance <- my_vars
ranks$daily_rank_highest<- apply(as.matrix(ranks[,3:26]),1,min)
## create cumulative daily average rating, create scoring for ratings, create cuulative scoring for ratings
ranks <- ranks %>%
arrange(app_id) %>%
group_by(app_id) %>%
mutate(csum1 = cumsum(s1)) %>%
mutate(csum2 = cumsum(s2)) %>%
mutate(csum3 = cumsum(s3)) %>%
mutate(csum4 = cumsum(s4)) %>%
mutate(csum5 = cumsum(s5)) %>%
mutate(total_reviews = csum1+csum2+csum3+csum4+csum5) %>%
mutate(net_rating_score_daily = (s1*-2)+(s2*-1)+(s3*0)+(s4*1)+(s5*2)) %>%
mutate(cumulative_rating_score_daily =(csum1*-2) + (csum2*-1) + (csum3*0) +(csum4*1)+(csum5*2))
## Create weekday/weekend
ranks$is_weekend <- chron::is.weekend(ranks$Date)
ranks$is_weekend <- as.integer(ranks$is_weekend )
## Create month category
months <- strsplit(as.character(ranks$Date),"-" )
months <- sapply(months, "[[", 2)
ranks$month <- months
## Merge with sales data
df<- merge(sales,ranks,by=c('Date',"app_id"),all=TRUE)
df$app_id <-as.factor(as.character(as.integer(df$app_id)))
## validate merge worked correctly(commented out for report)
#dim(df)[1]-sum(is.na(df$sales))==dim(sales)[1]
## drop all hourly and cumsum columns
df <- df[,c(1,2,3,28:32,38:44)]
## filter for apps only present in sales data to create our model training data
apps_with_sales_list <- as.integer(levels(as.factor(sales$app_id)))
modeling_df <- df %>%
filter(app_id%in%apps_with_sales_list )
prediction_df <- df %>%
filter( !app_id%in%apps_with_sales_list )
Code for this section has been appended to the bottom
## [1] Date app_id
## [3] sales s1
## [5] s2 s3
## [7] s4 s5
## [9] cs1 cs2
## [11] cs3 cs4
## [13] cs5 daily_rank_average
## [15] daily_variance daily_rank_highest
## [17] total_reviews net_rating_score_daily
## [19] cumulative_rating_score_daily is_weekend
## [21] month
## <0 rows> (or 0-length row.names)
## reorder factor levels
modeling_df$app_id <- as.factor(as.character(modeling_df$app_id))
## Split df by app
list_dfs <- split(modeling_df,modeling_df$app_id)
## Correct NA values in sales (5 values which existed in the ranks.csv didnt have sales values on the sales csv. Impute with 0)
modeling_df[which(is.na(modeling_df$sales)),3] <-0
modeling_df[which(is.na(modeling_df$sales))
,]
## TUrn list into individual df for 2891
app_2891 <- as.data.frame(list_dfs$`2891`)
app_2891$app_id <- as.factor(as.character(app_2891$app_id))
##corr plot entire DF
corrplot::corrplot(cor(modeling_df[,3:20]))
mtext("WHOLE DF", at=-2.5, line=-0.5, cex=2)
##corr plot app 2891
corrplot::corrplot(cor(app_2891[,3:20]))
mtext("App 2891", at=-2.5, line=-0.5, cex=2)
Please see tabs below
The instruction questions seemed open ended in terms of what we want to model. I decided to build a time series model for the highest selling product “app 2891”. I ran a simple train test split as well as a CV test using RMSE to compare several different time series models on the app. Analysis here easily could have gone further, as earlier in EDA the forecast package recommended imputation of several values in this time series. As the data is merely simulated data, I opted to just run the models and show the process without imputation. Future expectation of downloads can be a very valuable tool for our clients to adjust their advertising as well as IT requirements. As a warning, forecasts in general don’t behave well as time intervals of prediction become large.
Next, I tested several CV models and compared the RMSE scores for the models. My hope is that by extracting variable importance we can identify what features are most important to increasing sales. I recognize that this data is count data but believe that our models can hopefully capture the data correctly, without resorting to Poisson modeling techniques. Lastly, caret doesn’t handle time series data well, which is why I created several categories like is.weekend and month to attempt to capture time series effects in the modeling.
Below I run a time series analysis of app 2891. First, I create a train test split and plot out predictions from several models. The random walk model seems to perform the best here and the ARIMA model performs terribly. Given more time these models could be tested with imputations for outliers. Next i conduct a CV analysis with several models and the forecast package. ARIMA and ETS performed the best in the CV scoring. I plot out both models with predictions. I would feel more comfortable using the ETS model. Considering how poor the performance was by the ARIMA model in the train/test split. Forecast package handles imputations easily and could lead to better predictions. I produce the forecast graphs below for all the train/test and CV models. Our X axis represents weeks from start of the data set(1/1/2016). The train test spit predictions are mapping predictions for March against the data we already have. The CV models, which are trained on CV scoring of rmse, show us our expected future predictions. As you can see, the intervals for these forecasts are very wide.
App 2891 Time Series Plots:
App 2891 Naive Fit Prediction
App 2891 ETS Prediction
App 2891 Random Walk Prediction
App 2891 ARIMA Prediction
App 2891 Train/Test split RMSE comparison
| ets_ana_rmse | random_walk_rmse | naive_fit_rmse | arima_fit_rmse |
|---|---|---|---|
| 3684.53 | 2751.96 | 23742.29 | 10564.91 |
App 2891 Cross-validated RMSE
| X | naive_fit_cv_rmse | rwf_fit_rmse | arima_fit_rmse | ets_fit_rmse |
|---|---|---|---|---|
| 1 | 6734.94 | 6734.94 | 6581.14 | 6302.28 |
CV Arima model prediction over next month for app 2841
CV ETS model prediction over next month for app 2841
## CV functions
farima <- function(x, h) {
forecast(auto.arima(x,allowdrift=TRUE), h=h)
}
fets <- function(x, h) {
forecast(forecast::ets(x), h = h)
}
naives <- function(x, h) {
forecast(forecast::naive(x), h = h)
}
## build xts for display
xts_pipe1 <- xts::xts(app_2891$sales, order.by= app_2891$Date)
attr(xts_pipe1, 'frequency') <- 7
## display xts hist and autoplot
par(mfrow=c(2, 1), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
original_timeplot <- plot(xts_pipe1, main="pipe1")
plot(xts_pipe1, main="kWh")
# display frequency at which values in a vector occur
hist(xts_pipe1, xlab="", main="",col="red")
# ## build acf/pacf plots for pipe1 and diff (commented out for report)
# a <- ggAcf(xts_pipe1,lag=200)
# b <- ggPacf(xts_pipe1,lag=200)
# ## Example of overfdifferencing below
# c <- ggAcf(diff(xts_pipe1),lag=10)
# d <- ggPacf(diff(xts_pipe1),lag=200)
# plot_grid(a,b,c,d)
app_2891_ts <- ts(app_2891$sales,frequency=365,start = c(2016,1))
## Create train test split
myts.train <- app_2891_ts[1:70]
test_results <- app_2891_ts[71:90]
ts_test_results <- ts(test_results)
myts.train <- ts(myts.train,frequency=7)
## Build naive fit and plot
naive_fit <- naive(myts.train)
autoplot(naive_fit,h=29)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")
## Build ETS and plot
fit <- forecast::ets(myts.train, allow.multiplicative.trend=TRUE)
fc <- forecast(fit,h=29)
autoplot(fc)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")
## Build random wlak and plot
fcast <- stlf(myts.train,method='naive',h=29)
autoplot(fcast)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")
## Build ARIMA and plot
arima_fit <-forecast::auto.arima(myts.train)
arima_fit <- forecast(arima_fit,h=29)
autoplot(forecast(arima_fit,h=29))+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")
## get Train/Test RMSE
random_walk_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),fcast$mean)
ets_ana_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),fc$mean)
naive_fit_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),naive_fit$mean)
arima_fit_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),arima_fit$mean)
## train split results
results <- as.data.frame(cbind(ets_ana_rmse,random_walk_rmse,naive_fit_rmse,arima_fit_rmse))
kable(results,digits =2,'markdown',booktabs =T)
## Cross validation below took long to run, therefore i saved results to github and load in dataframe
x <- getURL("https://raw.githubusercontent.com/justinherman42/appfigures/master/cv_scores_app2891.csv")
y <- as.data.frame(read.csv(text = x))
kable(y,digits =2,'markdown',booktabs =T)
# Commented out imported as csv
# naive_fit_cv <- forecast::tsCV(ts(app_2891_ts,frequency = 7),naives, h=1)
# rwf_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),rwf, drift=TRUE, h=1)
# ets_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),fets, h=1)
# arima_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),farima, h=1)
# naive_fit_cv_rmse <- sqrt(mean(rwf_fit^2, na.rm=TRUE))
# rwf_fit_rmse <- sqrt(mean(rwf_fit^2, na.rm=TRUE))
# ets_fit_rmse <- sqrt(mean(ets_fit^2, na.rm=TRUE))
# arima_fit_rmse <- sqrt(mean(arima_fit^2, na.rm=TRUE))
#cv_scores <- as.data.frame(cbind(naive_fit_cv_rmse,rwf_fit_rmse,arima_fit_rmse,ets_fit_rmse))
#write.csv(cv_scores,file = "cv_scores_app2891.csv")
## actual models
auto_arima_model <- farima(ts(app_2891_ts,frequency = 7),365/7)
auto_arima_model$model
checkresiduals(auto_arima_model)
ets_model <- fets(ts(app_2891_ts,frequency = 7),365/7)
ets_model$model
checkresiduals(ets_model)
arima_for_plot <- autoplot(forecast::forecast(auto_arima_model,h=31),series='ARIMA')
ets_forecast_plot <- autoplot(forecast::forecast(ets_model),series="ETS")
The bread and butter of our modeling stage. We will setup a 5 fold cross validated training parameter. Caret can’t handle the date series, so we must exclude dates from the modeling. Hopefully our features like is.weekend, month and cumulative ranks, capture the time element well.
I run several linear, nonlinear, and tree-based models and compare their results based on RMSE. I display the variable importance for all the models. It must be noted that some of our features may be highly collinear and therefore the variable importance may not be accurately tuned.
Daily rank highest is considered the most important feature in almost all our models. 5 Star reviews is selected as extremely important as well. The value these variables bring to our clients really depends on the causal effect of the ranking data. Are these ranks published somewhere and viewed which may have an impact on driving downloads? Think Rotten tomatoes ratings driving traffic to the movies. Or is the ranking data solely reflective of genuine interest in the apps. If it’s the former, we can use ranks to drive traffic towards applications (which can be substantial). If it’s the latter, we can use the ranks data to determine profitability for apps soon after launch.
I used rmse to pick a model and in the end i picked the random forest model. With more time I would want to combine the Time series analysis with some of the analysis from our model. I would also want more data. Our model attempts to predict on over 3500 apps with only 90 days of data on less than 40 apps. Overall, it seems like a very interesting topic to investigate.
Train paramaters
Linear Models
## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 2 on full training set
## Partial least squares regression , fitted with the orthogonal scores algorithm.
## Call:
## plsr(formula = .outcome ~ ., ncomp = param$ncomp, data = dat, method = "oscorespls")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.3, lambda = 0 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 1 on full training set
Non Linear models
## Aggregating results
## Selecting tuning parameters
## Fitting k = 5 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting sigma = 0.125, C = 8 on full training set
Tree Models
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 18 on full training set
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 500, max_depth = 4, eta = 0.01, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1, subsample = 0.75 on full training set
Rank Models importance features
| Rank | pls | rf | xgbTree | enet | knn | svmRadial | glmnet |
|---|---|---|---|---|---|---|---|
| 1 | s5 | daily_rank_highest | daily_rank_highest | daily_rank_highest | daily_rank_highest | daily_rank_highest | cumulative_rating_score_daily |
| 2 | net_rating_score_daily | s5 | daily_rank_average | daily_rank_average | daily_rank_average | daily_rank_average | s5 |
| 3 | s4 | daily_rank_average | net_rating_score_daily | s5 | s5 | s5 | net_rating_score_daily |
| 4 | cumulative_rating_score_daily | net_rating_score_daily | s5 | net_rating_score_daily | net_rating_score_daily | net_rating_score_daily | daily_rank_highest |
| 5 | s3 | daily_variance | cs1 | s4 | s4 | s4 | total_reviews |
| 6 | daily_rank_highest | cumulative_rating_score_daily | daily_variance | cumulative_rating_score_daily | cumulative_rating_score_daily | cumulative_rating_score_daily | s4 |
| 7 | total_reviews | total_reviews | cumulative_rating_score_daily | s3 | s3 | s3 | daily_variance |
| 8 | daily_rank_average | is_weekend | s4 | total_reviews | total_reviews | total_reviews | cs1 |
| 9 | s2 | cs4 | total_reviews | s2 | s2 | s2 | s1 |
| 10 | s1 | cs2 | s3 | s1 | s1 | s1 | cs5 |
| 11 | cs3 | s1 | cs2 | daily_variance | daily_variance | daily_variance | cs3 |
| 12 | cs4 | cs5 | s1 | cs4 | cs4 | cs4 | is_weekend |
| 13 | cs2 | cs1 | cs3 | cs3 | cs3 | cs3 | s3 |
| 14 | cs1 | s4 | cs4 | cs2 | cs2 | cs2 | cs2 |
| 15 | cs5 | cs3 | is_weekend | month | month | month | daily_rank_average |
| 16 | month | month | s2 | is_weekend | is_weekend | is_weekend | month |
| 17 | daily_variance | s2 | cs5 | cs1 | cs1 | cs1 | s2 |
| 18 | is_weekend | s3 | month | cs5 | cs5 | cs5 | cs4 |
##
## Call:
## summary.resamples(object = .)
##
## Models: PLS, ENet, KNN, SVM, RF, XGB
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 2130.3708 2192.196 2240.331 2305.242 2280.599 2682.715 0
## ENet 1942.6416 2044.858 2057.371 2188.658 2350.154 2548.265 0
## KNN 1512.5788 1641.362 1652.588 1626.805 1660.362 1667.132 0
## SVM 2245.2264 2283.089 2469.891 2463.704 2654.788 2665.525 0
## RF 987.7445 1109.535 1162.016 1166.172 1242.008 1329.555 0
## XGB 981.5728 1128.565 1201.954 1185.319 1264.634 1349.867 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 3877.702 3914.482 3941.092 4139.684 3974.065 4991.079 0
## ENet 3752.864 3860.134 3923.325 4240.657 4548.382 5118.578 0
## KNN 3954.345 4159.861 4567.975 4409.517 4594.733 4770.669 0
## SVM 5384.505 5385.286 6208.286 6085.735 6342.220 7108.376 0
## RF 2435.471 2931.129 3006.524 3232.680 3717.091 4073.185 0
## XGB 2376.476 2988.896 3042.397 3242.596 3738.381 4066.831 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## PLS 0.8132833 0.8135496 0.8166363 0.8247710 0.8399609 0.8404251 0
## ENet 0.7262742 0.8098684 0.8244899 0.8041526 0.8294611 0.8306693 0
## KNN 0.7607204 0.7825883 0.8230092 0.8067721 0.8314895 0.8360532 0
## SVM 0.5449780 0.5745114 0.6060230 0.6095424 0.6605867 0.6616126 0
## RF 0.8342030 0.8607409 0.9008469 0.8855466 0.9022635 0.9296787 0
## XGB 0.8426103 0.8660799 0.8951299 0.8879660 0.9026636 0.9333460 0
App Sales Of Over 50k Table Count
##
## 2142 246 2694 2756 3005 3407 387 909 2841 3288 3391 3336 3395 2873 2281
## 1 1 1 1 1 1 1 1 2 2 2 3 3 4 6
## 2765 3534 2977 3234 3268 499 3299 3178 3533 3102 3530 2457 2500 886 2563
## 6 6 7 8 8 9 10 11 12 16 17 20 21 23 24
## 1733 2516 3072 2981 2941 1808 308 119
## 25 26 30 32 36 51 62 75
## drop columns for Caret
modeling_df$month <- as.integer(modeling_df$month)
set.seed(1)
xTrain <- modeling_df[,-c(1,2,3)]
yTrain <- as.integer(modeling_df[,3])
cvFolds <- createFolds(yTrain, k=5)
trControl <- trainControl(verboseIter=T,
method='cv',
number=5,
index=cvFolds)
# Set up and start multi-core processing
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
## Build Linear Models
plsFit <- train(x=xTrain,
y=yTrain,
method='pls',
tuneLength=20,
trControl=trControl,
preProc=c('knnImpute', 'center', 'scale'))
plot(plsFit)
plsFit$finalModel
enetFit <- train(x=xTrain,
y=yTrain,
method='enet',
tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1),
.lambda = seq(0, 1, by=0.1)),
trControl=trControl,
preProc=c('knnImpute', 'center', 'scale'))
plot(enetFit)
enetFit$finalModel
## Build Non-linear models
knnFit <- train(x=xTrain,
y=yTrain,
method='knn',
tuneLength=20,
trControl=trControl,
preProc=c('knnImpute', 'center', 'scale'))
plot(knnFit)
knnFit$finalModel
svmFit <- train(x=xTrain,
y=yTrain,
method="svmRadial",
tuneLength=20,
trControl=trControl,
preProc=c('knnImpute', 'center', 'scale'))
## plot svm model
plot(svmFit)
## Build rf
rf <- train(x=xTrain,
y=yTrain,
method='rf',
tuneLength=10,
trControl=trControl,
importance=T)
plot(rf)
## Vuild XGboost model
xgbGrid <- expand.grid(
.nrounds=c(100, 500), # boosting iterations (trees)
.max_depth=c(4, 6), # max tree depth
.eta=c(0.001, 0.01), # learning rate
.gamma=c(0),# minimum loss reduction
.colsample_bytree=c(0.4, 0.6), # subsample ratio of columns
.min_child_weight=c(1, 5), # minimum sum of instance weight
.subsample=c(0.5, 0.75)) # subsample ratio of rows
xgb1 <- train(x = xTrain,
y = yTrain,
method = 'xgbTree',
tuneGrid = xgbGrid,
trControl = trControl)
## shutdown multicore
stopCluster(cl)
registerDoSEQ()
## Rank Models importance features
getRank <- function(trainObjects){
temp <- c()
methods <- c()
for(object in trainObjects){
methods <- c(methods, object$method)
varimp <- varImp(object)[[1]]
varimp$variables <- row.names(varimp)
rank <- varimp[order(varimp$Overall, decreasing = T),] %>% row.names()
temp <- cbind(temp, rank)
}
temp <- as.data.frame(temp)
names(temp) <- methods
temp$Rank <- c(1:dim(temp)[1])
temp <- select(temp, Rank, everything())
return(temp)
}
kable(getRank(list(plsFit, rf, xgb1, enetFit, knnFit, svmFit)),digits = 2,'markdown', booktabs =T)
## plot variable importance
plot(varImp(plsFit), main='Variable Importance based on PLS')
plot(varImp(rf), main='Variable Importance based on Random Forest')
plot(varImp(xgb1), main='Variable Importance based on XGBoost')
plot(varImp(svmFit), main='Variable Importance based on Loess R-Squares')
## summary of model scores
resamples(list(PLS=plsFit, ENet=enetFit, KNN=knnFit, SVM=svmFit, RF=rf, XGB=xgb1)) %>% summary()
##boxplots of cv scores
par(mfrow=c(2,3))
boxplot(plsFit$resample$RMSE, main='CV RMSE for PLS')
boxplot(enetFit$resample$RMSE, main='CV RMSE for ENet')
boxplot(knnFit$resample$RMSE, main='CV RMSE for KNN')
boxplot(svmFit$resample$RMSE, main='CV RMSE for SVM')
boxplot(rf$resample$RMSE, main='CV RMSE for RF')
boxplot(xgb1$resample$RMSE, main='CV RMSE for XGB')
## predictions
pred <- predict(rf, newdata=prediction_df)
prediction_df$sales_predictions <- pred
prediction_df$Date <- pred_dates
prediction_df$app_id <- pred_appids
df_for_export <- prediction_df[,c(20,21,19)]
df_for_export$app_id <- as.integer(as.character(df_for_export$app_id))
final_df <- df_for_export %>%
arrange(app_id)
write.csv(final_df,'Herman_estimates.csv')
indexes <- which(prediction_df$sales_predictions>50000)
new_df <- prediction_df[indexes,c("app_id","sales_predictions")]
new_df$app_id <- as.factor(as.character(new_df$app_id))
sort(table(new_df$app_id))