Predicting Apps Sales Volume

In this report I perform some EDA on the two datasets provided to me, and then run some models to attempt to predict sales volume by day. The first part of the EDA explores the sales.csv. This contains our target “sales” column. Sales are our downloads, please note that throughout the writeup I interchangeably use the word “sales” and “downloads”. I found that only 5 apps in this csv file contain complete information (download totals for all 90 days).To conduct time series analysis, complete or near complete information is needed. However, incomplete sales data, isn’t an issue for most of our modeling process

The second csv file “Ranks.csv” gives us hourly data for over 3500 apps. For the hourly ranking data (ro-r23), I imputed missing values with the average of the unranked data rank=2278 (please see validate rank and impute section under the EDA Ranks CSV for further explanation). Next, I moved onto feature building. I created several ratings and rankings aggregated statistics(please see the feature building section under the EDA Ranks CSV). Continuing the data processing, I dropped the hourly ranking data in our data and joined the two csv files under date/id.

My dataset is now ready for modeling. I filtered the data for app_ids that were only present in the sales.csv file and used that as my modeling dataframe. All other data will be used for predictions. I took the most downloaded app( app_id 2891) and ran a time series analysis of the app. I wanted to forecast future downloads for the app as I thought this could be valuable to clients. I then setup a 5-fold cross validation to compare several linear, non-linear, generalized linear, Gradient-boosted, and Random forest models to predict daily downloads. I showed the variable importance from the models and discussed how RMSE was minimized by the RF and XGboost models.

The models seem to indicate that the 3 most important features for daily prediction are “the number of five star reviews”, “the average daily rank”, and the statistic which tracks “highest rank” the app has achieved in that day. Two of these features were custom built features. In fact, looking at the top 8 features in feature importance for RF and XgBoost, respectively 7 and 5 of the features were custom built features.

I made a conscious decision to not impute the sales.csv data with 0’s for the missing data. The sales csv had one zero present. If i had built out the missing dates with 0’s, our model would predict a significantly large amount of 0’s. While realistic, I was unsure of the validity of imputing with 0’s given the information at hand(please see Sales Data tab under EDA Sales DATA CSV). Because of this, our predictions minimum is 246. The maximum prediction is 79546. App 2981 is predicted to have the top three highest downloads in the data set. It does this on three consecutive days from 2/19-2-21. App 119 is predicted to have downloads over 50k 72 out of the 90 days. App 1808 and 308 are expected to have downloads over 50k over 50 times.

If you are exploring this notebook via an rmd file, please disregard all blocks set to eval=False. If you are using the rpubs, please realize that the sections are tabbed twice

EDA Sales Data CSV

Please see tabs below

Sales Data

Code for this section has been appended to the bottom

## 
##  146  186  222  237  906 1812 1840 1989 1997 2191 2640 1498 2821 3550 1264 
##    1    1    1    1    1    1    1    1    1    1    2    3    4    4    5 
##   82 3082 2373 2490 2942 1874 3333 3373  907 3428 2333 3308 1461  676 2398 
##    8    8   12   13   13   14   14   17   19   19   20   23   25   38   73 
## 2667 1490  406 2346  320  459  722 1234 2891 
##   74   75   80   82   90   90   90   90   90

Above I print a count for the Sales.csv dataset. This table shows out of the 90 days, how many days’ worth of download data we have for each application. Overall, I want to treat the applications in this dataset, as our training dataset. The plan is to join this sales data by date and ID with our other dataset. Imputation of missing values isn’t needed for the models I will be testing. Even data with only 1 days’ worth of downloads will hopefully have some value towards my eventual model.

I feel I need to address the missing sales data in the sales.csv. One could assume that if sales data isn’t present in what is a clean dataset, that it is because there were no sales on those dates. I contemplated imputing missing sales data as 0’s, but I decided against it after looking into app sales like the sales of app_id 1874 below. I just can’t justify imputing an application with 0’s given that it received upwards of 23k downloads on the limited days from which I have data for it. Therefore, when we join datasets, we will not have a complete dataset where every app sale is calculated for every day. Instead our final dataset for modeling will simply contain observations in which we have sales data and therefore it will not predict 0 sales.

##          Date app_id sales
## 1  2016-03-02   1874   250
## 2  2016-03-06   1874    96
## 3  2016-03-09   1874  9727
## 4  2016-03-10   1874 19010
## 5  2016-03-11   1874 22265
## 6  2016-03-12   1874 16656
## 7  2016-03-13   1874  4107
## 8  2016-03-14   1874  1761
## 9  2016-03-15   1874   472
## 10 2016-03-16   1874    48
## 11 2016-03-17   1874     0
## 12 2016-03-29   1874  1982
## 13 2016-03-30   1874  1364
## 14 2016-03-31   1874   237
# display counts
table_counts

# dispaly application 1874 sales
app_1874 <- sales %>% 
    filter(app_id == 1874)
app_1874

Explore Overall Monthly Trend

Below I created an interactive Plotly display of the sales.csv data. Before we begin to breakdown the data, let’s center and scale the sales of each app, and then explore the overall time series plot. The product sales are on different scales. By centering and scaling each individual product, we can explore the trends in overall data over time

Below you can see that the beginning of January has above average sales. It appears there is an overall large drop in sales after the first week. The first week of January has some clear outliers in sales. February has about average sales. March seems to have about average sales with a clear uptick towards the end. The trend in March may be due to the outliers that are clearly visible. You can see there are outliers over 5 SD away in March.

I also plot the overall sales volume by day with a 7 day rolling average. This can be something that may be useful for figruing out the demands on our overall network.

## function () 
## .Internal(date())
## <bytecode: 0x00000000614d3b48>
## <environment: namespace:base>
## Scale and center each app by app_id
data <- transform(sales, sales.norm = ave(sales, app_id, FUN = scale))
data$app_id <- as.factor(data$app_id) 
data$Date <- as.Date(data$Date )

## Monthly ggplot 
monthly_ggplotly <- ggplot(aes(Date, sales.norm),data=data)+
    geom_point()+
    geom_smooth()+
    xlab("")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()

## monthly ggplot broken down by product id
monthly_prod_id_ggplotly <- ggplot(aes(Date, sales.norm,col=app_id),data=data)+
    geom_point()+
    xlab("")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()
date
total_sales <- sales%>%
    group_by(Date) %>%
    summarize(value = sum(sales))


##plotting total sales by day
ggplot(total_sales, aes(x = as.POSIXct(Date), y = value)) + 
  geom_bar(stat = "identity") +
  theme_bw() +
  labs(x = "Month", y = "Average Visits per User")+
      theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     geom_line(aes(y=rollmean(value,7, na.pad=TRUE)))+
    scale_x_datetime(date_breaks = "2 days", labels = date_format("%b %d")) 
)

## Make sales into date column
sales$Date <- as.Date(sales$Date)

Look at app sales

Code for this section has been appended to the bottom. For clarity of this report, while the code explores all the app_ids, I have only displayed app_ids with 90 sales observations

Apps with 90 days of sales data

  • filter df for only sales data with 90 days of sales
  • plot and explore those apps time series and scatterplots

Explore outliers in app 722 and 1234

  • Boxplot below shows outliers present
  • the outliers in app_2891 are in the first 7 days of launch
    • might need to be addressed in model, although they are likely legitimate datapoints (likely launch of product or promotion)
  • using > 3 SD as outlier detection. I display the outliers in 722,1234
    • for app 722 outlier = 2208 +3(sd=1542) ~ 6700
    • for app 1234 outlier = 1102+ 3(sd=361) ~ 2200
  • option 2, use forecast to build univariate time series and identify timeseries outliers
    • I print recommended outlier replacement below
    • I don’t believe these outliers should be imputed as of now. Forecast suggests substantial changes in the values. I believe they are real datapoints. Before modeling they may need to be addressed

## [1] 1 2 3 4 5 6 7

Outliers app_1234

Date app_id sales
2016-03-05 1234 3348
2016-03-30 1234 2628

Outliers app_722

Date app_id sales
2016-03-22 722 12717
2016-03-23 722 7450

Recommendation for imputation of outliers

## $index
## [1] 79 80 81 82
## 
## $replacements
## [1] 1973.8 2877.6 3781.4 4685.2
## $index
## [1] 64 89
## 
## $replacements
## [1] 1553 1399
## get correct app_id
complete_sales_indx <-as.integer(names(which(table_counts>89)))
## filter for app_id
complete_sales <- sales %>% 
    filter(app_id %in%complete_sales_indx)
## scatterplot data of apps
ggplot(complete_sales) + geom_point(aes(x=Date, y=sales))+
  facet_wrap(~ app_id,scales = "free") 


## spread df int multiple columns, 
spread_df <- complete_sales %>%
    spread(.,app_id,sales)


## apply describe over spread_df
dfs <- sapply(spread_df[,c(as.character(complete_sales_indx))],function(x){psych::describe(x,IQR=TRUE)})

## store names for df rebuild
row_names_df <- rownames(dfs)
colnames_df <- colnames(dfs)

## apply round and reshape dataframe
dfs <- sapply(dfs,function(x){round(x,2)})
dfs <- matrix(dfs,ncol=5)
rownames(dfs) <- row_names_df
colnames(dfs) <- colnames_df

## Describes statistics by differnt app_ids((Commented out to make report clearer))
kable(dfs[2:14,],digits =2,'markdown',booktabs =T)

## visualizewith boxplots
boxplot(sales ~ app_id, data=complete_sales, main="boxplot")

outliers = boxplot(spread_df$`2891`, plot=FALSE)

which(spread_df$`2891` < outliers$stats[1] | spread_df$`2891` > outliers$stats[5])

## Look at outliers 722 and 1234 

complete_sales %>% 
    filter(app_id== 1234& sales>2200) %>% 
    kable(.,digits =2,'markdown',booktabs =T)

complete_sales %>% 
    filter(app_id== 722& sales>6700) %>% 
    kable(.,digits =2,'markdown',booktabs =T)

## create time series data see tsimpute recommendations
ts_complete_data <-lapply(spread_df,ts)
forecast::tsoutliers(ts_complete_data$`722`)
tsoutliers(ts_complete_data$`1234`)


## look at sales data for nearly complete apps
almost_complete_sales_indx <-as.integer(names(which(table_counts>70 &table_counts<90 )))
almost_complete_sales <- sales %>% 
    filter(app_id %in%almost_complete_sales_indx)
plot_3 <- ggplot(almost_complete_sales) + 
    geom_point(aes(x=Date, y=sales))+
    xlab("")+facet_wrap(~ app_id,scales = "free")
ggplotly(plot_3)
## spread df int multiple columns, 
spread_df <- almost_complete_sales %>%
    spread(.,app_id,sales)

## apply describe over spread_df
dfs <- sapply(spread_df[,c(as.character(almost_complete_sales_indx))],function(x){psych::describe(x,IQR=TRUE)})

## store names for df rebuild
row_names_df <- rownames(dfs)
colnames_df <- colnames(dfs)

## apply round and reshape dataframe
dfs <- sapply(dfs,function(x){round(x,2)})
dfs <- as.data.frame(matrix(dfs,ncol=5))

rownames(dfs) <- row_names_df
colnames(dfs) <- colnames_df
## Print df
kable(dfs,digits =2,'markdown',booktabs =T)
boxplot(sales ~ app_id, data=almost_complete_sales, main="boxplot")

## look at sales data for anything above 1 but less than 70
incomplete_sales_indx <-as.integer(names(which(table_counts>5 &table_counts<70)))
incomplete_sales <- sales %>% 
    filter(app_id %in%incomplete_sales_indx)
ggplot(incomplete_sales) + geom_point(aes(x=Date, y=sales))+
  facet_wrap(~ app_id,scales = "free",nrow = 3)+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))


## Apps With 5 or Less Days of Sales Data**
incomplete_sales_indx <-as.integer(names(which(table_counts>0 &table_counts<6)))
incomplete_sales <- sales %>% 
    filter(app_id %in%incomplete_sales_indx)
ggplot(incomplete_sales) + geom_point(aes(x=Date, y=sales))+
  facet_wrap(~ app_id,scales = "free",nrow = 3)+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
    incomplete_sales_indx

EDA Ranks CSV

Please see tabs below

Ranks data contains 3557 apps and sales data only contains 37 apps. Our sales data is daily, so we want to aggregate our hourly ranking data with an average by day rank. The cumulative data seem to be cumulative reviews for the apps, before the dataset began. This should be altered to reflect the new reviews in this dataset. NA values must be imputed. The NA’s seem to be isolated to the hourly ranking data. This is because there are only ranks form 1-1000 and there are over 3k apps in the dataset, therefore daily hourly ranking columns have 2k+ NA’s. The validate ranking section below attempts to show that these NA values are rank values from 1k-3557. How I imputed this data is addressed in the impute section below.

Overall Data Manipulation steps. Validate data, impute missing values, create features for potential models, join both datasets and filter our joined dataset for app_ids which are only present in the Sales data. This means we will be running cross validated models on a dataset of about 39 apps. This model will be used to predict on the leftover 3500 apps. This may not be a strong enough dataset considering how half of these apps only have 1 date with any sales data, but it seems like the best approach.

App 2891 is what appears to be the best-selling app in the dataset. In the modeling stage I am going to explore this app with a time series analysis to see if we can predict future downloads. As such I explore a df with only app 2891 in the corrplots section

validate Ranking Data and impute

Code for this section has been appended to the bottom

The table below shows that the range of every daily ranking column is from 1-1000. That means there are no values inside these columns with values outside the expected 1-1000 ranks. The total volume of each column should be 90k values, if the columns were complete.

\(1000ranks * 90 days = 90000\)

As the table below shows, there aren’t 90k values(n) in each column. However, most columns contain near 90k with the worst column still containing 99.98% of the ranks that should exist. This may seem inconsequential, but it’s important because my plan to impute involves if any data that is NA is ranked somewhere outside the top 1000. My imputation method will turn what is possibly about .02% of data that should be ranked, into my imputed value. Being that the values that are NA represent some random (not fully random) value from 1k-3557(total number apps), I decided to impute given the average of unranked data.

  • (Total apps + min_unranked_value) / 2

Value to Be imputed for NA:

\[(3557+1000)/2=~ 2278\]

vars n mean sd median trimmed mad min max range skew kurtosis se
r0 1 89992 500.46 288.65 500 500.46 370.65 1 1000 999 0 -1.2 0.96
r1 2 89953 500.51 288.55 501 500.52 370.65 1 1000 999 0 -1.2 0.96
r2 3 88974 500.63 288.62 501 500.64 370.65 1 1000 999 0 -1.2 0.97
r3 4 89999 500.49 288.67 500 500.49 370.65 1 1000 999 0 -1.2 0.96
r4 5 89997 500.48 288.67 500 500.48 370.65 1 1000 999 0 -1.2 0.96
r5 6 89886 500.87 288.39 501 500.92 370.65 1 1000 999 0 -1.2 0.96
r6 7 89888 500.88 288.40 501 500.93 370.65 1 1000 999 0 -1.2 0.96
r7 8 89991 500.45 288.65 500 500.45 370.65 1 1000 999 0 -1.2 0.96
r8 9 89994 500.47 288.66 500 500.47 370.65 1 1000 999 0 -1.2 0.96
r9 10 89995 500.47 288.66 500 500.47 370.65 1 1000 999 0 -1.2 0.96
r10 11 89977 500.37 288.60 500 500.37 370.65 1 1000 999 0 -1.2 0.96
r11 12 89997 500.48 288.67 500 500.48 370.65 1 1000 999 0 -1.2 0.96
r12 13 89958 500.61 288.61 501 500.62 370.65 1 1000 999 0 -1.2 0.96
r13 14 89931 500.12 288.46 500 500.12 370.65 1 1000 999 0 -1.2 0.96
r14 15 89990 500.44 288.64 500 500.44 370.65 1 1000 999 0 -1.2 0.96
r15 16 89998 500.49 288.67 500 500.49 370.65 1 1000 999 0 -1.2 0.96
r16 17 89960 500.28 288.55 500 500.28 370.65 1 1000 999 0 -1.2 0.96
r17 18 89984 500.41 288.63 500 500.41 370.65 1 1000 999 0 -1.2 0.96
r18 19 89991 500.45 288.65 500 500.45 370.65 1 1000 999 0 -1.2 0.96
r19 20 89973 500.35 288.59 500 500.35 370.65 1 1000 999 0 -1.2 0.96
r20 21 89993 500.46 288.65 500 500.46 370.65 1 1000 999 0 -1.2 0.96
r21 22 89999 500.49 288.67 500 500.49 370.65 1 1000 999 0 -1.2 0.96
r22 23 89995 500.47 288.66 500 500.47 370.65 1 1000 999 0 -1.2 0.96
r23 24 89981 500.39 288.62 500 500.39 370.65 1 1000 999 0 -1.2 0.96
ranks <- read.csv('ranks.csv')
ranks$Date <- as.Date(ranks$Date)
ranks$app_id <- as.factor(ranks$app_id)
kable(describe(ranks[,3:26]),digits =2,'markdown',booktabs =T)

## impute hourly missing ranking data 
for(i in 3:26){
  ranks[is.na(ranks[,i]), i] <- 2278
}
## Impute missing ratings data with 0's
for(i in 27:36){
ranks[is.na(ranks[,i]), i] <- 0
}

Feature Building

Code for this section has been appended to the bottom

What kind of features can we develop from what we have? Time is an important value in most time series data. However, models can’t be trained easily with time series data. Because of this, I decided to create an is_weekend column to capture any weekend effect. I also created a month column to attempt to capture any monthly effects. Our hourly data seems like it can be aggregated and safely dropped. I create a Average_daily_ranking, a daily_highest_ranking and a daily_variance stat to capture the statistical effects in the rankings columns. I believe we can then safely drop the 24-hour columns.

For the ratings features I create a daily average rating score by assigning values to 1-5 star ratings. For 1,2,3,4,5 stars I assign the following values (-2,-1,0,1,2). The cs columns in our dataset represents the total number of reviews from before our dataset starts, to the end of the dataset. At first, I wanted to create an old reviews column from this data, but I opted to just leave it as is but to create several new cumulative columns. I create columns (csum1- csum5), to make a category which tracks total new reviews throughout the dataset(total_new_reviews) and cumulative_rating_score_daily, which tracks overall rating score changes in the app by date. I believe these features will do a good job at capturing some of the time series elements we are missing in the modeling stage.

Csum1-csum5 as well as all the hourly ranking data are dropped at the end of feature building. After I join my datasets, I filter our models training data for app_ids only present in the sales.csv data. The rest of the data will be for our predictions. Overall, we have a training dataset with 1110 rows, and a prediction dataset with 97608 rows. This section is all done under the hood, so I print out the description of each df.

Features

  • Time-Based features
    • month
    • is_weekend
  • new ranking stats
    • daily_rank_average
    • daily_rank_highest
    • daily_variance
  • New ratings data
    • csum1-csum5
    • total_new_reviews
    • net_rating_score_daily
    • cumulative_rating_score_daily

Modeling Dataframe

Date app_id sales s1 s2 s3 s4 s5 cs1 cs2 cs3 cs4 cs5 daily_rank_average daily_variance daily_rank_highest total_reviews net_rating_score_daily cumulative_rating_score_daily is_weekend month
2016-01-01 1234 1553 0 0 0 0 0 332 163 269 395 1320 778.33 2502.41 678 0 0 0 0 01
2016-01-01 1490 2152 2 0 0 0 2 248 91 167 290 877 518.67 80.93 506 4 0 0 0 01
2016-01-01 2398 1501 1 0 0 0 0 40 13 17 24 120 1897.58 366764.51 950 1 -2 -2 0 01
2016-01-01 2891 89289 8 4 8 16 106 73 41 71 191 1036 5.75 0.20 5 142 208 208 0 01
2016-01-01 320 2412 1 0 2 0 0 746 284 290 568 1736 470.50 213.48 438 3 -2 -2 0 01
2016-01-01 406 1308 0 0 0 0 0 1376 1008 5674 20010 82635 717.75 539.33 681 0 0 0 0 01
2016-01-01 459 2037 0 0 0 0 1 10 2 14 32 264 603.42 52.17 594 1 2 2 0 01
2016-01-01 722 2052 1 0 2 8 1 4025 1129 1488 2007 5477 566.42 614.95 530 12 8 8 0 01
2016-01-02 1234 1011 0 0 0 0 0 332 163 269 395 1320 725.62 938.94 675 0 0 0 1 01
2016-01-02 1490 1722 0 0 2 0 0 248 91 169 290 877 545.79 126.78 532 6 0 0 1 01

Prediction Dataframe

kable((prediction_df[1:10,]),digits =2,'markdown',booktabs =T)
Date app_id sales s1 s2 s3 s4 s5 cs1 cs2 cs3 cs4 cs5 daily_rank_average daily_variance daily_rank_highest total_reviews net_rating_score_daily cumulative_rating_score_daily is_weekend month
2016-01-01 1 NA 0 0 0 0 0 47 8 6 12 30 591.96 740.22 548 0 0 0 0 01
2016-01-01 100 NA 4 0 0 0 0 4682 2187 6226 28460 315885 69.54 2.26 68 4 -8 -8 0 01
2016-01-01 1006 NA 0 0 0 0 0 33 13 11 12 18 177.46 233.91 158 0 0 0 0 01
2016-01-01 1007 NA 0 0 0 0 0 1 0 0 3 16 1615.42 458464.78 920 0 0 0 0 01
2016-01-01 1009 NA 4 0 2 2 12 11267 3178 2907 4622 17393 109.33 3.01 107 20 18 18 0 01
2016-01-01 1014 NA 2 0 0 0 0 719 216 428 1549 6833 537.04 79.00 522 2 -4 -4 0 01
2016-01-01 1015 NA 0 0 0 0 2 121 95 186 370 2883 445.29 84.04 428 2 4 4 0 01
2016-01-01 1017 NA 0 0 0 0 0 228 95 147 451 5223 286.79 24.78 278 0 0 0 0 01
2016-01-01 1018 NA 0 0 0 0 0 31 27 59 410 1269 1331.58 385273.82 906 0 0 0 0 01
2016-01-01 102 NA 0 0 0 0 0 1071 733 1632 3331 10022 901.54 1535.48 831 0 0 0 0 01
## create daily average rank, avg daily variance, daily highest rank
rank_average <- rowSums(ranks[3:26])/24
my_vars <- rowVars(as.matrix(ranks[,3:26]))
ranks$daily_rank_average <- rank_average
ranks$daily_variance <- my_vars
ranks$daily_rank_highest<- apply(as.matrix(ranks[,3:26]),1,min)

## create cumulative daily average rating, create scoring for ratings, create cuulative scoring for ratings
ranks <- ranks %>% 
    arrange(app_id) %>% 
    group_by(app_id) %>%
    mutate(csum1 = cumsum(s1)) %>% 
    mutate(csum2 = cumsum(s2)) %>% 
    mutate(csum3 = cumsum(s3)) %>% 
    mutate(csum4 = cumsum(s4)) %>% 
    mutate(csum5 = cumsum(s5)) %>% 
    mutate(total_reviews = csum1+csum2+csum3+csum4+csum5) %>% 
    mutate(net_rating_score_daily = (s1*-2)+(s2*-1)+(s3*0)+(s4*1)+(s5*2)) %>% 
    mutate(cumulative_rating_score_daily =(csum1*-2) + (csum2*-1) + (csum3*0) +(csum4*1)+(csum5*2))



## Create weekday/weekend
ranks$is_weekend <- chron::is.weekend(ranks$Date)
ranks$is_weekend <- as.integer(ranks$is_weekend )

## Create month category
months <- strsplit(as.character(ranks$Date),"-" )
months <- sapply(months, "[[", 2)
ranks$month <- months




## Merge with sales data
df<- merge(sales,ranks,by=c('Date',"app_id"),all=TRUE)
df$app_id <-as.factor(as.character(as.integer(df$app_id)))


## validate merge worked correctly(commented out for report)
#dim(df)[1]-sum(is.na(df$sales))==dim(sales)[1]

## drop all hourly and cumsum columns
df <- df[,c(1,2,3,28:32,38:44)]

## filter for apps only present in sales data to create our model training data
apps_with_sales_list <- as.integer(levels(as.factor(sales$app_id)))
modeling_df <- df %>% 
    filter(app_id%in%apps_with_sales_list )
    
prediction_df <- df %>% 
    filter( !app_id%in%apps_with_sales_list )

Corrpots

Code for this section has been appended to the bottom

  • heavy collinearity, which makes sense as our columns are aggregates of each other, but they are also heavily zero inflated. They still may have plenty of individual value in our models
##  [1] Date                          app_id                       
##  [3] sales                         s1                           
##  [5] s2                            s3                           
##  [7] s4                            s5                           
##  [9] cs1                           cs2                          
## [11] cs3                           cs4                          
## [13] cs5                           daily_rank_average           
## [15] daily_variance                daily_rank_highest           
## [17] total_reviews                 net_rating_score_daily       
## [19] cumulative_rating_score_daily is_weekend                   
## [21] month                        
## <0 rows> (or 0-length row.names)

## reorder factor levels
modeling_df$app_id <- as.factor(as.character(modeling_df$app_id))

## Split df by app
list_dfs <- split(modeling_df,modeling_df$app_id)

## Correct NA values in sales (5 values which existed in the ranks.csv didnt have sales values on the sales csv. Impute with 0)
modeling_df[which(is.na(modeling_df$sales)),3] <-0 
modeling_df[which(is.na(modeling_df$sales))
,]


## TUrn list into individual df for 2891
app_2891 <- as.data.frame(list_dfs$`2891`)
app_2891$app_id <- as.factor(as.character(app_2891$app_id))

##corr plot entire DF
corrplot::corrplot(cor(modeling_df[,3:20]))
mtext("WHOLE DF", at=-2.5, line=-0.5, cex=2)

##corr plot app 2891
corrplot::corrplot(cor(app_2891[,3:20]))
mtext("App 2891", at=-2.5, line=-0.5, cex=2)

Modeling

Please see tabs below

The instruction questions seemed open ended in terms of what we want to model. I decided to build a time series model for the highest selling product “app 2891”. I ran a simple train test split as well as a CV test using RMSE to compare several different time series models on the app. Analysis here easily could have gone further, as earlier in EDA the forecast package recommended imputation of several values in this time series. As the data is merely simulated data, I opted to just run the models and show the process without imputation. Future expectation of downloads can be a very valuable tool for our clients to adjust their advertising as well as IT requirements. As a warning, forecasts in general don’t behave well as time intervals of prediction become large.

Next, I tested several CV models and compared the RMSE scores for the models. My hope is that by extracting variable importance we can identify what features are most important to increasing sales. I recognize that this data is count data but believe that our models can hopefully capture the data correctly, without resorting to Poisson modeling techniques. Lastly, caret doesn’t handle time series data well, which is why I created several categories like is.weekend and month to attempt to capture time series effects in the modeling.

Time series analysis of 2891

Below I run a time series analysis of app 2891. First, I create a train test split and plot out predictions from several models. The random walk model seems to perform the best here and the ARIMA model performs terribly. Given more time these models could be tested with imputations for outliers. Next i conduct a CV analysis with several models and the forecast package. ARIMA and ETS performed the best in the CV scoring. I plot out both models with predictions. I would feel more comfortable using the ETS model. Considering how poor the performance was by the ARIMA model in the train/test split. Forecast package handles imputations easily and could lead to better predictions. I produce the forecast graphs below for all the train/test and CV models. Our X axis represents weeks from start of the data set(1/1/2016). The train test spit predictions are mapping predictions for March against the data we already have. The CV models, which are trained on CV scoring of rmse, show us our expected future predictions. As you can see, the intervals for these forecasts are very wide.

  • ARIMA model seems to think the app sales are going to go down over the next month, while ETS sees them with no trend and what looks like a weekly seasonality effect

App 2891 Time Series Plots:

Train/Test splits

  • Below predictions are based on splitting off the last 30 days of data and predicting on it

App 2891 Naive Fit Prediction

App 2891 ETS Prediction

App 2891 Random Walk Prediction

App 2891 ARIMA Prediction

App 2891 Train/Test split RMSE comparison

ets_ana_rmse random_walk_rmse naive_fit_rmse arima_fit_rmse
3684.53 2751.96 23742.29 10564.91

App 2891 Cross-validated RMSE

X naive_fit_cv_rmse rwf_fit_rmse arima_fit_rmse ets_fit_rmse
1 6734.94 6734.94 6581.14 6302.28

CV Arima model prediction over next month for app 2841

CV ETS model prediction over next month for app 2841

## CV functions
farima <- function(x, h) {
  forecast(auto.arima(x,allowdrift=TRUE), h=h)
}
fets <- function(x, h) {
  forecast(forecast::ets(x), h = h)
}
naives <- function(x, h) {
  forecast(forecast::naive(x), h = h)
}
## build xts for display
xts_pipe1 <- xts::xts(app_2891$sales, order.by= app_2891$Date)
attr(xts_pipe1, 'frequency') <- 7

## display xts hist and autoplot
par(mfrow=c(2, 1), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
original_timeplot <- plot(xts_pipe1, main="pipe1")
plot(xts_pipe1, main="kWh")
# display frequency at which values in a vector occur
hist(xts_pipe1, xlab="", main="",col="red")

# ## build acf/pacf plots for pipe1 and diff (commented out for report)
# a <- ggAcf(xts_pipe1,lag=200)
# b <- ggPacf(xts_pipe1,lag=200)
# ## Example of overfdifferencing below
# c <- ggAcf(diff(xts_pipe1),lag=10)
# d <- ggPacf(diff(xts_pipe1),lag=200)
# plot_grid(a,b,c,d)



app_2891_ts <- ts(app_2891$sales,frequency=365,start = c(2016,1))

## Create train test split
myts.train <- app_2891_ts[1:70]
test_results <- app_2891_ts[71:90]
ts_test_results <- ts(test_results)
myts.train <- ts(myts.train,frequency=7)







## Build naive fit and plot
naive_fit <- naive(myts.train)
autoplot(naive_fit,h=29)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")

## Build ETS and plot
fit <- forecast::ets(myts.train, allow.multiplicative.trend=TRUE)
fc <- forecast(fit,h=29)
autoplot(fc)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")

## Build random wlak and plot 
fcast <- stlf(myts.train,method='naive',h=29)
autoplot(fcast)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")

## Build ARIMA and plot
arima_fit <-forecast::auto.arima(myts.train)
arima_fit <- forecast(arima_fit,h=29)
autoplot(forecast(arima_fit,h=29))+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")

## get Train/Test RMSE
random_walk_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),fcast$mean)
ets_ana_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),fc$mean)
naive_fit_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),naive_fit$mean)
arima_fit_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),arima_fit$mean)
## train split results 
results <- as.data.frame(cbind(ets_ana_rmse,random_walk_rmse,naive_fit_rmse,arima_fit_rmse))
kable(results,digits =2,'markdown',booktabs =T)


## Cross validation below took long to run, therefore i saved results to github and load in dataframe
x <- getURL("https://raw.githubusercontent.com/justinherman42/appfigures/master/cv_scores_app2891.csv")
y <- as.data.frame(read.csv(text = x))
kable(y,digits =2,'markdown',booktabs =T)


# Commented out imported as csv
# naive_fit_cv <- forecast::tsCV(ts(app_2891_ts,frequency = 7),naives, h=1)
# rwf_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),rwf, drift=TRUE, h=1)
# ets_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),fets, h=1)
# arima_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),farima, h=1)
# naive_fit_cv_rmse <- sqrt(mean(rwf_fit^2, na.rm=TRUE))
# rwf_fit_rmse <- sqrt(mean(rwf_fit^2, na.rm=TRUE))
# ets_fit_rmse <- sqrt(mean(ets_fit^2, na.rm=TRUE))
# arima_fit_rmse <- sqrt(mean(arima_fit^2, na.rm=TRUE))
#cv_scores <- as.data.frame(cbind(naive_fit_cv_rmse,rwf_fit_rmse,arima_fit_rmse,ets_fit_rmse))
#write.csv(cv_scores,file = "cv_scores_app2891.csv")


## actual models 
auto_arima_model <- farima(ts(app_2891_ts,frequency = 7),365/7)
auto_arima_model$model
checkresiduals(auto_arima_model)
ets_model <- fets(ts(app_2891_ts,frequency = 7),365/7)
ets_model$model
checkresiduals(ets_model)


arima_for_plot <- autoplot(forecast::forecast(auto_arima_model,h=31),series='ARIMA')
ets_forecast_plot <- autoplot(forecast::forecast(ets_model),series="ETS")

Run models with Caret

The bread and butter of our modeling stage. We will setup a 5 fold cross validated training parameter. Caret can’t handle the date series, so we must exclude dates from the modeling. Hopefully our features like is.weekend, month and cumulative ranks, capture the time element well.
I run several linear, nonlinear, and tree-based models and compare their results based on RMSE. I display the variable importance for all the models. It must be noted that some of our features may be highly collinear and therefore the variable importance may not be accurately tuned.

Daily rank highest is considered the most important feature in almost all our models. 5 Star reviews is selected as extremely important as well. The value these variables bring to our clients really depends on the causal effect of the ranking data. Are these ranks published somewhere and viewed which may have an impact on driving downloads? Think Rotten tomatoes ratings driving traffic to the movies. Or is the ranking data solely reflective of genuine interest in the apps. If it’s the former, we can use ranks to drive traffic towards applications (which can be substantial). If it’s the latter, we can use the ranks data to determine profitability for apps soon after launch.

I used rmse to pick a model and in the end i picked the random forest model. With more time I would want to combine the Time series analysis with some of the analysis from our model. I would also want more data. Our model attempts to predict on over 3500 apps with only 90 days of data on less than 40 apps. Overall, it seems like a very interesting topic to investigate.

Train paramaters

Linear Models

## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 2 on full training set

## Partial least squares regression , fitted with the orthogonal scores algorithm.
## Call:
## plsr(formula = .outcome ~ ., ncomp = param$ncomp, data = dat,     method = "oscorespls")
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.3, lambda = 0 on full training set

## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 1 on full training set

Non Linear models

## Aggregating results
## Selecting tuning parameters
## Fitting k = 5 on full training set

## Aggregating results
## Selecting tuning parameters
## Fitting sigma = 0.125, C = 8 on full training set

Tree Models

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 18 on full training set

## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 500, max_depth = 4, eta = 0.01, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1, subsample = 0.75 on full training set

Rank Models importance features

Rank pls rf xgbTree enet knn svmRadial glmnet
1 s5 daily_rank_highest daily_rank_highest daily_rank_highest daily_rank_highest daily_rank_highest cumulative_rating_score_daily
2 net_rating_score_daily s5 daily_rank_average daily_rank_average daily_rank_average daily_rank_average s5
3 s4 daily_rank_average net_rating_score_daily s5 s5 s5 net_rating_score_daily
4 cumulative_rating_score_daily net_rating_score_daily s5 net_rating_score_daily net_rating_score_daily net_rating_score_daily daily_rank_highest
5 s3 daily_variance cs1 s4 s4 s4 total_reviews
6 daily_rank_highest cumulative_rating_score_daily daily_variance cumulative_rating_score_daily cumulative_rating_score_daily cumulative_rating_score_daily s4
7 total_reviews total_reviews cumulative_rating_score_daily s3 s3 s3 daily_variance
8 daily_rank_average is_weekend s4 total_reviews total_reviews total_reviews cs1
9 s2 cs4 total_reviews s2 s2 s2 s1
10 s1 cs2 s3 s1 s1 s1 cs5
11 cs3 s1 cs2 daily_variance daily_variance daily_variance cs3
12 cs4 cs5 s1 cs4 cs4 cs4 is_weekend
13 cs2 cs1 cs3 cs3 cs3 cs3 s3
14 cs1 s4 cs4 cs2 cs2 cs2 cs2
15 cs5 cs3 is_weekend month month month daily_rank_average
16 month month s2 is_weekend is_weekend is_weekend month
17 daily_variance s2 cs5 cs1 cs1 cs1 s2
18 is_weekend s3 month cs5 cs5 cs5 cs4

## 
## Call:
## summary.resamples(object = .)
## 
## Models: PLS, ENet, KNN, SVM, RF, XGB 
## Number of resamples: 5 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS  2130.3708 2192.196 2240.331 2305.242 2280.599 2682.715    0
## ENet 1942.6416 2044.858 2057.371 2188.658 2350.154 2548.265    0
## KNN  1512.5788 1641.362 1652.588 1626.805 1660.362 1667.132    0
## SVM  2245.2264 2283.089 2469.891 2463.704 2654.788 2665.525    0
## RF    987.7445 1109.535 1162.016 1166.172 1242.008 1329.555    0
## XGB   981.5728 1128.565 1201.954 1185.319 1264.634 1349.867    0
## 
## RMSE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS  3877.702 3914.482 3941.092 4139.684 3974.065 4991.079    0
## ENet 3752.864 3860.134 3923.325 4240.657 4548.382 5118.578    0
## KNN  3954.345 4159.861 4567.975 4409.517 4594.733 4770.669    0
## SVM  5384.505 5385.286 6208.286 6085.735 6342.220 7108.376    0
## RF   2435.471 2931.129 3006.524 3232.680 3717.091 4073.185    0
## XGB  2376.476 2988.896 3042.397 3242.596 3738.381 4066.831    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.8132833 0.8135496 0.8166363 0.8247710 0.8399609 0.8404251    0
## ENet 0.7262742 0.8098684 0.8244899 0.8041526 0.8294611 0.8306693    0
## KNN  0.7607204 0.7825883 0.8230092 0.8067721 0.8314895 0.8360532    0
## SVM  0.5449780 0.5745114 0.6060230 0.6095424 0.6605867 0.6616126    0
## RF   0.8342030 0.8607409 0.9008469 0.8855466 0.9022635 0.9296787    0
## XGB  0.8426103 0.8660799 0.8951299 0.8879660 0.9026636 0.9333460    0

App Sales Of Over 50k Table Count

## 
## 2142  246 2694 2756 3005 3407  387  909 2841 3288 3391 3336 3395 2873 2281 
##    1    1    1    1    1    1    1    1    2    2    2    3    3    4    6 
## 2765 3534 2977 3234 3268  499 3299 3178 3533 3102 3530 2457 2500  886 2563 
##    6    6    7    8    8    9   10   11   12   16   17   20   21   23   24 
## 1733 2516 3072 2981 2941 1808  308  119 
##   25   26   30   32   36   51   62   75
## drop columns for Caret
modeling_df$month <- as.integer(modeling_df$month)
set.seed(1)
xTrain <- modeling_df[,-c(1,2,3)]
yTrain <- as.integer(modeling_df[,3])
cvFolds <- createFolds(yTrain, k=5)
trControl <- trainControl(verboseIter=T,
                          method='cv', 
                          number=5,
                          index=cvFolds)

# Set up and start multi-core processing
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

## Build Linear Models
plsFit <- train(x=xTrain,
                 y=yTrain, 
                 method='pls',
                 tuneLength=20,
                 trControl=trControl,
                 preProc=c('knnImpute', 'center', 'scale'))
plot(plsFit)
plsFit$finalModel

enetFit <- train(x=xTrain,
                  y=yTrain,
                  method='enet',
                  tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                       .lambda = seq(0, 1, by=0.1)),
                  trControl=trControl,
                  preProc=c('knnImpute', 'center', 'scale'))
plot(enetFit)
enetFit$finalModel


## Build Non-linear models
knnFit <- train(x=xTrain,
                y=yTrain,
                method='knn',
                tuneLength=20,
                trControl=trControl,
                preProc=c('knnImpute', 'center', 'scale'))
plot(knnFit)
knnFit$finalModel

svmFit <- train(x=xTrain,
                y=yTrain,
                method="svmRadial",
                tuneLength=20,
                trControl=trControl,
                preProc=c('knnImpute', 'center', 'scale'))

## plot svm model
plot(svmFit)

## Build rf 
rf <- train(x=xTrain, 
            y=yTrain, 
            method='rf',
            tuneLength=10,
            trControl=trControl, 
            importance=T)
plot(rf)


## Vuild XGboost model
xgbGrid <- expand.grid(
   .nrounds=c(100, 500), # boosting iterations (trees)
   .max_depth=c(4, 6), # max tree depth
   .eta=c(0.001, 0.01), # learning rate
   .gamma=c(0),# minimum loss reduction
   .colsample_bytree=c(0.4, 0.6), # subsample ratio of columns
   .min_child_weight=c(1, 5), # minimum sum of instance weight
   .subsample=c(0.5, 0.75))  # subsample ratio of rows
xgb1 <- train(x = xTrain,
              y = yTrain,
              method = 'xgbTree',
              tuneGrid = xgbGrid,
              trControl = trControl)

## shutdown multicore
stopCluster(cl)
registerDoSEQ()


## Rank Models importance features 
getRank <- function(trainObjects){
  temp <- c()
  methods <- c()
  for(object in trainObjects){
    methods <- c(methods, object$method)
    varimp <- varImp(object)[[1]]
    varimp$variables <- row.names(varimp)
    rank <- varimp[order(varimp$Overall, decreasing = T),] %>% row.names()
    temp <- cbind(temp, rank)
    
  }
  temp <- as.data.frame(temp)
  names(temp) <- methods
  temp$Rank <- c(1:dim(temp)[1])
  temp <- select(temp, Rank, everything())
  return(temp)
}

kable(getRank(list(plsFit, rf, xgb1, enetFit, knnFit, svmFit)),digits = 2,'markdown', booktabs =T)

## plot variable importance
plot(varImp(plsFit), main='Variable Importance based on PLS')
plot(varImp(rf), main='Variable Importance based on Random Forest')
plot(varImp(xgb1), main='Variable Importance based on XGBoost')
plot(varImp(svmFit), main='Variable Importance based on Loess R-Squares')

## summary of model scores 
resamples(list(PLS=plsFit, ENet=enetFit, KNN=knnFit, SVM=svmFit, RF=rf, XGB=xgb1)) %>% summary()


##boxplots of cv scores
par(mfrow=c(2,3))
boxplot(plsFit$resample$RMSE, main='CV RMSE for PLS')
boxplot(enetFit$resample$RMSE, main='CV RMSE for ENet')
boxplot(knnFit$resample$RMSE, main='CV RMSE for KNN')
boxplot(svmFit$resample$RMSE, main='CV RMSE for SVM')
boxplot(rf$resample$RMSE, main='CV RMSE for RF')
boxplot(xgb1$resample$RMSE, main='CV RMSE for XGB')


## predictions 
pred <- predict(rf, newdata=prediction_df)
prediction_df$sales_predictions <- pred
prediction_df$Date <- pred_dates
prediction_df$app_id <- pred_appids
df_for_export <- prediction_df[,c(20,21,19)]
df_for_export$app_id <- as.integer(as.character(df_for_export$app_id))
final_df <- df_for_export %>% 
    arrange(app_id)

write.csv(final_df,'Herman_estimates.csv')


indexes <- which(prediction_df$sales_predictions>50000)
new_df <- prediction_df[indexes,c("app_id","sales_predictions")]
new_df$app_id <- as.factor(as.character(new_df$app_id))
sort(table(new_df$app_id))