Predicting Apps Sales Volume

In this report I perform some EDA on the two datasets provided to me, and then run some models to attempt to predict sales volume by day. The first part of the EDA explores the sales.csv. This contains our target “sales” column. Sales are our downloads, please note that throughout the writeup I interchangeably use the word “sales” and “downloads”. I found that only 5 apps in this csv file contain complete information (download totals for all 90 days).To conduct time series analysis, complete or near complete information is needed. However, incomplete sales data, isn’t an issue for most of our modeling process

The second csv file “Ranks.csv” gives us hourly data for over 3500 apps. For the hourly ranking data (ro-r23), I imputed missing values with the average of the unranked data rank=2278 (please see validate rank and impute section under the EDA Ranks CSV for further explanation). Next, I moved onto feature building. I created several ratings and rankings aggregated statistics(please see the feature building section under the EDA Ranks CSV). Continuing the data processing, I dropped the hourly ranking data in our data and joined the two csv files under date/id.

My dataset is now ready for modeling. I filtered the data for app_ids that were only present in the sales.csv file and used that as my modeling dataframe. All other data will be used for predictions. I took the most downloaded app( app_id 2891) and ran a time series analysis of the app. I wanted to forecast future downloads for the app as I thought this could be valuable to clients. I then setup a 5-fold cross validation to compare several linear, non-linear, generalized linear, Gradient-boosted, and Random forest models to predict daily downloads. I showed the variable importance from the models and discussed how RMSE was minimized by the RF and XGboost models.

The models seem to indicate that the 3 most important features for daily prediction are “the number of five star reviews”, “the average daily rank”, and the statistic which tracks “highest rank” the app has achieved in that day. Two of these features were custom built features. In fact, looking at the top 8 features in feature importance for RF and XgBoost, respectively 7 and 5 of the features were custom built features.

I made a conscious decision to not impute the sales.csv data with 0’s for the missing data. The sales csv had one zero present. If i had built out the missing dates with 0’s, our model would predict a significantly large amount of 0’s. While realistic, I was unsure of the validity of imputing with 0’s given the information at hand(please see Sales Data tab under EDA Sales DATA CSV). Because of this, our predictions minimum is 246. The maximum prediction is 79546. App 2981 is predicted to have the top three highest downloads in the data set. It does this on three consecutive days from 2/19-2-21. App 119 is predicted to have downloads over 50k 72 out of the 90 days. App 1808 and 308 are expected to have downloads over 50k over 50 times.

If you are exploring this notebook via an rmd file, please disregard all blocks set to eval=False. If you are using the rpubs, please realize that the sections are tabbed twice

EDA Sales Data CSV

Please see tabs below

Sales Data

Code for this section has been appended to the bottom

## 
##  146  186  222  237  906 1812 1840 1989 1997 2191 2640 1498 2821 3550 1264 
##    1    1    1    1    1    1    1    1    1    1    2    3    4    4    5 
##   82 3082 2373 2490 2942 1874 3333 3373  907 3428 2333 3308 1461  676 2398 
##    8    8   12   13   13   14   14   17   19   19   20   23   25   38   73 
## 2667 1490  406 2346  320  459  722 1234 2891 
##   74   75   80   82   90   90   90   90   90

Above I print a count for the Sales.csv dataset. This table shows out of the 90 days, how many days’ worth of download data we have for each application. Overall, I want to treat the applications in this dataset, as our training dataset. The plan is to join this sales data by date and ID with our other dataset. Imputation of missing values isn’t needed for the models I will be testing. Even data with only 1 days’ worth of downloads will hopefully have some value towards my eventual model.

I feel I need to address the missing sales data in the sales.csv. One could assume that if sales data isn’t present in what is a clean dataset, that it is because there were no sales on those dates. I contemplated imputing missing sales data as 0’s, but I decided against it after looking into app sales like the sales of app_id 1874 below. I just can’t justify imputing an application with 0’s given that it received upwards of 23k downloads on the limited days from which I have data for it. Therefore, when we join datasets, we will not have a complete dataset where every app sale is calculated for every day. Instead our final dataset for modeling will simply contain observations in which we have sales data and therefore it will not predict 0 sales.

##          Date app_id sales
## 1  2016-03-02   1874   250
## 2  2016-03-06   1874    96
## 3  2016-03-09   1874  9727
## 4  2016-03-10   1874 19010
## 5  2016-03-11   1874 22265
## 6  2016-03-12   1874 16656
## 7  2016-03-13   1874  4107
## 8  2016-03-14   1874  1761
## 9  2016-03-15   1874   472
## 10 2016-03-16   1874    48
## 11 2016-03-17   1874     0
## 12 2016-03-29   1874  1982
## 13 2016-03-30   1874  1364
## 14 2016-03-31   1874   237

# display counts
table_counts

# dispaly application 1874 sales
app_1874 <- sales %>% 
    filter(app_id == 1874)
app_1874

Explore Overall Monthly Trend

Below I created an interactive Plotly display of the sales.csv data. Before we begin to breakdown the data, let’s center and scale the sales of each app, and then explore the overall time series plot. The product sales are on different scales. By centering and scaling each individual product, we can explore the trends in overall data over time

Below you can see that the beginning of January has above average sales. It appears there is an overall large drop in sales after the first week. The first week of January has some clear outliers in sales. February has about average sales. March seems to have about average sales with a clear uptick towards the end. The trend in March may be due to the outliers that are clearly visible. You can see there are outliers over 5 SD away in March.

I also plot the overall sales volume by day with a 7 day rolling average. This can be something that may be useful for figruing out the demands on our overall network.

## function () 
## .Internal(date())
## <bytecode: 0x00000000614d3b48>
## <environment: namespace:base>

## Scale and center each app by app_id
data <- transform(sales, sales.norm = ave(sales, app_id, FUN = scale))
data$app_id <- as.factor(data$app_id) 
data$Date <- as.Date(data$Date )

## Monthly ggplot 
monthly_ggplotly <- ggplot(aes(Date, sales.norm),data=data)+
    geom_point()+
    geom_smooth()+
    xlab("")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()

## monthly ggplot broken down by product id
monthly_prod_id_ggplotly <- ggplot(aes(Date, sales.norm,col=app_id),data=data)+
    geom_point()+
    xlab("")+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()
date
total_sales <- sales%>%
    group_by(Date) %>%
    summarize(value = sum(sales))


##plotting total sales by day
ggplot(total_sales, aes(x = as.POSIXct(Date), y = value)) + 
  geom_bar(stat = "identity") +
  theme_bw() +
  labs(x = "Month", y = "Average Visits per User")+
      theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     geom_line(aes(y=rollmean(value,7, na.pad=TRUE)))+
    scale_x_datetime(date_breaks = "2 days", labels = date_format("%b %d")) 
)

## Make sales into date column
sales$Date <- as.Date(sales$Date)

Look at app sales

Code for this section has been appended to the bottom. For clarity of this report, while the code explores all the app_ids, I have only displayed app_ids with 90 sales observations

Apps with 90 days of sales data

filter df for only sales data with 90 days of sales
plot and explore those apps time series and scatterplots

Explore outliers in app 722 and 1234

Boxplot below shows outliers present
the outliers in app_2891 are in the first 7 days of launch
- might need to be addressed in model, although they are likely legitimate datapoints (likely launch of product or promotion)
using > 3 SD as outlier detection. I display the outliers in 722,1234
- for app 722 outlier = 2208 +3(sd=1542) ~ 6700
- for app 1234 outlier = 1102+ 3(sd=361) ~ 2200
option 2, use forecast to build univariate time series and identify timeseries outliers
- I print recommended outlier replacement below
- I don’t believe these outliers should be imputed as of now. Forecast suggests substantial changes in the values. I believe they are real datapoints. Before modeling they may need to be addressed

## [1] 1 2 3 4 5 6 7

Outliers app_1234

Date	app_id	sales
2016-03-05	1234	3348
2016-03-30	1234	2628

Outliers app_722

Date	app_id	sales
2016-03-22	722	12717
2016-03-23	722	7450

Recommendation for imputation of outliers

## $index
## [1] 79 80 81 82
## 
## $replacements
## [1] 1973.8 2877.6 3781.4 4685.2

## $index
## [1] 64 89
## 
## $replacements
## [1] 1553 1399

## get correct app_id
complete_sales_indx <-as.integer(names(which(table_counts>89)))
## filter for app_id
complete_sales <- sales %>% 
    filter(app_id %in%complete_sales_indx)
## scatterplot data of apps
ggplot(complete_sales) + geom_point(aes(x=Date, y=sales))+
  facet_wrap(~ app_id,scales = "free") 


## spread df int multiple columns, 
spread_df <- complete_sales %>%
    spread(.,app_id,sales)


## apply describe over spread_df
dfs <- sapply(spread_df[,c(as.character(complete_sales_indx))],function(x){psych::describe(x,IQR=TRUE)})

## store names for df rebuild
row_names_df <- rownames(dfs)
colnames_df <- colnames(dfs)

## apply round and reshape dataframe
dfs <- sapply(dfs,function(x){round(x,2)})
dfs <- matrix(dfs,ncol=5)
rownames(dfs) <- row_names_df
colnames(dfs) <- colnames_df

## Describes statistics by differnt app_ids((Commented out to make report clearer))
kable(dfs[2:14,],digits =2,'markdown',booktabs =T)

## visualizewith boxplots
boxplot(sales ~ app_id, data=complete_sales, main="boxplot")

outliers = boxplot(spread_df$`2891`, plot=FALSE)

which(spread_df$`2891` < outliers$stats[1] | spread_df$`2891` > outliers$stats[5])

## Look at outliers 722 and 1234 

complete_sales %>% 
    filter(app_id== 1234& sales>2200) %>% 
    kable(.,digits =2,'markdown',booktabs =T)

complete_sales %>% 
    filter(app_id== 722& sales>6700) %>% 
    kable(.,digits =2,'markdown',booktabs =T)

## create time series data see tsimpute recommendations
ts_complete_data <-lapply(spread_df,ts)
forecast::tsoutliers(ts_complete_data$`722`)
tsoutliers(ts_complete_data$`1234`)


## look at sales data for nearly complete apps
almost_complete_sales_indx <-as.integer(names(which(table_counts>70 &table_counts<90 )))
almost_complete_sales <- sales %>% 
    filter(app_id %in%almost_complete_sales_indx)
plot_3 <- ggplot(almost_complete_sales) + 
    geom_point(aes(x=Date, y=sales))+
    xlab("")+facet_wrap(~ app_id,scales = "free")
ggplotly(plot_3)
## spread df int multiple columns, 
spread_df <- almost_complete_sales %>%
    spread(.,app_id,sales)

## apply describe over spread_df
dfs <- sapply(spread_df[,c(as.character(almost_complete_sales_indx))],function(x){psych::describe(x,IQR=TRUE)})

## store names for df rebuild
row_names_df <- rownames(dfs)
colnames_df <- colnames(dfs)

## apply round and reshape dataframe
dfs <- sapply(dfs,function(x){round(x,2)})
dfs <- as.data.frame(matrix(dfs,ncol=5))

rownames(dfs) <- row_names_df
colnames(dfs) <- colnames_df
## Print df
kable(dfs,digits =2,'markdown',booktabs =T)
boxplot(sales ~ app_id, data=almost_complete_sales, main="boxplot")

## look at sales data for anything above 1 but less than 70
incomplete_sales_indx <-as.integer(names(which(table_counts>5 &table_counts<70)))
incomplete_sales <- sales %>% 
    filter(app_id %in%incomplete_sales_indx)
ggplot(incomplete_sales) + geom_point(aes(x=Date, y=sales))+
  facet_wrap(~ app_id,scales = "free",nrow = 3)+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))


## Apps With 5 or Less Days of Sales Data**
incomplete_sales_indx <-as.integer(names(which(table_counts>0 &table_counts<6)))
incomplete_sales <- sales %>% 
    filter(app_id %in%incomplete_sales_indx)
ggplot(incomplete_sales) + geom_point(aes(x=Date, y=sales))+
  facet_wrap(~ app_id,scales = "free",nrow = 3)+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
    incomplete_sales_indx

EDA Ranks CSV

Please see tabs below

Ranks data contains 3557 apps and sales data only contains 37 apps. Our sales data is daily, so we want to aggregate our hourly ranking data with an average by day rank. The cumulative data seem to be cumulative reviews for the apps, before the dataset began. This should be altered to reflect the new reviews in this dataset. NA values must be imputed. The NA’s seem to be isolated to the hourly ranking data. This is because there are only ranks form 1-1000 and there are over 3k apps in the dataset, therefore daily hourly ranking columns have 2k+ NA’s. The validate ranking section below attempts to show that these NA values are rank values from 1k-3557. How I imputed this data is addressed in the impute section below.

Overall Data Manipulation steps. Validate data, impute missing values, create features for potential models, join both datasets and filter our joined dataset for app_ids which are only present in the Sales data. This means we will be running cross validated models on a dataset of about 39 apps. This model will be used to predict on the leftover 3500 apps. This may not be a strong enough dataset considering how half of these apps only have 1 date with any sales data, but it seems like the best approach.

App 2891 is what appears to be the best-selling app in the dataset. In the modeling stage I am going to explore this app with a time series analysis to see if we can predict future downloads. As such I explore a df with only app 2891 in the corrplots section

validate Ranking Data and impute

Code for this section has been appended to the bottom

The table below shows that the range of every daily ranking column is from 1-1000. That means there are no values inside these columns with values outside the expected 1-1000 ranks. The total volume of each column should be 90k values, if the columns were complete.

\(1000ranks * 90 days = 90000\)

As the table below shows, there aren’t 90k values(n) in each column. However, most columns contain near 90k with the worst column still containing 99.98% of the ranks that should exist. This may seem inconsequential, but it’s important because my plan to impute involves if any data that is NA is ranked somewhere outside the top 1000. My imputation method will turn what is possibly about .02% of data that should be ranked, into my imputed value. Being that the values that are NA represent some random (not fully random) value from 1k-3557(total number apps), I decided to impute given the average of unranked data.

(Total apps + min_unranked_value) / 2

Value to Be imputed for NA:

\[(3557+1000)/2=~ 2278\]

	vars	n	mean	sd	median	trimmed	mad	min	max	range	kurtosis	se
r0	1	89992	500.46	288.65	500	500.46	370.65	1	1000	999	-1.2	0.96
r1	2	89953	500.51	288.55	501	500.52	370.65	1	1000	999	-1.2	0.96
r2	3	88974	500.63	288.62	501	500.64	370.65	1	1000	999	-1.2	0.97
r3	4	89999	500.49	288.67	500	500.49	370.65	1	1000	999	-1.2	0.96
r4	5	89997	500.48	288.67	500	500.48	370.65	1	1000	999	-1.2	0.96
r5	6	89886	500.87	288.39	501	500.92	370.65	1	1000	999	-1.2	0.96
r6	7	89888	500.88	288.40	501	500.93	370.65	1	1000	999	-1.2	0.96
r7	8	89991	500.45	288.65	500	500.45	370.65	1	1000	999	-1.2	0.96
r8	9	89994	500.47	288.66	500	500.47	370.65	1	1000	999	-1.2	0.96
r9	10	89995	500.47	288.66	500	500.47	370.65	1	1000	999	-1.2	0.96
r10	11	89977	500.37	288.60	500	500.37	370.65	1	1000	999	-1.2	0.96
r11	12	89997	500.48	288.67	500	500.48	370.65	1	1000	999	-1.2	0.96
r12	13	89958	500.61	288.61	501	500.62	370.65	1	1000	999	-1.2	0.96
r13	14	89931	500.12	288.46	500	500.12	370.65	1	1000	999	-1.2	0.96
r14	15	89990	500.44	288.64	500	500.44	370.65	1	1000	999	-1.2	0.96
r15	16	89998	500.49	288.67	500	500.49	370.65	1	1000	999	-1.2	0.96
r16	17	89960	500.28	288.55	500	500.28	370.65	1	1000	999	-1.2	0.96
r17	18	89984	500.41	288.63	500	500.41	370.65	1	1000	999	-1.2	0.96
r18	19	89991	500.45	288.65	500	500.45	370.65	1	1000	999	-1.2	0.96
r19	20	89973	500.35	288.59	500	500.35	370.65	1	1000	999	-1.2	0.96
r20	21	89993	500.46	288.65	500	500.46	370.65	1	1000	999	-1.2	0.96
r21	22	89999	500.49	288.67	500	500.49	370.65	1	1000	999	-1.2	0.96
r22	23	89995	500.47	288.66	500	500.47	370.65	1	1000	999	-1.2	0.96
r23	24	89981	500.39	288.62	500	500.39	370.65	1	1000	999	-1.2	0.96

ranks <- read.csv('ranks.csv')
ranks$Date <- as.Date(ranks$Date)
ranks$app_id <- as.factor(ranks$app_id)
kable(describe(ranks[,3:26]),digits =2,'markdown',booktabs =T)

## impute hourly missing ranking data 
for(i in 3:26){
  ranks[is.na(ranks[,i]), i] <- 2278
}
## Impute missing ratings data with 0's
for(i in 27:36){
ranks[is.na(ranks[,i]), i] <- 0
}

Feature Building

Code for this section has been appended to the bottom

What kind of features can we develop from what we have? Time is an important value in most time series data. However, models can’t be trained easily with time series data. Because of this, I decided to create an is_weekend column to capture any weekend effect. I also created a month column to attempt to capture any monthly effects. Our hourly data seems like it can be aggregated and safely dropped. I create a Average_daily_ranking, a daily_highest_ranking and a daily_variance stat to capture the statistical effects in the rankings columns. I believe we can then safely drop the 24-hour columns.

For the ratings features I create a daily average rating score by assigning values to 1-5 star ratings. For 1,2,3,4,5 stars I assign the following values (-2,-1,0,1,2). The cs columns in our dataset represents the total number of reviews from before our dataset starts, to the end of the dataset. At first, I wanted to create an old reviews column from this data, but I opted to just leave it as is but to create several new cumulative columns. I create columns (csum1- csum5), to make a category which tracks total new reviews throughout the dataset(total_new_reviews) and cumulative_rating_score_daily, which tracks overall rating score changes in the app by date. I believe these features will do a good job at capturing some of the time series elements we are missing in the modeling stage.

Csum1-csum5 as well as all the hourly ranking data are dropped at the end of feature building. After I join my datasets, I filter our models training data for app_ids only present in the sales.csv data. The rest of the data will be for our predictions. Overall, we have a training dataset with 1110 rows, and a prediction dataset with 97608 rows. This section is all done under the hood, so I print out the description of each df.

Features

Time-Based features
- month
- is_weekend
new ranking stats
- daily_rank_average
- daily_rank_highest
- daily_variance
New ratings data
- csum1-csum5
- total_new_reviews
- net_rating_score_daily
- cumulative_rating_score_daily

Modeling Dataframe

Date	app_id	sales	s1	s2	s3	s4	s5	cs1	cs2	cs3	cs4	cs5	daily_rank_average	daily_variance	daily_rank_highest	total_reviews	net_rating_score_daily	cumulative_rating_score_daily	is_weekend	month
2016-01-01	1234	1553	0	0	0	0	0	332	163	269	395	1320	778.33	2502.41	678	0	0	0	0	01
2016-01-01	1490	2152	2	0	0	0	2	248	91	167	290	877	518.67	80.93	506	4	0	0	0	01
2016-01-01	2398	1501	1	0	0	0	0	40	13	17	24	120	1897.58	366764.51	950	1	-2	-2	0	01
2016-01-01	2891	89289	8	4	8	16	106	73	41	71	191	1036	5.75	0.20	5	142	208	208	0	01
2016-01-01	320	2412	1	0	2	0	0	746	284	290	568	1736	470.50	213.48	438	3	-2	-2	0	01
2016-01-01	406	1308	0	0	0	0	0	1376	1008	5674	20010	82635	717.75	539.33	681	0	0	0	0	01
2016-01-01	459	2037	0	0	0	0	1	10	2	14	32	264	603.42	52.17	594	1	2	2	0	01
2016-01-01	722	2052	1	0	2	8	1	4025	1129	1488	2007	5477	566.42	614.95	530	12	8	8	0	01
2016-01-02	1234	1011	0	0	0	0	0	332	163	269	395	1320	725.62	938.94	675	0	0	0	1	01
2016-01-02	1490	1722	0	0	2	0	0	248	91	169	290	877	545.79	126.78	532	6	0	0	1	01

Prediction Dataframe

kable((prediction_df[1:10,]),digits =2,'markdown',booktabs =T)

Date	app_id	sales	s1	s3	s4	s5	cs1	cs2	cs3	cs4	cs5	daily_rank_average	daily_variance	daily_rank_highest	total_reviews	net_rating_score_daily	cumulative_rating_score_daily	month
2016-01-01	1	NA	0	0	0	0	47	8	6	12	30	591.96	740.22	548	0	0	0	01
2016-01-01	100	NA	4	0	0	0	4682	2187	6226	28460	315885	69.54	2.26	68	4	-8	-8	01
2016-01-01	1006	NA	0	0	0	0	33	13	11	12	18	177.46	233.91	158	0	0	0	01
2016-01-01	1007	NA	0	0	0	0	1	0	0	3	16	1615.42	458464.78	920	0	0	0	01
2016-01-01	1009	NA	4	2	2	12	11267	3178	2907	4622	17393	109.33	3.01	107	20	18	18	01
2016-01-01	1014	NA	2	0	0	0	719	216	428	1549	6833	537.04	79.00	522	2	-4	-4	01
2016-01-01	1015	NA	0	0	0	2	121	95	186	370	2883	445.29	84.04	428	2	4	4	01
2016-01-01	1017	NA	0	0	0	0	228	95	147	451	5223	286.79	24.78	278	0	0	0	01
2016-01-01	1018	NA	0	0	0	0	31	27	59	410	1269	1331.58	385273.82	906	0	0	0	01
2016-01-01	102	NA	0	0	0	0	1071	733	1632	3331	10022	901.54	1535.48	831	0	0	0	01

## create daily average rank, avg daily variance, daily highest rank
rank_average <- rowSums(ranks[3:26])/24
my_vars <- rowVars(as.matrix(ranks[,3:26]))
ranks$daily_rank_average <- rank_average
ranks$daily_variance <- my_vars
ranks$daily_rank_highest<- apply(as.matrix(ranks[,3:26]),1,min)

## create cumulative daily average rating, create scoring for ratings, create cuulative scoring for ratings
ranks <- ranks %>% 
    arrange(app_id) %>% 
    group_by(app_id) %>%
    mutate(csum1 = cumsum(s1)) %>% 
    mutate(csum2 = cumsum(s2)) %>% 
    mutate(csum3 = cumsum(s3)) %>% 
    mutate(csum4 = cumsum(s4)) %>% 
    mutate(csum5 = cumsum(s5)) %>% 
    mutate(total_reviews = csum1+csum2+csum3+csum4+csum5) %>% 
    mutate(net_rating_score_daily = (s1*-2)+(s2*-1)+(s3*0)+(s4*1)+(s5*2)) %>% 
    mutate(cumulative_rating_score_daily =(csum1*-2) + (csum2*-1) + (csum3*0) +(csum4*1)+(csum5*2))



## Create weekday/weekend
ranks$is_weekend <- chron::is.weekend(ranks$Date)
ranks$is_weekend <- as.integer(ranks$is_weekend )

## Create month category
months <- strsplit(as.character(ranks$Date),"-" )
months <- sapply(months, "[[", 2)
ranks$month <- months




## Merge with sales data
df<- merge(sales,ranks,by=c('Date',"app_id"),all=TRUE)
df$app_id <-as.factor(as.character(as.integer(df$app_id)))


## validate merge worked correctly(commented out for report)
#dim(df)[1]-sum(is.na(df$sales))==dim(sales)[1]

## drop all hourly and cumsum columns
df <- df[,c(1,2,3,28:32,38:44)]

## filter for apps only present in sales data to create our model training data
apps_with_sales_list <- as.integer(levels(as.factor(sales$app_id)))
modeling_df <- df %>% 
    filter(app_id%in%apps_with_sales_list )
    
prediction_df <- df %>% 
    filter( !app_id%in%apps_with_sales_list )

Corrpots

Code for this section has been appended to the bottom

heavy collinearity, which makes sense as our columns are aggregates of each other, but they are also heavily zero inflated. They still may have plenty of individual value in our models

##  [1] Date                          app_id                       
##  [3] sales                         s1                           
##  [5] s2                            s3                           
##  [7] s4                            s5                           
##  [9] cs1                           cs2                          
## [11] cs3                           cs4                          
## [13] cs5                           daily_rank_average           
## [15] daily_variance                daily_rank_highest           
## [17] total_reviews                 net_rating_score_daily       
## [19] cumulative_rating_score_daily is_weekend                   
## [21] month                        
## <0 rows> (or 0-length row.names)

## reorder factor levels
modeling_df$app_id <- as.factor(as.character(modeling_df$app_id))

## Split df by app
list_dfs <- split(modeling_df,modeling_df$app_id)

## Correct NA values in sales (5 values which existed in the ranks.csv didnt have sales values on the sales csv. Impute with 0)
modeling_df[which(is.na(modeling_df$sales)),3] <-0 
modeling_df[which(is.na(modeling_df$sales))
,]


## TUrn list into individual df for 2891
app_2891 <- as.data.frame(list_dfs$`2891`)
app_2891$app_id <- as.factor(as.character(app_2891$app_id))

##corr plot entire DF
corrplot::corrplot(cor(modeling_df[,3:20]))
mtext("WHOLE DF", at=-2.5, line=-0.5, cex=2)

##corr plot app 2891
corrplot::corrplot(cor(app_2891[,3:20]))
mtext("App 2891", at=-2.5, line=-0.5, cex=2)

Modeling

Please see tabs below

The instruction questions seemed open ended in terms of what we want to model. I decided to build a time series model for the highest selling product “app 2891”. I ran a simple train test split as well as a CV test using RMSE to compare several different time series models on the app. Analysis here easily could have gone further, as earlier in EDA the forecast package recommended imputation of several values in this time series. As the data is merely simulated data, I opted to just run the models and show the process without imputation. Future expectation of downloads can be a very valuable tool for our clients to adjust their advertising as well as IT requirements. As a warning, forecasts in general don’t behave well as time intervals of prediction become large.

Next, I tested several CV models and compared the RMSE scores for the models. My hope is that by extracting variable importance we can identify what features are most important to increasing sales. I recognize that this data is count data but believe that our models can hopefully capture the data correctly, without resorting to Poisson modeling techniques. Lastly, caret doesn’t handle time series data well, which is why I created several categories like is.weekend and month to attempt to capture time series effects in the modeling.

Time series analysis of 2891

Below I run a time series analysis of app 2891. First, I create a train test split and plot out predictions from several models. The random walk model seems to perform the best here and the ARIMA model performs terribly. Given more time these models could be tested with imputations for outliers. Next i conduct a CV analysis with several models and the forecast package. ARIMA and ETS performed the best in the CV scoring. I plot out both models with predictions. I would feel more comfortable using the ETS model. Considering how poor the performance was by the ARIMA model in the train/test split. Forecast package handles imputations easily and could lead to better predictions. I produce the forecast graphs below for all the train/test and CV models. Our X axis represents weeks from start of the data set(1/1/2016). The train test spit predictions are mapping predictions for March against the data we already have. The CV models, which are trained on CV scoring of rmse, show us our expected future predictions. As you can see, the intervals for these forecasts are very wide.

ARIMA model seems to think the app sales are going to go down over the next month, while ETS sees them with no trend and what looks like a weekly seasonality effect

App 2891 Time Series Plots:

Train/Test splits

Below predictions are based on splitting off the last 30 days of data and predicting on it

App 2891 Naive Fit Prediction

App 2891 ETS Prediction

App 2891 Random Walk Prediction

App 2891 ARIMA Prediction

App 2891 Train/Test split RMSE comparison

ets_ana_rmse	random_walk_rmse	naive_fit_rmse	arima_fit_rmse
3684.53	2751.96	23742.29	10564.91

App 2891 Cross-validated RMSE

X	naive_fit_cv_rmse	rwf_fit_rmse	arima_fit_rmse	ets_fit_rmse
1	6734.94	6734.94	6581.14	6302.28

CV Arima model prediction over next month for app 2841

CV ETS model prediction over next month for app 2841

## CV functions
farima <- function(x, h) {
  forecast(auto.arima(x,allowdrift=TRUE), h=h)
}
fets <- function(x, h) {
  forecast(forecast::ets(x), h = h)
}
naives <- function(x, h) {
  forecast(forecast::naive(x), h = h)
}
## build xts for display
xts_pipe1 <- xts::xts(app_2891$sales, order.by= app_2891$Date)
attr(xts_pipe1, 'frequency') <- 7

## display xts hist and autoplot
par(mfrow=c(2, 1), mar = c(3, 5, 0, 0), oma = c(0, 0, 0.5, 0.5))
original_timeplot <- plot(xts_pipe1, main="pipe1")
plot(xts_pipe1, main="kWh")
# display frequency at which values in a vector occur
hist(xts_pipe1, xlab="", main="",col="red")

# ## build acf/pacf plots for pipe1 and diff (commented out for report)
# a <- ggAcf(xts_pipe1,lag=200)
# b <- ggPacf(xts_pipe1,lag=200)
# ## Example of overfdifferencing below
# c <- ggAcf(diff(xts_pipe1),lag=10)
# d <- ggPacf(diff(xts_pipe1),lag=200)
# plot_grid(a,b,c,d)



app_2891_ts <- ts(app_2891$sales,frequency=365,start = c(2016,1))

## Create train test split
myts.train <- app_2891_ts[1:70]
test_results <- app_2891_ts[71:90]
ts_test_results <- ts(test_results)
myts.train <- ts(myts.train,frequency=7)







## Build naive fit and plot
naive_fit <- naive(myts.train)
autoplot(naive_fit,h=29)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")

## Build ETS and plot
fit <- forecast::ets(myts.train, allow.multiplicative.trend=TRUE)
fc <- forecast(fit,h=29)
autoplot(fc)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")

## Build random wlak and plot 
fcast <- stlf(myts.train,method='naive',h=29)
autoplot(fcast)+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")

## Build ARIMA and plot
arima_fit <-forecast::auto.arima(myts.train)
arima_fit <- forecast(arima_fit,h=29)
autoplot(forecast(arima_fit,h=29))+
autolayer(ts(app_2891_ts,frequency = 7),series="actual")

## get Train/Test RMSE
random_walk_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),fcast$mean)
ets_ana_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),fc$mean)
naive_fit_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),naive_fit$mean)
arima_fit_rmse <- rmse(ts(app_2891_ts[71:90],frequency = 7),arima_fit$mean)
## train split results 
results <- as.data.frame(cbind(ets_ana_rmse,random_walk_rmse,naive_fit_rmse,arima_fit_rmse))
kable(results,digits =2,'markdown',booktabs =T)


## Cross validation below took long to run, therefore i saved results to github and load in dataframe
x <- getURL("https://raw.githubusercontent.com/justinherman42/appfigures/master/cv_scores_app2891.csv")
y <- as.data.frame(read.csv(text = x))
kable(y,digits =2,'markdown',booktabs =T)


# Commented out imported as csv
# naive_fit_cv <- forecast::tsCV(ts(app_2891_ts,frequency = 7),naives, h=1)
# rwf_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),rwf, drift=TRUE, h=1)
# ets_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),fets, h=1)
# arima_fit <- forecast::tsCV(ts(app_2891_ts,frequency = 7),farima, h=1)
# naive_fit_cv_rmse <- sqrt(mean(rwf_fit^2, na.rm=TRUE))
# rwf_fit_rmse <- sqrt(mean(rwf_fit^2, na.rm=TRUE))
# ets_fit_rmse <- sqrt(mean(ets_fit^2, na.rm=TRUE))
# arima_fit_rmse <- sqrt(mean(arima_fit^2, na.rm=TRUE))
#cv_scores <- as.data.frame(cbind(naive_fit_cv_rmse,rwf_fit_rmse,arima_fit_rmse,ets_fit_rmse))
#write.csv(cv_scores,file = "cv_scores_app2891.csv")


## actual models 
auto_arima_model <- farima(ts(app_2891_ts,frequency = 7),365/7)
auto_arima_model$model
checkresiduals(auto_arima_model)
ets_model <- fets(ts(app_2891_ts,frequency = 7),365/7)
ets_model$model
checkresiduals(ets_model)


arima_for_plot <- autoplot(forecast::forecast(auto_arima_model,h=31),series='ARIMA')
ets_forecast_plot <- autoplot(forecast::forecast(ets_model),series="ETS")

Run models with Caret

The bread and butter of our modeling stage. We will setup a 5 fold cross validated training parameter. Caret can’t handle the date series, so we must exclude dates from the modeling. Hopefully our features like is.weekend, month and cumulative ranks, capture the time element well.
I run several linear, nonlinear, and tree-based models and compare their results based on RMSE. I display the variable importance for all the models. It must be noted that some of our features may be highly collinear and therefore the variable importance may not be accurately tuned.

Daily rank highest is considered the most important feature in almost all our models. 5 Star reviews is selected as extremely important as well. The value these variables bring to our clients really depends on the causal effect of the ranking data. Are these ranks published somewhere and viewed which may have an impact on driving downloads? Think Rotten tomatoes ratings driving traffic to the movies. Or is the ranking data solely reflective of genuine interest in the apps. If it’s the former, we can use ranks to drive traffic towards applications (which can be substantial). If it’s the latter, we can use the ranks data to determine profitability for apps soon after launch.

I used rmse to pick a model and in the end i picked the random forest model. With more time I would want to combine the Time series analysis with some of the analysis from our model. I would also want more data. Our model attempts to predict on over 3500 apps with only 90 days of data on less than 40 apps. Overall, it seems like a very interesting topic to investigate.

Train paramaters

Linear Models

## Aggregating results
## Selecting tuning parameters
## Fitting ncomp = 2 on full training set

## Partial least squares regression , fitted with the orthogonal scores algorithm.
## Call:
## plsr(formula = .outcome ~ ., ncomp = param$ncomp, data = dat,     method = "oscorespls")

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

## Aggregating results
## Selecting tuning parameters
## Fitting fraction = 0.3, lambda = 0 on full training set

## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 1 on full training set

Non Linear models

## Aggregating results
## Selecting tuning parameters
## Fitting k = 5 on full training set

## Aggregating results
## Selecting tuning parameters
## Fitting sigma = 0.125, C = 8 on full training set

Tree Models

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 18 on full training set

## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 500, max_depth = 4, eta = 0.01, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1, subsample = 0.75 on full training set

Rank Models importance features

Rank	pls	rf	xgbTree	enet	knn	svmRadial	glmnet
1	s5	daily_rank_highest	daily_rank_highest	daily_rank_highest	daily_rank_highest	daily_rank_highest	cumulative_rating_score_daily
2	net_rating_score_daily	s5	daily_rank_average	daily_rank_average	daily_rank_average	daily_rank_average	s5
3	s4	daily_rank_average	net_rating_score_daily	s5	s5	s5	net_rating_score_daily
4	cumulative_rating_score_daily	net_rating_score_daily	s5	net_rating_score_daily	net_rating_score_daily	net_rating_score_daily	daily_rank_highest
5	s3	daily_variance	cs1	s4	s4	s4	total_reviews
6	daily_rank_highest	cumulative_rating_score_daily	daily_variance	cumulative_rating_score_daily	cumulative_rating_score_daily	cumulative_rating_score_daily	s4
7	total_reviews	total_reviews	cumulative_rating_score_daily	s3	s3	s3	daily_variance
8	daily_rank_average	is_weekend	s4	total_reviews	total_reviews	total_reviews	cs1
9	s2	cs4	total_reviews	s2	s2	s2	s1
10	s1	cs2	s3	s1	s1	s1	cs5
11	cs3	s1	cs2	daily_variance	daily_variance	daily_variance	cs3
12	cs4	cs5	s1	cs4	cs4	cs4	is_weekend
13	cs2	cs1	cs3	cs3	cs3	cs3	s3
14	cs1	s4	cs4	cs2	cs2	cs2	cs2
15	cs5	cs3	is_weekend	month	month	month	daily_rank_average
16	month	month	s2	is_weekend	is_weekend	is_weekend	month
17	daily_variance	s2	cs5	cs1	cs1	cs1	s2
18	is_weekend	s3	month	cs5	cs5	cs5	cs4

## 
## Call:
## summary.resamples(object = .)
## 
## Models: PLS, ENet, KNN, SVM, RF, XGB 
## Number of resamples: 5 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS  2130.3708 2192.196 2240.331 2305.242 2280.599 2682.715    0
## ENet 1942.6416 2044.858 2057.371 2188.658 2350.154 2548.265    0
## KNN  1512.5788 1641.362 1652.588 1626.805 1660.362 1667.132    0
## SVM  2245.2264 2283.089 2469.891 2463.704 2654.788 2665.525    0
## RF    987.7445 1109.535 1162.016 1166.172 1242.008 1329.555    0
## XGB   981.5728 1128.565 1201.954 1185.319 1264.634 1349.867    0
## 
## RMSE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## PLS  3877.702 3914.482 3941.092 4139.684 3974.065 4991.079    0
## ENet 3752.864 3860.134 3923.325 4240.657 4548.382 5118.578    0
## KNN  3954.345 4159.861 4567.975 4409.517 4594.733 4770.669    0
## SVM  5384.505 5385.286 6208.286 6085.735 6342.220 7108.376    0
## RF   2435.471 2931.129 3006.524 3232.680 3717.091 4073.185    0
## XGB  2376.476 2988.896 3042.397 3242.596 3738.381 4066.831    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## PLS  0.8132833 0.8135496 0.8166363 0.8247710 0.8399609 0.8404251    0
## ENet 0.7262742 0.8098684 0.8244899 0.8041526 0.8294611 0.8306693    0
## KNN  0.7607204 0.7825883 0.8230092 0.8067721 0.8314895 0.8360532    0
## SVM  0.5449780 0.5745114 0.6060230 0.6095424 0.6605867 0.6616126    0
## RF   0.8342030 0.8607409 0.9008469 0.8855466 0.9022635 0.9296787    0
## XGB  0.8426103 0.8660799 0.8951299 0.8879660 0.9026636 0.9333460    0

App Sales Of Over 50k Table Count

## 
## 2142  246 2694 2756 3005 3407  387  909 2841 3288 3391 3336 3395 2873 2281 
##    1    1    1    1    1    1    1    1    2    2    2    3    3    4    6 
## 2765 3534 2977 3234 3268  499 3299 3178 3533 3102 3530 2457 2500  886 2563 
##    6    6    7    8    8    9   10   11   12   16   17   20   21   23   24 
## 1733 2516 3072 2981 2941 1808  308  119 
##   25   26   30   32   36   51   62   75

## drop columns for Caret
modeling_df$month <- as.integer(modeling_df$month)
set.seed(1)
xTrain <- modeling_df[,-c(1,2,3)]
yTrain <- as.integer(modeling_df[,3])
cvFolds <- createFolds(yTrain, k=5)
trControl <- trainControl(verboseIter=T,
                          method='cv', 
                          number=5,
                          index=cvFolds)

# Set up and start multi-core processing
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

## Build Linear Models
plsFit <- train(x=xTrain,
                 y=yTrain, 
                 method='pls',
                 tuneLength=20,
                 trControl=trControl,
                 preProc=c('knnImpute', 'center', 'scale'))
plot(plsFit)
plsFit$finalModel

enetFit <- train(x=xTrain,
                  y=yTrain,
                  method='enet',
                  tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                       .lambda = seq(0, 1, by=0.1)),
                  trControl=trControl,
                  preProc=c('knnImpute', 'center', 'scale'))
plot(enetFit)
enetFit$finalModel


## Build Non-linear models
knnFit <- train(x=xTrain,
                y=yTrain,
                method='knn',
                tuneLength=20,
                trControl=trControl,
                preProc=c('knnImpute', 'center', 'scale'))
plot(knnFit)
knnFit$finalModel

svmFit <- train(x=xTrain,
                y=yTrain,
                method="svmRadial",
                tuneLength=20,
                trControl=trControl,
                preProc=c('knnImpute', 'center', 'scale'))

## plot svm model
plot(svmFit)

## Build rf 
rf <- train(x=xTrain, 
            y=yTrain, 
            method='rf',
            tuneLength=10,
            trControl=trControl, 
            importance=T)
plot(rf)


## Vuild XGboost model
xgbGrid <- expand.grid(
   .nrounds=c(100, 500), # boosting iterations (trees)
   .max_depth=c(4, 6), # max tree depth
   .eta=c(0.001, 0.01), # learning rate
   .gamma=c(0),# minimum loss reduction
   .colsample_bytree=c(0.4, 0.6), # subsample ratio of columns
   .min_child_weight=c(1, 5), # minimum sum of instance weight
   .subsample=c(0.5, 0.75))  # subsample ratio of rows
xgb1 <- train(x = xTrain,
              y = yTrain,
              method = 'xgbTree',
              tuneGrid = xgbGrid,
              trControl = trControl)

## shutdown multicore
stopCluster(cl)
registerDoSEQ()


## Rank Models importance features 
getRank <- function(trainObjects){
  temp <- c()
  methods <- c()
  for(object in trainObjects){
    methods <- c(methods, object$method)
    varimp <- varImp(object)[[1]]
    varimp$variables <- row.names(varimp)
    rank <- varimp[order(varimp$Overall, decreasing = T),] %>% row.names()
    temp <- cbind(temp, rank)
    
  }
  temp <- as.data.frame(temp)
  names(temp) <- methods
  temp$Rank <- c(1:dim(temp)[1])
  temp <- select(temp, Rank, everything())
  return(temp)
}

kable(getRank(list(plsFit, rf, xgb1, enetFit, knnFit, svmFit)),digits = 2,'markdown', booktabs =T)

## plot variable importance
plot(varImp(plsFit), main='Variable Importance based on PLS')
plot(varImp(rf), main='Variable Importance based on Random Forest')
plot(varImp(xgb1), main='Variable Importance based on XGBoost')
plot(varImp(svmFit), main='Variable Importance based on Loess R-Squares')

## summary of model scores 
resamples(list(PLS=plsFit, ENet=enetFit, KNN=knnFit, SVM=svmFit, RF=rf, XGB=xgb1)) %>% summary()


##boxplots of cv scores
par(mfrow=c(2,3))
boxplot(plsFit$resample$RMSE, main='CV RMSE for PLS')
boxplot(enetFit$resample$RMSE, main='CV RMSE for ENet')
boxplot(knnFit$resample$RMSE, main='CV RMSE for KNN')
boxplot(svmFit$resample$RMSE, main='CV RMSE for SVM')
boxplot(rf$resample$RMSE, main='CV RMSE for RF')
boxplot(xgb1$resample$RMSE, main='CV RMSE for XGB')


## predictions 
pred <- predict(rf, newdata=prediction_df)
prediction_df$sales_predictions <- pred
prediction_df$Date <- pred_dates
prediction_df$app_id <- pred_appids
df_for_export <- prediction_df[,c(20,21,19)]
df_for_export$app_id <- as.integer(as.character(df_for_export$app_id))
final_df <- df_for_export %>% 
    arrange(app_id)

write.csv(final_df,'Herman_estimates.csv')


indexes <- which(prediction_df$sales_predictions>50000)
new_df <- prediction_df[indexes,c("app_id","sales_predictions")]
new_df$app_id <- as.factor(as.character(new_df$app_id))
sort(table(new_df$app_id))

Untitled

Justin Herman

May 23, 2019

Predicting Apps Sales Volume

EDA Sales Data CSV

Sales Data

Explore Overall Monthly Trend

Look at app sales

EDA Ranks CSV

validate Ranking Data and impute

Feature Building

Corrpots

Modeling

Time series analysis of 2891

Train/Test splits

Run models with Caret