Anonymous Inc. is a subset of customer data for the Anonymous Inc. platform sample size of ~14,650. The given data represents the number of signins, searches and quotes retreived on the platform within the period of October 24th to November 22nd, 2015. In order to find the best predictor of the future marketing efforts, it made to sense to start with two metrics and their relationship:
After reading in the data directly to R, a description of the various datasets gives us several important observations:
My first test case was to check if a high number of “signing in” was indicative of any other information. Using a aggregation function from base , I found that 157 IDs can be identified as high frequency users (top 5%) and that among these the titles, the below are the most common:
However, when comparing to others of these users with the same title, frequency was an unreliable predictor:
Consequently, I began to look at item quotations as the next logical indicator.
After finding the percentiles of the aggregation of item quotes, I selected for the top 5% in a similar manner to high frequency users with sqldf.
highitems <- sqldf("select * from itemquote where NumItemsQuotes > 5489")
Returning the below titles of interesting, I conducted a similar sampling of the title to determine if item quote patterns would be the same for these titles.
Similar to the High frequency users, the titles which were produced from this selection had little predictive power to the other users bringing me to the conclusion that there is not a pattern which can be deduced by nominal positions.
Given my statistics background, I couldn’t help but run a standard model of regression on this data. It is quickly apparent that there is a significant amount of clustering as well as some significant outliers (as previously mentioned) which are causing some heavy skew. This could be controlled by replicating and resampling (bootstrapping) or possibly by using a smoothing model and elimintating the outliers from the experimental set. However, that would better be determined on a more complete set of data than this small selection with limited dependent variables.
fit <- lm(NumItemsQuotes ~ PeopleId + Title,data=highitems)
plot(fit)
Our profile group containing all the previously listed titles was grouped for summary of data but as apparent from the table (see Figure 1. of Appendix) let alone further analysis, this is not the correct approach to profiling the best customer base. It really seemed as though there is no relationship from searches whatsoever, a minimal but plausible correlation of title to Signin, and no plausible explanation of Item quotes.
However, I had noticed that that selecting via a subset of date also returned something that merits more attention - the selection of people who signin, search, and pull quotes on the weekend.
I then compiled the subset of titled individuals active on the weekend (872 unique IDs) and found that:
After a full selection and disproving all my initial hypotheses, I happened to stumble on the reality that, if in doubt, market to those who work weekends. Further investigation of this is well merited, especially with more enriched data and more factors so that further reasoning can be substantiated. I would expect that there are significant outliers present from internal testers of the platform. However, I would begin with the working hypotheis that when a person accesses the platform is the best indication of their future use.