In this post I describe how to predict user conversion using R. I perform exploratory analysis of website user data using the R packages ggplot and plotly for interactive plots, and build and compare several machine learning models. In the end of the post I summarize the insights that can be drawn from the data about user behavior and describe the optimal actions for the company.

I am working with data on 316 thousand users of a website based in four countries: US, UK, Germany and China. The data includes 6 user characteristics: the country, age, whether they the user is new, the source through which the user accessed the site (Ads, search, or direct link), the number of pages visited, and whether the user converted, or made a purchase.

Data Preparation and Exploration

I start examining the data by looking at the summary statistics table. The table would show any missing data or unusual values.

data <- read.csv('conversion_data.csv')
summary(data)
##     country            age            new_user         source      
##  China  : 76602   Min.   : 17.00   Min.   :0.0000   Ads   : 88740  
##  Germany: 13056   1st Qu.: 24.00   1st Qu.:0.0000   Direct: 72420  
##  UK     : 48450   Median : 30.00   Median :1.0000   Seo   :155040  
##  US     :178092   Mean   : 30.57   Mean   :0.6855                  
##                   3rd Qu.: 36.00   3rd Qu.:1.0000                  
##                   Max.   :123.00   Max.   :1.0000                  
##  total_pages_visited   converted      
##  Min.   : 1.000      Min.   :0.00000  
##  1st Qu.: 2.000      1st Qu.:0.00000  
##  Median : 4.000      Median :0.00000  
##  Mean   : 4.873      Mean   :0.03226  
##  3rd Qu.: 7.000      3rd Qu.:0.00000  
##  Max.   :29.000      Max.   :1.00000

The maximum of age is suspiciously high. There are 2 users with ages above 110. I remove these two observations. The remaining users are under 80 years of age. There are no missing values, and the summary statistics look reasonable.

data <- subset(data,age<100)

To get a first feel for the relationships between the variables I plot a correlogram. It shows correlations for the numeric variables (4 out of 6 in the data). Blue color means that the variables are positively correlated, red, that they are negatively correlated, and the shading and pie charts illustrate the amount of correlation.

library(corrgram)
corrgram(data, upper.panel=panel.pie, main='Correlations Among Numeric Variables')

I can see that the variable of interest, converted, is highly and positively correlated with the number of pages visited - a proxy for the time the users spent on the site and how engaged they were. Other numeric variables are only modestly correlated with the variable ‘converted’.

I will now examine the relationship between conversion and pages visited, and between conversion and pages visited in greater detail.

Pages Visited and Conversion

I build an interactive graph of the relationship between pages visited and conversion.

library(ggplot2)
library(dplyr)
library(plotly)
grouped_data <- data %>% group_by(total_pages_visited) %>% summarise(conversion_rate=mean(converted))
p <- ggplot(grouped_data,aes(total_pages_visited,conversion_rate))+geom_bar(stat='identity',fill='dark blue')
ggplotly(p)

There is a striking exponential relationship between pages visited and conversion until about users 15 pages visited. Of users that visited 15 pages 74 percent converted. More pages visited are associated with higher conversion rates, and all users that visited above 20 pages converted.

I suspect that most users do not visit many pages. The histogram below shows how many users a particular number of pages.

p <- ggplot(data, aes(total_pages_visited))+geom_histogram(color='black',fill='light blue',binwidth=1)
ggplotly(p)

Most users visit 4 pages or less. The conversion rate for users that visit under 5 pages is 0.03 %. However, these plots do not indicate that the company should increase the number of pages that the users visit in order to increase conversion. More pages visited is an indicator of the user’s interest in the product and their intent to buy. The company should focus on building that interest and intent. I do not recommend the company to influence the number of pages that the users visit by, for instance, requiring users to go through more pages to find what they need. The data do not show that pages visited cause conversion, only these two events are highly correlated because they depend on user’s interest in the product.

User Age and Conversion

Age typically has a nonlinear effect on user purchasing. I plot conversion rate by age to see if I can see nonlinearity in the data.

grouped_data <- data %>% group_by(age) %>% summarise(conversion_rate=mean(converted))
p <- ggplot(grouped_data,aes(age,conversion_rate))+geom_bar(stat='identity',fill='dark blue')
ggplotly(p)

Conversion rate declines with age until age 55. The decline is roughly linear, but it is somewhat more rapid before age 35. I will include nonlinearity in age in my machine learning model if it improves predictive power.

After age 55 conversion rate jumps around between values as high as at age 34 and 0. Only 0.5 % of users are 55 or older, and lack of data leads to high volatility. I show the histogram of user age below.

p <- ggplot(data, aes(age))+geom_histogram(color='black',fill='light blue',binwidth=2)
ggplotly(p)

The product is most popular with users under 40.

User Country, Sources of Website Traffic, and Conversion

I would now like to see how the remaining user characteristics, ‘country’ and ‘source’, are related to conversion. I display a jitter plot of users by country that converted and did not convert, colored by source of accessing the website.

ggplot(data, aes(country,converted))+geom_jitter(aes(color=source),size=0.5)

I appears that very few users from China converted. US has the highest number of users that converted, followed by UK, and Germany. I do see any striking difference in the sources that brought the users from different countries to the site, or any clear relationship between the sources and conversion. There are a lot of blue dots on the graph indicating that many users came to the site through search.

Let us compare conversion rates by country.

grouped_data <- data %>% group_by(country) %>% summarise(conversion_rate=mean(converted))
grouped_data %>% arrange(conversion_rate)
## # A tibble: 4 x 2
##   country conversion_rate
##    <fctr>           <dbl>
## 1   China     0.001331558
## 2      US     0.037800687
## 3      UK     0.052612025
## 4 Germany     0.062428188

The intuition from the jitter plot was right, and Chinese users have an extremely low conversion rate of 0.01 %, followed by 3.8 % for the U.S., 5.3 % for the U.K. , and 6.2 % for Germany.

The table below displays conversion rates by source.

grouped_data <- data %>% group_by(source) %>% summarise(conversion_rate=mean(converted))
grouped_data %>% arrange(conversion_rate)
## # A tibble: 3 x 2
##   source conversion_rate
##   <fctr>           <dbl>
## 1 Direct      0.02816901
## 2    Seo      0.03288850
## 3    Ads      0.03447188

Advertising has the highest conversion rate. It may provide a slightly better payoff in terms of conversion, compared to improving visibility in the search results. I have initial ideas about the data ad can begin training the model.

Market Penetration

I was interested in comparing the numbers of users in each country with the country’s population to gauge market penetration. I obtained the current population estimates from indexmundi.com. The numbers are in thousands of population as of 2014. The market penetration rate is the number of users in each country divided by the thousands of population in that country.

populations <- c(1355693,80997,63743,318892)
sum <- summary(data$country)/populations
cat('Users Per Thousand of Population')
## Users Per Thousand of Population
##      China    Germany         UK         US 
## 0.05650394 0.16117881 0.76006777 0.55847121

China has the least users per thousand of population. This is another indication that the Chinese market is under-performing. Germany has significantly less users per person than the UK and the US. It also has the highest conversion rate, as shown in the previous section. The company should attract new German users as well as invite existing German users back to the site.

Building the Model

Train and Test Split

I slit the data into a training and testing subset using a split ration of 70 %.

library(caTools)
data$converted <- as.factor(data$converted)
split <- sample.split(data$converted,SplitRatio = 0.7)
train <- filter(data,split)
test <- filter(data,!split)

Logistic Regression

I start with a logistic regression classifier. This model has good predictive power in situations when the explanatory variables affect the probability of converting approximately linearly, and nonlinear effects and interaction terms are relatively unimportant. I include age squared explicitly, because I expect it to have a nonlinear effect on conversion. The model takes almost a minute to converge on 220 thousand observations. Then, 10 fold cross validation may take up to 10 minutes. The data set is large, and does not necessarily require cross validation.

logit <- glm(converted~country+age+new_user+source+total_pages_visited+total_pages_visited,family = binomial,data=train)
##summary(logit)

I evaluate the predictive power of the logistic regression on the test set.

predicted.converted <- ifelse(predict(logit,newdata=test,type='response')<0.5,0,1)
print(mislassError <- mean(predicted.converted!=test$converted))
## [1] 0.0142949

I repeat the train and test split, re-estimate the model and recompute the misclassError. The error varied between 0.013 to 0.014. The benchmark for the error is the frequency of converting: 0.032.

Random Forest

Random forest algorithm does not suppose a linear relationship between the explanatory variables and the log odds of converting. It is more flexible than the logistic regression, but is typically more computationally intense. The algorithm implemented below uses 500 trees.

library(randomForest)
rf <- randomForest(converted~., importance=T,train)
predicted.converted <- predict(rf,newdata=test,type='class')
print(mislassError <- mean(predicted.converted!=test$converted))
## [1] 0.01445303

The random forest model takes 5 minutes to converge. The predictive accuracy is slightly worse than that of the logistic regression. I display the importance table below to interpret the results.

rf$importance
##                                 0            1 MeanDecreaseAccuracy
## country              3.078265e-05 0.1036343288         3.372249e-03
## age                  7.860878e-05 0.0306408028         1.064503e-03
## new_user             1.701976e-04 0.0851227223         2.910626e-03
## source              -2.956793e-06 0.0005943617         1.632369e-05
## total_pages_visited  1.415606e-02 0.6277337766         3.394683e-02
##                     MeanDecreaseGini
## country                    385.50683
## age                        397.52422
## new_user                   399.32366
## source                      82.81326
## total_pages_visited       7963.78249

The number of pages visited are by far the most important variable in predicting conversion, followed by age, whether the user is new, country, and source.

Support Vector Machine

I build an SVM model with a radial basis kernel. This is another flexible model that does not require linearity. However, the SVMs do not scale well with the data, and the estimation is computationally intensive. SVM can achieve low out of sample error with tuning. However, tuning on a large data set, such as this one, would take a long time, perhaps several hours.

library(e1071)
SVM <- svm(converted~.,data=train)

The algorithm took 10 minutes to converge. Let us assess its performance.

predicted.converted <- predict(SVM, newdata = test, type = "class")
print(mislassError <- mean(predicted.converted != test$converted))
## [1] 0.01453737

SVM with default cost and gamma tuning parameters perform slightly worse than the other methods. I will not invest in tuning the SVM at this point.

Interpreting the Results

I choose the logistic regression model, because it is fast and superior in its predictive power. I find that not excluding age squared gives a slightly lower out of sample error.

logit <- glm(converted~country+age+new_user+source+total_pages_visited+total_pages_visited,family = binomial,data=train)
summary(logit)
## 
## Call:
## glm(formula = converted ~ country + age + new_user + source + 
##     total_pages_visited + total_pages_visited, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0954  -0.0619  -0.0234  -0.0093   4.4368  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -10.399846   0.181773 -57.213  < 2e-16 ***
## countryGermany        3.910935   0.159078  24.585  < 2e-16 ***
## countryUK             3.642921   0.145544  25.030  < 2e-16 ***
## countryUS             3.281130   0.141269  23.226  < 2e-16 ***
## age                  -0.075525   0.002856 -26.449  < 2e-16 ***
## new_user             -1.763422   0.042839 -41.164  < 2e-16 ***
## sourceDirect         -0.170547   0.058628  -2.909  0.00363 ** 
## sourceSeo            -0.041116   0.047910  -0.858  0.39078    
## total_pages_visited   0.766070   0.007542 101.572  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63078  on 221338  degrees of freedom
## Residual deviance: 17773  on 221330  degrees of freedom
## AIC: 17791
## 
## Number of Fisher Scoring iterations: 10

The the total pages visited has the highest z value, indicating that it makes the greatest contribution to predicting conversion, followed by new_user and age. All variables are highly significant in predicting conversion.

The variable ‘total pages visited’ reflects the user interest in the product, which also causes conversion. I re-estimate the logistic regression model without this variable. I do not present the results here, but all relationships remain the same.

Additional Data Exploration: Chinese Users

The Chinese users have a disproportionately low conversion rate. Before I give the final conclusions, I would like to compare the observed characteristics of Chinese users with the other users, and perhaps obtain an insight into the situation.

library(gridExtra)
p1 <- ggplot(filter(data,country!='China'), aes(age, total_pages_visited))+geom_point(aes(color=as.factor(new_user)))+ggtitle('U.S. and Europe')
p2 <- ggplot(filter(data, country=='China'), aes(age, total_pages_visited))+geom_point(aes(color=as.factor(new_user)))+ggtitle('China')
grid.arrange(p2, p1, nrow=2)

The company was likely active in the U.S. and European markets longer than in the Chinese market. Most Chinese users are new. New users do not visit as many pages and are less likely to convert.

In China very few users visit more than 17-18 pages. This may indicate issues with the Chinese website. Chinese young users are not substantially more engaged than older users. The company should focus its marketing efforts on Chinese young people.

I will examine whether Chinese users come to the site through Ads, search, or directly, compared to other users.

p1 <- ggplot(filter(data,country!='China'), aes(source))+geom_bar(color='black', aes(fill=as.factor(new_user)))+ggtitle('U.S. and Europe')
p2 <- ggplot(filter(data, country=='China'), aes(source))+geom_bar(color='black',aes(fill=as.factor(new_user)))+ggtitle('China')
grid.arrange(p2, p1, nrow=2)

The sources of Chinese users are similar to those of other users.

Conclusions:

  1. I build the logistic regression model for predicting conversion. It has high predictive power is not as computationally intensive as the more flexible machine learning models, such as random forest and SVM.

  2. Total pages visited is the greatest predictor of conversion. However, the company should not try to influence the number of pages that users visit directly, but encourage user interest in the product, which affects both the total pages visited and conversion. Users that visited many pages but did not convert likely had high intent to buy. The company can target these users with reminders or promotions.

  3. Users with existing accounts convert more. The company should focus on retaining existing users and encouraging them to come back to the site.

  4. Conversion rate declines rapidly with age. Marketing should be focused on attracting and retaining young users. In addition, the company may want to invest in studying why older users convert less. Is the product less attractive for them, or is the website inconvenient for them?

  5. Chinese users have a conversion rate of 0.01 % compared to the rate of 4-6 % of the users from the US and Europe. This difference is not explained by the characteristics of the Chinese users that we know from the data, such as the users’ age, or whether they are new. The company should look at the Chinese website and the behavior of Chinese users closely. This is a top priority because China has 77 thousand users, more than Germany and UK combined. Furthermore, China has very low market penetration compared to the other countries. If the product for Chinese users could be improved, the opportunities for growth are great.

  6. German users have the highest conversion rate of 6.2%, compared to 5.3 % of UK users, and 3.8 % of the US users. The market penetration in Germany is 3 to 4 times lower than that of the U.S. and U.K. The company should grow the user base in Germany and market to existing German users. The U.K. market has the the highest market penetration among all countries, and may be approaching saturation. The company would benefit from a more explicit analysis of market saturation to study whether the U.K. market is in fact saturated. The U.S. market penetration is not as high as that of U.K., but its conversion rate is the lowest.Therefore an investment in marketing in the U.S. may not be cost effective.