Synopsis

This analysis looks at Company X’s events database for time period Jan 1, 2015 through May 30, 2014. Event data is captured when user view a providers’ page on the website, click out to providers website and submit a review for a provider. The goal of this analysis is to gain insights into the AH provider population and performance.

Data Processing

Event data is read from csv file extracted from Company X’s database.

## Set the working directory and read in the raw data
setwd("C:/Users/jlee/Documents/Personal/AH/")
data <- read.csv("ALL Provider Stats May29 2015.csv")
suppressMessages(library(dplyr))

Here is a summary of the dataset

summary(data)
##        id                   city           state      
##  Min.   :    1   New York     :  861   CA     : 7803  
##  1st Qu.:15531   Los Angeles  :  590   NY     : 3122  
##  Median :26052   San Francisco:  569   TX     : 2907  
##  Mean   :28038   San Jose     :  413   PA     : 2189  
##  3rd Qu.:44459   Austin       :  388   MA     : 1364  
##  Max.   :56869   Houston      :  357   IL     : 1353  
##                  (Other)      :36786   (Other):21226  
##                          name         listing_type   premium_listing_level
##  Huntington Learning Center:  160   diamond :  130   Min.   :-10.00       
##  Just For Kix              :  128   free    :39440   1st Qu.:  0.00       
##  Color Me Mine             :   80   gold    :  227   Median :  0.00       
##  LearningRx                :   45   platinum:  167   Mean   :  0.22       
##  YMCA                      :   34                    3rd Qu.:  0.00       
##  Sylvan Learning           :   23                    Max.   : 20.00       
##  (Other)                   :39494                                         
##    pageviews            clicks         X2015.sessions   
##  Min.   :    0.00   Min.   :   0.000   Min.   :  0.000  
##  1st Qu.:    0.00   1st Qu.:   0.000   1st Qu.:  0.000  
##  Median :    0.00   Median :   0.000   Median :  0.000  
##  Mean   :   12.28   Mean   :   1.136   Mean   :  0.776  
##  3rd Qu.:    1.00   3rd Qu.:   0.000   3rd Qu.:  0.000  
##  Max.   :17006.00   Max.   :1693.000   Max.   :256.000  
##                                        NA's   :6217     
##  reviews_count      has_class has_camp  has_birthday_party
##  Min.   : 0.00000    : 2081    : 4045    : 5409           
##  1st Qu.: 0.00000   f:32491   f:24081   f:33617           
##  Median : 0.00000   t: 5392   t:11838   t:  938           
##  Mean   : 0.02144                                         
##  3rd Qu.: 0.00000                                         
##  Max.   :38.00000                                         
##                                                           
##  has_activity_photo has_business_photo    owner_id       updated_by   
##  f:38276            f:11702            Min.   :   13   Min.   :    1  
##  t: 1688            t:28262            1st Qu.: 5711   1st Qu.: 7834  
##                                        Median : 7295   Median :22191  
##                                        Mean   :16803   Mean   :24276  
##                                        3rd Qu.:26873   3rd Qu.:43196  
##                                        Max.   :61531   Max.   :61493  
##                                        NA's   :34078   NA's   :37090  
##       created_at        updated_at    created.updated
##  12/8/2014 :10853   2/26/2015: 3409   Mode :logical  
##  12/9/2014 : 7001   9/27/2014: 2628   FALSE:39516    
##  12/27/2014: 4789   3/13/2015: 1372   TRUE :448      
##  12/5/2014 : 2856   2/12/2015: 1183   NA's :0        
##  12/15/2014: 2177   4/12/2015: 1068                  
##  3/13/2015 : 1671   4/11/2015: 1062                  
##  (Other)   :10617   (Other)  :29242

Notice that for pageviews and clicks, the max is much larger than the mean. This indicates that there are a few providers with very large pageviews and clicks.

summary(data$pageviews)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    12.28     1.00 17010.00
summary(data$clicks)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    0.000    1.136    0.000 1693.000

For purposes of analysis, we will exclude the top 43 providers with pageviews > 1000.

data <- data[data$pageviews <= 1000,]

Analysis & Insights

About 96% (38,418 out of 39,921) of remaining providers fall within the first bin of the histogram below, each receiving less than 50 pageviews in 2015 through May 30th.

library(ggplot2)
## Histogram of camps by clicks.
hist(data$pageviews, breaks=20,main = "Histogram of Providers by Pageviews",col="steelblue",labels=TRUE,xlab="Pageviews",ylab="# of Providers")

About 99% (39,640 out of 39,921) of remaining providers fall within the first bin of the histogram below, each receiving less than 25 clicks in 2015 through May 30th.

## Histogram of camps by clicks.
hist(data$clicks,main = "Histogram of Providers by Clicks",col="steelblue",labels=TRUE,xlab="Clicks",ylab="# of Providers")

The following treemap shows pageviews and clicks by State. The size of the rectangle are based on pageviews whereas the color of the rectangle is based on clicks.

library(treemap)
treemap(data,index="state",vSize="pageviews",vColor="clicks",type="value")

The following treemap shows pageviews and clicks by Listing Type. The size of the rectangle are based on pageviews whereas the color of the rectangle is based on clicks.

treemap(data,index="listing_type",vSize="pageviews",vColor="clicks",type="value")

The scatterplots inidicate that providers in Gold and Platinum categories have better click through rates versus providers in the Free category, as evidenced by the steeper slope of the best fit line in the scatterplots below. However, Diamond underperforms the Free category.

## Scatterplots by listing type and premimum listing levels
g <- ggplot(data,aes(pageviews,clicks))
g <- g + geom_point(color="steelblue",size=3,alpha=1/2) + geom_smooth(method="lm")+theme_bw()
g <- g + facet_grid(.~listing_type)
g

We see some modest correlation between pageviews and the number of reviews for a provider in selected top states.

## scatterplot of pageviews versus # of reviews for selected states
data <- data[data$state %in% c("CA","PA","TX","NY","GA","VA","WA","IL","NJ","OH","MD","FL","MA","AZ"),]
p <- ggplot(data,aes(reviews_count,pageviews))
p <- p + geom_point(aes(color=listing_type),alpha=1/2) + geom_smooth(method="lm")+theme_bw()
p <- p + facet_wrap(~state,ncol=3)
p

What are the average pageviews, clicks and CTRs per provider, by state and listing type in selected top states?

library(dplyr)
data %>% 
  group_by(state,listing_type) %>%
  select(pageviews,clicks) %>%
  summarize(NumProviders=n(),AvgPVs=mean(pageviews),AvgClicks=mean(clicks),AvgCTR=AvgClicks/AvgPVs) %>%
  print.data.frame()
##    state listing_type NumProviders     AvgPVs  AvgClicks     AvgCTR
## 1     AZ      diamond            1   3.000000  1.0000000 0.33333333
## 2     AZ         free          562   4.631673  0.3220641 0.06953515
## 3     AZ         gold            4  72.000000  5.2500000 0.07291667
## 4     CA      diamond           88 343.556818 26.9772727 0.07852347
## 5     CA         free         7438  16.623958  1.3881420 0.08350249
## 6     CA         gold          112 183.723214 23.9375000 0.13029110
## 7     CA     platinum          131 218.946565 23.5572519 0.10759361
## 8     FL         free         1284   2.541277  0.3161994 0.12442538
## 9     FL         gold            3 125.000000 11.6666667 0.09333333
## 10    FL     platinum            1  35.000000  1.0000000 0.02857143
## 11    GA      diamond            1 373.000000 23.0000000 0.06166220
## 12    GA         free         1240   4.095161  0.5233871 0.12780622
## 13    GA         gold            2 116.000000  8.5000000 0.07327586
## 14    GA     platinum            2 143.000000 14.5000000 0.10139860
## 15    IL      diamond            1  30.000000  0.0000000 0.00000000
## 16    IL         free         1341   5.492916  0.2706935 0.04928048
## 17    IL         gold            9  88.333333  5.5555556 0.06289308
## 18    IL     platinum            2  95.500000  4.0000000 0.04188482
## 19    MA      diamond            1  75.000000 12.0000000 0.16000000
## 20    MA         free         1361   2.081558  0.1719324 0.08259795
## 21    MA         gold            1  43.000000  2.0000000 0.04651163
## 22    MD      diamond            2 114.000000 22.5000000 0.19736842
## 23    MD         free          792   5.193182  0.4116162 0.07926088
## 24    MD         gold            4  85.500000 10.5000000 0.12280702
## 25    MD     platinum            2  97.500000 10.0000000 0.10256410
## 26    NJ      diamond            1  66.000000 10.0000000 0.15151515
## 27    NJ         free         1053   5.946819  0.3722697 0.06259981
## 28    NJ         gold           11  82.181818  4.9090909 0.05973451
## 29    NY      diamond            3 137.000000  8.6666667 0.06326034
## 30    NY         free         3110   4.433119  0.1842444 0.04156089
## 31    NY         gold            5 137.000000 10.8000000 0.07883212
## 32    NY     platinum            4 179.750000 15.5000000 0.08623088
## 33    OH         free         1229   5.070789  0.4092758 0.08071245
## 34    OH         gold            5  71.400000  7.0000000 0.09803922
## 35    PA      diamond            4 274.500000 27.0000000 0.09836066
## 36    PA         free         2161  11.584452  0.9435447 0.08144923
## 37    PA         gold           13 127.076923 10.7692308 0.08474576
## 38    PA     platinum            8 292.000000 30.6250000 0.10488014
## 39    TX      diamond            2 150.500000 15.5000000 0.10299003
## 40    TX         free         2891   7.351090  0.7253545 0.09867307
## 41    TX         gold            9  72.000000  7.0000000 0.09722222
## 42    TX     platinum            3 151.666667  2.3333333 0.01538462
## 43    VA      diamond            2 181.500000 29.0000000 0.15977961
## 44    VA         free          903   5.011074  0.3787375 0.07558011
## 45    VA         gold            2  85.500000  6.0000000 0.07017544
## 46    VA     platinum            4 207.250000 25.2500000 0.12183353
## 47    WA      diamond            2 357.000000 31.5000000 0.08823529
## 48    WA         free         1015   3.336946  0.2591133 0.07764984
## 49    WA         gold            9 107.444444 11.7777778 0.10961737
## 50    WA     platinum            2 135.000000 22.0000000 0.16296296

What are the average pageviews, clicks and CTRs per provider, by city for selected top states? Write data to excel.

suppressMessages(library(xlsx))
c <- data %>% 
  group_by(state,city,listing_type) %>%
  select(pageviews,clicks) %>%
  summarize(NumProviders=n(),AvgPVs=mean(pageviews),AvgClicks=mean(clicks),AvgCTR=AvgClicks/AvgPVs)
write.xlsx(c,file="AHCity.xlsx")