This analysis looks at Company X’s events database for time period Jan 1, 2015 through May 30, 2014. Event data is captured when user view a providers’ page on the website, click out to providers website and submit a review for a provider. The goal of this analysis is to gain insights into the AH provider population and performance.
Event data is read from csv file extracted from Company X’s database.
## Set the working directory and read in the raw data
setwd("C:/Users/jlee/Documents/Personal/AH/")
data <- read.csv("ALL Provider Stats May29 2015.csv")
suppressMessages(library(dplyr))
Here is a summary of the dataset
summary(data)
## id city state
## Min. : 1 New York : 861 CA : 7803
## 1st Qu.:15531 Los Angeles : 590 NY : 3122
## Median :26052 San Francisco: 569 TX : 2907
## Mean :28038 San Jose : 413 PA : 2189
## 3rd Qu.:44459 Austin : 388 MA : 1364
## Max. :56869 Houston : 357 IL : 1353
## (Other) :36786 (Other):21226
## name listing_type premium_listing_level
## Huntington Learning Center: 160 diamond : 130 Min. :-10.00
## Just For Kix : 128 free :39440 1st Qu.: 0.00
## Color Me Mine : 80 gold : 227 Median : 0.00
## LearningRx : 45 platinum: 167 Mean : 0.22
## YMCA : 34 3rd Qu.: 0.00
## Sylvan Learning : 23 Max. : 20.00
## (Other) :39494
## pageviews clicks X2015.sessions
## Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.00 Median : 0.000 Median : 0.000
## Mean : 12.28 Mean : 1.136 Mean : 0.776
## 3rd Qu.: 1.00 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :17006.00 Max. :1693.000 Max. :256.000
## NA's :6217
## reviews_count has_class has_camp has_birthday_party
## Min. : 0.00000 : 2081 : 4045 : 5409
## 1st Qu.: 0.00000 f:32491 f:24081 f:33617
## Median : 0.00000 t: 5392 t:11838 t: 938
## Mean : 0.02144
## 3rd Qu.: 0.00000
## Max. :38.00000
##
## has_activity_photo has_business_photo owner_id updated_by
## f:38276 f:11702 Min. : 13 Min. : 1
## t: 1688 t:28262 1st Qu.: 5711 1st Qu.: 7834
## Median : 7295 Median :22191
## Mean :16803 Mean :24276
## 3rd Qu.:26873 3rd Qu.:43196
## Max. :61531 Max. :61493
## NA's :34078 NA's :37090
## created_at updated_at created.updated
## 12/8/2014 :10853 2/26/2015: 3409 Mode :logical
## 12/9/2014 : 7001 9/27/2014: 2628 FALSE:39516
## 12/27/2014: 4789 3/13/2015: 1372 TRUE :448
## 12/5/2014 : 2856 2/12/2015: 1183 NA's :0
## 12/15/2014: 2177 4/12/2015: 1068
## 3/13/2015 : 1671 4/11/2015: 1062
## (Other) :10617 (Other) :29242
Notice that for pageviews and clicks, the max is much larger than the mean. This indicates that there are a few providers with very large pageviews and clicks.
summary(data$pageviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.28 1.00 17010.00
summary(data$clicks)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.136 0.000 1693.000
For purposes of analysis, we will exclude the top 43 providers with pageviews > 1000.
data <- data[data$pageviews <= 1000,]
About 96% (38,418 out of 39,921) of remaining providers fall within the first bin of the histogram below, each receiving less than 50 pageviews in 2015 through May 30th.
library(ggplot2)
## Histogram of camps by clicks.
hist(data$pageviews, breaks=20,main = "Histogram of Providers by Pageviews",col="steelblue",labels=TRUE,xlab="Pageviews",ylab="# of Providers")
About 99% (39,640 out of 39,921) of remaining providers fall within the first bin of the histogram below, each receiving less than 25 clicks in 2015 through May 30th.
## Histogram of camps by clicks.
hist(data$clicks,main = "Histogram of Providers by Clicks",col="steelblue",labels=TRUE,xlab="Clicks",ylab="# of Providers")
The following treemap shows pageviews and clicks by State. The size of the rectangle are based on pageviews whereas the color of the rectangle is based on clicks.
library(treemap)
treemap(data,index="state",vSize="pageviews",vColor="clicks",type="value")
The following treemap shows pageviews and clicks by Listing Type. The size of the rectangle are based on pageviews whereas the color of the rectangle is based on clicks.
treemap(data,index="listing_type",vSize="pageviews",vColor="clicks",type="value")
The scatterplots inidicate that providers in Gold and Platinum categories have better click through rates versus providers in the Free category, as evidenced by the steeper slope of the best fit line in the scatterplots below. However, Diamond underperforms the Free category.
## Scatterplots by listing type and premimum listing levels
g <- ggplot(data,aes(pageviews,clicks))
g <- g + geom_point(color="steelblue",size=3,alpha=1/2) + geom_smooth(method="lm")+theme_bw()
g <- g + facet_grid(.~listing_type)
g
We see some modest correlation between pageviews and the number of reviews for a provider in selected top states.
## scatterplot of pageviews versus # of reviews for selected states
data <- data[data$state %in% c("CA","PA","TX","NY","GA","VA","WA","IL","NJ","OH","MD","FL","MA","AZ"),]
p <- ggplot(data,aes(reviews_count,pageviews))
p <- p + geom_point(aes(color=listing_type),alpha=1/2) + geom_smooth(method="lm")+theme_bw()
p <- p + facet_wrap(~state,ncol=3)
p
What are the average pageviews, clicks and CTRs per provider, by state and listing type in selected top states?
library(dplyr)
data %>%
group_by(state,listing_type) %>%
select(pageviews,clicks) %>%
summarize(NumProviders=n(),AvgPVs=mean(pageviews),AvgClicks=mean(clicks),AvgCTR=AvgClicks/AvgPVs) %>%
print.data.frame()
## state listing_type NumProviders AvgPVs AvgClicks AvgCTR
## 1 AZ diamond 1 3.000000 1.0000000 0.33333333
## 2 AZ free 562 4.631673 0.3220641 0.06953515
## 3 AZ gold 4 72.000000 5.2500000 0.07291667
## 4 CA diamond 88 343.556818 26.9772727 0.07852347
## 5 CA free 7438 16.623958 1.3881420 0.08350249
## 6 CA gold 112 183.723214 23.9375000 0.13029110
## 7 CA platinum 131 218.946565 23.5572519 0.10759361
## 8 FL free 1284 2.541277 0.3161994 0.12442538
## 9 FL gold 3 125.000000 11.6666667 0.09333333
## 10 FL platinum 1 35.000000 1.0000000 0.02857143
## 11 GA diamond 1 373.000000 23.0000000 0.06166220
## 12 GA free 1240 4.095161 0.5233871 0.12780622
## 13 GA gold 2 116.000000 8.5000000 0.07327586
## 14 GA platinum 2 143.000000 14.5000000 0.10139860
## 15 IL diamond 1 30.000000 0.0000000 0.00000000
## 16 IL free 1341 5.492916 0.2706935 0.04928048
## 17 IL gold 9 88.333333 5.5555556 0.06289308
## 18 IL platinum 2 95.500000 4.0000000 0.04188482
## 19 MA diamond 1 75.000000 12.0000000 0.16000000
## 20 MA free 1361 2.081558 0.1719324 0.08259795
## 21 MA gold 1 43.000000 2.0000000 0.04651163
## 22 MD diamond 2 114.000000 22.5000000 0.19736842
## 23 MD free 792 5.193182 0.4116162 0.07926088
## 24 MD gold 4 85.500000 10.5000000 0.12280702
## 25 MD platinum 2 97.500000 10.0000000 0.10256410
## 26 NJ diamond 1 66.000000 10.0000000 0.15151515
## 27 NJ free 1053 5.946819 0.3722697 0.06259981
## 28 NJ gold 11 82.181818 4.9090909 0.05973451
## 29 NY diamond 3 137.000000 8.6666667 0.06326034
## 30 NY free 3110 4.433119 0.1842444 0.04156089
## 31 NY gold 5 137.000000 10.8000000 0.07883212
## 32 NY platinum 4 179.750000 15.5000000 0.08623088
## 33 OH free 1229 5.070789 0.4092758 0.08071245
## 34 OH gold 5 71.400000 7.0000000 0.09803922
## 35 PA diamond 4 274.500000 27.0000000 0.09836066
## 36 PA free 2161 11.584452 0.9435447 0.08144923
## 37 PA gold 13 127.076923 10.7692308 0.08474576
## 38 PA platinum 8 292.000000 30.6250000 0.10488014
## 39 TX diamond 2 150.500000 15.5000000 0.10299003
## 40 TX free 2891 7.351090 0.7253545 0.09867307
## 41 TX gold 9 72.000000 7.0000000 0.09722222
## 42 TX platinum 3 151.666667 2.3333333 0.01538462
## 43 VA diamond 2 181.500000 29.0000000 0.15977961
## 44 VA free 903 5.011074 0.3787375 0.07558011
## 45 VA gold 2 85.500000 6.0000000 0.07017544
## 46 VA platinum 4 207.250000 25.2500000 0.12183353
## 47 WA diamond 2 357.000000 31.5000000 0.08823529
## 48 WA free 1015 3.336946 0.2591133 0.07764984
## 49 WA gold 9 107.444444 11.7777778 0.10961737
## 50 WA platinum 2 135.000000 22.0000000 0.16296296
What are the average pageviews, clicks and CTRs per provider, by city for selected top states? Write data to excel.
suppressMessages(library(xlsx))
c <- data %>%
group_by(state,city,listing_type) %>%
select(pageviews,clicks) %>%
summarize(NumProviders=n(),AvgPVs=mean(pageviews),AvgClicks=mean(clicks),AvgCTR=AvgClicks/AvgPVs)
write.xlsx(c,file="AHCity.xlsx")