First, we will want to load the libraries necessary for data manipulation and visualization:
library(ggplot2) #excellent viz library
library(dplyr) #data manipulation
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate) #makes handling timestamps a breeze!
library(scales) #for pretty printing percent
This assumes the working directory is the same as the tsv file:
search.df <- read.csv("search_dataset.tsv", header=TRUE, sep="\t", encoding = "UTF-8")
Let’s take a look at a few lines of the data:
head(search.df)
## timestamp event_action event_timeToDisplayResults
## 1 2.015032e+13 start NA
## 2 2.015033e+13 results 474
## 3 2.015021e+13 results 316
## 4 2.015041e+13 start NA
## 5 2.015023e+13 results 470
## 6 2.015043e+13 results 474
Yikes. That timestamp is a nightmare! Let’s convert it to something more human-readable:
search.df <- search.df %>%
mutate(timestamp = ymd_hms(timestamp, tz = "Pacific"))
## Warning: 10 failed to parse.
Let’s verify the successful conversion:
head(search.df)
## timestamp event_action event_timeToDisplayResults
## 1 2015-03-21 08:33:06 start NA
## 2 2015-03-31 22:39:51 results 474
## 3 2015-02-12 20:16:52 results 316
## 4 2015-04-12 10:05:15 start NA
## 5 2015-02-28 09:55:17 results 470
## 6 2015-04-25 22:18:13 results 474
Awesome! Let’s forge ahead.
summary(search.df)
## timestamp event_action event_timeToDisplayResults
## Min. :2014-11-25 21:17:51 click :142334 Min. : -17501.0
## 1st Qu.:2015-01-19 15:42:53 results:693723 1st Qu.: 333.0
## Median :2015-02-28 05:40:47 start :163943 Median : 446.0
## Mean :2015-02-27 07:30:35 Mean : 755.7
## 3rd Qu.:2015-04-07 21:26:57 3rd Qu.: 620.0
## Max. :2015-05-15 11:34:39 Max. :1715767.0
## NA's :10 NA's :306277
We see the timestamps median and means are similar, perhaps indicating a normal distribution – see the histogram below for more.
Not exactly a 1:1 with start and results. This indicates users search a lot before clicking to a result. But somewhat closer to 1:1 for results. Let’s see how close:
sum_start <- sum(search.df$event_action == "start")
sum_results <- sum(search.df$event_action == "results")
sum_click <- sum(search.df$event_action == "click")
In marketing, there’s the concept of the click through rate, a measure of the effectiveness of the ad. Using results in this metric would tank this particular KPI, so I’m going with the ratio of clicks from starting, assuming that main value to users is obtaining a result from a search:
clickthru_rate <- percent(sum_click/sum_start)
clickthru_rate
## [1] "86.8%"
Aggregate 86.8%. Not bad. As a KPI, it would be interesting to see how this trends over time. See the violin plot below for more. Next, let’s get a measure of how often people see a results page after starting the search session.
result_rate <- percent(sum_results/sum_start)
result_rate
## [1] "423%"
This suggests that, on average, people are presented with a results page over 4x before finding a result they will click. This suggests the searches do not return exactly what users seek at first.
So how many of these results pages lead to clicking on a result?
click_from_search_rate <- percent(sum_click/sum_results)
click_from_search_rate
## [1] "20.5%"
It hurts us, precious :(
HUGE outlier compared to IQR (inner quartile region, 25% to 75%). As a result mean is shifted higher than median. Possible causes for this may be intermittent server issues or spotty client connection (bad mobile signal, WiFi on plane, etc). Also several NA’s. What’s that all about? Let’s see more below….
search.df %>%
group_by(event_action) %>%
summarise(total = sum(event_timeToDisplayResults),
min_time = min(event_timeToDisplayResults),
max_time = max(event_timeToDisplayResults),
mean_time = mean(event_timeToDisplayResults),
median_time = median(event_timeToDisplayResults),
stdev = sd(event_timeToDisplayResults)) %>%
arrange(total, min_time, max_time, mean_time, median_time, stdev)
## Source: local data frame [3 x 7]
##
## event_action total min_time max_time mean_time median_time stdev
## 1 results 524259951 -17501 1715767 755.7194 446 5836.527
## 2 click NA NA NA NA NA NA
## 3 start NA NA NA NA NA NA
Only results return time, the other events are NA’s. Also standard deviation is an order of magnitude larger than median and mean of population, further supporting the existence of outliers. Visual support for this in the scatter plot below.
(chart.usage <- ggplot(data = search.df,
aes(x = event_action,
y = timestamp)) +
geom_violin(trim = FALSE) +
labs(title = "Scaled frequency distribution of events"))
## Warning: Removed 10 rows containing non-finite values (stat_ydensity).
What we see is that the distributions for all three actions tend to scale accordingly. With this, we could assume the roughly 87% click-through-rate is maintained throughout.
We also see a tapering effect at the beginning of the data set. This would indicate perhaps the data coincides with a feature roll out, the holiday season impacted use, or the system was not used as heavily over winter break by students.
A slight dip in overall use is noted in mid March. I can only assume St. Patrick’s Day or spring break impedes the search for knowledge. This is show more quantitatively in the histogram below.
Interestingly, in the April to May time frame, bimodality is observed across all events, but most strongly in results. I’d wager to say that perhaps the end of school semesters/quarters may impact those results before yielding a click through. I certainly used Wikipedia during those times in my life!
Let’s see if there’s a time effect on time to display results
(chart.boxplot <- ggplot(data = search.df,
aes(x = event_action,
y = event_timeToDisplayResults)) +
geom_boxplot() +
geom_jitter(aes(color = timestamp)) +
scale_y_log10() +
labs(title = "Distribution of time to display results"))
## Warning in scale$trans$trans(x): NaNs produced
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 306294 rows containing non-finite values (stat_boxplot).
## Warning: Removed 306294 rows containing missing values (geom_point).
There appears outliers on either end (black dots) are from around the same timestamp. Otherwise, colors distributed throughout, no clustering. Log scale used to account for outliers
How does the system’s response time vary over time?
(chart.results_time <- ggplot(data = search.df,
aes(x = timestamp,
y = event_timeToDisplayResults)) +
geom_point(aes(color = event_action)) +
geom_smooth() +
scale_y_log10() +
labs(title = "Time required to display results over time"))
## Warning in scale$trans$trans(x): NaNs produced
## Warning in scale$trans$trans(x): NaNs produced
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 306299 rows containing missing values (stat_smooth).
## Warning: Removed 306299 rows containing missing values (geom_point).
Over time, we do not observe the linear model of fit increase above one second. In regards to the clustering observed in the boxplot, there does not appear to be strong evidence for a time effect. Again, log scale used for outliers.
Finally, let’s take a look at the histogram of events:
(chart.hist <- ggplot(data = search.df,
aes(x = timestamp)) +
geom_freqpoly(binwidth = 20000,
aes(color = event_action)) +
labs(title = "Histogram of frequency distributions over time"))
It is interesting to note the peaks in frequency occur several times over the course of a month, perhaps corresponding to periodic use of the week. Also interesting to note is the abrupt drop in usage in mid-March and twice in May. This could possibly correspond to St. Patrick’s Day and Cinco de Mayo, or Spring Break and end of the semester/quarter, or all of the above. Note the bimodality between April and May from the violin plots are represented as frequency peaks, most prominent in results.
From the github repo:
Using this data set, eke out any insights you can about the users of search, the way they use the system, and the way the system responds to them, looking particularly at the seasonality and temporal patterns of user behaviour.