Introduction

First, we will want to load the libraries necessary for data manipulation and visualization:

library(ggplot2) #excellent viz library
library(dplyr) #data manipulation
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate) #makes handling timestamps a breeze!
library(scales) #for pretty printing percent

Extract, Transform, Load (ETL)

read.csv

This assumes the working directory is the same as the tsv file:

search.df <- read.csv("search_dataset.tsv", header=TRUE, sep="\t", encoding = "UTF-8")

dplyr::mutate & lubridate

Yikes. That timestamp is a nightmare! Let’s convert it to something more human-readable:

search.df <- search.df %>%
  mutate(timestamp = ymd_hms(timestamp, tz = "Pacific"))
## Warning: 10 failed to parse.

head()

Let’s verify the successful conversion:

head(search.df)
##             timestamp event_action event_timeToDisplayResults
## 1 2015-03-21 08:33:06        start                         NA
## 2 2015-03-31 22:39:51      results                        474
## 3 2015-02-12 20:16:52      results                        316
## 4 2015-04-12 10:05:15        start                         NA
## 5 2015-02-28 09:55:17      results                        470
## 6 2015-04-25 22:18:13      results                        474

Awesome! Let’s forge ahead.

Data analysis

Summary of dataset:

summary(search.df)
##    timestamp                    event_action    event_timeToDisplayResults
##  Min.   :2014-11-25 21:17:51   click  :142334   Min.   : -17501.0         
##  1st Qu.:2015-01-19 15:42:53   results:693723   1st Qu.:    333.0         
##  Median :2015-02-28 05:40:47   start  :163943   Median :    446.0         
##  Mean   :2015-02-27 07:30:35                    Mean   :    755.7         
##  3rd Qu.:2015-04-07 21:26:57                    3rd Qu.:    620.0         
##  Max.   :2015-05-15 11:34:39                    Max.   :1715767.0         
##  NA's   :10                                     NA's   :306277

Variable 1: Timestamp:

We see the timestamps median and means are similar, perhaps indicating a normal distribution – see the histogram below for more.

Variable 2: Event action - sum, subsetting, and variable assignment:

Not exactly a 1:1 with start and results. This indicates users search a lot before clicking to a result. But somewhat closer to 1:1 for results. Let’s see how close:

sum_start <- sum(search.df$event_action == "start")
sum_results <- sum(search.df$event_action == "results")  
sum_click <- sum(search.df$event_action == "click")  

In marketing, there’s the concept of the click through rate, a measure of the effectiveness of the ad. Using results in this metric would tank this particular KPI, so I’m going with the ratio of clicks from starting, assuming that main value to users is obtaining a result from a search:

clickthru_rate <- percent(sum_click/sum_start)
clickthru_rate
## [1] "86.8%"

Aggregate 86.8%. Not bad. As a KPI, it would be interesting to see how this trends over time. See the violin plot below for more. Next, let’s get a measure of how often people see a results page after starting the search session.

result_rate <- percent(sum_results/sum_start)
result_rate
## [1] "423%"

This suggests that, on average, people are presented with a results page over 4x before finding a result they will click. This suggests the searches do not return exactly what users seek at first.

So how many of these results pages lead to clicking on a result?

click_from_search_rate <- percent(sum_click/sum_results)
click_from_search_rate
## [1] "20.5%"

It hurts us, precious :(

Variable 3: event_timeToDisplayResults:

HUGE outlier compared to IQR (inner quartile region, 25% to 75%). As a result mean is shifted higher than median. Possible causes for this may be intermittent server issues or spotty client connection (bad mobile signal, WiFi on plane, etc). Also several NA’s. What’s that all about? Let’s see more below….

dplyr: summary of total different actions

search.df %>%
  group_by(event_action) %>%
  summarise(total = sum(event_timeToDisplayResults),
            min_time = min(event_timeToDisplayResults),
            max_time = max(event_timeToDisplayResults),
            mean_time = mean(event_timeToDisplayResults),
            median_time = median(event_timeToDisplayResults),
            stdev = sd(event_timeToDisplayResults)) %>%
  arrange(total, min_time, max_time, mean_time, median_time, stdev)
## Source: local data frame [3 x 7]
## 
##   event_action     total min_time max_time mean_time median_time    stdev
## 1      results 524259951   -17501  1715767  755.7194         446 5836.527
## 2        click        NA       NA       NA        NA          NA       NA
## 3        start        NA       NA       NA        NA          NA       NA

Only results return time, the other events are NA’s. Also standard deviation is an order of magnitude larger than median and mean of population, further supporting the existence of outliers. Visual support for this in the scatter plot below.

Data visualization

Violin plot: web usage

(chart.usage <- ggplot(data = search.df,
                 aes(x = event_action,
                     y = timestamp)) +
  geom_violin(trim = FALSE) +
  labs(title = "Scaled frequency distribution of events"))
## Warning: Removed 10 rows containing non-finite values (stat_ydensity).

What we see is that the distributions for all three actions tend to scale accordingly. With this, we could assume the roughly 87% click-through-rate is maintained throughout.

We also see a tapering effect at the beginning of the data set. This would indicate perhaps the data coincides with a feature roll out, the holiday season impacted use, or the system was not used as heavily over winter break by students.

A slight dip in overall use is noted in mid March. I can only assume St. Patrick’s Day or spring break impedes the search for knowledge. This is show more quantitatively in the histogram below.

Interestingly, in the April to May time frame, bimodality is observed across all events, but most strongly in results. I’d wager to say that perhaps the end of school semesters/quarters may impact those results before yielding a click through. I certainly used Wikipedia during those times in my life!

Boxplot

Let’s see if there’s a time effect on time to display results

(chart.boxplot <- ggplot(data = search.df,
                        aes(x = event_action,
                            y = event_timeToDisplayResults)) +
  geom_boxplot() +
  geom_jitter(aes(color = timestamp)) +
  scale_y_log10() +
  labs(title = "Distribution of time to display results"))
## Warning in scale$trans$trans(x): NaNs produced
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 306294 rows containing non-finite values (stat_boxplot).
## Warning: Removed 306294 rows containing missing values (geom_point).

There appears outliers on either end (black dots) are from around the same timestamp. Otherwise, colors distributed throughout, no clustering. Log scale used to account for outliers

Scatterplot: Results as a function of time:

How does the system’s response time vary over time?

(chart.results_time <- ggplot(data = search.df,
                             aes(x = timestamp,
                                 y = event_timeToDisplayResults)) +
  geom_point(aes(color = event_action)) +
  geom_smooth() +
  scale_y_log10() +
  labs(title = "Time required to display results over time"))
## Warning in scale$trans$trans(x): NaNs produced
## Warning in scale$trans$trans(x): NaNs produced
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 306299 rows containing missing values (stat_smooth).
## Warning: Removed 306299 rows containing missing values (geom_point).

Over time, we do not observe the linear model of fit increase above one second. In regards to the clustering observed in the boxplot, there does not appear to be strong evidence for a time effect. Again, log scale used for outliers.

Histogram

Finally, let’s take a look at the histogram of events:

(chart.hist <- ggplot(data = search.df,
                     aes(x = timestamp)) +
  geom_freqpoly(binwidth = 20000,
                aes(color = event_action)) +
  labs(title = "Histogram of frequency distributions over time"))

It is interesting to note the peaks in frequency occur several times over the course of a month, perhaps corresponding to periodic use of the week. Also interesting to note is the abrupt drop in usage in mid-March and twice in May. This could possibly correspond to St. Patrick’s Day and Cinco de Mayo, or Spring Break and end of the semester/quarter, or all of the above. Note the bimodality between April and May from the violin plots are represented as frequency peaks, most prominent in results.

Conclusions

From the github repo:

Using this data set, eke out any insights you can about the users of search, the way they use the system, and the way the system responds to them, looking particularly at the seasonality and temporal patterns of user behaviour.

The Users

  • Users tend to click through to results about 87% of the time.
  • Users tend to go through four search results for every session prior to clicking on a result. This may indicate the need for a more search algorithm and may explain why there’s roughly a 13% abandonment rate.
  • Users only end up clicking about one in five search results
  • Over time, these trends tend to hold. A few note able exceptions are:
  1. a ramp in use over the end of the year holiday season
  2. a dip in mid-March, perhaps due to St. Patrick’s Day or Spring Break
  3. two spikes in use between April and May, perhaps due to end of semester/quarter related searches.

The System

  • This data was gathered from 11/25/2014 to 05/15/2015
  • There is no perceivable trend in system response as a function of time
  • Response time is only recorded for results
  • The system returns, on average, four results pages for every search, perhaps suggesting a need to review the search algorithm
  • Over time, the average response time does not vary greatly
  • The average response time is 755.7ms, or just over about three quarters of a second. The middle of the road response time is 446ms, or about half a second. This would indicate the system experiences longer response times, but the causes are not systematic. Indeed, there are some large outliers outside of the IQR, perhaps due to server-side issues or user internet reception issues (poor WiFi, bad mobile signal, airplane WiFi).