1 Goal


The goal of this task is to conduct an Exploratory Data Analysis (EDA) on raw session data, that contains information about single visits to XXXXXX’s website.


2 Universal Libraries


Theese libraries are used throught the code. Libraries specific to the line of code are defined with the code.

library(ggplot2) #library for Visualizations
library(dplyr)   #library for Data Manupilation
library(lubridate) # library for managing time based data

3 Data Import


mydata <-read.csv(file="session_data.csv", header=TRUE, sep=";")
summary(mydata)
##     session          session_start_text session_end_text     clickouts    
##  Min.   :2.017e+13   Length:10000       Length:10000       Min.   :0.000  
##  1st Qu.:2.017e+13   Class :character   Class :character   1st Qu.:2.000  
##  Median :2.017e+13   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :2.017e+13                                         Mean   :2.485  
##  3rd Qu.:2.017e+13                                         3rd Qu.:3.000  
##  Max.   :2.017e+13                                         Max.   :8.000  
##     booking      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.0967  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

3 Data Wrangling


Assigining different classes for different datatypes in the columns .
session_start_text : It is a time based value. Assigining it as a POSIXct class.
session_end_text : It is a time based value. Assigining it as a POSIXct class.
logtime : Session_end - Session Start.
Some values of logtime throw an error if the start time is before 24:00 and ends after 24:00.
The times function corrects the value and assigns the retured value in sec to numeric.

mydata$session_start_text <- as.POSIXct(mydata$session_start_text, format='%H:%M:%S')
mydata$session_end_text <- as.POSIXct(mydata$session_end_text,  format='%H:%M:%S')
mydata$logtime <- difftime(mydata$session_end_text, mydata$session_start_text, units = "sec")
times <- function(x){
  if (x < 0) {
    y <- x + 86400
    return(y)
  } 
  else{ return(x)}
 }
mydata$logtime <- as.numeric(lapply(mydata$logtime, times))
summary(mydata)
##     session          session_start_text           
##  Min.   :2.017e+13   Min.   :2020-11-25 00:00:00  
##  1st Qu.:2.017e+13   1st Qu.:2020-11-25 06:03:09  
##  Median :2.017e+13   Median :2020-11-25 11:59:57  
##  Mean   :2.017e+13   Mean   :2020-11-25 12:00:30  
##  3rd Qu.:2.017e+13   3rd Qu.:2020-11-25 18:02:12  
##  Max.   :2.017e+13   Max.   :2020-11-25 23:59:56  
##  session_end_text                clickouts        booking          logtime     
##  Min.   :2020-11-25 00:00:12   Min.   :0.000   Min.   :0.0000   Min.   :  0.0  
##  1st Qu.:2020-11-25 06:04:38   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:139.0  
##  Median :2020-11-25 12:00:38   Median :2.000   Median :0.0000   Median :182.0  
##  Mean   :2020-11-25 12:01:39   Mean   :2.485   Mean   :0.0967   Mean   :181.3  
##  3rd Qu.:2020-11-25 18:03:51   3rd Qu.:3.000   3rd Qu.:0.0000   3rd Qu.:223.0  
##  Max.   :2020-11-25 23:59:58   Max.   :8.000   Max.   :1.0000   Max.   :378.0

From the Summary we observe that the dataset has no NA’s.
The average no of clickouts are 2.45
The average Logtime is 181 seconds i.e. 3 minutes.


4 Visualization 1


Hypothesis : There might be more bookings happening during a particular times of the day .

mydata2 <- mydata
mydata2$booking <- factor(mydata2$booking, levels = c(0,1), labels = c("Not-Booked", "Booked"))
ggplot(mydata2, aes(x = session_start_text, y = booking)) +
             geom_jitter(size = .8, na.rm = TRUE, aes(color = booking, shape = booking)) +
             ggtitle("Bookings Vs Time_of_Day") +
             theme(plot.title = element_text(hjust = 0.5))

Initial AnswerThe bookings seem to happen throughout the day in this visualization.
Reforming the Hypothesis and changing the visualization.

Questions :What are the hours with the least and the most chances of bookings to happen?
Aggregrating the observation by average into 24 hourly blocks.

mydata2hour <- mydata %>% aggregate(by = list(hour(mydata2$session_start_text)), FUN = mean)
ggplot(data = mydata2hour, aes(x = session_start_text, y = booking,fill = logtime)) +   
           geom_bar(stat = "identity") + ggtitle("Bookings happening aggegrated by hours") +            theme(plot.title = element_text(hjust = 0.5))

The Y axis of the graph gives the probability of a booking to occour during that hour.
The Color of shade gives the amount of time spent on website.The darker the shade, the lesser the time spent.
The Maximum number of bookings occour between 1am -2am with about 12% probability that there would be a booking. This is followed by 2am - 3am with about 11.5% probability of bookings.
Very quick bookings happen between 8am-9am. People spend the minimum amount of time on the website at this time
The least number of bookings occour between 2pm-3pm with only a 6% probality that there would be a booking.
People spend the most time on the website during 10pm - 11pm.Most probably they are looking for discounts and offers.


5 Visualization 2


Question : Is there a relation between Bookings and Logtime?

ggplot(mydata2, aes(x = logtime, y = booking)) +
         geom_jitter(size = 0.7, na.rm = TRUE, aes(color = booking))+
         ggtitle("Bookings Vs Logtimes")+
         theme(plot.title = element_text(hjust = 0.5))

It is observed that there are some bots which make a booking within 0-1 seconds.
Generally the bookings take from 100 seconds upto 300 seconds with the most bookings happening at about 200 seconds

Question : Is there a relation between Bookings and the number of clickouts?

ggplot(mydata2, aes(x = clickouts, y = booking)) + geom_jitter(size=0.5, na.rm=TRUE, aes(color=booking)) +
      ggtitle("Clickouts Vs Bookings") + theme(plot.title = element_text(hjust = 0.5))

There exists an anamoly here. Some bookings are happening at 0 clickouts.This might be happening through a thirdparty server or maybe an anamoly in the data.
Generally the bookings happen within 3-4 clickouts.


6 Conclusions


From the given log data, a lot can be learned about the customer behaviour.
Given more data and labels, a Machine Learning Model can be trained to predict the probability of a log or a clickout ending up with a Booking.
The given log data can also facilitate the detection of anamolies and problems.
Spacial Packages and discounts can be sold to the customer by knowing the times the customer stays the most on the website.