The goal of this task is to conduct an Exploratory Data Analysis (EDA) on raw session data, that contains information about single visits to XXXXXX’s website.
Theese libraries are used throught the code. Libraries specific to the line of code are defined with the code.
library(ggplot2) #library for Visualizations
library(dplyr) #library for Data Manupilation
library(lubridate) # library for managing time based data
mydata <-read.csv(file="session_data.csv", header=TRUE, sep=";")
summary(mydata)
## session session_start_text session_end_text clickouts
## Min. :2.017e+13 Length:10000 Length:10000 Min. :0.000
## 1st Qu.:2.017e+13 Class :character Class :character 1st Qu.:2.000
## Median :2.017e+13 Mode :character Mode :character Median :2.000
## Mean :2.017e+13 Mean :2.485
## 3rd Qu.:2.017e+13 3rd Qu.:3.000
## Max. :2.017e+13 Max. :8.000
## booking
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.0967
## 3rd Qu.:0.0000
## Max. :1.0000
3 Data Wrangling
Assigining different classes for different datatypes in the columns .
session_start_text : It is a time based value. Assigining it as a POSIXct class.
session_end_text : It is a time based value. Assigining it as a POSIXct class.
logtime : Session_end - Session Start.
Some values of logtime throw an error if the start time is before 24:00 and ends after 24:00.
The times function corrects the value and assigns the retured value in sec to numeric.
mydata$session_start_text <- as.POSIXct(mydata$session_start_text, format='%H:%M:%S')
mydata$session_end_text <- as.POSIXct(mydata$session_end_text, format='%H:%M:%S')
mydata$logtime <- difftime(mydata$session_end_text, mydata$session_start_text, units = "sec")
times <- function(x){
if (x < 0) {
y <- x + 86400
return(y)
}
else{ return(x)}
}
mydata$logtime <- as.numeric(lapply(mydata$logtime, times))
summary(mydata)
## session session_start_text
## Min. :2.017e+13 Min. :2020-11-25 00:00:00
## 1st Qu.:2.017e+13 1st Qu.:2020-11-25 06:03:09
## Median :2.017e+13 Median :2020-11-25 11:59:57
## Mean :2.017e+13 Mean :2020-11-25 12:00:30
## 3rd Qu.:2.017e+13 3rd Qu.:2020-11-25 18:02:12
## Max. :2.017e+13 Max. :2020-11-25 23:59:56
## session_end_text clickouts booking logtime
## Min. :2020-11-25 00:00:12 Min. :0.000 Min. :0.0000 Min. : 0.0
## 1st Qu.:2020-11-25 06:04:38 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:139.0
## Median :2020-11-25 12:00:38 Median :2.000 Median :0.0000 Median :182.0
## Mean :2020-11-25 12:01:39 Mean :2.485 Mean :0.0967 Mean :181.3
## 3rd Qu.:2020-11-25 18:03:51 3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:223.0
## Max. :2020-11-25 23:59:58 Max. :8.000 Max. :1.0000 Max. :378.0
From the Summary we observe that the dataset has no NA’s.
The average no of clickouts are 2.45
The average Logtime is 181 seconds i.e. 3 minutes.
Hypothesis : There might be more bookings happening during a particular times of the day .
mydata2 <- mydata
mydata2$booking <- factor(mydata2$booking, levels = c(0,1), labels = c("Not-Booked", "Booked"))
ggplot(mydata2, aes(x = session_start_text, y = booking)) +
geom_jitter(size = .8, na.rm = TRUE, aes(color = booking, shape = booking)) +
ggtitle("Bookings Vs Time_of_Day") +
theme(plot.title = element_text(hjust = 0.5))
Initial AnswerThe bookings seem to happen throughout the day in this visualization.
Reforming the Hypothesis and changing the visualization.
Questions :What are the hours with the least and the most chances of bookings to happen?
Aggregrating the observation by average into 24 hourly blocks.
mydata2hour <- mydata %>% aggregate(by = list(hour(mydata2$session_start_text)), FUN = mean)
ggplot(data = mydata2hour, aes(x = session_start_text, y = booking,fill = logtime)) +
geom_bar(stat = "identity") + ggtitle("Bookings happening aggegrated by hours") + theme(plot.title = element_text(hjust = 0.5))
The Y axis of the graph gives the probability of a booking to occour during that hour.
The Color of shade gives the amount of time spent on website.The darker the shade, the lesser the time spent.
The Maximum number of bookings occour between 1am -2am with about 12% probability that there would be a booking. This is followed by 2am - 3am with about 11.5% probability of bookings.
Very quick bookings happen between 8am-9am. People spend the minimum amount of time on the website at this time
The least number of bookings occour between 2pm-3pm with only a 6% probality that there would be a booking.
People spend the most time on the website during 10pm - 11pm.Most probably they are looking for discounts and offers.
Question : Is there a relation between Bookings and Logtime?
ggplot(mydata2, aes(x = logtime, y = booking)) +
geom_jitter(size = 0.7, na.rm = TRUE, aes(color = booking))+
ggtitle("Bookings Vs Logtimes")+
theme(plot.title = element_text(hjust = 0.5))
It is observed that there are some bots which make a booking within 0-1 seconds.
Generally the bookings take from 100 seconds upto 300 seconds with the most bookings happening at about 200 seconds
Question : Is there a relation between Bookings and the number of clickouts?
ggplot(mydata2, aes(x = clickouts, y = booking)) + geom_jitter(size=0.5, na.rm=TRUE, aes(color=booking)) +
ggtitle("Clickouts Vs Bookings") + theme(plot.title = element_text(hjust = 0.5))
There exists an anamoly here. Some bookings are happening at 0 clickouts.This might be happening through a thirdparty server or maybe an anamoly in the data.
Generally the bookings happen within 3-4 clickouts.
From the given log data, a lot can be learned about the customer behaviour.
Given more data and labels, a Machine Learning Model can be trained to predict the probability of a log or a clickout ending up with a Booking.
The given log data can also facilitate the detection of anamolies and problems.
Spacial Packages and discounts can be sold to the customer by knowing the times the customer stays the most on the website.