The objective of this assignment is to explore the clickstream data and use it to understand a company’s performance. Following are some of the areas which have been taken into account in this project
Data provided had multiple json files with nested dictionaries. Python was used to normalize and structure the data to a dataframe. This data was then summarized in python to get the following(sample) table:
| Request Ip | URl | Number of clicks | Time Spent | Date |
|---|---|---|---|---|
| 31.13.71.36 | www.facebook.com | 16 | 26 | 1-Sep-16 |
| 10.1.192.48 | plus.google.com | 3 | 18 | 25-Aug-16 |
This summarized data was then imported in R for further analysis and producing this rmarkdown report
How would you validate this data?
How would you normalize the data and make it representative?
Data products that gives view into a company’s performance- For the company’s executives it will be important to track the following metrics to take important decisions
Identify Performance driver from the clickstream data -
Any other ideas on what could be done with this dataset?
Visits by different websites
library(tidyverse)
library(dplyr)
library(stringr)
library(ggplot2)
Aggregated_data <- read.csv("Aggregated click stream data.csv",stringsAsFactors = FALSE)
Visits_by_website <- Aggregated_data %>%
group_by(url) %>%
summarise(count_visits = length(X)) %>%
arrange(desc(count_visits))
head(Visits_by_website,10)Analysis of different Google products
Aggregated_data %>%
group_by(url) %>%
summarise(count_visits = length(X)) %>%
arrange(desc(count_visits)) %>%
filter(str_detect(url, "google")) %>%
top_n(10, count_visits) %>%
ggplot() +
geom_bar(mapping=aes(x=reorder(url,count_visits),y=count_visits,col="red"),stat = 'identity')+
coord_flip()+
ylab("Visit Counts")+
xlab("Different google websites")Comparison of user profiles of Amazon, Microsoft and Netflix
Aggregated_data %>%
group_by(url) %>%
summarise(count_visits = length(X),
clicks_per_session = mean(clicks),
time_spent_per_session = mean(time_spent)) %>%
arrange(desc(count_visits)) %>%
filter(str_detect(url, c("amazon","netflix","microsoft"))) %>%
mutate(website = ifelse(str_detect(url, "amazon"),"Amazon", "NA"),
website = ifelse(str_detect(url, "netflix"),"Netflix", website),
website = ifelse(str_detect(url, "microsoft"),"Microsoft",website)) %>%
ggplot()+
geom_boxplot(aes(x=website, y=count_visits,col=website)) Although netflix seems to have more number of visits, the sample size is too less for making a comarison.
Aggregated_data %>%
filter(str_detect(url, c("amazon","netflix","microsoft"))) %>%
mutate(website = ifelse(str_detect(url, "amazon"),"Amazon", "NA"),
website = ifelse(str_detect(url, "netflix"),"Netflix", website),
website = ifelse(str_detect(url, "microsoft"),"Microsoft",website)) %>%
ggplot()+
geom_boxplot(aes(x=website, y=clicks,col=website)) Aggregated_data %>%
filter(str_detect(url, c("amazon","netflix","microsoft"))) %>%
mutate(website = ifelse(str_detect(url, "amazon"),"Amazon", "NA"),
website = ifelse(str_detect(url, "netflix"),"Netflix", website),
website = ifelse(str_detect(url, "microsoft"),"Microsoft",website)) %>%
ggplot()+
geom_boxplot(aes(x=website, y=time_spent,col=website)) We see that Amazon has very high variance inthe data points corresponding to differnt amazon domains so just considering the mean value may be misleading.
Comparison of visits to different websites over time
Aggregated_data %>%
filter(url=="plus.google.com"| url=="www.facebook.com" |url=="hangouts.google.com"| url=="www.youtube.com") %>%
group_by(url,Date) %>%
summarise(visits = length(url),
avg_time_spent = mean(time_spent))%>%
ggplot() +
geom_point(mapping=aes(x=Date, y=visits, col=url,group=4))+
geom_line(mapping=aes(x=Date, y=visits, col = url,group=4)) It can be seen that there are spikes in number of visits on some days which may correspond to a specific event on that day.
# import csv, json, sys, pandas, numpy
#def flattenjson( b, delim ):
# val = {}
# for i in b.keys():
# if isinstance( b[i], dict ):
# get = flattenjson( b[i], delim )
# for j in get.keys():
# val[ i + delim + j ] = get[j]
# else:
# val[i] = b[i]
#
# return val
#
#
#Data_final = pandas.DataFrame()
#Data_final
#Dat = pandas.DataFrame()
#Dat
#def partfunction(file):
# input = open(file)
# tweets = []
# tweets1 = []
# for line in input:
# tweets.append(json.loads(line))
# for tweet in tweets:
# f1=flattenjson(tweet, "_")
# tweets1.append(f1)
# Dat = pandas.DataFrame(tweets1)
# Dat = Dat.loc[:,['request_ip',
# 'request_responseHeaders_date',
# 'request_tabId',
# 'request_timeStamp',
# 'request_url',
# 'request_windowName',
# 'request_windowTitle',
# 'server_request_accept_language',
# 'server_request_country_code',
# 'request_responseHeaders_Content-Language',
# 'request_responseHeaders_Content-Length',
# 'request_requestHeaders_Referer',
# 'request_frameId',
# 'request_requestId',
# 'request_documentReferer',
# 'request_method' ]]
# # get timestamp and date mapping
# date_file = Dat[['request_ip', 'request_responseHeaders_date']].copy()
# date_file['Date'] = date_file['request_responseHeaders_date'].str[5:16]
# date_file['Day'] = date_file['request_responseHeaders_date'].str[:3]
# # date_file = date_file[date_file.loc[:,['request_responseHeaders_date']].dropna()]
# # date_file = date_file[numpy.isfinite(date_file['request_responseHeaders_date'])]
# #date_file = date_file.dropna(date_file.loc[:,['request_responseHeaders_date']])
# date_file = date_file.drop('request_responseHeaders_date', axis=1)
# date_file = date_file.sort(['Date'], ascending=False).drop_duplicates('request_ip')
# Dat['url'] = Dat['request_url'].str.split("/").str[2]
# aggregation = {'url': {'clicks': 'count'}, 'request_timeStamp': {'max_tmstmp': 'max', 'min_tmstmp': 'min'}}
# Dat = Dat.groupby(['request_ip', 'url']).agg(aggregation).reset_index()
# Dat.columns = ['request_ip', 'url', 'clicks', 'max_tmstmp', 'min_tmstmp']
# Dat['time_spent'] = (Dat['max_tmstmp'] - Dat['min_tmstmp'])
# all_fields = pandas.merge(Dat, date_file, how='left', left_on='request_ip', right_on='request_ip')
# global Data_final
# Data_final = Data_final.append(all_fields, ignore_index=True)
# return(Data_final)
#partfunction("part-00002-00008")
#partfunction("part-00002-00010")
#partfunction("part-00001-00012")
# Other files