Assignment Report

Synopsis:

The objective of this assignment is to explore the clickstream data and use it to understand a company’s performance. Following are some of the areas which have been taken into account in this project

Import the Json data in a structured manner to be able to produce analysis
From the given data find the number of visits to different websites
How does customer behavior on different websites like Amazon, Microsoft and Netflix differ
Explore how the number of visits on different websites have changed over time

The data

Data provided had multiple json files with nested dictionaries. Python was used to normalize and structure the data to a dataframe. This data was then summarized in python to get the following(sample) table:

Request Ip	URl	Number of clicks	Time Spent	Date
31.13.71.36	www.facebook.com	16	26	1-Sep-16
10.1.192.48	plus.google.com	3	18	25-Aug-16

This summarized data was then imported in R for further analysis and producing this rmarkdown report

Questions asked

How would you validate this data?

Check the scope of analysis - Country is US from the country-code field and time frame matches the specified time frame
Check the level of data - Given data is at the hit level where each and evry resuest from the server to local client and vise versa is recorded

How would you normalize the data and make it representative?

Sampling of the data should be proper for example stratified samples for different locations so theat the results are not biased by a particular region characteristics
Calculating proper statistical tests such as confidence intervals of mean/statistic being used. This will give an idea of the bias due to sample size and sample standard deviation. For example the mean time duration spent on Amazon.com could be gteater than that of facebook.com but if the sample of Amazon is very small, there is more uncertainty in quoting this value of time spent
Assuming that the sample taken is a random sample, the results of the website traffic data can be normalized by taking percentage of users in data visiting the website instead of the absolute number of users

Data products that gives view into a company’s performance- For the company’s executives it will be important to track the following metrics to take important decisions

Number of new and repeat visitors visiting the website over time
Percentage of visits that funnel down to a purchase on website
How well is the website doing compared to the competitors?
Percentage of users that are diverted to an advertisement on the website

Identify Performance driver from the clickstream data -

Identifying different customer segments visiting the website and tracking them over time Active engagers, surfers, users visiting just for getting contact information etc.
Number of users contacting the customer care after visiting the website – if data available
Number of returning users
Ease of navigation through the website

Any other ideas on what could be done with this dataset?

What percentage of traffic is driven by different referring domains websites. This information can then be used to allocate advertising budgets on different websites
Top drivers of dropping from the website – Looking at the exit page, and web path that leads to exit
Findability - number of clicks required to access the required content on website
Time spent on the content pages of the website w.r.t to the total time spent on the website
Understand the behavior of the customers who are likely to exit and target them with advertisements or a proactive chat option
Logistic regression models to predict in real time the likelihood of users quitting within a given time
Understand user behavior that leads to final purchase of a product. Common websites where they compare products before deciding
How do different language versions of the website differ?
Number of visits it takes for a customer before purchasing the product

Other Analysis Completed

Visits by different websites

library(tidyverse)
library(dplyr)
library(stringr)
library(ggplot2)

Aggregated_data <- read.csv("Aggregated click stream data.csv",stringsAsFactors = FALSE)

Visits_by_website <- Aggregated_data %>% 
  group_by(url) %>% 
  summarise(count_visits = length(X)) %>% 
  arrange(desc(count_visits))

head(Visits_by_website,10)

Analysis of different Google products

Aggregated_data %>% 
  group_by(url) %>% 
  summarise(count_visits = length(X)) %>% 
  arrange(desc(count_visits)) %>% 
  filter(str_detect(url, "google")) %>% 
  top_n(10, count_visits) %>% 
  ggplot() +
  geom_bar(mapping=aes(x=reorder(url,count_visits),y=count_visits,col="red"),stat = 'identity')+
  coord_flip()+
  ylab("Visit Counts")+
  xlab("Different google websites")

Comparison of user profiles of Amazon, Microsoft and Netflix

 Aggregated_data %>% 
  group_by(url) %>% 
  summarise(count_visits = length(X),
            clicks_per_session = mean(clicks),
            time_spent_per_session = mean(time_spent)) %>% 
  arrange(desc(count_visits)) %>% 
  filter(str_detect(url, c("amazon","netflix","microsoft"))) %>% 
  mutate(website = ifelse(str_detect(url, "amazon"),"Amazon", "NA"),
         website = ifelse(str_detect(url, "netflix"),"Netflix", website),
         website = ifelse(str_detect(url, "microsoft"),"Microsoft",website)) %>% 
  ggplot()+
  geom_boxplot(aes(x=website, y=count_visits,col=website))

Although netflix seems to have more number of visits, the sample size is too less for making a comarison.

  Aggregated_data %>% 
   filter(str_detect(url, c("amazon","netflix","microsoft"))) %>% 
   mutate(website = ifelse(str_detect(url, "amazon"),"Amazon", "NA"),
          website = ifelse(str_detect(url, "netflix"),"Netflix", website),
          website = ifelse(str_detect(url, "microsoft"),"Microsoft",website)) %>% 
   ggplot()+
   geom_boxplot(aes(x=website, y=clicks,col=website))

 Aggregated_data %>% 
   filter(str_detect(url, c("amazon","netflix","microsoft"))) %>% 
   mutate(website = ifelse(str_detect(url, "amazon"),"Amazon", "NA"),
          website = ifelse(str_detect(url, "netflix"),"Netflix", website),
          website = ifelse(str_detect(url, "microsoft"),"Microsoft",website)) %>% 
   ggplot()+
   geom_boxplot(aes(x=website, y=time_spent,col=website))

We see that Amazon has very high variance inthe data points corresponding to differnt amazon domains so just considering the mean value may be misleading.

Comparison of visits to different websites over time

 Aggregated_data %>% 
   filter(url=="plus.google.com"| url=="www.facebook.com" |url=="hangouts.google.com"| url=="www.youtube.com") %>% 
   group_by(url,Date) %>% 
   summarise(visits = length(url),
             avg_time_spent = mean(time_spent))%>% 
   ggplot() +
   geom_point(mapping=aes(x=Date, y=visits, col=url,group=4))+
   geom_line(mapping=aes(x=Date, y=visits, col = url,group=4))

It can be seen that there are spikes in number of visits on some days which may correspond to a specific event on that day.

Assumptions

One IP address is assumed to correspond to a single session of a online visitor and it is assumed that this IP address does not change during the visit
Number of visits are calculated as unique visits (IP address) on a website for any given day
Next day even if the IP address is same the session is considered as a new visit. This may cause few anomalous cases where at midnight the day changes but the visit remains same.
All the server requests for a givin website and IP address are considered as clicks
Time spent is calculated by difference in maximum timestamp during the vist and minimum timestamp during the vist and is assumed to be in mili seconds

Python Code

# import csv, json, sys, pandas, numpy

#def flattenjson( b, delim ):
#    val = {}
#    for i in b.keys():
 #       if isinstance( b[i], dict ):
  #          get = flattenjson( b[i], delim )
#            for j in get.keys():
#                val[ i + delim + j ] = get[j]
#        else:
#            val[i] = b[i]
#
#    return val
#
#
#Data_final = pandas.DataFrame()
#Data_final
#Dat = pandas.DataFrame()
#Dat
#def partfunction(file):
#    input = open(file)
#    tweets = []
#    tweets1 = []
#    for line in input:
#        tweets.append(json.loads(line))
#    for tweet in tweets:
#        f1=flattenjson(tweet, "_")
#        tweets1.append(f1)
#    Dat = pandas.DataFrame(tweets1)
#    Dat = Dat.loc[:,['request_ip',
#                     'request_responseHeaders_date',
#                     'request_tabId',
#                     'request_timeStamp',
#                     'request_url',
#                     'request_windowName',
 #                    'request_windowTitle',
  #                   'server_request_accept_language',
#                     'server_request_country_code',
#                     'request_responseHeaders_Content-Language',
#                    'request_responseHeaders_Content-Length',
#                     'request_requestHeaders_Referer',
#                     'request_frameId',
#                     'request_requestId',
#                     'request_documentReferer',
#                     'request_method' ]]
#    # get timestamp and date mapping
#    date_file = Dat[['request_ip', 'request_responseHeaders_date']].copy()
#    date_file['Date'] = date_file['request_responseHeaders_date'].str[5:16]
#    date_file['Day'] = date_file['request_responseHeaders_date'].str[:3]
#    # date_file = date_file[date_file.loc[:,['request_responseHeaders_date']].dropna()]
#    # date_file = date_file[numpy.isfinite(date_file['request_responseHeaders_date'])]
#    #date_file = date_file.dropna(date_file.loc[:,['request_responseHeaders_date']])
#    date_file = date_file.drop('request_responseHeaders_date', axis=1)
#    date_file = date_file.sort(['Date'], ascending=False).drop_duplicates('request_ip')


#    Dat['url'] = Dat['request_url'].str.split("/").str[2]
#    aggregation = {'url': {'clicks': 'count'}, 'request_timeStamp': {'max_tmstmp': 'max', 'min_tmstmp': 'min'}}
#    Dat = Dat.groupby(['request_ip', 'url']).agg(aggregation).reset_index()
#    Dat.columns = ['request_ip', 'url', 'clicks', 'max_tmstmp', 'min_tmstmp']
#    Dat['time_spent'] = (Dat['max_tmstmp'] - Dat['min_tmstmp'])

#    all_fields = pandas.merge(Dat, date_file, how='left', left_on='request_ip', right_on='request_ip')

#    global Data_final
#    Data_final = Data_final.append(all_fields, ignore_index=True)
#    return(Data_final)


#partfunction("part-00002-00008")
#partfunction("part-00002-00010")
#partfunction("part-00001-00012")
# Other files