Necessary packages

library(gsheet)
library(dplyr)
library(lubridate)

Downloading data

data<-gsheet2tbl("https://docs.google.com/spreadsheets/d/1dYKJFZkZ-qTcQOrbf80u85Zl1ugrRly3p_AJaQixgcw/edit#gid=1910446069", sheetid="Q3 data")

1) Consider only the rows with country_id = “BDV” (there are 844 such rows). For each site_id, we can compute the number of unique user_id’s found in these 844 rows. Which site_id has the largest number of unique users? And what’s the number?

data1<-data%>%filter(country_id=="BDV")
data1%>%select(site_id,user_id)%>%distinct()%>%group_by(site_id)%>%tally()%>%arrange(desc(n))
## # A tibble: 3 x 2
##   site_id     n
##   <chr>   <int>
## 1 5NPAU     544
## 2 N0OTG      90
## 3 3POLC       2

First, I chose the rows where the country_id=“BDV” (data1). Then, I selected the site_id and user_id columns to dipose any duplicate values (ts allow multiple user_id’s for each site_id’s). Then, I used the tally function to count the number of unique user_id by the site_id. As shown above, site 5NPAU had the largest number of unique users with 544 unique users.

2) Between 2019-02-03 00:00:00 and 2019-02-04 23:59:59, there are four users who visited a certain site more than 10 times. Find these four users & which sites they (each) visited more than 10 times. (Simply provides four triples in the form (user_id, site_id, number of visits) in the box below.)

str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 3553 obs. of  4 variables:
##  $ ts        : POSIXct, format: "2019-02-01 00:01:24" "2019-02-01 00:10:19" ...
##  $ user_id   : chr  "LC36FC" "LC39B6" "LC3500" "LC374F" ...
##  $ country_id: chr  "TL6" "TL6" "TL6" "TL6" ...
##  $ site_id   : chr  "N0OTG" "N0OTG" "N0OTG" "N0OTG" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ts = col_datetime(format = ""),
##   ..   user_id = col_character(),
##   ..   country_id = col_character(),
##   ..   site_id = col_character()
##   .. )
tz(data$ts)
## [1] "UTC"
data2<-data%>%filter(between(ts,as.POSIXct("2019-02-03 00:00:00", tz="UTC"),as.POSIXct("2019-02-04 23:59:59", tz="UTC")))
data2%>%group_by(user_id,site_id)%>%tally()%>%filter(n>10)%>%arrange(desc(n))%>%rename("number of visits"="n")
## # A tibble: 4 x 3
## # Groups:   user_id [4]
##   user_id site_id `number of visits`
##   <chr>   <chr>                <int>
## 1 LC3A59  N0OTG                   26
## 2 LC06C3  N0OTG                   25
## 3 LC3C9D  N0OTG                   17
## 4 LC3C7E  3POLC                   15

Looking at the structure of the data (str function), one can observe that the class of the ts column is POSIXct. This mean the time is in the UTC time zone (as observed in tz(data$ts)). Now, I can filter the dataset as the given time frame (between 2019-02-03 00:00:00 and 2019-02-04 23:59:59). From here, I used the tally function to count how many times one particular user visited one particular site, and erase all entries where the number of visitation is less or equal to 10. Then, I got the result as shown above.

3) For each site, compute the unique number of users whose last visit (found in the original data set) was to that site. For instance, user “LC3561”’s last visit is to “N0OTG” based on timestamp data. Based on this measure, what are top three sites? (hint: site “3POLC” is ranked at 5th with 28 users whose last visit in the data set was to 3POLC; simply provide three pairs in the form (site_id, number of users).)

user<-unique(data$user_id)
data3<-data.frame(user_id=character(), last_site=character())
for (i in 1:length(user)){
  temp<-data%>%filter(user_id==user[i])%>%filter(ts==max(ts))
  data3<-rbind(data3, data.frame(user_id=user[i],last_site=temp$site_id))
}
data3%>%group_by(last_site)%>%tally()%>%arrange(desc(n))%>%rename("number of users"="n")
## # A tibble: 8 x 2
##   last_site `number of users`
##   <fct>                 <int>
## 1 5NPAU                   992
## 2 N0OTG                   561
## 3 QGO3G                   289
## 4 GVOFK                    42
## 5 3POLC                    28
## 6 RT9Z6                     2
## 7 JSUUP                     1
## 8 EUZ/Q                     1

First, I set aside a string of unique user_id and created an empty dataset (data3) with columns user_id and last_site. Then I used the for-loop to find the last visited site by the unique user_id and appended that row to the data3. Then, I used the tally function on data3 to count the number of user_id by the last_site and got the result as shown above.

4) For each user, determine the first site he/she visited and the last site he/she visited based on the timestamp data(Please include users who visited the site only once.). Compute the number of users whose first/last visits are to the same website. What is the number?

user<-unique(data$user_id)
data4<-data.frame(user_id=character(),first_site=character(),last_site=character())
for (i in 1:length(user)){
  temp<-data%>%filter(user_id==user[i])
  data4<-rbind(data4, data.frame(user_id=user[i], first_site=temp$site_id[which.min(temp$ts)], last_site=temp$site_id[which.max(temp$ts)]))
}
nrow(data4[which(as.character(data4$first_site)==as.character(data4$last_site)),])
## [1] 1670

As the previous question, I first created the string of unique user_id and created an empty dataset (data4) to append during the for-loop. In the loop, I filtered for each user_id and appended the first and the last site visited by that user_id to data4. Then, I counted the number of rows of the data4 where the first_site column equaled to the last_site column. This resulted in 1670 user_id’s where the first and the last visited site were equal to each other.

5) For each site, count the following numbers: (A) the number of unique users who have visited at least two different countries (B) the number of all unique users. (For example, user “LC3450” has visited 3 countries, “BDV”, “QLT”, and “TL6” with “5NPAU”, “5NPAU”, and “N0OTG”, respectively. When counting (A), both “5NPAU” and “N0OTG” need to consider the user.) Please calculate the ratio B/A for each site and list top three sites and the corresponding ratio.

id<-data%>%group_by(user_id)%>%summarize(count=length(unique(country_id)))%>%filter(count>1)
A<-data%>%select(user_id,country_id,site_id)%>%distinct()%>%filter(user_id %in% id$user_id)%>%group_by(site_id)%>%tally()%>%rename("A"=n)
B<-data%>%select(site_id,user_id)%>%distinct()%>%group_by(site_id)%>%tally()%>%rename("B"="n")
left_join(A,B,by="site_id")%>%mutate(ratio=B/A)%>%arrange(desc(ratio))
## # A tibble: 8 x 4
##   site_id     A     B ratio
##   <chr>   <int> <int> <dbl>
## 1 QGO3G      79   353  4.47
## 2 5NPAU     332  1104  3.33
## 3 N0OTG     201   658  3.27
## 4 3POLC      15    36  2.4 
## 5 GVOFK      26    59  2.27
## 6 JSUUP       1     2  2   
## 7 RT9Z6       1     2  2   
## 8 EUZ/Q       1     1  1

First, I found which user_id have visited at least two different countries. Separately, I selected columns user_id, country_id, and site_id from the original data and disposed the duplicated value (duplicate created by the ts column). On this seperate dataset, I selected the user_id who visited at least two different countries. From here, I used the tally function yet again and counted the number of unique user_id by the site_id (A).

To find B, I selected columns user, country and site id’s and erased all duplicates. Then, I found the number of all unique users that visited each site_id’s (B).

I merged the two datasets (A,B) calculated for the ratio B/A and get the result as shown above.

On the side note, It is interesting to notice that all the users of the lowest traffic site have visitied at least two countries. Let’s see how these entries look like in the whole scheme of things. We want to focus on site_id=JSUUP,RT9Z6, EUZ/Q

test<-data%>%filter(site_id %in% c("JSUUP","RT9Z6","EUZ/Q"))
data%>%filter(user_id %in% unique(test$user_id))%>%arrange(user_id)%>%select(user_id,country_id,site_id)%>%distinct()
## # A tibble: 9 x 3
##   user_id country_id site_id
##   <chr>   <chr>      <chr>  
## 1 LC31D8  TL6        JSUUP  
## 2 LC3568  TL6        RT9Z6  
## 3 LC3837  QLT        5NPAU  
## 4 LC3837  K1R        N0OTG  
## 5 LC3837  TL6        RT9Z6  
## 6 LC3B44  QLT        5NPAU  
## 7 LC3B44  TL6        EUZ/Q  
## 8 LC3BDF  TL6        JSUUP  
## 9 LC3BDF  HVQ        GVOFK

First, I have found the user_id that has accessed the site_id’s as mentioned above. Then, I found all the site_id and country_id for those user_id. Here, we can see that user_id={LC3837,LC3B44,LC3BDF} visited two countries and accessed the lowest traffic sites.