Link to project on GitHUB
Link to project on RPub

Content

Introduction
ATO Entry Pages
ATO Browser Data
ATO Top 100 Keywords

Introduction

In this report we will work with some publicly available data from the Australian Government.

The Australian Taxation Office (ATO) provided some of its web log data for GovHack 2014. It is available on the Australian Government’s data sharing web site and specifically within the ATO Web Analytics section.

ATO Entry Pages

Loading necessary libraries:

library(rattle) # normVarNames().
library(stringr) # String manpiulation.
library(tidyr) # Tidy the dataset.
library(dplyr) # Data manipulation.
library(ggplot2) # Visualise data.
library(scales) # Include commas in numbers.
library(directlabels) # Dodging labels for ggplot2.   

Loading dataset:

ds <- read.csv('Entry pages by month and traffic source - July 2013 to April 2014.csv')        

Variable names simplifying:

names(ds) <- normVarNames(names(ds))      
names(ds)   
## [1] "entry_page" "month"      "source"     "views"      "visits"

We can now explore some of the characteristics of the dataset:

dim(ds) # Datasets dimensions   
## [1] 207118      5
summary(ds[-1]) # Datasets summary    
##      month            source           views             visits         
##  14-Apr :23324   External:144400   Min.   :      1   Min.   :      1.0  
##  13-Jul :23067   Internal: 62718   1st Qu.:      4   1st Qu.:      1.0  
##  13-Aug :22849                     Median :     18   Median :      4.0  
##  13-Oct :22736                     Mean   :   1019   Mean   :    169.2  
##  13-Nov :22478                     3rd Qu.:     88   3rd Qu.:     20.0  
##  13-Sep :21084                     Max.   :9106745   Max.   :1349083.0  
##  (Other):71580

The variable month appears to report the month and year. The source looks to record, presumably, whether the person browsing is external to the ATO or internal to the ATO. For an entry point the views looks to report the number of views for that month (and broken down between internal and external views). Similarly for visits.

Quantity of different entry points:

length(levels(ds$entry_page))    
## [1] 33260

So, out of the 207,118, there are 33,260 different entry points.

The total number of views/visits to the ATO website:

format(sum(ds$views), big.mark=",") # The total number of views
## [1] "211,110,271"
format(sum(ds$visits), big.mark=",") # The total number of visits
## [1] "35,050,249"

Explore the views/visits per month:

ds %>%
  group_by(month) %>%
  summarise(views=sum(views), visits=sum(visits)) %>%
  gather(type, count, -month) %>%
  ggplot(aes(x=month, y=count, fill=type)) +
  geom_bar(stat="identity", position="dodge") +
  scale_y_continuous(labels=comma) +
  labs(fill="Type", x="Month", y="Count") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

We can see an interesting pattern of views versus visits in that there’s a reasonably at number of visits over the period, however the number of views (and also we would suggest views per visit) is dramatically increased for July. We would really need to analyse the relative change in views/visit over time to confirm that observation, but we’ll stay with the visual for now. The July spike may well correspond to the Australian financial year ending in June and starting in July. We might also observe the holiday season around December when there must be less interest in taxation topics.

The breakdown between External and Internal:

ds %>%
group_by(month, source) %>%
summarise(views=sum(views), visits=sum(visits)) %>%
gather(type, count, -c(month, source)) %>%
ggplot(aes(x=month, y=count, fill=type)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels=comma) +
labs(fill="Type", x="Month", y="Count") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
facet_wrap(~source)

We can see that relatively speaking there are very few internal views/visits. This should not be surprising, as the ATO has only about 20,000 staff, compared to the population of Australia at over 23 million.

ATO Browser Data

Loading dataset:

ds <- read.csv('Browser by month and traffic source - July 2013 to April 2014.csv')      

Variable names simplifying:

names(ds) <- normVarNames(names(ds))
names(ds)[3] <- "source"
names(ds)
## [1] "browser" "month"   "source"  "views"   "visits"

We can now explore some of the characteristics of the dataset:

dim(ds) # Datasets dimensions   
## [1] 1357    5
summary(ds) # Datasets summary    
##                             browser         month          source    
##  Chrome                         :  20   Mar-14 :163   External:1304  
##  Firefox                        :  20   Oct-13 :143   Internal:  53  
##  Microsoft Internet Explorer 6.x:  20   Apr-14 :141                  
##  Microsoft Internet Explorer 7.x:  20   Nov-13 :137                  
##  Microsoft Internet Explorer 8.x:  20   Feb-14 :136                  
##  External                       :  11   Jan-14 :132                  
##  (Other)                        :1246   (Other):505                  
##      views             visits       
##  Min.   :      1   Min.   :      1  
##  1st Qu.:      2   1st Qu.:      1  
##  Median :      5   Median :      2  
##  Mean   : 155571   Mean   :  25829  
##  3rd Qu.:     35   3rd Qu.:     11  
##  Max.   :7765921   Max.   :1268942  
## 

We can see that the data records aggregated monthly observations of the browsers connecting to the ATO web site. The connection may be internal to the orgranisation or external. As we can see most connections are external.

The average number of views per visit:

ds$ratio <- ds$views/ds$visits    
summary(ds$ratio)    
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   3.333   4.000 151.100

So, as we can see the average number of views per visit is 3.33.

Explore Internal Usage

Let’s get some understanding of the internal versus external pro les of browser usage. Firstly, how many internal versus external visits?

freq <- ds %>%   
group_by(source) %>%
summarise(total=sum(visits))
freq    
## Source: local data frame [2 x 2]
## 
##     source    total
##     (fctr)    (int)
## 1 External 33946553
## 2 Internal  1103712

We see that internal visits account for just 3% of all visits.

For the Internal users we now check which browsers are being used:

ib <- ds %>%    
filter(source == "Internal") %>%
group_by(browser) %>%
summarise(total=sum(visits)) %>%
arrange(desc(total))
ib    
## Source: local data frame [8 x 2]
## 
##                           browser   total
##                            (fctr)   (int)
## 1 Microsoft Internet Explorer 8.x 1032002
## 2 Microsoft Internet Explorer 7.x   56681
## 3                         Firefox   12793
## 4 Microsoft Internet Explorer 6.x    1440
## 5                          Chrome     792
## 6                        External       2
## 7 Microsoft Internet Explorer 9.x       1
## 8                          Safari       1

The ATO apparently deploys Microsoft Internet Explorer 8 as part of its standard operating environment. There’s a few other browsers, but they are relatively rarely used.

Internal Usage Over Time

Explore the browser usage profile over time.

ds %>%    
filter(source=="Internal", visits > 1000) %>%
ggplot(aes(month, visits, fill=browser)) +
geom_bar(stat="identity") +
scale_y_continuous(labels=comma) +
theme(axis.text.x=element_text(angle=45, hjust=1))    

It is interesting, if also puzzling, to see quite an increase in visits over the 10 months. We might ask if the data is actually complete as a 4-fold increase in internal visits between September 2013 and April 2014 sounds rather sudden.

External Visits

Compare the internal browser usage to the external browser usage:

eb <- ds %>%
filter(source == "External") %>%
group_by(browser) %>%
summarise(total=sum(visits)) %>%
arrange(desc(total))
head(eb, 10)    
## Source: local data frame [10 x 2]
## 
##                             browser   total
##                              (fctr)   (int)
## 1                            Chrome 7508320
## 2  Microsoft Internet Explorer 10.x 5258583
## 3                     Mobile Safari 5016803
## 4                           Firefox 4296359
## 5   Microsoft Internet Explorer 8.x 3736043
## 6   Microsoft Internet Explorer 9.x 3458002
## 7                            Safari 2282737
## 8   Microsoft Internet Explorer 7.x 1498382
## 9                           Mozilla  634484
## 10  Microsoft Internet Explorer 6.x  151233

Looking at all dataset, we saw quite a spread of single hit browsers. In fact, out of the 406 there are 142 browsers with a single visit, 297 with less than 10 visits, 372 with less than 100 visits, 391 with less than 1000 visits, and 394with less than 20,000 visits.

ds %>%
filter(source == "External", visits > 20000) %>%
ggplot(aes(month, visits, fill=browser)) +
geom_bar(stat="identity") +
facet_wrap(~browser) +
scale_y_continuous(labels=comma) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(legend.position="none")    

ATO Top 100 Keywords

Loading the ATO top 100 keywords for analysis:

ds <- read.csv('Local keywords (top 100) by month and traffic source - July 2013 to April 2014.csv')      

Preparing the dataset for analysis, and as usual we normalise and simplify the column names, order the months chronologically:

names(ds) <- normVarNames(names(ds))
ds$month <- factor(ds$month, levels=months)
names(ds)[1] <- "keyword"
names(ds)[3] <- "source"
names(ds)
## [1] "keyword" "month"   "source"  "views"   "visits"

We can now explore some of the characteristics of the dataset:

dim(ds) # Datasets dimensions   
## [1] 1951    5
summary(ds) # Datasets summary    
##               keyword         month          source         views       
##  1005             :  20   Jul-13 :199   External:1000   Min.   :   1.0  
##  abn              :  20   Apr-14 :199   Internal: 951   1st Qu.:  12.0  
##  calculator       :  20   Aug-13 :198                   Median :  83.0  
##  calculators      :  20   Feb-14 :198                   Mean   : 209.9  
##  capital gains tax:  20   Jan-14 :197                   3rd Qu.: 282.5  
##  car allowance    :  20   Dec-13 :196                   Max.   :4519.0  
##  (Other)          :1831   (Other):764                                   
##      visits      
##  Min.   :   1.0  
##  1st Qu.:   7.0  
##  Median :  56.0  
##  Mean   : 130.7  
##  3rd Qu.: 169.5  
##  Max.   :3792.0  
## 

Plot Top 40:

ds %>%
group_by(keyword) %>%
summarise(views=sum(views), visits=sum(visits)) %>%
arrange(desc(views)) %>%
head(40) %>%
gather(type, count, -keyword) %>%
ggplot(aes(x=keyword, y=count, fill=type)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels=comma) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(legend.position="top") +
labs(x="") +
coord_flip()    

Plot internal Only:

ds %>%
subset(source=="Internal") %>%
group_by(keyword) %>%
summarise(views=sum(views), visits=sum(visits)) %>%
arrange(desc(views)) %>%
head(40) %>%
gather(type, count, -keyword) %>%
ggplot(aes(x=keyword, y=count, fill=type)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels=comma) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(legend.position="top") +
coord_flip()