Content
Introduction
ATO Entry Pages
ATO Browser Data
ATO Top 100 Keywords
Link to project on GitHUB
Link to project on RPub
In this report we will work with some publicly available data from the Australian Government.
The Australian Taxation Office (ATO) provided some of its web log data for GovHack 2014. It is available on the Australian Government’s data sharing web site and specifically within the ATO Web Analytics section.
Loading necessary libraries:
library(rattle) # normVarNames().
library(stringr) # String manpiulation.
library(tidyr) # Tidy the dataset.
library(dplyr) # Data manipulation.
library(ggplot2) # Visualise data.
library(scales) # Include commas in numbers.
library(directlabels) # Dodging labels for ggplot2.
Loading dataset:
ds <- read.csv('Entry pages by month and traffic source - July 2013 to April 2014.csv')
Variable names simplifying:
names(ds) <- normVarNames(names(ds))
names(ds)
## [1] "entry_page" "month" "source" "views" "visits"
We can now explore some of the characteristics of the dataset:
dim(ds) # Datasets dimensions
## [1] 207118 5
summary(ds[-1]) # Datasets summary
## month source views visits
## 14-Apr :23324 External:144400 Min. : 1 Min. : 1.0
## 13-Jul :23067 Internal: 62718 1st Qu.: 4 1st Qu.: 1.0
## 13-Aug :22849 Median : 18 Median : 4.0
## 13-Oct :22736 Mean : 1019 Mean : 169.2
## 13-Nov :22478 3rd Qu.: 88 3rd Qu.: 20.0
## 13-Sep :21084 Max. :9106745 Max. :1349083.0
## (Other):71580
The variable month appears to report the month and year. The source looks to record, presumably, whether the person browsing is external to the ATO or internal to the ATO. For an entry point the views looks to report the number of views for that month (and broken down between internal and external views). Similarly for visits.
Quantity of different entry points:
length(levels(ds$entry_page))
## [1] 33260
So, out of the 207,118, there are 33,260 different entry points.
The total number of views/visits to the ATO website:
format(sum(ds$views), big.mark=",") # The total number of views
## [1] "211,110,271"
format(sum(ds$visits), big.mark=",") # The total number of visits
## [1] "35,050,249"
Explore the views/visits per month:
ds %>%
group_by(month) %>%
summarise(views=sum(views), visits=sum(visits)) %>%
gather(type, count, -month) %>%
ggplot(aes(x=month, y=count, fill=type)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels=comma) +
labs(fill="Type", x="Month", y="Count") +
theme(axis.text.x=element_text(angle=45, hjust=1))
We can see an interesting pattern of views versus visits in that there’s a reasonably at number of visits over the period, however the number of views (and also we would suggest views per visit) is dramatically increased for July. We would really need to analyse the relative change in views/visit over time to confirm that observation, but we’ll stay with the visual for now. The July spike may well correspond to the Australian financial year ending in June and starting in July. We might also observe the holiday season around December when there must be less interest in taxation topics.
The breakdown between External and Internal:
ds %>%
group_by(month, source) %>%
summarise(views=sum(views), visits=sum(visits)) %>%
gather(type, count, -c(month, source)) %>%
ggplot(aes(x=month, y=count, fill=type)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels=comma) +
labs(fill="Type", x="Month", y="Count") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
facet_wrap(~source)
We can see that relatively speaking there are very few internal views/visits. This should not be surprising, as the ATO has only about 20,000 staff, compared to the population of Australia at over 23 million.
Loading dataset:
ds <- read.csv('Browser by month and traffic source - July 2013 to April 2014.csv')
Variable names simplifying:
names(ds) <- normVarNames(names(ds))
names(ds)[3] <- "source"
names(ds)
## [1] "browser" "month" "source" "views" "visits"
We can now explore some of the characteristics of the dataset:
dim(ds) # Datasets dimensions
## [1] 1357 5
summary(ds) # Datasets summary
## browser month source
## Chrome : 20 Mar-14 :163 External:1304
## Firefox : 20 Oct-13 :143 Internal: 53
## Microsoft Internet Explorer 6.x: 20 Apr-14 :141
## Microsoft Internet Explorer 7.x: 20 Nov-13 :137
## Microsoft Internet Explorer 8.x: 20 Feb-14 :136
## External : 11 Jan-14 :132
## (Other) :1246 (Other):505
## views visits
## Min. : 1 Min. : 1
## 1st Qu.: 2 1st Qu.: 1
## Median : 5 Median : 2
## Mean : 155571 Mean : 25829
## 3rd Qu.: 35 3rd Qu.: 11
## Max. :7765921 Max. :1268942
##
We can see that the data records aggregated monthly observations of the browsers connecting to the ATO web site. The connection may be internal to the orgranisation or external. As we can see most connections are external.
The average number of views per visit:
ds$ratio <- ds$views/ds$visits
summary(ds$ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 3.333 4.000 151.100
So, as we can see the average number of views per visit is 3.33.
Let’s get some understanding of the internal versus external proles of browser usage. Firstly, how many internal versus external visits?
freq <- ds %>%
group_by(source) %>%
summarise(total=sum(visits))
freq
## Source: local data frame [2 x 2]
##
## source total
## (fctr) (int)
## 1 External 33946553
## 2 Internal 1103712
We see that internal visits account for just 3% of all visits.
For the Internal users we now check which browsers are being used:
ib <- ds %>%
filter(source == "Internal") %>%
group_by(browser) %>%
summarise(total=sum(visits)) %>%
arrange(desc(total))
ib
## Source: local data frame [8 x 2]
##
## browser total
## (fctr) (int)
## 1 Microsoft Internet Explorer 8.x 1032002
## 2 Microsoft Internet Explorer 7.x 56681
## 3 Firefox 12793
## 4 Microsoft Internet Explorer 6.x 1440
## 5 Chrome 792
## 6 External 2
## 7 Microsoft Internet Explorer 9.x 1
## 8 Safari 1
The ATO apparently deploys Microsoft Internet Explorer 8 as part of its standard operating environment. There’s a few other browsers, but they are relatively rarely used.
Explore the browser usage profile over time.
ds %>%
filter(source=="Internal", visits > 1000) %>%
ggplot(aes(month, visits, fill=browser)) +
geom_bar(stat="identity") +
scale_y_continuous(labels=comma) +
theme(axis.text.x=element_text(angle=45, hjust=1))
It is interesting, if also puzzling, to see quite an increase in visits over the 10 months. We might ask if the data is actually complete as a 4-fold increase in internal visits between September 2013 and April 2014 sounds rather sudden.
Compare the internal browser usage to the external browser usage:
eb <- ds %>%
filter(source == "External") %>%
group_by(browser) %>%
summarise(total=sum(visits)) %>%
arrange(desc(total))
head(eb, 10)
## Source: local data frame [10 x 2]
##
## browser total
## (fctr) (int)
## 1 Chrome 7508320
## 2 Microsoft Internet Explorer 10.x 5258583
## 3 Mobile Safari 5016803
## 4 Firefox 4296359
## 5 Microsoft Internet Explorer 8.x 3736043
## 6 Microsoft Internet Explorer 9.x 3458002
## 7 Safari 2282737
## 8 Microsoft Internet Explorer 7.x 1498382
## 9 Mozilla 634484
## 10 Microsoft Internet Explorer 6.x 151233
Looking at all dataset, we saw quite a spread of single hit browsers. In fact, out of the 406 there are 142 browsers with a single visit, 297 with less than 10 visits, 372 with less than 100 visits, 391 with less than 1000 visits, and 394with less than 20,000 visits.
ds %>%
filter(source == "External", visits > 20000) %>%
ggplot(aes(month, visits, fill=browser)) +
geom_bar(stat="identity") +
facet_wrap(~browser) +
scale_y_continuous(labels=comma) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(legend.position="none")
Loading the ATO top 100 keywords for analysis:
ds <- read.csv('Local keywords (top 100) by month and traffic source - July 2013 to April 2014.csv')
Preparing the dataset for analysis, and as usual we normalise and simplify the column names, order the months chronologically:
names(ds) <- normVarNames(names(ds))
ds$month <- factor(ds$month, levels=months)
names(ds)[1] <- "keyword"
names(ds)[3] <- "source"
names(ds)
## [1] "keyword" "month" "source" "views" "visits"
We can now explore some of the characteristics of the dataset:
dim(ds) # Datasets dimensions
## [1] 1951 5
summary(ds) # Datasets summary
## keyword month source views
## 1005 : 20 Jul-13 :199 External:1000 Min. : 1.0
## abn : 20 Apr-14 :199 Internal: 951 1st Qu.: 12.0
## calculator : 20 Aug-13 :198 Median : 83.0
## calculators : 20 Feb-14 :198 Mean : 209.9
## capital gains tax: 20 Jan-14 :197 3rd Qu.: 282.5
## car allowance : 20 Dec-13 :196 Max. :4519.0
## (Other) :1831 (Other):764
## visits
## Min. : 1.0
## 1st Qu.: 7.0
## Median : 56.0
## Mean : 130.7
## 3rd Qu.: 169.5
## Max. :3792.0
##
ds %>%
group_by(keyword) %>%
summarise(views=sum(views), visits=sum(visits)) %>%
arrange(desc(views)) %>%
head(40) %>%
gather(type, count, -keyword) %>%
ggplot(aes(x=keyword, y=count, fill=type)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels=comma) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(legend.position="top") +
labs(x="") +
coord_flip()
ds %>%
subset(source=="Internal") %>%
group_by(keyword) %>%
summarise(views=sum(views), visits=sum(visits)) %>%
arrange(desc(views)) %>%
head(40) %>%
gather(type, count, -keyword) %>%
ggplot(aes(x=keyword, y=count, fill=type)) +
geom_bar(stat="identity", position="dodge") +
scale_y_continuous(labels=comma) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(legend.position="top") +
coord_flip()