The purpose of this document is to describe the preliminary analysis of the Ottawa Web site visits data set which is to be discussed in the Nov 5 ‘Open Data Ottawa’ meetup (http://www.meetup.com/Open-Data-Ottawa/).
The data set is available at http://data.ottawa.ca/dataset/ottawa-ca-web
So far, only data from August, 2014 has been analyzed.
The file was downloaded, converted to .csv, and massaged a little by hand to ensure that the first row of the .csv file was the header row (i.e. no blank rows), and to convert the “Avg Time on Page” column to seconds (TtlSecs).
library(dplyr)
library(ggplot2)
dat <- read.csv("C:/Users/prabinovitch/Downloads/Aug2014.csv", stringsAsFactors=FALSE)
The web page describes the fields in the data as follows:
“Pageview: is the total number of pages viewed. Repeated views of a single page are counted.”
– This likely includes page refreshes, but one would expect that to be a small distortion
“Unique Pageview: is the number of visits during which the specified page was viewed at least once. A unique pageview is counted for each page URL + page Title combination”
– This is clear
“Avg. Time on Page: is the average amount of time visitors spent viewing a specified page or set of pages.”
– This is problematic. Since it is not know when a page is left, only when the next one is entered, there are scenarios that are going to be have questionable values for this field.
— The last page visited can not have a time on page associated to it. Perhaps some fixed value is assigned, or these these page visits are not included in the average. In either case, it introduces distortion.
— A browser sitting on a page while the user east dinner and then continues browsingwill have the dinner time is included in the time for the page.
In summary the “Avg. Time on Page” is questionable and so will not be analyzed at this point.
“Entrances: is the number of times visitors entered your site through a specified page or set of pages.”
– Definition is clear.
“Bounce Rate: is the percentage of single-page visits (i.e. visits in which the person left your site from the entrance page without interacting with the page).”
“% Exit: is the percentage of site exits that occurred from a specified page or set of pages.” – this definition must be based on some timeout, but what it is is not explained, and may distort results.
One would assume that a page’s Exit% should always be equal to or greater than the page’s Bounce rate, because the exit % counts all exits, but the bounce rate just the ones that are one page visits.
But when plotted…..
dat %>%
ggplot(aes(X..Exit,Bounce.Rate)) +
geom_point() +
geom_abline(slope=1,intercept=0,colour='blue',size=2) +
xlab('Exit Rate') +
ylab('Bounce Rate')
Based on the logic described above, all data points should be below the blue line. Thus, it is likely that the definitions are not quite clear enough.
Out of interest, the Exit rate from each page is plotted against its number of entrances.
dat %>%
ggplot(aes(Entrances,X..Exit)) +
geom_point() +
xlab('Entrances') +
ylab('Exit Rate')
The only obviously interesting features is the point to the far right, which is likely the home page.
However, when zoomed in to the pages that had less than 100 visits, there is an interesting curve between 0 and 20 entrances, and exit rates of 80 to 100%.
dat %>%
ggplot(aes(Entrances,X..Exit)) +
geom_point() +
xlab('Entrances') +
ylab('Exit Rate') +
xlim(0,100) +
ylim(0,1)
That is interesting - and likely just an illusion, as Entrances is highly correlated with Pageviews:
dat %>%
ggplot(aes(Pageviews,Entrances)) +
geom_point() +
xlab('Pageviews') +
ylab('Entrances') +
xlim(0,5000) +
ylim(0,2000)
and so they are ‘noisy’ surrogates for each other, and
dat %>%
filter(Pageviews<=10) %>%
ggplot(aes(Pageviews,X..Exit)) +
geom_point() +
xlab('Pageviews') +
ylab('Exit Rate') +
xlim(0,10)
where you can clearly see the banding structure is just due to the small numbers involved.
The overall goal is to correlate these page views with the 311 data set analyzed last month.
One approach, yet to be done, is to take all navigation levels and convert them to a bag of words for each page. Then for each word, sum the total numbers of page views across all pages that use that word. Then look at the words and match them (as best as possible) to the categories in the 311 data set, and compare monthly totals by word category.
To get started…
dat %>%
group_by(Navigation.Level.1) %>%
summarise(
count = n()
)%>%
arrange(desc(count)) %>%
top_n(10)
## Source: local data frame [10 x 2]
##
## Navigation.Level.1 count
## 1 en 7135
## 2 fr 2215
## 3 sites 532
## 4 cgi-bin 68
## 5 election 12
## 6 403.html?page= 6
## 7 calendar 6
## 8 web 4
## 9 election-all 2
## 10 residents 2
dat %>%
group_by(Navigation.Level.2) %>%
summarise(
count = n()
)%>%
arrange(desc(count)) %>%
top_n(10)
## Source: local data frame [10 x 2]
##
## Navigation.Level.2 count
## 1 residents 3564
## 2 city-hall 2298
## 3 hotel-de-ville 578
## 4 default 456
## 5 news 378
## 6 official-plan-0 227
## 7 business 213
## 8 serviceottawa 148
## 9 nouvelles 95
## 10 ott-city-online 79
dat %>%
group_by(Navigation.Level.3) %>%
summarise(
count = n()
)%>%
arrange(desc(count)) %>%
top_n(10)
## Source: local data frame [10 x 2]
##
## Navigation.Level.3 count
## 1 planning-and-development 829
## 2 789
## 3 files 530
## 4 public-consultations 455
## 5 laws-licenses-and-permits 351
## 6 public-health 325
## 7 arts-culture-and-community 321
## 8 parks-and-recreation 315
## 9 water-and-environment 314
## 10 accountability-and-transparency 240
dat %>%
group_by(Navigation.Level.4) %>%
summarise(
count = n()
)%>%
arrange(desc(count)) %>%
top_n(10)
## Source: local data frame [10 x 2]
##
## Navigation.Level.4 count
## 1 2086
## 2 styles 451
## 3 community-plans-and-design-guidelines 418
## 4 laws 309
## 5 transforming-ottawa 196
## 6 museums-and-heritage 188
## 7 planning-and-infrastructure 148
## 8 healthy-living 118
## 9 accountability-framework 109
## 10 arts-theatre-music 103
dat %>%
group_by(Navigation.Level.5) %>%
summarise(
count = n()
)%>%
arrange(desc(count)) %>%
top_n(10)
## Source: local data frame [10 x 2]
##
## Navigation.Level.5 count
## 1 3603
## 2 large 451
## 3 city-ottawa-zoning-law 183
## 4 recreation-centre-locations 65
## 5 centres-recreatifs-emplacements 61
## 6 images 51
## 7 billings-family-virtual-exhibit-0 43
## 8 en 41
## 9 index_en.html 40
## 10 le-reglement-de-zonage 34
dat %>%
group_by(Navigation.Level.6) %>%
summarise(
count = n()
)%>%
arrange(desc(count)) %>%
top_n(10)
## Source: local data frame [12 x 2]
##
## Navigation.Level.6 count
## 1 8492
## 2 public 453
## 3 index_en.html 40
## 4 residents 27
## 5 ray-friel-9 14
## 6 city-hall 6
## 7 wards 6
## 8 1-greenspace 5
## 9 2-identifying 5
## 10 4 5
## 11 en 5
## 12 stage-five-candidate 5
dat %>%
group_by(Navigation.Level.7) %>%
summarise(
count = n()
)%>%
arrange(desc(count)) %>%
top_n(10)
## Source: local data frame [18 x 2]
##
## Navigation.Level.7 count
## 1 9351
## 2 room_photos 406
## 3 index_en.html 27
## 4 parks-and-recreation 13
## 5 city-hall 4
## 6 24 3
## 7 index_en.shtml 3
## 8 ray 3
## 9 32 2
## 10 brewer_en.html 2
## 11 en 2
## 12 garbage-and-recycling 2
## 13 laws-licenses-and-permits 2
## 14 preschool.pdf 2
## 15 social-services 2
## 16 transportation-and-parking 2
## 17 water-and-environment 2
## 18 your-city-government 2
We see that the ‘Navigation.Level’ does not really matter conceptually, and this is why it would be preferable to merge all the levels into one bag of words for each page.
When looking at the individual words in the lists, it is seen that many are artifacts of the web site structure, and have little content, such as: ‘default’, ‘sites’, ‘cgi-bin’, ‘403’, ‘web’, etc. These stop-words will also be eliminated from the bags of words. Note though that ‘en’ and ‘fr’ may be useful in further analysis.
Some words may have to be split. For example, ‘waterquality’ would be better viewed as ‘water’ and ‘quality’.
When the data analysis is complete, one would expect that outside events (like the sinkhole) would trigger an increase in both 311 phone calls, and web site visits, and that over time, more interaction is performed via the web site. Unfortunately the web site data is monthly totals, and so likely does not have enough resolution to see these effects. It will probably show an aggregate of both effects.
All in all, much remains to be done for the next meetup.