Introduction

The purpose of this document is to describe the preliminary analysis of the Ottawa Web site visits data set which is to be discussed in the Nov 5 ‘Open Data Ottawa’ meetup (http://www.meetup.com/Open-Data-Ottawa/).

Data

The data set is available at http://data.ottawa.ca/dataset/ottawa-ca-web

So far, only data from August, 2014 has been analyzed.

The file was downloaded, converted to .csv, and massaged a little by hand to ensure that the first row of the .csv file was the header row (i.e. no blank rows), and to convert the “Avg Time on Page” column to seconds (TtlSecs).

library(dplyr)
library(ggplot2)

dat <- read.csv("C:/Users/prabinovitch/Downloads/Aug2014.csv", stringsAsFactors=FALSE)

Definitions

The web page describes the fields in the data as follows:

In summary the “Avg. Time on Page” is questionable and so will not be analyzed at this point.

One would assume that a page’s Exit% should always be equal to or greater than the page’s Bounce rate, because the exit % counts all exits, but the bounce rate just the ones that are one page visits.

But when plotted…..

dat %>%
  ggplot(aes(X..Exit,Bounce.Rate)) +
  geom_point() +
  geom_abline(slope=1,intercept=0,colour='blue',size=2) +
  xlab('Exit Rate') +
  ylab('Bounce Rate')

plot of chunk unnamed-chunk-2

Based on the logic described above, all data points should be below the blue line. Thus, it is likely that the definitions are not quite clear enough.

Out of interest, the Exit rate from each page is plotted against its number of entrances.

dat %>%
  ggplot(aes(Entrances,X..Exit)) +
  geom_point() +
  xlab('Entrances') +
  ylab('Exit Rate')

plot of chunk unnamed-chunk-3

The only obviously interesting features is the point to the far right, which is likely the home page.

However, when zoomed in to the pages that had less than 100 visits, there is an interesting curve between 0 and 20 entrances, and exit rates of 80 to 100%.

dat %>%
  ggplot(aes(Entrances,X..Exit)) +
  geom_point() +
  xlab('Entrances') +
  ylab('Exit Rate') +
  xlim(0,100) +
  ylim(0,1)

plot of chunk unnamed-chunk-4

That is interesting - and likely just an illusion, as Entrances is highly correlated with Pageviews:

  dat %>%
  ggplot(aes(Pageviews,Entrances)) +
  geom_point() +
  xlab('Pageviews') +
  ylab('Entrances') +
  xlim(0,5000) +
  ylim(0,2000)

plot of chunk unnamed-chunk-5

and so they are ‘noisy’ surrogates for each other, and

dat %>%
  filter(Pageviews<=10) %>%
  ggplot(aes(Pageviews,X..Exit)) +
  geom_point() +
  xlab('Pageviews') +
  ylab('Exit Rate') +
  xlim(0,10)

plot of chunk unnamed-chunk-6

where you can clearly see the banding structure is just due to the small numbers involved.

Next Steps

The overall goal is to correlate these page views with the 311 data set analyzed last month.

One approach, yet to be done, is to take all navigation levels and convert them to a bag of words for each page. Then for each word, sum the total numbers of page views across all pages that use that word. Then look at the words and match them (as best as possible) to the categories in the 311 data set, and compare monthly totals by word category.

To get started…

dat %>%
  group_by(Navigation.Level.1) %>%
  summarise(
    count = n()
  )%>%
  arrange(desc(count)) %>%
  top_n(10)
## Source: local data frame [10 x 2]
## 
##    Navigation.Level.1 count
## 1                  en  7135
## 2                  fr  2215
## 3               sites   532
## 4             cgi-bin    68
## 5            election    12
## 6      403.html?page=     6
## 7            calendar     6
## 8                 web     4
## 9        election-all     2
## 10          residents     2
dat %>%
  group_by(Navigation.Level.2) %>%
  summarise(
    count = n()
  )%>%
  arrange(desc(count)) %>%
  top_n(10)
## Source: local data frame [10 x 2]
## 
##    Navigation.Level.2 count
## 1           residents  3564
## 2           city-hall  2298
## 3      hotel-de-ville   578
## 4             default   456
## 5                news   378
## 6     official-plan-0   227
## 7            business   213
## 8       serviceottawa   148
## 9           nouvelles    95
## 10    ott-city-online    79
dat %>%
  group_by(Navigation.Level.3) %>%
  summarise(
    count = n()
  )%>%
  arrange(desc(count)) %>%
  top_n(10)
## Source: local data frame [10 x 2]
## 
##                 Navigation.Level.3 count
## 1         planning-and-development   829
## 2                                    789
## 3                            files   530
## 4             public-consultations   455
## 5        laws-licenses-and-permits   351
## 6                    public-health   325
## 7       arts-culture-and-community   321
## 8             parks-and-recreation   315
## 9            water-and-environment   314
## 10 accountability-and-transparency   240
dat %>%
  group_by(Navigation.Level.4) %>%
  summarise(
    count = n()
  )%>%
  arrange(desc(count)) %>%
  top_n(10)
## Source: local data frame [10 x 2]
## 
##                       Navigation.Level.4 count
## 1                                         2086
## 2                                 styles   451
## 3  community-plans-and-design-guidelines   418
## 4                                   laws   309
## 5                    transforming-ottawa   196
## 6                   museums-and-heritage   188
## 7            planning-and-infrastructure   148
## 8                         healthy-living   118
## 9               accountability-framework   109
## 10                    arts-theatre-music   103
dat %>%
  group_by(Navigation.Level.5) %>%
  summarise(
    count = n()
  )%>%
  arrange(desc(count)) %>%
  top_n(10)
## Source: local data frame [10 x 2]
## 
##                   Navigation.Level.5 count
## 1                                     3603
## 2                              large   451
## 3             city-ottawa-zoning-law   183
## 4        recreation-centre-locations    65
## 5    centres-recreatifs-emplacements    61
## 6                             images    51
## 7  billings-family-virtual-exhibit-0    43
## 8                                 en    41
## 9                      index_en.html    40
## 10            le-reglement-de-zonage    34
dat %>%
  group_by(Navigation.Level.6) %>%
  summarise(
    count = n()
  )%>%
  arrange(desc(count)) %>%
  top_n(10)
## Source: local data frame [12 x 2]
## 
##      Navigation.Level.6 count
## 1                        8492
## 2                public   453
## 3         index_en.html    40
## 4             residents    27
## 5           ray-friel-9    14
## 6             city-hall     6
## 7                 wards     6
## 8          1-greenspace     5
## 9         2-identifying     5
## 10                    4     5
## 11                   en     5
## 12 stage-five-candidate     5
dat %>%
  group_by(Navigation.Level.7) %>%
  summarise(
    count = n()
  )%>%
  arrange(desc(count)) %>%
  top_n(10)
## Source: local data frame [18 x 2]
## 
##            Navigation.Level.7 count
## 1                              9351
## 2                 room_photos   406
## 3               index_en.html    27
## 4        parks-and-recreation    13
## 5                   city-hall     4
## 6                          24     3
## 7              index_en.shtml     3
## 8                         ray     3
## 9                          32     2
## 10             brewer_en.html     2
## 11                         en     2
## 12      garbage-and-recycling     2
## 13  laws-licenses-and-permits     2
## 14              preschool.pdf     2
## 15            social-services     2
## 16 transportation-and-parking     2
## 17      water-and-environment     2
## 18       your-city-government     2

We see that the ‘Navigation.Level’ does not really matter conceptually, and this is why it would be preferable to merge all the levels into one bag of words for each page.

When looking at the individual words in the lists, it is seen that many are artifacts of the web site structure, and have little content, such as: ‘default’, ‘sites’, ‘cgi-bin’, ‘403’, ‘web’, etc. These stop-words will also be eliminated from the bags of words. Note though that ‘en’ and ‘fr’ may be useful in further analysis.

Some words may have to be split. For example, ‘waterquality’ would be better viewed as ‘water’ and ‘quality’.

Conclusions

When the data analysis is complete, one would expect that outside events (like the sinkhole) would trigger an increase in both 311 phone calls, and web site visits, and that over time, more interaction is performed via the web site. Unfortunately the web site data is monthly totals, and so likely does not have enough resolution to see these effects. It will probably show an aggregate of both effects.

All in all, much remains to be done for the next meetup.