Exploring the data

This is an experiment mainly for me to learn R better. I want to extract access data from Google Analytics, and combine that with content data from my WordPress blog. Code and pre-munged data

setwd("~/src/ganalytics")
library(plyr)
library(ggplot2)
library(stringr)
source("./utility.R")
load("data.Rda")
load("wp_posts.Rda")
load("wp_stats.Rda")
load("posts.Rda")

Check if the data is correct

Let's check the total hits for one given page, and compare with Google Analytics

We'll check the hits for '407 Indonesian textbooks openly available', Google Analytics shows 334 page views.

pagetitle = "407 Indonesian textbooks openly available"

# we first select the matching rows, and then sum the visits
sum(tbl[tbl$pageTitle == pagetitle, ]$visits)
## [1] 335

Not sure why it's one more. Let's check for a specific day.

January 19, 2013, there were 23 visits in total on Google Analytics.

sum(tbl[tbl$date == makedate("2013-01-19"), ]$visits)
## [1] 23

Let me try to plot the access to the Indonesian textbook page.

ind <- tbl[tbl$pageTitle == pagetitle, ]
ggplot(ind, aes(x = date, y = visits)) + geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-2

Is there any correlation between length and reading level?

ggplot(wp_stats[wp_stats$flesch > 0, ], aes(x = flesch, y = length)) + geom_point() + 
    geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-3

Doesn't seem so. (For some reason, it doesn't work well on a few foreign songs etc, gives negative level, so I exclude these).

Showing Flesch-Kincaid level over log(total visits) - no correlation?

ggplot(posts[posts$flesch > 0, ], aes(x = log(totvisits), y = flesch)) + geom_point() + 
    geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-4

Log of total visits over links

ggplot(posts, aes(x = log(totvisits), y = links)) + geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-5

Log of total visits over comments

ggplot(posts, aes(x = log(totvisits), y = num_comments)) + geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-6

Length over total visits

ggplot(posts, aes(x = log(totvisits), y = length)) + geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-7