Loading the data

The data was loaded from an sql database originally created by Kirk Harland:

# geoT <- SpatialPointsDataFrame(coords= matrix(c(db$Lon, db$Lat), ncol=2), data=db)
load("../tweet_store/.RData")

Analysis of the full dataset

The full twitter dataset contains 2.8 million tweets collected between September 2011 and April 2012. The bounding box of the tweets encapsulates Leeds and Bradford, with the highest density of tweets focussed on Leeds city centre. A range of variables were collected, but the subsequent analysis will focus on Text, the written content of each tweet. This is shown below:

names(geoT) # the variables available for analysis

## Loading required package: sp

##  [1] "Tweet_ID"    "User_ID"     "Screen_Name" "Full_Name"   "Lat"        
##  [6] "Lon"         "Loc_Txt"     "Time"        "Text"        "Easting"    
## [11] "Northing"    "Year"        "Month"       "Day"         "Hour"       
## [16] "Minute"      "Second"

nrow(geoT) / 1000000 # million tweets

## [1] 2.812

range(geoT$Time)

## [1] "Fri Apr 06 00:01:08 +0000 2012" "Wed Sep 28 23:56:21 +0000 2011"

summary(nchar(geoT$Text))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    38.0    60.0    66.6    91.0   521.0

geoT[ which.max(nchar(geoT$Text)), ]

##             coordinates           Tweet_ID   User_ID     Screen_Name
## 1731217 (-1.705, 53.71) 262608965366603776 253747929 CharlotteSwithy
##         Full_Name   Lat    Lon                             Loc_Txt
## 1731217  E✖ample™ 53.71 -1.705 Kirklees_ Kirklees - United Kingdom
##                                   Time
## 1731217 Sun Oct 28 17:37:11 +0000 2012
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Text
## 1731217 Phone's dead &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;
##         Easting Northing Year Month Day Hour Minute Second
## 1731217  419487   423367 2012    10  28   17     37     11

# plotting the spatial distribution of the tweets
library(ggmap)

## 
## Attaching package: 'ggmap'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     crime

ggmap(get_map(bbox(geoT))) + geom_point(data = geoT@data[sample(1:nrow(geoT), size = nrow(geoT) / 10), ],
                                        aes(x = Lon, y = Lat), alpha = 0.1)

## Warning: bounding box given to google - spatial extent only approximate.

## converting bounding box to center/zoom specification. (experimental)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=53.822421,-1.54535&zoom=11&size=%20640x640&scale=%202&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms

## Warning: Removed 25871 rows containing missing values (geom_point).

Subsetting ‘Pennine Way’

Below we see tweets related to the Pennine Way

length(grep("pennine way", geoT$Text, ignore.case = T))

## [1] 1

head(geoT$Text[grep("pennine way", geoT$Text, ignore.case = T)])

## [1] "Can't wait to walk the next section on the Pennine Way #hiking #backpacking"

length(grep("pennine", geoT$Text, ignore.case = T))

## [1] 199

head(geoT$Text[grep("pennine", geoT$Text, ignore.case = T)])

## [1] "I'm at The Pulse & Pulse 2 (Pennine House_ Well Street_ Bradford) http://4sq.com/li9C50"                                                  
## [2] "I'm at The Pulse & Pulse 2 (Pennine House_ Well Street_ Bradford) http://4sq.com/jIevab"                                                  
## [3] "is across the Pennines for a day's examining in Leeds."                                                                                   
## [4] "@MD_Wainwright @bumblecricket This is an ideal occasion for bumble to watch some cricket on the right side of the pennines for a change!!"
## [5] "Dramatic skies as we head over the Pennines into God's Own County. (@ Starbucks) http://4sq.com/nPgvE8"                                   
## [6] "@georgeyboy From what I'm reading_ the fans didnt want to cross the pennines & go to the Galpharm_ so RFL listened to fans."

pTweets <- geoT[ grep("pennine", geoT$Text, ignore.case = T) , ]

Buffer around Pennine way

The coodinates of the Pennine Way can be downloaded directly from the internet, from hiking.waymarkedtrails.org/en/:

download.file(url = "http://hiking.waymarkedtrails.org/en/routebrowser/63872/gpx", "pennine.gpx")
library(rgdal)

## rgdal: version: 0.8-16, (SVN revision 498)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.9.2, released 2012/10/08
## Path to GDAL shared files: /usr/share/gdal
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)

ogrListLayers("pennine.gpx")

## [1] "waypoints"    "routes"       "tracks"       "route_points"
## [5] "track_points"

pw <- readOGR("pennine.gpx", layer = "tracks")

## OGR data source with driver: GPX 
## Source: "pennine.gpx", layer: "tracks"
## with 9 features and 12 fields
## Feature type: wkbMultiLineString with 2 dimensions

pw <- spTransform(pw, CRS("+init=epsg:27700")) # transform CRS to OSGB
library(rgeos)

## rgeos version: 0.3-5, (SVN revision 447)
##  GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921 
##  Polygon checking: TRUE

pwBuf <- gBuffer(pw, width = 15000) # create 15 km buffer
plot(pwBuf) # plot to test dimensions make sense

plot of chunk unnamed-chunk-4

pwBuf <- spTransform(pwBuf, CRS("+init=epsg:4326"))
proj4string(geoT) <- CRS("+init=epsg:4326")
PennineTweets <- geoT[pwBuf, ]
nrow(PennineTweets)

## [1] 52106

plot(pwBuf)
points(PennineTweets)

plot of chunk unnamed-chunk-4

Unfortunately, the buffer for the Pennine Way only just intersects with the edge of the twitter dataset - resulting in just over 50,000 tweets. All of these are over 10 km away, so they cannot be classed as Pennine Tweets.

Downloading the ‘Pennine Tweets’

The Twitter dataset we’ve looked at is unlikely to be of much use as it does not actually intersect with the Pennine Way. However, the method for extracting ‘Pennine Tweets’ has been demonstrated and some example tweets have been filtered from the original dataset of 2.8 million tweets.

These were saved to my Dropbox file with the following command:

write.csv(pTweets, "~/Dropbox/Public/tmp/ptweets.csv")

The data can be downloaded from here.

Further work

There is an ongoing project to download all geographic tweets nationwide, which can be found on the tweepy repository. This may provide better Twitter data on the Pennine trail but hopes should not be high: ramblers, except for RamblersGB are not known for their tweeting ways!