The data was loaded from an sql database originally created by Kirk Harland:
# geoT <- SpatialPointsDataFrame(coords= matrix(c(db$Lon, db$Lat), ncol=2), data=db)
load("../tweet_store/.RData")
The full twitter dataset contains 2.8 million tweets collected between September 2011 and April 2012. The bounding box of the tweets encapsulates Leeds and Bradford, with the highest density of tweets focussed on Leeds city centre. A range of variables were collected, but the subsequent analysis will focus on Text, the written content of each tweet. This is shown below:
names(geoT) # the variables available for analysis
## Loading required package: sp
## [1] "Tweet_ID" "User_ID" "Screen_Name" "Full_Name" "Lat"
## [6] "Lon" "Loc_Txt" "Time" "Text" "Easting"
## [11] "Northing" "Year" "Month" "Day" "Hour"
## [16] "Minute" "Second"
nrow(geoT) / 1000000 # million tweets
## [1] 2.812
range(geoT$Time)
## [1] "Fri Apr 06 00:01:08 +0000 2012" "Wed Sep 28 23:56:21 +0000 2011"
summary(nchar(geoT$Text))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 38.0 60.0 66.6 91.0 521.0
geoT[ which.max(nchar(geoT$Text)), ]
## coordinates Tweet_ID User_ID Screen_Name
## 1731217 (-1.705, 53.71) 262608965366603776 253747929 CharlotteSwithy
## Full_Name Lat Lon Loc_Txt
## 1731217 E✖ample™ 53.71 -1.705 Kirklees_ Kirklees - United Kingdom
## Time
## 1731217 Sun Oct 28 17:37:11 +0000 2012
## Text
## 1731217 Phone's dead <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
## Easting Northing Year Month Day Hour Minute Second
## 1731217 419487 423367 2012 10 28 17 37 11
# plotting the spatial distribution of the tweets
library(ggmap)
##
## Attaching package: 'ggmap'
##
## The following object is masked _by_ '.GlobalEnv':
##
## crime
ggmap(get_map(bbox(geoT))) + geom_point(data = geoT@data[sample(1:nrow(geoT), size = nrow(geoT) / 10), ],
aes(x = Lon, y = Lat), alpha = 0.1)
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=53.822421,-1.54535&zoom=11&size=%20640x640&scale=%202&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Warning: Removed 25871 rows containing missing values (geom_point).
Below we see tweets related to the Pennine Way
length(grep("pennine way", geoT$Text, ignore.case = T))
## [1] 1
head(geoT$Text[grep("pennine way", geoT$Text, ignore.case = T)])
## [1] "Can't wait to walk the next section on the Pennine Way #hiking #backpacking"
length(grep("pennine", geoT$Text, ignore.case = T))
## [1] 199
head(geoT$Text[grep("pennine", geoT$Text, ignore.case = T)])
## [1] "I'm at The Pulse & Pulse 2 (Pennine House_ Well Street_ Bradford) http://4sq.com/li9C50"
## [2] "I'm at The Pulse & Pulse 2 (Pennine House_ Well Street_ Bradford) http://4sq.com/jIevab"
## [3] "is across the Pennines for a day's examining in Leeds."
## [4] "@MD_Wainwright @bumblecricket This is an ideal occasion for bumble to watch some cricket on the right side of the pennines for a change!!"
## [5] "Dramatic skies as we head over the Pennines into God's Own County. (@ Starbucks) http://4sq.com/nPgvE8"
## [6] "@georgeyboy From what I'm reading_ the fans didnt want to cross the pennines & go to the Galpharm_ so RFL listened to fans."
pTweets <- geoT[ grep("pennine", geoT$Text, ignore.case = T) , ]
The coodinates of the Pennine Way can be downloaded directly from the internet, from hiking.waymarkedtrails.org/en/:
download.file(url = "http://hiking.waymarkedtrails.org/en/routebrowser/63872/gpx", "pennine.gpx")
library(rgdal)
## rgdal: version: 0.8-16, (SVN revision 498)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.9.2, released 2012/10/08
## Path to GDAL shared files: /usr/share/gdal
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
ogrListLayers("pennine.gpx")
## [1] "waypoints" "routes" "tracks" "route_points"
## [5] "track_points"
pw <- readOGR("pennine.gpx", layer = "tracks")
## OGR data source with driver: GPX
## Source: "pennine.gpx", layer: "tracks"
## with 9 features and 12 fields
## Feature type: wkbMultiLineString with 2 dimensions
pw <- spTransform(pw, CRS("+init=epsg:27700")) # transform CRS to OSGB
library(rgeos)
## rgeos version: 0.3-5, (SVN revision 447)
## GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921
## Polygon checking: TRUE
pwBuf <- gBuffer(pw, width = 15000) # create 15 km buffer
plot(pwBuf) # plot to test dimensions make sense
pwBuf <- spTransform(pwBuf, CRS("+init=epsg:4326"))
proj4string(geoT) <- CRS("+init=epsg:4326")
PennineTweets <- geoT[pwBuf, ]
nrow(PennineTweets)
## [1] 52106
plot(pwBuf)
points(PennineTweets)
Unfortunately, the buffer for the Pennine Way only just intersects with the edge of the twitter dataset - resulting in just over 50,000 tweets. All of these are over 10 km away, so they cannot be classed as Pennine Tweets.
The Twitter dataset we’ve looked at is unlikely to be of much use as it does not actually intersect with the Pennine Way. However, the method for extracting ‘Pennine Tweets’ has been demonstrated and some example tweets have been filtered from the original dataset of 2.8 million tweets.
These were saved to my Dropbox file with the following command:
write.csv(pTweets, "~/Dropbox/Public/tmp/ptweets.csv")
The data can be downloaded from here.
There is an ongoing project to download all geographic tweets nationwide, which can be found on the tweepy repository. This may provide better Twitter data on the Pennine trail but hopes should not be high: ramblers, except for RamblersGB are not known for their tweeting ways!