Trajectory Analysis in R - Final Post

This project aimed to develop an R package named trajectories that is specifically catered for trajectory analysis based on existing R packages available such as sp, spacetime. Since the beginning of the project, progress has been made in three main aspects. First, several functions have been developed to allow coercion between common spatial classes for trajectories. Second, a set of functions was written to analyze the trajectory data. Third, a plot() function was created to specifically visualize trajectory data. In addition, two sample trajectory datasets were integrated to the package for demonstration as well as for user testing. The package is managed at r-forge.org and can be viewed and downloaded at this link.

For the rest of this post, I will first briefly introduce the sample datasets, and then discusses the trajectories package from the three aspects mentioned above.

To demonstrate the functionality of the package, two sets of sample trajectory datasets are included in the package. The first dataset is named traj_sample. It was created from public GPS trajectories uploaded by user alex18 at OpenStreetMap.org. Five trajectories are available in this sample as five STI objects stored in a single list object in R. The second dataset, named geolife_sample was extracted from the GeoLife dataset released by Microsoft. It includes the trajectories collected from users of mobile phone via GPS. A total of 15 trajectories from three users are stored in the dataset as a single STTDF object. The traj_sample dataset was mostly used for testing when developed the methods due to its relative small size. On contrary, the geolife_sample dataset is relative richer in the amount of trajectory data. Hence, it will be used later in this post to demonstrate the package. The size of the GeoLife dataset is 298.66MB. Users may choose to download the entire dataset for testing from the link on the GeoLife webpage.

The first step is to load the sample dataset geolife_sample from the trajectories package. The code for loading the dataset is shown below:

# Load the package. You will see some compatibility-related warning
# messages. Just ignore them at this point.
library("trajectories")
## Loading required package: sp Loading required package: spacetime Loading
## required package: rgeos rgeos version: 0.2-19, (SVN revision 394) GEOS
## runtime version: 3.3.3-CAPI-1.7.4 Polygon checking: TRUE
## Warning: the specification for class "im" in package 'maptools' seems
## equivalent to one from package 'sp' and is not turning on duplicate class
## definitions for this class Warning: the specification for class "owin" in
## package 'maptools' seems equivalent to one from package 'sp' and is not
## turning on duplicate class definitions for this class Warning: the
## specification for class "ppp" in package 'maptools' seems equivalent to
## one from package 'sp' and is not turning on duplicate class definitions
## for this class Warning: the specification for class "psp" in package
## 'maptools' seems equivalent to one from package 'sp' and is not turning on
## duplicate class definitions for this class Warning: replacing previous
## import '.__C__im' when loading 'maptools' Warning: replacing previous
## import '.__C__owin' when loading 'maptools' Warning: replacing previous
## import '.__C__ppp' when loading 'maptools' Warning: replacing previous
## import '.__C__psp' when loading 'maptools'
## Attaching package: 'trajectories'
## 
## The following object is masked from 'package:stats':
## 
## aggregate

# Load the geolife sample dataset
data(geolife_sample)

# Make the name shorter
geolife <- geolife_sample
class(geolife)
## [1] "STTDF"
## attr(,"package")
## [1] "spacetime"
slotNames(geolife)
## [1] "data"    "traj"    "sp"      "time"    "endTime"

The object geolife has four slots as specified as below:

Next, three functions that allows for coercion between common spatial classes for trajectories are listed in the table below:

Name Description
STItoSTTDF() Coerces a list of STI objects into STTDF and computes the trajectory attributes such as distance, time, average speed, turning angle, elevation change, etc.
STItoSpatialLines() Coerces an STI object into an SpatialLines object. The time slot of the STI object is discarded.
STTDFtoSpatialLines() Coerces an STTDF object into an SpatialLines object. The time and data slot of the STTDF object is discarded.

The usage of these three functions are shown below:

sttdf <- STItoSTTDF(geolife@traj)
class(sttdf)
## [1] "STTDF"
## attr(,"package")
## [1] "spacetime"

sl <- STItoSpatialLines(geolife@traj[[1]])
class(sl)
## [1] "SpatialLines"
## attr(,"package")
## [1] "sp"

sl2 <- STTDFtoSpatialLines(geolife)
class(sl2)
## [1] "SpatialLines"
## attr(,"package")
## [1] "sp"

One of the major goals of trajectory analysis is to extract useful information from trajectory data. To this end, three functions were developed to manipulate and summarize the trajectory data.

The summary() function summarizes the basic properties and statistics of the trajectory data, either stored as STI object or STTDF object.

sti <- geolife@traj[[1]]
summary(sti)
## $class
## [1] "STI"
## attr(,"package")
## [1] "spacetime"
## 
## $bbox
##         min    max
## long 116.29 116.32
## lat   39.98  40.01
## 
## $is.projected
## [1] FALSE
## 
## $proj4string
## [1] "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"
## 
## $npoints
## [1] 907
## 
## $starting_time
## [1] "2008-10-23 02:53:10 PDT"
## 
## $ending_time
## [1] "2008-10-23 11:11:12 PDT"
## 
## $duration
## [1] 8.3
## 
## attr(,"class")
## [1] "summary.STI"

summary(geolife)
## $class
## [1] "STTDF"
## attr(,"package")
## [1] "spacetime"
## 
## $bbox
##      min    max
## x 116.15 116.39
## y  39.89  40.08
## 
## $is.projected
## [1] FALSE
## 
## $proj4string
## [1] "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"
## 
## $ntraj
## [1] 15
## 
## $starting_time
## [1] "2008-10-23 02:53:10 PDT"
## 
## $ending_time
## [1] "2008-10-28 05:03:42 PDT"
## 
## $duration
## [1] 122.2
## 
## $total_dist
## [1] 364.3
## 
## $time_lapsed
## [1] 37404
## 
## $ave_speed
## [1] 0.0097
## 
## $ave_elevation
## [1] 334.8
## 
## attr(,"class")
## [1] "summary.STTDF"

The aggregate() function aggregates the trajectory data over various temporal scales such as hour, day of month, or month of year, and returns a dataframe object containing the summarization as results. In the dataframe, the first column indicates the temporal unit (e.g., hour, day, month). The second column contains the total distance in that unit. The third column contains the total time lapsed (second) in that unit. The forth column contains the average elevation in that unit. The fifth column contains the total number of points in that unit. The sixth column contains the average speed (km/h) in that unit.

aggregate(geolife, "hour")
##       dist timeLapsed    elev   np speed
## 1   0.0000          0       0    0   NaN
## 2   0.0000          0       0    0   NaN
## 3   0.0000          0       0    0   NaN
## 4   0.0000          0       0    0   NaN
## 5   0.0000          0       0    0   NaN
## 6   0.0000          0       0    0   NaN
## 7   0.0000          0       0    0   NaN
## 8   0.0000          0       0    0   NaN
## 9   0.0000          0       0    0   NaN
## 10 36.4337       3108  703709 3110 42.20
## 11 26.4016       2537  622561 2537 37.46
## 12 14.8010        956  476643  959 55.74
## 13  6.7556       1397  716136 1397 17.41
## 14  6.0347       2038 1334814 2038 10.66
## 15  5.3388        852 1110787  852 22.56
## 16 12.1635       1065 1388155 1066 41.12
## 17  0.5935        119  181018  119 17.95
## 18  0.0000          0       0    0   NaN
## 19  0.0000          0       0    0   NaN
## 20  0.0000          0       0    0   NaN
## 21  0.0000          0       0    0   NaN
## 22  0.0002          1     521    1  0.72
## 23  6.8711       1094  477434 1094 22.61
## 24  0.0000          0       0    0   NaN

In the example above, the hourly statistics of the trajectory is listed in a dataframe with statistics for each hour being a row. Similarly, the daily statistics and monthly statistics can be obtained by executing aggregate(geolife, “day”) or aggregate(geolife, “month”).

The crop() function was designed to spatially select trajectories that overlay with a certain area. This is particularly useful when users want to spatially subset the trajectory data. Currently, the crop() function is able to spatially subset trajectory data using a SpatialPolygons object. At this point, only trajectories that entirely overlap within the polygon will be selected.

# Create an SpatialPolygons object
lat_min <- min(geolife@traj[[1]]@sp$lat)
lat_max <- max(geolife@traj[[1]]@sp$lat)
long_min <- min(geolife@traj[[1]]@sp$long)
long_max <- max(geolife@traj[[1]]@sp$long)

xpol <- c(long_min, long_max, long_max, long_min, long_min)
ypol <- c(lat_min, lat_min, lat_max, lat_max, lat_min)

pol <- SpatialPolygons(list(Polygons(list(Polygon(cbind(xpol, ypol))), ID = "x1")))
pol@proj4string <- CRS("+proj=longlat +datum=WGS84")

# Crop the geolife dataset using the polygon created
geolife_cropped <- crop(geolife, pol)

Last but not least, the plot() function is introduced here to visualize the results from the previous example. The plot() function can coerce either an STI *object or an *STTDF *object into an *SpatialLines *object, and then plot the *SpatialLines *object using the drawing device in R. By default, the *plot() draws a single STI object using black color while draws an STTDF object using distinct colors for each trajectories.

# Plotting an STI object
plot(geolife@traj[[1]])

plot of chunk unnamed-chunk-6

# Plotting an STTDF object
plot(geolife)
plot(pol, add = T)

plot of chunk unnamed-chunk-6

# Plotting an STI object cropped by a polygon
plot(geolife_cropped)
plot(pol, add = T)

plot of chunk unnamed-chunk-6

The following aspects can be further developed/improved in future:

It has been a great pleasure to work on an R package over the summer. Before this project, I have been an R user for a year, and it is exciting to make a transition to an R developer. In the very beginning of this project, I was not expecting there is a huge difference between being an R user and being an R developer because essentially, they both write codes in R. But it turned out I was deadly wrong. Fortunately, I have the opportunity to work with leading R developers, who pointed my way to make this transition. I spent more time to adapt myself into a R developer and learned to use necessary tools to make R packages.

It is also exciting to work with big data from real world when developed the package. Not only it allows me to see how the functions that I built can transform data into information and even knowledge, but makes me to think about the limitation imposed by the size of data as well.

I would definitely recommend GSoC to students who have interests in R as it is a great opportunity to work closely with experts in the field, make connections to the community, and get paid with generous support from Google.