Let’s say that your exploring a dataset of phsyical activity data from multiple participants. One of the things you may be interested in exploring is how often individuals meet their phsyical activity goals. This project is meant to provide an example of a few different data exploration methods.

The Data

Data was collected as part of my doctoral dissertation. Physical activity data was gathered from Fitbit activity trackers, which the participants owned. Participant goals were collected as part of in-depth interviews.

AllParticipantsValidDays <- read.csv("AllParticipantsValidDays.csv")

Let’s get rid of some of the extraneious variables.

ExcludeVars <- names(AllParticipantsValidDays) %in% c("X.1", "X", "ActivityDay", "MET90ValidDay")
AllParticipantsValidDaysClean <- AllParticipantsValidDays[!ExcludeVars]
str(AllParticipantsValidDaysClean)
## 'data.frame':    15953 obs. of  6 variables:
##  $ X.2          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ StepTotal    : int  9261 12084 6250 9182 8139 4827 5720 5235 4850 7024 ...
##  $ MET90ValidMin: int  1440 1440 1440 1440 1315 1440 1293 1328 1366 1440 ...
##  $ id           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StepCategory : Factor w/ 5 levels "0-4999","10000-12499",..: 5 2 4 5 5 1 4 4 1 4 ...
##  $ Date         : Factor w/ 2054 levels "2009-12-27","2009-12-28",..: 740 741 742 743 744 745 746 747 748 749 ...

What do the columns reflect?

  1. id = ID number for the participants
  2. StepToal = total number of steps
  3. MET90ValidMin = number of valid minutes (calcuated using Accelerometry package)
  4. StepCategory = Activity level categorization based on step total
  5. Date = date

Hmmm…“Date” isn’t a date. Let’s fix that.

AllParticipantsValidDaysClean$Date <- as.Date(AllParticipantsValidDaysClean$Date, format("%Y-%m-%d"))

We might want to describe the data at some point so let’s create a summary table based on this data before we go any further. I’m going to use the plyr package and summarise to create the table.

library(plyr)
GoalSummaryBYid <- ddply(AllParticipantsValidDaysClean, c("id"), summarise,
      N    = length(StepTotal),
      mean = mean(StepTotal),
      sd   = sd(StepTotal),
      se   = sd / sqrt(N),
      min  = min(StepTotal),
      max  = max(StepTotal),
      median = median(StepTotal)
)

head(GoalSummaryBYid)
##   id    N      mean       sd       se   min   max  median
## 1  1 1096  5581.911 3742.404 113.0435   109 26378  5042.0
## 2  2 1062  8480.120 3766.051 115.5644   278 31516  7525.0
## 3  3  686  9723.647 4501.774 171.8785   818 29741  8818.0
## 4  4  198 19175.106 3346.568 237.8303 10761 36110 18706.0
## 5  5  472  8543.055 3562.972 163.9991  2080 27093  7887.0
## 6  6  542 13571.360 8651.764 371.6251   519 68565 11790.5

Looks good!

Let’s move on to workign with goals. We’ll first load the goal data we gathered from the interviews and add it complete data file.

StepGoals <- read.csv("stepgoals.csv")
AllParticipantsValidDaysGoalsClean <- merge(AllParticipantsValidDaysClean, StepGoals, by="id")

Let’s calculate the Goals vs. Actual Steps discrepancy and label which days goals are met.

AllParticipantsValidDaysGoalsClean$StepGoalDiff <- AllParticipantsValidDaysGoalsClean$StepTotal - AllParticipantsValidDaysGoalsClean$stepgoal
AllParticipantsValidDaysGoalsClean$GoalMeet <- ifelse(AllParticipantsValidDaysGoalsClean$StepGoalDiff > 1, 1, 0)

I think it will be handy to have a categorical variable to describe the whether or not the particianpt met their goal. Let’s create a factor column.

AllParticipantsValidDaysGoalsClean$GoalMeetText <- ifelse(AllParticipantsValidDaysGoalsClean$StepGoalDiff > 1, "Goal Met", "Goal Not Met")

So we have a large data file with 15,953 observations and we want to understand it a bit more. Let’s get started!

Goals Analysis - Plots

I like to get a look at the data before I really dive into analysis so let’s try to plot it. We’ll start with a bar plot that relfects each day of data we have and whether or not the participant met their goal.

library(ggplot2)
library(scales)


#Generate a bar plot of P01's data. This is a bar plot of all the daily data, colored by whether or not the set goal has been met.
#Added a horizontal line 
p01GoalBarChart <- ggplot(aes(x=Date, y=StepTotal, fill=GoalMeetText), data=subset(AllParticipantsValidDaysGoalsClean, id == 1)) + 
  geom_bar(stat = 'identity') + 
  scale_x_date(date_breaks = '2 month', labels = date_format("%m-%y")) +
  theme_bw()

p01GoalBarChart

I like this plot, but let’s clean it up a bit and add a horizontal line for the participant’s goal.

p01GoalBarChart + guides(fill=guide_legend(title=NULL)) +
geom_hline(aes(yintercept = stepgoal), data=subset(AllParticipantsValidDaysGoalsClean, id == 1)) +
 xlab("Date (Month-Year)") + 
  ylab("Steps per Day") + 
  ggtitle("Steps per Valid Day Categorized by Goal Meeting")

Let’s combine those plotting methods and explore more than one participant.

#Create a subset of the data and the goals data that contains the first five participants
FirstFive <- AllParticipantsValidDaysGoalsClean[AllParticipantsValidDaysGoalsClean$id %in% c(1:5), ]
FirstFiveGoals <- StepGoals[StepGoals$id %in% c(1:5), ]

GoalBarChart <- ggplot(FirstFive, aes(x=Date, y=StepTotal, fill=GoalMeetText)) + 
  geom_bar(stat = 'identity') + 
  facet_grid(id ~.) +
  scale_x_date(date_breaks = '3 month', labels = date_format("%m-%y")) +
  guides(fill=guide_legend(title=NULL)) +
  geom_hline(aes(yintercept = stepgoal), data=FirstFiveGoals) +
  xlab("Date (Month-Year)") + 
  ylab("Steps per Day") + 
  ggtitle("Steps per Valid Day Categorized by Goal Meeting") +
  theme_bw()

GoalBarChart

Goal Analysis - Frequency

One of the interesting things you can do with longitudinal data is to explore how frequently people meet their goals.

First, let’s get some descriptive data about how frequently people meet and don’t meet their goals, and the associated step data for each.

library(plyr)
GoalSummaryBYidANDgoals <- ddply(AllParticipantsValidDaysGoalsClean, c("id", "GoalMeet"), summarise,
      N    = length(StepTotal),
      mean = mean(StepTotal),
      sd   = sd(StepTotal),
      se   = sd / sqrt(N),
      min  = min(StepTotal),
      max  = max(StepTotal),
      median = median(StepTotal)
)
head(GoalSummaryBYidANDgoals) 
##   id GoalMeet   N      mean       sd        se   min   max median
## 1  1        0 791  3739.263 2007.337  71.37268   109  7499   3585
## 2  1        1 305 10360.711 2868.790 164.26645  7516 26378   9357
## 3  2        0 769  6670.182 1701.349  61.35221   278  9987   6621
## 4  2        1 293 13230.433 3556.991 207.80162 10025 31516  12154
## 5  3        0 286  5777.339 1420.504  83.99617   818  7985   5981
## 6  3        1 400 12545.257 3769.199 188.45994  8016 29741  11623

This is great, but to create frequencies we need the total number of days in the dataset for each participant. Good thing we created the GoalSummaryBYid object earlier! We can use the variable, “N”, to generate frequecy percentages.

First we need to create a data frame that has the total number of days per participants.

id <- (GoalSummaryBYid$id)
ValidDays <- (GoalSummaryBYid$N)
ValidDaysPerParticipant <- data.frame(id, ValidDays)

Now we can merge those together and create our frequency variable.

GoalSummaryBYidANDgoals <- merge(GoalSummaryBYidANDgoals, ValidDaysPerParticipant, by="id")
GoalSummaryBYidANDgoals$Freq <- (GoalSummaryBYidANDgoals$N / GoalSummaryBYidANDgoals$ValidDays)

head(GoalSummaryBYidANDgoals)
##   id GoalMeet   N      mean       sd        se   min   max median
## 1  1        0 791  3739.263 2007.337  71.37268   109  7499   3585
## 2  1        1 305 10360.711 2868.790 164.26645  7516 26378   9357
## 3  2        0 769  6670.182 1701.349  61.35221   278  9987   6621
## 4  2        1 293 13230.433 3556.991 207.80162 10025 31516  12154
## 5  3        0 286  5777.339 1420.504  83.99617   818  7985   5981
## 6  3        1 400 12545.257 3769.199 188.45994  8016 29741  11623
##   ValidDays      Freq
## 1      1096 0.7217153
## 2      1096 0.2782847
## 3      1062 0.7241055
## 4      1062 0.2758945
## 5       686 0.4169096
## 6       686 0.5830904

Goals - Streaks

I’m really interested in teasing out differences among the participants related to their physical activity. One way I’m exploring that is to look at their consistency around meeting their goals. My first pass at this is to explore streaks (runs of consecutive days) of meeting/not meeting theri goals.

To understand streaks I’m using a process called Run Length Encoding (RLE). You can learn more about Rhere, but basically RLE looks for consecutive instances of the same value along a vector. Since we already coded the variable GoalMeet as either 0 or 1 (not meeting, meeting) we can use that vector for RLE.

Now, remember that we have all participants in one data object and we want to understand streaks at the per-participant level. To do that we first need to create a list of objects that is split by participant id.

ParticipantValidDaysList <- split(AllParticipantsValidDaysGoalsClean, AllParticipantsValidDaysGoalsClean$id)

I’m using the rle2 function from the Accelerometery package. Supposedly it handles larger data file faster than rle in base R.

In order to properly append the ID from our initial data frame to the new data frame we’re creating we’ll use a function, my.rle, that adds the ID value to a new variable, id.

library(accelerometry)

#Function for appending the ID from initial data frame to the new data frame created through rle2.
my.rle <- function(x) {
  final <- as.data.frame(rle2(x$GoalMeet))
  final$id <- unique(x$id)
  
  return(final)
}

#Applying the new function to the list to create a new list of of streaks data.
AllStreaksList<- lapply(ParticipantValidDaysList,  my.rle)

#Bind all the list objects into one data frame. 
AllParticipantsStreaks <- do.call(rbind, AllStreaksList)

head(AllParticipantsStreaks)
##     values lengths id
## 1.1      1       2  1
## 1.2      0       1  1
## 1.3      1       2  1
## 1.4      0       9  1
## 1.5      1       4  1
## 1.6      0       3  1

Let’s create a table of summary statistics based on the streaks data we just created.

library(plyr)
StreaksSummaryBYidANDgoals <- ddply(AllParticipantsStreaks, c("id", "values"), summarise,
      N    = length(lengths),
      mean = mean(lengths),
      sd   = sd(lengths),
      se   = sd / sqrt(N),
      min  = min(lengths),
      max  = max(lengths),
      median = median(lengths)
)

head(StreaksSummaryBYidANDgoals)
##   id values   N     mean        sd         se min max median
## 1  1      0 183 4.322404  4.278735 0.31629318   1  29      3
## 2  1      1 183 1.666667  1.178252 0.08709891   1   8      1
## 3  2      0 145 5.303448  8.616299 0.71554471   1  57      3
## 4  2      1 145 2.020690  2.063132 0.17133378   1  14      1
## 5  3      0 102 2.803922  2.714243 0.26875007   1  16      2
## 6  3      1 103 3.883495 10.917966 1.07577915   1 107      2