Let’s say that your exploring a dataset of phsyical activity data from multiple participants. One of the things you may be interested in exploring is how often individuals meet their phsyical activity goals. This project is meant to provide an example of a few different data exploration methods.
Data was collected as part of my doctoral dissertation. Physical activity data was gathered from Fitbit activity trackers, which the participants owned. Participant goals were collected as part of in-depth interviews.
AllParticipantsValidDays <- read.csv("AllParticipantsValidDays.csv")
Let’s get rid of some of the extraneious variables.
ExcludeVars <- names(AllParticipantsValidDays) %in% c("X.1", "X", "ActivityDay", "MET90ValidDay")
AllParticipantsValidDaysClean <- AllParticipantsValidDays[!ExcludeVars]
str(AllParticipantsValidDaysClean)
## 'data.frame': 15953 obs. of 6 variables:
## $ X.2 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ StepTotal : int 9261 12084 6250 9182 8139 4827 5720 5235 4850 7024 ...
## $ MET90ValidMin: int 1440 1440 1440 1440 1315 1440 1293 1328 1366 1440 ...
## $ id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StepCategory : Factor w/ 5 levels "0-4999","10000-12499",..: 5 2 4 5 5 1 4 4 1 4 ...
## $ Date : Factor w/ 2054 levels "2009-12-27","2009-12-28",..: 740 741 742 743 744 745 746 747 748 749 ...
What do the columns reflect?
Hmmm…“Date” isn’t a date. Let’s fix that.
AllParticipantsValidDaysClean$Date <- as.Date(AllParticipantsValidDaysClean$Date, format("%Y-%m-%d"))
We might want to describe the data at some point so let’s create a summary table based on this data before we go any further. I’m going to use the plyr package and summarise to create the table.
library(plyr)
GoalSummaryBYid <- ddply(AllParticipantsValidDaysClean, c("id"), summarise,
N = length(StepTotal),
mean = mean(StepTotal),
sd = sd(StepTotal),
se = sd / sqrt(N),
min = min(StepTotal),
max = max(StepTotal),
median = median(StepTotal)
)
head(GoalSummaryBYid)
## id N mean sd se min max median
## 1 1 1096 5581.911 3742.404 113.0435 109 26378 5042.0
## 2 2 1062 8480.120 3766.051 115.5644 278 31516 7525.0
## 3 3 686 9723.647 4501.774 171.8785 818 29741 8818.0
## 4 4 198 19175.106 3346.568 237.8303 10761 36110 18706.0
## 5 5 472 8543.055 3562.972 163.9991 2080 27093 7887.0
## 6 6 542 13571.360 8651.764 371.6251 519 68565 11790.5
Looks good!
Let’s move on to workign with goals. We’ll first load the goal data we gathered from the interviews and add it complete data file.
StepGoals <- read.csv("stepgoals.csv")
AllParticipantsValidDaysGoalsClean <- merge(AllParticipantsValidDaysClean, StepGoals, by="id")
Let’s calculate the Goals vs. Actual Steps discrepancy and label which days goals are met.
AllParticipantsValidDaysGoalsClean$StepGoalDiff <- AllParticipantsValidDaysGoalsClean$StepTotal - AllParticipantsValidDaysGoalsClean$stepgoal
AllParticipantsValidDaysGoalsClean$GoalMeet <- ifelse(AllParticipantsValidDaysGoalsClean$StepGoalDiff > 1, 1, 0)
I think it will be handy to have a categorical variable to describe the whether or not the particianpt met their goal. Let’s create a factor column.
AllParticipantsValidDaysGoalsClean$GoalMeetText <- ifelse(AllParticipantsValidDaysGoalsClean$StepGoalDiff > 1, "Goal Met", "Goal Not Met")
So we have a large data file with 15,953 observations and we want to understand it a bit more. Let’s get started!
I like to get a look at the data before I really dive into analysis so let’s try to plot it. We’ll start with a bar plot that relfects each day of data we have and whether or not the participant met their goal.
library(ggplot2)
library(scales)
#Generate a bar plot of P01's data. This is a bar plot of all the daily data, colored by whether or not the set goal has been met.
#Added a horizontal line
p01GoalBarChart <- ggplot(aes(x=Date, y=StepTotal, fill=GoalMeetText), data=subset(AllParticipantsValidDaysGoalsClean, id == 1)) +
geom_bar(stat = 'identity') +
scale_x_date(date_breaks = '2 month', labels = date_format("%m-%y")) +
theme_bw()
p01GoalBarChart
I like this plot, but let’s clean it up a bit and add a horizontal line for the participant’s goal.
p01GoalBarChart + guides(fill=guide_legend(title=NULL)) +
geom_hline(aes(yintercept = stepgoal), data=subset(AllParticipantsValidDaysGoalsClean, id == 1)) +
xlab("Date (Month-Year)") +
ylab("Steps per Day") +
ggtitle("Steps per Valid Day Categorized by Goal Meeting")
Let’s combine those plotting methods and explore more than one participant.
#Create a subset of the data and the goals data that contains the first five participants
FirstFive <- AllParticipantsValidDaysGoalsClean[AllParticipantsValidDaysGoalsClean$id %in% c(1:5), ]
FirstFiveGoals <- StepGoals[StepGoals$id %in% c(1:5), ]
GoalBarChart <- ggplot(FirstFive, aes(x=Date, y=StepTotal, fill=GoalMeetText)) +
geom_bar(stat = 'identity') +
facet_grid(id ~.) +
scale_x_date(date_breaks = '3 month', labels = date_format("%m-%y")) +
guides(fill=guide_legend(title=NULL)) +
geom_hline(aes(yintercept = stepgoal), data=FirstFiveGoals) +
xlab("Date (Month-Year)") +
ylab("Steps per Day") +
ggtitle("Steps per Valid Day Categorized by Goal Meeting") +
theme_bw()
GoalBarChart
One of the interesting things you can do with longitudinal data is to explore how frequently people meet their goals.
First, let’s get some descriptive data about how frequently people meet and don’t meet their goals, and the associated step data for each.
library(plyr)
GoalSummaryBYidANDgoals <- ddply(AllParticipantsValidDaysGoalsClean, c("id", "GoalMeet"), summarise,
N = length(StepTotal),
mean = mean(StepTotal),
sd = sd(StepTotal),
se = sd / sqrt(N),
min = min(StepTotal),
max = max(StepTotal),
median = median(StepTotal)
)
head(GoalSummaryBYidANDgoals)
## id GoalMeet N mean sd se min max median
## 1 1 0 791 3739.263 2007.337 71.37268 109 7499 3585
## 2 1 1 305 10360.711 2868.790 164.26645 7516 26378 9357
## 3 2 0 769 6670.182 1701.349 61.35221 278 9987 6621
## 4 2 1 293 13230.433 3556.991 207.80162 10025 31516 12154
## 5 3 0 286 5777.339 1420.504 83.99617 818 7985 5981
## 6 3 1 400 12545.257 3769.199 188.45994 8016 29741 11623
This is great, but to create frequencies we need the total number of days in the dataset for each participant. Good thing we created the GoalSummaryBYid object earlier! We can use the variable, “N”, to generate frequecy percentages.
First we need to create a data frame that has the total number of days per participants.
id <- (GoalSummaryBYid$id)
ValidDays <- (GoalSummaryBYid$N)
ValidDaysPerParticipant <- data.frame(id, ValidDays)
Now we can merge those together and create our frequency variable.
GoalSummaryBYidANDgoals <- merge(GoalSummaryBYidANDgoals, ValidDaysPerParticipant, by="id")
GoalSummaryBYidANDgoals$Freq <- (GoalSummaryBYidANDgoals$N / GoalSummaryBYidANDgoals$ValidDays)
head(GoalSummaryBYidANDgoals)
## id GoalMeet N mean sd se min max median
## 1 1 0 791 3739.263 2007.337 71.37268 109 7499 3585
## 2 1 1 305 10360.711 2868.790 164.26645 7516 26378 9357
## 3 2 0 769 6670.182 1701.349 61.35221 278 9987 6621
## 4 2 1 293 13230.433 3556.991 207.80162 10025 31516 12154
## 5 3 0 286 5777.339 1420.504 83.99617 818 7985 5981
## 6 3 1 400 12545.257 3769.199 188.45994 8016 29741 11623
## ValidDays Freq
## 1 1096 0.7217153
## 2 1096 0.2782847
## 3 1062 0.7241055
## 4 1062 0.2758945
## 5 686 0.4169096
## 6 686 0.5830904
I’m really interested in teasing out differences among the participants related to their physical activity. One way I’m exploring that is to look at their consistency around meeting their goals. My first pass at this is to explore streaks (runs of consecutive days) of meeting/not meeting theri goals.
To understand streaks I’m using a process called Run Length Encoding (RLE). You can learn more about Rhere, but basically RLE looks for consecutive instances of the same value along a vector. Since we already coded the variable GoalMeet as either 0 or 1 (not meeting, meeting) we can use that vector for RLE.
Now, remember that we have all participants in one data object and we want to understand streaks at the per-participant level. To do that we first need to create a list of objects that is split by participant id.
ParticipantValidDaysList <- split(AllParticipantsValidDaysGoalsClean, AllParticipantsValidDaysGoalsClean$id)
I’m using the rle2 function from the Accelerometery package. Supposedly it handles larger data file faster than rle in base R.
In order to properly append the ID from our initial data frame to the new data frame we’re creating we’ll use a function, my.rle, that adds the ID value to a new variable, id.
library(accelerometry)
#Function for appending the ID from initial data frame to the new data frame created through rle2.
my.rle <- function(x) {
final <- as.data.frame(rle2(x$GoalMeet))
final$id <- unique(x$id)
return(final)
}
#Applying the new function to the list to create a new list of of streaks data.
AllStreaksList<- lapply(ParticipantValidDaysList, my.rle)
#Bind all the list objects into one data frame.
AllParticipantsStreaks <- do.call(rbind, AllStreaksList)
head(AllParticipantsStreaks)
## values lengths id
## 1.1 1 2 1
## 1.2 0 1 1
## 1.3 1 2 1
## 1.4 0 9 1
## 1.5 1 4 1
## 1.6 0 3 1
Let’s create a table of summary statistics based on the streaks data we just created.
library(plyr)
StreaksSummaryBYidANDgoals <- ddply(AllParticipantsStreaks, c("id", "values"), summarise,
N = length(lengths),
mean = mean(lengths),
sd = sd(lengths),
se = sd / sqrt(N),
min = min(lengths),
max = max(lengths),
median = median(lengths)
)
head(StreaksSummaryBYidANDgoals)
## id values N mean sd se min max median
## 1 1 0 183 4.322404 4.278735 0.31629318 1 29 3
## 2 1 1 183 1.666667 1.178252 0.08709891 1 8 1
## 3 2 0 145 5.303448 8.616299 0.71554471 1 57 3
## 4 2 1 145 2.020690 2.063132 0.17133378 1 14 1
## 5 3 0 102 2.803922 2.714243 0.26875007 1 16 2
## 6 3 1 103 3.883495 10.917966 1.07577915 1 107 2