Test for Fitbit Measurement Bias

Whelp Activity Patterns

Test for Fitbit Measurement Bias - DRAFT

The Fitbit is a very small wearable digital accelerometer worn by the pups unobtrusively on their collars. The Fitbit is designed for use by humans to measure physical activity, and promoted as a fitness aid. Here we are adapting it for use in an animal study, so that what are measured as “steps” in terms of human movement may not translate directly to puppy steps, but have been observed to be a good measure of relative activity for comparing a sample population of puppies.

Fitbit on Pup Fitbit in Hand

Step count data from each Fitbit device are uploaded to the Fitbit website. The Fitbit company has generously provisioned us with a research-only Web account and Web Application Programming Interface (Web API) to retrieve this time series data collected by the Fitbit devices at the most granular level available, which is steps-per-minute. Those data are fetched by our Java program using the Web API, and loaded into a relational database.

We are using two Fitbit devices: Fitbit1 and Fitbit2. The data are collected on two pups at a time; siblings from the same litter. We attach Fitbit1 on the first pup and Fitbit2 on the second pup. Early in the data gathering process we noticed that in each of the first four pairs of samples, Fitbit1 had a higher overall step count than Fitbit2. This prompted us to suspect a possible measurement bias.

Get Summarized data

Connect to the Fitbit DB:

library("RODBC")

## Warning: package 'RODBC' was built under R version 2.15.3

conn <- odbcConnect("fitbit")
sql1 <- "select dd.DateStr, dd.FBDevice, dd.PupName, SUM(s.Steps)"
sql1 <- paste(sql1, "from DayDevice dd, Steps s ")
sql1 <- paste(sql1, "where dd.DateStr = s.DateStr and dd.FBDevice = s.FBDevice ")
sql1 <- paste(sql1, "and dd.PupName = s.PupName and dd.FullDayData = 1 ")
sql1 <- paste(sql1, "and dd.PupName NOT LIKE 'CONTROL%' ")  # exclude FITBIT measurement CONTROL tests
sql1 <- paste(sql1, "and dd.DateStr IN")  # include only 1st full day of data
sql1 <- paste(sql1, "  (select MIN(DateStr) from DayDevice ")
sql1 <- paste(sql1, "  where FullDayData=1 group by PupName)")
sql1 <- paste(sql1, "group by dd.DateStr, dd.FBDevice, dd.PupName")
sql1 <- paste(sql1, "order by 1,2,3")
sum <- sqlQuery(conn, paste(sql1))

Testing for Measurement Bias

Here are the raw step data from our current collection of 12 whelp/days of activity.

plot(sum[, 4], type = "b", ylim = c(0, 20000), ylab = "Fitbit 'steps'", xlab = "", 
    xaxt = "n")
mtext("Fitbit Device Number", side = 1, line = 2.5)
mtext("A measurement bias by Device?", side = 3, line = 1)
axis(1, at = c(1:length(sum[, 2])), labels = substring(as.character(sum[, 2]), 
    7, 7))

plot of chunk unnamed-chunk-2

Note the pattern in the first eight samples, showing the higher pairwise steps measured by Fitbit1 over Fitbit2.

To explore this we set up a comparative test to detect measurement bias between the two Fitbit devices. The test was simply to record two full days of activity during which both devices are worn by the same subject, a pup called Prim. The data for those 48 hours are labelled as PupName='CONTROL3' and PupName='CONTROL4'. In a world with zero measurement bias, the devices would have precisely the same minute-by-minute step count. However we did indeed find differences in measurement as illustrated here:

A Measurement Contingency Table

sqlCtrl <- "select dd.DateStr, dd.FBDevice, dd.PupName, s.TimeStr, s.Steps"
sqlCtrl <- paste(sqlCtrl, "from DayDevice dd, Steps s ")
sqlCtrl <- paste(sqlCtrl, "where dd.DateStr = s.DateStr and dd.FBDevice = s.FBDevice ")
sqlCtrl <- paste(sqlCtrl, "and dd.PupName = s.PupName and dd.FullDayData = 1 ")
sqlCtrl <- paste(sqlCtrl, "and dd.PupName IN ('CONTROL3','CONTROL4') ")  # FITBIT CONTROL3/CONTROL4 tests ONLY
sqlCtrl <- paste(sqlCtrl, "order by dd.DateStr, dd.FBDevice, dd.PupName, s.TimeStr")
dtlCtrl <- sqlQuery(conn, paste(sqlCtrl))
# Sum steps and cross-tab by day+minute:
adsCtrl <- as.matrix(xtabs(Steps ~ PupName + paste(DateStr, TimeStr), dtlCtrl))
library(xtable)

## Warning: package 'xtable' was built under R version 2.15.3

contable <- as.data.frame(ftable(c(adsCtrl[1, ] > 0), adsCtrl[2, ] > 0))
contable$pct = round(100 * contable[, 3]/length(adsCtrl[1, ]), digits = 2)
colnames(contable) <- c("Fitbit 1", "Fitbit 2", "  Minutes (total 2880)", "  Percent")
rownames(contable) <- paste(1:4, "Steps recorded > 0")
rownames(contable) <- c("  Steps recorded > 0? ", " .Steps recorded > 0? ", 
    ". Steps recorded > 0? ", "..Steps recorded > 0? ")
print(xtable(contable), type = "html", html.table.attributes = "border=1, bgcolor='lightblue'")

	Fitbit 1	Fitbit 2	Minutes (total 2880)	Percent
Steps recorded > 0?	FALSE	FALSE	2075	72.05
.Steps recorded > 0?	TRUE	FALSE	74	2.57
. Steps recorded > 0?	FALSE	TRUE	76	2.64
..Steps recorded > 0?	TRUE	TRUE	655	22.74

Browsing a contingency table of activity (defined as minutes in which steps recorded > 0) vs no activity (defined as minutes in which steps recorded = 0), we see that both devices had some periods during which they measured activity when the other device didn't measure any. One reassuring point we see here is that periods measuring zero activity (both devices showing FALSE when pups are at rest) are generally in agreement between the two devices.

Boxplots of Fitbit1 vs Fitbit2 measurement

The boxplots show some variation in the distribution of measurement between the devices. The overlap of the notches around the medians of the two boxes illustrate that the medians do not differ significantly (p = 0.05) as calculated by the boxplot notch parameter (Chambers et al., 1983, p. 62).

par(mfrow = c(1, 2))
# Get distributions of minutes in which at least one device recorded steps
# > 0
ctrl1 <- adsCtrl[1, adsCtrl[1, ] > 0 | adsCtrl[2, ] > 0]
ctrl2 <- adsCtrl[2, adsCtrl[1, ] > 0 | adsCtrl[2, ] > 0]
boxplot(ctrl1, ctrl2, width = c(rep(5, 2)), notch = T, names = c("Fitbit1", 
    "Fitbit2"))

plot of chunk unnamed-chunk-4

Another visual to compare the variation in measurement

In this visual we see how the variances in measurement between Fitbit1 and Fitbit2 are spread across the histogram bars of measured 'steps' per minute. Across the spectrum, sometimes Fitbit1 registered more steps, and about as often Fitbit2 registered more steps. This adds intuitive confidence that the measurement variance between the two devices is by chance rather than a systematic measurement bias such as a difference in senstivity of the accelerometers inside the two devices.

par(mfrow = c(1, 1))
colors = c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5), rgb(0.5, 0, 1, 0.8))
hist(ctrl1, col = colors[1], main = "Fitbit1 vs Fitbit2 Measurement Variation", 
    ylab = "Fitbit 'steps' per minute", xlab = "Minutes in which at least one device recorded steps > 0")
hist(ctrl2, col = colors[2], add = T)
legend("topright", c("Fitbit1", "Fitbit2", "Overlap"), col = colors, pch = 15, 
    cex = 1.2, inset = 0.02)

plot of chunk unnamed-chunk-5

Adjust Raw Data for Fitbit Measurement Bias?

SHOULD we adjust raw data for measurement variance? Is there a significant bias?

Test for normality Although we can see this in the visuals, we run the Shapiro test which gives a very high confidence for both Fitbit measurement test data sets that we can reject the NULL hypothesis of a normal distribution.

shapiro.test(ctrl1)

## 
##  Shapiro-Wilk normality test
## 
## data:  ctrl1 
## W = 0.9463, p-value < 2.2e-16

shapiro.test(ctrl2)

## 
##  Shapiro-Wilk normality test
## 
## data:  ctrl2 
## W = 0.9464, p-value < 2.2e-16

Wilcoxon Rank test The samples are independent since they are from two separate devices. Each measurement in a sample is independent rather than part of a time series, since each minute of measurement by a Fitbit device has no dependency on measurements in prior minutes.

Given these characteristics of the samples we run the Wilcoxon Rank Sum Test for unpaired samples to test the hypothesis that the distributions differ significantly with a confidence threshold of 95%.

wil = wilcox.test(ctrl1, ctrl2, conf.int = T, conf.level = 0.95)
wil

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ctrl1 and ctrl2 
## W = 321047, p-value = 0.7504
## alternative hypothesis: true location shift is not equal to 0 
## 95 percent confidence interval:
##  -2  1 
## sample estimates:
## difference in location 
##             -3.778e-05

With a p-value of 0.7504 we cannot reject the null hypothesis that the true Fitbit1 and Fitbit2 distributions are the same. While there is variability between the two devices, it doesn't seem to introduce a systemic measurement bias.

What should we do?

Attempt to adjust the raw data? If so, how?
Keep and use the measurement as-is, accepting the variability?
Perform more measurement bias testing?