Part 1 - Introduction

I’m interested in the number of steps I take per day and how that may relate to the weather outside. I think this information is important because leading a sedentary lifestyle can be detrimental to various aspects of health including cardiovascular, pulmonary, and mental health. Thus, it is important to identify factors that may lead to taking fewer steps per day so that I can intervene and purposefully take more steps on days where I’m already expected to take fewer. I hope that this serves as a framework for others to conduct the same investigation.

Part 2 - Data

An explanation of the data

The data are collected automatically with any iPhone running iOS 8 or higher. As we’ll see, in the source data format, cases are “time chunks,” as I call them–any amount of time my iPhone began recording steps to when it stopped. This is the amount of steps taken from beginning a time of walking to ending it. So a time chunk can be quite short (taking a few steps) or quite long (going for a long walk). This means, that one of my first transformations will have to be to organize the data by day as opposed to by “time chunk” since that isn’t really meaningful.

The dependent/response variable in my project is steps taken per day, a discrete numeric variable. I want to relate this to temperature on a given day, making temperature the independent/explanatory variable (also numeric, but continuous in this case)

The nature of these data make them purely observational–nothing is randomize and now experiment was conducted. This is just observing two factors from my life and seeing if they relate.

The population in question is my steps writ large (those outside of my sample) and since this is an observational study, it is difficult to generalize to the population. Similarly, if we wanted to think about the population as all other people, it would be impossible to generalize since this study only includes one individual (myself). Nonetheless, because this is a massive sample size of days counted for steps (as we’ll see), I am moderately comfortable with generalizing the findings of this study to my life and steps writ large.

These data cannot be used to establish causal links because this is not a simple random sample–it’s instead a highly biased sample in that it only comes from one period of my life. Furthermore, this is not an experimental design in which other actors will be kept constant. This is instead an observational study in which any number of additional variables may have confounded the result.

Load the data: Steps Data

Apple provides its StepData in XML format

library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(XML)

XML_StepsFile <- "/Users/zachdravis/Downloads/apple_health_export/export.xml"
XML_StepsData <- xmlParse(XML_StepsFile)
StepsData <- XML:::xmlAttrsToDataFrame(XML_StepsData["//Record"])

head(tbl_df(StepsData))

A tibble: 6 x 9

type sourceName sourceVersion device unit creationDate startDate
1 HKQuan… iPhone 11.2.1 <, value

Data Manipulations: Steps Data

  • Make our key value numeric
  • Standardize date format
  • Create additional variables for month, year, date, and Day of the Week
  • Now filter out the “mile” measurement since we are interested in number of steps only
  • Collapse across “time chunks” to get number of steps per day
StepsData$value <- as.numeric(as.character(StepsData$value))

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
StepsData$endDate <-ymd_hms(StepsData$endDate,tz="America/New_York")
## Date in ISO8601 format; converting timezone from UTC to "America/New_York".
StepsData$month<-format(StepsData$endDate,"%M")
StepsData$year<-format(StepsData$endDate,"%Y")
StepsData$date<-format(StepsData$endDate,"%Y-%m-%d")
StepsData$dayofweek <-wday(StepsData$endDate, label=TRUE)

CountsByDate <- StepsData %>%
  filter(unit == "count") %>%
  group_by(date) %>%
  summarise(StepCount = sum(value))

Load the data: Temperature Data

TempData <- read.csv("/Users/zachdravis/Desktop/TempData2.csv")
TempData$NAME <- as.character(TempData$NAME)
TempDataPhilly <- TempData[grepl("Philadelphia international airport", TempData$NAME, ignore.case = T),]
TempDataPhilly$DATE <- as.Date(TempDataPhilly$DATE)
CountsByDate$date <- as.Date(CountsByDate$date)

Merge the data for temp and steps

TempSteps <- TempDataPhilly %>% 
  select(DATE, TMAX) %>%
  left_join(CountsByDate, ., by = c("date" = "DATE"))

Part 3 - Exploratory data analysis

EDA of Steps alone

Let’s just look at how my number of steps taken per day relates to day.

plot(CountsByDate$StepCount ~ CountsByDate$date,
     main = "Scatterplot of Steps by Date",
     xlab = "Date",
     ylab = "Step Count",
     xaxt = "n")
abline(lm(StepCount~date, data = CountsByDate), col="red")
axis.Date(1, at=seq(min(CountsByDate$date), max(CountsByDate$date), by="6 mon"), format="%m-%Y")

x <- lm(StepCount~date, data = CountsByDate)

plot(x$residuals ~ CountsByDate$date,
    main = "Residuals of a Linear Model",
    xlab = "Date",
    ylab = "Residual Value")
abline(h = 0, lty = 2, col = "red", lwd = 2) #Here we see the residuals are unevenly distributed. RECREATE THIS WITH SPC

These plots tell me that there is not a clear linear relationship with date and the wavy pattern in the data represent seasonality–it looks like I’m walking a lot less in the winter. This is great because it indicates steps may relate to the temperature!

EDA of Steps and Temp

plot(TempSteps$StepCount ~ TempSteps$TMAX,
     main = "Scatterplot of Steps by Max Temperature",
     xlab = "Max Temp (Farenheit)",
     ylab = "Step Count")

I see the possibility for a linear relationship.

cor(TempSteps$StepCount, TempSteps$TMAX)
## [1] 0.3897621

Ok, this is a weak correlation coefficient.

Part 4 - Inference

StepsModel <- lm(StepCount ~ TMAX, data = TempSteps)

Check conditions

  1. Linearity (yes, checked above)
  2. Nearly normal residuals
hist(StepsModel$residuals)

Yes, nearly normal! 3. Constant variability

plot(TempSteps$StepCount ~ TempSteps$TMAX,
     main = "Scatterplot of Steps by Max Temperature",
     xlab = "Max Temp (Farenheit)",
     ylab = "Step Count")
abline(StepsModel, col = "red")

It seems with the exception of a few outliers, the variability is pretty constant.

Investigate model

Our Null Hypothesis is that there is no relationship between max temperature and steps I took per day. The alternative hypothesis is that there is a relationship.

summary(StepsModel)
## 
## Call:
## lm(formula = StepCount ~ TMAX, data = TempSteps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7814.4 -2483.7  -211.6  2103.6 16557.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3369.155    550.514    6.12 1.74e-09 ***
## TMAX          86.893      8.592   10.11  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3658 on 571 degrees of freedom
## Multiple R-squared:  0.1519, Adjusted R-squared:  0.1504 
## F-statistic: 102.3 on 1 and 571 DF,  p-value: < 2.2e-16

Clearly, due to the highly significant p-value (p <.001), we reject the Null hypothesis and assume that based on these data there is a relationship between steps taken per day and the max temperature.

The equation of the linear model: Expected Steps = 3369.155 + 86.893 * Degrees of Max Temperature This means that for every degree of temperature for the max temperature, an expected 86.893 steps are added to a baseline of 3369.155 to estimate the expected number of steps taken on a given day.

Although this model is significant, it is important to note that there is a weak linear relationship and that the max temperature of a given day does not predict that much variance in the steps taken per day–the R-Squared variable is .15, meaning that max temperature only predicts 15% of the variance in steps taken per day.

Part 5 - Conclusion

This analysis looked at the steps taken per day on my iPhone over the past 573 days and how it relates to the the weather. I found that despite a weak linear relatinship, the max temperature of the day is a significant predictor of steps taken per day. Nonetheless, max temperature only explains ~15% of the variance in steps per day, leading me to think about future variables that could be used.

Future analyses should incorporate a number of additional weather related variables (i.e. average temp, rainfall, wind speeds) to assess prediction power for steps taken per day.

References

https://developer.apple.com/researchkit/ (Background info on Apple Steps Data)

https://www.r-bloggers.com/taking-steps-in-xml/ (Help to load Apple Steps Data into R)

https://www.ncdc.noaa.gov/cdo-web/search (Source of weather data)