I. Data Description

There are 4410 users, 2205 males and 2205 females. Each user has 50-60 records, ranging from 2015-02-01 to 2015-04-30. The BMI of users ranges from 17.10 to 59.20. The resting heart rate ranges from 32.05 to 115.97. Note that I replace BMI=99 and resting_heart_rate=0 with NA.

The following graph shows that male’s BMI has a higher mean and lower variance than female’s. The T test confirms that male’s average BMI 27.73 is significantly higher than female’s average BMI 26.55.

## 
##  Welch Two Sample t-test
## 
## data:  user1$bmi[user1$gender == "female"] and user1$bmi[user1$gender == "male"]
## t = -7.6528, df = 4217.9, p-value = 2.422e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4811525 -0.8770243
## sample estimates:
## mean of x mean of y 
##  26.55076  27.72985

The following two graphs shows that in March, the proportion of exercised people increased sharply, while the mean resting heart rate dropped sharply.

II. Relationship between exercise frequency and RHR

First, I aggregate the exercise table and resting heart rate table by userID, calculated the exercise frequency and average resting heart rate for each person, and then merge them together with the user table. From the scatter plot, we can see that RHR is negatively correlated with exercise frequency. We can also see that BMI is positively correlated with the RHR, which indicates that we should take the confounding variables into account when modeling the relationship between exercise frequency and RHR.

Let’s start from a linear regression model. I included exercise frequency, BMI, gender and their interactions as predictors, and then use step-wise method to select the features. The model summary shows that 1) average RHR increase as the BMI increase; 2) females have higher RHR than males; 3) for males, every 10% increase in exercise frequency results in an 0.44 decrease in RHR on average; for females, every 10% increase in exercise frequency results in an 0.75 decrease in RHR on average. The diagnostic plots didn’t show any violation on assumptions.

## Start:  AIC=17905.01
## avg_rhr ~ freq * bmi * gender
## 
##                   Df Sum of Sq    RSS   AIC
## - freq:bmi:gender  1    1.0871 258703 17903
## <none>                         258702 17905
## 
## Step:  AIC=17903.03
## avg_rhr ~ freq + bmi + gender + freq:bmi + freq:gender + bmi:gender
## 
##               Df Sum of Sq    RSS   AIC
## - freq:bmi     1     1.082 258704 17901
## - bmi:gender   1     2.296 258705 17901
## <none>                     258703 17903
## - freq:gender  1   257.219 258960 17905
## 
## Step:  AIC=17901.05
## avg_rhr ~ freq + bmi + gender + freq:gender + bmi:gender
## 
##               Df Sum of Sq    RSS   AIC
## - bmi:gender   1     3.443 258708 17899
## <none>                     258704 17901
## - freq:gender  1   257.921 258962 17903
## 
## Step:  AIC=17899.11
## avg_rhr ~ freq + bmi + gender + freq:gender
## 
##               Df Sum of Sq    RSS   AIC
## <none>                     258708 17899
## - freq:gender  1     260.3 258968 17902
## - bmi          1   18198.6 276906 18195
## 
## Call:
## lm(formula = avg_rhr ~ freq + bmi + gender + freq:gender, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.530  -5.194  -0.345   4.843  39.557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54.85692    0.76119  72.067  < 2e-16 ***
## freq        -4.41406    0.93402  -4.726 2.36e-06 ***
## bmi          0.39961    0.02276  17.559  < 2e-16 ***
## gender       4.00365    0.59193   6.764 1.52e-11 ***
## freq:gender -3.06892    1.46150  -2.100   0.0358 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.683 on 4383 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.1091, Adjusted R-squared:  0.1083 
## F-statistic: 134.2 on 4 and 4383 DF,  p-value: < 2.2e-16

Next, I used all of the records to model the relationship. Since every user has multiple measures, I chose a mixed model to incorporate the hierarchy. I kept the fixed effects the same as the linear regression, and add the random effects as follows:

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: 
## resting_heart_rate ~ exercised_today + bmi + gender + exercised_today:gender +  
##     (1 + exercised_today | userID)
##    Data: data1_1
## 
##       AIC       BIC    logLik  deviance  df.resid 
## 1399972.5 1400066.5 -699977.3 1399954.5    253875 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.8290 -0.6496  0.0058  0.6526  5.7621 
## 
## Random effects:
##  Groups   Name            Variance Std.Dev. Corr 
##  userID   (Intercept)     59.82    7.734         
##           exercised_today  1.02    1.010    -0.09
##  Residual                 13.05    3.612         
## Number of obs: 253884, groups:  userID, 4388
## 
## Fixed effects:
##                        Estimate Std. Error t value
## (Intercept)            53.21883    0.65475   81.28
## exercised_today        -0.76230    0.03098  -24.60
## bmi                     0.40408    0.02284   17.69
## gender                  3.30098    0.23583   14.00
## exercised_today:gender -0.10659    0.04411   -2.42
## 
## Correlation of Fixed Effects:
##             (Intr) exrcs_ bmi    gender
## exercsd_tdy -0.027                     
## bmi         -0.967  0.000              
## gender      -0.288  0.074  0.114       
## exrcsd_tdy:  0.018 -0.702  0.000 -0.101

The mixed model shows that for males, exercise results in an 0.76 decrease in RHR on average; for females, exercise results in an 0.87 decrease in RHR on average. The following likelihood ratio test shows that the fixed effects of exercise is significant.

## Data: data1_1
## Models:
## mix.null: resting_heart_rate ~ bmi + gender + (1 + exercised_today | userID)
## mix.model: resting_heart_rate ~ exercised_today + bmi + gender + exercised_today:gender + 
## mix.model:     (1 + exercised_today | userID)
##           Df     AIC     BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## mix.null   7 1401142 1401215 -700564  1401128                             
## mix.model  9 1399973 1400067 -699977  1399955 1173.2      2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

III. Exercise Frequency Increase and RHR

First, we need to find out which users increase their exercise frequency. I calculated the 10-days moving average series for every users, and then estimate the linear trend by OLS(see here). I found that 2444 users saw an increase in exercise frequency at 1% confidence level.

From the previous models, we are expecting to see that as the exercise frequency increase, the moving average of RHR decrease. Implementing the same computation on RHR (I replaced missing values in RHR with corresponding user’s average RHR.), I found that 2772 users saw a decrease in RHR at 1% confidence level, and 1994 of them have increased exercise frequency. The following graph shows the moving averages of three randomly selected users. The RHR series are centered for plotting.

Next, I built a mixed model for those 2444 users who saw an increase in exercise frequency. We can see that after controlling for BMI and gender, those 2444 users see a daily decrease in RHR of 0.08 on average. Likelihood ratio test shows that the fixed effects of days is significant.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: resting_heart_rate ~ days + bmi + gender + (1 + days | userID)
##    Data: data1_2
## Control: lmerControl(check.conv.grad = .makeCC("warning", tol = 0.0022,  
##     relTol = NULL))
## 
##       AIC       BIC    logLik  deviance  df.resid 
##  766535.3  766614.1 -383259.6  766519.3    140546 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3680 -0.6472 -0.0025  0.6480  5.8448 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  userID   (Intercept) 65.481829 8.09208       
##           days         0.002392 0.04891  -0.30
##  Residual             12.098914 3.47835       
## Number of obs: 140554, groups:  userID, 2429
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 55.886975   0.859299   65.04
## days        -0.081741   0.001138  -71.85
## bmi          0.366716   0.029803   12.30
## gender       3.506681   0.317341   11.05
## 
## Correlation of Fixed Effects:
##        (Intr) days   bmi   
## days   -0.059              
## bmi    -0.961  0.000       
## gender -0.297  0.000  0.102
## convergence code: 0
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Data: data1_2
## Models:
## mix.null2: resting_heart_rate ~ bmi + gender + (1 + days | userID)
## mix.model2: resting_heart_rate ~ days + bmi + gender + (1 + days | userID)
##            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## mix.null2   7 769302 769370 -384644   769288                             
## mix.model2  8 766535 766614 -383260   766519 2768.3      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can find all the R code I used on my Gist.