There are 4410 users, 2205 males and 2205 females. Each user has 50-60 records, ranging from 2015-02-01 to 2015-04-30. The BMI of users ranges from 17.10 to 59.20. The resting heart rate ranges from 32.05 to 115.97. Note that I replace BMI=99 and resting_heart_rate=0 with NA.
The following graph shows that male’s BMI has a higher mean and lower variance than female’s. The T test confirms that male’s average BMI 27.73 is significantly higher than female’s average BMI 26.55.
##
## Welch Two Sample t-test
##
## data: user1$bmi[user1$gender == "female"] and user1$bmi[user1$gender == "male"]
## t = -7.6528, df = 4217.9, p-value = 2.422e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.4811525 -0.8770243
## sample estimates:
## mean of x mean of y
## 26.55076 27.72985
The following two graphs shows that in March, the proportion of exercised people increased sharply, while the mean resting heart rate dropped sharply.
First, I aggregate the exercise table and resting heart rate table by userID, calculated the exercise frequency and average resting heart rate for each person, and then merge them together with the user table. From the scatter plot, we can see that RHR is negatively correlated with exercise frequency. We can also see that BMI is positively correlated with the RHR, which indicates that we should take the confounding variables into account when modeling the relationship between exercise frequency and RHR.
Let’s start from a linear regression model. I included exercise frequency, BMI, gender and their interactions as predictors, and then use step-wise method to select the features. The model summary shows that 1) average RHR increase as the BMI increase; 2) females have higher RHR than males; 3) for males, every 10% increase in exercise frequency results in an 0.44 decrease in RHR on average; for females, every 10% increase in exercise frequency results in an 0.75 decrease in RHR on average. The diagnostic plots didn’t show any violation on assumptions.
## Start: AIC=17905.01
## avg_rhr ~ freq * bmi * gender
##
## Df Sum of Sq RSS AIC
## - freq:bmi:gender 1 1.0871 258703 17903
## <none> 258702 17905
##
## Step: AIC=17903.03
## avg_rhr ~ freq + bmi + gender + freq:bmi + freq:gender + bmi:gender
##
## Df Sum of Sq RSS AIC
## - freq:bmi 1 1.082 258704 17901
## - bmi:gender 1 2.296 258705 17901
## <none> 258703 17903
## - freq:gender 1 257.219 258960 17905
##
## Step: AIC=17901.05
## avg_rhr ~ freq + bmi + gender + freq:gender + bmi:gender
##
## Df Sum of Sq RSS AIC
## - bmi:gender 1 3.443 258708 17899
## <none> 258704 17901
## - freq:gender 1 257.921 258962 17903
##
## Step: AIC=17899.11
## avg_rhr ~ freq + bmi + gender + freq:gender
##
## Df Sum of Sq RSS AIC
## <none> 258708 17899
## - freq:gender 1 260.3 258968 17902
## - bmi 1 18198.6 276906 18195
##
## Call:
## lm(formula = avg_rhr ~ freq + bmi + gender + freq:gender, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.530 -5.194 -0.345 4.843 39.557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.85692 0.76119 72.067 < 2e-16 ***
## freq -4.41406 0.93402 -4.726 2.36e-06 ***
## bmi 0.39961 0.02276 17.559 < 2e-16 ***
## gender 4.00365 0.59193 6.764 1.52e-11 ***
## freq:gender -3.06892 1.46150 -2.100 0.0358 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.683 on 4383 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.1091, Adjusted R-squared: 0.1083
## F-statistic: 134.2 on 4 and 4383 DF, p-value: < 2.2e-16
Next, I used all of the records to model the relationship. Since every user has multiple measures, I chose a mixed model to incorporate the hierarchy. I kept the fixed effects the same as the linear regression, and add the random effects as follows:
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula:
## resting_heart_rate ~ exercised_today + bmi + gender + exercised_today:gender +
## (1 + exercised_today | userID)
## Data: data1_1
##
## AIC BIC logLik deviance df.resid
## 1399972.5 1400066.5 -699977.3 1399954.5 253875
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8290 -0.6496 0.0058 0.6526 5.7621
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## userID (Intercept) 59.82 7.734
## exercised_today 1.02 1.010 -0.09
## Residual 13.05 3.612
## Number of obs: 253884, groups: userID, 4388
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 53.21883 0.65475 81.28
## exercised_today -0.76230 0.03098 -24.60
## bmi 0.40408 0.02284 17.69
## gender 3.30098 0.23583 14.00
## exercised_today:gender -0.10659 0.04411 -2.42
##
## Correlation of Fixed Effects:
## (Intr) exrcs_ bmi gender
## exercsd_tdy -0.027
## bmi -0.967 0.000
## gender -0.288 0.074 0.114
## exrcsd_tdy: 0.018 -0.702 0.000 -0.101
The mixed model shows that for males, exercise results in an 0.76 decrease in RHR on average; for females, exercise results in an 0.87 decrease in RHR on average. The following likelihood ratio test shows that the fixed effects of exercise is significant.
## Data: data1_1
## Models:
## mix.null: resting_heart_rate ~ bmi + gender + (1 + exercised_today | userID)
## mix.model: resting_heart_rate ~ exercised_today + bmi + gender + exercised_today:gender +
## mix.model: (1 + exercised_today | userID)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mix.null 7 1401142 1401215 -700564 1401128
## mix.model 9 1399973 1400067 -699977 1399955 1173.2 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First, we need to find out which users increase their exercise frequency. I calculated the 10-days moving average series for every users, and then estimate the linear trend by OLS(see here). I found that 2444 users saw an increase in exercise frequency at 1% confidence level.
From the previous models, we are expecting to see that as the exercise frequency increase, the moving average of RHR decrease. Implementing the same computation on RHR (I replaced missing values in RHR with corresponding user’s average RHR.), I found that 2772 users saw a decrease in RHR at 1% confidence level, and 1994 of them have increased exercise frequency. The following graph shows the moving averages of three randomly selected users. The RHR series are centered for plotting.
Next, I built a mixed model for those 2444 users who saw an increase in exercise frequency. We can see that after controlling for BMI and gender, those 2444 users see a daily decrease in RHR of 0.08 on average. Likelihood ratio test shows that the fixed effects of days is significant.
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: resting_heart_rate ~ days + bmi + gender + (1 + days | userID)
## Data: data1_2
## Control: lmerControl(check.conv.grad = .makeCC("warning", tol = 0.0022,
## relTol = NULL))
##
## AIC BIC logLik deviance df.resid
## 766535.3 766614.1 -383259.6 766519.3 140546
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.3680 -0.6472 -0.0025 0.6480 5.8448
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## userID (Intercept) 65.481829 8.09208
## days 0.002392 0.04891 -0.30
## Residual 12.098914 3.47835
## Number of obs: 140554, groups: userID, 2429
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 55.886975 0.859299 65.04
## days -0.081741 0.001138 -71.85
## bmi 0.366716 0.029803 12.30
## gender 3.506681 0.317341 11.05
##
## Correlation of Fixed Effects:
## (Intr) days bmi
## days -0.059
## bmi -0.961 0.000
## gender -0.297 0.000 0.102
## convergence code: 0
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Data: data1_2
## Models:
## mix.null2: resting_heart_rate ~ bmi + gender + (1 + days | userID)
## mix.model2: resting_heart_rate ~ days + bmi + gender + (1 + days | userID)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mix.null2 7 769302 769370 -384644 769288
## mix.model2 8 766535 766614 -383260 766519 2768.3 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can find all the R code I used on my Gist.