I like wearing my fitbit to track steps and progress of exercise classes. Here is a quick analysis of my own fitbit data collected between April 2018 (when my husband brought me my fitbit) until 31st August 2019.
If you also have a fitbit you can download your own data (31 days at a time) from the account page settings here.
I will be structuring my analysis in the following steps 1. Audit 2. Descriptive Statistics 3. Statistical Testing 4. Model Building 5. Evaluation
For a fuller description of how I carry out these steps check out my blog post.
download library packages you require
library(psych)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.1 v purrr 0.3.2
## v tidyr 0.8.3 v dplyr 0.8.0.1
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%() masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readxl)
library(ggpubr)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(RColorBrewer)
library(corrplot)
## corrplot 0.84 loaded
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(zoo)
library(dplyr)
library(forecast)
##
## Attaching package: 'forecast'
## The following object is masked from 'package:ggpubr':
##
## gghistogram
if you don’t have any of the above packages use the function install.packages(“insert package name here”)
importing the data from excel, here I used the import button underneath the data environment tab on the right hand side of Rstudio
I used the following code to import from excel into R
First what type of data do we have? This will help determine the range of analysis which can be robustly conducted on the dataset. Some of which may already be converted as part of the import process. For example date was specified as a date variable above but month number was specified as a character (as this will help with conversion to factor later).
A good place to start is to look at your data 1. View your data in the data window
View(Raw_Fitbit_Data)
head(Raw_Fitbit_Data, n=10)
## # A tibble: 10 x 15
## Date `Calories Burne~ Steps Distance Floors
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2018-04-09 00:00:00 1995 9364 6.52 0
## 2 2018-04-10 00:00:00 2135 13642 9.5 0
## 3 2018-04-11 00:00:00 2481 19112 14.7 28
## 4 2018-04-12 00:00:00 1877 8091 5.9 30
## 5 2018-04-13 00:00:00 2470 17129 13.3 40
## 6 2018-04-14 00:00:00 2104 13026 9.23 33
## 7 2018-04-15 00:00:00 2601 13641 10.9 43
## 8 2018-04-16 00:00:00 2432 17671 13.5 47
## 9 2018-04-17 00:00:00 2543 14072 10.8 46
## 10 2018-04-18 00:00:00 2735 24106 18.0 42
## # ... with 10 more variables: `Minutes Sedentary` <dbl>, `Minutes Lightly
## # Active` <dbl>, `Minutes Fairly Active` <dbl>, `Minutes Very
## # Active` <dbl>, `Activity Calories` <dbl>, Weekday <chr>,
## # Weekend <chr>, `Proportion of total calories active` <dbl>,
## # Month <chr>, Year <dbl>
R can aid you in finding out what types of variables we are dealing with.
summary(Raw_Fitbit_Data)
## Date Calories Burned Steps
## Min. :2018-04-09 00:00:00 Min. :1342 Min. : 0
## 1st Qu.:2018-08-14 06:00:00 1st Qu.:2186 1st Qu.:13712
## Median :2018-12-19 12:00:00 Median :2522 Median :18585
## Mean :2018-12-19 12:00:00 Mean :2483 Mean :18234
## 3rd Qu.:2019-04-25 18:00:00 3rd Qu.:2757 3rd Qu.:22740
## Max. :2019-08-31 00:00:00 Max. :4910 Max. :42719
## Distance Floors Minutes Sedentary Minutes Lightly Active
## Min. : 0.00 Min. : 0.00 Min. : 595 Min. : 0.0
## 1st Qu.:10.17 1st Qu.: 38.00 1st Qu.:1044 1st Qu.:178.0
## Median :14.04 Median : 48.00 Median :1097 Median :210.0
## Mean :14.00 Mean : 48.03 Mean :1097 Mean :208.2
## 3rd Qu.:17.68 3rd Qu.: 57.00 3rd Qu.:1163 3rd Qu.:241.0
## Max. :31.15 Max. :144.00 Max. :1440 Max. :507.0
## Minutes Fairly Active Minutes Very Active Activity Calories
## Min. : 0.00 Min. : 0.00 Min. : 0
## 1st Qu.:15.00 1st Qu.: 51.25 1st Qu.:1065
## Median :24.00 Median : 98.00 Median :1464
## Mean :25.89 Mean : 92.45 Mean :1407
## 3rd Qu.:36.00 3rd Qu.:128.00 3rd Qu.:1736
## Max. :94.00 Max. :405.00 Max. :4188
## Weekday Weekend Proportion of total calories active
## Length:510 Length:510 Min. :0.0000
## Class :character Class :character 1st Qu.:0.4843
## Mode :character Mode :character Median :0.5785
## Mean :0.5492
## 3rd Qu.:0.6297
## Max. :0.8530
## Month Year
## Length:510 Min. :2018
## Class :character 1st Qu.:2018
## Mode :character Median :2018
## Mean :2018
## 3rd Qu.:2019
## Max. :2019
Ok.. so some of those variables are not in a suitable format for analysis let’s convert
#first creating a copy
FitbitDatav2 <- Raw_Fitbit_Data
FitbitDatav2$Weekday <- factor(FitbitDatav2$Weekday, levels=c(1,2,3,4,5,6,7), labels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
FitbitDatav2$Month <- factor(FitbitDatav2$Month, levels=c(1,2,3,4,5,6,7,8,9, 10, 11, 12 ), labels=c("January","February", "March", "April", "May", "June" ,"July", "August", "September", "October", "November", "December"))
FitbitDatav2$Weekend <- factor(FitbitDatav2$Weekend, levels=c(0,1), labels=c("Weekday", "Weekend"))
let’s check the variables again
summary(FitbitDatav2)
## Date Calories Burned Steps
## Min. :2018-04-09 00:00:00 Min. :1342 Min. : 0
## 1st Qu.:2018-08-14 06:00:00 1st Qu.:2186 1st Qu.:13712
## Median :2018-12-19 12:00:00 Median :2522 Median :18585
## Mean :2018-12-19 12:00:00 Mean :2483 Mean :18234
## 3rd Qu.:2019-04-25 18:00:00 3rd Qu.:2757 3rd Qu.:22740
## Max. :2019-08-31 00:00:00 Max. :4910 Max. :42719
##
## Distance Floors Minutes Sedentary Minutes Lightly Active
## Min. : 0.00 Min. : 0.00 Min. : 595 Min. : 0.0
## 1st Qu.:10.17 1st Qu.: 38.00 1st Qu.:1044 1st Qu.:178.0
## Median :14.04 Median : 48.00 Median :1097 Median :210.0
## Mean :14.00 Mean : 48.03 Mean :1097 Mean :208.2
## 3rd Qu.:17.68 3rd Qu.: 57.00 3rd Qu.:1163 3rd Qu.:241.0
## Max. :31.15 Max. :144.00 Max. :1440 Max. :507.0
##
## Minutes Fairly Active Minutes Very Active Activity Calories
## Min. : 0.00 Min. : 0.00 Min. : 0
## 1st Qu.:15.00 1st Qu.: 51.25 1st Qu.:1065
## Median :24.00 Median : 98.00 Median :1464
## Mean :25.89 Mean : 92.45 Mean :1407
## 3rd Qu.:36.00 3rd Qu.:128.00 3rd Qu.:1736
## Max. :94.00 Max. :405.00 Max. :4188
##
## Weekday Weekend Proportion of total calories active
## Sunday :72 Weekday:365 Min. :0.0000
## Monday :73 Weekend:145 1st Qu.:0.4843
## Tuesday :73 Median :0.5785
## Wednesday:73 Mean :0.5492
## Thursday :73 3rd Qu.:0.6297
## Friday :73 Max. :0.8530
## Saturday :73
## Month Year
## May : 62 Min. :2018
## July : 62 1st Qu.:2018
## August : 62 Median :2018
## June : 60 Mean :2018
## April : 52 3rd Qu.:2019
## January: 31 Max. :2019
## (Other):181
that looks good so in summary we have * Date Type + Date (note we have one observation for each day) * Numeric Type + Calories Burned (per day in kcal) + Steps + Distance (km) + Floors + Minutes Sedentary + Minutes Lightly Active + Minutes Fairly Active + Minutes Very Active + Activity Calories + Proportion of total calories active (ok I created this one in excel and it is the activity calories divided by Calories burned total) + Year (however this could be turned into a factor) * Factor variables + Weekday (day of the week) + Weekend (coded as 1 if a weekend day and 0 if Mon to Fri) + Month
Now let’s check whether there is missing data
sapply(FitbitDatav2, function(x) sum(is.na(x)))
## Date Calories Burned
## 0 0
## Steps Distance
## 0 0
## Floors Minutes Sedentary
## 0 0
## Minutes Lightly Active Minutes Fairly Active
## 0 0
## Minutes Very Active Activity Calories
## 0 0
## Weekday Weekend
## 0 0
## Proportion of total calories active Month
## 0 0
## Year
## 0
some nice clean data by the looks of it - now let’s move onto the new stage Descriptive Statistics
This section focuses on what has occurred in the past and by interpreting the descriptive statistics it can help inform hypotheses to base statistical testing on.
First the numerical data variables, for some quick descriptive statistics I like using the describe function from the psych package
print(describe(FitbitDatav2[,c(2:10, 13)]))
## vars n mean sd median
## Calories Burned 1 510 2482.59 398.21 2522.50
## Steps 2 510 18233.85 6542.61 18584.50
## Distance 3 510 14.00 5.29 14.04
## Floors 4 510 48.03 16.40 48.00
## Minutes Sedentary 5 510 1096.53 119.47 1097.00
## Minutes Lightly Active 6 510 208.18 58.32 210.00
## Minutes Fairly Active 7 510 25.89 15.67 24.00
## Minutes Very Active 8 510 92.45 51.51 98.00
## Activity Calories 9 510 1407.20 482.12 1464.00
## Proportion of total calories active 10 510 0.55 0.12 0.58
## trimmed mad min max
## Calories Burned 2490.47 391.41 1342 4910.00
## Steps 18283.22 6863.70 0 42719.00
## Distance 13.99 5.55 0 31.15
## Floors 47.67 14.83 0 144.00
## Minutes Sedentary 1102.40 85.99 595 1440.00
## Minutes Lightly Active 209.38 47.44 0 507.00
## Minutes Fairly Active 25.00 14.83 0 94.00
## Minutes Very Active 91.84 53.37 0 405.00
## Activity Calories 1418.52 445.52 0 4188.00
## Proportion of total calories active 0.56 0.09 0 0.85
## range skew kurtosis se
## Calories Burned 3568.00 0.12 2.00 17.63
## Steps 42719.00 -0.08 0.00 289.71
## Distance 31.15 0.00 -0.15 0.23
## Floors 144.00 0.65 3.51 0.73
## Minutes Sedentary 845.00 -0.95 3.32 5.29
## Minutes Lightly Active 507.00 -0.14 2.58 2.58
## Minutes Fairly Active 94.00 0.73 1.00 0.69
## Minutes Very Active 405.00 0.36 1.47 2.28
## Activity Calories 4188.00 0.01 1.63 21.35
## Proportion of total calories active 0.85 -1.33 2.71 0.01
There are potentially some non normal variables at play here but it difficult to determine without testing
lapply(FitbitDatav2[,c(2:10, 13)], FUN=shapiro.test)
## $`Calories Burned`
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.97179, p-value = 2.42e-08
##
##
## $Steps
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.99564, p-value = 0.1679
##
##
## $Distance
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.99678, p-value = 0.4063
##
##
## $Floors
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.95901, p-value = 1.036e-10
##
##
## $`Minutes Sedentary`
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.92476, p-value = 2.699e-15
##
##
## $`Minutes Lightly Active`
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.96733, p-value = 3.126e-09
##
##
## $`Minutes Fairly Active`
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.96529, p-value = 1.293e-09
##
##
## $`Minutes Very Active`
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.96023, p-value = 1.659e-10
##
##
## $`Activity Calories`
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.97515, p-value = 1.293e-07
##
##
## $`Proportion of total calories active`
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.90789, p-value < 2.2e-16
For my Fitbit statistics Steps and Distance are the only numeric variables where the normality assumption cannot be rejected according to the shapiro test.
#ok before I start this I am going to rename that calories variable
names(FitbitDatav2)[2] <- "Calories"
Calorie_Density <- ggplot(FitbitDatav2, aes(x=Calories)) + geom_line(stat="density") + expand_limits(y=0)
Steps_Density <- ggplot(FitbitDatav2, aes(x=Steps)) + geom_line(stat="density") + expand_limits(y=0)
Distance_Density <- ggplot(FitbitDatav2, aes(x=Distance)) + geom_line(stat="density") + expand_limits(y=0)
Floors_Density <- ggplot(FitbitDatav2, aes(x=Floors)) + geom_line(stat="density") + expand_limits(y=0)
ggarrange(Calorie_Density, Steps_Density, Distance_Density, Floors_Density, ncol = 2, nrow = 2)
Checking out some of the other numeric variable distributions
Minutes_Sedentary_Density <- ggplot(FitbitDatav2, aes(x=Minutes_Sedentary)) + geom_line(stat="density") + expand_limits(y=0)
Minutes_Lightly_Active_Density <- ggplot(FitbitDatav2, aes(x=Minutes_Lightly_Active)) + geom_line(stat="density") + expand_limits(y=0)
Minutes_Fairly_Active_Density <- ggplot(FitbitDatav2, aes(x=Minutes_Fairly_Active)) + geom_line(stat="density") + expand_limits(y=0)
Minutes_Very_Active_Density <- ggplot(FitbitDatav2, aes(x=Minutes_Very_Active)) + geom_line(stat="density") + expand_limits(y=0)
Activity_Calories_Density <- ggplot(FitbitDatav2, aes(x=Activity_Calories)) + geom_line(stat="density") + expand_limits(y=0)
Activity_Calorie_Proportion_Density <- ggplot(FitbitDatav2, aes(x=Activity_Calorie_Proportion)) + geom_line(stat="density") + expand_limits(y=0)
ggarrange(Minutes_Sedentary_Density, Minutes_Lightly_Active_Density, Minutes_Fairly_Active_Density, Minutes_Very_Active_Density, Activity_Calories_Density, Activity_Calorie_Proportion_Density, ncol = 3, nrow = 2)
Now onto covariation i.e. how much two variables vary together. This is important as we would typically expect all of these variables to be particularly interrelated.
Fitbit_Corr <- cor(FitbitDatav2[,c(2:10, 13)])
round(Fitbit_Corr, digits=2)
## Calories Steps Distance Floors
## Calories 1.00 0.92 0.92 0.59
## Steps 0.92 1.00 0.99 0.60
## Distance 0.92 0.99 1.00 0.61
## Floors 0.59 0.60 0.61 1.00
## Minutes_Sedentary -0.68 -0.63 -0.61 -0.45
## Minutes_Lightly_Active 0.47 0.44 0.42 0.54
## Minutes_Fairly_Active 0.61 0.63 0.58 0.41
## Minutes_Very_Active 0.92 0.86 0.86 0.48
## Activity_Calories 0.98 0.91 0.90 0.61
## Activity_Calorie_Proportion 0.93 0.87 0.86 0.63
## Minutes_Sedentary Minutes_Lightly_Active
## Calories -0.68 0.47
## Steps -0.63 0.44
## Distance -0.61 0.42
## Floors -0.45 0.54
## Minutes_Sedentary 1.00 -0.53
## Minutes_Lightly_Active -0.53 1.00
## Minutes_Fairly_Active -0.47 0.26
## Minutes_Very_Active -0.57 0.18
## Activity_Calories -0.70 0.54
## Activity_Calorie_Proportion -0.71 0.60
## Minutes_Fairly_Active Minutes_Very_Active
## Calories 0.61 0.92
## Steps 0.63 0.86
## Distance 0.58 0.86
## Floors 0.41 0.48
## Minutes_Sedentary -0.47 -0.57
## Minutes_Lightly_Active 0.26 0.18
## Minutes_Fairly_Active 1.00 0.52
## Minutes_Very_Active 0.52 1.00
## Activity_Calories 0.62 0.91
## Activity_Calorie_Proportion 0.63 0.84
## Activity_Calories Activity_Calorie_Proportion
## Calories 0.98 0.93
## Steps 0.91 0.87
## Distance 0.90 0.86
## Floors 0.61 0.63
## Minutes_Sedentary -0.70 -0.71
## Minutes_Lightly_Active 0.54 0.60
## Minutes_Fairly_Active 0.62 0.63
## Minutes_Very_Active 0.91 0.84
## Activity_Calories 1.00 0.96
## Activity_Calorie_Proportion 0.96 1.00
corrplot(Fitbit_Corr, method="shade", shade.col=NA, tl.col="black", tl.srt=45)
From this we can see that for me, and very likely for the majority of fitbit users there are a couple of very correlated clusters. One is the highly correlated relationship with Calories, Steps and Distance and another cluster is Minutes very active and activity calories and ratio of activity calories to total.
Next onto those categorical variables - all of the categorical variables are based on dates and therefore are going to be fairly evenly distributed. Howeever we can instead look at the relationship with time.
#Converting to time series for some analysis
Fitbit_XTS <- xts(FitbitDatav2[,-1], order.by=as.Date(FitbitDatav2$Date))
Calorie_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Calories)) +
geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
method = "loess")+ theme_minimal()
Steps_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Steps)) +
geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
method = "loess")+ theme_minimal()
Distance_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Distance)) +
geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
method = "loess")+ theme_minimal()
Floors_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Floors)) +
geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
method = "loess")+ theme_minimal()
Calorie_TimeSeries
Steps_TimeSeries
Distance_TimeSeries
Floors_TimeSeries
The amount of variation on a daily or weekly basis suggests that some of these time series could be modelled as a random walk…. hehehe But an actual observation here is that it seems like there was an increase until January and has since levelled out potentially even reducing a little in August. As this is my data I know I have started doing a new weight lifting and less cardio based programme in August.
Minutes_Sedentary_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Minutes_Sedentary)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
method = "loess")+ theme_minimal()
Minutes_Lightly_Active_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Minutes_Lightly_Active)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",method = "loess")+ theme_minimal()
Minutes_Fairly_Active_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Minutes_Fairly_Active)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",method = "loess")+ theme_minimal()
Minutes_Very_Active_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Minutes_Very_Active)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07", method = "loess")+ theme_minimal()
Activity_Calories_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Activity_Calories)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07", method = "loess")+ theme_minimal()
Activity_Calories_Proportion_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Activity_Calorie_Proportion)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07", method = "loess")+ theme_minimal()
Minutes_Sedentary_TimeSeries
Minutes_Lightly_Active_TimeSeries
Minutes_Fairly_Active_TimeSeries
Minutes_Very_Active_TimeSeries
Activity_Calories_TimeSeries
Activity_Calories_Proportion_TimeSeries
Also during the last few weeks the weight lifting and reduction of cardio has meant that the minutes very active has decreased while the minutes lightly and fairly active has increased.
Is there any relationship between the categorical variables and the numerical variables?
qplot(FitbitDatav2$Weekday, FitbitDatav2$Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekday, main="Boxplot Calories by Day of Week", xlab= "Day of Week", ylab="Calories per day")+labs(fill = 'Day of Week')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Accent")
qplot(FitbitDatav2$Weekday, FitbitDatav2$Steps, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekday, main="Boxplot Steps by Day of Week", xlab= "Day of Week", ylab="Steps per day")+labs(fill = 'Day of Week')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Accent")
qplot(FitbitDatav2$Weekday, FitbitDatav2$Distance, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekday, main="Boxplot Distance by Day of Week", xlab= "Day of Week", ylab="Distance(km) per day")+labs(fill = 'Day of Week')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Accent")
Wednesday stands out for the proportion of calories burned during the day being the highest day of the week and Thursday tends to be lower. However there is a lot of overlap of the between the interquartile ranges across days.
qplot(FitbitDatav2$Weekend, FitbitDatav2$Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Calories Weekday versus Weekend", xlab= " ", ylab="Calories per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Steps, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Steps Weekday versus Weekend", xlab= " ", ylab="Steps per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Distance, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Distance Weekday versus Weekend", xlab= " ", ylab="Distance(km) walked per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Floors, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Floors Weekday versus Weekend", xlab= " ", ylab="Floors per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Minutes_Sedentary, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Sedentary Minutes Weekday versus Weekend", xlab= " ", ylab="Sedentary Minutes per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Minutes_Lightly_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Lightly Active Minutes Weekday versus Weekend", xlab= " ", ylab="Lightly Active Minutes per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Minutes_Fairly_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Fairly Active Minutes Weekday versus Weekend", xlab= " ", ylab="Fairly Active Minutes per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Minutes_Very_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Very Active Minutes Weekday versus Weekend", xlab= " ", ylab="Very Active Minutes per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Activity_Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Activity Calories Weekday versus Weekend", xlab= " ", ylab="Activity Calories per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
qplot(FitbitDatav2$Weekend, FitbitDatav2$Activity_Calorie_Proportion, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Activity Calorie Proportion Weekday versus Weekend", xlab= " ", ylab="Activity Calorie Proportion per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")
It would appear that there are only small differences between weekend days and days of the week. The most common observation is that the inter quartile range is smaller with the weekend days however this very well could be due to the smaller number of days.
qplot(FitbitDatav2$Month, FitbitDatav2$Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Calories by Month", xlab= "Month", ylab="Calories per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Steps, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Steps by Month", xlab= "Month", ylab="Steps per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Distance, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Distance(km) by Month", xlab= "Month", ylab="Distance per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Floors, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Floors by Month", xlab= "Month", ylab="Floors per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Minutes_Sedentary, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Sedentary Minutes by Month", xlab= "Month", ylab="Sedentary Minutes per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Minutes_Lightly_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Lightly Active Minutes by Month", xlab= "Month", ylab="Lightly Active Minutes per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Minutes_Fairly_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Fairly Active Minutes by Month", xlab= "Month", ylab="Fairly Active Minutes per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Minutes_Very_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Very Active Minutes by Month", xlab= "Month", ylab="Very Active Minutes per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Activity_Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Activity Calories by Month", xlab= "Month", ylab="Activity Calories per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
qplot(FitbitDatav2$Month, FitbitDatav2$Activity_Calorie_Proportion, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Activity Calorie Proportion by Month", xlab= "Month", ylab="Activity Calorie Proportion per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)
The autumn months so far appear to be the months with the lowest values for many of the above variables however for these months there is only a one month observation compared with the months of april to august where there are 2 months worth of observations.
Results 1. There is a high correlation between Calories, Steps and distance walked - consequently would not be sensible to include both steps and distance in the same regression as independent variables
I am not more active on weekend days
FitbitDatav3 <- FitbitDatav2
#Starting to create the dummies
FitbitDatav3$Wednesday <- 0
FitbitDatav3$Quarter1 <- 0
FitbitDatav3$Quarter2 <- 0
FitbitDatav3$Year2019 <- 0
FitbitDatav3$Wednesday[FitbitDatav3$Weekday=="Wednesday"] = 1
FitbitDatav3$Quarter1[FitbitDatav3$Month=="April"|FitbitDatav3$Month=="May"|FitbitDatav3$Month=="June"] = 1
FitbitDatav3$Quarter2[FitbitDatav3$Month=="July"|FitbitDatav3$Month=="August"|FitbitDatav3$Month=="September"] = 1
FitbitDatav3$Year2019[FitbitDatav3$Year==2019] = 1
##Statistical Testing Testing whether the hypotheses are correct
note calories is non-normal but steps are normally distributed
wilcox.test(FitbitDatav3$Calories ~ FitbitDatav3$Wednesday)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FitbitDatav3$Calories by FitbitDatav3$Wednesday
## W = 10690, p-value = 6.392e-06
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(FitbitDatav3$Calories ~ FitbitDatav3$Quarter1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FitbitDatav3$Calories by FitbitDatav3$Quarter1
## W = 32816, p-value = 0.02314
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(FitbitDatav3$Calories ~ FitbitDatav3$Quarter2)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FitbitDatav3$Calories by FitbitDatav3$Quarter2
## W = 27823, p-value = 0.7882
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(FitbitDatav3$Calories ~ FitbitDatav3$Year2019)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FitbitDatav3$Calories by FitbitDatav3$Year2019
## W = 21203, p-value = 1.376e-11
## alternative hypothesis: true location shift is not equal to 0
For calories burned per day - Wednesdays are significantly different to all other days - Quarter 1 months (April, May and June) are significantly different to other quarters - Quarter 2 months (July, August and September) are not signficantly different to other quarters - Observations this year (2019) are statistically significant from 2018 observations
t.test(FitbitDatav3$Steps, FitbitDatav3$Wednesday)
##
## Welch Two Sample t-test
##
## data: FitbitDatav3$Steps and FitbitDatav3$Wednesday
## t = 62.937, df = 509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 17664.53 18802.88
## sample estimates:
## mean of x mean of y
## 1.823385e+04 1.431373e-01
t.test(FitbitDatav3$Steps, FitbitDatav3$Quarter1)
##
## Welch Two Sample t-test
##
## data: FitbitDatav3$Steps and FitbitDatav3$Quarter1
## t = 62.937, df = 509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 17664.33 18802.69
## sample estimates:
## mean of x mean of y
## 1.823385e+04 3.411765e-01
t.test(FitbitDatav3$Steps, FitbitDatav3$Quarter2)
##
## Welch Two Sample t-test
##
## data: FitbitDatav3$Steps and FitbitDatav3$Quarter2
## t = 62.937, df = 509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 17664.37 18802.72
## sample estimates:
## mean of x mean of y
## 1.823385e+04 3.019608e-01
t.test(FitbitDatav3$Steps, FitbitDatav3$Year2019)
##
## Welch Two Sample t-test
##
## data: FitbitDatav3$Steps and FitbitDatav3$Year2019
## t = 62.936, df = 509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 17664.20 18802.55
## sample estimates:
## mean of x mean of y
## 1.823385e+04 4.764706e-01
However for steps all of the hypotheses are significant what I were to use the wilcoxon test instead
wilcox.test(FitbitDatav3$Steps ~ FitbitDatav3$Wednesday)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FitbitDatav3$Steps by FitbitDatav3$Wednesday
## W = 9432.5, p-value = 2.246e-08
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(FitbitDatav3$Steps ~ FitbitDatav3$Quarter1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FitbitDatav3$Steps by FitbitDatav3$Quarter1
## W = 32990, p-value = 0.01723
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(FitbitDatav3$Steps ~ FitbitDatav3$Quarter2)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FitbitDatav3$Steps by FitbitDatav3$Quarter2
## W = 26060, p-value = 0.3766
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(FitbitDatav3$Steps ~ FitbitDatav3$Year2019)
##
## Wilcoxon rank sum test with continuity correction
##
## data: FitbitDatav3$Steps by FitbitDatav3$Year2019
## W = 22896, p-value = 9.381e-09
## alternative hypothesis: true location shift is not equal to 0
all still significant great :)
Now onto model building There are two dependent (outcome) variables I am interested in testing 1. Calories Burnt per day 2. Steps walked per day
To uncover the factors I would suggest using a standard regression technique
#multiple regression
Calorie_Gamma_model <- glm(Calories ~ Steps + Wednesday + Quarter1, family="Gamma", data=FitbitDatav3)
summary(Calorie_Gamma_model)
##
## Call:
## glm(formula = Calories ~ Steps + Wednesday + Quarter1, family = "Gamma",
## data = FitbitDatav3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.24644 -0.04688 -0.01034 0.03484 0.27941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.763e-04 4.195e-06 137.358 <2e-16 ***
## Steps -9.101e-09 1.938e-10 -46.964 <2e-16 ***
## Wednesday 8.936e-06 3.488e-06 2.562 0.0107 *
## Quarter1 -2.869e-07 2.683e-06 -0.107 0.9149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.005059001)
##
## Null deviance: 13.6494 on 509 degrees of freedom
## Residual deviance: 2.5016 on 506 degrees of freedom
## AIC: 6705.6
##
## Number of Fisher Scoring iterations: 4
#normally distribured ols model
Calorie_OLS_model <- glm(Calories ~ Steps + Wednesday + Quarter1, family="gaussian", data=FitbitDatav3)
summary(Calorie_OLS_model)
##
## Call:
## glm(formula = Calories ~ Steps + Wednesday + Quarter1, family = "gaussian",
## data = FitbitDatav3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -288.62 -96.75 -39.81 56.86 1028.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.453e+03 2.109e+01 68.893 <2e-16 ***
## Steps 5.684e-02 1.066e-03 53.311 <2e-16 ***
## Wednesday -4.795e+01 1.982e+01 -2.419 0.0159 *
## Quarter1 -2.157e-01 1.427e+01 -0.015 0.9879
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 23136.33)
##
## Null deviance: 80711323 on 509 degrees of freedom
## Residual deviance: 11706982 on 506 degrees of freedom
## AIC: 6578.4
##
## Number of Fisher Scoring iterations: 2
#normally distribured ols model
Calorie_OLS_modelv2 <- glm(Calories ~ Wednesday + Quarter1, family="gaussian", data=FitbitDatav3)
summary(Calorie_OLS_modelv2)
##
## Call:
## glm(formula = Calories ~ Wednesday + Quarter1, family = "gaussian",
## data = FitbitDatav3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1134.48 -281.23 49.31 264.10 2433.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2476.48 22.46 110.250 < 2e-16 ***
## Wednesday 207.84 49.42 4.205 3.08e-05 ***
## Quarter1 -69.29 36.51 -1.898 0.0583 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 152784.3)
##
## Null deviance: 80711323 on 509 degrees of freedom
## Residual deviance: 77461656 on 507 degrees of freedom
## AIC: 7540.1
##
## Number of Fisher Scoring iterations: 2
Steps_OLS_model <- glm(Steps ~ Wednesday + Quarter1 + Quarter2 + Year2019, family="gaussian", data=FitbitDatav3)
summary(Steps_OLS_model)
##
## Call:
## glm(formula = Steps ~ Wednesday + Quarter1 + Quarter2 + Year2019,
## family = "gaussian", data = FitbitDatav3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -19549.1 -4278.4 566.1 4203.5 26159.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16559.2 539.6 30.690 < 2e-16 ***
## Wednesday 4489.0 774.5 5.796 1.20e-08 ***
## Quarter1 -1505.4 649.7 -2.317 0.0209 *
## Quarter2 -121.3 672.6 -0.180 0.8569
## Year2019 3320.9 545.8 6.084 2.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 37522619)
##
## Null deviance: 2.1788e+10 on 509 degrees of freedom
## Residual deviance: 1.8949e+10 on 505 degrees of freedom
## AIC: 10349
##
## Number of Fisher Scoring iterations: 2
Steps_OLS_modelv2 <- glm(Steps ~ Wednesday + Quarter1 + Year2019, family="gaussian", data=FitbitDatav3)
summary(Steps_OLS_modelv2)
##
## Call:
## glm(formula = Steps ~ Wednesday + Quarter1 + Year2019, family = "gaussian",
## data = FitbitDatav3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -19497.6 -4263.4 590.1 4195.9 26218.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16500.3 429.0 38.459 < 2e-16 ***
## Wednesday 4489.0 773.8 5.801 1.16e-08 ***
## Quarter1 -1450.3 572.9 -2.532 0.0117 *
## Year2019 3328.3 543.8 6.120 1.87e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 37450876)
##
## Null deviance: 2.1788e+10 on 509 degrees of freedom
## Residual deviance: 1.8950e+10 on 506 degrees of freedom
## AIC: 10347
##
## Number of Fisher Scoring iterations: 2
Steps_OLS_modelv3 <- glm(Steps ~ Wednesday + Year2019, family="gaussian", data=FitbitDatav3)
summary(Steps_OLS_modelv3)
##
## Call:
## glm(formula = Steps ~ Wednesday + Year2019, family = "gaussian",
## data = FitbitDatav3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -18954.7 -4178.0 596.7 4453.0 26669.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16049.7 392.5 40.896 < 2e-16 ***
## Wednesday 4487.1 777.9 5.768 1.39e-08 ***
## Year2019 3236.0 545.5 5.933 5.53e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 37850437)
##
## Null deviance: 2.1788e+10 on 509 degrees of freedom
## Residual deviance: 1.9190e+10 on 507 degrees of freedom
## AIC: 10351
##
## Number of Fisher Scoring iterations: 2
Steps_LM <- lm(Steps ~ Wednesday + Year2019, data=FitbitDatav3)
summary(Steps_LM)
##
## Call:
## lm(formula = Steps ~ Wednesday + Year2019, data = FitbitDatav3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18954.7 -4178.0 596.7 4453.0 26669.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16049.7 392.5 40.896 < 2e-16 ***
## Wednesday 4487.1 777.9 5.768 1.39e-08 ***
## Year2019 3236.0 545.5 5.933 5.53e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6152 on 507 degrees of freedom
## Multiple R-squared: 0.1192, Adjusted R-squared: 0.1158
## F-statistic: 34.32 on 2 and 507 DF, p-value: 1.052e-14
Another way to look at the data is to see whether previous observations could predict furture observations. To do this
#Converting to time series for some analysis
Calories_XTS <- xts(FitbitDatav3[,2], order.by=as.Date(FitbitDatav3$Date))
Steps_XTS <- xts(FitbitDatav3[,3], order.by=as.Date(FitbitDatav3$Date))
#Creating training data
Calories_train <- subset(Calories_XTS, end=length(Calories_XTS)-28)
Steps_train <- subset(Steps_XTS, end=length(Steps_XTS)-28)
Calories_ETS <- ets(Calories_XTS)
checkresiduals(Calories_ETS)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 9.2735, df = 8, p-value = 0.3198
##
## Model df: 2. Total lags used: 10
autoplot(forecast(Calories_ETS))
Steps_ETS <- ets(Steps_XTS)
checkresiduals(Steps_ETS)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 15.599, df = 8, p-value = 0.0485
##
## Model df: 2. Total lags used: 10
autoplot(forecast(Steps_ETS))
fets <- function(y, h){
forecast(ets(y), h=h)
}
Calories_e1 <- tsCV(Calories_XTS, fets, h=28)
Calories_e2 <- tsCV(Calories_XTS, snaive, h=28)
mean(Calories_e1^2, na.rm=T)
## [1] 170419.4
mean(Calories_e2^2, na.rm=T)
## [1] 293607.4
Steps_e1 <- tsCV(Steps_XTS, fets, h=28)
Steps_e2 <- tsCV(Steps_XTS, snaive, h=28)
mean(Steps_e1^2, na.rm=T)
## [1] 53263772
mean(Steps_e2^2, na.rm=T)
## [1] 80752169
For both series the ets function works better than a seasonal naive model but it is clear there is some weekly seasonality at play for both of the variables.
So that concludes my analysis of my fitbit data and hopefully can help you in conducting your own analysis with similar datasets.
Despite thinking that my weekend days would be much more active than my workdays I have found them not to be significantly different. However wednesdays (where during part of this dataset I would tend to go to two cardio classes back to back) were found to be significantly correlated with both Calories and Step Count.
Now I have moved onto a new weight lifting based programme I would be interested in coming back to this analysis to see how this has impacted on the overall values.