Fitbit EDA

Stepping Up

An exploratory (actually pretty extensive) data analysis of my Fitbit Data

I like wearing my fitbit to track steps and progress of exercise classes. Here is a quick analysis of my own fitbit data collected between April 2018 (when my husband brought me my fitbit) until 31st August 2019.

If you also have a fitbit you can download your own data (31 days at a time) from the account page settings here.

I will be structuring my analysis in the following steps 1. Audit 2. Descriptive Statistics 3. Statistical Testing 4. Model Building 5. Evaluation

For a fuller description of how I carry out these steps check out my blog post.

Setup

download library packages you require

library(psych)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(tidyverse)

## -- Attaching packages -------------------------------------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.1       v purrr   0.3.2  
## v tidyr   0.8.3       v dplyr   0.8.0.1
## v readr   1.3.1       v stringr 1.4.0  
## v tibble  2.1.1       v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%()   masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()

library(readxl)
library(ggpubr)

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(RColorBrewer)
library(corrplot)

## corrplot 0.84 loaded

library(xts)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(zoo)
library(dplyr)
library(forecast)

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:ggpubr':
## 
##     gghistogram

if you don’t have any of the above packages use the function install.packages(“insert package name here”)

importing the data from excel, here I used the import button underneath the data environment tab on the right hand side of Rstudio

I used the following code to import from excel into R

Auditing

First what type of data do we have? This will help determine the range of analysis which can be robustly conducted on the dataset. Some of which may already be converted as part of the import process. For example date was specified as a date variable above but month number was specified as a character (as this will help with conversion to factor later).

A good place to start is to look at your data 1. View your data in the data window

View(Raw_Fitbit_Data)

View a small sample of the data in the command window

head(Raw_Fitbit_Data, n=10)

## # A tibble: 10 x 15
##    Date                `Calories Burne~ Steps Distance Floors
##    <dttm>                         <dbl> <dbl>    <dbl>  <dbl>
##  1 2018-04-09 00:00:00             1995  9364     6.52      0
##  2 2018-04-10 00:00:00             2135 13642     9.5       0
##  3 2018-04-11 00:00:00             2481 19112    14.7      28
##  4 2018-04-12 00:00:00             1877  8091     5.9      30
##  5 2018-04-13 00:00:00             2470 17129    13.3      40
##  6 2018-04-14 00:00:00             2104 13026     9.23     33
##  7 2018-04-15 00:00:00             2601 13641    10.9      43
##  8 2018-04-16 00:00:00             2432 17671    13.5      47
##  9 2018-04-17 00:00:00             2543 14072    10.8      46
## 10 2018-04-18 00:00:00             2735 24106    18.0      42
## # ... with 10 more variables: `Minutes Sedentary` <dbl>, `Minutes Lightly
## #   Active` <dbl>, `Minutes Fairly Active` <dbl>, `Minutes Very
## #   Active` <dbl>, `Activity Calories` <dbl>, Weekday <chr>,
## #   Weekend <chr>, `Proportion of total calories active` <dbl>,
## #   Month <chr>, Year <dbl>

R can aid you in finding out what types of variables we are dealing with.

summary(Raw_Fitbit_Data)

##       Date                     Calories Burned     Steps      
##  Min.   :2018-04-09 00:00:00   Min.   :1342    Min.   :    0  
##  1st Qu.:2018-08-14 06:00:00   1st Qu.:2186    1st Qu.:13712  
##  Median :2018-12-19 12:00:00   Median :2522    Median :18585  
##  Mean   :2018-12-19 12:00:00   Mean   :2483    Mean   :18234  
##  3rd Qu.:2019-04-25 18:00:00   3rd Qu.:2757    3rd Qu.:22740  
##  Max.   :2019-08-31 00:00:00   Max.   :4910    Max.   :42719  
##     Distance         Floors       Minutes Sedentary Minutes Lightly Active
##  Min.   : 0.00   Min.   :  0.00   Min.   : 595      Min.   :  0.0         
##  1st Qu.:10.17   1st Qu.: 38.00   1st Qu.:1044      1st Qu.:178.0         
##  Median :14.04   Median : 48.00   Median :1097      Median :210.0         
##  Mean   :14.00   Mean   : 48.03   Mean   :1097      Mean   :208.2         
##  3rd Qu.:17.68   3rd Qu.: 57.00   3rd Qu.:1163      3rd Qu.:241.0         
##  Max.   :31.15   Max.   :144.00   Max.   :1440      Max.   :507.0         
##  Minutes Fairly Active Minutes Very Active Activity Calories
##  Min.   : 0.00         Min.   :  0.00      Min.   :   0     
##  1st Qu.:15.00         1st Qu.: 51.25      1st Qu.:1065     
##  Median :24.00         Median : 98.00      Median :1464     
##  Mean   :25.89         Mean   : 92.45      Mean   :1407     
##  3rd Qu.:36.00         3rd Qu.:128.00      3rd Qu.:1736     
##  Max.   :94.00         Max.   :405.00      Max.   :4188     
##    Weekday            Weekend          Proportion of total calories active
##  Length:510         Length:510         Min.   :0.0000                     
##  Class :character   Class :character   1st Qu.:0.4843                     
##  Mode  :character   Mode  :character   Median :0.5785                     
##                                        Mean   :0.5492                     
##                                        3rd Qu.:0.6297                     
##                                        Max.   :0.8530                     
##     Month                Year     
##  Length:510         Min.   :2018  
##  Class :character   1st Qu.:2018  
##  Mode  :character   Median :2018  
##                     Mean   :2018  
##                     3rd Qu.:2019  
##                     Max.   :2019

Ok.. so some of those variables are not in a suitable format for analysis let’s convert

#first creating a copy 
FitbitDatav2 <- Raw_Fitbit_Data
FitbitDatav2$Weekday <- factor(FitbitDatav2$Weekday, levels=c(1,2,3,4,5,6,7), labels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
FitbitDatav2$Month <- factor(FitbitDatav2$Month, levels=c(1,2,3,4,5,6,7,8,9, 10, 11, 12 ), labels=c("January","February", "March", "April", "May", "June" ,"July", "August", "September", "October", "November", "December"))
FitbitDatav2$Weekend <- factor(FitbitDatav2$Weekend, levels=c(0,1), labels=c("Weekday", "Weekend"))

let’s check the variables again

summary(FitbitDatav2)

##       Date                     Calories Burned     Steps      
##  Min.   :2018-04-09 00:00:00   Min.   :1342    Min.   :    0  
##  1st Qu.:2018-08-14 06:00:00   1st Qu.:2186    1st Qu.:13712  
##  Median :2018-12-19 12:00:00   Median :2522    Median :18585  
##  Mean   :2018-12-19 12:00:00   Mean   :2483    Mean   :18234  
##  3rd Qu.:2019-04-25 18:00:00   3rd Qu.:2757    3rd Qu.:22740  
##  Max.   :2019-08-31 00:00:00   Max.   :4910    Max.   :42719  
##                                                               
##     Distance         Floors       Minutes Sedentary Minutes Lightly Active
##  Min.   : 0.00   Min.   :  0.00   Min.   : 595      Min.   :  0.0         
##  1st Qu.:10.17   1st Qu.: 38.00   1st Qu.:1044      1st Qu.:178.0         
##  Median :14.04   Median : 48.00   Median :1097      Median :210.0         
##  Mean   :14.00   Mean   : 48.03   Mean   :1097      Mean   :208.2         
##  3rd Qu.:17.68   3rd Qu.: 57.00   3rd Qu.:1163      3rd Qu.:241.0         
##  Max.   :31.15   Max.   :144.00   Max.   :1440      Max.   :507.0         
##                                                                           
##  Minutes Fairly Active Minutes Very Active Activity Calories
##  Min.   : 0.00         Min.   :  0.00      Min.   :   0     
##  1st Qu.:15.00         1st Qu.: 51.25      1st Qu.:1065     
##  Median :24.00         Median : 98.00      Median :1464     
##  Mean   :25.89         Mean   : 92.45      Mean   :1407     
##  3rd Qu.:36.00         3rd Qu.:128.00      3rd Qu.:1736     
##  Max.   :94.00         Max.   :405.00      Max.   :4188     
##                                                             
##       Weekday      Weekend    Proportion of total calories active
##  Sunday   :72   Weekday:365   Min.   :0.0000                     
##  Monday   :73   Weekend:145   1st Qu.:0.4843                     
##  Tuesday  :73                 Median :0.5785                     
##  Wednesday:73                 Mean   :0.5492                     
##  Thursday :73                 3rd Qu.:0.6297                     
##  Friday   :73                 Max.   :0.8530                     
##  Saturday :73                                                    
##      Month          Year     
##  May    : 62   Min.   :2018  
##  July   : 62   1st Qu.:2018  
##  August : 62   Median :2018  
##  June   : 60   Mean   :2018  
##  April  : 52   3rd Qu.:2019  
##  January: 31   Max.   :2019  
##  (Other):181

that looks good so in summary we have * Date Type + Date (note we have one observation for each day) * Numeric Type + Calories Burned (per day in kcal) + Steps + Distance (km) + Floors + Minutes Sedentary + Minutes Lightly Active + Minutes Fairly Active + Minutes Very Active + Activity Calories + Proportion of total calories active (ok I created this one in excel and it is the activity calories divided by Calories burned total) + Year (however this could be turned into a factor) * Factor variables + Weekday (day of the week) + Weekend (coded as 1 if a weekend day and 0 if Mon to Fri) + Month

Now let’s check whether there is missing data

sapply(FitbitDatav2, function(x) sum(is.na(x)))

##                                Date                     Calories Burned 
##                                   0                                   0 
##                               Steps                            Distance 
##                                   0                                   0 
##                              Floors                   Minutes Sedentary 
##                                   0                                   0 
##              Minutes Lightly Active               Minutes Fairly Active 
##                                   0                                   0 
##                 Minutes Very Active                   Activity Calories 
##                                   0                                   0 
##                             Weekday                             Weekend 
##                                   0                                   0 
## Proportion of total calories active                               Month 
##                                   0                                   0 
##                                Year 
##                                   0

some nice clean data by the looks of it - now let’s move onto the new stage Descriptive Statistics

Descriptive Statistics

This section focuses on what has occurred in the past and by interpreting the descriptive statistics it can help inform hypotheses to base statistical testing on.

First the numerical data variables, for some quick descriptive statistics I like using the describe function from the psych package

print(describe(FitbitDatav2[,c(2:10, 13)]))

##                                     vars   n     mean      sd   median
## Calories Burned                        1 510  2482.59  398.21  2522.50
## Steps                                  2 510 18233.85 6542.61 18584.50
## Distance                               3 510    14.00    5.29    14.04
## Floors                                 4 510    48.03   16.40    48.00
## Minutes Sedentary                      5 510  1096.53  119.47  1097.00
## Minutes Lightly Active                 6 510   208.18   58.32   210.00
## Minutes Fairly Active                  7 510    25.89   15.67    24.00
## Minutes Very Active                    8 510    92.45   51.51    98.00
## Activity Calories                      9 510  1407.20  482.12  1464.00
## Proportion of total calories active   10 510     0.55    0.12     0.58
##                                      trimmed     mad  min      max
## Calories Burned                      2490.47  391.41 1342  4910.00
## Steps                               18283.22 6863.70    0 42719.00
## Distance                               13.99    5.55    0    31.15
## Floors                                 47.67   14.83    0   144.00
## Minutes Sedentary                    1102.40   85.99  595  1440.00
## Minutes Lightly Active                209.38   47.44    0   507.00
## Minutes Fairly Active                  25.00   14.83    0    94.00
## Minutes Very Active                    91.84   53.37    0   405.00
## Activity Calories                    1418.52  445.52    0  4188.00
## Proportion of total calories active     0.56    0.09    0     0.85
##                                        range  skew kurtosis     se
## Calories Burned                      3568.00  0.12     2.00  17.63
## Steps                               42719.00 -0.08     0.00 289.71
## Distance                               31.15  0.00    -0.15   0.23
## Floors                                144.00  0.65     3.51   0.73
## Minutes Sedentary                     845.00 -0.95     3.32   5.29
## Minutes Lightly Active                507.00 -0.14     2.58   2.58
## Minutes Fairly Active                  94.00  0.73     1.00   0.69
## Minutes Very Active                   405.00  0.36     1.47   2.28
## Activity Calories                    4188.00  0.01     1.63  21.35
## Proportion of total calories active     0.85 -1.33     2.71   0.01

There are potentially some non normal variables at play here but it difficult to determine without testing

lapply(FitbitDatav2[,c(2:10, 13)], FUN=shapiro.test)

## $`Calories Burned`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.97179, p-value = 2.42e-08
## 
## 
## $Steps
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99564, p-value = 0.1679
## 
## 
## $Distance
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.99678, p-value = 0.4063
## 
## 
## $Floors
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.95901, p-value = 1.036e-10
## 
## 
## $`Minutes Sedentary`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.92476, p-value = 2.699e-15
## 
## 
## $`Minutes Lightly Active`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.96733, p-value = 3.126e-09
## 
## 
## $`Minutes Fairly Active`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.96529, p-value = 1.293e-09
## 
## 
## $`Minutes Very Active`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.96023, p-value = 1.659e-10
## 
## 
## $`Activity Calories`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.97515, p-value = 1.293e-07
## 
## 
## $`Proportion of total calories active`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.90789, p-value < 2.2e-16

For my Fitbit statistics Steps and Distance are the only numeric variables where the normality assumption cannot be rejected according to the shapiro test.

#ok before I start this I am going to rename that calories variable 
names(FitbitDatav2)[2] <- "Calories"

Calorie_Density <- ggplot(FitbitDatav2, aes(x=Calories)) + geom_line(stat="density") + expand_limits(y=0) 
Steps_Density <- ggplot(FitbitDatav2, aes(x=Steps)) + geom_line(stat="density") + expand_limits(y=0)
Distance_Density <- ggplot(FitbitDatav2, aes(x=Distance)) +  geom_line(stat="density") + expand_limits(y=0)
Floors_Density <- ggplot(FitbitDatav2, aes(x=Floors)) + geom_line(stat="density") + expand_limits(y=0)
ggarrange(Calorie_Density, Steps_Density, Distance_Density, Floors_Density,  ncol = 2, nrow = 2)

Checking out some of the other numeric variable distributions

Minutes_Sedentary_Density <- ggplot(FitbitDatav2, aes(x=Minutes_Sedentary)) + geom_line(stat="density") + expand_limits(y=0) 
Minutes_Lightly_Active_Density <- ggplot(FitbitDatav2, aes(x=Minutes_Lightly_Active)) + geom_line(stat="density") + expand_limits(y=0)
Minutes_Fairly_Active_Density <- ggplot(FitbitDatav2, aes(x=Minutes_Fairly_Active)) +  geom_line(stat="density") + expand_limits(y=0)
Minutes_Very_Active_Density <- ggplot(FitbitDatav2, aes(x=Minutes_Very_Active)) + geom_line(stat="density") + expand_limits(y=0)
Activity_Calories_Density <- ggplot(FitbitDatav2, aes(x=Activity_Calories)) + geom_line(stat="density") + expand_limits(y=0)
Activity_Calorie_Proportion_Density <- ggplot(FitbitDatav2, aes(x=Activity_Calorie_Proportion)) + geom_line(stat="density") + expand_limits(y=0)
ggarrange(Minutes_Sedentary_Density, Minutes_Lightly_Active_Density, Minutes_Fairly_Active_Density, Minutes_Very_Active_Density, Activity_Calories_Density, Activity_Calorie_Proportion_Density,  ncol = 3, nrow = 2)

Now onto covariation i.e. how much two variables vary together. This is important as we would typically expect all of these variables to be particularly interrelated.

Fitbit_Corr <- cor(FitbitDatav2[,c(2:10, 13)])
round(Fitbit_Corr, digits=2)

##                             Calories Steps Distance Floors
## Calories                        1.00  0.92     0.92   0.59
## Steps                           0.92  1.00     0.99   0.60
## Distance                        0.92  0.99     1.00   0.61
## Floors                          0.59  0.60     0.61   1.00
## Minutes_Sedentary              -0.68 -0.63    -0.61  -0.45
## Minutes_Lightly_Active          0.47  0.44     0.42   0.54
## Minutes_Fairly_Active           0.61  0.63     0.58   0.41
## Minutes_Very_Active             0.92  0.86     0.86   0.48
## Activity_Calories               0.98  0.91     0.90   0.61
## Activity_Calorie_Proportion     0.93  0.87     0.86   0.63
##                             Minutes_Sedentary Minutes_Lightly_Active
## Calories                                -0.68                   0.47
## Steps                                   -0.63                   0.44
## Distance                                -0.61                   0.42
## Floors                                  -0.45                   0.54
## Minutes_Sedentary                        1.00                  -0.53
## Minutes_Lightly_Active                  -0.53                   1.00
## Minutes_Fairly_Active                   -0.47                   0.26
## Minutes_Very_Active                     -0.57                   0.18
## Activity_Calories                       -0.70                   0.54
## Activity_Calorie_Proportion             -0.71                   0.60
##                             Minutes_Fairly_Active Minutes_Very_Active
## Calories                                     0.61                0.92
## Steps                                        0.63                0.86
## Distance                                     0.58                0.86
## Floors                                       0.41                0.48
## Minutes_Sedentary                           -0.47               -0.57
## Minutes_Lightly_Active                       0.26                0.18
## Minutes_Fairly_Active                        1.00                0.52
## Minutes_Very_Active                          0.52                1.00
## Activity_Calories                            0.62                0.91
## Activity_Calorie_Proportion                  0.63                0.84
##                             Activity_Calories Activity_Calorie_Proportion
## Calories                                 0.98                        0.93
## Steps                                    0.91                        0.87
## Distance                                 0.90                        0.86
## Floors                                   0.61                        0.63
## Minutes_Sedentary                       -0.70                       -0.71
## Minutes_Lightly_Active                   0.54                        0.60
## Minutes_Fairly_Active                    0.62                        0.63
## Minutes_Very_Active                      0.91                        0.84
## Activity_Calories                        1.00                        0.96
## Activity_Calorie_Proportion              0.96                        1.00

corrplot(Fitbit_Corr, method="shade", shade.col=NA, tl.col="black", tl.srt=45)

From this we can see that for me, and very likely for the majority of fitbit users there are a couple of very correlated clusters. One is the highly correlated relationship with Calories, Steps and Distance and another cluster is Minutes very active and activity calories and ratio of activity calories to total.

Next onto those categorical variables - all of the categorical variables are based on dates and therefore are going to be fairly evenly distributed. Howeever we can instead look at the relationship with time.

#Converting to time series for some analysis 
Fitbit_XTS <- xts(FitbitDatav2[,-1], order.by=as.Date(FitbitDatav2$Date))

Calorie_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Calories)) + 
     geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
  method = "loess")+ theme_minimal()
Steps_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Steps)) + 
     geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
  method = "loess")+ theme_minimal()
Distance_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Distance)) + 
     geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
  method = "loess")+ theme_minimal()
Floors_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Floors)) + 
     geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
  method = "loess")+ theme_minimal()

Calorie_TimeSeries

Steps_TimeSeries

Distance_TimeSeries

Floors_TimeSeries

The amount of variation on a daily or weekly basis suggests that some of these time series could be modelled as a random walk…. hehehe But an actual observation here is that it seems like there was an increase until January and has since levelled out potentially even reducing a little in August. As this is my data I know I have started doing a new weight lifting and less cardio based programme in August.

Minutes_Sedentary_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Minutes_Sedentary)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",
  method = "loess")+ theme_minimal()
Minutes_Lightly_Active_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Minutes_Lightly_Active)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",method = "loess")+ theme_minimal()
Minutes_Fairly_Active_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Minutes_Fairly_Active)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07",method = "loess")+ theme_minimal()
Minutes_Very_Active_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Minutes_Very_Active)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07", method = "loess")+ theme_minimal()
Activity_Calories_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Activity_Calories)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07", method = "loess")+ theme_minimal()
Activity_Calories_Proportion_TimeSeries <- ggplot(data = FitbitDatav2, aes(x = Date, y = Activity_Calorie_Proportion)) + geom_line(color = "#00AFBB", size = 1)+ stat_smooth( color = "#FC4E07", fill = "#FC4E07", method = "loess")+ theme_minimal()

Minutes_Sedentary_TimeSeries

Minutes_Lightly_Active_TimeSeries

Minutes_Fairly_Active_TimeSeries

Minutes_Very_Active_TimeSeries

Activity_Calories_TimeSeries

Activity_Calories_Proportion_TimeSeries

Also during the last few weeks the weight lifting and reduction of cardio has meant that the minutes very active has decreased while the minutes lightly and fairly active has increased.

Is there any relationship between the categorical variables and the numerical variables?

qplot(FitbitDatav2$Weekday, FitbitDatav2$Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekday, main="Boxplot Calories by Day of Week", xlab= "Day of Week", ylab="Calories per day")+labs(fill = 'Day of Week')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Accent")

qplot(FitbitDatav2$Weekday, FitbitDatav2$Steps, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekday, main="Boxplot Steps by Day of Week", xlab= "Day of Week", ylab="Steps per day")+labs(fill = 'Day of Week')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Accent")

qplot(FitbitDatav2$Weekday, FitbitDatav2$Distance, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekday, main="Boxplot Distance by Day of Week", xlab= "Day of Week", ylab="Distance(km) per day")+labs(fill = 'Day of Week')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Accent")

Wednesday stands out for the proportion of calories burned during the day being the highest day of the week and Thursday tends to be lower. However there is a lot of overlap of the between the interquartile ranges across days.

qplot(FitbitDatav2$Weekend, FitbitDatav2$Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Calories Weekday versus Weekend", xlab= " ", ylab="Calories per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Steps, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Steps Weekday versus Weekend", xlab= " ", ylab="Steps per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Distance, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Distance Weekday versus Weekend", xlab= " ", ylab="Distance(km) walked per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Floors, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Floors Weekday versus Weekend", xlab= " ", ylab="Floors per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Minutes_Sedentary, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Sedentary Minutes Weekday versus Weekend", xlab= " ", ylab="Sedentary Minutes per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Minutes_Lightly_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Lightly Active Minutes Weekday versus Weekend", xlab= " ", ylab="Lightly Active Minutes per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Minutes_Fairly_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Fairly Active Minutes Weekday versus Weekend", xlab= " ", ylab="Fairly Active Minutes per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Minutes_Very_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Very Active Minutes Weekday versus Weekend", xlab= " ", ylab="Very Active Minutes per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Activity_Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Activity Calories Weekday versus Weekend", xlab= " ", ylab="Activity Calories per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

qplot(FitbitDatav2$Weekend, FitbitDatav2$Activity_Calorie_Proportion, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Weekend, main="Boxplot Activity Calorie Proportion Weekday versus Weekend", xlab= " ", ylab="Activity Calorie Proportion per day")+labs(fill = 'Weekend')+theme_bw()+guides(fill=FALSE)+scale_fill_brewer(palette="Set2")

It would appear that there are only small differences between weekend days and days of the week. The most common observation is that the inter quartile range is smaller with the weekend days however this very well could be due to the smaller number of days.

qplot(FitbitDatav2$Month, FitbitDatav2$Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Calories by Month", xlab= "Month", ylab="Calories per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Steps, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Steps by Month", xlab= "Month", ylab="Steps per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Distance, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Distance(km) by Month", xlab= "Month", ylab="Distance per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Floors, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Floors by Month", xlab= "Month", ylab="Floors per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Minutes_Sedentary, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Sedentary Minutes by Month", xlab= "Month", ylab="Sedentary Minutes per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Minutes_Lightly_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Lightly Active Minutes by Month", xlab= "Month", ylab="Lightly Active Minutes per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Minutes_Fairly_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Fairly Active Minutes by Month", xlab= "Month", ylab="Fairly Active Minutes per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Minutes_Very_Active, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Very Active Minutes by Month", xlab= "Month", ylab="Very Active Minutes per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Activity_Calories, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Activity Calories by Month", xlab= "Month", ylab="Activity Calories per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

qplot(FitbitDatav2$Month, FitbitDatav2$Activity_Calorie_Proportion, data=FitbitDatav2, geom=c("boxplot"), fill=FitbitDatav2$Month, main="Boxplot Activity Calorie Proportion by Month", xlab= "Month", ylab="Activity Calorie Proportion per day")+labs(fill = 'Month')+theme_bw()+guides(fill=FALSE)

The autumn months so far appear to be the months with the lowest values for many of the above variables however for these months there is only a one month observation compared with the months of april to august where there are 2 months worth of observations.

Results and subsequent hypotheses to test

Results 1. There is a high correlation between Calories, Steps and distance walked - consequently would not be sensible to include both steps and distance in the same regression as independent variables

There is a high level of correlation between variables regarding active minutes

consequently again would not be sensible to include all three in a regression as independent variables

On wednesdays I am more active

a dummy variable needs to be created for wednesdays

I am not more active on weekend days
I have been more active in summer months

create a dummy variable for quarters in particular for quarter 1 (April to June) and quarter 2 (July to September)

I have been more active since 2019

create a dummy variable for observations since 2019

FitbitDatav3 <- FitbitDatav2
#Starting to create the dummies
FitbitDatav3$Wednesday <- 0 
FitbitDatav3$Quarter1 <- 0
FitbitDatav3$Quarter2 <- 0 
FitbitDatav3$Year2019 <- 0

FitbitDatav3$Wednesday[FitbitDatav3$Weekday=="Wednesday"] = 1 
FitbitDatav3$Quarter1[FitbitDatav3$Month=="April"|FitbitDatav3$Month=="May"|FitbitDatav3$Month=="June"] = 1 
FitbitDatav3$Quarter2[FitbitDatav3$Month=="July"|FitbitDatav3$Month=="August"|FitbitDatav3$Month=="September"] = 1 
FitbitDatav3$Year2019[FitbitDatav3$Year==2019] = 1

##Statistical Testing Testing whether the hypotheses are correct

note calories is non-normal but steps are normally distributed

wilcox.test(FitbitDatav3$Calories ~ FitbitDatav3$Wednesday)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FitbitDatav3$Calories by FitbitDatav3$Wednesday
## W = 10690, p-value = 6.392e-06
## alternative hypothesis: true location shift is not equal to 0

wilcox.test(FitbitDatav3$Calories ~ FitbitDatav3$Quarter1)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FitbitDatav3$Calories by FitbitDatav3$Quarter1
## W = 32816, p-value = 0.02314
## alternative hypothesis: true location shift is not equal to 0

wilcox.test(FitbitDatav3$Calories ~ FitbitDatav3$Quarter2)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FitbitDatav3$Calories by FitbitDatav3$Quarter2
## W = 27823, p-value = 0.7882
## alternative hypothesis: true location shift is not equal to 0

wilcox.test(FitbitDatav3$Calories ~ FitbitDatav3$Year2019)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FitbitDatav3$Calories by FitbitDatav3$Year2019
## W = 21203, p-value = 1.376e-11
## alternative hypothesis: true location shift is not equal to 0

For calories burned per day - Wednesdays are significantly different to all other days - Quarter 1 months (April, May and June) are significantly different to other quarters - Quarter 2 months (July, August and September) are not signficantly different to other quarters - Observations this year (2019) are statistically significant from 2018 observations

t.test(FitbitDatav3$Steps, FitbitDatav3$Wednesday)

## 
##  Welch Two Sample t-test
## 
## data:  FitbitDatav3$Steps and FitbitDatav3$Wednesday
## t = 62.937, df = 509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  17664.53 18802.88
## sample estimates:
##    mean of x    mean of y 
## 1.823385e+04 1.431373e-01

t.test(FitbitDatav3$Steps, FitbitDatav3$Quarter1)

## 
##  Welch Two Sample t-test
## 
## data:  FitbitDatav3$Steps and FitbitDatav3$Quarter1
## t = 62.937, df = 509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  17664.33 18802.69
## sample estimates:
##    mean of x    mean of y 
## 1.823385e+04 3.411765e-01

t.test(FitbitDatav3$Steps, FitbitDatav3$Quarter2)

## 
##  Welch Two Sample t-test
## 
## data:  FitbitDatav3$Steps and FitbitDatav3$Quarter2
## t = 62.937, df = 509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  17664.37 18802.72
## sample estimates:
##    mean of x    mean of y 
## 1.823385e+04 3.019608e-01

t.test(FitbitDatav3$Steps, FitbitDatav3$Year2019)

## 
##  Welch Two Sample t-test
## 
## data:  FitbitDatav3$Steps and FitbitDatav3$Year2019
## t = 62.936, df = 509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  17664.20 18802.55
## sample estimates:
##    mean of x    mean of y 
## 1.823385e+04 4.764706e-01

However for steps all of the hypotheses are significant what I were to use the wilcoxon test instead

wilcox.test(FitbitDatav3$Steps ~ FitbitDatav3$Wednesday)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FitbitDatav3$Steps by FitbitDatav3$Wednesday
## W = 9432.5, p-value = 2.246e-08
## alternative hypothesis: true location shift is not equal to 0

wilcox.test(FitbitDatav3$Steps ~ FitbitDatav3$Quarter1)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FitbitDatav3$Steps by FitbitDatav3$Quarter1
## W = 32990, p-value = 0.01723
## alternative hypothesis: true location shift is not equal to 0

wilcox.test(FitbitDatav3$Steps ~ FitbitDatav3$Quarter2)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FitbitDatav3$Steps by FitbitDatav3$Quarter2
## W = 26060, p-value = 0.3766
## alternative hypothesis: true location shift is not equal to 0

wilcox.test(FitbitDatav3$Steps ~ FitbitDatav3$Year2019)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  FitbitDatav3$Steps by FitbitDatav3$Year2019
## W = 22896, p-value = 9.381e-09
## alternative hypothesis: true location shift is not equal to 0

all still significant great :)

Model Building and Testing

Now onto model building There are two dependent (outcome) variables I am interested in testing 1. Calories Burnt per day 2. Steps walked per day

To uncover the factors I would suggest using a standard regression technique

#multiple regression 
Calorie_Gamma_model <- glm(Calories ~ Steps + Wednesday + Quarter1, family="Gamma", data=FitbitDatav3)
summary(Calorie_Gamma_model)

## 
## Call:
## glm(formula = Calories ~ Steps + Wednesday + Quarter1, family = "Gamma", 
##     data = FitbitDatav3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.24644  -0.04688  -0.01034   0.03484   0.27941  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.763e-04  4.195e-06 137.358   <2e-16 ***
## Steps       -9.101e-09  1.938e-10 -46.964   <2e-16 ***
## Wednesday    8.936e-06  3.488e-06   2.562   0.0107 *  
## Quarter1    -2.869e-07  2.683e-06  -0.107   0.9149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.005059001)
## 
##     Null deviance: 13.6494  on 509  degrees of freedom
## Residual deviance:  2.5016  on 506  degrees of freedom
## AIC: 6705.6
## 
## Number of Fisher Scoring iterations: 4

#normally distribured ols model
Calorie_OLS_model <- glm(Calories ~ Steps + Wednesday + Quarter1, family="gaussian", data=FitbitDatav3)
summary(Calorie_OLS_model)

## 
## Call:
## glm(formula = Calories ~ Steps + Wednesday + Quarter1, family = "gaussian", 
##     data = FitbitDatav3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -288.62   -96.75   -39.81    56.86  1028.75  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.453e+03  2.109e+01  68.893   <2e-16 ***
## Steps        5.684e-02  1.066e-03  53.311   <2e-16 ***
## Wednesday   -4.795e+01  1.982e+01  -2.419   0.0159 *  
## Quarter1    -2.157e-01  1.427e+01  -0.015   0.9879    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 23136.33)
## 
##     Null deviance: 80711323  on 509  degrees of freedom
## Residual deviance: 11706982  on 506  degrees of freedom
## AIC: 6578.4
## 
## Number of Fisher Scoring iterations: 2

#normally distribured ols model
Calorie_OLS_modelv2 <- glm(Calories ~ Wednesday + Quarter1, family="gaussian", data=FitbitDatav3)
summary(Calorie_OLS_modelv2)

## 
## Call:
## glm(formula = Calories ~ Wednesday + Quarter1, family = "gaussian", 
##     data = FitbitDatav3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1134.48   -281.23     49.31    264.10   2433.52  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2476.48      22.46 110.250  < 2e-16 ***
## Wednesday     207.84      49.42   4.205 3.08e-05 ***
## Quarter1      -69.29      36.51  -1.898   0.0583 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 152784.3)
## 
##     Null deviance: 80711323  on 509  degrees of freedom
## Residual deviance: 77461656  on 507  degrees of freedom
## AIC: 7540.1
## 
## Number of Fisher Scoring iterations: 2

Steps_OLS_model <- glm(Steps ~ Wednesday + Quarter1 + Quarter2 + Year2019, family="gaussian", data=FitbitDatav3)
summary(Steps_OLS_model)

## 
## Call:
## glm(formula = Steps ~ Wednesday + Quarter1 + Quarter2 + Year2019, 
##     family = "gaussian", data = FitbitDatav3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -19549.1   -4278.4     566.1    4203.5   26159.8  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16559.2      539.6  30.690  < 2e-16 ***
## Wednesday     4489.0      774.5   5.796 1.20e-08 ***
## Quarter1     -1505.4      649.7  -2.317   0.0209 *  
## Quarter2      -121.3      672.6  -0.180   0.8569    
## Year2019      3320.9      545.8   6.084 2.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 37522619)
## 
##     Null deviance: 2.1788e+10  on 509  degrees of freedom
## Residual deviance: 1.8949e+10  on 505  degrees of freedom
## AIC: 10349
## 
## Number of Fisher Scoring iterations: 2

Steps_OLS_modelv2 <- glm(Steps ~ Wednesday + Quarter1 + Year2019, family="gaussian", data=FitbitDatav3)
summary(Steps_OLS_modelv2)

## 
## Call:
## glm(formula = Steps ~ Wednesday + Quarter1 + Year2019, family = "gaussian", 
##     data = FitbitDatav3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -19497.6   -4263.4     590.1    4195.9   26218.7  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16500.3      429.0  38.459  < 2e-16 ***
## Wednesday     4489.0      773.8   5.801 1.16e-08 ***
## Quarter1     -1450.3      572.9  -2.532   0.0117 *  
## Year2019      3328.3      543.8   6.120 1.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 37450876)
## 
##     Null deviance: 2.1788e+10  on 509  degrees of freedom
## Residual deviance: 1.8950e+10  on 506  degrees of freedom
## AIC: 10347
## 
## Number of Fisher Scoring iterations: 2

Steps_OLS_modelv3 <- glm(Steps ~ Wednesday + Year2019, family="gaussian", data=FitbitDatav3)
summary(Steps_OLS_modelv3)

## 
## Call:
## glm(formula = Steps ~ Wednesday + Year2019, family = "gaussian", 
##     data = FitbitDatav3)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -18954.7   -4178.0     596.7    4453.0   26669.3  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16049.7      392.5  40.896  < 2e-16 ***
## Wednesday     4487.1      777.9   5.768 1.39e-08 ***
## Year2019      3236.0      545.5   5.933 5.53e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 37850437)
## 
##     Null deviance: 2.1788e+10  on 509  degrees of freedom
## Residual deviance: 1.9190e+10  on 507  degrees of freedom
## AIC: 10351
## 
## Number of Fisher Scoring iterations: 2

Steps_LM <- lm(Steps ~ Wednesday + Year2019, data=FitbitDatav3)
summary(Steps_LM)

## 
## Call:
## lm(formula = Steps ~ Wednesday + Year2019, data = FitbitDatav3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18954.7  -4178.0    596.7   4453.0  26669.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16049.7      392.5  40.896  < 2e-16 ***
## Wednesday     4487.1      777.9   5.768 1.39e-08 ***
## Year2019      3236.0      545.5   5.933 5.53e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6152 on 507 degrees of freedom
## Multiple R-squared:  0.1192, Adjusted R-squared:  0.1158 
## F-statistic: 34.32 on 2 and 507 DF,  p-value: 1.052e-14

Another way to look at the data is to see whether previous observations could predict furture observations. To do this

#Converting to time series for some analysis 
Calories_XTS <- xts(FitbitDatav3[,2], order.by=as.Date(FitbitDatav3$Date))
Steps_XTS <- xts(FitbitDatav3[,3], order.by=as.Date(FitbitDatav3$Date))

#Creating training data 
Calories_train <- subset(Calories_XTS, end=length(Calories_XTS)-28)
Steps_train <- subset(Steps_XTS, end=length(Steps_XTS)-28)

Calories_ETS <- ets(Calories_XTS)
checkresiduals(Calories_ETS)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 9.2735, df = 8, p-value = 0.3198
## 
## Model df: 2.   Total lags used: 10

autoplot(forecast(Calories_ETS))

Steps_ETS <- ets(Steps_XTS)
checkresiduals(Steps_ETS)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 15.599, df = 8, p-value = 0.0485
## 
## Model df: 2.   Total lags used: 10

autoplot(forecast(Steps_ETS))

fets <- function(y, h){
  forecast(ets(y), h=h)
}

Calories_e1 <- tsCV(Calories_XTS, fets, h=28)
Calories_e2 <- tsCV(Calories_XTS, snaive, h=28)

mean(Calories_e1^2, na.rm=T)

## [1] 170419.4

mean(Calories_e2^2, na.rm=T)

## [1] 293607.4

Steps_e1 <- tsCV(Steps_XTS, fets, h=28)
Steps_e2 <- tsCV(Steps_XTS, snaive, h=28)

mean(Steps_e1^2, na.rm=T)

## [1] 53263772

mean(Steps_e2^2, na.rm=T)

## [1] 80752169

For both series the ets function works better than a seasonal naive model but it is clear there is some weekly seasonality at play for both of the variables.

Conclusion

So that concludes my analysis of my fitbit data and hopefully can help you in conducting your own analysis with similar datasets.

Despite thinking that my weekend days would be much more active than my workdays I have found them not to be significantly different. However wednesdays (where during part of this dataset I would tend to go to two cardio classes back to back) were found to be significantly correlated with both Calories and Step Count.

Now I have moved onto a new weight lifting based programme I would be interested in coming back to this analysis to see how this has impacted on the overall values.