Here is what we have found out:

Installing Packages and loading the packages

# This is left out intentionally blank

install.packages("readr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("lubridate")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("ggpubr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("plotly")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("ggthemes")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("gganimate")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("lmtest")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("zoo")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(readr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ purrr     1.0.1
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(lubridate) 
library(ggplot2)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggpubr)
library(tidyr)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(gganimate)

## No renderer backend detected. gganimate will default to writing frames to separate files
## Consider installing:
## - the `gifski` package for gif output
## - the `av` package for video output
## and restarting the R session

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(gplots)

## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess

## setwd("~/2022 R Code/Second Capstone")
setwd("/cloud/project")

Reading Files

setwd("/cloud/project/Capstone")
daily_activity = read_csv("dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

daily_sleep = read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

daily_calories = read_csv("dailyCalories_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

daily_intensities = read_csv("dailyIntensities_merged.csv")

## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

daily_steps = read_csv("dailySteps_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

setwd("/cloud/project/Capstone")
hourly_calories = read.csv("hourlyCalories_merged.csv")
hourly_intensities = read.csv("hourlyIntensities_merged.csv")
hourly_steps = read.csv("hourlySteps_merged.csv")

weight_log_info = read.csv("weightLogInfo_merged.csv")

We are not doing by the minute since it is too many samples and it is just inefficient to look at things by the minute (in turns of half-hourly).

We will also be ignoring the fat, and logid section in the weight_log_info as there is not meaning for this to be a part of the data analysis section

Data Cleaning

clean_names(daily_activity)

## # A tibble: 940 × 15
##            id activity…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸
##         <dbl> <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1503960366 4/12/2016    13162    8.5     8.5        0    1.88   0.550    6.06
##  2 1503960366 4/13/2016    10735    6.97    6.97       0    1.57   0.690    4.71
##  3 1503960366 4/14/2016    10460    6.74    6.74       0    2.44   0.400    3.91
##  4 1503960366 4/15/2016     9762    6.28    6.28       0    2.14   1.26     2.83
##  5 1503960366 4/16/2016    12669    8.16    8.16       0    2.71   0.410    5.04
##  6 1503960366 4/17/2016     9705    6.48    6.48       0    3.19   0.780    2.51
##  7 1503960366 4/18/2016    13019    8.59    8.59       0    3.25   0.640    4.71
##  8 1503960366 4/19/2016    15506    9.88    9.88       0    3.53   1.32     5.03
##  9 1503960366 4/20/2016    10544    6.68    6.68       0    1.96   0.480    4.24
## 10 1503960366 4/21/2016     9819    6.34    6.34       0    1.34   0.350    4.65
## # … with 930 more rows, 6 more variables: sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## #   abbreviated variable names ¹activity_date, ²total_steps, ³total_distance,
## #   ⁴tracker_distance, ⁵logged_activities_distance, ⁶very_active_distance,
## #   ⁷moderately_active_distance, ⁸light_active_distance

clean_names(daily_sleep)

## # A tibble: 413 × 5
##            id sleep_day             total_sleep_records total_minutes_…¹ total…²
##         <dbl> <chr>                               <dbl>            <dbl>   <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                   1              327     346
##  2 1503960366 4/13/2016 12:00:00 AM                   2              384     407
##  3 1503960366 4/15/2016 12:00:00 AM                   1              412     442
##  4 1503960366 4/16/2016 12:00:00 AM                   2              340     367
##  5 1503960366 4/17/2016 12:00:00 AM                   1              700     712
##  6 1503960366 4/19/2016 12:00:00 AM                   1              304     320
##  7 1503960366 4/20/2016 12:00:00 AM                   1              360     377
##  8 1503960366 4/21/2016 12:00:00 AM                   1              325     364
##  9 1503960366 4/23/2016 12:00:00 AM                   1              361     384
## 10 1503960366 4/24/2016 12:00:00 AM                   1              430     449
## # … with 403 more rows, and abbreviated variable names ¹total_minutes_asleep,
## #   ²total_time_in_bed

clean_names(daily_intensities)

## # A tibble: 940 × 10
##            id activity…¹ seden…² light…³ fairl…⁴ very_…⁵ seden…⁶ light…⁷ moder…⁸
##         <dbl> <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 1503960366 4/12/2016      728     328      13      25       0    6.06   0.550
##  2 1503960366 4/13/2016      776     217      19      21       0    4.71   0.690
##  3 1503960366 4/14/2016     1218     181      11      30       0    3.91   0.400
##  4 1503960366 4/15/2016      726     209      34      29       0    2.83   1.26 
##  5 1503960366 4/16/2016      773     221      10      36       0    5.04   0.410
##  6 1503960366 4/17/2016      539     164      20      38       0    2.51   0.780
##  7 1503960366 4/18/2016     1149     233      16      42       0    4.71   0.640
##  8 1503960366 4/19/2016      775     264      31      50       0    5.03   1.32 
##  9 1503960366 4/20/2016      818     205      12      28       0    4.24   0.480
## 10 1503960366 4/21/2016      838     211       8      19       0    4.65   0.350
## # … with 930 more rows, 1 more variable: very_active_distance <dbl>, and
## #   abbreviated variable names ¹activity_day, ²sedentary_minutes,
## #   ³lightly_active_minutes, ⁴fairly_active_minutes, ⁵very_active_minutes,
## #   ⁶sedentary_active_distance, ⁷light_active_distance,
## #   ⁸moderately_active_distance

clean_names(daily_steps)

## # A tibble: 940 × 3
##            id activity_day step_total
##         <dbl> <chr>             <dbl>
##  1 1503960366 4/12/2016         13162
##  2 1503960366 4/13/2016         10735
##  3 1503960366 4/14/2016         10460
##  4 1503960366 4/15/2016          9762
##  5 1503960366 4/16/2016         12669
##  6 1503960366 4/17/2016          9705
##  7 1503960366 4/18/2016         13019
##  8 1503960366 4/19/2016         15506
##  9 1503960366 4/20/2016         10544
## 10 1503960366 4/21/2016          9819
## # … with 930 more rows

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(daily_calories))

## [1] 0

sum(duplicated(daily_intensities))

## [1] 0

sum(duplicated(daily_sleep)) ## This one has 3

## [1] 3

sum(duplicated(daily_steps))

## [1] 0

sum(duplicated(hourly_calories))

## [1] 0

sum(duplicated(hourly_intensities))

## [1] 0

sum(duplicated(hourly_steps))

## [1] 0

daily_sleep %>% distinct()

## # A tibble: 410 × 5
##            Id SleepDay              TotalSleepRecords TotalMinutesAsleep Total…¹
##         <dbl> <chr>                             <dbl>              <dbl>   <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM                 1                327     346
##  2 1503960366 4/13/2016 12:00:00 AM                 2                384     407
##  3 1503960366 4/15/2016 12:00:00 AM                 1                412     442
##  4 1503960366 4/16/2016 12:00:00 AM                 2                340     367
##  5 1503960366 4/17/2016 12:00:00 AM                 1                700     712
##  6 1503960366 4/19/2016 12:00:00 AM                 1                304     320
##  7 1503960366 4/20/2016 12:00:00 AM                 1                360     377
##  8 1503960366 4/21/2016 12:00:00 AM                 1                325     364
##  9 1503960366 4/23/2016 12:00:00 AM                 1                361     384
## 10 1503960366 4/24/2016 12:00:00 AM                 1                430     449
## # … with 400 more rows, and abbreviated variable name ¹TotalTimeInBed

sum(is.na(daily_sleep))

## [1] 0

##str(daily_activity)
##str(daily_calories)
##str(daily_intensities)
##str(daily_sleep)
str(daily_steps)

## spc_tbl_ [940 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id         : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : num [1:940] 13162 10735 10460 9762 12669 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDay = col_character(),
##   ..   StepTotal = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Seeing this makes easier to see that we need to change sleep day and activity day into actual dates. We would also need to change the the column names into lower case to type the column names easier. We also see there is little to no values in daily_weight.Fat, so it will be best to take not doing anything with it.

## str(hourly_intensities)
## str(hourly_calories)
str(hourly_steps)

## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...

We can also connect daily and hourly into two big independent data frame. This is because each data frame contains a small amount of variables.

daily_activity = rename_with(daily_activity, tolower)
daily_calories = rename_with(daily_calories, tolower)
daily_intensities = rename_with(daily_intensities, tolower)
daily_sleep = rename_with(daily_sleep, tolower)
daily_steps = rename_with(daily_steps, tolower)

hourly_calories = rename_with(hourly_calories, tolower)
hourly_intensities = rename_with(hourly_intensities, tolower)
hourly_steps = rename_with(hourly_steps, tolower)

head(daily_activity) ## if it worked for one, it worked for all

## # A tibble: 6 × 15
##       id activ…¹ total…² total…³ track…⁴ logge…⁵ verya…⁶ moder…⁷ light…⁸ seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: veryactiveminutes <dbl>, fairlyactiveminutes <dbl>,
## #   lightlyactiveminutes <dbl>, sedentaryminutes <dbl>, calories <dbl>, and
## #   abbreviated variable names ¹activitydate, ²totalsteps, ³totaldistance,
## #   ⁴trackerdistance, ⁵loggedactivitiesdistance, ⁶veryactivedistance,
## #   ⁷moderatelyactivedistance, ⁸lightactivedistance, ⁹sedentaryactivedistance

daily_activity$date = as.Date(daily_activity$activitydate, "%m/%d/%Y")
daily_calories$date = as.Date(daily_calories$activityday, "%m/%d/%Y")
daily_steps$date = as.Date(daily_steps$activityday, "%m/%d/%Y")
daily_intensities$date = as.Date(daily_intensities$activityday, "%m/%d/%Y")

daily_sleep %>%
    mutate(sleepday = str_remove_all(sleepday, "12:00:00 AM"))

## # A tibble: 413 × 5
##            id sleepday     totalsleeprecords totalminutesasleep totaltimeinbed
##         <dbl> <chr>                    <dbl>              <dbl>          <dbl>
##  1 1503960366 "4/12/2016 "                 1                327            346
##  2 1503960366 "4/13/2016 "                 2                384            407
##  3 1503960366 "4/15/2016 "                 1                412            442
##  4 1503960366 "4/16/2016 "                 2                340            367
##  5 1503960366 "4/17/2016 "                 1                700            712
##  6 1503960366 "4/19/2016 "                 1                304            320
##  7 1503960366 "4/20/2016 "                 1                360            377
##  8 1503960366 "4/21/2016 "                 1                325            364
##  9 1503960366 "4/23/2016 "                 1                361            384
## 10 1503960366 "4/24/2016 "                 1                430            449
## # … with 403 more rows

daily_sleep$date = as.Date(daily_sleep$sleepday, "%m/%d/%Y")

Setting up Data

daily_log = merge(daily_activity, daily_sleep, by=c("id", "date"))

The reason I want to merge the data set is in order to do more statistical analysis, and not “predict” values that are missing. For me, it is better to have all the data filled up as much as possible and not predict what the missing values are.

daily_log$percent_sleep = 100*((daily_log$totalminutesasleep/daily_log$totaltimeinbed))

daily_id_log = daily_log %>% 
  group_by(id) %>%
  summarise(avg_steps = mean(totalsteps), avg_cal = mean(calories), avg_sleep = mean(totalminutesasleep), avg_sleep_hour = (avg_sleep/60), avg_sleep_percent = mean(percent_sleep))

daily_log$id = as.factor(daily_log$id)

I would like to add this variable as a factor in order to let the program know that each id is a different person. In a sense, it is treated as a categorical variable. It will serve to know the distribution of sleep and calories each person get in total.

daily_id_log = daily_id_log %>% 
  mutate(active_users = case_when(
    avg_steps < 5000 ~ " Sedentary Inactive",
    avg_steps >= 5000 & avg_steps < 9000 ~ "Light Active",
    avg_steps >= 9001 & avg_steps < 12000 ~ "Moderately Active", 
    avg_steps > 12001 ~ "Very Active"))

daily_id_log = daily_id_log %>% 
  mutate(sleep_rating = case_when(
    avg_sleep_hour < 7 ~  "Insufficient Sleep",
    avg_sleep_hour >= 7 & avg_sleep_hour < 9 ~  "Recommended Amount",
    avg_sleep_hour > 9 ~  "Too Much Sleep"))

steps_perc <- daily_id_log %>%
    group_by(active_users) %>%
    summarise(total = n()) %>%
    mutate(totals = sum(total)) %>%
    group_by(active_users) %>%
    summarise(total_percent = total / totals) %>%
    mutate(labels = scales::percent(total_percent))

sleep_perc <- daily_id_log %>%
    group_by(sleep_rating) %>%
    summarise(total = n()) %>%
    mutate(totals = sum(total)) %>%
    group_by(sleep_rating) %>%
    summarise(total_percent = total / totals) %>%
    mutate(labels = scales::percent(total_percent))

daily_log = daily_log %>%
  mutate(activity_level = case_when(
    totalsteps < 5000 ~ "Sedentary Active",
    totalsteps >= 5000 & totalsteps < 9000 ~ "Light Active",
    totalsteps >= 9001 & totalsteps < 12000 ~ "Moderately Active", 
    totalsteps > 12001 ~ "Very Active"))

hourly_calories = read.csv("hourlyCalories_merged.csv")
head(hourly_calories)

##           Id          ActivityHour Calories
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366  4/12/2016 1:00:00 AM       61
## 3 1503960366  4/12/2016 2:00:00 AM       59
## 4 1503960366  4/12/2016 3:00:00 AM       47
## 5 1503960366  4/12/2016 4:00:00 AM       48
## 6 1503960366  4/12/2016 5:00:00 AM       48

hourly_calories$date = as.Date(hourly_calories$ActivityHour, "%m/%d/%Y")
hourly_calories$datetime = as.POSIXct(hourly_calories$ActivityHour,format="%m/%d/%Y %H:%M:%S",tz=Sys.timezone())
hourly_calories$time_component <- format(hourly_calories$datetime,'%H:%M:%S')
hourly_calories$day = weekdays(as.Date(hourly_calories$date))
hourly_calories$month = month(as.Date(hourly_calories$date))

## head(hourly_calories)

Data Visualizations

steps_perc %>%
  ggplot(aes(x="",y=total_percent, fill=active_users))+
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0) + 
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5)) + 
  scale_fill_brewer(palette="Blues")+
  theme_minimal()+
  theme(axis.text.x=element_blank())+ 
  scale_fill_manual(values=c("#a4e9d5",
                             "#6e0d25",
                             "#c1aba6",
                             "#22577a")) +
  labs(title = "User's Activity Level Based on Steps", caption = 'Source: FitBit Fitness Tracker Data') +
  guides(fill = guide_legend(title = "Active User"))

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

## I was trying to get the code below to work, but had no success. 

##    geom_text(aes(y = total_percent/3 + c(0, cumsum(total_percent)[-length(total_percent)]), 
##            label = percent(total_percent/100)), size=5)

We can see that the Activity Level of the participants is great since over 79% of them have been active to a greater extent towards their sedentary counterparts.

sleep_perc %>%
  ggplot(aes(x="",y=total_percent, fill=sleep_rating))+
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0) + 
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5)) + 
  scale_fill_brewer()+
  theme_minimal()+
  theme(axis.text.x=element_blank())+ 
    scale_fill_manual(values=c(
                             "#a4e9d5",
                             "#6e0d25",
                             "#c1aba6"))+
  labs(title = "User's Sleep Based on Total Amount of Sleeps in Minutes", caption = 'Source: FitBit Fitness Tracker Data')+ 
  guides(fill = guide_legend(title = "Sleep Rating"))

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

This pie chart really surprised me more than the other one, since I knew people will be more active given the idea of wearing a FitBit means to live a healthier lifestyle. What is surprising is how many people are not getting sufficient amount of sleep. One thing that I also surprising is how according to (https://www.hopkinsmedicine.org/health/wellness-and-prevention/oversleeping-bad-for-your-health#:~:text=How%20Much%20Sleep%20Is%20Too,an%20underlying%20problem%2C%20Polotsky%20says.), oversleeping can even be between 8-9 hours.

We can now check the Daily Steps and Total Amount of Sleep by a bargraph

daily_log %>% 
   group_by(totalsteps, calories) %>% 
  ggplot(aes(x = totalsteps, y = calories, color = calories)) +
  geom_point() +
  scale_color_gradient(low = "blue",high = "red") +
  geom_smooth(color = "beige") + 
  theme(legend.position = c(.8, .3),
        legend.spacing.y = unit(2, "mm"), 
        panel.border = element_rect(colour = "black", fill=NA),
        legend.background = element_blank(),
        legend.box.background = element_rect(colour = "black")) +
  labs(title = 'Calories vs. Total Steps',
       y = 'Calories',
       x = 'Total Steps',
       caption = 'Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

cor.test(daily_log$calories, daily_log$totalsteps)

## 
##  Pearson's product-moment correlation
## 
## data:  daily_log$calories and daily_log$totalsteps
## t = 9.1666, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3285635 0.4890481
## sample estimates:
##       cor 
## 0.4119959

We can see that there is some positive correlation between these two variables. The reason of why this may be small is perhaps the middle part being a contributing factor of having a small correlation. We can also see that the p-value (< 2.2e-16) is less than significance level alpha = 0.05, making this correlation coefficient statistically significant.

reg_daily_steps = lm(daily_log$calories ~ daily_log$totalsteps)
summary(reg_daily_steps)

## 
## Call:
## lm(formula = daily_log$calories ~ daily_log$totalsteps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1496.06  -571.36   -20.66   556.96  1875.44 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.752e+03  7.833e+01  22.363   <2e-16 ***
## daily_log$totalsteps 7.561e-02  8.248e-03   9.167   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 696 on 411 degrees of freedom
## Multiple R-squared:  0.1697, Adjusted R-squared:  0.1677 
## F-statistic: 84.03 on 1 and 411 DF,  p-value: < 2.2e-16

This shows us that there is a statistical significance of the linear regression shown below via the p-value (p < 2.2e^-16). What we can also see is the R^2 and Adj. R^2 being super low for this situation, what we can do for this is to include more variables in play. This will help us increase the R^2 and Adj R^2 of our model, but adding a lot of variables will cause more harm than good via residuals, and possibly decrease our p-value.

plot(reg_daily_steps)

bptest(reg_daily_steps)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg_daily_steps
## BP = 34.703, df = 1, p-value = 3.84e-09

What we can see from this R^2 and Adj R^2 values is that we need to add more variables into the model in order to get a better R^2. The reason of getting a higher R^2 is to provide a better precise view of the correlation by taking into account how many independent variables are being used for the model.

The Residuals vs Fitted graph, we can see some heteroscedasticity. It should not be a surprise that predicting the amount of calories burned gets harder to predict as time goes on. This indicates that we need to add more variables into the mix in order to predict the amount of calories burned. Perhaps we can use the other variables that contains their own steps with intensity of said steps.

The Q-Q Plot is relatively fine, however there is some data points that indicate that there is some massive outliers for some of the participants. It would make sense to delete these data points in order to see if it can help with the prediction. Given the scope of this capstone, it will be best to not delete any datapoints as there is no extreme outlier.

We can see with the Scale-Location graph is somewhat not giving a good idea if this data has heteroscedasticity. Thankfully we can implement the Breusch-Pagan Test (Thanks to the lmtest library), and according to the test we can see that the p-value is greater than 0.05, which we can say that there is no sufficient evidence that there is heteroscedasticity in our model.

reg_daily_steps_2 = lm(daily_log$calories ~ daily_log$totalsteps + daily_log$totaltimeinbed)
summary(reg_daily_steps_2)

## 
## Call:
## lm(formula = daily_log$calories ~ daily_log$totalsteps + daily_log$totaltimeinbed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1568.30  -553.47   -46.58   540.94  1932.64 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1952.57561  157.71840  12.380   <2e-16 ***
## daily_log$totalsteps        0.07360    0.00835   8.814   <2e-16 ***
## daily_log$totaltimeinbed   -0.40041    0.27309  -1.466    0.143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 695 on 410 degrees of freedom
## Multiple R-squared:  0.1741, Adjusted R-squared:   0.17 
## F-statistic: 43.21 on 2 and 410 DF,  p-value: < 2.2e-16

plot(reg_daily_steps_2)

bptest(reg_daily_steps_2)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg_daily_steps_2
## BP = 37.45, df = 2, p-value = 7.375e-09

Now we can introduce the Activity Level or the intensity of the model

ggplot(data=daily_log,aes(x=totalsteps, y=calories, color = activity_level)) + 
    geom_point() + 
    geom_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

The moderate active trend does seem to be the one that contributes very differently compared to the other variables. I find this extremely interesting since I do think the more intense the work out, the more calories the participant should output. It also makes sense that very active is also affecting the model more than the others due to its huge cone.

reg_daily_steps_3 = lm(daily_log$calories ~ daily_log$veryactivedistance + daily_log$moderatelyactivedistance + daily_log$lightactivedistance + daily_log$sedentaryactivedistance)
summary(reg_daily_steps_3)

## 
## Call:
## lm(formula = daily_log$calories ~ daily_log$veryactivedistance + 
##     daily_log$moderatelyactivedistance + daily_log$lightactivedistance + 
##     daily_log$sedentaryactivedistance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1372.9  -463.9  -102.7   466.5  2266.7 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         1557.50      79.03  19.707   <2e-16 ***
## daily_log$veryactivedistance         180.99      15.98  11.329   <2e-16 ***
## daily_log$moderatelyactivedistance   -66.00      32.25  -2.046   0.0414 *  
## daily_log$lightactivedistance        164.56      18.11   9.088   <2e-16 ***
## daily_log$sedentaryactivedistance    627.86    3584.54   0.175   0.8610    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 627.8 on 408 degrees of freedom
## Multiple R-squared:  0.3294, Adjusted R-squared:  0.3228 
## F-statistic:  50.1 on 4 and 408 DF,  p-value: < 2.2e-16

plot(reg_daily_steps_3)

bptest(reg_daily_steps_3)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg_daily_steps_3
## BP = 35.073, df = 4, p-value = 4.487e-07

We can see a near better increase in R^2 and Adj R^2. However, it is not to the point where it is great. What we can do is add even more variables such as the intensity minutes to see if they can contribute more.

From what we can see, the model seems to not have many terrible outliers worth noticing as it does not fall into cook’s distance (a scale that details if a point or set of points contribute high influences in the variable).

We can see that the scale location plot helps us find the assumption of equal variance. As we can see, the values do seem to be close together in the 2000-3000 range. Making this model an excellent one to demonstrate a model’s close to equal variance (also called homoscedasticity).

This model is worth keeping as it helps us understand which intensity is making a great impact towards calories burned while working out.

Now, I will try to add other varibles such as totalminutesasleep, totalminutesinbed, and percent_sleep.

reg_daily_steps_4 = lm(daily_log$calories ~ daily_log$veryactivedistance + daily_log$moderatelyactivedistance + daily_log$lightactivedistance + daily_log$sedentaryactivedistance + daily_log$veryactiveminutes + daily_log$fairlyactiveminutes + daily_log$sedentaryminutes)

summary(reg_daily_steps_4)

## 
## Call:
## lm(formula = daily_log$calories ~ daily_log$veryactivedistance + 
##     daily_log$moderatelyactivedistance + daily_log$lightactivedistance + 
##     daily_log$sedentaryactivedistance + daily_log$veryactiveminutes + 
##     daily_log$fairlyactiveminutes + daily_log$sedentaryminutes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1017.40  -346.18   -20.55   320.20  2366.32 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         686.0157   133.0297   5.157 3.94e-07 ***
## daily_log$veryactivedistance       -102.4485    24.2561  -4.224 2.97e-05 ***
## daily_log$moderatelyactivedistance -262.4289    86.1158  -3.047  0.00246 ** 
## daily_log$lightactivedistance       202.0325    14.5317  13.903  < 2e-16 ***
## daily_log$sedentaryactivedistance  2792.5046  2801.5681   0.997  0.31947    
## daily_log$veryactiveminutes          17.7613     1.4319  12.404  < 2e-16 ***
## daily_log$fairlyactiveminutes         9.4607     3.9173   2.415  0.01617 *  
## daily_log$sedentaryminutes            0.9369     0.1482   6.323 6.80e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 486.8 on 405 degrees of freedom
## Multiple R-squared:  0.5998, Adjusted R-squared:  0.5929 
## F-statistic: 86.71 on 7 and 405 DF,  p-value: < 2.2e-16

## plot(reg_daily_steps_3)
bptest(reg_daily_steps_4)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg_daily_steps_4
## BP = 31.497, df = 7, p-value = 5.034e-05

The R^2 has increased by a great sum, this can mean that we are able to keep this model due to the idea that the p-value is less than alpha = 0.05. What may become an issue is how robust the model can be, since the fewer the variables the more appropriate and more interpretative the model can be. What we can do is take out the fairly active minutes, sedentary active distance since they are not contributing as much as the others.

reg_daily_steps_5 = lm(daily_log$calories ~ daily_log$veryactivedistance + daily_log$moderatelyactivedistance + daily_log$lightactivedistance  + daily_log$veryactiveminutes + daily_log$fairlyactiveminutes)
summary(reg_daily_steps_5)

## 
## Call:
## lm(formula = daily_log$calories ~ daily_log$veryactivedistance + 
##     daily_log$moderatelyactivedistance + daily_log$lightactivedistance + 
##     daily_log$veryactiveminutes + daily_log$fairlyactiveminutes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1176.5  -364.3   -70.6   375.2  2278.5 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1431.641     65.415  21.886  < 2e-16 ***
## daily_log$veryactivedistance        -95.262     25.375  -3.754 0.000199 ***
## daily_log$moderatelyactivedistance -246.836     89.528  -2.757 0.006095 ** 
## daily_log$lightactivedistance       182.447     14.881  12.261  < 2e-16 ***
## daily_log$veryactiveminutes          17.371      1.498  11.595  < 2e-16 ***
## daily_log$fairlyactiveminutes         8.711      4.078   2.136 0.033261 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 509.8 on 407 degrees of freedom
## Multiple R-squared:  0.5588, Adjusted R-squared:  0.5534 
## F-statistic: 103.1 on 5 and 407 DF,  p-value: < 2.2e-16

## plot(reg_daily_steps_3)
bptest(reg_daily_steps_5)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg_daily_steps_5
## BP = 18.012, df = 5, p-value = 0.002932

We can now see that the model is relatively the same, but the amount of variables is not. This model is robust enough to implement machine learning techniques into it. Given the amount of variables that were given to me, there is not real variable to test besides sleep and calories. What I would’ve wanted is more variables such as age group, sex, diet plans, and other lifestyle variables.

I want to see if there is any effects that one will burn more calories which will affect sleep.

daily_log %>% 
   group_by(totalminutesasleep, calories) %>% 
  ggplot(aes(x = totalminutesasleep, y = calories, color = calories)) +
  geom_point() +
  scale_color_gradient(low = "blue",high = "red") +
  geom_smooth(color = "green") + 
  theme(legend.position = c(.8, .3),
        legend.spacing.y = unit(2, "mm"), 
        panel.border = element_rect(colour = "black", fill=NA),
        legend.background = element_blank(),
        legend.box.background = element_rect(colour = "black")) +
  labs(title = 'Calories vs. Total Minutes Asleep',
       y = 'Calories',
       x = 'Total Minutes Asleep',
       caption = 'Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

cor.test(daily_log$calories, daily_log$totalminutesasleep)

## 
##  Pearson's product-moment correlation
## 
## data:  daily_log$calories and daily_log$totalminutesasleep
## t = -0.57854, df = 411, p-value = 0.5632
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12467707  0.06815644
## sample estimates:
##         cor 
## -0.02852571

We can see that the cor value for this model is -0.02 which demonstrates that calories burned does not have an effect towards that minutes of sleep one will get. Perhaps the intensity of the run is the variables that affects it more? We can also see that the p-value (0.5632) which gives us the idea that we cannot reject the null hypothesis in which we can say that calories has no affect towards the minutes of sleep.

We will now see if we can add both asleep and totaltimeinbed variables in the linear regression.

reg_daily_sleep = lm(daily_log$calories ~ daily_log$totalminutesasleep + daily_log$totaltimeinbed)
summary(reg_daily_sleep)

## 
## Call:
## lm(formula = daily_log$calories ~ daily_log$totalminutesasleep + 
##     daily_log$totaltimeinbed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2258.1  -496.4  -169.6   457.6  2532.4 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  2661.6674   136.2240  19.539  < 2e-16 ***
## daily_log$totalminutesasleep    4.5505     0.8314   5.473 7.71e-08 ***
## daily_log$totaltimeinbed       -4.7376     0.7741  -6.120 2.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 731.7 on 410 degrees of freedom
## Multiple R-squared:  0.08445,    Adjusted R-squared:  0.07999 
## F-statistic: 18.91 on 2 and 410 DF,  p-value: 1.395e-08

plot(reg_daily_sleep)

bptest(reg_daily_sleep)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg_daily_sleep
## BP = 5.4609, df = 2, p-value = 0.06519

Since we did get a p-value that is greater than 0.05, we cannot reject the null hypothesis that there is no unequal variance in the residuals of the linear regression model. Note, this test is not definitive, and we may see some degree of heteroscedasticity not being detected by the test.

cor(daily_log$totaltimeinbed, daily_log$totalminutesasleep)

## [1] 0.9304575

We can see that the correlation between these two variables is too high. Which makes since it does seem that the participants will go to sleep once they go to bed, but a small amount of them will stay up a little bit longer when in bed.

sleep_vs_bed = ggplot(data =daily_log, aes(totalminutesasleep,totaltimeinbed))

sleep_vs_bed + geom_point(aes(colour = factor(id))) + 
  geom_smooth(aes(totalminutesasleep,totaltimeinbed)) + 
  labs(title = "Total minutes asleep vs Total time in bed", fill = "IDs", caption = 'Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

sleep_box = ggplot(daily_log, aes(totalminutesasleep,totaltimeinbed, fill = factor(id))) +
  geom_boxplot() +
  labs(title = "Box Plot of Particiants' Sleep in Minutes", fill = "IDs", caption = 'Source: FitBit Fitness Tracker Data')

sleep_box

## ggplotly(sleep_box) This work but I cannot find a way to limit the bounds and see if I can have nicer looking boxplots with the ability to interactive with them

While the R^2 of this model is terrible, but has a p-value that allows us to reject the null hypothesis, we can see that these two variables have a high correlations amongst each other. If this correlation was around .5-.75, then we can of course add both of them in the same model, but given the high correlation, it would be best to not include them together in the model.

According to this, there is a tiny negative correlation between calories and the amount of sleep one gets, given the sample population. It is so little that we can simply say that there is no correlation.

While we can see that this test is statically significant, we cannot continue on this search as there is a small R^2 and a high correlation (we can even see this with the ggplot graph and corr function). Even when one adds the time the person spent in bed and was sleep. While we can add more variables,as mentioned before, we must keep the model flexible and robust.

daily_log %>% 
   group_by(totalminutesasleep, totalsteps) %>% 
  ggplot(aes(x = totalminutesasleep, y = totalsteps, color = calories)) +
  geom_point() +
  geom_smooth(color = "green") + 
  theme(legend.position = c(.8, .3),
        legend.spacing.y = unit(2, "mm"), 
        panel.border = element_rect(colour = "black", fill=NA),
        legend.background = element_blank(),
        legend.box.background = element_rect(colour = "black")) +
  labs(title = 'Calories vs. Total Minutes Asleep',
       y = 'Total Steps',
       x = 'Total Minutes Asleep',
       caption = 'Source: FitBit Fitness Tracker Data')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

cor(daily_log$totaldistance, daily_log$totalminutesasleep)

## [1] -0.1721427

We can see that there is a negative correlation within the totaldistance and totalminutesalseep variable. This indicates that there is little correlation amongst these variables.

It will be best to test more of these variables independently. As in seeing if there is a statistical difference amongst one another.

ANOVA Testing in each participant’s sleep and calories

Now we have an Anova test in order to find if each participants’ means are equal, or there is a relationship between calories and sleep independently. We will do this in order to see if each person has a unique case for themselves or if all participants are relativity the same in terms of level.

plot_ly(daily_log, x =  ~id, y =  ~calories, 
        type = 'box',
        color = ~id)

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Here we have a MASSIVE difference from each participant in the study. It is amazing how different each participant’s caloric output upon this study. But it is also concerning that we do a different amount of samples in each participants, making this study a bit uneven. Another thing to note is can this be enough to provide evidence against the idea that all the participant’s caloric outputs are equal? The answer is not really and we can see by the help of the aov_count function in R.

aov_calorie_cont = aov(daily_log$calories~daily_log$id)
summary(aov_calorie_cont)

##               Df    Sum Sq Mean Sq F value Pr(>F)    
## daily_log$id  23 175773479 7642325   46.45 <2e-16 ***
## Residuals    389  64008686  164547                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have a good F-value at 46.45 (This helps us see if the variation among samples means dominates over the variation within the groups or not), and we have a very small p value (p < 2e^-16). This means we can conclude that for our confidence interval we can accept the idea that there is a significant relationship between calories and the participants.

plot_ly(daily_log, x =  ~id, y =  ~totalminutesasleep, 
        type = 'box',
        color = ~id)

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

aov_sleep_cont = aov(daily_log$totalminutesasleep~daily_log$id)
summary(aov_sleep_cont)

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## daily_log$id  23 2204023   95827   10.45 <2e-16 ***
## Residuals    389 3566228    9168                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This model has a lower F-value (This indicates that the variation of the sample set is much lower compared to the calories below) which helps us understand that it should provide us a high corresponding p-value. However, the p-value is the same as before, which gives us the idea that there is a statically difference between each participants mean.

sleep_fig = plot_ly(data = daily_log, x =  ~totaldistance, y = ~totaltimeinbed, color = ~id)

sleep_fig

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

More data visualization in terms of participants.

Here we have a better look at the distribution between each participant’s activities during bed. It does seem that there is one participant that only less than 3 entrees of what time do they sleep. What surprises me the most is how participant X-2279 and X-0313 has a massive distribution

t = plot_ly(data = daily_log, x =  ~totalminutesasleep, y = ~totaltimeinbed, color = ~factor(id), type = "box")
t = t %>% layout(boxmode = "group", xaxis = list(range = c(220, 600)), yaxis = list(range = c(300, 600)))

t

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

While I do not know why the boxes are small, we do know that each distribution for each participant is different from one another as shown below. Based by observation (and not by calculations) we can have a safe assumption that each participant’s sleep pattern is different. We can also see that participant X-3714 does have a tendencies to be awake while being in bed, and not sleeping.

daily_log %>%  group_by(id) %>% summarise(mean_minutes_asleep=mean(totaltimeinbed)) %>% 
ggplot(aes(x=mean_minutes_asleep, y=reorder(id, (mean_minutes_asleep)))) +
  geom_col(fill = "#00abff", width=0.12) +
  geom_point(color = "#00abff", size=2.5) +
  geom_rect(aes(xmin=420, xmax=540, ymin=-Inf, ymax=Inf), fill="#59C96D", alpha=0.02) +
  theme_light() +
  theme(text = element_text(size=20)) +
  labs(y = "FitBit IDs",
       x = "Minutes Asleep",
       title = "Average Daily Sleep by ID") +
  annotate(geom="text", x=480, y=2, label="Ideal sleep \n duration", size=5)

Here we have a graph that display the participant’s FitBit ID’s vs Minutes of Sleep

daily_log %>%  group_by(id) %>% summarise(mean_steps=mean(totalsteps)) %>% 
ggplot(aes(x=mean_steps, y=reorder(id, (mean_steps)))) +
  geom_col(fill = "#00abff", width=0.12) +
  geom_point(color = "#00abff", size=2.5) +
  geom_rect(aes(xmin=420, xmax=540, ymin=-Inf, ymax=Inf), fill="#59C96D", alpha=0.02) +
  theme_light() +
  theme(text = element_text(size=20)) +
  labs(y = "FitBit IDs",
       x = "Steps",
       title = "Figure 5: Average Daily Steps by ID") +
  annotate(geom="text", x=6000, y=2, label="Ideal Steps \n duration", size=5)

## Found this function thanks to https://stackoverflow.com/questions/34093169/horizontal-vertical-line-in-plotly
hline <- function(y = 0, color = "#6e0d25") {
  list(
    type = "line", 
    x0 = 0, 
    x1 = 1, 
    xref = "paper",
    y0 = y, 
    y1 = y, 
    line = list(color = color)
  )
}
daily_log$weekday = weekdays(daily_log$date)
daily_log$month = month(daily_log$date)

daily_log$weekday <- ordered(daily_log$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
  "Friday", "Saturday", "Sunday"))


weekday_log = daily_log %>%
  group_by(weekday) %>%
  summarize(avg_daily_steps = mean(totalsteps), avg_sleep_mins = mean(totalminutesasleep), avg_percent_sleep = mean(percent_sleep))

sleep_week_fig = plot_ly(weekday_log, x = ~weekday, y = ~(avg_sleep_mins), 
                   type = 'bar', 
                   name = "Average Amount of Sleep (Minutes)", 
                   text = ~round((avg_sleep_mins), digits = 2))

sleep_week_fig = sleep_week_fig %>% layout(yaxis = list(title = "Average Sleep in Minutes"), barcode = 'group',  
                               xaxis = list(title = ""),
                               yaxis = list(title = ""))

sleep_week_fig = sleep_week_fig %>% layout(shapes = list(hline(420)))

sleep_week_fig

## Warning: 'layout' objects don't have these attributes: 'barcode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

What we can see is that, as a collective, the participants are not sleeping the recommended amount of sleep at all. This is perhaps due to their schedule and their work-life balance. It would have been great to see what they of jobs they were working or more information about age and sex group. The red line we have represents 480 minutes or 8 hours of sleep.

Now, We can do the amount of all steps on average during the week.

step_week_fig = plot_ly(weekday_log, x = ~weekday, y = ~(avg_daily_steps), 
                   type = 'bar', 
                   name = "Average Amount of Steps", 
                   text = ~round((avg_daily_steps), digits = 2))

step_week_fig = step_week_fig %>% layout(yaxis = list(title = "Average Steps"), barcode = 'group',
                               xaxis = list(title = ""),
                               yaxis = list(title = ""))

step_week_fig = step_week_fig %>% layout(shapes = list(hline(6000)))

step_week_fig

## Warning: 'layout' objects don't have these attributes: 'barcode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

I am surprised that on average a lot of people are getting in their steps each day! What surprises me the most is how Sunday is the least amount of steps taken on average. This is probably because many participants treated Sunday as their true rest day and enjoyed staying outdoors, while Saturday is often recognized as a day to go out and do personal matters (in this case, walking).

Capstone Data Analysis for FitBit

Capstone Project for Google’s Data Analytics Certification

About the data:

Background Information about the data: