Bellabeat Capstone Project with R


Introduction


This project is the investigation of the data collected by Bellabeat, a high-tech manufacturer of health-focused products for women. The dataset collected from thirty-three eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. The main goal is to gain insight into how consumers are using the company smart devices.


Loading and Cleaning Datasets


Loading Libraries


First, load all the libraries for data analysis and visualization.

library("tidyverse")
library("lubridate")
library("viridis")
library("ggpubr")
library("ggrepel")
library("GGally")



Loading Datasets


Then, load the FitBit datasets. These datasets include 18 csv files containing data about physical activities, heart rate, sleep track, METS, calories burned, and weight information recorded on daily and/or hourly basis, some are even monitored in minute and/or seconds format.

dailyActivity_merged <- read.csv("C:/Users/Family/Downloads/dailyActivity_merged.csv")
dailyCalories_merged <- read.csv("C:/Users/Family/Downloads/dailyCalories_merged.csv")
dailyIntensities_merged <- read.csv("C:/Users/Family/Downloads/dailyIntensities_merged.csv")
dailySteps_merged <- read.csv("C:/Users/Family/Downloads/dailySteps_merged.csv")
heartrate_seconds_merged <- read.csv("C:/Users/Family/Downloads/heartrate_seconds_merged.csv/heartrate_seconds_merged.csv")
hourlyCalories_merged <- read.csv("C:/Users/Family/Downloads/hourlyCalories_merged.csv")
hourlyIntensities_merged <- read.csv("C:/Users/Family/Downloads/hourlyIntensities_merged.csv")
hourlySteps_merged <- read.csv("C:/Users/Family/Downloads/hourlySteps_merged.csv")
minuteCaloriesNarrow_merged <- read.csv("C:/Users/Family/Downloads/minuteCaloriesNarrow_merged.csv/minuteCaloriesNarrow_merged.csv")
minuteCaloriesWide_merged <- read.csv("C:/Users/Family/Downloads/minuteCaloriesWide_merged.csv~/minuteCaloriesWide_merged.csv", header=FALSE)
minuteIntensitiesNarrow_merged <- read.csv("C:/Users/Family/Downloads/minuteIntensitiesNarrow_merged.csv/minuteIntensitiesNarrow_merged.csv")
minuteIntensitiesWide_merged <- read.csv("C:/Users/Family/Downloads/minuteIntensitiesWide_merged.csv/minuteIntensitiesWide_merged.csv", header=FALSE)
minuteMETsNarrow_merged <- read.csv("C:/Users/Family/Downloads/minuteMETsNarrow_merged.csv/minuteMETsNarrow_merged.csv")
minuteSleep_merged <- read.csv("C:/Users/Family/Downloads/minuteSleep_merged.csv/minuteSleep_merged.csv")
minuteStepsNarrow_merged <- read.csv("C:/Users/Family/Downloads/minuteStepsNarrow_merged.csv/minuteStepsNarrow_merged.csv")
minuteStepsWide_merged <- read.csv("C:/Users/Family/Downloads/minuteStepsWide_merged.csv/minuteStepsWide_merged.csv", header=FALSE)
sleepDay_merged <- read.csv("C:/Users/Family/Downloads/sleepDay_merged.csv")
weightLogInfo_merged <- read.csv("C:/Users/Family/Downloads/weightLogInfo_merged.csv")



Check for Missing Values


sum(is.na(dailyActivity_merged))
sum(is.na(dailyCalories_merged))
sum(is.na(dailyIntensities_merged))
sum(is.na(dailySteps_merged))
sum(is.na(heartrate_seconds_merged))
sum(is.na(hourlyCalories_merged))
sum(is.na(hourlyIntensities_merged))
sum(is.na(hourlySteps_merged))
sum(is.na(minuteCaloriesNarrow_merged))
sum(is.na(minuteCaloriesWide_merged))
sum(is.na(minuteIntensitiesNarrow_merged))
sum(is.na(minuteIntensitiesWide_merged))
sum(is.na(minuteMETsNarrow_merged))
sum(is.na(minuteSleep_merged))
sum(is.na(minuteStepsNarrow_merged))
sum(is.na(minuteStepsWide_merged))
sum(is.na(sleepDay_merged))
sum(is.na(weightLogInfo_merged))
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 65



Only weightLogInfo_merged data frame has 65 missing values while the other data frames have none of it.


Check and Remove Duplication


sum(duplicated(dailyActivity_merged))
sum(duplicated(dailyCalories_merged))
sum(duplicated(dailyIntensities_merged))
sum(duplicated(dailySteps_merged))
sum(duplicated(heartrate_seconds_merged))
sum(duplicated(hourlyCalories_merged))
sum(duplicated(hourlyIntensities_merged))
sum(duplicated(hourlySteps_merged))
sum(duplicated(minuteCaloriesNarrow_merged))
sum(duplicated(minuteCaloriesWide_merged))
sum(duplicated(minuteIntensitiesNarrow_merged))
sum(duplicated(minuteIntensitiesWide_merged))
sum(duplicated(minuteMETsNarrow_merged))
sum(duplicated(minuteSleep_merged))
sum(duplicated(minuteStepsNarrow_merged))
sum(duplicated(minuteStepsWide_merged))
sum(duplicated(sleepDay_merged))
sum(duplicated(weightLogInfo_merged))
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 543
## [1] 0
## [1] 0
## [1] 3
## [1] 0



There’re some duplication on sleepDay_merged and minuteSleep_merged data frames. So let’s remove them all.

sleepDay_merged <-
  sleepDay_merged[!duplicated(sleepDay_merged), ]

minuteSleep_merged <- 
  minuteSleep_merged[!duplicated(minuteSleep_merged), ]



Exploring Datasets


First, let’s see the number of users (Id) in each data frame.

n_distinct(dailyActivity_merged$Id)
n_distinct(dailyCalories_merged$Id)
n_distinct(dailyIntensities_merged$Id)
n_distinct(dailySteps_merged$Id)
n_distinct(heartrate_seconds_merged$Id)
n_distinct(hourlyCalories_merged$Id)
n_distinct(hourlyIntensities_merged$Id)
n_distinct(hourlySteps_merged$Id)
n_distinct(minuteCaloriesNarrow_merged$Id)
n_distinct(minuteCaloriesWide_merged$Id)
n_distinct(minuteIntensitiesNarrow_merged$Id)
n_distinct(minuteIntensitiesWide_merged$Id)
n_distinct(minuteMETsNarrow_merged$Id)
n_distinct(minuteSleep_merged$Id)
n_distinct(minuteStepsNarrow_merged$Id)
n_distinct(minuteStepsWide_merged$Id)
n_distinct(sleepDay_merged$Id)
n_distinct(weightLogInfo_merged$Id)
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 14
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 0
## [1] 33
## [1] 0
## [1] 33
## [1] 24
## [1] 33
## [1] 0
## [1] 24
## [1] 8



Wide_merged data frames have 0 number of Id probably because they are modified version of Narrow_merged data frames with additional column. All the columns are renamed to capital V with tag number, Id column are rename to V1. In that column ,the name “Id” itself are underneath the new header V1


n_distinct(minuteCaloriesWide_merged$V1)
n_distinct(minuteIntensitiesWide_merged$V1)
n_distinct(minuteStepsWide_merged$V1)
## [1] 34
## [1] 34
## [1] 34



So each Wide merged table indicate 33 participants (not included second column header).

  • There’re only 8 users(Id) in weightLogInfo_merged
  • sleepDay_merged and minuteSleep_merged both have 24 users participated.
  • 14 users participate in heartrate_seconds_merged.
  • All the other data frames have 33 users.



Exploring Daily-Based Data Frames


Now let’s check the consistency of all daily-based data frame to see whether or not each users have the same number of days participated in every daily-based data frame. If the number are the same, we can merge all daily-base data frame into one.

dailyActivity_merged %>%                            
  group_by(Id) %>%
  summarise(count = n_distinct(ActivityDate))
## # A tibble: 33 × 2
##            Id count
##         <dbl> <int>
##  1 1503960366    31
##  2 1624580081    31
##  3 1644430081    30
##  4 1844505072    31
##  5 1927972279    31
##  6 2022484408    31
##  7 2026352035    31
##  8 2320127002    31
##  9 2347167796    18
## 10 2873212765    31
## # ℹ 23 more rows
dailyCalories_merged %>%                              
  group_by(Id) %>%
  summarise(count = n_distinct(ActivityDay))
## # A tibble: 33 × 2
##            Id count
##         <dbl> <int>
##  1 1503960366    31
##  2 1624580081    31
##  3 1644430081    30
##  4 1844505072    31
##  5 1927972279    31
##  6 2022484408    31
##  7 2026352035    31
##  8 2320127002    31
##  9 2347167796    18
## 10 2873212765    31
## # ℹ 23 more rows
dailyIntensities_merged %>%
  group_by(Id) %>%
  summarise(count = n_distinct(ActivityDay))
## # A tibble: 33 × 2
##            Id count
##         <dbl> <int>
##  1 1503960366    31
##  2 1624580081    31
##  3 1644430081    30
##  4 1844505072    31
##  5 1927972279    31
##  6 2022484408    31
##  7 2026352035    31
##  8 2320127002    31
##  9 2347167796    18
## 10 2873212765    31
## # ℹ 23 more rows
dailySteps_merged %>%
  group_by(Id) %>%
  summarise(count = n_distinct(ActivityDay))
## # A tibble: 33 × 2
##            Id count
##         <dbl> <int>
##  1 1503960366    31
##  2 1624580081    31
##  3 1644430081    30
##  4 1844505072    31
##  5 1927972279    31
##  6 2022484408    31
##  7 2026352035    31
##  8 2320127002    31
##  9 2347167796    18
## 10 2873212765    31
## # ℹ 23 more rows



This turns out to be number of day recorded for each user are all the same in all these four data frame. So we can combine them into one but it seem to be that only the data frame dailyActivity_merged already have all the data the other three data frame have. So using only dailyActivity_merged would be enough.

Exploring Hourly-Based Data Frames


Applying the same process with all hourly-based table. Check the number of time recorded for each users.

hourlyCalories_merged %>% 
  group_by(Id) %>%
  summarise(count = n_distinct(ActivityHour))
## # A tibble: 33 × 2
##            Id count
##         <dbl> <int>
##  1 1503960366   717
##  2 1624580081   736
##  3 1644430081   708
##  4 1844505072   731
##  5 1927972279   736
##  6 2022484408   736
##  7 2026352035   736
##  8 2320127002   735
##  9 2347167796   414
## 10 2873212765   736
## # ℹ 23 more rows
hourlyIntensities_merged %>%
  group_by(Id) %>%
  summarise(count = n_distinct(ActivityHour))
## # A tibble: 33 × 2
##            Id count
##         <dbl> <int>
##  1 1503960366   717
##  2 1624580081   736
##  3 1644430081   708
##  4 1844505072   731
##  5 1927972279   736
##  6 2022484408   736
##  7 2026352035   736
##  8 2320127002   735
##  9 2347167796   414
## 10 2873212765   736
## # ℹ 23 more rows
hourlySteps_merged %>%
  group_by(Id) %>%
  summarise(count = n_distinct(ActivityHour))
## # A tibble: 33 × 2
##            Id count
##         <dbl> <int>
##  1 1503960366   717
##  2 1624580081   736
##  3 1644430081   708
##  4 1844505072   731
##  5 1927972279   736
##  6 2022484408   736
##  7 2026352035   736
##  8 2320127002   735
##  9 2347167796   414
## 10 2873212765   736
## # ℹ 23 more rows



All these three tables have the same set of Id and each Id also has the same number of recorded hours.

Preparing Data Frames for further Analysis


Organize Recorded Time in each Data Frame


Before processing daily-based and hourly-based data frames. Let’s first prepare some other data frames that will be used individually.

We start by organizing time recorded in heartrate_seconds_merged table.

heartrate_seconds_merged$Time <- parse_date_time(
  heartrate_seconds_merged$Time, 
  "%m/%d/%y %I:%M:%S %p"
)



Then extract date as “ActivityDay” using date()

heartrate_seconds_merged <- heartrate_seconds_merged %>%
  mutate(ActivityDay = date(Time))



Check days of participation for each users.

heartrate_seconds_merged %>%                              
  group_by(Id) %>%
  summarise(count = n_distinct(ActivityDay))
## # A tibble: 14 × 2
##            Id count
##         <dbl> <int>
##  1 2022484408    31
##  2 2026352035     4
##  3 2347167796    18
##  4 4020332650    16
##  5 4388161847    30
##  6 4558609924    31
##  7 5553957443    31
##  8 5577150313    28
##  9 6117666160    23
## 10 6775888955    18
## 11 6962181067    31
## 12 7007744171    24
## 13 8792009665    18
## 14 8877689391    31



Only 14 participants with days of participation vary from 4 to 31


Applying the same process with minuteSleep_merged table.

minuteSleep_merged$date <- parse_date_time(
  minuteSleep_merged$date, 
  "%m/%d/%y %I:%M:%S %p"
)



Organize time recorded in minuteMETsNarrow_merged table from character class into date time class.

minuteMETsNarrow_merged$ActivityMinute <- 
  parse_date_time(minuteMETsNarrow_merged$ActivityMinute, 
    "%m/%d/%y %I:%M:%S %p"
  ) 



And also for dailyActivity_merged

dailyActivity_merged$ActivityDate <- 
  mdy(dailyActivity_merged$ActivityDate)



METs value need to be divided by 10, and then add more columns with extracted date and hour from ActivityMinute. The data from this minuteMETsNarrow_merged table will later be incorporated into daily and hourly based tables.

minuteMETsNarrow_merged <- 
  mutate(minuteMETsNarrow_merged, METs = METs/10,
         date = date(ActivityMinute),
         ActivityHour = floor_date(ActivityMinute, "hour"))



Creating METs Daily Data Frame


For daily METs value, we use average average METs value calculated by mean function.

METs_daily <- minuteMETsNarrow_merged %>% 
  group_by(Id,date) %>% 
  summarize(METs = mean(METs),
            .groups='drop')



This table will be combined with dailyActivity_merged, but the number of day in this table contain 6 row less than dailyActivity_merged. A little bit of data will be lost, but the combination will give bigger and more complex picture of how all these data related.

Activity_METs_daily <-
  inner_join(dailyActivity_merged, METs_daily,
             by = c("Id","ActivityDate" = "date"))



Besides using this table for creating box plot, We would also use it for creating bar chart.

Add weekday colomn into this table.

Activity_METs_daily <- Activity_METs_daily %>% 
  mutate(Weekday = wday(ActivityDate, label = T, 
                        week_start = 1))



Prepare Sleep Data Frame Before Merging


We incorporate sleepDay_merged table into daily-based tables because we want to see how sleeping time can have some effect on other values such as calories burned, METs, etc.

sleepDay_merged %>%          
  group_by(Id) %>%
  summarise(count = n_distinct(SleepDay))
## # A tibble: 24 × 2
##            Id count
##         <dbl> <int>
##  1 1503960366    25
##  2 1644430081     4
##  3 1844505072     3
##  4 1927972279     5
##  5 2026352035    28
##  6 2320127002     1
##  7 2347167796    15
##  8 3977333714    28
##  9 4020332650     8
## 10 4319703577    26
## # ℹ 14 more rows



There’re only 24 users with only 410 rows in SleepDay_merged table compared to 33 users with 940 rows on other daily-based tables. To combine them together we would loose significant amount of data. So we only use this combination only to see the effect of sleeping on other values.

sleepDay_merged <- sleepDay_merged %>% 
  mutate(DateTime = mdy_hms(SleepDay))



Merging and Creating Data Frame


Activity_Sleep_daily <-
  inner_join(dailyActivity_merged, sleepDay_merged,
             by = c("Id","ActivityDate" = "DateTime"))



Adding sleep efficiency column, the value is the percentage of total minutes asleep divided by total time in bed. Normal sleep efficiency is considered to be 80% or greater.

Activity_Sleep_daily <- Activity_Sleep_daily %>% 
  mutate(
  Sleep_Efficiency = ((TotalMinutesAsleep/TotalTimeInBed)*100)
)



Then combine Activity_Sleep_daily table with METs_daily to be SMAd (Sleep_METs_Activity_daily)

SMAd <-
  inner_join(Activity_Sleep_daily, METs_daily,
             by = c("Id","ActivityDate" = "date"))



Add weekday column to SMAd

SMAd <- SMAd %>% 
  mutate(Weekday = wday(ActivityDate, label = T, 
                        week_start = 1))



SMAd table will be use for almost every daily based plot

Create data frame Activity_METs_daily_long for a chart that display each activity minutes into box plot

Activity_METs_daily_long <- Activity_METs_daily %>% 
  pivot_longer(cols = c(
    'VeryActiveMinutes',
    'FairlyActiveMinutes',
    'LightlyActiveMinutes',
    'SedentaryMinutes'
  ),
  names_to='ActivityMinutes',
  values_to='Minutes'
  )



It’s done for daily-based tables. Now, let’s try to join all hourly-based tables together.

list_df <- list(hourlyCalories_merged,
                hourlyIntensities_merged,
                hourlySteps_merged)

hourly_outer_join <- list_df %>% 
  reduce(full_join, by= c('Id','ActivityHour'))



Change time format from character to datetime

hourly_outer_join$ActivityHour <- parse_date_time(
  hourly_outer_join$ActivityHour, 
  "%m/%d/%y %I:%M:%S %p"
) 



Add more columns from extracting ActivityHour into time, date, day, and weekday from datetime

hourly_outer_join <- hourly_outer_join %>% 
  mutate(Weekday = wday(ActivityHour, label = T, week_start = 1), 
             date = date(ActivityHour),
             Day = day(ActivityHour),
             Time = format(ActivityHour, "%k")
)



Create hourly_METs data frame for merging with hourly-based tables.

hourly_METs <- minuteMETsNarrow_merged %>% 
  group_by(Id,ActivityHour) %>% 
  summarize(METs = mean(METs),
            .groups='drop')



Combining hourly_METs table with other hourly-based table called hourly_outer_joint_1

hourly_outer_join_1 <-
  inner_join(hourly_outer_join, hourly_METs,
             by = c("Id","ActivityHour"))



Create data frame for labeling pie chart from the table minuteIntensitiesNarrow_merged. This label indicated the percentage of each intensity value based on time spending for each users. We call this data frame df_1

df_1 <- minuteIntensitiesNarrow_merged %>% 
  group_by(Id) %>% 
  mutate(countId= n()) %>% 
  ungroup %>% 
  group_by(Id,countId,Intensity) %>% 
  summarise(count=n(),
            .groups='drop' ) %>%
  group_by(Id) %>% 
  mutate(cumcount=cumsum(count),
         pos=cumsum(count)-count/2,
         per=paste0(round(100*count/countId,2),'%')) %>% 
  ungroup



And for the data frame minuteSleep_merged will be used in plotting pie chart, we also need another data frame called df_2 for labeling each segment as well.

df_2 <- minuteSleep_merged %>% 
  group_by(Id) %>% 
  mutate(countId= n()) %>% 
  ungroup %>% 
  group_by(Id,countId,value) %>% 
  summarise(count=n(),
            .groups='drop' ) %>%
  group_by(Id) %>% 
  mutate(cumcount=cumsum(count),
         pos=cumsum(count)-count/2,
         per=paste0(round(100*count/countId,2),'%')) %>% 
  ungroup



In this project, all 33 participants will be categorized based on their participation and time spending on intense activities. To make it easier each Id will be replaced by user number from 01 to 33 by creating another column that tags each Id with simplified number.

thirtythree <- c("01", "02", "03", "04", "05", "06", 
                                  "07", "08", "09", 10:33)

Users <- paste("User", thirtythree, sep = " ")

Id <- unlist(distinct(dailyActivity_merged, Id))

users_id <- tibble(Users, Id)



Then attach this newly created column to dailyActivity_merged table. This modified data frame will be used to create two more columns that categorized users based on days of participation and level of intensity.


Participation level of each user will be based on number of days recorded in this table. Activity level of each user will be based on the median value of “very active minutes” recorded in dailyActivity_merged table. which means Users with high activity level, or spend more minutes each day with high intensity activity are likely to do exercise regularly.

dailyActivity_merged <- dailyActivity_merged %>% 
  left_join(users_id, by = "Id")



Create a data frame with 2 column called ALPL, which stands for Activity Level and Participation Level.
This data frame can be attached to any data frames in this study to put tag on each user to see how they spend their time on intense activity, and how often they participated in the program.

ALPL <- dailyActivity_merged %>% 
  group_by(Users, Id) %>%  
  summarize(
    MedVeryActive = median(VeryActiveMinutes),
    UsageRecords = n_distinct(ActivityDate),
    .groups='drop'
  ) %>% 
  mutate(
    ActivityLevel = case_when(
      MedVeryActive < 4 ~ "Low",
      MedVeryActive <= 32 ~ "Med",
      MedVeryActive < 211 ~ "High"
    ),
    ParticipationLevel = case_when(
      UsageRecords < 14 ~ "Low Usage",
      UsageRecords < 21 ~ "Moderate Usage",
      UsageRecords < 31 ~ "High Usage",
      UsageRecords == 31 ~ "Daily Usage",
    ),
    ActivityLevel = factor(
      ActivityLevel,
      levels=c('Low', 'Med', 'High')
    ),
    ParticipationLevel = factor(
      ParticipationLevel,
      levels=c('Low Usage','Moderate Usage',
               'High Usage','Daily Usage')
    )
  ) %>% 
  subset(select = -c(MedVeryActive,UsageRecords))  



Our activity level is based the median value of very active minutes each user spent in each day, which can tell how much exercise they do on regular basis. We categorize activity level into “High” for users who have median value of time over 32 minutes each day spending on intense activity, “Med” for users with median value from over 4 minutes to 32 spending on intense activity, and “Low” for users who spent 4 minutes or less.


For participation level, we categorize users based on number of days, from dailActivity_merged table, each user has their data recorded in this table. We have “Daily Usage” for users who have their data recorded everyday from beginning to the end to program, “High Usage” for users who participate from 21 to 30 days, “Moderate Usage” for 14 to 20 days, and “Low Usage” for users who participate less than 14 days.


Attach this ALPL data frame to all the data frames we need to analyze on participation and intensity. So we can see how intensity level and participation level of user can tell about the different in other outcome.



hourly_outer_join_1_ALPL <- hourly_outer_join_1 %>%
  left_join(ALPL, by = "Id")

Activity_METs_daily_ALPL <- Activity_METs_daily %>% 
  left_join(ALPL, by = "Id")

SMAd_ALPL <- SMAd %>% 
  left_join(ALPL, by = "Id")

minuteIntensitiesNarrow_merged_ALPL <- 
  minuteIntensitiesNarrow_merged %>% 
  left_join(ALPL, by = "Id")

minuteSleep_merged_ALPL <- minuteSleep_merged %>%
  left_join(ALPL, by = "Id")

df_1_ALPL <- df_1 %>% 
  left_join(ALPL, by = "Id")

df_2_ALPL <- df_2 %>% 
  left_join(ALPL, by = "Id")

Activity_METs_daily_long_ALPL <- Activity_METs_daily_long %>% 
  left_join(ALPL, by = "Id")

heartrate_seconds_merged_ALPL <- heartrate_seconds_merged %>% 
  left_join(ALPL, by = "Id") %>% 
  mutate(Weekday = wday(Time, 
                        label = T, week_start = 1), 
         Hour = hour(Time))

weightLogInfo_merged_ALPL <- weightLogInfo_merged %>% 
  left_join(ALPL, by = "Id")



Visualization and Analysis



Intensity Distribution

By User, Activity Level, Participation Level


cp <- coord_polar(theta = "y")
cp$is_free <- function() TRUE



options(ggrepel.max.overlaps = Inf)

minuteIntensitiesNarrow_merged_ALPL %>% 
  ggplot(aes(x=1, fill=factor(Intensity))) +
  geom_bar(position = "fill", width = 1) +
  cp +
  scale_fill_manual(values=c("dodgerblue",
                             "olivedrab1",
                             "orangered",
                             "yellow1")) +
  
  facet_wrap(~Users~ActivityLevel~ParticipationLevel
             , scales = "free") +
  geom_label_repel(data = df_1_ALPL,
                   aes(label = per, y=pos), 
                   position = position_fill(vjust = 0.5),
                   size = 3.0,
                   show.legend = FALSE) +
  theme(aspect.ratio = 1) +
  guides(fill = guide_legend(title = "Intensity")) +
  theme_classic() +
  theme(axis.text.y = element_blank(),
        strip.background = element_rect(fill = "thistle1"),
        legend.position = "bottom") +
  labs(title="Intensity Distribution",
       subtitle="By User, Activity Level, Participation Level",
       y="Intensity Percentage")



  • Intensity value.( 0 = Sedentary, 1 = Light, 2 = Moderate, 3 = Very Active)
  • Above each pie labeled with user number, activity level and participation level. This pie chart have all the users which can be use to located level of activity and participation for every single one of them.
  • Pie charts of users with high activity level always have proportion of intensity value as very active (3) more than 2%.
  • Most users with medium activity level have proportion of intensity value of 3 between 1% to 2%, except user 03, user 09, user 11 and user 18. User 11 still have proportion of intensity value of 3 and 2 combined less than 2%, less than some users labeled as low activity level.



Hourly Box Plot of Calories, Intensity, Step and METS Distribution over Day


w <- hourly_outer_join_1_ALPL %>% 
  ggplot() +
  theme(plot.title=element_text(size=11),
        axis.title=element_text(size=9),
        axis.text=element_text(size=8),
        axis.text.x = element_text(angle = 90,vjust=0.5))

ggarrange(
  
  w + geom_boxplot(aes(Time, Calories), 
               fill="burlywood4",
               width=0.7) +
    labs(x='Time', 
         y='Calories',
         title = "Calories vs Time of Day"),
  
  w + geom_boxplot(aes(Time, TotalIntensity), 
               fill="royalblue4",
               width=0.7) +
    labs(x='Time', 
         y='Total Intensity',
         title = "Total Intensity vs Time of Day"),
  
  w + geom_boxplot(aes(Time, StepTotal), 
               fill="red1",
               width=0.7) +
    labs(x='Time', 
         y='Total Steps',
         title = "Total Steps vs Time of Day"),
  
  w + geom_boxplot(aes(Time, METs), 
               fill="olivedrab1",
               width=0.7) +
    labs(x='Time', 
         y='METs',
         title = "METs vs Time of Day"),
  nrow=4
)



  • As we can see, there are too many outliers. For any hourly-based chart, median value would be more appropriate than mean value for analysis.
  • In this hourly-based table we use. TotalIntensity means value calculated by adding all the minute-level intensity values that occurred within the hour. But in the data frame dailyIntensities_merge, there is no column of intensity value. Instead there are only columns of time spending and distance based on intensity level.



Heat Map for Hourly Median Calories, Intensity, Steps and METs Distribution per Weekday


options(repr.plot.width=20, repr.plot.height=80)

z1 <- hourly_outer_join_1_ALPL %>% 
  group_by(Weekday, Time) %>% 
  summarize(medCal = round(median(Calories)),
            medInt = round(median(TotalIntensity)),
            medStep = round(median(StepTotal)),
            medMETs = round(median(METs), digits = 2),
            .groups='drop') %>%
  ggplot(aes(Time, Weekday)) +
  scale_y_discrete(limits = rev) +
  theme(axis.text.x = element_text(angle = 90)) +
  guides(fill = guide_colourbar(barwidth = 0.45,
                                barheight = 6))

ggarrange(
  z1 + geom_tile(aes(fill=medCal),colour = "white")  +
    geom_text(aes(label = medCal), 
              size = 2.5, color = "black") +
    scale_fill_viridis(option = "H", 
                       name = "Median Calories",
                       guide = guide_colorbar(
                         title.position = "top", 
                         direction = "vertical")) +
    labs(x='Time', 
         y='Weekday',
         title = "Hourly Median Calories Distribution per Weekday"),
  
  
  z1 + geom_tile(aes(fill=medInt),colour = "white")  +
    geom_text(aes(label = medInt), 
              size = 2.5, color = "black") +
    scale_fill_viridis(option = "H", 
                       name = "Median Intensity",
                       guide = guide_colorbar(
                         title.position = "top", 
                         direction = "vertical")) +
    labs(x='Time', 
         y='Weekday',
         title = "Hourly Median Intensity Distribution per Weekday"),
  
  z1 + geom_tile(aes(fill=medStep),colour = "white")  +
    geom_text(aes(label = medStep), 
              size = 2.5, color = "black") +
    scale_fill_viridis(option = "H", 
                       name = "Median Steps",
                       guide = guide_colorbar(
                         title.position = "top", 
                         direction = "vertical")) +
    labs(x='Time', 
         y='Weekday',
         title = "Hourly Median Steps Distribution per Weekday"),
  
  z1 + geom_tile(aes(fill=medMETs),colour = "white")  +
    geom_text(aes(label = medMETs), 
              size = 2.5, color = "black") +
    scale_fill_viridis(option = "H", 
                       name = "Median METs",
                       guide = guide_colorbar(
                         title.position = "top", 
                         direction = "vertical")) +
    labs(x='Time', 
         y='Weekday',
         title = "Hourly Median METs Distribution per Weekday"),
  nrow = 4
)



  • Calories, intensity, steps and METs are all at their peak as list below
    1. Tuesday 18:00
    2. Wednesday 18:00
    3. Thursday 17:00
    4. Saturday 13:00
  • In the late afternoon 16:00 to 19:00 in the evening, all the value on this heat map are in the orange zone, except on Saturday the orange zone starts from 11:00.
  • Only in calories heat map we can find the orange zone cluster from 10:00 to 13:00 during the weekdays, especially on Tuesday.



Hourly Median Calories Heat Map per Weekday

By Activity Level


hourly_outer_join_1_ALPL %>% 
  group_by(ActivityLevel,Weekday, Time) %>% 
  summarize(medCal = round(median(Calories)),
            .groups='drop') %>% 
  ggplot(aes(Time, Weekday)) +
  scale_y_discrete(limits = rev) +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_tile(aes(fill=medCal),colour = "white")  +
  geom_text(aes(label = medCal), 
            size = 2.5, color = "black") +
  scale_fill_viridis(option = "H", 
                     name = "Median Calories",
                     guide = guide_colorbar(
                       title.position = "top", 
                       direction = "vertical")) +
  labs(x='Time', 
       y='Weekday',
       title = "Hourly Median Calories Distribution per Weekday") +
  facet_wrap(~ActivityLevel, ncol = 1)



  • Surprisingly, people with medium activity level burn less calories during sleeping time than both people with high and low activity level.



Hourly Median Intensity Heat Map per Weekday

By Activity Level


hourly_outer_join_1_ALPL %>% 
  group_by(ActivityLevel,Weekday, Time) %>% 
  summarize(medInt = round(median(TotalIntensity)),
            .groups='drop') %>% 
  ggplot(aes(Time, Weekday)) +
  scale_y_discrete(limits = rev) +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_tile(aes(fill=medInt),colour = "white")  +
  geom_text(aes(label = medInt), 
            size = 2.5, color = "black") +
  scale_fill_viridis(option = "H", 
                     name = "Median Intensity",
                     guide = guide_colorbar(
                       title.position = "top", 
                       direction = "vertical")) +
  labs(x='Time', 
       y='Weekday',
       title = "Hourly Median Intensity Distribution per Weekday") +
  facet_wrap(~ActivityLevel, ncol = 1)



  • Again, we can see more orange zone on the heat map of medium activity level, especially on Saturday.



Hourly Median Steps Heat Map per Weekday

By Activity Level


hourly_outer_join_1_ALPL %>% 
  group_by(ActivityLevel,Weekday, Time) %>% 
  summarize(medStep = round(median(StepTotal)),
            .groups='drop') %>% 
  ggplot(aes(Time, Weekday)) +
  scale_y_discrete(limits = rev) +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_tile(aes(fill=medStep),colour = "white")  +
  geom_text(aes(label = medStep), 
            size = 2.5, color = "black") +
  scale_fill_viridis(option = "H", 
                     name = "Median Steps",
                     guide = guide_colorbar(
                       title.position = "top", 
                       direction = "vertical")) +
  labs(x='Time', 
       y='Weekday',
       title = "Hourly Median Steps Distribution per Weekday") +
  facet_wrap(~ActivityLevel, ncol = 1)



  • For high activity level, Saturday 13:00 and Wednesday 18:00 are the obviously peak points.
  • For the medium activity level, the peak is on Saturday 11:00



Hourly Median METs Heat Map per Weekday

By Activity Level


hourly_outer_join_1_ALPL %>% 
  group_by(ActivityLevel,Weekday, Time) %>% 
  summarize(medMETs = round(median(METs), digits = 2),
            .groups='drop') %>% 
  ggplot(aes(Time, Weekday)) +
  scale_y_discrete(limits = rev) +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_tile(aes(fill=medMETs),colour = "white")  +
  geom_text(aes(label = medMETs), 
            size = 2.5, color = "black") +
  scale_fill_viridis(option = "H", 
                     name = "Median METs",
                     guide = guide_colorbar(
                       title.position = "top", 
                       direction = "vertical")) +
  labs(x='Time', 
       y='Weekday',
       title = "Hourly Median METs Distribution per Weekday") +
  facet_wrap(~ActivityLevel, ncol = 1)



  • Similar pattern as see on the previous heat map. For high activity level, the peak are at Saturday 13:00, Thursday 14:00 and Wednesday 18:00. Monday seems to have the least overall METs value. And for medium activity level, the peak is at Saturday 11:00 and overall METs value is lowest on Thursday.



Bar Chart for Hourly Average Calories, Intensity, Steps and METs Distribution per Weekday


s <- hourly_outer_join_1_ALPL %>% 
  group_by(Weekday) %>% 
  summarize(avgCal = mean(Calories),
            avgInt = mean(TotalIntensity),
            avgStep = mean(StepTotal),
            avgMETs = mean(METs),
            .groups='drop') %>% 
  ggplot() +
  theme(plot.title=element_text(size=11),
        axis.title=element_text(size=9),
        axis.text=element_text(size=8))
  
ggarrange(

  s + geom_col(aes(Weekday, avgCal), 
               fill="darkolivegreen1",
               width=0.6) +
    labs(x='Weekday', 
         y='Average Calories',
         title = "Hourly Average Calories Distribution per Weekday"),
  
  s + geom_col(aes(Weekday, avgInt), 
               fill="darkorange",
               width=0.6) +
    labs(x='Weekday', 
         y='Average Intensity',
         title = "Hourly Average Intensity Distribution per Weekday"),
  
  s + geom_col(aes(Weekday, avgStep), 
               fill="cadetblue1",
               width=0.6) +
    labs(x='Weekday', 
         y='Average Steps',
         title = "Hourly Average Steps Distribution per Weekday"),
  
  s + geom_col(aes(Weekday, avgMETs), 
               fill="mediumpurple1",
               width=0.6) +
    labs(x='Weekday', 
         y='Average METs',
         title = "Hourly Average METss Distribution per Weekday")
)



  • Intensity value in this chart is calculated by adding all the minute-level intensity values that occurred within the hour. Not the same value on dailyIntensities_merged table.
  • The data from all these bar charts are all hourly-based. The values are in hourly scale. Later in the project there will be another bar chart from daily-based table for comparison on pattern.
  • We use average value calculated by mean function instead of median value because this bar chart is weekday-based, not on hour. And also because all the value on another bar chart from daily-based table are average value, not median, so we’d better use value calculate from the same method.
  • All the values from these chart are related, we can see similar pattern. The bar is highest on Saturday, followed by Tuesday. Sunday is the lowest for every value.


The next bar chart will display the value of each activity level side by side.



Bar Chart for Hourly Average Calories, Intensity, Steps and METs Distribution per Weekday

By Activity Level


G <- hourly_outer_join_1_ALPL %>% 
  group_by(ActivityLevel,Weekday) %>% 
  summarize(avgCal = mean(Calories),
            avgInt = mean(TotalIntensity),
            avgStep = mean(StepTotal),
            avgMETs = mean(METs),
            .groups='drop') %>% 
  ggplot() +
  scale_fill_manual(values=
                      c("gold1",
                        "red",
                        "grey0")) +
  theme(plot.title=element_text(size=11),
        axis.title=element_text(size=9),
        axis.text=element_text(size=8),
        axis.text.x = element_text(angle = 90,vjust=0.5))

ggarrange(
  
  G + geom_col(aes(Weekday, avgCal, 
               fill=ActivityLevel),
               position = "dodge") +
    labs(x='Weekday', 
         y='Average Calories',
         title = "Hourly Average Calories Distribution per Weekday",
         subtitle = "By Activity Level"),
  
  G + geom_col(aes(Weekday, avgInt, 
               fill=ActivityLevel),
               position = "dodge") +
    labs(x='Weekday', 
         y='Average Intensity',
         title = "Hourly Average Intensity Distribution per Weekday",
         subtitle = "By Activity Level"),
  
  G + geom_col(aes(Weekday, avgStep, 
               fill=ActivityLevel),
               position = "dodge") +
    labs(x='Weekday', 
         y='Average Steps',
         title = "Hourly Average Steps Distribution per Weekday",
         subtitle = "By Activity Level"),
  
  G + geom_col(aes(Weekday, avgMETs, 
               fill=ActivityLevel),
               position = "dodge") +
    labs(x='Weekday', 
         y='Average METs',
         title = "Hourly Average METs Distribution per Weekday",
         subtitle = "By Activity Level")
)



  • The pattern for each activity level will be different from the previous bar chart.
  • High activity level bar is peak at Tuesday, not on Saturday as previous chart, followed by Thursday. The lowest bar is on Sunday.
  • Medium activity level is peak on Saturday, followed by Tuesday and Monday. The lowest is on Wednesday. We can see the range from minimum to maximum height is widest in this group.
  • Chart of low activity level is peak on Saturday and Sunday is the lowest.



The chart below is weekday chart with the daily-based values of calories, distance, steps and METs. All the data are taken from Activity_METs_daily_ALPL table. We can compare patterns of the same value from both daily-base data and hourly-base data



Bar Chart for daily Average Calories, Distance, Steps and METs Distribution per Weekday


J <- Activity_METs_daily_ALPL %>% 
  group_by(Weekday) %>% 
  summarize(avgCal = mean(Calories),
            avgDis = mean(TotalDistance),
            avgStep = mean(TotalSteps),
            avgMETs = mean(METs),
            .groups='drop') %>% 
  ggplot() +
  theme(plot.title=element_text(size=11),
        axis.title=element_text(size=9),
        axis.text=element_text(size=8))

ggarrange(
  
  J + geom_col(aes(Weekday, avgCal), 
               fill="darkolivegreen1",
               width=0.6) +
    labs(x='Weekday', 
         y='Average Calories',
         title = "Daily Average Calories Distribution per Weekday"),
  
  J + geom_col(aes(Weekday, avgDis), 
               fill="red",
               width=0.6) +
    labs(x='Weekday', 
         y='Average Distance',
         title = "Daily Average Distance Distribution per Weekday"),
  
  J + geom_col(aes(Weekday, avgStep), 
               fill="cadetblue1",
               width=0.6) +
    labs(x='Weekday', 
         y='Average Steps',
         title = "Daily Average Steps Distribution per Weekday"),
  
  J + geom_col(aes(Weekday, avgMETs), 
               fill="mediumpurple1",
               width=0.6) +
    labs(x='Weekday', 
         y='Average METs',
         title = "Daily Average METss Distribution per Weekday")
)



  • This bar chart display the same values on daily scale except the distance chart. Activity_METs_daily_ALPL table doesn’t have the same intensity value like in the hourly-based table.
  • The pattern of calories, steps and METs values are similar to hourly based chart, only daily scale give much higher values for calories and steps



The next bar chart will display the value of each activity level side by side.



Bar Chart for daily Average Calories, Distance, Steps and METs Distribution per Weekday

By Activity Level


J1 <- Activity_METs_daily_ALPL %>% 
  group_by(ActivityLevel,Weekday) %>% 
  summarize(avgCal = mean(Calories),
            avgDis = mean(TotalDistance),
            avgStep = mean(TotalSteps),
            avgMETs = mean(METs),
            .groups='drop') %>% 
  ggplot() +
  scale_fill_manual(values=
                      c("sandybrown",
                        "peru",
                        "darkkhaki")) +
  
  theme(plot.title=element_text(size=11),
        axis.title=element_text(size=9),
        axis.text=element_text(size=8),
        axis.text.x = element_text(angle = 90,vjust=0.5))

ggarrange(
  
  J1 + geom_col(aes(Weekday, avgCal, 
               fill=ActivityLevel),
               position = "dodge") +
    labs(x='Weekday', 
         y='Average Calories',
         title = "Daily Average Calories Distribution per Weekday",
         subtitle = "By Activity Level"),
  
  J1 + geom_col(aes(Weekday, avgDis, 
               fill=ActivityLevel),
               position = "dodge") +
    labs(x='Weekday', 
         y='Average Distance',
         title = "Daily Average Distance Distribution per Weekday",
         subtitle = "By Activity Level"),
  
  J1 + geom_col(aes(Weekday, avgStep, 
               fill=ActivityLevel),
               position = "dodge") +
    labs(x='Weekday', 
         y='Average Steps',
         title = "Daily Average Steps Distribution per Weekday",
         subtitle = "By Activity Level"),
  
  J1 + geom_col(aes(Weekday, avgMETs, 
               fill=ActivityLevel),
               position = "dodge") +
    labs(x='Weekday', 
         y='Average METs',
         title = "Daily Average METss Distribution per Weekday",
         subtitle = "By Activity Level")
)



  • From step chart, only people with high activity level have average daily steps over recommended value of 10,000 steps.
  • As expected, we can see very similar pattern between step chart and distance chart.
  • Only on Saturday, people with medium activity level have average steps slightly higher than people with high activity level, but have average distance slightly lower.
  • People with low activity level have average steps equal or lower than 5,000.



Bar Chart for Hourly Median Calories, Intensity, Steps, METs Distribution over Day


v <- hourly_outer_join_1_ALPL %>% 
  group_by(Time) %>% 
  summarize(medCal = median(Calories),
            medInt = median(TotalIntensity),
            medStep = median(StepTotal),
            medMETs = median(METs),
            .groups='drop') %>% 
  ggplot() +
  theme(plot.title=element_text(size=11),
        axis.title=element_text(size=9),
        axis.text=element_text(size=8),
        axis.text.x = element_text(angle = 90,vjust=0.5))

ggarrange(

v + geom_col(aes(Time, medCal), 
             fill="burlywood4",
             width=0.6) +
  labs(x='Time', 
       y='Median Calories',
       title = "Median Calories vs Time of Day"),

v + geom_col(aes(Time, medInt), 
             fill="royalblue4",
             width=0.6) +
  labs(x='Time', 
       y='Median Intensity',
       title = "Median Intensity vs Time of Day"),

v + geom_col(aes(Time, medStep), 
             fill="red1",
             width=0.6) +
  labs(x='Time', 
       y='Median Steps',
       title = "Median Steps vs Time of Day"),

v + geom_col(aes(Time, medMETs), 
             fill="olivedrab1",
             width=0.6) +
  labs(x='Time', 
       y='Median METs',
       title = "Median METs vs Time of Day"),
nrow=4
)



  • We can also see similar pattern of all the values in these four chart. There are two peak point from 17:00 to 19:00, and also from 11:00 to 13:00
  • Sleeping hours are from 23:00 to 11:00 of the next day, there is no activity and step, only there is still calorie burning during this time.
  • Peak of all these chart is at 18:00, followed by 12:00 (only Calorie chart, the second peak is at 13:00).



Bar Chart for Hourly Median Calories, Intensity, Steps, METs Distribution over Day

By Activity Level


E <- hourly_outer_join_1_ALPL %>% 
  group_by(ActivityLevel,Time) %>% 
  summarize(medCal = median(Calories),
            medInt = median(TotalIntensity),
            medStep = median(StepTotal),
            medMETs = median(METs),
            .groups='drop') %>% 
  ggplot() +
  scale_fill_manual(values=
                      c("red",
                        "green",
                        "blue4")) +
  theme(plot.title=element_text(size=11),
        axis.title=element_text(size=9),
        axis.text=element_text(size=8),
        axis.text.x = element_text(angle = 90,vjust=0.5))

ggarrange(
  
  E + geom_col(aes(Time,
                   medCal,
                   fill=ActivityLevel),
                   position = "dodge") +
    labs(x="Time", 
         y="Median Calories",
         title = "Median Calories vs Time of Day",
         subtitle = "By Activity Level"),
  
  E + geom_col(aes(Time, 
                   medInt,
                   fill=ActivityLevel), 
                   position = "dodge") +
    labs(x="Time", 
         y="Median Intensity",
         title = "Median Intensity vs Time of Day",
         subtitle = "By Activity Level"),
  
  E + geom_col(aes(Time, 
                   medStep,
                   fill=ActivityLevel), 
               position = "dodge") +
    labs(x="Time", 
         y="Median Steps",
         title = "Median Steps vs Time of Day",
         subtitle = "By Activity Level"),
  
  E + geom_col(aes(Time, 
                   medMETs,
                   fill=ActivityLevel), 
               position = "dodge") +
    labs(x="Time", 
         y="Median METs",
         title = "Median METs vs Time of Day",
         subtitle = "By Activity Level"),
  nrow=4
)



  • The overall pattern is similar to the previous chart, but surprisingly on the intensity chart from 16:00 to 23:00, people with medium activity level have intensity value higher than people with high activity level.
  • The contradiction could probably come from different metrics used to define intensity. Users are categorized into groups based on activity level from the data on daily-based chart, which define intensity as time spending on intense activity. But in this chart from hourly-based values, intensity is not measured as time.
  • Another interesting point is calories burning during sleep time. From 22:00 to 6:00 in the morning of the next day, the red bars of people with low activity level are higher than the green bars of people with medium activity level. This indicates low activity people have higher calories burned during sleep time higher than medium activity people.
  • From 19:00 in the evening to 23:00, people with medium activity level have number of steps higher than people with high activity level.
  • From 20:00 to 22:00, people with medium activity level have highest METs value.
  • For the chart that indicate activity like step chart and intensity chart, people in medium activity level seem to have less sleep as their activity start at 6:00 and end at 23:00 while people in other group start at 7:00 (high activity level) or 9:00 (low activity level) and finish at 22:00 for both high and low activity level.



Step Total and Total Intensity

Total Intensity and Calories

Step Total and METs

METs and Calories

Step Total and Calories

Multiple Linear Regression Model for Hourly-Based Data Frame



Regression Model describes the correlation between a dependent variable, y, and one or more independent variables, X. From all the values in hourly-based data frame we have, we can see from our scatter plots that many values have almost linear relationship to one another. Linear regression model created from all these data can forecast the outcome from the data based on behavioral factors.



The functional form of Multiple Linear Regression is :



\[y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + ... + \beta_{n} x_{n}\]

We use lm() to find all the coefficients \(\beta_{0}, \beta_{1}, \beta_{2},...,\beta_{n}\) for all the independent variables \(x_{1}, x_{2},..., x_{n}\) we have.



Linear Regression Model between METs and Intensity



options(scipen = 999)

lmMETs_Intensity <- 
  lm(METs ~ TotalIntensity, 
            data=hourly_outer_join_1_ALPL)

summary(lmMETs_Intensity)
## 
## Call:
## lm(formula = METs ~ TotalIntensity, data = hourly_outer_join_1_ALPL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2173 -0.0258 -0.0054  0.0046  4.9917 
## 
## Coefficients:
##                  Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    1.00543699 0.00141218   712.0 <0.0000000000000002 ***
## TotalIntensity 0.03851601 0.00005806   663.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1824 on 22091 degrees of freedom
## Multiple R-squared:  0.9522, Adjusted R-squared:  0.9522 
## F-statistic: 4.4e+05 on 1 and 22091 DF,  p-value: < 0.00000000000000022



From the regression analysis we get from summary function, the significance level of intensity(TotalIntensity) indicates this variable contributes to the explanation of the dependent variable(METs) significantly (three asterisks). R-squared is 0.9522 and very low p-value which is quite good.



Linear Regression Model between METs and Steps



options(scipen = 999)

lmMETs_Step <- 
  lm(METs ~ 
       StepTotal, 
       data=hourly_outer_join_1_ALPL)

summary(lmMETs_Step)
## 
## Call:
## lm(formula = METs ~ StepTotal, data = hourly_outer_join_1_ALPL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1751 -0.1190 -0.0933  0.0553  7.5743 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) 1.118981219 0.002634363   424.8 <0.0000000000000002 ***
## StepTotal   0.001093226 0.000003461   315.8 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3552 on 22091 degrees of freedom
## Multiple R-squared:  0.8187, Adjusted R-squared:  0.8187 
## F-statistic: 9.975e+04 on 1 and 22091 DF,  p-value: < 0.00000000000000022



In this model, Step also contribute significantly to explanation of METs, with very low p-value and R-squared is 0.8187.



Multiple Linear Regression Model between METs, Steps and Intensity



options(scipen = 999)

lmMETs_StepandIntensity <- 
  lm(METs ~ 
       StepTotal +
       TotalIntensity, 
       data=hourly_outer_join_1_ALPL)

summary(lmMETs_StepandIntensity)
## 
## Call:
## lm(formula = METs ~ StepTotal + TotalIntensity, data = hourly_outer_join_1_ALPL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3580 -0.0209 -0.0114  0.0036  3.9978 
## 
## Coefficients:
##                 Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    1.0114400  0.0013462  751.31 <0.0000000000000002 ***
## StepTotal      0.0001869  0.0000038   49.19 <0.0000000000000002 ***
## TotalIntensity 0.0330450  0.0001241  266.22 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1732 on 22090 degrees of freedom
## Multiple R-squared:  0.9569, Adjusted R-squared:  0.9569 
## F-statistic: 2.453e+05 on 2 and 22090 DF,  p-value: < 0.00000000000000022



In this model, we incorporate both step and intensity to explain the correlation of these two independent variables and METs. The result is even better than the previous two models. Very low p-value and better R-squared of 0.9569


  • The model from this result is the following equation.



METs = 1.0114400 + 0.0001869(StepTotal) + 0.0330450(TotalIntensity)



Create regression diagnostics plot.



plot(lmMETs_StepandIntensity)



  • The line in Residuals vs Fitted chart is almost horizontal line without distinct patterns, an indication for a linear relationship.
  • Most of the points in Normal Q-Q chart fall approximately along the reference line in the middle, but as it comes closer to toward the ends on both sides, these points gradually deviate from the dashed line.
  • Most of the dots are cluster at the bottom-left of Scale-Location chart. Horizontal line with equally spread points is a good indication of homoscedasticity. But our case have a heteroscedasticity problem.
  • From Residuals vs Leverage chart, we can see all the outliers that might influence the regression results when included or excluded from the analysis. As we can see from our previous box plot, data from hourly-based table contain so many extreme values.



Hourly Heart Rate Box Plot for Each User



heartrate_seconds_merged_ALPL %>% 
  mutate(
    Hour = format(Time, "%k")
  ) %>% ggplot() + 
  geom_boxplot(aes(Hour,Value,fill=Hour)) +
  theme(axis.text.x = element_text(angle = 90,vjust=0.5),
        strip.background = element_rect(fill = "palegoldenrod")) +
  facet_wrap(~Users~ActivityLevel) +
  labs(x='Time', 
       y='Heart Rate Value',
       title = "Hourly Heart Rate for Each User")



  • For some certain hours, people with high activity level tend to have longer box which indicates the wider range of heart rate during that hours.
  • So many outliers in this chart, median value of heart rate would be more suitable for use than mean value.



Hourly Median Heart Rate Distribution per Weekday

By Activity Level



heartrate_seconds_merged_ALPL %>%
  group_by(Weekday,
           Hour,
           ActivityLevel) %>% 
  summarize(medValue = round(median(Value)),
            .groups='drop') %>% 
  ggplot(aes(Hour,Weekday)) +
  scale_y_discrete(limits = rev) +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_tile(aes(fill=medValue),colour = "white")  +
  geom_text(aes(label = medValue), 
            size = 2.5, color = "black") +
  scale_fill_viridis(option = "H", 
                     name = "Median Heart Rate") +
  facet_wrap(~ActivityLevel, ncol = 1) +
  labs(x='Time', 
       y='Weekday',
       title = "Hourly Median Heart Rate Distribution per Weekday",
         subtitle = "By Activity Level")



  • We can see clearer in heat map that people with low activity level tend to have higher hear rate than people with medium and high activity level.
  • People with high activity level have lower heart rate during sleep time than people with both medium and low activity level. This could indicate the difference in quality of sleep.
  • For people with high activity level, the peak of heart rate in this heat map is correlate to the peak time of the previous hourly calorie heat map, at 13:00 on Saturday and 18:00 on Wednesday. Only the peak at 6:00 on Thursday is not corresponding to the calories burned at that time.
  • On Tuesday 18:00, heart rate is not significantly high in all activity level, in contrast to the high values of calories burned, total steps, intensity, and METs on hourly heat map.
  • For people with high activity level, we can see the orange zone stand out at 6:00 from Monday to Friday in contrast to the color of surrounding hours. This could suggest the routine of morning exercise before going to work.



Sleeping Value Distribution

By User, Activity Level, Participation Level



minuteSleep_merged_ALPL %>% 
  ggplot(aes(x=1, fill=factor(value))) +
  geom_bar(position = "fill", width = 1) +
  cp +
  scale_fill_manual(values=c("cadetblue1",
                             "goldenrod1",
                             "firebrick1")) +
  
  facet_wrap(~Users~ActivityLevel~ParticipationLevel, 
             scales = "free") +
  geom_label_repel(data = df_2_ALPL,
                   aes(label = per, y=pos), 
                   position = position_fill(vjust = 0.5),
                   size = 3.0,
                   show.legend = FALSE) +
  theme(aspect.ratio = 1) +
  guides(fill = guide_legend(title = "Value")) +
  theme_classic() +
  theme(axis.text.y = element_blank(),
        strip.background = element_rect(fill = "thistle1"),
        legend.position = "bottom") +
  labs(title="Sleeping Value Distribution",
       subtitle = "By User, Activity Level, Participation Level",
       y="Sleeping Value Percentage")



  • Value indicating the sleep state.(1 = asleep, 2 = restless, 3 = awake)
  • People with low activity level tend to have more of red area in this pie chart, which indicate the higher percentage of time they stayed awake while sleeping. This could suggest about the quality of sleep they have.
  • There is some exception like user number 3, this person have very high percentage of awaking time while sleeping.



Box Plot for Activity Minutes per Weekday



Start from this chart, we will use data from daily-based data frames, like Activity_METs_daily_long_ALPL from this chart or Activity_METs_daily_ALPL for the next charts with additional data on sleeping. For the chart involved sleeping, we use data from SMAd_ALPL (which stand for Sleep, METs, Activity daily) instead.


The reason is because we have lost huge chunk of data (from 934 rows to 410 rows ) while incorporating sleep data into daily activity data, due to less number of participant (only 24 users) on SleepDay_merged table. We want all the charts created to be as inclusive as possible, So we would use data from SMAd table only for the charts involved sleeping.



Activity_METs_daily_long_ALPL %>% 
  mutate(Weekday = wday(ActivityDate, 
                        label = T, 
                        week_start = 1)) %>% 
  ggplot() +
  geom_boxplot(aes(Weekday,Minutes,fill=ActivityMinutes)) + 
  theme_classic()  



  • This chart we can see the comparison of each activity minutes stand side by side in each weekday.
  • The ranges of very active minutes are wider than the ranges of fairly active minutes.
  • Sedentary minutes has the widest range and has the lowest median value on Thursday


The next chart we can see distribution of each activity minutes on weekday in closer look.



Box Plot for Very Active Minutes, Fairy Active Minutes, Lightly Active Minutes and Sedetary Minutes per Weekday



MM <- Activity_METs_daily_ALPL %>% 
  mutate(Weekday = wday(ActivityDate, 
                        label = T, 
                        week_start = 1)) %>%
  ggplot() +
  scale_fill_manual(values=
                      c("gold1", 
                        "lightpink",
                        "olivedrab2",
                        "darkorange",
                        "cadetblue1",
                        "mediumorchid2",
                        "firebrick2")) +          
  theme_classic()+
  theme(axis.text.x = element_text(angle = 90))

ggarrange(
  
  MM + geom_boxplot(aes(Weekday,
                        VeryActiveMinutes,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Very Active Minutes',
         title = "Very Active Minutes Distribution per Weekday"),
  
  MM + geom_boxplot(aes(Weekday,
                        FairlyActiveMinutes,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Fairy Active Minutes',
         title = "Fairy Active Minutes Distribution per Weekday"),
  
  MM + geom_boxplot(aes(Weekday,
                        LightlyActiveMinutes,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Lightly Active Minutes',
         title = "Lightly Active Minutes Distribution per Weekday"),
  
  MM + geom_boxplot(aes(Weekday,
                        SedentaryMinutes,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Sedentary Minutes',
         title = "Sedentary Minutes Distribution per Weekday")
  
)



  • Sunday is the day that all the median values, except median value of sedentary minutes, are at the lowest. Sunday is like a rest day of the week according to the previous heat maps. All the value like calories burned, intensity, steps and METs are at the lowest.
  • Median of very active minutes and fairly active minutes are highest on Tuesday. But as you can see, both charts have so many outliers.


We also can create box plot for each activity level to see the differences in activity minutes distribution per weekday.



Box Plot for Activity Minutes per Weekday

By Activity Level



Activity_METs_daily_long_ALPL %>% 
  mutate(Weekday = wday(ActivityDate, 
                        label = T, 
                        week_start = 1)) %>% 
  ggplot() +
  geom_boxplot(aes(Weekday,Minutes,fill=ActivityMinutes)) + 
  theme_classic() +
  facet_wrap(~ActivityLevel, ncol = 1) +
  theme(strip.background = element_rect(fill = "aliceblue")) +
  labs(x='Minutes', 
       y='Weekday',
       title = "Box Plot for Activity Minutes per Weekday",
       subtitle = "By Activity Level")



  • People with high activity have higher value of very active minutes than fairy active minute, but the chart of people with medium activity level goes on the other way around.
  • Sedentary minutes box plot of people with medium activity level mostly has median value lower than 1,000 minute. lower than the values from both people with high activity and low activity level.



Next chart is the activity minutes stacked bar for each individual user.



Weekday Average Activity Minutes Stacked Bar

By User, Activity Level



Activity_METs_daily_long_ALPL %>% 
  mutate(Weekday = wday(ActivityDate, 
                        label = T, 
                        week_start = 1)) %>% 
  ggplot(aes(Weekday,Minutes,fill=ActivityMinutes)) +
  geom_bar(width = 0.7,stat = "summary", fun = "mean") + 
  theme_classic()  +
  facet_wrap(~Users~ActivityLevel) +
  theme(axis.text.x = element_text(angle = 90,vjust=0.5),
        strip.background = element_rect(fill = "thistle1")) +
  labs(title="Weekday Average Activity Minutes Stacked Bar",
       subtitle="By User, Activity Level, Participation Level")



  • For many users, the height of their stacked bars reach approximately near, or even reach, full 1,440 minutes in a day. Many of them have over 1,000 sedentary minutes in a day. Could that mean sedentary minutes include sleep time?


Next chart is weekday average activity minutes stacked bars categorized by activity level.



Weekday Average Activity Minutes Stacked Bar

Activity Level



Activity_METs_daily_long_ALPL %>% 
  mutate(Weekday = wday(ActivityDate, 
                        label = T, 
                        week_start = 1)) %>% 
  ggplot(aes(Weekday,Minutes,fill=ActivityMinutes)) +
  geom_bar(width = 0.7,stat = "summary", fun = "mean") + 
  theme_classic()  +
  facet_wrap(~ActivityLevel) +
  theme(axis.text.x = element_text(angle = 90,vjust=0.5),
        strip.background = element_rect(fill = "thistle1")) +
  labs(title="Weekday Average Activity Minutes Stacked Bar",
       subtitle="By User, Activity Level, Participation Level")



  • High activity level users have highest average value of very active minutes, with the peak on Tuesday.
  • Users with low activity level have the shortest of both very active minutes and fairly active minutes. On the other hand, this group have the value of sedentary minutes the most in every weekday.
  • Users with medium activity level have the most of fairly active minutes proportion.



Box Plot for Total Steps, Calories, METs and Total Distance per Weekday



AM <- Activity_METs_daily %>% 
  mutate(Weekday = wday(ActivityDate, 
                        label = T, 
                        week_start = 1)) %>%
  ggplot() +
  scale_fill_manual(values=
                      c("gold1", 
                        "lightpink",
                        "olivedrab2",
                        "darkorange",
                        "cadetblue1",
                        "mediumorchid2",
                        "firebrick2")) +          
   theme_classic()+
   theme(axis.text.x = element_text(angle = 90))

ggarrange(
  
  AM + geom_boxplot(aes(Weekday,
                        TotalSteps,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Total Steps',
         title = "Total Steps Distribution per Weekday"),
  
  AM + geom_boxplot(aes(Weekday,
                        Calories,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Calories',
         title = "Calories Distribution per Weekday"),
  
  AM + geom_boxplot(aes(Weekday,
                        METs,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='METs',
         title = "METs Distribution per Weekday"),
  
  AM + geom_boxplot(aes(Weekday,
                        TotalDistance,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Total Distance',
         title = "Total Distance Distribution per Weekday")
  
  )



  • The charts above, we have calories, steps and METs box plots we can compare to the previous heat map we created from hour-based table. We can see the median line in the box plots of three chart have the highest values on Tuesday, instead of Saturday on hourly-based heat maps that we can see more orange zone. Keep in mind that there are always differences on daily and hourly scale, and also we have to take hours from non-orange zone on heat map into account, like sleeping hours that color in heat maps are in deep purple.
  • The median value of distance is also highest on Tuesday.
  • Sunday is the day that all median values are at the lowest.
  • There are some outliers, but not as many as the data from hourly-based table.



Correlation Matrix for Daily-Based Data Fram with Sleeping Data



Now we use SMAd_ALPL table which stands for Sleep, METs and Activity daily in order to see how sleeping pattern can have some effect on other activities during waking hours, or some biological processes that happen throughout the day like calories burned and METs.


But before we creating any chart, let’s use correlation matrix to see how all the data related to each other



ggcorr(SMAd_ALPL[, 2:21], 
       label = TRUE,
       hjust = 1.0,
       angle = -45,
       layout.exp = 1)



  • The matrix doesn’t indicate correlation between total time in bed, total minutes asleep and total sleep records with METs and calories burned, but sleep efficiency has some correlation.


Let’s create multiple linear regression for this data frame and see how we can develop simple equation for all these relationship



Multiple Linear Regression Model for Sleep and METs Incorporated Daily-Based Data Frame



options(scipen = 999)

modSMAd_ALPL <- lm(Calories ~ 
              TotalSteps +
              VeryActiveMinutes + 
              FairlyActiveMinutes + 
              LightlyActiveMinutes +
              SedentaryMinutes+
              VeryActiveDistance + 
              ModeratelyActiveDistance + 
              LightActiveDistance +
              SedentaryActiveDistance +
              TotalSleepRecords +
              TotalMinutesAsleep +
              TotalTimeInBed + 
              Sleep_Efficiency,
            data=SMAd_ALPL)

summary(modSMAd_ALPL)
## 
## Call:
## lm(formula = Calories ~ TotalSteps + VeryActiveMinutes + FairlyActiveMinutes + 
##     LightlyActiveMinutes + SedentaryMinutes + VeryActiveDistance + 
##     ModeratelyActiveDistance + LightActiveDistance + SedentaryActiveDistance + 
##     TotalSleepRecords + TotalMinutesAsleep + TotalTimeInBed + 
##     Sleep_Efficiency, data = SMAd_ALPL)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -759.5 -232.3  -26.5  189.2 4377.4 
## 
## Coefficients:
##                            Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              -2411.4893   799.7185  -3.015              0.00273 ** 
## TotalSteps                  -0.2179     0.0229  -9.515 < 0.0000000000000002 ***
## VeryActiveMinutes           14.4678     1.2094  11.962 < 0.0000000000000002 ***
## FairlyActiveMinutes         17.5818     3.2438   5.420  0.00000010369209365 ***
## LightlyActiveMinutes        -1.2111     0.4936  -2.454              0.01457 *  
## SedentaryMinutes             1.3626     0.1650   8.256  0.00000000000000227 ***
## VeryActiveDistance         210.9534    37.5440   5.619  0.00000003629868202 ***
## ModeratelyActiveDistance   -13.7444    77.5118  -0.177              0.85935    
## LightActiveDistance        543.9142    33.4873  16.242 < 0.0000000000000002 ***
## SedentaryActiveDistance   1391.7987  2277.3383   0.611              0.54145    
## TotalSleepRecords           88.2852    58.5429   1.508              0.13234    
## TotalMinutesAsleep          -1.2354     1.6156  -0.765              0.44492    
## TotalTimeInBed               2.1642     1.4485   1.494              0.13595    
## Sleep_Efficiency            25.5263     8.4937   3.005              0.00282 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 392.6 on 396 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.732 
## F-statistic: 86.95 on 13 and 396 DF,  p-value: < 0.00000000000000022



  • Adjusted R-Squared and p-value is good but we need to eliminate some independent variables that don’t significantly contribute to explaining the variance in the dependent variable.
  • In this case, these variables would be moderately active distance, sedentary active distance, total sleep records, total minutes asleep and total time in bed.


So the adjusted model would be as following.



options(scipen = 999)

modSMAd_ALPL_redux <- lm(Calories ~ 
                     TotalSteps +
                     VeryActiveMinutes + 
                     FairlyActiveMinutes + 
                     LightlyActiveMinutes +
                     SedentaryMinutes+
                     VeryActiveDistance + 
                     LightActiveDistance + 
                     Sleep_Efficiency,
                   data=SMAd_ALPL)


summary(modSMAd_ALPL_redux)
## 
## Call:
## lm(formula = Calories ~ TotalSteps + VeryActiveMinutes + FairlyActiveMinutes + 
##     LightlyActiveMinutes + SedentaryMinutes + VeryActiveDistance + 
##     LightActiveDistance + Sleep_Efficiency, data = SMAd_ALPL)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -792.2 -254.8  -14.2  196.9 4523.6 
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          -600.69244  267.68270  -2.244              0.02537 *  
## TotalSteps             -0.22911    0.02179 -10.515 < 0.0000000000000002 ***
## VeryActiveMinutes      13.96435    1.13897  12.260 < 0.0000000000000002 ***
## FairlyActiveMinutes    16.58093    1.78668   9.280 < 0.0000000000000002 ***
## LightlyActiveMinutes   -1.31264    0.48692  -2.696              0.00732 ** 
## SedentaryMinutes        0.81123    0.12577   6.450       0.000000000322 ***
## VeryActiveDistance    227.76369   37.95053   6.002       0.000000004374 ***
## LightActiveDistance   544.11503   34.18433  15.917 < 0.0000000000000002 ***
## Sleep_Efficiency       17.54828    2.72642   6.436       0.000000000350 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 404.8 on 401 degrees of freedom
## Multiple R-squared:  0.7207, Adjusted R-squared:  0.7151 
## F-statistic: 129.3 on 8 and 401 DF,  p-value: < 0.00000000000000022



  • With this model, we have every independent variable that contributes significantly to explain the variance in the dependent variable.
  • Adjusted R-squared may be a little bit less than the previous model, but p-value is still good.


  • The model from this result is the following equation.



Calories = -600.69244 - 0.22911(TotalSteps) + 13.96435(VeryActiveMinutes) + 16.58093(FairlyActiveMinutes) - 1.31264(LightlyActiveMinutes) + 0.81123(SedentaryMinutes) + 227.76369(VeryActiveDistance) + 544.11503(LightActiveDistance) + 17.54828(Sleep_Efficiency)



Let’s check the diagnostic plot to see if this model has a good fit for normal distribution or not.



plot(modSMAd_ALPL_redux)



  • Compare to the previous model created from hourly-based table, this model may not have value of Adjusted R-squared as high as previous model, but the diagnosis plot gives better result.
  • The line in Residuals vs Fitted chart is well horizontal with the points scatter evenly above and below that line.
  • Scale-Location plot of this model is even better than the previous model, with almost horizontal line at the height close to value of 1.
  • All the point in Normal Q-Q plot follow along the straight line nicely, with very few extreme observation.
  • There is an observation 289 on Residual vs Leverage plot that locates outside “Cook’s distance” dashed.


If sleep sleep efficiency is the only sleep related data that significantly contributes to calories burned. Let’s take a look at overall sleep distribution per weekday.



Box Plots for Total Minutes Asleep, Total Time in Bed and Sleep Efficiency Distribution per Weekday



SL <- SMAd_ALPL %>% 
  ggplot() +
  scale_fill_manual(values=
                      c("gold1", 
                        "lightpink",
                        "olivedrab2",
                        "darkorange",
                        "cadetblue1",
                        "mediumorchid2",
                        "firebrick2")) +          
  theme_classic()+
  theme(axis.text.x = element_text(angle = 90))

ggarrange(
  
  SL + geom_boxplot(aes(Weekday,
                        TotalMinutesAsleep,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Total Minutes Asleep',
         title = "Total Minutes Asleep Distribution per Weekday"),
  
  SL + geom_boxplot(aes(Weekday,
                        TotalTimeInBed,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Total Time In Bed',
         title = "Total Time In Bed Distribution per Weekday"),
  
  SL + geom_boxplot(aes(Weekday,
                        Sleep_Efficiency,
                        fill=Weekday)) +
    labs(x='Weekday', 
         y='Sleep Efficiency',
         title = "Sleep Efficiency Distribution per Weekday")
  
)



  • Median value of total minutes asleep is at the highest on Sunday, and lowest on Friday.
  • The median value of sleep efficiency also at the lowest on Sunday as well.
  • During the weekday, Wednesday has the highest median value of both total minutes asleep and total time in bed. But for sleep efficiency, Monday has the highest median value during weekday.



Sleep Efficiency Distribution per Weekday

By Activity Level



SMAd_ALPL %>% 
  ggplot() +
  scale_fill_manual(values=
                      c("gold1", 
                        "lightpink",
                        "olivedrab2",
                        "darkorange",
                        "cadetblue1",
                        "mediumorchid2",
                        "firebrick2")) +          
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90)) +
  geom_boxplot(aes(Weekday,
                        Sleep_Efficiency,
                        fill=Weekday)) +
  facet_wrap(~ActivityLevel, ncol = 1) +
  theme(strip.background = element_rect(fill = "aliceblue")) +
  labs(x='Weekday', 
         y='Sleep Efficiency',
         title = "Sleep Efficiency Distribution per Weekday",
         subtitle = "By Activity Level")



  • We can not see much difference in median value of sleep efficiency in each group, only box plot of people with medium activity level has widest range.
  • There are some outliers indicate sleep deprivation in every group. Most of them are in medium activity level, follow by low activity level.


From previous activity minutes stacked bar charts, we notice the part of average sedentary minutes in some bars are over 1,200 minutes as compared to total 1,440 minutes in one day. The question is, does sedentary minutes include sleep time? Let’s see the plot between total minutes asleep and sedentary minutes as related to sleep efficiency and sleep fragmentation (total sleep records).



Total Minutes Asleep Vs. Sedentary Minutes

Total Minutes Asleep Vs. Sedentary Minutes

Total Minutes Asleep Vs. Very Active Minutes

Total Minutes Asleep Vs. Very Active Minutes

Total Minutes Asleep Vs. Calories

Total Minutes Asleep Vs. Calories

Total Minutes Asleep Vs. Calories

Total Minutes Asleep Vs. Calories

Total Minutes Asleep Vs. METs

Total Minutes Asleep Vs. METs

Sleep Efficiency Vs. Calories

Sleep Efficiency Vs. METs

Sleep Efficiency Vs. METs

Total Steps Vs. METs

Total Steps Vs. METs

Weight (Kg.) Vs. BMI

By Activity Level



weightLogInfo_merged_ALPL %>% 
  ggplot(aes(WeightKg, BMI, 
             color=ActivityLevel)) +
  geom_point() +
  facet_wrap(~ActivityLevel) +
  theme(axis.text.x = element_text(angle = 90,vjust=0.5)) +
  labs(
    x='Weight (Kg.)', 
    y='BMI', 
    title='Weight (Kg.) Vs. BMI', 
    subtitle ='By Activity Level')



  • Look like we found extreme case, low activity level, high weight and high BMI to the point of obesity. This is user number 5.
  • User number 5 is one of the example of inconsistency we can find throughout this data set. Based on the dailyActivity_merged table, User 05 is daily user because we can find his data recorded from the beginning to the last day. But in some other table, for example sleepDay_merged table, User 05 participated only 5 days.


We can only find out some more information about his daily activity on non-sleep related table to get some idea on the life style related to data on this weightLogInfo_merged_ALPL table.


The chart below using data from the table Activity_METs_daily_ALPL, which provide substantial amount of data for this user to get some idea about daily life style.



Total Steps Vs. Calories

Data Limitations



  1. The data set has some inconsistency and incomplete in many level. There are different number of user participation in many data frames. If we want to join two data frame together, let’s take hourly based tables and heart rate table for instance, the differences in both user participation and also number of days or hours each user participated, make it inevitably to exclude significant amount of valuable data.
  2. There are irregularity in time recording. Let’s take heartrate_second_merged table for instance. The data in this table is supposed to recorded every 5 seconds, but beside the fact that not every user equally participated in collecting data, the data frame also skip on the period of time when there’s no data collected. Instead of having timeline with constant recorded interval every 5 seconds with NA value on the absent data, the absent periods are just skipped with the beginning of next recorded period on the next row. This cause some problem in case we want to plot some line chart to see how heart rate swing in each day, without consistent interval with NA value during absent period, Line chart will automatically connect the line across the absent period.
  3. The data set does not include very important data, such as gender, age, residential location, which is crucial factors that have effects on rate of metabolism, weather condition that suitable of outdoor activity, etc.



Recommendations



  1. From correlation matrix and linear regression model we created, Bellabeat should develop app that can predict trend calories burned, weight change, etc., as related to recorded data from users that depicted in simple dashboard. This app display all the recorded and development trend to users, and encourage users to improve by sending notification about daily steps, sleep hours and the most important, caution or warning for some unusual data recorded to the app., like unusual heart rate.
  2. Bellabeat app. should also develop custom made programs for each group of users according to their previous collected data. These programs encourage users to take their own steps to achieve their goal of improvement, in term of calories burned, weight loss, better sleeping quality, etc. All these program can also introduce users to more Bellabeat products that can help users achive their goal along the path.



Special Thanks

The author in the list below are the source of good ideas, and inspiration for this project. Please do yourself a favor, take a look at their beautiful works.


1.) https://www.kaggle.com/code/alixmcgettrick/bellabeat-case-study-in-r


2.) https://www.kaggle.com/code/zulkhaireesulaiman/bellabeat-capstone-project-in-r#-7.-Recommendations--


3.) https://www.kaggle.com/code/aymangouda/bellabeat-fitbit-capstone-project



Thank you for taking your time to have a look at my project!


Any comments and recommendations for improvement would be highly appreciated!