This project is the investigation of the data collected by
Bellabeat, a high-tech manufacturer of health-focused products for
women. The dataset collected from thirty-three eligible Fitbit users
consented to the submission of personal tracker data, including
minute-level output for physical activity, heart rate, and sleep
monitoring. The main goal is to gain insight into how consumers are
using the company smart devices.
First, load all the libraries for data analysis and
visualization.
library("tidyverse")
library("lubridate")
library("viridis")
library("ggpubr")
library("ggrepel")
library("GGally")
Then, load the FitBit datasets. These datasets include 18 csv
files containing data about physical activities, heart rate, sleep
track, METS, calories burned, and weight information recorded on daily
and/or hourly basis, some are even monitored in minute and/or seconds
format.
dailyActivity_merged <- read.csv("C:/Users/Family/Downloads/dailyActivity_merged.csv")
dailyCalories_merged <- read.csv("C:/Users/Family/Downloads/dailyCalories_merged.csv")
dailyIntensities_merged <- read.csv("C:/Users/Family/Downloads/dailyIntensities_merged.csv")
dailySteps_merged <- read.csv("C:/Users/Family/Downloads/dailySteps_merged.csv")
heartrate_seconds_merged <- read.csv("C:/Users/Family/Downloads/heartrate_seconds_merged.csv/heartrate_seconds_merged.csv")
hourlyCalories_merged <- read.csv("C:/Users/Family/Downloads/hourlyCalories_merged.csv")
hourlyIntensities_merged <- read.csv("C:/Users/Family/Downloads/hourlyIntensities_merged.csv")
hourlySteps_merged <- read.csv("C:/Users/Family/Downloads/hourlySteps_merged.csv")
minuteCaloriesNarrow_merged <- read.csv("C:/Users/Family/Downloads/minuteCaloriesNarrow_merged.csv/minuteCaloriesNarrow_merged.csv")
minuteCaloriesWide_merged <- read.csv("C:/Users/Family/Downloads/minuteCaloriesWide_merged.csv~/minuteCaloriesWide_merged.csv", header=FALSE)
minuteIntensitiesNarrow_merged <- read.csv("C:/Users/Family/Downloads/minuteIntensitiesNarrow_merged.csv/minuteIntensitiesNarrow_merged.csv")
minuteIntensitiesWide_merged <- read.csv("C:/Users/Family/Downloads/minuteIntensitiesWide_merged.csv/minuteIntensitiesWide_merged.csv", header=FALSE)
minuteMETsNarrow_merged <- read.csv("C:/Users/Family/Downloads/minuteMETsNarrow_merged.csv/minuteMETsNarrow_merged.csv")
minuteSleep_merged <- read.csv("C:/Users/Family/Downloads/minuteSleep_merged.csv/minuteSleep_merged.csv")
minuteStepsNarrow_merged <- read.csv("C:/Users/Family/Downloads/minuteStepsNarrow_merged.csv/minuteStepsNarrow_merged.csv")
minuteStepsWide_merged <- read.csv("C:/Users/Family/Downloads/minuteStepsWide_merged.csv/minuteStepsWide_merged.csv", header=FALSE)
sleepDay_merged <- read.csv("C:/Users/Family/Downloads/sleepDay_merged.csv")
weightLogInfo_merged <- read.csv("C:/Users/Family/Downloads/weightLogInfo_merged.csv")
sum(is.na(dailyActivity_merged))
sum(is.na(dailyCalories_merged))
sum(is.na(dailyIntensities_merged))
sum(is.na(dailySteps_merged))
sum(is.na(heartrate_seconds_merged))
sum(is.na(hourlyCalories_merged))
sum(is.na(hourlyIntensities_merged))
sum(is.na(hourlySteps_merged))
sum(is.na(minuteCaloriesNarrow_merged))
sum(is.na(minuteCaloriesWide_merged))
sum(is.na(minuteIntensitiesNarrow_merged))
sum(is.na(minuteIntensitiesWide_merged))
sum(is.na(minuteMETsNarrow_merged))
sum(is.na(minuteSleep_merged))
sum(is.na(minuteStepsNarrow_merged))
sum(is.na(minuteStepsWide_merged))
sum(is.na(sleepDay_merged))
sum(is.na(weightLogInfo_merged))
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 65
Only weightLogInfo_merged data frame has 65 missing
values while the other data frames have none of it.
sum(duplicated(dailyActivity_merged))
sum(duplicated(dailyCalories_merged))
sum(duplicated(dailyIntensities_merged))
sum(duplicated(dailySteps_merged))
sum(duplicated(heartrate_seconds_merged))
sum(duplicated(hourlyCalories_merged))
sum(duplicated(hourlyIntensities_merged))
sum(duplicated(hourlySteps_merged))
sum(duplicated(minuteCaloriesNarrow_merged))
sum(duplicated(minuteCaloriesWide_merged))
sum(duplicated(minuteIntensitiesNarrow_merged))
sum(duplicated(minuteIntensitiesWide_merged))
sum(duplicated(minuteMETsNarrow_merged))
sum(duplicated(minuteSleep_merged))
sum(duplicated(minuteStepsNarrow_merged))
sum(duplicated(minuteStepsWide_merged))
sum(duplicated(sleepDay_merged))
sum(duplicated(weightLogInfo_merged))
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 543
## [1] 0
## [1] 0
## [1] 3
## [1] 0
There’re some duplication on sleepDay_merged and
minuteSleep_merged data frames. So let’s remove them all.
sleepDay_merged <-
sleepDay_merged[!duplicated(sleepDay_merged), ]
minuteSleep_merged <-
minuteSleep_merged[!duplicated(minuteSleep_merged), ]
First, let’s see the number of users (Id) in each data frame.
n_distinct(dailyActivity_merged$Id)
n_distinct(dailyCalories_merged$Id)
n_distinct(dailyIntensities_merged$Id)
n_distinct(dailySteps_merged$Id)
n_distinct(heartrate_seconds_merged$Id)
n_distinct(hourlyCalories_merged$Id)
n_distinct(hourlyIntensities_merged$Id)
n_distinct(hourlySteps_merged$Id)
n_distinct(minuteCaloriesNarrow_merged$Id)
n_distinct(minuteCaloriesWide_merged$Id)
n_distinct(minuteIntensitiesNarrow_merged$Id)
n_distinct(minuteIntensitiesWide_merged$Id)
n_distinct(minuteMETsNarrow_merged$Id)
n_distinct(minuteSleep_merged$Id)
n_distinct(minuteStepsNarrow_merged$Id)
n_distinct(minuteStepsWide_merged$Id)
n_distinct(sleepDay_merged$Id)
n_distinct(weightLogInfo_merged$Id)
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 14
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 0
## [1] 33
## [1] 0
## [1] 33
## [1] 24
## [1] 33
## [1] 0
## [1] 24
## [1] 8
Wide_merged data frames have 0 number of Id probably
because they are modified version of Narrow_merged data frames with
additional column. All the columns are renamed to capital V with tag
number, Id column are rename to V1. In that column ,the name “Id” itself
are underneath the new header V1
n_distinct(minuteCaloriesWide_merged$V1)
n_distinct(minuteIntensitiesWide_merged$V1)
n_distinct(minuteStepsWide_merged$V1)
## [1] 34
## [1] 34
## [1] 34
So each Wide merged table indicate 33 participants (not
included second column header).
Now let’s check the consistency of all daily-based data frame
to see whether or not each users have the same number of days
participated in every daily-based data frame. If the number are the
same, we can merge all daily-base data frame into one.
dailyActivity_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(ActivityDate))
## # A tibble: 33 × 2
## Id count
## <dbl> <int>
## 1 1503960366 31
## 2 1624580081 31
## 3 1644430081 30
## 4 1844505072 31
## 5 1927972279 31
## 6 2022484408 31
## 7 2026352035 31
## 8 2320127002 31
## 9 2347167796 18
## 10 2873212765 31
## # ℹ 23 more rows
dailyCalories_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(ActivityDay))
## # A tibble: 33 × 2
## Id count
## <dbl> <int>
## 1 1503960366 31
## 2 1624580081 31
## 3 1644430081 30
## 4 1844505072 31
## 5 1927972279 31
## 6 2022484408 31
## 7 2026352035 31
## 8 2320127002 31
## 9 2347167796 18
## 10 2873212765 31
## # ℹ 23 more rows
dailyIntensities_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(ActivityDay))
## # A tibble: 33 × 2
## Id count
## <dbl> <int>
## 1 1503960366 31
## 2 1624580081 31
## 3 1644430081 30
## 4 1844505072 31
## 5 1927972279 31
## 6 2022484408 31
## 7 2026352035 31
## 8 2320127002 31
## 9 2347167796 18
## 10 2873212765 31
## # ℹ 23 more rows
dailySteps_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(ActivityDay))
## # A tibble: 33 × 2
## Id count
## <dbl> <int>
## 1 1503960366 31
## 2 1624580081 31
## 3 1644430081 30
## 4 1844505072 31
## 5 1927972279 31
## 6 2022484408 31
## 7 2026352035 31
## 8 2320127002 31
## 9 2347167796 18
## 10 2873212765 31
## # ℹ 23 more rows
This turns out to be number of day recorded for each
user are all the same in all these four data frame. So we can combine
them into one but it seem to be that only the data frame
dailyActivity_merged already have all the data the other three data
frame have. So using only dailyActivity_merged would be enough.
Applying the same process with all hourly-based table. Check
the number of time recorded for each users.
hourlyCalories_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(ActivityHour))
## # A tibble: 33 × 2
## Id count
## <dbl> <int>
## 1 1503960366 717
## 2 1624580081 736
## 3 1644430081 708
## 4 1844505072 731
## 5 1927972279 736
## 6 2022484408 736
## 7 2026352035 736
## 8 2320127002 735
## 9 2347167796 414
## 10 2873212765 736
## # ℹ 23 more rows
hourlyIntensities_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(ActivityHour))
## # A tibble: 33 × 2
## Id count
## <dbl> <int>
## 1 1503960366 717
## 2 1624580081 736
## 3 1644430081 708
## 4 1844505072 731
## 5 1927972279 736
## 6 2022484408 736
## 7 2026352035 736
## 8 2320127002 735
## 9 2347167796 414
## 10 2873212765 736
## # ℹ 23 more rows
hourlySteps_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(ActivityHour))
## # A tibble: 33 × 2
## Id count
## <dbl> <int>
## 1 1503960366 717
## 2 1624580081 736
## 3 1644430081 708
## 4 1844505072 731
## 5 1927972279 736
## 6 2022484408 736
## 7 2026352035 736
## 8 2320127002 735
## 9 2347167796 414
## 10 2873212765 736
## # ℹ 23 more rows
All these three tables have the same set of Id and each
Id also has the same number of recorded hours.
Before processing daily-based and hourly-based data frames.
Let’s first prepare some other data frames that will be used
individually.
We start by organizing time recorded in heartrate_seconds_merged
table.
heartrate_seconds_merged$Time <- parse_date_time(
heartrate_seconds_merged$Time,
"%m/%d/%y %I:%M:%S %p"
)
Then extract date as “ActivityDay” using date()
heartrate_seconds_merged <- heartrate_seconds_merged %>%
mutate(ActivityDay = date(Time))
Check days of participation for each users.
heartrate_seconds_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(ActivityDay))
## # A tibble: 14 × 2
## Id count
## <dbl> <int>
## 1 2022484408 31
## 2 2026352035 4
## 3 2347167796 18
## 4 4020332650 16
## 5 4388161847 30
## 6 4558609924 31
## 7 5553957443 31
## 8 5577150313 28
## 9 6117666160 23
## 10 6775888955 18
## 11 6962181067 31
## 12 7007744171 24
## 13 8792009665 18
## 14 8877689391 31
Only 14 participants with days of participation vary from 4 to 31
Applying the same process with minuteSleep_merged table.
minuteSleep_merged$date <- parse_date_time(
minuteSleep_merged$date,
"%m/%d/%y %I:%M:%S %p"
)
Organize time recorded in minuteMETsNarrow_merged table
from character class into date time class.
minuteMETsNarrow_merged$ActivityMinute <-
parse_date_time(minuteMETsNarrow_merged$ActivityMinute,
"%m/%d/%y %I:%M:%S %p"
)
And also for dailyActivity_merged
dailyActivity_merged$ActivityDate <-
mdy(dailyActivity_merged$ActivityDate)
METs value need to be divided by 10, and then add more
columns with extracted date and hour from ActivityMinute. The data from
this minuteMETsNarrow_merged table will later be incorporated into daily
and hourly based tables.
minuteMETsNarrow_merged <-
mutate(minuteMETsNarrow_merged, METs = METs/10,
date = date(ActivityMinute),
ActivityHour = floor_date(ActivityMinute, "hour"))
For daily METs value, we use average average METs value
calculated by mean function.
METs_daily <- minuteMETsNarrow_merged %>%
group_by(Id,date) %>%
summarize(METs = mean(METs),
.groups='drop')
This table will be combined with dailyActivity_merged,
but the number of day in this table contain 6 row less than
dailyActivity_merged. A little bit of data will be lost, but the
combination will give bigger and more complex picture of how all these
data related.
Activity_METs_daily <-
inner_join(dailyActivity_merged, METs_daily,
by = c("Id","ActivityDate" = "date"))
Besides using this table for creating box plot, We
would also use it for creating bar chart.
Add weekday colomn into this table.
Activity_METs_daily <- Activity_METs_daily %>%
mutate(Weekday = wday(ActivityDate, label = T,
week_start = 1))
We incorporate sleepDay_merged table into daily-based tables
because we want to see how sleeping time can have some effect on other
values such as calories burned, METs, etc.
sleepDay_merged %>%
group_by(Id) %>%
summarise(count = n_distinct(SleepDay))
## # A tibble: 24 × 2
## Id count
## <dbl> <int>
## 1 1503960366 25
## 2 1644430081 4
## 3 1844505072 3
## 4 1927972279 5
## 5 2026352035 28
## 6 2320127002 1
## 7 2347167796 15
## 8 3977333714 28
## 9 4020332650 8
## 10 4319703577 26
## # ℹ 14 more rows
There’re only 24 users with only 410 rows in
SleepDay_merged table compared to 33 users with 940 rows on other
daily-based tables. To combine them together we would loose significant
amount of data. So we only use this combination only to see the effect
of sleeping on other values.
sleepDay_merged <- sleepDay_merged %>%
mutate(DateTime = mdy_hms(SleepDay))
Activity_Sleep_daily <-
inner_join(dailyActivity_merged, sleepDay_merged,
by = c("Id","ActivityDate" = "DateTime"))
Adding sleep efficiency column, the value is the
percentage of total minutes asleep divided by total time in bed. Normal
sleep efficiency is considered to be 80% or greater.
Activity_Sleep_daily <- Activity_Sleep_daily %>%
mutate(
Sleep_Efficiency = ((TotalMinutesAsleep/TotalTimeInBed)*100)
)
Then combine Activity_Sleep_daily table with METs_daily
to be SMAd (Sleep_METs_Activity_daily)
SMAd <-
inner_join(Activity_Sleep_daily, METs_daily,
by = c("Id","ActivityDate" = "date"))
Add weekday column to SMAd
SMAd <- SMAd %>%
mutate(Weekday = wday(ActivityDate, label = T,
week_start = 1))
SMAd table will be use for almost every daily based
plot
Create data frame Activity_METs_daily_long for a chart that display
each activity minutes into box plot
Activity_METs_daily_long <- Activity_METs_daily %>%
pivot_longer(cols = c(
'VeryActiveMinutes',
'FairlyActiveMinutes',
'LightlyActiveMinutes',
'SedentaryMinutes'
),
names_to='ActivityMinutes',
values_to='Minutes'
)
It’s done for daily-based tables. Now, let’s try to
join all hourly-based tables together.
list_df <- list(hourlyCalories_merged,
hourlyIntensities_merged,
hourlySteps_merged)
hourly_outer_join <- list_df %>%
reduce(full_join, by= c('Id','ActivityHour'))
Change time format from character to datetime
hourly_outer_join$ActivityHour <- parse_date_time(
hourly_outer_join$ActivityHour,
"%m/%d/%y %I:%M:%S %p"
)
Add more columns from extracting ActivityHour into
time, date, day, and weekday from datetime
hourly_outer_join <- hourly_outer_join %>%
mutate(Weekday = wday(ActivityHour, label = T, week_start = 1),
date = date(ActivityHour),
Day = day(ActivityHour),
Time = format(ActivityHour, "%k")
)
Create hourly_METs data frame for merging with
hourly-based tables.
hourly_METs <- minuteMETsNarrow_merged %>%
group_by(Id,ActivityHour) %>%
summarize(METs = mean(METs),
.groups='drop')
Combining hourly_METs table with other hourly-based
table called hourly_outer_joint_1
hourly_outer_join_1 <-
inner_join(hourly_outer_join, hourly_METs,
by = c("Id","ActivityHour"))
Create data frame for labeling pie chart from the table
minuteIntensitiesNarrow_merged. This label indicated the percentage of
each intensity value based on time spending for each users. We call this
data frame df_1
df_1 <- minuteIntensitiesNarrow_merged %>%
group_by(Id) %>%
mutate(countId= n()) %>%
ungroup %>%
group_by(Id,countId,Intensity) %>%
summarise(count=n(),
.groups='drop' ) %>%
group_by(Id) %>%
mutate(cumcount=cumsum(count),
pos=cumsum(count)-count/2,
per=paste0(round(100*count/countId,2),'%')) %>%
ungroup
And for the data frame minuteSleep_merged will be used
in plotting pie chart, we also need another data frame called df_2 for
labeling each segment as well.
df_2 <- minuteSleep_merged %>%
group_by(Id) %>%
mutate(countId= n()) %>%
ungroup %>%
group_by(Id,countId,value) %>%
summarise(count=n(),
.groups='drop' ) %>%
group_by(Id) %>%
mutate(cumcount=cumsum(count),
pos=cumsum(count)-count/2,
per=paste0(round(100*count/countId,2),'%')) %>%
ungroup
In this project, all 33 participants will be
categorized based on their participation and time spending on intense
activities. To make it easier each Id will be replaced by user number
from 01 to 33 by creating another column that tags each Id with
simplified number.
thirtythree <- c("01", "02", "03", "04", "05", "06",
"07", "08", "09", 10:33)
Users <- paste("User", thirtythree, sep = " ")
Id <- unlist(distinct(dailyActivity_merged, Id))
users_id <- tibble(Users, Id)
Then attach this newly created column to
dailyActivity_merged table. This modified data frame will be used to
create two more columns that categorized users based on days of
participation and level of intensity.
Participation level of each user will be based on number of days
recorded in this table. Activity level of each user will be based on the
median value of “very active minutes” recorded in dailyActivity_merged
table. which means Users with high activity level, or spend more minutes
each day with high intensity activity are likely to do exercise
regularly.
dailyActivity_merged <- dailyActivity_merged %>%
left_join(users_id, by = "Id")
Create a data frame with 2 column called ALPL, which
stands for Activity Level and Participation Level.
This data
frame can be attached to any data frames in this study to put tag on
each user to see how they spend their time on intense activity, and how
often they participated in the program.
ALPL <- dailyActivity_merged %>%
group_by(Users, Id) %>%
summarize(
MedVeryActive = median(VeryActiveMinutes),
UsageRecords = n_distinct(ActivityDate),
.groups='drop'
) %>%
mutate(
ActivityLevel = case_when(
MedVeryActive < 4 ~ "Low",
MedVeryActive <= 32 ~ "Med",
MedVeryActive < 211 ~ "High"
),
ParticipationLevel = case_when(
UsageRecords < 14 ~ "Low Usage",
UsageRecords < 21 ~ "Moderate Usage",
UsageRecords < 31 ~ "High Usage",
UsageRecords == 31 ~ "Daily Usage",
),
ActivityLevel = factor(
ActivityLevel,
levels=c('Low', 'Med', 'High')
),
ParticipationLevel = factor(
ParticipationLevel,
levels=c('Low Usage','Moderate Usage',
'High Usage','Daily Usage')
)
) %>%
subset(select = -c(MedVeryActive,UsageRecords))
Our activity level is based the median value of very active minutes each user spent in each day, which can tell how much exercise they do on regular basis. We categorize activity level into “High” for users who have median value of time over 32 minutes each day spending on intense activity, “Med” for users with median value from over 4 minutes to 32 spending on intense activity, and “Low” for users who spent 4 minutes or less.
For participation level, we categorize users based on number of days, from dailActivity_merged table, each user has their data recorded in this table. We have “Daily Usage” for users who have their data recorded everyday from beginning to the end to program, “High Usage” for users who participate from 21 to 30 days, “Moderate Usage” for 14 to 20 days, and “Low Usage” for users who participate less than 14 days.
Attach this ALPL data frame to all the data frames we need to analyze on participation and intensity. So we can see how intensity level and participation level of user can tell about the different in other outcome.
hourly_outer_join_1_ALPL <- hourly_outer_join_1 %>%
left_join(ALPL, by = "Id")
Activity_METs_daily_ALPL <- Activity_METs_daily %>%
left_join(ALPL, by = "Id")
SMAd_ALPL <- SMAd %>%
left_join(ALPL, by = "Id")
minuteIntensitiesNarrow_merged_ALPL <-
minuteIntensitiesNarrow_merged %>%
left_join(ALPL, by = "Id")
minuteSleep_merged_ALPL <- minuteSleep_merged %>%
left_join(ALPL, by = "Id")
df_1_ALPL <- df_1 %>%
left_join(ALPL, by = "Id")
df_2_ALPL <- df_2 %>%
left_join(ALPL, by = "Id")
Activity_METs_daily_long_ALPL <- Activity_METs_daily_long %>%
left_join(ALPL, by = "Id")
heartrate_seconds_merged_ALPL <- heartrate_seconds_merged %>%
left_join(ALPL, by = "Id") %>%
mutate(Weekday = wday(Time,
label = T, week_start = 1),
Hour = hour(Time))
weightLogInfo_merged_ALPL <- weightLogInfo_merged %>%
left_join(ALPL, by = "Id")
cp <- coord_polar(theta = "y")
cp$is_free <- function() TRUE
options(ggrepel.max.overlaps = Inf)
minuteIntensitiesNarrow_merged_ALPL %>%
ggplot(aes(x=1, fill=factor(Intensity))) +
geom_bar(position = "fill", width = 1) +
cp +
scale_fill_manual(values=c("dodgerblue",
"olivedrab1",
"orangered",
"yellow1")) +
facet_wrap(~Users~ActivityLevel~ParticipationLevel
, scales = "free") +
geom_label_repel(data = df_1_ALPL,
aes(label = per, y=pos),
position = position_fill(vjust = 0.5),
size = 3.0,
show.legend = FALSE) +
theme(aspect.ratio = 1) +
guides(fill = guide_legend(title = "Intensity")) +
theme_classic() +
theme(axis.text.y = element_blank(),
strip.background = element_rect(fill = "thistle1"),
legend.position = "bottom") +
labs(title="Intensity Distribution",
subtitle="By User, Activity Level, Participation Level",
y="Intensity Percentage")
w <- hourly_outer_join_1_ALPL %>%
ggplot() +
theme(plot.title=element_text(size=11),
axis.title=element_text(size=9),
axis.text=element_text(size=8),
axis.text.x = element_text(angle = 90,vjust=0.5))
ggarrange(
w + geom_boxplot(aes(Time, Calories),
fill="burlywood4",
width=0.7) +
labs(x='Time',
y='Calories',
title = "Calories vs Time of Day"),
w + geom_boxplot(aes(Time, TotalIntensity),
fill="royalblue4",
width=0.7) +
labs(x='Time',
y='Total Intensity',
title = "Total Intensity vs Time of Day"),
w + geom_boxplot(aes(Time, StepTotal),
fill="red1",
width=0.7) +
labs(x='Time',
y='Total Steps',
title = "Total Steps vs Time of Day"),
w + geom_boxplot(aes(Time, METs),
fill="olivedrab1",
width=0.7) +
labs(x='Time',
y='METs',
title = "METs vs Time of Day"),
nrow=4
)
options(repr.plot.width=20, repr.plot.height=80)
z1 <- hourly_outer_join_1_ALPL %>%
group_by(Weekday, Time) %>%
summarize(medCal = round(median(Calories)),
medInt = round(median(TotalIntensity)),
medStep = round(median(StepTotal)),
medMETs = round(median(METs), digits = 2),
.groups='drop') %>%
ggplot(aes(Time, Weekday)) +
scale_y_discrete(limits = rev) +
theme(axis.text.x = element_text(angle = 90)) +
guides(fill = guide_colourbar(barwidth = 0.45,
barheight = 6))
ggarrange(
z1 + geom_tile(aes(fill=medCal),colour = "white") +
geom_text(aes(label = medCal),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median Calories",
guide = guide_colorbar(
title.position = "top",
direction = "vertical")) +
labs(x='Time',
y='Weekday',
title = "Hourly Median Calories Distribution per Weekday"),
z1 + geom_tile(aes(fill=medInt),colour = "white") +
geom_text(aes(label = medInt),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median Intensity",
guide = guide_colorbar(
title.position = "top",
direction = "vertical")) +
labs(x='Time',
y='Weekday',
title = "Hourly Median Intensity Distribution per Weekday"),
z1 + geom_tile(aes(fill=medStep),colour = "white") +
geom_text(aes(label = medStep),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median Steps",
guide = guide_colorbar(
title.position = "top",
direction = "vertical")) +
labs(x='Time',
y='Weekday',
title = "Hourly Median Steps Distribution per Weekday"),
z1 + geom_tile(aes(fill=medMETs),colour = "white") +
geom_text(aes(label = medMETs),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median METs",
guide = guide_colorbar(
title.position = "top",
direction = "vertical")) +
labs(x='Time',
y='Weekday',
title = "Hourly Median METs Distribution per Weekday"),
nrow = 4
)
hourly_outer_join_1_ALPL %>%
group_by(ActivityLevel,Weekday, Time) %>%
summarize(medCal = round(median(Calories)),
.groups='drop') %>%
ggplot(aes(Time, Weekday)) +
scale_y_discrete(limits = rev) +
theme(axis.text.x = element_text(angle = 90)) +
geom_tile(aes(fill=medCal),colour = "white") +
geom_text(aes(label = medCal),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median Calories",
guide = guide_colorbar(
title.position = "top",
direction = "vertical")) +
labs(x='Time',
y='Weekday',
title = "Hourly Median Calories Distribution per Weekday") +
facet_wrap(~ActivityLevel, ncol = 1)
hourly_outer_join_1_ALPL %>%
group_by(ActivityLevel,Weekday, Time) %>%
summarize(medInt = round(median(TotalIntensity)),
.groups='drop') %>%
ggplot(aes(Time, Weekday)) +
scale_y_discrete(limits = rev) +
theme(axis.text.x = element_text(angle = 90)) +
geom_tile(aes(fill=medInt),colour = "white") +
geom_text(aes(label = medInt),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median Intensity",
guide = guide_colorbar(
title.position = "top",
direction = "vertical")) +
labs(x='Time',
y='Weekday',
title = "Hourly Median Intensity Distribution per Weekday") +
facet_wrap(~ActivityLevel, ncol = 1)
hourly_outer_join_1_ALPL %>%
group_by(ActivityLevel,Weekday, Time) %>%
summarize(medStep = round(median(StepTotal)),
.groups='drop') %>%
ggplot(aes(Time, Weekday)) +
scale_y_discrete(limits = rev) +
theme(axis.text.x = element_text(angle = 90)) +
geom_tile(aes(fill=medStep),colour = "white") +
geom_text(aes(label = medStep),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median Steps",
guide = guide_colorbar(
title.position = "top",
direction = "vertical")) +
labs(x='Time',
y='Weekday',
title = "Hourly Median Steps Distribution per Weekday") +
facet_wrap(~ActivityLevel, ncol = 1)
hourly_outer_join_1_ALPL %>%
group_by(ActivityLevel,Weekday, Time) %>%
summarize(medMETs = round(median(METs), digits = 2),
.groups='drop') %>%
ggplot(aes(Time, Weekday)) +
scale_y_discrete(limits = rev) +
theme(axis.text.x = element_text(angle = 90)) +
geom_tile(aes(fill=medMETs),colour = "white") +
geom_text(aes(label = medMETs),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median METs",
guide = guide_colorbar(
title.position = "top",
direction = "vertical")) +
labs(x='Time',
y='Weekday',
title = "Hourly Median METs Distribution per Weekday") +
facet_wrap(~ActivityLevel, ncol = 1)
s <- hourly_outer_join_1_ALPL %>%
group_by(Weekday) %>%
summarize(avgCal = mean(Calories),
avgInt = mean(TotalIntensity),
avgStep = mean(StepTotal),
avgMETs = mean(METs),
.groups='drop') %>%
ggplot() +
theme(plot.title=element_text(size=11),
axis.title=element_text(size=9),
axis.text=element_text(size=8))
ggarrange(
s + geom_col(aes(Weekday, avgCal),
fill="darkolivegreen1",
width=0.6) +
labs(x='Weekday',
y='Average Calories',
title = "Hourly Average Calories Distribution per Weekday"),
s + geom_col(aes(Weekday, avgInt),
fill="darkorange",
width=0.6) +
labs(x='Weekday',
y='Average Intensity',
title = "Hourly Average Intensity Distribution per Weekday"),
s + geom_col(aes(Weekday, avgStep),
fill="cadetblue1",
width=0.6) +
labs(x='Weekday',
y='Average Steps',
title = "Hourly Average Steps Distribution per Weekday"),
s + geom_col(aes(Weekday, avgMETs),
fill="mediumpurple1",
width=0.6) +
labs(x='Weekday',
y='Average METs',
title = "Hourly Average METss Distribution per Weekday")
)
The next bar chart will display the value of each activity level side by side.
G <- hourly_outer_join_1_ALPL %>%
group_by(ActivityLevel,Weekday) %>%
summarize(avgCal = mean(Calories),
avgInt = mean(TotalIntensity),
avgStep = mean(StepTotal),
avgMETs = mean(METs),
.groups='drop') %>%
ggplot() +
scale_fill_manual(values=
c("gold1",
"red",
"grey0")) +
theme(plot.title=element_text(size=11),
axis.title=element_text(size=9),
axis.text=element_text(size=8),
axis.text.x = element_text(angle = 90,vjust=0.5))
ggarrange(
G + geom_col(aes(Weekday, avgCal,
fill=ActivityLevel),
position = "dodge") +
labs(x='Weekday',
y='Average Calories',
title = "Hourly Average Calories Distribution per Weekday",
subtitle = "By Activity Level"),
G + geom_col(aes(Weekday, avgInt,
fill=ActivityLevel),
position = "dodge") +
labs(x='Weekday',
y='Average Intensity',
title = "Hourly Average Intensity Distribution per Weekday",
subtitle = "By Activity Level"),
G + geom_col(aes(Weekday, avgStep,
fill=ActivityLevel),
position = "dodge") +
labs(x='Weekday',
y='Average Steps',
title = "Hourly Average Steps Distribution per Weekday",
subtitle = "By Activity Level"),
G + geom_col(aes(Weekday, avgMETs,
fill=ActivityLevel),
position = "dodge") +
labs(x='Weekday',
y='Average METs',
title = "Hourly Average METs Distribution per Weekday",
subtitle = "By Activity Level")
)
The chart below is weekday chart with the daily-based values of calories, distance, steps and METs. All the data are taken from Activity_METs_daily_ALPL table. We can compare patterns of the same value from both daily-base data and hourly-base data
J <- Activity_METs_daily_ALPL %>%
group_by(Weekday) %>%
summarize(avgCal = mean(Calories),
avgDis = mean(TotalDistance),
avgStep = mean(TotalSteps),
avgMETs = mean(METs),
.groups='drop') %>%
ggplot() +
theme(plot.title=element_text(size=11),
axis.title=element_text(size=9),
axis.text=element_text(size=8))
ggarrange(
J + geom_col(aes(Weekday, avgCal),
fill="darkolivegreen1",
width=0.6) +
labs(x='Weekday',
y='Average Calories',
title = "Daily Average Calories Distribution per Weekday"),
J + geom_col(aes(Weekday, avgDis),
fill="red",
width=0.6) +
labs(x='Weekday',
y='Average Distance',
title = "Daily Average Distance Distribution per Weekday"),
J + geom_col(aes(Weekday, avgStep),
fill="cadetblue1",
width=0.6) +
labs(x='Weekday',
y='Average Steps',
title = "Daily Average Steps Distribution per Weekday"),
J + geom_col(aes(Weekday, avgMETs),
fill="mediumpurple1",
width=0.6) +
labs(x='Weekday',
y='Average METs',
title = "Daily Average METss Distribution per Weekday")
)
The next bar chart will display the value of each activity level side by side.
J1 <- Activity_METs_daily_ALPL %>%
group_by(ActivityLevel,Weekday) %>%
summarize(avgCal = mean(Calories),
avgDis = mean(TotalDistance),
avgStep = mean(TotalSteps),
avgMETs = mean(METs),
.groups='drop') %>%
ggplot() +
scale_fill_manual(values=
c("sandybrown",
"peru",
"darkkhaki")) +
theme(plot.title=element_text(size=11),
axis.title=element_text(size=9),
axis.text=element_text(size=8),
axis.text.x = element_text(angle = 90,vjust=0.5))
ggarrange(
J1 + geom_col(aes(Weekday, avgCal,
fill=ActivityLevel),
position = "dodge") +
labs(x='Weekday',
y='Average Calories',
title = "Daily Average Calories Distribution per Weekday",
subtitle = "By Activity Level"),
J1 + geom_col(aes(Weekday, avgDis,
fill=ActivityLevel),
position = "dodge") +
labs(x='Weekday',
y='Average Distance',
title = "Daily Average Distance Distribution per Weekday",
subtitle = "By Activity Level"),
J1 + geom_col(aes(Weekday, avgStep,
fill=ActivityLevel),
position = "dodge") +
labs(x='Weekday',
y='Average Steps',
title = "Daily Average Steps Distribution per Weekday",
subtitle = "By Activity Level"),
J1 + geom_col(aes(Weekday, avgMETs,
fill=ActivityLevel),
position = "dodge") +
labs(x='Weekday',
y='Average METs',
title = "Daily Average METss Distribution per Weekday",
subtitle = "By Activity Level")
)
v <- hourly_outer_join_1_ALPL %>%
group_by(Time) %>%
summarize(medCal = median(Calories),
medInt = median(TotalIntensity),
medStep = median(StepTotal),
medMETs = median(METs),
.groups='drop') %>%
ggplot() +
theme(plot.title=element_text(size=11),
axis.title=element_text(size=9),
axis.text=element_text(size=8),
axis.text.x = element_text(angle = 90,vjust=0.5))
ggarrange(
v + geom_col(aes(Time, medCal),
fill="burlywood4",
width=0.6) +
labs(x='Time',
y='Median Calories',
title = "Median Calories vs Time of Day"),
v + geom_col(aes(Time, medInt),
fill="royalblue4",
width=0.6) +
labs(x='Time',
y='Median Intensity',
title = "Median Intensity vs Time of Day"),
v + geom_col(aes(Time, medStep),
fill="red1",
width=0.6) +
labs(x='Time',
y='Median Steps',
title = "Median Steps vs Time of Day"),
v + geom_col(aes(Time, medMETs),
fill="olivedrab1",
width=0.6) +
labs(x='Time',
y='Median METs',
title = "Median METs vs Time of Day"),
nrow=4
)
E <- hourly_outer_join_1_ALPL %>%
group_by(ActivityLevel,Time) %>%
summarize(medCal = median(Calories),
medInt = median(TotalIntensity),
medStep = median(StepTotal),
medMETs = median(METs),
.groups='drop') %>%
ggplot() +
scale_fill_manual(values=
c("red",
"green",
"blue4")) +
theme(plot.title=element_text(size=11),
axis.title=element_text(size=9),
axis.text=element_text(size=8),
axis.text.x = element_text(angle = 90,vjust=0.5))
ggarrange(
E + geom_col(aes(Time,
medCal,
fill=ActivityLevel),
position = "dodge") +
labs(x="Time",
y="Median Calories",
title = "Median Calories vs Time of Day",
subtitle = "By Activity Level"),
E + geom_col(aes(Time,
medInt,
fill=ActivityLevel),
position = "dodge") +
labs(x="Time",
y="Median Intensity",
title = "Median Intensity vs Time of Day",
subtitle = "By Activity Level"),
E + geom_col(aes(Time,
medStep,
fill=ActivityLevel),
position = "dodge") +
labs(x="Time",
y="Median Steps",
title = "Median Steps vs Time of Day",
subtitle = "By Activity Level"),
E + geom_col(aes(Time,
medMETs,
fill=ActivityLevel),
position = "dodge") +
labs(x="Time",
y="Median METs",
title = "Median METs vs Time of Day",
subtitle = "By Activity Level"),
nrow=4
)
Regression Model describes the correlation between a dependent variable, y, and one or more independent variables, X. From all the values in hourly-based data frame we have, we can see from our scatter plots that many values have almost linear relationship to one another. Linear regression model created from all these data can forecast the outcome from the data based on behavioral factors.
The functional form of Multiple Linear Regression is :
\[y = \beta_{0} + \beta_{1} x_{1} +
\beta_{2} x_{2} + ... + \beta_{n} x_{n}\]
We use lm() to find all the coefficients \(\beta_{0}, \beta_{1}, \beta_{2},...,\beta_{n}\) for all the independent variables \(x_{1}, x_{2},..., x_{n}\) we have.
options(scipen = 999)
lmMETs_Intensity <-
lm(METs ~ TotalIntensity,
data=hourly_outer_join_1_ALPL)
summary(lmMETs_Intensity)
##
## Call:
## lm(formula = METs ~ TotalIntensity, data = hourly_outer_join_1_ALPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2173 -0.0258 -0.0054 0.0046 4.9917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.00543699 0.00141218 712.0 <0.0000000000000002 ***
## TotalIntensity 0.03851601 0.00005806 663.4 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1824 on 22091 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.9522
## F-statistic: 4.4e+05 on 1 and 22091 DF, p-value: < 0.00000000000000022
From the regression analysis we get from summary function, the significance level of intensity(TotalIntensity) indicates this variable contributes to the explanation of the dependent variable(METs) significantly (three asterisks). R-squared is 0.9522 and very low p-value which is quite good.
options(scipen = 999)
lmMETs_Step <-
lm(METs ~
StepTotal,
data=hourly_outer_join_1_ALPL)
summary(lmMETs_Step)
##
## Call:
## lm(formula = METs ~ StepTotal, data = hourly_outer_join_1_ALPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1751 -0.1190 -0.0933 0.0553 7.5743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.118981219 0.002634363 424.8 <0.0000000000000002 ***
## StepTotal 0.001093226 0.000003461 315.8 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3552 on 22091 degrees of freedom
## Multiple R-squared: 0.8187, Adjusted R-squared: 0.8187
## F-statistic: 9.975e+04 on 1 and 22091 DF, p-value: < 0.00000000000000022
In this model, Step also contribute significantly to explanation of METs, with very low p-value and R-squared is 0.8187.
options(scipen = 999)
lmMETs_StepandIntensity <-
lm(METs ~
StepTotal +
TotalIntensity,
data=hourly_outer_join_1_ALPL)
summary(lmMETs_StepandIntensity)
##
## Call:
## lm(formula = METs ~ StepTotal + TotalIntensity, data = hourly_outer_join_1_ALPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3580 -0.0209 -0.0114 0.0036 3.9978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0114400 0.0013462 751.31 <0.0000000000000002 ***
## StepTotal 0.0001869 0.0000038 49.19 <0.0000000000000002 ***
## TotalIntensity 0.0330450 0.0001241 266.22 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1732 on 22090 degrees of freedom
## Multiple R-squared: 0.9569, Adjusted R-squared: 0.9569
## F-statistic: 2.453e+05 on 2 and 22090 DF, p-value: < 0.00000000000000022
In this model, we incorporate both step and intensity to explain the correlation of these two independent variables and METs. The result is even better than the previous two models. Very low p-value and better R-squared of 0.9569
METs = 1.0114400 + 0.0001869(StepTotal) + 0.0330450(TotalIntensity)
Create regression diagnostics plot.
plot(lmMETs_StepandIntensity)
heartrate_seconds_merged_ALPL %>%
mutate(
Hour = format(Time, "%k")
) %>% ggplot() +
geom_boxplot(aes(Hour,Value,fill=Hour)) +
theme(axis.text.x = element_text(angle = 90,vjust=0.5),
strip.background = element_rect(fill = "palegoldenrod")) +
facet_wrap(~Users~ActivityLevel) +
labs(x='Time',
y='Heart Rate Value',
title = "Hourly Heart Rate for Each User")
heartrate_seconds_merged_ALPL %>%
group_by(Weekday,
Hour,
ActivityLevel) %>%
summarize(medValue = round(median(Value)),
.groups='drop') %>%
ggplot(aes(Hour,Weekday)) +
scale_y_discrete(limits = rev) +
theme(axis.text.x = element_text(angle = 90)) +
geom_tile(aes(fill=medValue),colour = "white") +
geom_text(aes(label = medValue),
size = 2.5, color = "black") +
scale_fill_viridis(option = "H",
name = "Median Heart Rate") +
facet_wrap(~ActivityLevel, ncol = 1) +
labs(x='Time',
y='Weekday',
title = "Hourly Median Heart Rate Distribution per Weekday",
subtitle = "By Activity Level")
minuteSleep_merged_ALPL %>%
ggplot(aes(x=1, fill=factor(value))) +
geom_bar(position = "fill", width = 1) +
cp +
scale_fill_manual(values=c("cadetblue1",
"goldenrod1",
"firebrick1")) +
facet_wrap(~Users~ActivityLevel~ParticipationLevel,
scales = "free") +
geom_label_repel(data = df_2_ALPL,
aes(label = per, y=pos),
position = position_fill(vjust = 0.5),
size = 3.0,
show.legend = FALSE) +
theme(aspect.ratio = 1) +
guides(fill = guide_legend(title = "Value")) +
theme_classic() +
theme(axis.text.y = element_blank(),
strip.background = element_rect(fill = "thistle1"),
legend.position = "bottom") +
labs(title="Sleeping Value Distribution",
subtitle = "By User, Activity Level, Participation Level",
y="Sleeping Value Percentage")
Start from this chart, we will use data from daily-based data frames, like Activity_METs_daily_long_ALPL from this chart or Activity_METs_daily_ALPL for the next charts with additional data on sleeping. For the chart involved sleeping, we use data from SMAd_ALPL (which stand for Sleep, METs, Activity daily) instead.
The reason is because we have lost huge chunk of data (from 934 rows to 410 rows ) while incorporating sleep data into daily activity data, due to less number of participant (only 24 users) on SleepDay_merged table. We want all the charts created to be as inclusive as possible, So we would use data from SMAd table only for the charts involved sleeping.
Activity_METs_daily_long_ALPL %>%
mutate(Weekday = wday(ActivityDate,
label = T,
week_start = 1)) %>%
ggplot() +
geom_boxplot(aes(Weekday,Minutes,fill=ActivityMinutes)) +
theme_classic()
The next chart we can see distribution of each activity minutes on weekday in closer look.
MM <- Activity_METs_daily_ALPL %>%
mutate(Weekday = wday(ActivityDate,
label = T,
week_start = 1)) %>%
ggplot() +
scale_fill_manual(values=
c("gold1",
"lightpink",
"olivedrab2",
"darkorange",
"cadetblue1",
"mediumorchid2",
"firebrick2")) +
theme_classic()+
theme(axis.text.x = element_text(angle = 90))
ggarrange(
MM + geom_boxplot(aes(Weekday,
VeryActiveMinutes,
fill=Weekday)) +
labs(x='Weekday',
y='Very Active Minutes',
title = "Very Active Minutes Distribution per Weekday"),
MM + geom_boxplot(aes(Weekday,
FairlyActiveMinutes,
fill=Weekday)) +
labs(x='Weekday',
y='Fairy Active Minutes',
title = "Fairy Active Minutes Distribution per Weekday"),
MM + geom_boxplot(aes(Weekday,
LightlyActiveMinutes,
fill=Weekday)) +
labs(x='Weekday',
y='Lightly Active Minutes',
title = "Lightly Active Minutes Distribution per Weekday"),
MM + geom_boxplot(aes(Weekday,
SedentaryMinutes,
fill=Weekday)) +
labs(x='Weekday',
y='Sedentary Minutes',
title = "Sedentary Minutes Distribution per Weekday")
)
We also can create box plot for each activity level to see the differences in activity minutes distribution per weekday.
Activity_METs_daily_long_ALPL %>%
mutate(Weekday = wday(ActivityDate,
label = T,
week_start = 1)) %>%
ggplot() +
geom_boxplot(aes(Weekday,Minutes,fill=ActivityMinutes)) +
theme_classic() +
facet_wrap(~ActivityLevel, ncol = 1) +
theme(strip.background = element_rect(fill = "aliceblue")) +
labs(x='Minutes',
y='Weekday',
title = "Box Plot for Activity Minutes per Weekday",
subtitle = "By Activity Level")
Next chart is the activity minutes stacked bar for each individual user.
Activity_METs_daily_long_ALPL %>%
mutate(Weekday = wday(ActivityDate,
label = T,
week_start = 1)) %>%
ggplot(aes(Weekday,Minutes,fill=ActivityMinutes)) +
geom_bar(width = 0.7,stat = "summary", fun = "mean") +
theme_classic() +
facet_wrap(~Users~ActivityLevel) +
theme(axis.text.x = element_text(angle = 90,vjust=0.5),
strip.background = element_rect(fill = "thistle1")) +
labs(title="Weekday Average Activity Minutes Stacked Bar",
subtitle="By User, Activity Level, Participation Level")
Next chart is weekday average activity minutes stacked bars categorized by activity level.
Activity_METs_daily_long_ALPL %>%
mutate(Weekday = wday(ActivityDate,
label = T,
week_start = 1)) %>%
ggplot(aes(Weekday,Minutes,fill=ActivityMinutes)) +
geom_bar(width = 0.7,stat = "summary", fun = "mean") +
theme_classic() +
facet_wrap(~ActivityLevel) +
theme(axis.text.x = element_text(angle = 90,vjust=0.5),
strip.background = element_rect(fill = "thistle1")) +
labs(title="Weekday Average Activity Minutes Stacked Bar",
subtitle="By User, Activity Level, Participation Level")
AM <- Activity_METs_daily %>%
mutate(Weekday = wday(ActivityDate,
label = T,
week_start = 1)) %>%
ggplot() +
scale_fill_manual(values=
c("gold1",
"lightpink",
"olivedrab2",
"darkorange",
"cadetblue1",
"mediumorchid2",
"firebrick2")) +
theme_classic()+
theme(axis.text.x = element_text(angle = 90))
ggarrange(
AM + geom_boxplot(aes(Weekday,
TotalSteps,
fill=Weekday)) +
labs(x='Weekday',
y='Total Steps',
title = "Total Steps Distribution per Weekday"),
AM + geom_boxplot(aes(Weekday,
Calories,
fill=Weekday)) +
labs(x='Weekday',
y='Calories',
title = "Calories Distribution per Weekday"),
AM + geom_boxplot(aes(Weekday,
METs,
fill=Weekday)) +
labs(x='Weekday',
y='METs',
title = "METs Distribution per Weekday"),
AM + geom_boxplot(aes(Weekday,
TotalDistance,
fill=Weekday)) +
labs(x='Weekday',
y='Total Distance',
title = "Total Distance Distribution per Weekday")
)
Now we use SMAd_ALPL table which stands for Sleep, METs and Activity daily in order to see how sleeping pattern can have some effect on other activities during waking hours, or some biological processes that happen throughout the day like calories burned and METs.
But before we creating any chart, let’s use correlation matrix to see how all the data related to each other
ggcorr(SMAd_ALPL[, 2:21],
label = TRUE,
hjust = 1.0,
angle = -45,
layout.exp = 1)
Let’s create multiple linear regression for this data frame and see how we can develop simple equation for all these relationship
options(scipen = 999)
modSMAd_ALPL <- lm(Calories ~
TotalSteps +
VeryActiveMinutes +
FairlyActiveMinutes +
LightlyActiveMinutes +
SedentaryMinutes+
VeryActiveDistance +
ModeratelyActiveDistance +
LightActiveDistance +
SedentaryActiveDistance +
TotalSleepRecords +
TotalMinutesAsleep +
TotalTimeInBed +
Sleep_Efficiency,
data=SMAd_ALPL)
summary(modSMAd_ALPL)
##
## Call:
## lm(formula = Calories ~ TotalSteps + VeryActiveMinutes + FairlyActiveMinutes +
## LightlyActiveMinutes + SedentaryMinutes + VeryActiveDistance +
## ModeratelyActiveDistance + LightActiveDistance + SedentaryActiveDistance +
## TotalSleepRecords + TotalMinutesAsleep + TotalTimeInBed +
## Sleep_Efficiency, data = SMAd_ALPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -759.5 -232.3 -26.5 189.2 4377.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2411.4893 799.7185 -3.015 0.00273 **
## TotalSteps -0.2179 0.0229 -9.515 < 0.0000000000000002 ***
## VeryActiveMinutes 14.4678 1.2094 11.962 < 0.0000000000000002 ***
## FairlyActiveMinutes 17.5818 3.2438 5.420 0.00000010369209365 ***
## LightlyActiveMinutes -1.2111 0.4936 -2.454 0.01457 *
## SedentaryMinutes 1.3626 0.1650 8.256 0.00000000000000227 ***
## VeryActiveDistance 210.9534 37.5440 5.619 0.00000003629868202 ***
## ModeratelyActiveDistance -13.7444 77.5118 -0.177 0.85935
## LightActiveDistance 543.9142 33.4873 16.242 < 0.0000000000000002 ***
## SedentaryActiveDistance 1391.7987 2277.3383 0.611 0.54145
## TotalSleepRecords 88.2852 58.5429 1.508 0.13234
## TotalMinutesAsleep -1.2354 1.6156 -0.765 0.44492
## TotalTimeInBed 2.1642 1.4485 1.494 0.13595
## Sleep_Efficiency 25.5263 8.4937 3.005 0.00282 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 392.6 on 396 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.732
## F-statistic: 86.95 on 13 and 396 DF, p-value: < 0.00000000000000022
So the adjusted model would be as following.
options(scipen = 999)
modSMAd_ALPL_redux <- lm(Calories ~
TotalSteps +
VeryActiveMinutes +
FairlyActiveMinutes +
LightlyActiveMinutes +
SedentaryMinutes+
VeryActiveDistance +
LightActiveDistance +
Sleep_Efficiency,
data=SMAd_ALPL)
summary(modSMAd_ALPL_redux)
##
## Call:
## lm(formula = Calories ~ TotalSteps + VeryActiveMinutes + FairlyActiveMinutes +
## LightlyActiveMinutes + SedentaryMinutes + VeryActiveDistance +
## LightActiveDistance + Sleep_Efficiency, data = SMAd_ALPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -792.2 -254.8 -14.2 196.9 4523.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -600.69244 267.68270 -2.244 0.02537 *
## TotalSteps -0.22911 0.02179 -10.515 < 0.0000000000000002 ***
## VeryActiveMinutes 13.96435 1.13897 12.260 < 0.0000000000000002 ***
## FairlyActiveMinutes 16.58093 1.78668 9.280 < 0.0000000000000002 ***
## LightlyActiveMinutes -1.31264 0.48692 -2.696 0.00732 **
## SedentaryMinutes 0.81123 0.12577 6.450 0.000000000322 ***
## VeryActiveDistance 227.76369 37.95053 6.002 0.000000004374 ***
## LightActiveDistance 544.11503 34.18433 15.917 < 0.0000000000000002 ***
## Sleep_Efficiency 17.54828 2.72642 6.436 0.000000000350 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 404.8 on 401 degrees of freedom
## Multiple R-squared: 0.7207, Adjusted R-squared: 0.7151
## F-statistic: 129.3 on 8 and 401 DF, p-value: < 0.00000000000000022
Calories = -600.69244 - 0.22911(TotalSteps) + 13.96435(VeryActiveMinutes) + 16.58093(FairlyActiveMinutes) - 1.31264(LightlyActiveMinutes) + 0.81123(SedentaryMinutes) + 227.76369(VeryActiveDistance) + 544.11503(LightActiveDistance) + 17.54828(Sleep_Efficiency)
Let’s check the diagnostic plot to see if this model has a good fit for normal distribution or not.
plot(modSMAd_ALPL_redux)
If sleep sleep efficiency is the only sleep related data that significantly contributes to calories burned. Let’s take a look at overall sleep distribution per weekday.
SL <- SMAd_ALPL %>%
ggplot() +
scale_fill_manual(values=
c("gold1",
"lightpink",
"olivedrab2",
"darkorange",
"cadetblue1",
"mediumorchid2",
"firebrick2")) +
theme_classic()+
theme(axis.text.x = element_text(angle = 90))
ggarrange(
SL + geom_boxplot(aes(Weekday,
TotalMinutesAsleep,
fill=Weekday)) +
labs(x='Weekday',
y='Total Minutes Asleep',
title = "Total Minutes Asleep Distribution per Weekday"),
SL + geom_boxplot(aes(Weekday,
TotalTimeInBed,
fill=Weekday)) +
labs(x='Weekday',
y='Total Time In Bed',
title = "Total Time In Bed Distribution per Weekday"),
SL + geom_boxplot(aes(Weekday,
Sleep_Efficiency,
fill=Weekday)) +
labs(x='Weekday',
y='Sleep Efficiency',
title = "Sleep Efficiency Distribution per Weekday")
)
SMAd_ALPL %>%
ggplot() +
scale_fill_manual(values=
c("gold1",
"lightpink",
"olivedrab2",
"darkorange",
"cadetblue1",
"mediumorchid2",
"firebrick2")) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90)) +
geom_boxplot(aes(Weekday,
Sleep_Efficiency,
fill=Weekday)) +
facet_wrap(~ActivityLevel, ncol = 1) +
theme(strip.background = element_rect(fill = "aliceblue")) +
labs(x='Weekday',
y='Sleep Efficiency',
title = "Sleep Efficiency Distribution per Weekday",
subtitle = "By Activity Level")
From previous activity minutes stacked bar charts, we notice the part of average sedentary minutes in some bars are over 1,200 minutes as compared to total 1,440 minutes in one day. The question is, does sedentary minutes include sleep time? Let’s see the plot between total minutes asleep and sedentary minutes as related to sleep efficiency and sleep fragmentation (total sleep records).
weightLogInfo_merged_ALPL %>%
ggplot(aes(WeightKg, BMI,
color=ActivityLevel)) +
geom_point() +
facet_wrap(~ActivityLevel) +
theme(axis.text.x = element_text(angle = 90,vjust=0.5)) +
labs(
x='Weight (Kg.)',
y='BMI',
title='Weight (Kg.) Vs. BMI',
subtitle ='By Activity Level')
We can only find out some more information about his daily activity on non-sleep related table to get some idea on the life style related to data on this weightLogInfo_merged_ALPL table.
The chart below using data from the table Activity_METs_daily_ALPL, which provide substantial amount of data for this user to get some idea about daily life style.
The author in the list below are the source of good ideas, and inspiration for this project. Please do yourself a favor, take a look at their beautiful works.
1.) https://www.kaggle.com/code/alixmcgettrick/bellabeat-case-study-in-r
2.) https://www.kaggle.com/code/zulkhaireesulaiman/bellabeat-capstone-project-in-r#-7.-Recommendations--
3.) https://www.kaggle.com/code/aymangouda/bellabeat-fitbit-capstone-project
Thank you for taking your time to have a look at my project!
Any comments and recommendations for improvement would be highly appreciated!