Bellabeat is the go-to wellness brand for women with an ecosystem of products and services focused on women’s health. Bellabeat develops wearables and accompanying products that monitor biometric and lifestyle data to help women better understand how their bodies work and make healthier choices.
The business task:
Analyze smart device data to gain insight into how consumers are using their smart devices in order to present high-level recommendations for Bellabeat’s marketing strategy.
Key stakeholders: * Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer * Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team * Bellabeat marketing analytics team.
Questions:
Bellabeat products chosen for the analysis:
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Data
FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users.
Data privacy and accessibility:
Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
Metadata contains information about the data types and data description.
Data limitations: * There is no information about age, sex and demographics; * Data is not current (2016-04-12 - 2016-05-12); * Small sample size; * Data is not original from BellaBeat.
Data organization
Data is organized in 18 CSV files. It has both long and wide formats.
First, all 18 files were opened in RStudio and reviewed for unique numbers of ID (using n_unique function):
For our analysis we will use the following CSV files: 1. 33 ID: dailyActivity_merged.csv (contains information about daily calories, intensities and steps from files: dailyCalories_merged.csv, dailyIntensities_merged.csv, dailySteps_merged.csv), hourlyCalories_merged.csv, hourlySteps_merged.csv. 2. 24 ID: sleepDay_merged.csv.
The analysis will be done in R and shared with the key stakeholders.
First, we will install the packages. Then we will import the data, transform and analyze.
# installing packages:
library(tidyverse)
library(lubridate)
library(skimr)
library(janitor)
library(ggpubr)
library(ggrepel)
We will start our analysis with daily activity data which contains information about 33 users
# daily activity file importing and reviewing the struture
daily_activity <- read_csv("dailyActivity_merged.csv")
str(daily_activity)
spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
$ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
$ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
$ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
$ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
$ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
$ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
$ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
$ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
$ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
$ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
$ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
$ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
$ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
- attr(*, "spec")=
.. cols(
.. Id = col_double(),
.. ActivityDate = col_character(),
.. TotalSteps = col_double(),
.. TotalDistance = col_double(),
.. TrackerDistance = col_double(),
.. LoggedActivitiesDistance = col_double(),
.. VeryActiveDistance = col_double(),
.. ModeratelyActiveDistance = col_double(),
.. LightActiveDistance = col_double(),
.. SedentaryActiveDistance = col_double(),
.. VeryActiveMinutes = col_double(),
.. FairlyActiveMinutes = col_double(),
.. LightlyActiveMinutes = col_double(),
.. SedentaryMinutes = col_double(),
.. Calories = col_double()
.. )
- attr(*, "problems")=<externalptr>
# Cleaning: clean columns names
daily_activity <- daily_activity %>%
clean_names()
# Cleaning: changing date format
daily_activity <- daily_activity %>%
rename(date=activity_date)%>%
mutate(date=as_date(date, format = "%m/%d/%Y"))
# Cleaning: check if the format is changed
glimpse(daily_activity)
Rows: 940
Columns: 15
$ id <dbl> 1503960366, 1503960366, 1503960366, 1503960~
$ date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0~
$ total_steps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 130~
$ total_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9~
$ tracker_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9~
$ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ very_active_distance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3~
$ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1~
$ light_active_distance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5~
$ sedentary_active_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ very_active_minutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,~
$ fairly_active_minutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, ~
$ lightly_active_minutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205~
$ sedentary_minutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 8~
$ calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2~
# Cleaning: duplicates check
sum(duplicated(daily_activity))
[1] 0
# Cleaning: unique ID numbers
n_unique(daily_activity$id)
[1] 33
#First look (mean values)
summary(daily_activity)
id date total_steps total_distance
Min. :1.504e+09 Min. :2016-04-12 Min. : 0 Min. : 0.000
1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790 1st Qu.: 2.620
Median :4.445e+09 Median :2016-04-26 Median : 7406 Median : 5.245
Mean :4.855e+09 Mean :2016-04-26 Mean : 7638 Mean : 5.490
3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727 3rd Qu.: 7.713
Max. :8.878e+09 Max. :2016-05-12 Max. :36019 Max. :28.030
tracker_distance logged_activities_distance very_active_distance
Min. : 0.000 Min. :0.0000 Min. : 0.000
1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
Median : 5.245 Median :0.0000 Median : 0.210
Mean : 5.475 Mean :0.1082 Mean : 1.503
3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
Max. :28.030 Max. :4.9421 Max. :21.920
moderately_active_distance light_active_distance sedentary_active_distance
Min. :0.0000 Min. : 0.000 Min. :0.000000
1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
Median :0.2400 Median : 3.365 Median :0.000000
Mean :0.5675 Mean : 3.341 Mean :0.001606
3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
Max. :6.4800 Max. :10.710 Max. :0.110000
very_active_minutes fairly_active_minutes lightly_active_minutes
Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
Median : 4.00 Median : 6.00 Median :199.0
Mean : 21.16 Mean : 13.56 Mean :192.8
3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
Max. :210.00 Max. :143.00 Max. :518.0
sedentary_minutes calories
Min. : 0.0 Min. : 0
1st Qu.: 729.8 1st Qu.:1828
Median :1057.5 Median :2134
Mean : 991.2 Mean :2304
3rd Qu.:1229.5 3rd Qu.:2793
Max. :1440.0 Max. :4900
Next, we would like to see the distribution of users based on activity. We will follow the recommendations developed as a guide on how many daily steps are sufficient for health benefits in generally healthy adults (Tudor- Locke & Bassett, 2004). WHO Library Cataloguing in Publication Data
| Steps per day | Physical activity level |
|---|---|
| <5000 | Sedentary lifestyle |
| 5000-7499 | Low active |
| 7500-9999 | Somewhat active |
| >=10 000 | Active |
| >=12 500 | Highly active |
# Average total steps grouped by ID
Total_steps_mean <- daily_activity %>%
group_by(id) %>%
summarize(mean_total_steps=mean(total_steps))
head(Total_steps_mean)
# A tibble: 6 x 2
id mean_total_steps
<dbl> <dbl>
1 1503960366 12117.
2 1624580081 5744.
3 1644430081 7283.
4 1844505072 2580.
5 1927972279 916.
6 2022484408 11371.
# Creating user types (Tudor- Locke & Bassett, 2004)
activity_user_type <- Total_steps_mean %>%
mutate(activity_type = case_when(
mean_total_steps < 5000 ~ "sedentary",
mean_total_steps >= 5000 & mean_total_steps < 7500 ~ "low active",
mean_total_steps >= 7500 & mean_total_steps < 10000 ~ "somewhat active",
mean_total_steps >= 10000 & mean_total_steps < 12500 ~ "active",
mean_total_steps >= 12500 ~ "highly active",
))
head(activity_user_type)
# A tibble: 6 x 3
id mean_total_steps activity_type
<dbl> <dbl> <chr>
1 1503960366 12117. active
2 1624580081 5744. low active
3 1644430081 7283. low active
4 1844505072 2580. sedentary
5 1927972279 916. sedentary
6 2022484408 11371. active
#Counting the number by user type and calculating the percentage
activity_user_type_percent <- activity_user_type %>%
group_by(activity_type) %>%
summarise(total=n()) %>%
mutate(totals = sum(total)) %>%
group_by(activity_type) %>%
summarise(total_percent = total / totals) %>%
mutate(percent = scales::percent(total_percent))%>%
arrange(desc(total_percent))
activity_user_type_percent$activity_type <- factor(activity_user_type_percent$activity_type, levels = c("sedentary", "low active", "somewhat active", "active", "highly active"))
head(activity_user_type_percent)
# A tibble: 5 x 3
activity_type total_percent percent
<fct> <dbl> <chr>
1 low active 0.273 27.3%
2 somewhat active 0.273 27.3%
3 sedentary 0.242 24.2%
4 active 0.152 15.2%
5 highly active 0.0606 6.1%
# Creating a plot
options(repr.plot.width = 6, repr.plot.height = 6)
ggplot(activity_user_type_percent,aes(x="",y = total_percent, fill=activity_type)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0)+
scale_fill_brewer(palette='PuRd')+
theme_void()+ # remove background, grid, numeric labels
theme(plot.title = element_text(hjust = 0.5,vjust= -5, size = 22, face = "bold")) +
geom_text(aes(label = percent, x=1.2),position = position_stack(vjust = 0.5))+
labs(title="User type by activity")+
guides(fill = guide_legend(title = "Activity type"))
Conclusions
Next, we want to see user weekdays distribution based on the average number of steps and calories.
# Adding days of the week
daily_activity <- daily_activity %>%
mutate(weekday=weekdays(date))
head(daily_activity)
# A tibble: 6 x 16
id date total_steps total_distance tracker_distance logged_activiti~
<dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 1.50e9 2016-04-12 13162 8.5 8.5 0
2 1.50e9 2016-04-13 10735 6.97 6.97 0
3 1.50e9 2016-04-14 10460 6.74 6.74 0
4 1.50e9 2016-04-15 9762 6.28 6.28 0
5 1.50e9 2016-04-16 12669 8.16 8.16 0
6 1.50e9 2016-04-17 9705 6.48 6.48 0
# ... with 10 more variables: very_active_distance <dbl>,
# moderately_active_distance <dbl>, light_active_distance <dbl>,
# sedentary_active_distance <dbl>, very_active_minutes <dbl>,
# fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
# sedentary_minutes <dbl>, calories <dbl>, weekday <chr>
# Grouping by ID and summarizing average values
activity_weekdays <- daily_activity %>%
group_by(weekday) %>%
summarize(mean_total_steps=mean(total_steps), mean_total_distance=mean(total_distance), mean_calories=mean(calories))
head(activity_weekdays)
# A tibble: 6 x 4
weekday mean_total_steps mean_total_distance mean_calories
<chr> <dbl> <dbl> <dbl>
1 Friday 7448. 5.31 2332.
2 Monday 7781. 5.55 2324.
3 Saturday 8153. 5.85 2355.
4 Sunday 6933. 5.03 2263
5 Thursday 7406. 5.31 2200.
6 Tuesday 8125. 5.83 2356.
# Creating a plot
activity_weekdays$weekday <- ordered(activity_weekdays$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday"))
options(repr.plot.width = 10, repr.plot.height = 5)
ggplot(data=activity_weekdays)+
geom_col(mapping = aes(x=weekday,y=mean_total_steps), fill="#CE3A94")+
theme(axis.text.x = element_text(angle = 30))+
labs(title = "Weekly average steps", x="Weekday", y="")+
theme(legend.title = element_text(size = 20),
legend.text = element_text(size = 50),
plot.title = element_text(size = 22),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18))+
labs(title = "Weekly average steps", x="Weekday", y="Average Total Steps")
options(repr.plot.width = 10, repr.plot.height = 5)
ggplot(data=activity_weekdays)+
geom_col(mapping = aes(x=weekday,y=mean_calories), fill="#CE3A94")+
theme(legend.title = element_text(size = 20),
legend.text = element_text(size = 50),
plot.title = element_text(size = 22),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18))+
labs(title = "Weekly average calories", x="Weekday", y="Calories")
Conclusion
We can’t see that there is a great difference between days of the week and average number of steps or calories.
Now we would like to find out if there is a relationship between the device usage and activity type based on the number of steps.
The maximum number of days is 31.
# Device usage and activity type
tracker_usage <- daily_activity %>%
select(id, date, sedentary_minutes,lightly_active_minutes, fairly_active_minutes,very_active_minutes) %>%
group_by(id) %>%
mutate(total_usage_id=sedentary_minutes+lightly_active_minutes+fairly_active_minutes+very_active_minutes) %>%
group_by(id) %>%
summarise(total_entries_id=sum(n()))
activity_user_type <- merge(activity_user_type, tracker_usage, by="id")
head(activity_user_type)
id mean_total_steps activity_type total_entries_id
1 1503960366 12116.742 active 31
2 1624580081 5743.903 low active 31
3 1644430081 7282.967 low active 30
4 1844505072 2580.065 sedentary 31
5 1927972279 916.129 sedentary 31
6 2022484408 11370.645 active 31
# Creating a plot
options(repr.plot.width = 10, repr.plot.height = 7)
ggplot(data=activity_user_type)+
geom_point(mapping = aes(y=total_entries_id, x=mean_total_steps, color=activity_type))+
theme(legend.title = element_text(size = 20),
legend.text = element_text(size = 18),
plot.title = element_text(size = 22),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18))+
labs(title="Device usage vs. Total Steps", x="Average total steps",y="Days of device usage", color="Activity type")+
facet_wrap(~activity_type)
Conclusion
Users taking more than 7500 steps (having somewhat active, active and highly active types) use the device more days than sedentary and low active user types. This may mean that the user can stay more motivated by wearing the device more often.
Next, we are goint to analyze daily sleep. To do this, we need to import and clean one file containing information about 24 users
# importing a file
daily_sleep <- read.csv("sleepDay_merged.csv")
str(daily_sleep)
'data.frame': 413 obs. of 5 variables:
$ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
$ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
$ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
$ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
# Cleaning:duplicates
sum(duplicated(daily_sleep))
[1] 3
# Cleaning: removing 3 duplicates
daily_sleep <- daily_sleep%>%
distinct()%>%
drop_na()
sum(duplicated(daily_sleep))
[1] 0
# Cleaning: column names
daily_sleep <- daily_sleep%>%
clean_names()
str(daily_sleep)
'data.frame': 410 obs. of 5 variables:
$ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ sleep_day : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
$ total_sleep_records : int 1 2 1 2 1 1 1 1 1 1 ...
$ total_minutes_asleep: int 327 384 412 340 700 304 360 325 361 430 ...
$ total_time_in_bed : int 346 407 442 367 712 320 377 364 384 449 ...
# Cleaning: date format changing
daily_sleep <- daily_sleep %>%
rename(date=sleep_day) %>%
mutate(date = as_datetime(date,format="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
str(daily_sleep)
'data.frame': 410 obs. of 5 variables:
$ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ date : POSIXct, format: "2016-04-12" "2016-04-13" ...
$ total_sleep_records : int 1 2 1 2 1 1 1 1 1 1 ...
$ total_minutes_asleep: int 327 384 412 340 700 304 360 325 361 430 ...
$ total_time_in_bed : int 346 407 442 367 712 320 377 364 384 449 ...
Next, when the data is cleaned, we would like to see the percentage of time asleep from the recommended 8 hours.
# Average time in bed and time asleep
daily_sleep_mean <- daily_sleep %>%
group_by(id) %>%
drop_na() %>%
summarize(mean_total_min_asleep = mean(total_minutes_asleep), mean_time_in_bed=mean(total_time_in_bed))
# Adding a column with % asleep of the recommended 8h (480 min)
daily_sleep_mean <- daily_sleep_mean %>%
mutate(percent_time_asleep =(mean_total_min_asleep/480)*100)
head(daily_sleep_mean)
# A tibble: 6 x 4
id mean_total_min_asleep mean_time_in_bed percent_time_asleep
<dbl> <dbl> <dbl> <dbl>
1 1503960366 360. 383. 75.1
2 1644430081 294 346 61.3
3 1844505072 652 961 136.
4 1927972279 417 438. 86.9
5 2026352035 506. 538. 105.
6 2320127002 61 69 12.7
# creating a plot
# The horizontal line represents the recommended 8 hours of sleep
ggplot(data=daily_sleep_mean)+
geom_point(mapping=aes(x=mean_total_min_asleep, y=percent_time_asleep))+
theme(plot.title = element_text(size = 22),
plot.subtitle = element_text(size = 18, color="#CE3A94"),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18))+
geom_hline(yintercept = 100, color="#CE3A94")+
labs(title="Average sleep time", subtitle="Percentage from recommended 8 hours", x="Average min asleep", y="Percentage from 480 min")
Conclusion
Analyzing our results, we see that the majority of users do not sleep during the recommended hours.
Next step we want to find out the relationships between the average number of steps and average minutes asleep.
For this analysis, we need to merge daily activity and daily sleep files.
# Merging daily activity and daily sleep files
merged_data <- merge(daily_activity,daily_sleep,by=c('id','date'))
str(merged_data)
'data.frame': 397 obs. of 19 variables:
$ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
$ date : Date, format: "2016-04-12" "2016-04-14" ...
$ total_steps : num 13162 10460 9762 12669 13019 ...
$ total_distance : num 8.5 6.74 6.28 8.16 8.59 ...
$ tracker_distance : num 8.5 6.74 6.28 8.16 8.59 ...
$ logged_activities_distance: num 0 0 0 0 0 0 0 0 0 0 ...
$ very_active_distance : num 1.88 2.44 2.14 2.71 3.25 ...
$ moderately_active_distance: num 0.55 0.4 1.26 0.41 0.64 ...
$ light_active_distance : num 6.06 3.91 2.83 5.04 4.71 ...
$ sedentary_active_distance : num 0 0 0 0 0 0 0 0 0 0 ...
$ very_active_minutes : num 25 30 29 36 42 50 28 66 41 39 ...
$ fairly_active_minutes : num 13 11 34 10 16 31 12 27 21 5 ...
$ lightly_active_minutes : num 328 181 209 221 233 264 205 130 262 238 ...
$ sedentary_minutes : num 728 1218 726 773 1149 ...
$ calories : num 1985 1776 1745 1863 1921 ...
$ weekday : chr "Tuesday" "Thursday" "Friday" "Saturday" ...
$ total_sleep_records : int 2 1 2 1 1 1 1 1 1 1 ...
$ total_minutes_asleep : int 384 412 340 700 304 360 325 361 430 277 ...
$ total_time_in_bed : int 407 442 367 712 320 377 364 384 449 323 ...
# Cleaning: reviewing duplicates and unique IDs
sum(duplicated(merged_data))
[1] 0
n_distinct(merged_data$id)
[1] 24
Finding relationships between Average number of steps and minutes asleep
# Finding relationships between Average number of steps and minutes asleep
# The horizontal line represents the recommended 8 hours of sleep
activity_sleep <- merged_data %>%
group_by(id) %>%
summarize(mean_total_steps=mean(total_steps), mean_total_minutes_asleep=mean(total_minutes_asleep))
# Creating a plot
ggplot(data=activity_sleep)+
geom_point(mapping=aes(x=mean_total_steps, y=mean_total_minutes_asleep))+
geom_smooth(mapping=aes(x=mean_total_steps, y=mean_total_minutes_asleep), color="#CE3A94")+
theme(plot.title=element_text(size=22, color="#CE3A94"),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18))+
geom_hline(yintercept = 480)+
labs(title = "Average Steps vs. Average Minutes Asleep", x="Average Steps", y="Average Minutes Asleep")
Getting deeper into our analysis we want to check if there are relationships between activity minutes and minutes asleep
# New dataframe with average time asleep and activity time
activity_sleep <- merged_data %>%
group_by(id) %>%
drop_na() %>%
summarize(mean_total_steps=mean(total_steps),
mean_total_minutes_asleep=mean(total_minutes_asleep),
mean_sedentary_minutes=mean(sedentary_minutes),
mean_lightly_active_minutes=mean(lightly_active_minutes),
mean_fairly_active_minutes=mean(fairly_active_minutes),
mean_very_active_minutes=mean(very_active_minutes))
head(activity_sleep)
# A tibble: 6 x 7
id mean_total_steps mean_total_minu~ mean_sedentary_~ mean_lightly_ac~
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1503960366 12656. 362. 849. 227.
2 1644430081 8021. 294 1099. 184
3 1844505072 5624. 652 714. 268
4 1927972279 748. 334. 1192. 38.8
5 2026352035 5548. 506. 683. 257.
6 2320127002 5583 61 1174 266
# ... with 2 more variables: mean_fairly_active_minutes <dbl>,
# mean_very_active_minutes <dbl>
# Creating plots
# The horizontal line represents the recommended 8 hours of sleep
options(repr.plot.width = 15, repr.plot.height = 7)
ggarrange(
ggplot(data=activity_sleep)+
geom_point(mapping=aes(x=mean_sedentary_minutes, y=mean_total_minutes_asleep))+
geom_smooth(mapping=aes(x=mean_sedentary_minutes, y=mean_total_minutes_asleep), color="#CE3A94")+
theme(plot.title=element_text(size=18, color="#CE3A94"),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16))+
geom_hline(yintercept = 480)+
labs(title = "Average Sedentary min vs. Average min asleep", x="Average sedentary minutes", y="Average minutes asleep"),
ggplot(data=activity_sleep)+
geom_point(mapping=aes(x=mean_very_active_minutes, y=mean_total_minutes_asleep))+
geom_smooth(mapping=aes(x=mean_very_active_minutes, y=mean_total_minutes_asleep), color="#CE3A94")+
theme(plot.title=element_text(size=18, color="#CE3A94"),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16))+
geom_hline(yintercept = 480)+
labs(title = "Average Very active min vs. Average min asleep", x="Average very active minutes", y="Average minutes asleep"),
ggplot(data=activity_sleep)+
geom_point(mapping=aes(x=mean_lightly_active_minutes, y=mean_total_minutes_asleep))+
geom_smooth(mapping=aes(x=mean_lightly_active_minutes, y=mean_total_minutes_asleep), color="#CE3A94")+
theme(plot.title=element_text(size=18, color="#CE3A94"),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16))+
geom_hline(yintercept = 480)+
labs(title = "Average Lightly active min vs. Average min asleep", x="Average lightly active minutes", y="Average minutes asleep"),
ggplot(data=activity_sleep)+
geom_point(mapping=aes(x=mean_fairly_active_minutes, y=mean_total_minutes_asleep))+
geom_smooth(mapping=aes(x=mean_fairly_active_minutes, y=mean_total_minutes_asleep), color="#CE3A94")+
theme(plot.title=element_text(size=18, color="#CE3A94"),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16))+
geom_hline(yintercept = 480)+
labs(title = "Average Fairly active min vs. Average min asleep", x="Average fairly active minutes", y="Average minutes asleep")
)
According to the first graph, we can see a negative correlation between sedentary time and time asleep.
Otherwise, we can’t see a correlation between very active/fairly/lightly time and time asleep.
Conclusion
In this section, we analyze how user activity is distributed over hours. For this, we will import, clean and merge two following files: * hourly steps (33 ID); * hourly calories (33 ID).
#Importing files
hourly_steps <- read.csv("hourlySteps_merged.csv")
hourly_calories <- read.csv("hourlyCalories_merged.csv")
head(hourly_steps)
Id ActivityHour StepTotal
1 1503960366 4/12/2016 12:00:00 AM 373
2 1503960366 4/12/2016 1:00:00 AM 160
3 1503960366 4/12/2016 2:00:00 AM 151
4 1503960366 4/12/2016 3:00:00 AM 0
5 1503960366 4/12/2016 4:00:00 AM 0
6 1503960366 4/12/2016 5:00:00 AM 0
head(hourly_calories)
Id ActivityHour Calories
1 1503960366 4/12/2016 12:00:00 AM 81
2 1503960366 4/12/2016 1:00:00 AM 61
3 1503960366 4/12/2016 2:00:00 AM 59
4 1503960366 4/12/2016 3:00:00 AM 47
5 1503960366 4/12/2016 4:00:00 AM 48
6 1503960366 4/12/2016 5:00:00 AM 48
# Cleaning column names
hourly_steps <- hourly_steps %>%
clean_names()
hourly_calories <- hourly_calories %>%
clean_names()
head(hourly_steps)
id activity_hour step_total
1 1503960366 4/12/2016 12:00:00 AM 373
2 1503960366 4/12/2016 1:00:00 AM 160
3 1503960366 4/12/2016 2:00:00 AM 151
4 1503960366 4/12/2016 3:00:00 AM 0
5 1503960366 4/12/2016 4:00:00 AM 0
6 1503960366 4/12/2016 5:00:00 AM 0
head(hourly_calories)
id activity_hour calories
1 1503960366 4/12/2016 12:00:00 AM 81
2 1503960366 4/12/2016 1:00:00 AM 61
3 1503960366 4/12/2016 2:00:00 AM 59
4 1503960366 4/12/2016 3:00:00 AM 47
5 1503960366 4/12/2016 4:00:00 AM 48
6 1503960366 4/12/2016 5:00:00 AM 48
# Cleaning date format
hourly_steps <- hourly_steps %>%
rename(date=activity_hour) %>%
mutate(date=as.POSIXct(date,format = "%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
hourly_calories <- hourly_calories %>%
rename(date=activity_hour) %>%
mutate(date=as.POSIXct(date,format = "%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
head(hourly_steps)
id date step_total
1 1503960366 2016-04-12 00:00:00 373
2 1503960366 2016-04-12 01:00:00 160
3 1503960366 2016-04-12 02:00:00 151
4 1503960366 2016-04-12 03:00:00 0
5 1503960366 2016-04-12 04:00:00 0
6 1503960366 2016-04-12 05:00:00 0
head(hourly_calories)
id date calories
1 1503960366 2016-04-12 00:00:00 81
2 1503960366 2016-04-12 01:00:00 61
3 1503960366 2016-04-12 02:00:00 59
4 1503960366 2016-04-12 03:00:00 47
5 1503960366 2016-04-12 04:00:00 48
6 1503960366 2016-04-12 05:00:00 48
# Merging files by "id" and "date"
merged_steps_calories <- merge(hourly_steps,hourly_calories, by=c("id","date"))
# Creating a column with the time only
merged_steps_calories$time <- format(merged_steps_calories$date, format = "%H:%M")
head(merged_steps_calories)
id date step_total calories time
1 1503960366 2016-04-12 00:00:00 373 81 00:00
2 1503960366 2016-04-12 01:00:00 160 61 01:00
3 1503960366 2016-04-12 02:00:00 151 59 02:00
4 1503960366 2016-04-12 03:00:00 0 47 03:00
5 1503960366 2016-04-12 04:00:00 0 48 04:00
6 1503960366 2016-04-12 05:00:00 0 48 05:00
# Cleaning: Checking duplicates
sum(duplicated(merged_steps_calories))
[1] 0
# Cleaning: Checking unique IDs
n_unique(merged_steps_calories$id)
[1] 33
# Transforming the dataframe. We will group by the time
merged_steps_calories <- merged_steps_calories %>%
group_by(time) %>%
summarize(mean_calories=mean(calories), mean_total_steps=mean(step_total))
head(merged_steps_calories)
# A tibble: 6 x 3
time mean_calories mean_total_steps
<chr> <dbl> <dbl>
1 00:00 71.8 42.2
2 01:00 70.2 23.1
3 02:00 69.2 17.1
4 03:00 67.5 6.43
5 04:00 68.3 12.7
6 05:00 81.7 43.9
# Creating charts
options(repr.plot.width = 15, repr.plot.height = 7)
ggarrange(
ggplot(data=merged_steps_calories)+
geom_col(mapping = aes(x=time,y=mean_total_steps, fill=mean_total_steps))+
scale_fill_gradient(low = "white", high = "#CE3A94")+
theme(plot.title = element_text(hjust = 0.5,vjust= 1, size = 22, face = "bold"),
axis.text.x = element_text(angle = 80, size=11, vjust= 0.4),
axis.text.y = element_text(size=11))+
labs(title = "Hourly steps during the day", x="Time", y="Average Steps")+
guides(fill = guide_legend(title = "Average Steps")),
ggplot(data=merged_steps_calories)+
geom_col(mapping = aes(x=time,y=mean_calories, fill=mean_calories))+
scale_fill_gradient(low = "white", high = "#CE3A94")+
theme(plot.title = element_text(hjust = 0.5,vjust= 1, size = 22, face = "bold"),
axis.text.x = element_text(angle = 80, size=11, vjust= 0.4),
axis.text.y = element_text(size=11))+
labs(title = "Hourly calories during the day", x="Time", y="Average calories")+
guides(fill = guide_legend(title = "Average Calories"))
)
Looking at the charts, it can be seen that users are most active between 12pm and 2pm (lunchtime) and from 5pm till 7pm (after work).
It’s important to note that the second chart shows how people burn calories by doing different activities (not just steps).
Conclusion
Given the difference from the first chart, we can recommend different types of exercises.
It’s important to send notifications to reduce sedentary activity (in order to ensure good sleep and healthier lifestyle). In recommendations users could see simple and fast exercises which can improve the situation. At the same time, women won’t feel stressed.
Send notifications to go to bed on time. The app can be enhanced with sleep/meditation music, etc.
Exercise recommendations may be based on lifestyle. For example, this may depend on the number of children, type of work, etc.
Taking into account the current situation with the pandemic, and given that many people are working remotely, recommendations can be based on activity at home. Even during the work women could do simple exercises each 1h/1,5h, or the desired time.
Condidering the data limitations, additional data is required: * Current original data from Bellabeat; * Age,demographics; * Lifestyle; * Preferences, what motivates.
Following ideas could help to build even stronger brand and make a device as a helphul “friend”:
Emotions/mood control. Recommendations of exercises, affirmations, nutrition depending on the mood. Ideally, an application should use algorithms based on user preferences.
In pop-up notifications, they could rank their desire to exercise, readiness for sleep, etc.
Community support. Building a user community in an app can increase motivation.
Community meetings in different locations (it could be done through social media).
A notebook in which women could write down their feelings, emotions, or just something that inspires and motivates them. Women could choose to view their own written motivation in a pop-up window (if they selected that option in the app).
Hire a psychotherapist how could support and motivate women (additional payment).
A “friend” chat support (additional payment) from a Bellabeat team.
Possibility to create a playlist of favorite music which will help women to stay motivated.
Special recommendations for pregnant women.
Reminders about health control like cancer control, hormons, etc.