Table of Contents

1. Introduction

Bellabeat is the go-to wellness brand for women with an ecosystem of products and services focused on women’s health. Bellabeat develops wearables and accompanying products that monitor biometric and lifestyle data to help women better understand how their bodies work and make healthier choices.

2. The business task

The business task:

Analyze smart device data to gain insight into how consumers are using their smart devices in order to present high-level recommendations for Bellabeat’s marketing strategy.

Key stakeholders: * Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer * Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team * Bellabeat marketing analytics team.

Questions:

Bellabeat products chosen for the analysis:

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

3. Data organization

Data

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users.

Data privacy and accessibility:

Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Metadata contains information about the data types and data description.

Data limitations: * There is no information about age, sex and demographics; * Data is not current (2016-04-12 - 2016-05-12); * Small sample size; * Data is not original from BellaBeat.

Data organization

Data is organized in 18 CSV files. It has both long and wide formats.

First, all 18 files were opened in RStudio and reviewed for unique numbers of ID (using n_unique function):

  1. 33 ID: dailyActivity_merged.csv, dailyCalories_merged.csv, dailyIntensities_merged.csv, dailySteps_merged.csv, hourlyCalories_merged.csv, hourlyIntensities_merged.csv, hourlySteps_merged.csv, minuteCaloriesNarrow_merged.csv, minuteCaloriesWide_merged.csv, minuteIntensitiesNarrow_merged.csv, minuteIntensitiesWide_merged.csv, minuteMETsNarrow_merged.csv, minuteStepsNarrow_merged.csv,minuteStepsWide_merged.csv.
  2. 24 ID: minuteSleep_merged.csv, sleepDay_merged.csv.
  3. 14 ID: heartrate_seconds_merged.csv (too small sample size).
  4. 8 ID: weightLogInfo_merged.csv (too small sample size).

For our analysis we will use the following CSV files: 1. 33 ID: dailyActivity_merged.csv (contains information about daily calories, intensities and steps from files: dailyCalories_merged.csv, dailyIntensities_merged.csv, dailySteps_merged.csv), hourlyCalories_merged.csv, hourlySteps_merged.csv. 2. 24 ID: sleepDay_merged.csv.

4. Process and Analysis

The analysis will be done in R and shared with the key stakeholders.

First, we will install the packages. Then we will import the data, transform and analyze.

# installing packages:
library(tidyverse)
library(lubridate)
library(skimr)
library(janitor)
library(ggpubr)
library(ggrepel)

4.1 Daily activity

4.1.1 Data cleaning

We will start our analysis with daily activity data which contains information about 33 users

# daily activity file importing and reviewing the struture
daily_activity <- read_csv("dailyActivity_merged.csv")

str(daily_activity)
spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
 $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
 $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
 $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
 $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
 $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
 $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
 $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
 $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
 $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
 $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
 $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
 - attr(*, "spec")=
  .. cols(
  ..   Id = col_double(),
  ..   ActivityDate = col_character(),
  ..   TotalSteps = col_double(),
  ..   TotalDistance = col_double(),
  ..   TrackerDistance = col_double(),
  ..   LoggedActivitiesDistance = col_double(),
  ..   VeryActiveDistance = col_double(),
  ..   ModeratelyActiveDistance = col_double(),
  ..   LightActiveDistance = col_double(),
  ..   SedentaryActiveDistance = col_double(),
  ..   VeryActiveMinutes = col_double(),
  ..   FairlyActiveMinutes = col_double(),
  ..   LightlyActiveMinutes = col_double(),
  ..   SedentaryMinutes = col_double(),
  ..   Calories = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
# Cleaning: clean columns names
daily_activity <- daily_activity %>%
  clean_names()

# Cleaning: changing date format
daily_activity <- daily_activity %>%
   rename(date=activity_date)%>%
   mutate(date=as_date(date, format = "%m/%d/%Y"))

# Cleaning: check if the format is changed
glimpse(daily_activity)
Rows: 940
Columns: 15
$ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960~
$ date                       <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0~
$ total_steps                <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 130~
$ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9~
$ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9~
$ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3~
$ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1~
$ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5~
$ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ very_active_minutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,~
$ fairly_active_minutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, ~
$ lightly_active_minutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205~
$ sedentary_minutes          <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 8~
$ calories                   <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2~
# Cleaning: duplicates check
sum(duplicated(daily_activity))
[1] 0
# Cleaning: unique ID numbers
n_unique(daily_activity$id)
[1] 33
#First look (mean values)

summary(daily_activity)
       id                 date             total_steps    total_distance  
 Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0   Min.   : 0.000  
 1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.: 2.620  
 Median :4.445e+09   Median :2016-04-26   Median : 7406   Median : 5.245  
 Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638   Mean   : 5.490  
 3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.: 7.713  
 Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019   Max.   :28.030  
 tracker_distance logged_activities_distance very_active_distance
 Min.   : 0.000   Min.   :0.0000             Min.   : 0.000      
 1st Qu.: 2.620   1st Qu.:0.0000             1st Qu.: 0.000      
 Median : 5.245   Median :0.0000             Median : 0.210      
 Mean   : 5.475   Mean   :0.1082             Mean   : 1.503      
 3rd Qu.: 7.710   3rd Qu.:0.0000             3rd Qu.: 2.053      
 Max.   :28.030   Max.   :4.9421             Max.   :21.920      
 moderately_active_distance light_active_distance sedentary_active_distance
 Min.   :0.0000             Min.   : 0.000        Min.   :0.000000         
 1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:0.000000         
 Median :0.2400             Median : 3.365        Median :0.000000         
 Mean   :0.5675             Mean   : 3.341        Mean   :0.001606         
 3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.:0.000000         
 Max.   :6.4800             Max.   :10.710        Max.   :0.110000         
 very_active_minutes fairly_active_minutes lightly_active_minutes
 Min.   :  0.00      Min.   :  0.00        Min.   :  0.0         
 1st Qu.:  0.00      1st Qu.:  0.00        1st Qu.:127.0         
 Median :  4.00      Median :  6.00        Median :199.0         
 Mean   : 21.16      Mean   : 13.56        Mean   :192.8         
 3rd Qu.: 32.00      3rd Qu.: 19.00        3rd Qu.:264.0         
 Max.   :210.00      Max.   :143.00        Max.   :518.0         
 sedentary_minutes    calories   
 Min.   :   0.0    Min.   :   0  
 1st Qu.: 729.8    1st Qu.:1828  
 Median :1057.5    Median :2134  
 Mean   : 991.2    Mean   :2304  
 3rd Qu.:1229.5    3rd Qu.:2793  
 Max.   :1440.0    Max.   :4900  

4.1.2 Activity type (steps)

Next, we would like to see the distribution of users based on activity. We will follow the recommendations developed as a guide on how many daily steps are sufficient for health benefits in generally healthy adults (Tudor- Locke & Bassett, 2004). WHO Library Cataloguing in Publication Data

Steps per day Physical activity level
<5000 Sedentary lifestyle
5000-7499 Low active
7500-9999 Somewhat active
>=10 000 Active
>=12 500 Highly active
# Average total steps grouped by ID
Total_steps_mean <- daily_activity %>% 
  group_by(id) %>% 
  summarize(mean_total_steps=mean(total_steps))

head(Total_steps_mean)
# A tibble: 6 x 2
          id mean_total_steps
       <dbl>            <dbl>
1 1503960366           12117.
2 1624580081            5744.
3 1644430081            7283.
4 1844505072            2580.
5 1927972279             916.
6 2022484408           11371.
# Creating user types (Tudor- Locke & Bassett, 2004)

activity_user_type <- Total_steps_mean %>%
  mutate(activity_type = case_when(
    mean_total_steps < 5000 ~ "sedentary",
    mean_total_steps >= 5000 & mean_total_steps < 7500 ~ "low active", 
    mean_total_steps >= 7500 & mean_total_steps < 10000 ~ "somewhat active", 
    mean_total_steps >= 10000 & mean_total_steps < 12500 ~ "active",
    mean_total_steps >= 12500 ~ "highly active",
  ))
head(activity_user_type)
# A tibble: 6 x 3
          id mean_total_steps activity_type
       <dbl>            <dbl> <chr>        
1 1503960366           12117. active       
2 1624580081            5744. low active   
3 1644430081            7283. low active   
4 1844505072            2580. sedentary    
5 1927972279             916. sedentary    
6 2022484408           11371. active       
#Counting the number by user type and calculating the percentage 
activity_user_type_percent <- activity_user_type %>%
  group_by(activity_type) %>% 
  summarise(total=n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(activity_type) %>%
  summarise(total_percent = total / totals) %>%
  mutate(percent = scales::percent(total_percent))%>%
  arrange(desc(total_percent))

activity_user_type_percent$activity_type <- factor(activity_user_type_percent$activity_type, levels = c("sedentary", "low active", "somewhat active", "active", "highly active"))

head(activity_user_type_percent)
# A tibble: 5 x 3
  activity_type   total_percent percent
  <fct>                   <dbl> <chr>  
1 low active             0.273  27.3%  
2 somewhat active        0.273  27.3%  
3 sedentary              0.242  24.2%  
4 active                 0.152  15.2%  
5 highly active          0.0606 6.1%   
# Creating a plot
options(repr.plot.width = 6, repr.plot.height = 6)
ggplot(activity_user_type_percent,aes(x="",y = total_percent, fill=activity_type)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0)+
  scale_fill_brewer(palette='PuRd')+
  theme_void()+ # remove background, grid, numeric labels
  theme(plot.title = element_text(hjust = 0.5,vjust= -5, size = 22, face = "bold")) +
  geom_text(aes(label = percent, x=1.2),position = position_stack(vjust = 0.5))+
  labs(title="User type by activity")+
  guides(fill = guide_legend(title = "Activity type"))

Conclusions

  • 54,6% of users have both low active and somewhat active types;
  • 24,2% of users have sedentary lifestyle;
  • 21,6% of users have active types (highly active and active).

4.1.3 Weekdays distribution

Next, we want to see user weekdays distribution based on the average number of steps and calories.

# Adding days of the week
daily_activity <- daily_activity %>%
   mutate(weekday=weekdays(date))

head(daily_activity)
# A tibble: 6 x 16
      id date       total_steps total_distance tracker_distance logged_activiti~
   <dbl> <date>           <dbl>          <dbl>            <dbl>            <dbl>
1 1.50e9 2016-04-12       13162           8.5              8.5                 0
2 1.50e9 2016-04-13       10735           6.97             6.97                0
3 1.50e9 2016-04-14       10460           6.74             6.74                0
4 1.50e9 2016-04-15        9762           6.28             6.28                0
5 1.50e9 2016-04-16       12669           8.16             8.16                0
6 1.50e9 2016-04-17        9705           6.48             6.48                0
# ... with 10 more variables: very_active_distance <dbl>,
#   moderately_active_distance <dbl>, light_active_distance <dbl>,
#   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
#   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
#   sedentary_minutes <dbl>, calories <dbl>, weekday <chr>
# Grouping by ID and summarizing average values
activity_weekdays <- daily_activity %>% 
  group_by(weekday) %>% 
  summarize(mean_total_steps=mean(total_steps), mean_total_distance=mean(total_distance), mean_calories=mean(calories))

head(activity_weekdays)
# A tibble: 6 x 4
  weekday  mean_total_steps mean_total_distance mean_calories
  <chr>               <dbl>               <dbl>         <dbl>
1 Friday              7448.                5.31         2332.
2 Monday              7781.                5.55         2324.
3 Saturday            8153.                5.85         2355.
4 Sunday              6933.                5.03         2263 
5 Thursday            7406.                5.31         2200.
6 Tuesday             8125.                5.83         2356.
# Creating a plot
activity_weekdays$weekday <- ordered(activity_weekdays$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday"))

options(repr.plot.width = 10, repr.plot.height = 5)
ggplot(data=activity_weekdays)+
  geom_col(mapping = aes(x=weekday,y=mean_total_steps), fill="#CE3A94")+
  theme(axis.text.x = element_text(angle = 30))+
  labs(title = "Weekly average steps", x="Weekday", y="")+
  theme(legend.title = element_text(size = 20), 
        legend.text = element_text(size = 50), 
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18),
        axis.title.y = element_text(size = 18))+
  labs(title = "Weekly average steps", x="Weekday", y="Average Total Steps")

options(repr.plot.width = 10, repr.plot.height = 5)
ggplot(data=activity_weekdays)+
  geom_col(mapping = aes(x=weekday,y=mean_calories), fill="#CE3A94")+
  theme(legend.title = element_text(size = 20), 
        legend.text = element_text(size = 50), 
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18),
        axis.title.y = element_text(size = 18))+
  labs(title = "Weekly average calories", x="Weekday", y="Calories")

Conclusion

We can’t see that there is a great difference between days of the week and average number of steps or calories.

4.1.4 Device usage vs. Activity

Now we would like to find out if there is a relationship between the device usage and activity type based on the number of steps.

The maximum number of days is 31.

# Device usage and activity type
tracker_usage <- daily_activity %>%
  select(id, date, sedentary_minutes,lightly_active_minutes, fairly_active_minutes,very_active_minutes) %>% 
  group_by(id) %>%
  mutate(total_usage_id=sedentary_minutes+lightly_active_minutes+fairly_active_minutes+very_active_minutes) %>% 
  group_by(id) %>% 
  summarise(total_entries_id=sum(n()))

activity_user_type <- merge(activity_user_type, tracker_usage, by="id")

head(activity_user_type)
          id mean_total_steps activity_type total_entries_id
1 1503960366        12116.742        active               31
2 1624580081         5743.903    low active               31
3 1644430081         7282.967    low active               30
4 1844505072         2580.065     sedentary               31
5 1927972279          916.129     sedentary               31
6 2022484408        11370.645        active               31
# Creating a plot
options(repr.plot.width = 10, repr.plot.height = 7)
ggplot(data=activity_user_type)+
  geom_point(mapping = aes(y=total_entries_id, x=mean_total_steps, color=activity_type))+
  theme(legend.title = element_text(size = 20), 
        legend.text = element_text(size = 18), 
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18),
        axis.title.y = element_text(size = 18))+
  labs(title="Device usage vs. Total Steps", x="Average total steps",y="Days of device usage", color="Activity type")+
  facet_wrap(~activity_type)

Conclusion

Users taking more than 7500 steps (having somewhat active, active and highly active types) use the device more days than sedentary and low active user types. This may mean that the user can stay more motivated by wearing the device more often.

4.2 Daily sleep

4.2.1 Daily sleep percentage

Next, we are goint to analyze daily sleep. To do this, we need to import and clean one file containing information about 24 users

# importing a file
daily_sleep <- read.csv("sleepDay_merged.csv")

str(daily_sleep)
'data.frame':   413 obs. of  5 variables:
 $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
 $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
 $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
 $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
# Cleaning:duplicates
sum(duplicated(daily_sleep))
[1] 3
# Cleaning: removing 3 duplicates
daily_sleep <- daily_sleep%>%
   distinct()%>%
   drop_na()
sum(duplicated(daily_sleep))
[1] 0
# Cleaning: column names
daily_sleep <- daily_sleep%>%
   clean_names()

str(daily_sleep)
'data.frame':   410 obs. of  5 variables:
 $ id                  : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ sleep_day           : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
 $ total_sleep_records : int  1 2 1 2 1 1 1 1 1 1 ...
 $ total_minutes_asleep: int  327 384 412 340 700 304 360 325 361 430 ...
 $ total_time_in_bed   : int  346 407 442 367 712 320 377 364 384 449 ...
# Cleaning: date format changing
daily_sleep <- daily_sleep %>% 
  rename(date=sleep_day) %>% 
  mutate(date = as_datetime(date,format="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
str(daily_sleep)
'data.frame':   410 obs. of  5 variables:
 $ id                  : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ date                : POSIXct, format: "2016-04-12" "2016-04-13" ...
 $ total_sleep_records : int  1 2 1 2 1 1 1 1 1 1 ...
 $ total_minutes_asleep: int  327 384 412 340 700 304 360 325 361 430 ...
 $ total_time_in_bed   : int  346 407 442 367 712 320 377 364 384 449 ...

Next, when the data is cleaned, we would like to see the percentage of time asleep from the recommended 8 hours.

# Average time in bed and time asleep
daily_sleep_mean <- daily_sleep %>% 
   group_by(id) %>% 
   drop_na() %>% 
   summarize(mean_total_min_asleep = mean(total_minutes_asleep), mean_time_in_bed=mean(total_time_in_bed))

# Adding a column with % asleep of the recommended 8h (480 min)
daily_sleep_mean <- daily_sleep_mean %>%
mutate(percent_time_asleep =(mean_total_min_asleep/480)*100)

head(daily_sleep_mean)
# A tibble: 6 x 4
          id mean_total_min_asleep mean_time_in_bed percent_time_asleep
       <dbl>                 <dbl>            <dbl>               <dbl>
1 1503960366                  360.             383.                75.1
2 1644430081                  294              346                 61.3
3 1844505072                  652              961                136. 
4 1927972279                  417              438.                86.9
5 2026352035                  506.             538.               105. 
6 2320127002                   61               69                 12.7
# creating a plot
# The horizontal line represents the recommended 8 hours of sleep
ggplot(data=daily_sleep_mean)+
  geom_point(mapping=aes(x=mean_total_min_asleep, y=percent_time_asleep))+
  theme(plot.title = element_text(size = 22),
        plot.subtitle = element_text(size = 18, color="#CE3A94"),
        axis.title.x = element_text(size = 18),
        axis.title.y = element_text(size = 18))+
  geom_hline(yintercept = 100, color="#CE3A94")+
  labs(title="Average sleep time", subtitle="Percentage from recommended 8 hours", x="Average min asleep", y="Percentage from 480 min")

Conclusion

Analyzing our results, we see that the majority of users do not sleep during the recommended hours.

4.2.2 Daily sleep vs. Steps

Next step we want to find out the relationships between the average number of steps and average minutes asleep.

For this analysis, we need to merge daily activity and daily sleep files.

# Merging daily activity and daily sleep files
merged_data <- merge(daily_activity,daily_sleep,by=c('id','date'))
str(merged_data)
'data.frame':   397 obs. of  19 variables:
 $ id                        : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ date                      : Date, format: "2016-04-12" "2016-04-14" ...
 $ total_steps               : num  13162 10460 9762 12669 13019 ...
 $ total_distance            : num  8.5 6.74 6.28 8.16 8.59 ...
 $ tracker_distance          : num  8.5 6.74 6.28 8.16 8.59 ...
 $ logged_activities_distance: num  0 0 0 0 0 0 0 0 0 0 ...
 $ very_active_distance      : num  1.88 2.44 2.14 2.71 3.25 ...
 $ moderately_active_distance: num  0.55 0.4 1.26 0.41 0.64 ...
 $ light_active_distance     : num  6.06 3.91 2.83 5.04 4.71 ...
 $ sedentary_active_distance : num  0 0 0 0 0 0 0 0 0 0 ...
 $ very_active_minutes       : num  25 30 29 36 42 50 28 66 41 39 ...
 $ fairly_active_minutes     : num  13 11 34 10 16 31 12 27 21 5 ...
 $ lightly_active_minutes    : num  328 181 209 221 233 264 205 130 262 238 ...
 $ sedentary_minutes         : num  728 1218 726 773 1149 ...
 $ calories                  : num  1985 1776 1745 1863 1921 ...
 $ weekday                   : chr  "Tuesday" "Thursday" "Friday" "Saturday" ...
 $ total_sleep_records       : int  2 1 2 1 1 1 1 1 1 1 ...
 $ total_minutes_asleep      : int  384 412 340 700 304 360 325 361 430 277 ...
 $ total_time_in_bed         : int  407 442 367 712 320 377 364 384 449 323 ...
# Cleaning: reviewing duplicates and unique IDs
sum(duplicated(merged_data))
[1] 0
n_distinct(merged_data$id)
[1] 24

Finding relationships between Average number of steps and minutes asleep

# Finding relationships between Average number of steps and minutes asleep
# The horizontal line represents the recommended 8 hours of sleep
activity_sleep <- merged_data %>% 
  group_by(id) %>%
  summarize(mean_total_steps=mean(total_steps), mean_total_minutes_asleep=mean(total_minutes_asleep))

# Creating a plot
ggplot(data=activity_sleep)+
geom_point(mapping=aes(x=mean_total_steps, y=mean_total_minutes_asleep))+
geom_smooth(mapping=aes(x=mean_total_steps, y=mean_total_minutes_asleep), color="#CE3A94")+
theme(plot.title=element_text(size=22, color="#CE3A94"),
      axis.title.x = element_text(size = 18),
      axis.title.y = element_text(size = 18))+
geom_hline(yintercept = 480)+
labs(title = "Average Steps vs. Average Minutes Asleep", x="Average Steps", y="Average Minutes Asleep")

4.2.3 Daily sleep vs. Activity time

Getting deeper into our analysis we want to check if there are relationships between activity minutes and minutes asleep

# New dataframe with average time asleep and activity time
activity_sleep <- merged_data %>% 
  group_by(id) %>%
  drop_na() %>% 
  summarize(mean_total_steps=mean(total_steps), 
            mean_total_minutes_asleep=mean(total_minutes_asleep),
            mean_sedentary_minutes=mean(sedentary_minutes),
    mean_lightly_active_minutes=mean(lightly_active_minutes),
         mean_fairly_active_minutes=mean(fairly_active_minutes),
         mean_very_active_minutes=mean(very_active_minutes))

head(activity_sleep)
# A tibble: 6 x 7
          id mean_total_steps mean_total_minu~ mean_sedentary_~ mean_lightly_ac~
       <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
1 1503960366           12656.             362.             849.            227. 
2 1644430081            8021.             294             1099.            184  
3 1844505072            5624.             652              714.            268  
4 1927972279             748.             334.            1192.             38.8
5 2026352035            5548.             506.             683.            257. 
6 2320127002            5583               61             1174             266  
# ... with 2 more variables: mean_fairly_active_minutes <dbl>,
#   mean_very_active_minutes <dbl>
# Creating plots
# The horizontal line represents the recommended 8 hours of sleep
options(repr.plot.width = 15, repr.plot.height = 7)
ggarrange(
ggplot(data=activity_sleep)+
  geom_point(mapping=aes(x=mean_sedentary_minutes, y=mean_total_minutes_asleep))+
  geom_smooth(mapping=aes(x=mean_sedentary_minutes, y=mean_total_minutes_asleep), color="#CE3A94")+
  theme(plot.title=element_text(size=18, color="#CE3A94"),
      axis.title.x = element_text(size = 16),
      axis.title.y = element_text(size = 16))+
  geom_hline(yintercept = 480)+
  labs(title = "Average Sedentary min vs. Average min asleep", x="Average sedentary minutes", y="Average minutes asleep"),
ggplot(data=activity_sleep)+
  geom_point(mapping=aes(x=mean_very_active_minutes, y=mean_total_minutes_asleep))+
  geom_smooth(mapping=aes(x=mean_very_active_minutes, y=mean_total_minutes_asleep), color="#CE3A94")+
  theme(plot.title=element_text(size=18, color="#CE3A94"),
      axis.title.x = element_text(size = 16),
      axis.title.y = element_text(size = 16))+
      geom_hline(yintercept = 480)+
  labs(title = "Average Very active min vs. Average min asleep", x="Average very active minutes", y="Average minutes asleep"),
ggplot(data=activity_sleep)+
  geom_point(mapping=aes(x=mean_lightly_active_minutes, y=mean_total_minutes_asleep))+
  geom_smooth(mapping=aes(x=mean_lightly_active_minutes, y=mean_total_minutes_asleep), color="#CE3A94")+
  theme(plot.title=element_text(size=18, color="#CE3A94"),
      axis.title.x = element_text(size = 16),
      axis.title.y = element_text(size = 16))+
      geom_hline(yintercept = 480)+
  labs(title = "Average Lightly active min vs. Average min asleep", x="Average lightly active minutes", y="Average minutes asleep"),
ggplot(data=activity_sleep)+
  geom_point(mapping=aes(x=mean_fairly_active_minutes, y=mean_total_minutes_asleep))+
  geom_smooth(mapping=aes(x=mean_fairly_active_minutes, y=mean_total_minutes_asleep), color="#CE3A94")+
  theme(plot.title=element_text(size=18, color="#CE3A94"),
      axis.title.x = element_text(size = 16),
      axis.title.y = element_text(size = 16))+
      geom_hline(yintercept = 480)+
  labs(title = "Average Fairly active min vs. Average min asleep", x="Average fairly active minutes", y="Average minutes asleep")
)    

According to the first graph, we can see a negative correlation between sedentary time and time asleep.

Otherwise, we can’t see a correlation between very active/fairly/lightly time and time asleep.

Conclusion

  • The data indicates that all users sleep less than the recommended 8 hours.
  • The sedentary time needs to be reduced in order to ensure healthier sleep.

4.3 Hourly steps and calories

In this section, we analyze how user activity is distributed over hours. For this, we will import, clean and merge two following files: * hourly steps (33 ID); * hourly calories (33 ID).

#Importing files

hourly_steps <- read.csv("hourlySteps_merged.csv")
hourly_calories <- read.csv("hourlyCalories_merged.csv")

head(hourly_steps)
          Id          ActivityHour StepTotal
1 1503960366 4/12/2016 12:00:00 AM       373
2 1503960366  4/12/2016 1:00:00 AM       160
3 1503960366  4/12/2016 2:00:00 AM       151
4 1503960366  4/12/2016 3:00:00 AM         0
5 1503960366  4/12/2016 4:00:00 AM         0
6 1503960366  4/12/2016 5:00:00 AM         0
head(hourly_calories)
          Id          ActivityHour Calories
1 1503960366 4/12/2016 12:00:00 AM       81
2 1503960366  4/12/2016 1:00:00 AM       61
3 1503960366  4/12/2016 2:00:00 AM       59
4 1503960366  4/12/2016 3:00:00 AM       47
5 1503960366  4/12/2016 4:00:00 AM       48
6 1503960366  4/12/2016 5:00:00 AM       48
# Cleaning column names
hourly_steps <- hourly_steps %>%
clean_names()

hourly_calories <- hourly_calories %>%
clean_names()

head(hourly_steps)
          id         activity_hour step_total
1 1503960366 4/12/2016 12:00:00 AM        373
2 1503960366  4/12/2016 1:00:00 AM        160
3 1503960366  4/12/2016 2:00:00 AM        151
4 1503960366  4/12/2016 3:00:00 AM          0
5 1503960366  4/12/2016 4:00:00 AM          0
6 1503960366  4/12/2016 5:00:00 AM          0
head(hourly_calories)
          id         activity_hour calories
1 1503960366 4/12/2016 12:00:00 AM       81
2 1503960366  4/12/2016 1:00:00 AM       61
3 1503960366  4/12/2016 2:00:00 AM       59
4 1503960366  4/12/2016 3:00:00 AM       47
5 1503960366  4/12/2016 4:00:00 AM       48
6 1503960366  4/12/2016 5:00:00 AM       48
# Cleaning date format

hourly_steps <- hourly_steps %>%
  rename(date=activity_hour) %>% 
  mutate(date=as.POSIXct(date,format = "%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

hourly_calories <- hourly_calories %>%
  rename(date=activity_hour) %>% 
  mutate(date=as.POSIXct(date,format = "%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

head(hourly_steps)
          id                date step_total
1 1503960366 2016-04-12 00:00:00        373
2 1503960366 2016-04-12 01:00:00        160
3 1503960366 2016-04-12 02:00:00        151
4 1503960366 2016-04-12 03:00:00          0
5 1503960366 2016-04-12 04:00:00          0
6 1503960366 2016-04-12 05:00:00          0
head(hourly_calories)
          id                date calories
1 1503960366 2016-04-12 00:00:00       81
2 1503960366 2016-04-12 01:00:00       61
3 1503960366 2016-04-12 02:00:00       59
4 1503960366 2016-04-12 03:00:00       47
5 1503960366 2016-04-12 04:00:00       48
6 1503960366 2016-04-12 05:00:00       48
# Merging files by "id" and "date"
merged_steps_calories <- merge(hourly_steps,hourly_calories, by=c("id","date")) 
# Creating a column with the time only
merged_steps_calories$time <- format(merged_steps_calories$date, format = "%H:%M")
head(merged_steps_calories)
          id                date step_total calories  time
1 1503960366 2016-04-12 00:00:00        373       81 00:00
2 1503960366 2016-04-12 01:00:00        160       61 01:00
3 1503960366 2016-04-12 02:00:00        151       59 02:00
4 1503960366 2016-04-12 03:00:00          0       47 03:00
5 1503960366 2016-04-12 04:00:00          0       48 04:00
6 1503960366 2016-04-12 05:00:00          0       48 05:00
# Cleaning: Checking duplicates
sum(duplicated(merged_steps_calories))
[1] 0
# Cleaning: Checking unique IDs
n_unique(merged_steps_calories$id)
[1] 33
# Transforming the dataframe. We will group by the time
merged_steps_calories <- merged_steps_calories %>%
  group_by(time) %>% 
  summarize(mean_calories=mean(calories), mean_total_steps=mean(step_total)) 
head(merged_steps_calories)
# A tibble: 6 x 3
  time  mean_calories mean_total_steps
  <chr>         <dbl>            <dbl>
1 00:00          71.8            42.2 
2 01:00          70.2            23.1 
3 02:00          69.2            17.1 
4 03:00          67.5             6.43
5 04:00          68.3            12.7 
6 05:00          81.7            43.9 
# Creating charts
options(repr.plot.width = 15, repr.plot.height = 7)
ggarrange(
ggplot(data=merged_steps_calories)+
  geom_col(mapping = aes(x=time,y=mean_total_steps, fill=mean_total_steps))+
  scale_fill_gradient(low = "white", high = "#CE3A94")+
  theme(plot.title = element_text(hjust = 0.5,vjust= 1, size = 22, face = "bold"),
        axis.text.x = element_text(angle = 80, size=11, vjust= 0.4),
        axis.text.y = element_text(size=11))+
  labs(title = "Hourly steps during the day", x="Time", y="Average Steps")+
  guides(fill = guide_legend(title = "Average Steps")),  
ggplot(data=merged_steps_calories)+
  geom_col(mapping = aes(x=time,y=mean_calories, fill=mean_calories))+
  scale_fill_gradient(low = "white", high = "#CE3A94")+
  theme(plot.title = element_text(hjust = 0.5,vjust= 1, size = 22, face = "bold"),
        axis.text.x = element_text(angle = 80, size=11, vjust= 0.4),
        axis.text.y = element_text(size=11))+
  labs(title = "Hourly calories during the day", x="Time", y="Average calories")+
  guides(fill = guide_legend(title = "Average Calories"))
)

Looking at the charts, it can be seen that users are most active between 12pm and 2pm (lunchtime) and from 5pm till 7pm (after work).

It’s important to note that the second chart shows how people burn calories by doing different activities (not just steps).

Conclusion

Given the difference from the first chart, we can recommend different types of exercises.

5. Conclusion

5.1 Key insights

  • It’s important to send notifications to reduce sedentary activity (in order to ensure good sleep and healthier lifestyle). In recommendations users could see simple and fast exercises which can improve the situation. At the same time, women won’t feel stressed.

  • Send notifications to go to bed on time. The app can be enhanced with sleep/meditation music, etc.

  • Exercise recommendations may be based on lifestyle. For example, this may depend on the number of children, type of work, etc.

  • Taking into account the current situation with the pandemic, and given that many people are working remotely, recommendations can be based on activity at home. Even during the work women could do simple exercises each 1h/1,5h, or the desired time.

5.2 Additional data

Condidering the data limitations, additional data is required: * Current original data from Bellabeat; * Age,demographics; * Lifestyle; * Preferences, what motivates.

5.3 More ideas

Following ideas could help to build even stronger brand and make a device as a helphul “friend”:

  • Emotions/mood control. Recommendations of exercises, affirmations, nutrition depending on the mood. Ideally, an application should use algorithms based on user preferences.

  • In pop-up notifications, they could rank their desire to exercise, readiness for sleep, etc.

  • Community support. Building a user community in an app can increase motivation.

  • Community meetings in different locations (it could be done through social media).

  • A notebook in which women could write down their feelings, emotions, or just something that inspires and motivates them. Women could choose to view their own written motivation in a pop-up window (if they selected that option in the app).

  • Hire a psychotherapist how could support and motivate women (additional payment).

  • A “friend” chat support (additional payment) from a Bellabeat team.

  • Possibility to create a playlist of favorite music which will help women to stay motivated.

  • Special recommendations for pregnant women.

  • Reminders about health control like cancer control, hormons, etc.

5.4 Strategy

  1. Define main advantages and desadvantages of applying the insights for different customer segments.
  2. Identify the main competition in each customer segment.
  3. Create a marketing plan for targeted segments.
  4. Test and improve.