Bellabeat Case Study

Introduction

Company Background

Bellabeat is a tech-driven wellness company for women founded in 2013, that manufactures health-focused smart products that are beautifully designed to inform and inspire women around the world. Bellabeat technology is developed to Collect data on various health activities, sleep, stress, and reproductive health, which has allowed women to be empowered with the knowledge about their own health and habits.

Bellabeat Products

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

Project Background

As a junior data analyst working on the marketing analyst team at Bellabeat, I have been asked to analyze data from non-Bellabeat(FitBit) smart devices usage, in order to gain insight into how consumers are using these smart devices, then apply these insight to one of Bellabeat’s product to help guide marketing strategy for the company.

Key Stakeholders

Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer

Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team

Marketing analytics team: A team of data analysts responsible for guiding Bellabeat’s marketing strategy.

Project Key Deliverables

  1. A clear summary of the business task
  2. A description of all data sources used
  3. Documentation of any cleaning or manipulation of data
  4. A summary of data analysis
  5. Supporting visualizations and key findings
  6. A high-level content recommendations based on your analysis

Data Analysis

Business Task

  1. What are the trends in the consumer usage of non-Bellabeat (FitBit) smart devices
  2. How can these trends apply to Bellabeat’s customers
  3. How can the insight gained from the trends help improve Bellabeat’s marketing strategy

Data Preparation

Data Source

The FitBit Fitness Tracker Data (CC0: Public Domain) dataset used for this project is a public data that explores smart device users’ daily habits, made available through Mobius

Dataset Description

These datasets were generated by thirty consenting Fitbit users from a survey distributed via Amazon Mechanical Turk between 03.12.2016-05.12.2016. For this analysis the datasets containing personal tracker data for daily physical activity, weight_info, and sleep monitoring were used.

Setting up environment

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate) #for mdy()
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(janitor) #for clean_names()
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library("corrplot") #for plotting cor matrix
## corrplot 0.90 loaded
library(stats) #for cor()
library("skimr") # for summary()
library(ggpubr) #  for pie chart

Loading Data

daily_activity <- read_csv("D:/ugwun/Documents/R projects/fitbit_data/dailyActivity_merged.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   ActivityDate = col_character(),
##   TotalSteps = col_double(),
##   TotalDistance = col_double(),
##   TrackerDistance = col_double(),
##   LoggedActivitiesDistance = col_double(),
##   VeryActiveDistance = col_double(),
##   ModeratelyActiveDistance = col_double(),
##   LightActiveDistance = col_double(),
##   SedentaryActiveDistance = col_double(),
##   VeryActiveMinutes = col_double(),
##   FairlyActiveMinutes = col_double(),
##   LightlyActiveMinutes = col_double(),
##   SedentaryMinutes = col_double(),
##   Calories = col_double()
## )
weight_data <- read_csv("D:/ugwun/Documents/R projects/fitbit_data/weightLogInfo_merged.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   Date = col_character(),
##   WeightKg = col_double(),
##   WeightPounds = col_double(),
##   Fat = col_double(),
##   BMI = col_double(),
##   IsManualReport = col_logical(),
##   LogId = col_double()
## )
sleep_data <- read_csv("D:/ugwun/Documents/R projects/fitbit_data/sleepDay_merged.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Id = col_double(),
##   SleepDay = col_character(),
##   TotalSleepRecords = col_double(),
##   TotalMinutesAsleep = col_double(),
##   TotalTimeInBed = col_double()
## )

Data Exploration

#Take a glimpse at the daily activity data
colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/~
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
head(daily_activity)
## # A tibble: 6 x 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
#Take a glimpse at the daily activity data
colnames(sleep_data)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
glimpse(sleep_data)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "~
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
head(sleep_data)
## # A tibble: 6 x 5
##          Id SleepDay           TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
##       <dbl> <chr>                         <dbl>             <dbl>          <dbl>
## 1    1.50e9 4/12/2016 12:00:0~                1               327            346
## 2    1.50e9 4/13/2016 12:00:0~                2               384            407
## 3    1.50e9 4/15/2016 12:00:0~                1               412            442
## 4    1.50e9 4/16/2016 12:00:0~                2               340            367
## 5    1.50e9 4/17/2016 12:00:0~                1               700            712
## 6    1.50e9 4/19/2016 12:00:0~                1               304            320
#Take a glimpse at the daily activity data
colnames(weight_data)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"
glimpse(weight_data)
## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212~
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2~
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, ~
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6~
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,~
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,~
head(weight_data)
## # A tibble: 6 x 8
##         Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport     LogId
##      <dbl> <chr>         <dbl>        <dbl> <dbl> <dbl> <lgl>              <dbl>
## 1   1.50e9 5/2/2016 ~     52.6         116.    22  22.6 TRUE             1.46e12
## 2   1.50e9 5/3/2016 ~     52.6         116.    NA  22.6 TRUE             1.46e12
## 3   1.93e9 4/13/2016~    134.          294.    NA  47.5 FALSE            1.46e12
## 4   2.87e9 4/21/2016~     56.7         125.    NA  21.5 TRUE             1.46e12
## 5   2.87e9 5/12/2016~     57.3         126.    NA  21.7 TRUE             1.46e12
## 6   4.32e9 4/17/2016~     72.4         160.    25  27.5 TRUE             1.46e12

Data Summary

# How many observations are in each dataset
nrow(daily_activity)
## [1] 940
nrow(sleep_data)
## [1] 413
nrow(weight_data)
## [1] 67
# How many unique IDs are in each dataset
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_data$Id)
## [1] 24
n_distinct(weight_data$Id)
## [1] 8
# How many duplicate rows are in the dataset
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(sleep_data))
## [1] 3
sum(duplicated(weight_data))
## [1] 0

Observation

From a quick scan of the loaded datasets the following quick observation were made 1. The id column is common in all 3 datasets, and can be used to merge the datasets 2. The data type of the Date variable in the 3 datasets(i.e. daily_activity\(activity_date, sleep_data\)date, and weight_data$date) are currently character variables and needs to be converted to Date format. 4. The sleep_data and the weight_data have both date and time merged in one column and need to be be separated, as only the date variable will be used for the analysis. 5. Only 24 and 8 unique users logged sleep data and weight data respectively, compared to 33 unique users who logged in their daily activities. This implies that most of these users used the device to log their daily activities, but not all of the users track their weight and sleeping habits with the device. 5. There appears to be no duplicate data in the daily_activity and weight_data, however the sleep_data has 3 duplicates, which need to be removed

Data Cleaning and Manipulation

# Clean column names to lower case
daily_activity <- clean_names(daily_activity)
sleep_data <- clean_names(sleep_data)
weight_data <- clean_names(weight_data)

# Change the activity_date column name to 'date' in the daily_activity dataset
daily_activity <- daily_activity %>% 
  dplyr::rename(date = activity_date)

# Examine the column names
colnames(daily_activity)
##  [1] "id"                         "date"                      
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"
colnames(sleep_data)
## [1] "id"                   "sleep_day"            "total_sleep_records" 
## [4] "total_minutes_asleep" "total_time_in_bed"
colnames(weight_data)
## [1] "id"               "date"             "weight_kg"        "weight_pounds"   
## [5] "fat"              "bmi"              "is_manual_report" "log_id"
# Removing duplicate date from the sleep_data
sleep_data <- distinct(sleep_data)

# Confirm that the duplicate was removed
sum(duplicated(sleep_data))
## [1] 0
# find missing values
sum(is.na(daily_activity))
## [1] 0
sum(is.na(sleep_data))
## [1] 0
sum(is.na(weight_data))
## [1] 65
# There are 65 missing values in the fat column of the weight_data

# remove the 'fat' column with the missing data in the weight_data and the log_id column
weight_data<- select(weight_data, -fat)
weight_data<- select(weight_data, -log_id)

# Examine the weight_data
head(weight_data)
## # A tibble: 6 x 6
##           id date                 weight_kg weight_pounds   bmi is_manual_report
##        <dbl> <chr>                    <dbl>         <dbl> <dbl> <lgl>           
## 1 1503960366 5/2/2016 11:59:59 PM      52.6          116.  22.6 TRUE            
## 2 1503960366 5/3/2016 11:59:59 PM      52.6          116.  22.6 TRUE            
## 3 1927972279 4/13/2016 1:08:52 AM     134.           294.  47.5 FALSE           
## 4 2873212765 4/21/2016 11:59:59 ~      56.7          125.  21.5 TRUE            
## 5 2873212765 5/12/2016 11:59:59 ~      57.3          126.  21.7 TRUE            
## 6 4319703577 4/17/2016 11:59:59 ~      72.4          160.  27.5 TRUE
# Convert the data type of the date column from character variable to date variable 
daily_activity$date <- lubridate::mdy(daily_activity$date)
daily_activity <-  mutate(daily_activity, weekday = weekdays(date))
# confirm that the data type is changed from character to date
head(daily_activity)
## # A tibble: 6 x 16
##       id date       total_steps total_distance tracker_distance logged_activiti~
##    <dbl> <date>           <dbl>          <dbl>            <dbl>            <dbl>
## 1 1.50e9 2016-04-12       13162           8.5              8.5                 0
## 2 1.50e9 2016-04-13       10735           6.97             6.97                0
## 3 1.50e9 2016-04-14       10460           6.74             6.74                0
## 4 1.50e9 2016-04-15        9762           6.28             6.28                0
## 5 1.50e9 2016-04-16       12669           8.16             8.16                0
## 6 1.50e9 2016-04-17        9705           6.48             6.48                0
## # ... with 10 more variables: very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>, weekday <chr>
# sleep_data cleaning: separate sleep_day column to date and time column, convert the date from character variable to date format, and add weekdays column
sleep_data <- sleep_data %>%
    separate(sleep_day,c("date","time"), sep=" ") %>%
    mutate(date = mdy(date), weekday = weekdays(date)) %>%
    select(-"time")
## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# 

sleep_data$weekday <- factor(sleep_data$weekday, 
                                 levels = c("Monday", "Tuesday", "Wednesday", 
                                            "Thursday", "Friday", "Saturday", 
                                            "Sunday"))

# confirm that the data type is changed from character to date format
head(sleep_data)
## # A tibble: 6 x 6
##        id date       total_sleep_reco~ total_minutes_a~ total_time_in_b~ weekday
##     <dbl> <date>                 <dbl>            <dbl>            <dbl> <fct>  
## 1  1.50e9 2016-04-12                 1              327              346 Tuesday
## 2  1.50e9 2016-04-13                 2              384              407 Wednes~
## 3  1.50e9 2016-04-15                 1              412              442 Friday 
## 4  1.50e9 2016-04-16                 2              340              367 Saturd~
## 5  1.50e9 2016-04-17                 1              700              712 Sunday 
## 6  1.50e9 2016-04-19                 1              304              320 Tuesday
# weight_data cleaning: separate date column to date and time column, convert the date from character variable to date format
weight_data <- weight_data %>%
  separate(date, c("date", "time"), sep = " ")%>%
  select(-"time")%>%
  mutate(date = mdy(date), weekday = weekdays(date))
## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# confirm that the data type is changed from character to date
head(weight_data)
## # A tibble: 6 x 7
##           id date       weight_kg weight_pounds   bmi is_manual_report weekday  
##        <dbl> <date>         <dbl>         <dbl> <dbl> <lgl>            <chr>    
## 1 1503960366 2016-05-02      52.6          116.  22.6 TRUE             Monday   
## 2 1503960366 2016-05-03      52.6          116.  22.6 TRUE             Tuesday  
## 3 1927972279 2016-04-13     134.           294.  47.5 FALSE            Wednesday
## 4 2873212765 2016-04-21      56.7          125.  21.5 TRUE             Thursday 
## 5 2873212765 2016-05-12      57.3          126.  21.7 TRUE             Thursday 
## 6 4319703577 2016-04-17      72.4          160.  27.5 TRUE             Sunday
# adding a sleep_quality column to categorize the minutes of sleep into adequate, excessive, or sleep deprived per CDC recommendation)
sleep_data <- sleep_data %>% 
  mutate(Sleep_quality = case_when(
    sleep_data$total_minutes_asleep < 420 ~ "sleep deprived",
    sleep_data$total_minutes_asleep >= 420 & 
      sleep_data$total_minutes_asleep <= 540 ~ "adequate sleep",
    sleep_data$total_minutes_asleep > 540 ~ "excessive sleep"))
# Take a look at the dataframe
head(sleep_data)
## # A tibble: 6 x 7
##        id date       total_sleep_reco~ total_minutes_a~ total_time_in_b~ weekday
##     <dbl> <date>                 <dbl>            <dbl>            <dbl> <fct>  
## 1  1.50e9 2016-04-12                 1              327              346 Tuesday
## 2  1.50e9 2016-04-13                 2              384              407 Wednes~
## 3  1.50e9 2016-04-15                 1              412              442 Friday 
## 4  1.50e9 2016-04-16                 2              340              367 Saturd~
## 5  1.50e9 2016-04-17                 1              700              712 Sunday 
## 6  1.50e9 2016-04-19                 1              304              320 Tuesday
## # ... with 1 more variable: Sleep_quality <chr>
  weight_data <- weight_data %>%
    mutate(weight_status = case_when(
    weight_data$bmi < 18.5 ~ "underweight",
    weight_data$bmi >= 18.5 & weight_data$bmi <= 24.9 ~ "healthy weight",
    weight_data$bmi >= 25.0 & weight_data$bmi <= 29.9 ~ "overweight",
    weight_data$bmi > 30.0 ~ "obesity"))
# take a look at the data frame
head(weight_data)
## # A tibble: 6 x 8
##           id date       weight_kg weight_pounds   bmi is_manual_report weekday  
##        <dbl> <date>         <dbl>         <dbl> <dbl> <lgl>            <chr>    
## 1 1503960366 2016-05-02      52.6          116.  22.6 TRUE             Monday   
## 2 1503960366 2016-05-03      52.6          116.  22.6 TRUE             Tuesday  
## 3 1927972279 2016-04-13     134.           294.  47.5 FALSE            Wednesday
## 4 2873212765 2016-04-21      56.7          125.  21.5 TRUE             Thursday 
## 5 2873212765 2016-05-12      57.3          126.  21.7 TRUE             Thursday 
## 6 4319703577 2016-04-17      72.4          160.  27.5 TRUE             Sunday   
## # ... with 1 more variable: weight_status <chr>

Data Exploration and Analysis

daily_activity %>%  
  select(total_steps,
         total_distance,
         sedentary_minutes) %>%
  summary()
##   total_steps    total_distance   sedentary_minutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   
##  Median : 7406   Median : 5.245   Median :1057.5   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

1. What is the smart device usage rate among users?

Assumptions: 1. Daily usage were calculated from daily steps logged 2. Zero daily step (step = 0) would be considered as NO USAGE 3. Daily step greater than zero (step > 0) are considered ACTIVE USAGE 4. Active_days was defined as the total days of ACTIVE USAGE (steps >0) 5. Total_days was defined as the total observed days 6. Usage rate was defined as the active_days / total_days

Findings

  1. 55% of the users had ‘perfect’ usage record (100% usage rate), i.e. the smart device was used everyday to log daily activity withing the recorded period.
  2. 27% of the users had usage rate of less than 100% but greater than or equal to 80%, while the rest of the users (15%) had usage rate of between 50% and 80%.
# Creating dataframe for daily usage
daily_usage_df <- daily_activity %>%
    group_by(id) %>%
    summarise( active_days = sum(total_steps != 0), 
              total_days = sum(total_steps >= 0),
             usage_rate = active_days / total_days) 

#checking summary of usage rate 
daily_usage_df%>%
    select(-id)%>%
    summary()
##   active_days      total_days      usage_rate    
##  Min.   : 3.00   Min.   : 4.00   Min.   :0.5484  
##  1st Qu.:21.00   1st Qu.:29.00   1st Qu.:0.9231  
##  Median :30.00   Median :31.00   Median :1.0000  
##  Mean   :26.15   Mean   :28.48   Mean   :0.9138  
##  3rd Qu.:31.00   3rd Qu.:31.00   3rd Qu.:1.0000  
##  Max.   :31.00   Max.   :31.00   Max.   :1.0000
# grouping users by usage rate
daily_usage_df <- daily_usage_df %>%
    mutate( user_type = case_when(
        usage_rate == 1 ~ "perfect user",
        usage_rate < 1 & usage_rate >= 0.8 ~ "active user",
        usage_rate < 0.8 & usage_rate >= 0.5 ~ "average user",
        usage_rate < 0.5 ~ "casual user"))

# Creating a data frame for plotting different category of users
daily_usage_plot <- daily_usage_df %>%
    group_by(user_type) %>%
    summarise(user_count = n()) %>%
              mutate(perc_usage = (round(user_count / sum(user_count)*100,0)))

# assign and define color pallet
mycols <- c("azure4", "#BFC9CA", "#FADBD8")
# paste percentage sign to calculated value
labs <- paste0(daily_usage_plot$perc_usage, "%")

# Plot pie chart distribution of users
ggpie(daily_usage_plot, "user_count", label = labs,
   fill = "user_type", color = "white",
   palette = (mycols))

2. What is the smart device usage rate at different days of the week?

Assumptions: 1. Daily usage were calculated from daily steps logged 2. Zero daily step (step = 0) would be considered as NO USAGE 3. Daily step greater than zero (step > 0) are considered ACTIVE USAGE 4. Active_users was defined as the total users with (steps >0) 5. Total_users was defined as the total observed users 6. Usage rate was defined as the active_user / total_user

Findings

Usage rate were slightly higher on Fridays than other days of the week.

# Creating data frame
daily_usage_weekday <- daily_activity %>%
    group_by(weekday) %>%
    summarise( active_user = sum(total_steps != 0), 
              total_user = sum(total_steps >= 0),
             usage_rate = active_user / total_user)
#Plotting daily usage per day
ggplot(data=daily_usage_weekday, aes(x=weekday, y=usage_rate)) +
  geom_bar(stat="identity", width=0.5, fill = "#FADBD8") +
    #zoom in y-axis to "0.8 ~ 1"
    coord_cartesian(ylim = c(0.8, 1))+
    #main title text
    ggtitle("Usage Rate By Weekdays") +
    # x, y axis label text
     xlab("Weekdays") + ylab("Usage Rate") +
    #text settings
    theme(
        plot.title = element_text(color="black", size=24, face="bold"),
        axis.title.x = element_text(color="black", size=18, face="bold"),
        axis.title.y = element_text(color="black", size=18, face="bold"))

3. What is the relationship between the smart device usage rate and the number of calories burned, total steps taken and total minutes spent not doing any physical activities by users?

Findings

  1. There is no correlation between the number of calories burned and the device usage rate (R- 0.13, p = 0.48)
  2. As expected, the was a positive correlation between the usage rate and the total steps taken by user (R = 0.7, p = 0.000005), and a negative correlation between the usage rate and total minutes spent in sedentary position (R = -0.47, p = 0.0055)
# Creating data frame
usage_activity <- daily_activity %>%
    group_by(id) %>%
    summarise( active_user = sum(total_steps != 0), 
              total_user = sum(total_steps >= 0),
             usage_rate = active_user / total_user,
             avg_sedentary_minutes = mean(sedentary_minutes),
               avg_calories_burned = mean(calories),
               avg_steps = mean(total_steps)) 

# Relationship between usage rate and calories burned
ggscatter(usage_activity, x = "usage_rate", y = "avg_calories_burned",
   color = "black", shape = 21, size = 3, # Points color, shape and size
   add = "reg.line",  # Add regression line
   add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
   conf.int = TRUE, # Add confidence interval
   cor.coef = TRUE # Add correlation coefficient
   )
## `geom_smooth()` using formula 'y ~ x'

# Relationship between usage rate and total steps
ggscatter(usage_activity, x = "usage_rate", y = "avg_steps",
   color = "black", shape = 21, size = 3, # Points color, shape and size
   add = "reg.line",  # Add regression line
   add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
   conf.int = TRUE, # Add confidence interval
   cor.coef = TRUE # Add correlation coefficient
   )
## `geom_smooth()` using formula 'y ~ x'

# Relationship between usage rate and sedentary time
ggscatter(usage_activity, x = "usage_rate", y = "avg_sedentary_minutes",
   color = "black", shape = 21, size = 3, # Points color, shape and size
   add = "reg.line",  # Add regression line
   add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
   conf.int = TRUE, # Add confidence interval
   cor.coef = TRUE # Add correlation coefficient
   )
## `geom_smooth()` using formula 'y ~ x'

4. What is the relationship between the intensity of activity for smart device users for each day of the week?

Findings

Most of the participant spent most of their time without doing any form of activities (high sedentary minutes) followed by light activity. Tuesday appears to have the most active minutes overall.

# Creating a data frame to summarize the activity levels by week day
active_minutes <- daily_activity %>% 
  select(weekday, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes) %>% 
  group_by(weekday) %>%
  summarize(very_active = sum(very_active_minutes),
            fairly_active = sum(fairly_active_minutes),
            lightly_active = sum(lightly_active_minutes),
            sedentary= sum(sedentary_minutes))

#order the factor levels to follow a set sequence
active_minutes$weekday <- factor(active_minutes$weekday, 
                                 levels = c("Monday", "Tuesday", "Wednesday", 
                                            "Thursday", "Friday", "Saturday", 
                                            "Sunday"))

# Convert to long data for easy plotting
active_minutes_long <- pivot_longer(active_minutes,
                                    cols = very_active:sedentary, 
                                    names_to = "activity_levels", 
                                    values_to = "total_minutes")

# Plotting activity intensity levels by weekday
ggplot(active_minutes_long, aes(x = weekday, y = total_minutes)) +
  geom_col(aes(fill = activity_levels), position = position_dodge2(preserve = "single")) +
  labs(title = "Active Minutes by Intensity Per Day", 
       x = "Day of the Week",
       y = "Total Active Minutes",
       fill = "Level of Intensity") +
  theme(axis.text.x=element_text(angle=45,hjust=1))

# Expanding  on the activity levels without the sedentary time 
active_minutes_2 <- daily_activity %>% 
  select(weekday, very_active_minutes, fairly_active_minutes, lightly_active_minutes) %>% 
  group_by(weekday) %>%
  summarize(very_active = sum(very_active_minutes),
            fairly_active = sum(fairly_active_minutes),
            lightly_active = sum(lightly_active_minutes))

active_minutes_2$weekday <- factor(active_minutes$weekday, 
                                 levels = c("Monday", "Tuesday", "Wednesday", 
                                            "Thursday", "Friday", "Saturday", 
                                            "Sunday"))
active_minutes_2_long <- pivot_longer(active_minutes,
                                    cols = very_active:lightly_active, 
                                    names_to = "activity_levels", 
                                    values_to = "total_minutes")


ggplot(active_minutes_2_long, aes(x = weekday, y = total_minutes)) +
  geom_col(aes(fill = activity_levels), position = position_dodge2(preserve = "single")) +
  labs(title = "Active Minutes by Intensity Per Day", 
       x = "Day of the Week",
       y = "Total Active Minutes",
       fill = "Level of Intensity") +
  theme(axis.text.x=element_text(angle=45,hjust=1))

5. What is the relationship between the various activity intensity levels and sleep quality?

Findings

Although the sleep data is limited, the smart device users who recorded their sleep shows no significant correlation between the quality of sleep and the levels of activity intensity, with the exception of sedentary minutes, which was negatively correlated with sleep quality (R= -0.6). This means that more time spent in sedentary is associated with sleep deprivation.

#
# Creating data frame by joining sleep data and activity data, followed selecting columns that ween and filtering out rows with NA

Sleep_daily_activity <- left_join(daily_activity, sleep_data, by = c("id", "date"))

sleep_activity_df <- Sleep_daily_activity %>% 
  select(date, very_active_minutes, fairly_active_minutes, lightly_active_minutes, 
         sedentary_minutes, total_minutes_asleep, Sleep_quality)

sleep_activity_df <- sleep_activity_df %>% 
  filter(!is.na(total_minutes_asleep)) %>% 
  filter(!is.na(Sleep_quality))

# Plotting activity levels vs sleep quality

sleep_activity_df %>% 
  ggplot(mapping = aes(x = total_minutes_asleep, y = very_active_minutes)) +
  geom_point(aes(color = Sleep_quality)) +
  scale_color_brewer(palette = "Set1") +
  stat_cor(aes(color = Sleep_quality), label.x = 3)

sleep_activity_df %>% 
  ggplot(mapping = aes(x = total_minutes_asleep, y = fairly_active_minutes)) +
  geom_point(aes(color = Sleep_quality)) +
  scale_color_brewer(palette = "Set1") +
  stat_cor(aes(color = Sleep_quality), label.x = 3)

sleep_activity_df %>% 
  ggplot(mapping = aes(x = total_minutes_asleep, y = lightly_active_minutes)) +
  geom_point(aes(color = Sleep_quality)) +
  scale_color_brewer(palette = "Set1") +
  stat_cor(aes(color = Sleep_quality), label.x = 3)

sleep_activity_df %>% 
   ggplot(mapping = aes(x = total_minutes_asleep, y = sedentary_minutes)) +
  geom_point(aes(color = Sleep_quality)) +
  scale_color_brewer(palette = "Set1") +
  stat_cor(aes(color = Sleep_quality), label.x = 3)

6. What is the relationship between time spent in bed and Total time asleep?

Findings

There is a strong positive correlation between time spent in bed and overall quality of sleep. That is users who spend more time in bed get more sleep.

ggplot(data=sleep_data, aes(x=total_minutes_asleep, y=total_time_in_bed)) + 
  geom_point(aes(color = Sleep_quality))+
  scale_color_brewer(palette = "Set2") +
  stat_cor()

7. Is the quality of sleep affeted by the days of the week?

Findinds

Weekends(Sundays and Saturdays) had the highest recorded number of excessive sleep, while Wednesday had the highest number of adequate sleep. Tuesdays records the highest level of sleep deprivation.

sleep_qual_day <- sleep_data %>% 
  select(weekday, Sleep_quality) %>% 
  group_by(weekday, Sleep_quality) %>%
  tally()
 

 ggplot(sleep_qual_day, aes(x = weekday, y = n)) +
  geom_col(aes(fill = Sleep_quality))+
  labs(title = "Sleep Quality Per Day", 
       x = "Day of the Week",
       y = "Quality of Sleep",
       fill = "Level of Intensity") +
  theme(axis.text.x=element_text(angle=45,hjust=1))

8. Does the weight of smart device users affect sleep quality?

Finding

There is no significant correlation between the weight and sleep quality, however the one user whose weight was recorded as obese also had sleep deprivation, this could be potentially significant but the data is not sufficient to make a reasonable conclusion.

all_activity_log <- left_join(Sleep_daily_activity, weight_data, by = c("id", "date"))


# weight and sleep

weight_sleep_df <- all_activity_log %>% 
  select(date, bmi, total_minutes_asleep, Sleep_quality, weight_status)

head(weight_sleep_df)
## # A tibble: 6 x 5
##   date         bmi total_minutes_asleep Sleep_quality   weight_status
##   <date>     <dbl>                <dbl> <chr>           <chr>        
## 1 2016-04-12    NA                  327 sleep deprived  <NA>         
## 2 2016-04-13    NA                  384 sleep deprived  <NA>         
## 3 2016-04-14    NA                   NA <NA>            <NA>         
## 4 2016-04-15    NA                  412 sleep deprived  <NA>         
## 5 2016-04-16    NA                  340 sleep deprived  <NA>         
## 6 2016-04-17    NA                  700 excessive sleep <NA>
weight_sleep_df %>% 
  filter(!is.na(total_minutes_asleep)) %>% 
  filter(!is.na(Sleep_quality)) %>% 
  filter(!is.na(weight_status)) %>% 
  filter(!is.na(bmi)) %>% 
  ggplot(mapping = aes(x = total_minutes_asleep, y = bmi)) +
  geom_point(aes(color = Sleep_quality, weight_status), size = 5) +
  labs(title = "Sleep vs Weight", x = "Total Time Asleep for 30 Days", 
       y = "BMI") +
  scale_color_brewer(palette = "Set1")+
  stat_cor()

9. Does the total calories burned affect the sleep quality

Findings

There appears to be no correlation between the number of calories burned in the day and the number time spent asleep.

#Calories and sleep
sleep_calories_df <- all_activity_log %>%
  select(calories, total_minutes_asleep, Sleep_quality)%>% 
  filter(!is.na(total_minutes_asleep)) %>% 
  filter(!is.na(Sleep_quality)) 


  ggplot(data = sleep_calories_df,aes(x=total_minutes_asleep, y=calories)) + 
    #add transparency (alpha) to avoid over plotting
    geom_point(alpha = 0.5, aes(color = Sleep_quality)) + 
        #add text labels
    labs(title="Time Asleep vs. Calories",
              x = "Total Minutes Asleep",
        y = "Calories Burned") +
    stat_cor()

10. What is the level of activities among the differnt weight classes?

Findings

Obese users spend most time in sedentary positing and little time in doing lightly-active activities. Overweight users spend most time in sedentary position and interesting they have the highest time spent doing highly-active activities.

  # Creating data frame for weigh and activity levels 
weight_activity <- all_activity_log %>% 
  select(id, bmi, very_active_minutes, fairly_active_minutes, lightly_active_minutes, sedentary_minutes, weight_status) %>% 
  filter(!is.na(bmi))%>%
  filter(!is.na(weight_status))%>%
  group_by(id) 
  
 weight_activity_long <- pivot_longer(weight_activity,
                                    cols = very_active_minutes:sedentary_minutes, 
                                    names_to = "activity_levels", 
                                    values_to = "total_minutes")
 
 ggplot(weight_activity_long, aes(x = weight_status, y = total_minutes)) +
  geom_col(aes(fill = activity_levels), position = 'dodge') +
  labs(title = "Active Minutes by Intensity Per Day", 
       x = "Day of the Week",
       y = "Total Active Minutes",
       fill = "Level of Intensity") +
  theme(axis.text.x=element_text(angle=45,hjust=1))

Key Findinds

  1. There is no correlation between the number of calories burned and the device usage rate
  2. Most of the participant spent most of their time without doing any form of activity
  3. sedentary is associated with sleep deprivation
  4. Most users sleep weekends
  5. Going to bed early and spending more time in bed results in getting more sleep
  6. Overweight users spend the most time doing high intensity activities

Conclusion and Reconmendations

  1. Based on the analysis it shows that most users of BellaBeat smart device record their daily physical activities more than their sleep and weight.It is not clear which of the devices these data was gotten from, but it will be interesting to find out if users inability to record sleep data is because they take their device off while sleeping (example people may be uncomfortable sleeping with their wrist watch or bracelet). A viable solution will be to redesign and sync the app to track sleep even when users are not wearing their devices (Samsung Health app already does this)

  2. The Marketing should engage users in healthy habits, by prompting and recommending activities when users have been inactive for some time.

  3. Sending insights and analysis of users weekly activities and sleep records, e.g. will be how their activity levels for the week could be affecting their sleep/ This could also help the users become aware of their health status and may prompt them to engage more in healthy lifestyle.

Limitations

  1. Low User sample data: The dataset is small and incomplete in the case of sleep data which, which is not a true representation of the population. As a result inference made from the analysis may not be statistically significant.

  2. Time frame: The time frame of data collection is only limited to a period of 31 days which could largely decrease the possibility of finding some significant insights.