Google Data Analytics Capstone Project: Bellabeat Case Study

Introduction

Bellabeat is a high tech company that manufactures health-focused smart products for women. Their products are designed to help users live healthier lives. Bellabeat is a successful small company but it is seeking to tap into the opportunities available in the global market in order to grow its business. This study was carried to analyze trends in health-focused smart product market in order to unlock new opportunites for growth.

Business Task

The business task is to analyze smart device usage data in order to gain insight into how users use these devices. Knowledge gained from this analysis is to be used to explore new opportunities to further grow the company.

Data

The data to be used for this study was sourced from the CCO public domain. As such, it is data that is freely available to every researcher to use. The data contained in the domain consist of data on activity, sleep, calories, intensities, steps taken and weight logs. My approach to this study is to do a deep dive into the different aspects of the data with the intention of gaining more insights into how smart devices can be used to help guide the health choices of users.

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("here")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("skimr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(here)

## here() starts at /cloud/project

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr)

Daily Activities

Loading data on activities

daily_activities <- read_csv("dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(daily_activities)

## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## #   ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## #   ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance

Checking the number of participants available in the study

n_distinct(daily_activities$Id)

## [1] 33

nrow(daily_activities)

## [1] 940

The number of participants in the daily activities is 33. Statistically, this number is high enough for us to draw valid inferences from the data. Below is a quick summary of some observations in the study

daily_activities %>% 
  mutate(sedentary_hours = round(SedentaryMinutes/60, digits = 2)) %>% 
  select(TotalSteps, TotalDistance, sedentary_hours) %>% 
  summary()

##    TotalSteps    TotalDistance    sedentary_hours
##  Min.   :    0   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:12.16  
##  Median : 7406   Median : 5.245   Median :17.62  
##  Mean   : 7638   Mean   : 5.490   Mean   :16.52  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:20.49  
##  Max.   :36019   Max.   :28.030   Max.   :24.00

The average steps taken by the participants daily is 7638 while the average number of hours users remain sedentary in a day is 16.52. The average sedentary hours is very high and users should be sensitized to reduce this as it could negatively impact their health.

Let us take closer look at users whose daily activities might expose them to higher health risks

daily_activeness <- daily_activities %>% 
  mutate(sedentary_hrs = round(SedentaryMinutes/60, digits = 2)) %>% 
  select(Id, TotalSteps, sedentary_hrs) %>% 
  group_by(Id) %>% 
  summarize(average_daily_steps = mean(TotalSteps), average_sedentary_hrs = mean(sedentary_hrs)) %>% 
  mutate(activeness = case_when(average_daily_steps < 4000 ~ "Low Daily Steps", average_daily_steps >= 4000 & average_daily_steps <= 6000 ~ " Fairly active", average_daily_steps > 6000 ~ "Highly active")) %>% 
  mutate(sedentariness = case_when(average_sedentary_hrs > 12 ~ "Highly sedentary", average_sedentary_hrs > 8 & average_sedentary_hrs <= 12 ~ "Averagely sedentary", average_sedentary_hrs < 8 ~ " Low sedentary"))

Let us preview the new dataframe just created

head(daily_activeness)

## # A tibble: 6 × 5
##           Id average_daily_steps average_sedentary_hrs activeness        seden…¹
##        <dbl>               <dbl>                 <dbl> <chr>             <chr>  
## 1 1503960366              12117.                  14.1 "Highly active"   Highly…
## 2 1624580081               5744.                  21.0 " Fairly active"  Highly…
## 3 1644430081               7283.                  19.4 "Highly active"   Highly…
## 4 1844505072               2580.                  20.1 "Low Daily Steps" Highly…
## 5 1927972279                916.                  22.0 "Low Daily Steps" Highly…
## 6 2022484408              11371.                  18.5 "Highly active"   Highly…
## # … with abbreviated variable name ¹sedentariness

Visualizing the different categories of activeness

ggplot(data = daily_activeness, aes(x = activeness, fill = activeness))+ geom_bar()+ labs(title = "Daily Activity", x = "Level of Daily Activity", y = "Number of Participants")

Visualizing the different categories of sedentariness

ggplot(data = daily_activeness, aes(x = sedentariness, fill = sedentariness))+ geom_bar()+ labs(title = "Daily Sedentary Lifestyle", x = "Level of Daily Sedentariness", y = "Number of Participants")

The two visualizations above clearly show that the daily habits of most participants exposes them to health risks especially as regards their sedentary lifestyle. Smart device users can be sensitized on the health risks posed by these daily habits.

Daily Calories

Now we take a look at the data on calories and see what we can learn from there.

calories_data <- read_csv("dailyCalories_merged.csv")

## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(calories_data)

## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

We will check the number of participants in the data to get some idea of the data’s suitability for analysis

n_distinct(calories_data$Id)

## [1] 33

nrow(calories_data)

## [1] 940

The number of participants is 33 which is high enough to make reasonable inferences from the data. We will now calculate some summary statistics of the data.

calories_data %>% 
  select(Calories) %>% 
  summary()

##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900

Four of the data points on the calories column are zeroes and I believe this was because of non-response for that those data points. Because of the impact this extreme values could have on the results, I deem it necessary to remove the data points.

calories_data1 <- calories_data[-c(31, 654, 818, 880), ]
n_distinct(calories_data1$Id)

## [1] 33

nrow(calories_data1)

## [1] 936

We will now recalculate the summary statistics to get a better idea of information contained in the dataset.

calories_data1 %>% 
  select(Calories) %>% 
  summary()

##     Calories   
##  Min.   :  52  
##  1st Qu.:1834  
##  Median :2144  
##  Mean   :2313  
##  3rd Qu.:2794  
##  Max.   :4900

From the summary statistics above, the average daily calories burnt by participants is 2313. This falls within the recommended range for adults daily. We will now look at the average for each participant and get some idea what these figures look like.

average_daily_participant <- calories_data1 %>% 
  select(Id, Calories) %>% 
  group_by(Id) %>% 
  summarise(average_daily_calories = mean(Calories))
head(average_daily_participant)

## # A tibble: 6 × 2
##           Id average_daily_calories
##        <dbl>                  <dbl>
## 1 1503960366                  1877.
## 2 1624580081                  1483.
## 3 1644430081                  2811.
## 4 1844505072                  1573.
## 5 1927972279                  2173.
## 6 2022484408                  2510.

If we are to work with the recommended daily requirement of 2000 calories daily, a number of participants do not meet this requirement.

low_calories_category <- average_daily_participant %>% 
  filter(average_daily_calories < 2000)
head(low_calories_category)

## # A tibble: 6 × 2
##           Id average_daily_calories
##        <dbl>                  <dbl>
## 1 1503960366                  1877.
## 2 1624580081                  1483.
## 3 1844505072                  1573.
## 4 2026352035                  1541.
## 5 2320127002                  1724.
## 6 2873212765                  1917.

13 out of the total of 33 participants did not burn up to the required calories daily. This represents about 39 percent of the data.

Daily Sleep

We will now take a look at the sleep patterns of the participants in the study.

sleep_day <- read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(sleep_day)

## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹TotalTimeInBed

Let’s check the number of participants contained in this dataset

n_distinct(sleep_day$Id)

## [1] 24

nrow(sleep_day)

## [1] 413

The total number of participants is 24. Statistically, for a dataset to be suitable for analysis, the number of participants should be 30 and above. But we will still try to examine some of the patterns in this dataset.

sleep_day %>% 
  select(TotalMinutesAsleep, TotalTimeInBed) %>% 
  mutate(TotalHoursAsleep = TotalMinutesAsleep/60, TotalHoursInBed = TotalTimeInBed/60) %>% 
  summary()

##  TotalMinutesAsleep TotalTimeInBed  TotalHoursAsleep  TotalHoursInBed 
##  Min.   : 58.0      Min.   : 61.0   Min.   : 0.9667   Min.   : 1.017  
##  1st Qu.:361.0      1st Qu.:403.0   1st Qu.: 6.0167   1st Qu.: 6.717  
##  Median :433.0      Median :463.0   Median : 7.2167   Median : 7.717  
##  Mean   :419.5      Mean   :458.6   Mean   : 6.9911   Mean   : 7.644  
##  3rd Qu.:490.0      3rd Qu.:526.0   3rd Qu.: 8.1667   3rd Qu.: 8.767  
##  Max.   :796.0      Max.   :961.0   Max.   :13.2667   Max.   :16.017

The average hours of sleep most of the participants get is 7.2. Twenty-five percent of the population only get six hours of sleep daily. This is below the recommended average of 8 hours daily given by health professionals. From these figures, we can infer that most of the participants are sleep deprived. We will now attempt to narrow this down to the participants in the study.

sleep_pattern <- sleep_day %>% 
  mutate(TotalHoursAsleep = TotalMinutesAsleep/60) %>% 
  select(Id, TotalMinutesAsleep, TotalHoursAsleep) %>% 
  group_by(Id) %>% 
  summarize(AverageHoursOfSleep = mean(TotalHoursAsleep)) %>% 
  mutate(HoursOfSleep = case_when(AverageHoursOfSleep < 8 ~ "Sleep Deprived", AverageHoursOfSleep >= 8 ~ "Sleeping Adequately"))
head(sleep_pattern)

## # A tibble: 6 × 3
##           Id AverageHoursOfSleep HoursOfSleep       
##        <dbl>               <dbl> <chr>              
## 1 1503960366                6.00 Sleep Deprived     
## 2 1644430081                4.9  Sleep Deprived     
## 3 1844505072               10.9  Sleeping Adequately
## 4 1927972279                6.95 Sleep Deprived     
## 5 2026352035                8.44 Sleeping Adequately
## 6 2320127002                1.02 Sleep Deprived

We will now visualize the sleep patterns among the participants

ggplot(data = sleep_pattern, aes(x = HoursOfSleep))+ geom_bar()+ labs(title = "Daily Hours of Sleep", x = "Hours Of Sleep", y = "Number of Participants")

From the visualizaion above, we can see that a large number of participants are not sleeping adequately. in fact, only two out of the 24 participants are getting adequate sleep. This represents only about 8 percent of the total participants.

We will now explore if the number of hours of sleep a participant gets has anything to do with the time spent in bed.

sleep_day %>% 
  select(TotalMinutesAsleep, TotalTimeInBed) %>% 
  summarize(correlation = cor(TotalMinutesAsleep, TotalTimeInBed))

## # A tibble: 1 × 1
##   correlation
##         <dbl>
## 1       0.930

ggplot(data = sleep_day, aes(x = TotalMinutesAsleep, y = TotalTimeInBed))+ geom_point()+ labs(title = "Sleep: Total Minutes Asleep vs Total Time In Bed", x = "Total Minutes Asleep", y = "Total Time In Bed" )

The positive relationship shown between time spent in bed and hours of sleep and the high correlation of 0.93 between the two strongly suggests that people who spend more time in bed tend to get more sleep.

Heart Rate

We will now take a look at the heart rate data and see what we can learn from there.

heartrate_data <- read_csv("heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(heartrate_data)

## # A tibble: 6 × 3
##           Id Time                 Value
##        <dbl> <chr>                <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM    97
## 2 2022484408 4/12/2016 7:21:05 AM   102
## 3 2022484408 4/12/2016 7:21:10 AM   105
## 4 2022484408 4/12/2016 7:21:20 AM   103
## 5 2022484408 4/12/2016 7:21:25 AM   101
## 6 2022484408 4/12/2016 7:22:05 AM    95

Let us check out the number of participants in this data as well as the size of the entire data

n_distinct(heartrate_data$Id)

## [1] 14

nrow(heartrate_data)

## [1] 2483658

The number of participants in the data is 14 which is very low. This will make it difficult for us to draw valid inferences from the result of our analysis of this data. It will however be interesting to know what the data has to tell us about this 14 participants.

heartrate_cat <- heartrate_data %>% 
  select(Id, Value) %>% 
  group_by(Id) %>% 
  summarize(average_heartrate_per_minute = mean(Value)) %>% 
  mutate(heartrate_category = case_when(average_heartrate_per_minute < 60 ~ "At Risk of Bradycardia", average_heartrate_per_minute >= 60 & average_heartrate_per_minute <= 100 ~ " Normal Heartrate", average_heartrate_per_minute > 100 ~ "At Risk of Tachycardia"))
head(heartrate_cat)

## # A tibble: 6 × 3
##           Id average_heartrate_per_minute heartrate_category 
##        <dbl>                        <dbl> <chr>              
## 1 2022484408                         80.2 " Normal Heartrate"
## 2 2026352035                         93.8 " Normal Heartrate"
## 3 2347167796                         76.7 " Normal Heartrate"
## 4 4020332650                         82.3 " Normal Heartrate"
## 5 4388161847                         66.1 " Normal Heartrate"
## 6 4558609924                         81.7 " Normal Heartrate"

The heart rates of all 14 participants fall within the normal range.

Weights of Participants

We will also take a look at the data on weights of each of the participants

weight_data <- read_csv("weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(weight_data)

## # A tibble: 6 × 8
##           Id Date                  WeightKg Weight…¹   Fat   BMI IsMan…²   LogId
##        <dbl> <chr>                    <dbl>    <dbl> <dbl> <dbl> <lgl>     <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM      52.6     116.    22  22.6 TRUE    1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM      52.6     116.    NA  22.6 TRUE    1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM     134.      294.    NA  47.5 FALSE   1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.    NA  21.5 TRUE    1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.    NA  21.7 TRUE    1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     160.    25  27.5 TRUE    1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport

Checking the number participants and data size

n_distinct(weight_data$Id)

## [1] 8

nrow(weight_data)

## [1] 67

The number of participants in the dataset is too low for us to make any meaningful inferences from the result of our analysis of this data.

Key Takeaways

A large number of participants live sedentary lifestyles which could have major effect on their health.
Thirty-nine percent (39%) of participants are not burning enough calories per day.
Ninety-two percent (92%) of participants are not getting enough sleep daily. This could have serious health implications.
Spending more time in bed can help improve the hours of sleep of participants.

Recommendations

A successful marketing campaign relies on matching the products you have to offer with the needs of prospective customers. The key takeaways above reveal the classes of individuals who can take advantage of the products Bellabeat have to offer to improve their health and make better health choices. I therefore recommend that:

The key takeaways should be woven into the marketing campaign as a way of demonstrating to customers the practical utility of Bellabeat products.
Given that these products are marketed primarily to women and they are expected to be worn for a long time, aesthetics and comfort should be seriously considered in the design of the products.