Bellabeat Case Study in R

1. Introduction

Bellabeat is a high-tech manufacturer of health-focused products for women. They are looking for growth opportunities in the global smart device market.

2. Ask

2.1. Business Task

Analyzing smart device usage data in order to gain insight into users’ daily habits related to physical activity, heart rate, and sleep patterns. These insights can be important to spot trends in how customers use smart devices, which then inform Bellabeat’s marketing strategies and product improvements. Specific questions need to be answer as followed: - What are some trends in smart device usage? - How could these trends apply to Bellabeat customers? - How could these trends help influence Bellabeat marketing strategy?

2.2. Stakeholders

Urška Sršen: Bellabeat’s co founder and Chief Creative Officer
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

3. Prepare

3.1. About the data

Data source: public data set contains personal fitness tracker data from Fitbit, an American company specializing in health and fitness technology. The data set contains data of thirty eligible Fitbit users consented to the submission of personal tracker data.
Structure of data: 18 csv files, including information about daily activity, steps, intensity, weight, calories burn and heart rate. The dataset is structured in a combination of long and wide format, with each file containing specific information related to different aspects of smart device usage.

3.2. Assessment of credibility and Integrity of data

Sampling bias: data of 30 users is not large enough to represent the population (people who use smart devices).
Lack of information about demographics of the users. Bellabeat focuses on women’s health, so insufficient data of gender, age, location will make this data strongly biased.
The data is historical and not updated, therefore it is not current. Although the dataset is not ROCCC, this analysis is only for study purposes so I will go ahead with the analysis process. In a real life scenario, I will search for more credible data.

4. Process

4.1 Tools

I will use R to explore, clean, analyze and visualize findings for this case study for the sake of convenience because R is a comprehensive tool for statistical computing, data analysis, and graphical visualization.

4.2. Data Cleaning

I want to do an analysis based upon daily data so I have uploaded 4 csv files containing tables at the day level.

4.2.1. Installing Packages

Install and load the tidyverse.

install.packages('tidyverse')

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(ggplot2)
install.packages('corrplot')

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(corrplot)

## corrplot 0.95 loaded

4.2.2. Exploring the data set

Name the dataframes and take a look at the data. We need to find out how many unique participants there are in each dataframe.

daily_activity <- read_csv("dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

n_distinct(daily_activity$Id)

## [1] 33

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

daily_calories <- read.csv("dailyCalories_merged.csv")
n_distinct(daily_calories$Id)

## [1] 33

head(daily_calories)

##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728

daily_intensities <- read.csv("dailyIntensities_merged.csv")
n_distinct(daily_intensities$Id)

## [1] 33

str(daily_intensities)

## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...

sleep_day <- read_csv("sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

n_distinct(sleep_day$Id)

## [1] 24

colnames<- sleep_day

weight_log <- read.csv("weightLogInfo_merged.csv")
n_distinct(weight_log$Id)

## [1] 8

head(weight_log)

##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

After a quick review, it is noted that: - All tables have ID columns, therefore we can merge the tables. - Information in daily_intensities and daily calories tables is already present in the daily activity so I dropped out these two dataframes. - There is only 8 participants logged in weight information so the weight data is not representative to look into. I also dropped out this table.

4.3. Merging data and more cleaning

Now we have 2 remaining tables to explore. Let’s summarize some statistics figures in each dataframe.

daily_activity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes) %>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

sleep_day %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

Let’s parse dates to facilitate merging tables and joining them.

activity <- daily_activity %>%
  mutate(ActivityDate = mdy(ActivityDate))

sleep <- sleep_day %>%
  mutate(SleepDay = mdy_hms(SleepDay)) %>%
  mutate(SleepDay = as_date(SleepDay))

merged_data <- inner_join(activity, sleep, 
                          by = c("Id" = "Id", "ActivityDate" = "SleepDay"))

Now we have a merged dataframe. We need to remove duplicates or null values.

merged_data <- merged_data %>%
  select(Id, ActivityDate, TotalSteps, SedentaryMinutes, Calories, 
         TotalMinutesAsleep, TotalTimeInBed)
merged_data <- merged_data %>%
  distinct() %>%
  drop_na()

5. Analyze and Share

Relationship between activity and sleep

I will do a correlation test between activity and sleep data to find out if the variables are correlated.

cor_data <- merged_data %>%
  select(TotalSteps, SedentaryMinutes, Calories, 
         TotalMinutesAsleep, TotalTimeInBed)

cor_matrix <- cor(cor_data)

corrplot(cor_matrix, method = "number", type = "upper", 
         tl.col = "navyblue", title = "Correlation Matrix", mar = c(0,0,1,0))

Based on the correlation analysis, some findings have been revealed:

Steps vs. Sleep: the numbers of steps taken and sleep time is slightly correlated, so higher levels of activiy may have longer sleep time but it is not strongly supported.
Sedentary Time vs. Sleep: A reversed relationship is found in the data which suggests that higher sedentary time may reduce sleep duration.
Calories Burned vs. Sleep: There is little evidence suggesting amount if calorie burn is associated with sleep duration.

Let’s visualize the findings for more clarity.

summary(merged_data)

##        Id             ActivityDate          TotalSteps    SedentaryMinutes
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :   17   Min.   :   0.0  
##  1st Qu.:3.977e+09   1st Qu.:2016-04-19   1st Qu.: 5189   1st Qu.: 631.2  
##  Median :4.703e+09   Median :2016-04-27   Median : 8913   Median : 717.0  
##  Mean   :4.995e+09   Mean   :2016-04-26   Mean   : 8515   Mean   : 712.1  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:11370   3rd Qu.: 782.8  
##  Max.   :8.792e+09   Max.   :2016-05-12   Max.   :22770   Max.   :1265.0  
##     Calories    TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 257   Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1841   1st Qu.:361.0      1st Qu.:403.8  
##  Median :2207   Median :432.5      Median :463.0  
##  Mean   :2389   Mean   :419.2      Mean   :458.5  
##  3rd Qu.:2920   3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :4900   Max.   :796.0      Max.   :961.0

ggplot(merged_data, aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  geom_point(alpha = 0.6, color = "violet") +
  geom_smooth(method = "lm", se = TRUE, color = "yellow") +
  labs(title = "Steps vs. Minutes Asleep",
       x = "Total Steps",
       y = "Total Minutes Asleep")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(merged_data, aes(x = SedentaryMinutes, y = TotalMinutesAsleep)) +
  geom_point(alpha = 0.6, color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "yellow") +
  labs(title = "Sedentary Minutes vs. Minutes Asleep",
       x = "Sedentary Minutes",
       y = "Total Minutes Asleep")

## `geom_smooth()` using formula = 'y ~ x'

The visualization suppported the above insights from the correlation analysis.

Activity analysis

While it is widely known that target of 10,000 steps a day is a good benchmark, a healthy number of steps per day varies based on age and individual factors. Generally, adults under 60 are recommended to take 8.000 to 10.000 steps, while 6.000 to 8.000 may be sufficient for those from 60 years of age and above. The dataset doesn’t offer age and gender information so we cannot categorize groups based on demographic information. However, we can divide the users into groups based on how much steps they take per record: under 7.000 steps (average healthy number for the older group), 7.000-10.000 steps and above 10.000 steps.

below_7000 <- sum(activity$TotalSteps < 7000) / length(activity$TotalSteps) * 100
between_7000_10000 <- sum(activity$TotalSteps >= 7000 & activity$TotalSteps < 10000) / length(activity$TotalSteps) * 100
more_10000 <- sum(activity$TotalSteps >= 10000) / length(activity$TotalSteps) * 100

data <- matrix(c(below_7000, between_7000_10000, more_10000), nrow=3, ncol=1)
rownames(data) <- c('Less than 7.000 steps','7.000-10.000 steps','More than 10.000 steps')
colnames(data) <- "Percentage" 
final=as.table(data)
print(final)

##                        Percentage
## Less than 7.000 steps    46.48936
## 7.000-10.000 steps       21.27660
## More than 10.000 steps   32.23404

activity2 <- activity %>%
    mutate(
        TotalSteps = case_when(
            activity$TotalSteps < 7000 ~ "0-7.000 steps",
            activity$TotalSteps >= 7000 &
                activity$TotalSteps < 10000 ~ "7.000-10.000 steps",
            activity$TotalSteps >= 10000 ~ "More than 10.000 steps"
        ))

ggplot(data = activity2, aes(x = TotalSteps, color = TotalSteps, fill = TotalSteps)) +
    geom_bar(alpha = 0.8, show.legend = FALSE) +
    labs(x = "total steps", title = "Categorization based on number of steps per record") +
    theme_classic()

Now, it can be see that almost half of the records are below 7.000 steps! And only 21.3% of records are between 7.000 and 10.000 thousand steps. We can give some recommendations based on this findings.

Sleep analysis

It is recommended that most healthy adults need around 7-9 hours of sleep per night to feel rested and maintain optimal health. However, individual sleep needs can vary based on age, health, and lifestyle. We don’t have those demographic information so let’s suppose this scenario is for general adults. I will calculate average amount of sleep per user and divide them into categories. Firstly, I turn minutes into hours in sleep dataframe for easier categorization.

sleep2 <- sleep %>%
  mutate(
    TotalHoursAsleep = round(sleep$TotalMinutesAsleep / 60, 1),
    TotalHoursInBed = round(sleep$TotalTimeInBed / 60, 1)
  )

head(sleep2)

## # A tibble: 6 × 7
##           Id SleepDay   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <date>                 <dbl>              <dbl>          <dbl>
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
## # ℹ 2 more variables: TotalHoursAsleep <dbl>, TotalHoursInBed <dbl>

Now let’s see how much sleep each user gets on average.

sleep_group <- sleep2 %>%
    group_by(Id) %>%
    summarise(AveHoursAsleep = mean(TotalHoursAsleep))

sleep_group2 <- sleep_group %>%
    mutate(
        AverageHoursAsleep = case_when(
            sleep_group$AveHoursAsleep < 7 ~ "Under 7 hours",
            sleep_group$AveHoursAsleep  >= 7 &
                sleep_group$AveHoursAsleep  <= 9 ~ "From 7 - 9 hours",
            sleep_group$AveHoursAsleep  > 9 ~ "More than 9 hours"
        ))

ggplot(data = sleep_group2, aes(x = fct_relevel(AverageHoursAsleep, "Healthy Sleep", after = 1), fill = fct_relevel(AverageHoursAsleep, "Healthy Sleep", after = 1))) +
    geom_bar(alpha = 0.9, show.legend = FALSE) +
    labs(title = "Average sleep duration", x = "") +
    theme_classic()

## Warning: 1 unknown level in `f`: Healthy Sleep
## 1 unknown level in `f`: Healthy Sleep

As it can be seen from the graph, the majority group doesn’t sleep enough as recommended (less than 7 hours a day) and some sleep too much (more than 9 hours a day). This has space for recommendation!

6. Act

Conclusions:

While the data shows only mild correlations, there is evidence that physical activity patterns can influence sleep behavior. Promoting balanced daily routines could lead to improved sleep health.
The majority of users doesn’t take reach the healthy benchmark for steps taken per day.
Most users don’t sleep enough as recommended.

Recommendations:

Increase Moderate Physical Activity: Engaging in more daily steps may help improve sleep duration and quality.

Reduce Sedentary Time Long periods of inactivity are slightly linked to reduced sleep.

Monitor Consistency: Establishing consistent activity and sleep patterns could be more beneficial than just daily totals.

Ideas for Bellabeat marketing and product improvements:

Based on the insights gained from the analysis phase, I think it would be suitable to apply the findings to improve Bellabeat app which provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products such as Leaf and Time so the company can promote these two products as well. Some ideas for the app are followed:

Promote Activity-Tracking as a Path to Better Sleep: Emphasize the link between increased daily steps and improved sleep and bundle activity trackers with sleep analysis features to strengthen the value proposition.
Target Sedentary Users with Health Challenges: Create content campaigns or in-app tools for desk workers, remote employees, or gamers to reduce sedentary behavior.
Educational Content & Email Campaigns.
“Smart Sleep + Move” Challenges: Reward users with badges, coupons, or social shareable achievements and allow users to compare their sleep improvement over time.

Thank you for your time and attention!