Case Study 2: How Can a Wellness Technology Company Play It Smart? [Capstone Project]

Bellabeat Project

As a junior data analyst working with the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women, I have been tasked with analyzing FitBit fitness tracker data to gain insights into how consumers are using the FitBit app and to discover trends that can guide Bellabeat’s marketing strategy. The goal is to identify growth opportunities for the company in the global smart device market. Bellabeat is a successful small company, but with the potential to become a larger player in the market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

In this project, I have analyzed FitBit fitness tracker data, which includes minute-level output for physical activity, heart rate, and sleep monitoring from thirty eligible Fitbit users who have consented to the submission of personal tracker data. The data has been organized and cleaned to prepare for analysis, including handling missing data and duplicated entries.

Through exploratory data analysis, I have gained insights into users’ habits, including their daily activity, steps, and heart rate. I have also analyzed their sleep patterns, including the total time in bed and the total minutes asleep. Additionally, I have used data visualization techniques to gain further insights into the relationship between steps taken and time spent in bed.

Ask Phase

Firstly, we need to address who our key stakeholders are.In this case we have following ones:

Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Business Objectives:

What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat marketing strategy?

Prepare Phase:

Sršen encouraged me to use public data that explores smart device users’ daily habits. She points me to a specific data set:

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Data is publicly available on Kaggle: FitBit Fitness Tracker Data and stored in 18 csv files. In the Prepare phase, we identify the data being used and its limitations.

In the Prepare phase, we identify the data being used and its limitations.Some limitations of it are the following:

-The data was collected 7 years ago in 2016 and users’ daily activity, fitness and sleeping habits, diet and food consumption may have changed since then. Therefore, the data may not be timely or relevant.

-The sample size of 30 FitBit users is not representative of the entire fitness population. As data was collected in a survey, we can not be assured about its integrity or accuracy.

-The data only covers a period of 31 days, which is a short period to make any significant conclusions about the usage of the product. Small sample size: the data only includes data from 33 users at most. This could limit the representativeness, and the findings may not generalize to the wider population of Leaf users.

Process Phase

In this phase we will process the data by cleaning and ensuring that it is correct,complete and error free.

We have to check if the data contains any missing or null values and transform the data into format we want for the analysis tool:

I have used RStudio for data cleaning, data transformation, data analysis and visualization.

Firstly, we need to install and read the packages we need for analysis: I have installed all the required packages, therefore I am able to load them all at once.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)
library(lubridate)
library(ggplot2)
library(tibble)

Using the read.csv command, we can access and load data from a secure hard disk and save it as a variable of our choosing.

daily_activity <- read.csv("dailyActivity_merged.csv")
daily_sleep <- read.csv("sleepDay_merged.csv")
weight_info <- read.csv("weightLogInfo_merged.csv")

I utilized the clean_names() function to modify the headers of each data frame, guaranteeing they adhere to proper syntax (i.e., beginning with a letter/underscore and consisting only of letters/numbers/underscores).

daily_activity <- clean_names(daily_activity)
daily_sleep <- clean_names(daily_sleep)
weight_info <- clean_names(weight_info)

New headers looks like this:

colnames(daily_activity)

##  [1] "id"                         "activity_date"             
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"

colnames(daily_sleep)

## [1] "id"                   "sleep_day"            "total_sleep_records" 
## [4] "total_minutes_asleep" "total_time_in_bed"

colnames(weight_info)

## [1] "id"               "date"             "weight_kg"        "weight_pounds"   
## [5] "fat"              "bmi"              "is_manual_report" "log_id"

Using the glimpse() function, I verified that the data type and format of the headers in each data frame are in line with the corresponding metadata. This helped ensure that the data is properly formatted and can be effectively used in subsequent analysis.

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ activity_date              <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/1…
## $ total_steps                <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance             <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance           <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes          <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories                   <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…

glimpse(daily_sleep)

## Rows: 413
## Columns: 5
## $ id                   <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1…
## $ sleep_day            <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM",…
## $ total_sleep_records  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ total_minutes_asleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430,…
## $ total_time_in_bed    <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449,…

glimpse(weight_info)

## Rows: 67
## Columns: 8
## $ id               <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 28732…
## $ date             <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13…
## $ weight_kg        <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3…
## $ weight_pounds    <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159…
## $ fat              <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, N…
## $ bmi              <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.2…
## $ is_manual_report <chr> "True", "True", "False", "True", "True", "True", "Tru…
## $ log_id           <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+1…

I used the mutate() function to transform the data in these columns into the correct formats.

daily_activity <- daily_activity %>%
  mutate(id = as.character(id)) %>%
  mutate(activity_date = mdy(activity_date))

daily_sleep <- daily_sleep %>%
  mutate(id = as.character(id)) %>%
  mutate(sleep_day = mdy_hms(sleep_day))

weight_info <- weight_info %>%
  mutate(id = as.character(id)) %>%
  mutate(date = mdy_hms(date))

I used the rename() function to standardize the column names of headers in DATE format to “date” and those in DATE-TIME format to “date_time”.

daily_activity <- daily_activity %>%
  rename(date = activity_date)

daily_sleep <- daily_sleep %>%
  rename(date = sleep_day)

weight_info <- weight_info %>%
  rename(date_time = date)

I checked for observations with missing or null values using the !complete.cases() function.

sum(!complete.cases(daily_activity))

## [1] 0

sum(!complete.cases(daily_sleep))

## [1] 0

sum(!complete.cases(weight_info))

## [1] 65

There are 65 missing values in weight_info that all match in the fat column. As this wouldn’t affect the analysis, let’s continue

I utilized the duplicated() function to check for any replicated entries.

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(daily_sleep))

## [1] 3

sum(duplicated(weight_info))

## [1] 0

Using the distinct() function, I eliminated three duplicated observations found in daily_sleep.

daily_sleep <- daily_sleep %>%
  distinct()

If we check again we can see that those values are no longer there.

sum(duplicated(daily_sleep))

## [1] 0

Analize Phase

Next, it is necessary to generate a summary of the data to uncover potential insights.

daily_activity %>%
  select(total_steps,
         total_distance,
         sedentary_minutes) %>%
  summary()

##   total_steps    total_distance   sedentary_minutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   
##  Median : 7406   Median : 5.245   Median :1057.5   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0

daily_sleep %>%
  select(total_sleep_records,
         total_minutes_asleep,
         total_time_in_bed) %>%
  summary()

##  total_sleep_records total_minutes_asleep total_time_in_bed
##  Min.   :1.00        Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:1.00        1st Qu.:361.0        1st Qu.:403.8    
##  Median :1.00        Median :432.5        Median :463.0    
##  Mean   :1.12        Mean   :419.2        Mean   :458.5    
##  3rd Qu.:1.00        3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :3.00        Max.   :796.0        Max.   :961.0

weight_info %>% 
  select(weight_kg,
         bmi) %>%
  summary()

##    weight_kg           bmi       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

Statistical Interpretation:

Insights from Daily Activity data: The average number of recorded steps is 7638, which is lower than the recommended 10000 steps. Additionally, the average distance covered is 5.490 km, which is also less than the recommended 8 km mark. The average sedentary minutes is 991.2 minutes or 16.52 hours, which is very high as it should not exceed 7 hours. Even if you are physically active, sitting for more than 7 to 10 hours a day can negatively impact your health. The average of very active minutes is 21.16, which is less than the target of 30 minutes per day.
Insights from weight log: We cannot determine a person’s health solely based on their weight as other factors like height and fat percentage also affect overall health. The average BMI is 25.19, which is slightly higher than the healthy BMI range of 18 to 24.9.
Insights from daily sleep data: On average, people take around 20 to 30 minutes to fall asleep as there is a 35-minute difference between their time in bed and time asleep.

I decided to combine the cleaned data frames with the same timeframe using the left_join() function. This would return only the observations that have matching values of id and date/date_time.

daily_activity_sleep <- left_join(daily_activity, daily_sleep, by = c("id", "date"))
df <- left_join(daily_activity_sleep, weight_info, by = c("id"))

## Warning in left_join(daily_activity_sleep, weight_info, by = c("id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

-Transforming Data-

This code is manipulating and summarizing data using the dplyr package in R.

The first two lines are transforming the ‘date’ column into a proper date format and then creating a new ‘day_of_week’ column with the day of the week for each date.

The third line is creating a new data frame called ‘df_by_day’ by grouping the ‘df’ data frame by the ‘day_of_week’ column and calculating the mean values for several variables such as ‘total_steps’, ‘total_distance’, ‘sedentary_minutes’, ‘total_minutes_asleep’, and ‘total_time_in_bed’. These mean values are assigned to new columns in the ‘df_by_day’ data frame named ‘avg_steps’, ‘avg_distance’, ‘avg_sedentary_minutes’, ‘avg_minutes_asleep’, and ‘avg_time_in_bed’, respectively.

df <- df %>%
  mutate(date = as.Date(date))

df <- df %>%
  mutate(day_of_week = weekdays(date))

df_by_day <- df %>%
  group_by(day_of_week) %>%
  summarize(avg_steps = mean(total_steps),
            avg_distance = mean(total_distance),
            avg_sedentary_minutes = mean(sedentary_minutes),
            avg_minutes_asleep = mean(total_minutes_asleep),
            avg_time_in_bed = mean(total_time_in_bed))

Share Phase

In this stage, we will generate some visual representations based on our analysis and project objectives.

ggplot(df_by_day, aes(x = day_of_week, y = avg_steps)) +
  geom_bar(stat = "identity") +
  labs(x = "day of the week", y = "average daily steps")

Some possible explanations could be that the study participants have more free time on those days to walk, have a more active routine on those days, or simply prefer to walk more on those particular days. It would be necessary to further analyze the data and consider other external factors to obtain a more accurate answer.

ggplot(df, aes(x = total_steps, y = sedentary_minutes)) + 
  geom_point() + 
  labs(x = "step Count", y = "sedentary minutes", title = "Relationship between steps and sedentary minutes")

The graph shows that as the amount of time spent in sedentary behavior increases, the number of daily steps taken decreases. This suggests that there is a negative relationship between the two variables - the more sedentary time, the fewer steps taken.

ggplot(df, aes(x = total_distance, y = calories)) +
  geom_point() +
  labs(x = "Total Distance", y = "Calories", title = "Calories Burned vs Total Distance Walked")

In the “Calories Burned vs Total Distance Walked” graph, a clear positive relationship between the total amount of distance walked and the number of calories burned can be observed. As the total distance walked increases, so does the number of calories burned. This pattern is expected since walking involves a higher energy expenditure, which in turn translates into a higher number of burned calories. In summary, the more a person walks, the more calories they burn.

ggplot(daily_activity_sleep, aes(x = total_steps, y = total_time_in_bed)) +
  geom_point() +
  labs(x = "Total Steps", y = "Total Time in Bed", title = "Relationship between steps taken and time spent in bed")

## Warning: Removed 530 rows containing missing values (`geom_point()`).

In the graph we can see that most of the women who participated in the test took approximately 10k steps and slept 500 minutes (which would be more than 8 hours).

Act Phase

The analysis of the FitBit data has provided valuable insights that can support data-driven decision making. Given that both companies develop similar products, the health and fitness trends identified in the analysis can also be applied to Bellabeat customers.

Based on the analysis, my recommendations are as follows:

We observed that the majority of people use the application to track steps and calories burned, with fewer users tracking sleep and even fewer tracking weight records. Therefore, it would be advisable to prioritize step, calorie, and sleep tracking in the application.

I would recommend encouraging people to increase their walking activity as a way to burn more calories. This can be achieved through various means such as promoting walking as a form of exercise, providing incentives for walking such as rewards or challenges, or incorporating walking into daily routines such as walking to work or taking a walk during breaks. Additionally, it may be useful to provide education and resources on the benefits of walking and how it can contribute to a healthy lifestyle.

It may be recommended to encourage people to reduce their sedentary behavior and incorporate more physical activity into their daily routine. This could be achieved through strategies such as taking short breaks to stand and move around during prolonged periods of sitting, using standing desks or treadmill desks, or scheduling regular exercise or physical activity breaks throughout the day. Additionally, it may be beneficial to educate people on the importance of reducing sedentary behavior and increasing physical activity for overall health and wellbeing.

I would recommend conducting further analysis to investigate the possible explanations mentioned and explore other external factors that could affect the number of steps taken on certain days. It may also be useful to conduct surveys or interviews with study participants to gather more information about their daily routines and preferences. By understanding the underlying reasons for the observed patterns, we can develop more targeted interventions or recommendations to encourage physical activity and promote healthy habits.

Case Study 2: How Can a Wellness Technology Company Play It Smart? [Capstone Project]

Valentin Fortuny

2023-04-12

Bellabeat Project