Case study: Fitbit’s data for Bellabeat

About
Scenerio
Business task
Stakeholders
Data source
- How does the data help in answering the business questions?
Analysis process
- Exploring the dataset
- Statistical summary and Visualisation
Insight and findings
Recommendations to the stakeholders
Further analysis

About

This is a capstone project I have performed in RStudio being part of my Google Data Analytics Professional Certificate course.

Scenerio

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

Business task

Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices then apply these insights to one of Bellabeat’s products.

Stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat * Bellabeat marketing manager and his/her analytics team.

Data source

The data is located here.

The data is of Fitbit Fitness Tracker stored in public domain in Kaggle. The author is Mobius. About 30 Fitbit users allowed collection of personal tracker data on their active time, sleep and heart rate for a period of one month (04/12/2016-05/12/2016). A total of 18 files are available in csv format.

How does the data help in answering the business questions?

The reliability of the data is low for it does not represent the whole population of smart gadget users and doesn’t specify their gender. Therefore it’s likely that the users could be a mix of male and females. Whereas, the insights or trends derived from this data would be recommended for the marketing campaigns targeting women.
The originality of the data is low for it is a third party data collected using Amazon Mechanical Turk.
The data is averagely comprehensive because it contains multiple fields on daily activity intensity, calories burned, daily steps taken, daily sleep time and weight record.
The data is low in terms of how current. It was collected in April through May of 2016, meaning that users’ habits might have changed over the years.
Data citation is low for it was collected by a third party, therefore, key descriptive information about the data is unknown.

Analysis process

Loading needed library

library('tidyverse')

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library('here')

## here() starts at C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project

library('skimr')
library('janitor')

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library('DT')

Importing dataset

daily_activity <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/dailyActivity_merged.csv")
sleep_day <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/sleepDay_merged.csv")
heartrate_seconds <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/heartrate_seconds_merged.csv")
weight_log_info <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/weightLogInfo_merged.csv")

A brief display of the data table

datatable(head(daily_activity, 10), class = 'display',
          options = list(pageLength = 5, dom = 'tip', scrollX = TRUE),
          rownames = FALSE)

Convert data types as appropriate

daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate)
sleep_day$SleepDay <- mdy_hms(sleep_day$SleepDay, tz=Sys.timezone())
heartrate_seconds$Time <- mdy_hms(heartrate_seconds$Time, tz=Sys.timezone())

Change column names to lower case, add underscore & ensure uniqueness

# Store the result in the existing table
daily_activity <- clean_names(daily_activity)
sleep_day <- clean_names(sleep_day)
heartrate_seconds <- clean_names(heartrate_seconds)

Exploring the dataset

Confirm the numbers of unique IDs in each table

n_distinct(daily_activity$id)

## [1] 33

n_distinct(sleep_day$id)

## [1] 24

n_distinct(weight_log_info$Id)

## [1] 8

n_distinct(heartrate_seconds$id)

## [1] 14

Out of the overall participants which is 33, the weight and heart rate table has just 8 and 14 participants respectively. I am afraid that is too low.

Statistical summary and Visualisation

# Distance per category
daily_activity %>%  
  select(daily_steps,
         very_active_distance,
         moderately_active_distance,
         lightly_active_distance,
         sedentary_distance) %>%
  summary()

##   daily_steps    very_active_distance moderately_active_distance
##  Min.   :    0   Min.   : 0.000       Min.   :0.0000            
##  1st Qu.: 3790   1st Qu.: 0.000       1st Qu.:0.0000            
##  Median : 7406   Median : 0.210       Median :0.2400            
##  Mean   : 7638   Mean   : 1.503       Mean   :0.5675            
##  3rd Qu.:10727   3rd Qu.: 2.053       3rd Qu.:0.8000            
##  Max.   :36019   Max.   :21.920       Max.   :6.4800            
##  lightly_active_distance sedentary_distance
##  Min.   : 0.000          Min.   :0.000000  
##  1st Qu.: 1.945          1st Qu.:0.000000  
##  Median : 3.365          Median :0.000000  
##  Mean   : 3.341          Mean   :0.001606  
##  3rd Qu.: 4.782          3rd Qu.:0.000000  
##  Max.   :10.710          Max.   :0.110000

Highest daily steps taken stands at 36,019 and and on average, it is 7,638

# Minutes per category
daily_activity %>%
  select(very_active_minutes,
         moderately_active_minutes,
         lightly_active_minutes,
         sedentary_minutes) %>%
  summary()

##  very_active_minutes moderately_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00            Min.   :  0.0         
##  1st Qu.:  0.00      1st Qu.:  0.00            1st Qu.:127.0         
##  Median :  4.00      Median :  6.00            Median :199.0         
##  Mean   : 21.16      Mean   : 13.56            Mean   :192.8         
##  3rd Qu.: 32.00      3rd Qu.: 19.00            3rd Qu.:264.0         
##  Max.   :210.00      Max.   :143.00            Max.   :518.0         
##  sedentary_minutes
##  Min.   :   0.0   
##  1st Qu.: 729.8   
##  Median :1057.5   
##  Mean   : 991.2   
##  3rd Qu.:1229.5   
##  Max.   :1440.0

Average of categories of daily intensity(Distance)

# daily intensity distance
avg_intensity_levels <- daily_activity %>%
  summarise(very_active_distance = mean(very_active_distance),
            moderately_active_distance = mean(moderately_active_distance),
            lightly_active_distance = mean(lightly_active_distance),
            sedentary_distance = mean(sedentary_distance))

Pivot(convert) from wide to long

avg_intensity_levels_long <- avg_intensity_levels %>% 
  pivot_longer(cols = c('very_active_distance', 'moderately_active_distance', 'lightly_active_distance', 'sedentary_distance'),
               names_to = 'avg_intensity_level',
               values_to = 'kilometer')

#Add column for distance in meter
avg_intensity_levels_long <- mutate(avg_intensity_levels_long,
                                    meters = kilometer*1000)
avg_intensity_levels_long

## # A tibble: 4 × 3
##   avg_intensity_level        kilometer  meters
##   <chr>                          <dbl>   <dbl>
## 1 very_active_distance         1.50    1503.  
## 2 moderately_active_distance   0.568    568.  
## 3 lightly_active_distance      3.34    3341.  
## 4 sedentary_distance           0.00161    1.61

On average, users cover 1.5km very active
0.57km moderately active
3.3km lightly active
Inactive(sedentary) distance had the rest of the time It is clear that users cover more distance while lightly active compared to while very active or moderately active

ggplot(data = avg_intensity_levels_long, aes(x = avg_intensity_level, y = kilometer))+
  geom_col(fill = '#56B4E9', color = 'black')+
  xlab('Intensity distance categories')+
  ylab('distance in km')+
  labs(title = 'Average daily-intensity distance',
     caption = 'FitBit data from Kaggle',
     tag = 'Fig. 1') +
       theme(plot.tag.position = 'bottomleft')

Average of categories of daily intensity(Minutes)

# daily intensity minutes
avg_intensity_mins <- daily_activity %>%
  summarise(very_active_minutes = mean(very_active_minutes),
            moderately_active_minutes = mean(moderately_active_minutes),
            lightly_active_minutes = mean(lightly_active_minutes),
            sedentary_minutes = mean(sedentary_minutes))

Pivot from wide to long

avg_active_mins_long <- avg_intensity_mins %>% 
  pivot_longer(cols = c('very_active_minutes', 'moderately_active_minutes', 'lightly_active_minutes', 'sedentary_minutes'),
               names_to = 'avg_active_times',
               values_to = 'minutes')

# Add column for time in hours
avg_active_mins_long <- mutate(avg_active_mins_long, hours = minutes/60)
avg_active_mins_long

## # A tibble: 4 × 3
##   avg_active_times          minutes  hours
##   <chr>                       <dbl>  <dbl>
## 1 very_active_minutes          21.2  0.353
## 2 moderately_active_minutes    13.6  0.226
## 3 lightly_active_minutes      193.   3.21 
## 4 sedentary_minutes           991.  16.5

Users spent, on average, just 21 minutes very active everyday and 9991 minutes which is equal to over 16 hours being Inactive(sedentary). That is a lot of hours spent being inactive.

Data Viz

ggplot(data = avg_active_mins_long, aes(x = avg_active_times, y = hours))+
  geom_col(fill = '#0072B2', color = 'black')+
  xlab('Time-spent categories')+
  ylab('minutes/hour')+
  labs(title = 'Average daily-intensity minutes',
       caption = 'FitBit data from Kaggle',
       tag = 'Fig. 2') +
  theme(plot.tag.position = 'bottomleft')

Average distance/minutes

distance_per_mins <- avg_intensity_levels_long$meters/avg_active_mins_long$minutes
distance_per_mins

## [1] 70.998743400 41.839071311 17.326752884  0.001620627

Users cover about 71m in 1 very active minute, about 42m in 1 moderately active minute and just 17m in 1 lightly active minute

Data Viz

barplot(distance_per_mins,
        names = c('very_active', 'moderate', 'light', 'sedentary'),
        col="dodgerblue3",
        main = 'Average distance per minutes',
        ylab = 'Meters',
        xlab = 'Minutes',
        sub = 'Fig. 3')

Calories versus daily steps

#Relationship between calories burnt and daily steps per week
daily_activity <- daily_activity %>%
  mutate(week = case_when(
    between(activity_date, as.Date('2016-04-12'), as.Date('2016-04-18'))~ 'week_1',
    between(activity_date, as.Date('2016-04-19'), as.Date('2016-04-25'))~ 'week_2',
    between(activity_date, as.Date('2016-04-26'), as.Date('2016-05-02'))~ 'week_3',
    between(activity_date, as.Date('2016-05-03'), as.Date('2016-05-09'))~ 'week_4',
    TRUE~ 'NA'))

calories_per_step <- daily_activity %>% 
  select(activity_date, daily_steps, calories, week) %>%
  filter(week == 'week_1'|week == 'week_2'|
           week == 'week_3'|week == 'week_4')

Data Viz

#Calories burned per steps taken daily
ggplot(data = calories_per_step, aes(x = daily_steps, y = calories))+
  geom_point(color = 'blue')+
  geom_smooth(color = 'red')+
  facet_wrap(~week)+
  xlab('Average daily steps')+
  ylab('Average kcals')+
  labs(title = 'Average calories burned/steps taken daily',
       caption = 'FitBit data from Kaggle',
       tag = 'Fig. 4') +
  theme(plot.tag.position = 'bottomleft')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is a positive correlation between steps taken daily and calories burned for the data shows that as the daily steps increases, the calories burned increases. Even though the trend line seems to be moving downward towards the peak of the 3rd week, the variable that caused this could be seen as an outlier. There is also a shift during the middle of the 4th week but the trend still continued to move upward. Each week may not follow exact pattern but the overall trend is clearly from the lower left to the upper right of the plot.

Average heart rate versus average sleep time

#Join average heart rate and sleep minutes
heart_rate_sleep <- avg_heart_rate %>% 
  full_join(avg_sleep_mins, by = 'id')

Data Viz

ggplot(data = heart_rate_sleep, aes(x = average_sleep_mins, y = average_heart_rate))+
  geom_point(color = 'blue')+
  labs(title = 'Average heart rate vs sleep time',
       caption = 'FitBit data from Kaggle',
       tag = 'Fig. 6')+
  theme(plot.tag.position = 'bottomleft')+
  xlab('Average sleep minutes--->')+
  ylab('Average heart rate--->')

The data points are scattered and it shows there is no correlation between average sleep time and heart rate but this could be due to insufficient data to form a basis for conclusion.

Relationship between minutes asleep and time in bed

ggplot(data = sleep_day, aes(x = total_time_in_bed, y = total_minutes_asleep))+
  geom_point(color = 'blue')+
  geom_smooth(color = 'red')+
  labs(title = 'Time asleep vs time in bed',
       caption = 'FitBit data from Kaggle',
       tag = 'Fig. 5')+
  theme(plot.tag.position = 'bottomleft')+
  xlab('minutes in bed--->')+
  ylab('sleep minutes--->')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is a positive correlation between time spent in bed and time spent asleep. As the minutes spent in bed increases, the higher the time spent sleeping.

Insight and findings

With the average number of steps taken daily standing at 7,638, this is lower than what the Centers for Disease Control and Prevention(CDC) recommends. According to CDC, for adults younger than 60, the risk of premature death leveled off at about 8,000 to 10,000 steps per day. For adults 60 and older, the risk of premature death leveled off at about 6,000 to 8,000 steps per day. See here.
With 3.3km covered while lightly active, and 1.5km while very active. It is clear that users cover more distance while lightly active compared to while very active. It is true that slow and steady wins the race!
Users spent, on average, just 21 minutes very active everyday and 9991 minutes which is equal to over 16 hours being Inactive(sedentary). This could be a result of long sitting hours at work.
The data shows that as the daily steps increases, the calories burned increases. This could be an indicator to achieving desired weight. Don’t wish for good body, ‘walk’ for it!

Recommendations to the stakeholders

Launch social media ads of Bellabeat’s app.
Offer certain days of free trial to give users opportunity to see the benefits the app can offer them before committing to full membership through subscription
Consider using app notifications reminding users to get active, thereby reducing time spent inactive.
Promote other Bellabeat’s wellness products on the app.

Further analysis

Bellabeat should collect data from their app and other products and perform a detailed analysis of the users behavior

Case study: Fitbit’s data for Bellabeat

Ayotola Olabode

2023-06-27

About

Scenerio

Business task

Stakeholders

Data source

How does the data help in answering the business questions?

Analysis process

Exploring the dataset

Statistical summary and Visualisation

Insight and findings

Recommendations to the stakeholders

Further analysis