Bellabeat Case Study

Background:

This analysis project is done as a part of Google data analytics professional course offered by Coursera.*

Scenario

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

About the company

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

PHASE 1: ASK

To analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices, and apply the insights to one of the bellabeat products.

PHASE 2: PREPARE

To view and know the integrity of the datasets we are about to use for our analysis.
The data source for this project is made available through Mobius FitBit Fitness Tracker Data.
This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

Loading Libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(readr)
library(skimr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggplot2)
library(rmarkdown)

Loading Datasets

daily_activity <- read.csv("dailyActivity_merged.csv")
calories <- read.csv("dailyCalories_merged.csv")
intensity <- read.csv("dailyIntensities_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
sleep_rate <- read.csv("sleepDay_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")

Viewing the datasets

head(daily_activity)

head(calories)

head(intensity)

head(daily_steps)

head(sleep_rate)

head(weight_log)

Number of Users

n_distinct(daily_activity$Id)

## [1] 33

n_distinct(calories$Id)

## [1] 33

n_distinct(intensity$Id)

## [1] 33

n_distinct(daily_steps$Id)

## [1] 33

n_distinct(sleep_rate$Id)

## [1] 24

n_distinct(weight_log$Id)

## [1] 8

24 out of 33 users have provided their sleep data and 8 out of 33 users have provided their weight and BMI data.

Checking the data integrity and credibility using ROCCC approach.
1. Reliable - Data set is not reliable due to its sample size being very low, 33 and is not collective representation of the whole population.
2. Original - Data set is collected through a survey via Amazon mechanical turk, which may or may not be original hence collected thorough second or third party inputs.
3. Comprehensive - Inadequate information. Some of the most crucial information to solve the given question like age, gender and location are unavailable.
4. Current - The data is survey on 2016 and hence outdated.
5. Cited - A cited source is not mentioned, hence it is difficult to confirm its credibility.

PHASE 3: PROCESS

The activity_report already contains the data related to calories, intensity, and daily steps. To Verify if the data is same on both the tables, we can check if the number of columns and Id’s match with each other.

nrow(daily_activity)

## [1] 940

nrow(calories)

## [1] 940

nrow(intensity)

## [1] 940

nrow(daily_steps)

## [1] 940

All the rows and data matches with the number of data

Cleaning the dataset of any duplicated values.

sum(duplicated(daily_activity))

## [1] 0

sum(duplicated(sleep_rate))

## [1] 3

sum(duplicated(weight_log))

## [1] 0

Sleep_rate has 3 duplicated values, therefore removing the duplicate values.

sleep_rate <- 
  sleep_rate %>% 
  distinct()

sum(duplicated(sleep_rate))

## [1] 0

Cleaning the data set of any null values.

sum(is.na(daily_activity))

## [1] 0

sum(is.na(sleep_rate))

## [1] 0

sum(is.na(weight_log))

## [1] 65

str(weight_log)

## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

Fat data is not provided by many users nd since we are not using ‘fat’ data for this analysis, it is excluded.

weight_log <- weight_log %>% 
  select(-Fat)
sum(is.na(weight_log))

## [1] 0

Separating date and time values from weight and sleep data sets.

weight_log<-
  weight_log %>% 
  separate(Date, c("Date", "Time")," ") %>% 
  select(-IsManualReport, -WeightPounds, -LogId, -Time)

## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

head(weight_log)

sleep_rate<-
  sleep_rate %>% 
  separate("SleepDay", c("Date", "Time"), " ")

## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

head(sleep_rate)

PHASE 4: ANALYZE

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

1. Tracker usage

tracker_usage <- daily_activity %>%
  group_by(Id) %>%
  summarize(active_days = n_distinct(ActivityDate))

tracker_usage <- tracker_usage %>%
  mutate (usage = case_when(
    active_days <= 10 ~ 'Low',
    active_days > 10 & active_days <= 20 ~ 'Moderate',
    active_days > 20 ~ 'High')) %>%
  group_by(usage) 
head(tracker_usage)

ggplot(tracker_usage, aes(x=" ", y=usage, fill=usage))+
  geom_bar(stat = "identity")+
  coord_polar("y")+
  labs(x=NULL, y=NULL, title = "Tracker Usage", caption ="33 User Data")+
  theme(axis.ticks=element_blank(),
        axis.text.x=element_blank(),
        legend.position="top",
        legend.title = element_blank())

The pie chart shows the user’s usage of tracker on daily basis, nearly 3 out of 4 users have a high usage of fitness tracker.

2. Amount of sleep for each days of the week

sleep_day <- sleep_rate %>%
  mutate(day = weekdays(as.Date(Date, format = "%m/%d/%Y")))

sleep_day <- sleep_day %>%
  group_by(day) %>%
  summarize(days_used = n()) %>%
  arrange(day)

sleep_day$day <-ordered(sleep_day$day, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
sleep_day <- sleep_day %>%
  arrange(day)

ggplot(data = sleep_day) +
  geom_col(aes(x=day, y=days_used), fill='#FFDEB4') +
  geom_text(aes(x=day, y=days_used, label = days_used), position = position_fill(vjust = 30)) +
  labs(
    x="Days of the week",
    y="Usage",
    title = "TRACKER USAGE OVER THE WEEK"
    )

Tracker usage is high mid-week and comparably it is lower during weekends and Mondays.

3. Accuracy of tracker measurement

tracker_variance <- 
  daily_activity %>% 
  filter(TotalDistance!=TrackerDistance)

Red points denote the outliers of inaccurate measurements.

ggplot(daily_activity)+
  geom_point(mapping = aes(x=TotalDistance, y=TrackerDistance))+
  geom_point(data=tracker_variance, mapping=aes(x=TotalDistance, y=TrackerDistance), color='red')+
  labs(
    x="Total Distance",
    y="Tracker Measurement",
    title="Tracker Measurement Variance",
    )

Few data are not matching and the tracker measured distance is less than the actual distance covered.
The variation is minor and hence will not be affecting the results.

4. Distance vs Calories

ggplot(daily_activity, aes(x=TotalDistance, y=Calories))+
  geom_point(color="cyan")+
  geom_smooth()+
  labs(title="Distance vs Calories Burned", subtitle = "940 observations of 33 users")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To analyze the impact of distance covered to that of calories burnt.

distance_report <- 
  daily_activity %>% 
  group_by(Id) %>% 
  summarize(total_distance=sum(TotalDistance), total_steps=sum(TotalSteps), calories=sum(Calories), 
            very_active=sum(VeryActiveDistance), moderately_active=sum(ModeratelyActiveDistance), 
            light_active=sum(LightActiveDistance))

distance_report <-
  distance_report %>% 
  mutate(activity=case_when(
    calories>80000 ~ "intense",
    calories<80000 & calories>50000 ~ "moderate",
    calories<50000 ~ "light")) %>% 
  mutate(distance_range=case_when(
    total_distance>350 ~ ">350km",
    total_distance<350 & total_distance>250 ~ "250-350km",
    total_distance<250 & total_distance>150 ~ "150-250km",
    total_distance<150 ~ "0-150km"))

head(distance_report)

ggplot(data=distance_report, aes(x=activity, fill=distance_range))+
  geom_bar()+
  labs(x="Intensity", title = "Intensity of calories burnt", subtitle = "940 observations of 33 users")+
  theme(axis.text.y= element_blank(), axis.title.y = element_blank(), axis.title.x = element_text(face = "bold"), legend.title = element_blank())

5. Activity vs Daily Steps

An user’s activity is categorized as sedentary, light active, fairly active and very active based on their everyday count of steps.

daily_average <- daily_activity %>%
  group_by(Id) %>%
  summarise (mean_daily_steps = mean(TotalSteps), mean_daily_calories = mean(Calories))

user_type <- daily_average %>%
  group_by(Id) %>% 
  mutate(user_activity = case_when(
    mean_daily_steps < 5000 ~ "sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "lightly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "fairly active", 
    mean_daily_steps >= 10000 ~ "very active"
  )) %>% 
  arrange(user_activity)

head(user_type)

user_type$Id <- as.character(user_type$Id)

ggplot(user_type, aes(x=user_activity, y=Id, fill=mean_daily_steps))+
  geom_bar(stat="identity", position = "stack")+
  coord_polar("x")+
  labs(x=NULL, y=NULL, title="ACTIVITY AND DAILY STEPS")+
  theme(axis.title = NULL, axis.text.x = element_text(colour='black', angle=22, face = "bold"))

6. Time spent in bed vs Time asleep

ggplot(sleep_rate, aes(x=TotalMinutesAsleep, y=TotalTimeInBed, color=Id)) +
  geom_jitter()+
  geom_abline()+
  labs(
    x="Sleep Time",
    y="Time in Bed",
    title = "TIME IN BED vs TIME ASLEEP"
       )

Few users seem to experience sleeplessness even after they rest themselves in bed.

7. Individual users sleep time

sleep_rate$Id <- as.character(sleep_rate$Id)

less_sleep <- sleep_rate %>%
  mutate(sleep_hours=TotalMinutesAsleep/60) %>% 
  filter(sleep_hours <8)
  
suff_sleep <- sleep_rate %>% 
  mutate(sleep_hours=TotalMinutesAsleep/60) %>% 
  filter(sleep_hours>8)
  
glimpse(suff_sleep)

## Rows: 114
## Columns: 7
## $ Id                 <chr> "1503960366", "1503960366", "1644430081", "18445050…
## $ Date               <chr> "4/17/2016", "5/8/2016", "5/2/2016", "4/15/2016", "…
## $ Time               <chr> "12:00:00", "12:00:00", "12:00:00", "12:00:00", "12…
## $ TotalSleepRecords  <int> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 700, 594, 796, 644, 722, 590, 750, 503, 531, 545, 5…
## $ TotalTimeInBed     <int> 712, 611, 961, 961, 961, 961, 775, 546, 565, 568, 5…
## $ sleep_hours        <dbl> 11.666667, 9.900000, 13.266667, 10.733333, 12.03333…

ggplot(less_sleep, aes(x=Id)) +
  geom_bar()+
  labs(
    x = "Users",
    y = "No. of days",
    title = "USERS GETTING INSUFFICIENT SLEEP")+
  theme(axis.text.x = element_text(angle = 90))

ggplot(suff_sleep, aes(x=Id)) +
  geom_bar()+
  labs(
    x = "Users",
    y = "No. of days",
    title = "USERS GETTING ADEQUATE SLEEP") +
  theme(axis.text.x = element_text(angle = 90))

average_sleep <- sleep_rate %>% 
  group_by(Id) %>% 
  summarise(avg_sleep_hour=mean(TotalMinutesAsleep)/60) %>% 
  mutate(sleep_level=case_when(
    avg_sleep_hour>7 ~ "Adequate Sleep",
    avg_sleep_hour<7 ~ "Inadequate Sleep"
  ))

ggplot(average_sleep, aes(x=" ", y=sleep_level, fill=sleep_level))+
  geom_bar(stat="identity")+
  coord_polar("y")+
  labs(x=NULL, y=NULL, title="SLEEP RATE")+
  theme(axis.text = element_blank(), legend.position = "top", 
        legend.title = element_blank())

More than 70% of the users are not getting adequate sleep of 7 hours.

8. User activity vs BMI

weight_class<-weight_log %>% 
  group_by(Id) %>% 
  mutate(weight_category=case_when(
    BMI<18.5 ~ "Underweight",
    BMI>18.5 & BMI<24.9 ~ "Healthy",
    BMI>25.0 & BMI<29.9 ~ "Overweight",
    BMI>30.0 ~ "Obese"
  ))
head(weight_class)

weight_class$Id <- as.character(weight_class$Id)
ggplot(weight_class, aes(x=" ", y=weight_category, fill=weight_category))+
  geom_bar(stat="Identity")+
  coord_polar("y")+
  labs(x=NULL, y=NULL, title="WEIGHT CATEGORY", caption="Weight Data for 8 Users", colour="Category")+
  theme(axis.text = element_blank(), legend.title = element_blank(), legend.position = "bottom")

About two-third of the users are either overweight or obese.

weight_comparison <- 
  merge(daily_activity, weight_log, by.x="Id", by.y="Id")

weight_comparison <- weight_comparison %>%
  group_by(Id) %>% 
  summarise(Id=mean(Id), total_steps=sum(TotalSteps), BMI=mean(BMI)) %>% 
  arrange(total_steps)

weight_comparison$Id <- as.character(weight_comparison$Id)

ggplot(weight_comparison, aes(x=total_steps, y=BMI, color=Id))+
  geom_point()+
  geom_bar(stat="identity")+
  labs(x="Total Steps Taken", y="BMI", title="BMI vs Activity", caption="Weight Data for 8 Users")+
  theme(legend.position = "bottom")

The data available for weight cannot be used for any type of conclusive analysis, since the sample is too uncertain and low.

PHASE 5: ACT

Even though the chosen data set was not reliable and inconsistent, we can still propose some of the following analysis and observations:

Insight: The usage of fitness tracker is consistent among the users and it’s safe to say that people are ready to wear fitness bands for a prolonged period.
Insight: The distance tracking accuracy is well maintained among the trackers.
Insight: Tracker usage seem to be reduced during Mondays and weekends, the usage is observed to be highest in mid of the week.

Recommendation : Even though the usage of tracker is consistent, the users who record all the features are feeble, even among the 33 users only 18% of them recorded all three variables (Steps, Sleep, Weight). Bellabeat should encourage all its users to record all variables of health by explaining its importance, benefits of recording all health variables, thereby increasing usage of smart devices. Weekend events can be implemented to engage the users in daily active usage of the tracker.

Insight: The distance and calories burned is directly proportional, it has a linear relation i.e., as a person covers more distance, he’ll burn more calories.
Insight: The BMI among the data (8 Users) only a quarter of the users seem to have a healthy body proportions.

Recommendation : Bellabeat could organize a campaign to reach a distance milestone, to encourage users cover more steps, burn more calories and eventually maintain a healthy a healthy body.

Insight: Among the sample, majority tend to sleep right after they go to bed, while some of the users still find the difficult to go to sleep.
Insight: Number of users getting adequate sleep, i.e., 7 hours (according to studies), is very less.

Recommendation : To improve the sleep rate of the users and help them fall asleep as soon as they go to bed, the fitness tracker could be incorporated with some relaxing and deep focus sounds which starts playing as the user gets to the bed and turns off as they fall asleep.