Table of contents


Introduction

The following project is a case study for the Google Data Analytics Professional Certificate focusing on the company Bellabeat, a high-tech manufacturer of health-focused products for women.


Scenario

Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company.


Characters and stakeholders

  • Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s co-founder
  • Bellabeat’s marketing analytics team

Products

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day.

  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.


Problems to be solved

Business Tasks

1. What are some trends in smart device usage?

2. How could these trends apply to Bellabeat’s customers?

3. How could these trends help influence Bellabeat’s marketing strategy?

A clear summary of the business task:

Gaining insights into trends in smart device usage and applying them to offer recommendations to guide BellaBeat’s marketing strategy, particularly for the BellaBeat App as it is the closest match in terms of applicable data being measured by the FitBit sample dataset.


Data and limitations

The datasets contain fitness tracker information from thirty Fitbit users who consented to the submission of personal tracker data between 03.12.2016 – 05.12.2016 via Amazon Mechanical Turk that can be used to explore user behavior which includes daily, minute and second level information for:

A description of all data sources used:

Specific datasets included in analysis:

Limitations:

Omissions:

Specific datasets supplied but omitted in analysis:


Data cleaning

All data cleaning tasks were completed using R Studio, installing the following packages will allow me to do this:

Following package installation, I then imported the files into R Studio via the read.csv function:

dailyActivity_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/dailyActivity_merged.csv")
hourlyCalories_merged_test <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/hourlyCalories_merged_test.csv")
hourlyIntensities_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/hourlyIntensities_merged.csv")
hourlySteps_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/hourlySteps_merged.csv")
sleepDay_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/sleepDay_merged.csv")
weightLogInfo_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/weightLogInfo_merged.csv")

Now that all the files have been loaded in, its time to have a look at the structure of the tables and look for any errors in format:

str(dailyActivity_merged)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1503960366 1503960366 1503960366 1503960366 1503960366 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(sleepDay_merged)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1503960366 1503960366 1503960366 1503960366 1503960366 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
str(weightLogInfo_merged)
## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1503960366 1503960366 1927972279 2873212765 2873212765 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1462233599000 1462319999000 1460509732000 1461283199000 1463097599000 ...
str(hourlyCalories_merged_test)
## 'data.frame':    22099 obs. of  5 variables:
##  $ Id          : num  1503960366 1503960366 1503960366 1503960366 1503960366 ...
##  $ ActivityHour: chr  "2016-04-12" "2016-04-12 01:00:00" "2016-04-12 02:00:00" "2016-04-12 03:00:00" ...
##  $ Calories    : int  81 61 59 47 48 48 48 47 68 141 ...
##  $ Date        : chr  "2016-04-12" "2016-04-12" "2016-04-12" "2016-04-12" ...
##  $ Time        : chr  "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
str(hourlyIntensities_merged)
## 'data.frame':    22099 obs. of  4 variables:
##  $ Id              : num  1503960366 1503960366 1503960366 1503960366 1503960366 ...
##  $ ActivityHour    : chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ TotalIntensity  : int  20 8 7 0 0 0 0 0 13 30 ...
##  $ AverageIntensity: num  0.333 0.133 0.117 0 0 ...
str(hourlySteps_merged)
## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1503960366 1503960366 1503960366 1503960366 1503960366 ...
##  $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...

Looking at the output, the following observations are made:

Now on to cleaning the data:

# Cleaning the column names and renaming the table

daily_activity <- clean_names(dailyActivity_merged)

# Changing the format of activity_date variable

daily_activity$activity_date <- as.Date(daily_activity$activity_date,"%m/%d/%y")
# Cleaning the column names and renaming the table 

daily_sleep <- clean_names(sleepDay_merged)

# Changing the format of sleep_day variable and dropping the time portion as 
# it is automatically logged and wont be useful

daily_sleep$sleep_day <- as.Date(daily_sleep$sleep_day,"%m/%d/%y")
# Cleaning the column names and renaming the table

weight_log_info <- clean_names(weightLogInfo_merged)

# Changing the format of date variable to date time

weight_log_info$date <- parse_date_time(weight_log_info$date, "%m/%d/%y %H:%M:%S %p")

# Changing the format of is_manual_report variable to logical

weight_log_info$is_manual_report <- as.logical(weight_log_info$is_manual_report)
# Cleaning the column names and renaming the table

hourly_intensities <- clean_names(hourlyIntensities_merged)

# Changing the format of activity_hour variable to date time

hourly_intensities$activity_hour <- parse_date_time(hourly_intensities$activity_hour, "%m/%d/y %H:%M:%S %p")
# Cleaning the column names and renaming the table

hourly_steps <- clean_names(hourlySteps_merged)

# Changing the format of activity_hour variable to date time

hourly_steps$activity_hour <- parse_date_time(hourly_steps$activity_hour, "%m/%d/%y %H:%M:%S %p")

Some manipulation to the datasets that I need to make in order to query and visualize the data in a meaningful way:

# Adding column day_of_week to this dataframe

daily_activity$day_of_week <- weekdays(daily_activity$activity_date)

# Changing the order of week days to better visualize 

daily_activity$day_of_week <- factor(daily_activity$day_of_week, c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))
# Adding column bmi_range to this dataframe

weight_log_info <- weight_log_info %>% 
  mutate(bmi_range = case_when(
    bmi > 24.9 ~ 'Over',
    bmi < 18.5 ~ 'Under',
    TRUE ~ 'Healthy'
    ))
# Adding column sleep_range to this dataframe

daily_sleep <- daily_sleep %>% 
  mutate(sleep_range = case_when(
    total_minutes_asleep >= 560 ~ 'Over',
    total_minutes_asleep < 420 ~ 'Under',
    TRUE ~ 'Healthy'
    ))

# Formatting sleep_range variable as factor 

daily_sleep$sleep_range <- as.factor(daily_sleep$sleep_range)

Analysis and Visualizations

The first aspect to analyze user behavior is to look at which days users are most and least active, with supporting arguments which may explain the behavior:

# A column chart showing week day and activity level by step count

daily_activity %>% 
  ggplot(aes(day_of_week, total_steps, fill = day_of_week)) +
  geom_col(alpha = 0.4) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  legend.position = "none")+
  labs(title = "Step Count Totals by Week Day",
       subtitle = "33 users over a one month period",
       x = "Week Day",
       y = "Step Count")

Observations regarding the chart above are:


The next aspect to analyze user behavior is to see how many users are under, at a healthy, or over the recommended BMI range for a young adult:

# A tally of users who are under recommended BMI

nrow(filter(distinct(weight_log_info, id, .keep_all = TRUE),bmi_range == "Under"))
## [1] 0
# A tally of users who are at a healthy BMI

nrow(filter(distinct(weight_log_info, id, .keep_all = TRUE),bmi_range == "Healthy"))
## [1] 3
# A tally of users who are over recommended BMI

nrow(filter(distinct(weight_log_info, id, .keep_all = TRUE),bmi_range == "Over"))
## [1] 5

Based on the users information the following observations can be made:


The next aspect to analyze user behavior is to inspect the time of day when strenuous activity occurs, to best understand when users are working out as opposed to general walking/etc.

# group_by function to sum user activity by time interval

hourly_cal_grouped <- hourlyCalories_merged_test %>% 
    group_by(Time) %>% 
    summarise(sum_calories = sum(Calories))

# A column chart mapping the time of day that calories are burnt

ggplot(data = hourly_cal_grouped, aes(x=Time, y=sum_calories))+
geom_col(position = "dodge", color = "#000000", fill = "#82D3E1", alpha = 0.8)+
theme_bw()+
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Calories burnt by time of day",
    subtitle = "33 users activity grouped by time",
    x = "Time of Day",
    y = "Sum of Calories")

Observations regarding the chart above are :


The next aspect to analyze is users sleep duration, and whether or not users are getting the recommended amount of sleep.

First is a pie chart segmenting the users into 3 ranges of sleep quality:

# creating a table to include frequency and variable

sleep_range_pie_data <- table(daily_sleep$sleep_range)

# A pie chart displaying the quality range of sleep users are getting

PieChart(sleep_range_pie_data, hole=0, quiet=TRUE, 
         fill = "colors", color = "black", lty = "solid", lwd=1, clockwise=FALSE, 
         transparency = 0.4, init_angle = 310, alpha = 0.5,
         main = "Actual sleep quality vs recommended (420 - 560 mins)")

Secondly, a histogram showing the distribution of the number of hours asleep daily by recorded counts:

# adding variable "hours asleep" to dataframe

daily_sleep$hours_asleep = round((daily_sleep$total_minutes_asleep)/60, digits = 2)

# a histogram showing the distribution of hours asleep

daily_sleep %>% 
  ggplot(aes(x = hours_asleep)) +
  geom_histogram(stat = "bin", binwidth = 1, color = "#000000", fill = "#82D3E1", alpha = 0.8) +
  scale_x_continuous(breaks=1:15) +
  theme_bw() +
  labs(x = "Hours Asleep",
       y = "Recorded Counts",
       title = "Distribution of hours asleep per day")

Thirdly, a column chart to visualize which days users are getting the least sleep:

# adding a new column week_day to daily_sleep dataset

daily_sleep$day_of_week <- weekdays(daily_sleep$sleep_day)

# reordering the week days so they display logically

daily_sleep$day_of_week <- factor(daily_sleep$day_of_week, c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))

# grouping sleep_range

daily_sleep_filtered <- daily_sleep %>%
  group_by(day_of_week) %>% 
filter(sleep_range=="Under" | sleep_range=="Healthy")

# a column chart showing sleep range by weekday

ggplot(daily_sleep_filtered, aes(fill = sleep_range, x = day_of_week, y = hours_asleep))+
  geom_col(position = "dodge")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90)) +
  facet_wrap(~sleep_range)+
  labs(title = "Weekday by sleep range",
       x = "Week Day",
       y = "Hours Asleep",
       fill = "Sleep Range")

Observations regarding the charts above are :


The next aspect to analyze is the correlation between total steps, very active minutes and calories burned.

Firstly a scatter plot showing the relationship between total steps and calories burned:

# dropping null values as they are only cluttering the calculation of the relationship
# between total steps and calories burned

daily_activity_3 <- daily_activity %>% 
    rowwise() %>% 
    filter(sum(c(total_steps)) != 0)

# a scatter plot showing total steps vs calories

daily_activity_3 %>%
  ggplot(aes(total_steps, calories)) +
  geom_smooth(color = "black") +
  geom_point(size = 1.5, alpha = 0.6, position = "jitter", color = "#2F8BC6") +
  theme_bw() +
  labs(title = "Calories burned vs Step total",
       x = "Step Total",
       y = " Calories")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Below is a calculation of the correlation between the two variables:

cor(daily_activity$total_steps, daily_activity$calories)
## [1] 0.5915681

Secondly, another scatter plot mapping very active minutes against calories burned:

# dropping null values as they are only cluttering the calculation of the relationship
# between very active minutes and calories burned

daily_activity_2 <- daily_activity %>% 
    rowwise() %>% 
    filter(sum(c(very_active_minutes)) != 0)

# scatter plot mapping very active minutes against calories burned

daily_activity_2 %>% 
  group_by(activity_date) %>% 
  ggplot(aes(very_active_minutes, calories)) +
  geom_smooth(color = "black") +
  geom_point(size = 1.5, alpha = 0.6, position = "jitter", color = "#2F8BC6") +
  theme_bw() +
  labs(title = "Calories Burned vs Very Active Minutes",
       x = "Very Active Minutes",
       y = "Calories Burned")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Below is a calculation of the correlation between the two variables:

cor(daily_activity$total_steps, daily_activity$very_active_minutes)
## [1] 0.6670787

Observations regarding the charts above are :


Reccomendations to guide marketing strategy

Based on user behavior the following recommendations are made:

Activity

BMI / Weight

Sleep patterns

Additional recommendations


That concludes this case study, thank you for reading and I do hope that you enjoyed it!


Additional Citations:

Fitabase Dictionary - This was used for a deeper understanding of some of the variables, ie units of measurement.

Sleep guidelines per the CDC

BMI guidelines per the CDC