Table of contents
Introduction to the company
Problems to be solved (Ask phase)
Data and limitations (Prepare phase)
Data cleaning (Process phase)
Analysis and visualizations (Analyze phase)
Recommendations (Act phase)
Introduction
The following project is a case study for the Google Data Analytics Professional Certificate focusing on the company Bellabeat, a high-tech manufacturer of health-focused products for women.
Scenario
Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company.
Characters and stakeholders
- Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
- Sando Mur: Mathematician and Bellabeat’s co-founder
- Bellabeat’s marketing analytics team
Products
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
Problems to be solved
Business Tasks
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat’s customers?
3. How could these trends help influence Bellabeat’s marketing strategy?
A clear summary of the business task:
Gaining insights into trends in smart device usage and applying them to offer recommendations to guide BellaBeat’s marketing strategy, particularly for the BellaBeat App as it is the closest match in terms of applicable data being measured by the FitBit sample dataset.
Data and limitations
The datasets contain fitness tracker information from thirty Fitbit users who consented to the submission of personal tracker data between 03.12.2016 – 05.12.2016 via Amazon Mechanical Turk that can be used to explore user behavior which includes daily, minute and second level information for:
- Physical activity
- Heart rate
- Sleep monitoring
A description of all data sources used:
- FitBit Fitness Tracker Data made available through Mobius
Specific datasets included in analysis:
- dailyActivity_merged.csv
- sleepDay_merged.csv
- weightLogInfo_merged.csv
- hourlyCalories_merged.csv
- hourlyIntensities_merged.csv
- hourlySteps_merged.csv
Limitations:
Missing data of participants including but not limited to gender, age, location and season.
Incomplete data on almost all datasets, they all lack some datapoints especially where manual user input is required which will skew analysis.
Unreliable data, many missing datapoints, dated from 2016 so not current, not comprehensive as there are only 33 participants, in some cases as few as 8 though.
Omissions:
All minute and second level datasets, as they are not influential in guiding marketing strategy.
dailyActivity_merged.csv is a compilation of other datasets supplied, hence it is being used in place.
Specific datasets supplied but omitted in analysis:
- dailyCalories_merged.csv
- dailyIntensities_merged.csv
- dailySteps_merged.csv
- heartrate_seconds_merged.csv
- minuteCaloriesNarrow_merged.csv
- minuteCaloriesWide_merged.csv
- minuteIntensitiesNarrow_merged.csv
- minuteIntensitiesWide_merged.csv
- minuteMETsNarrow_merged.csv
- minuteSleep_merged.csv
- minuteStepsNarrow_merged.csv
- minuteStepsWide_merged.csv
Data cleaning
All data cleaning tasks were completed using R Studio, installing the following packages will allow me to do this:
Following package installation, I then imported the files into R Studio via the read.csv function:
dailyActivity_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/dailyActivity_merged.csv")
hourlyCalories_merged_test <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/hourlyCalories_merged_test.csv")
hourlyIntensities_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/hourlyIntensities_merged.csv")
hourlySteps_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/hourlySteps_merged.csv")
sleepDay_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/sleepDay_merged.csv")
weightLogInfo_merged <- read.csv("~/Data Analytics/R Training/Bellabeat Case Study/To keep in analysis/weightLogInfo_merged.csv")Now that all the files have been loaded in, its time to have a look at the structure of the tables and look for any errors in format:
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1503960366 1503960366 1503960366 1503960366 1503960366 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1503960366 1503960366 1503960366 1503960366 1503960366 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1503960366 1503960366 1927972279 2873212765 2873212765 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1462233599000 1462319999000 1460509732000 1461283199000 1463097599000 ...
## 'data.frame': 22099 obs. of 5 variables:
## $ Id : num 1503960366 1503960366 1503960366 1503960366 1503960366 ...
## $ ActivityHour: chr "2016-04-12" "2016-04-12 01:00:00" "2016-04-12 02:00:00" "2016-04-12 03:00:00" ...
## $ Calories : int 81 61 59 47 48 48 48 47 68 141 ...
## $ Date : chr "2016-04-12" "2016-04-12" "2016-04-12" "2016-04-12" ...
## $ Time : chr "00:00:00" "01:00:00" "02:00:00" "03:00:00" ...
## 'data.frame': 22099 obs. of 4 variables:
## $ Id : num 1503960366 1503960366 1503960366 1503960366 1503960366 ...
## $ ActivityHour : chr "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ TotalIntensity : int 20 8 7 0 0 0 0 0 13 30 ...
## $ AverageIntensity: num 0.333 0.133 0.117 0 0 ...
## 'data.frame': 22099 obs. of 3 variables:
## $ Id : num 1503960366 1503960366 1503960366 1503960366 1503960366 ...
## $ ActivityHour: chr "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ StepTotal : int 373 160 151 0 0 0 0 0 250 1864 ...
Looking at the output, the following observations are made:
- dailyActivity_merged$ActivityDate is in character format and not date format.
- sleepDay_merged$SleepDay is in character format and not date time format.
- weightLogInfo_merged$Date is in character format and not date time format.
- weightLogInfo_merged$IsManualReport is in character format and not logical/boolean.
- hourlyCalories_merged$ActivityHour is in character format and not date time format.
- hourlyIntensities_merged$ActivityHour is in character format and not date time format.
- hourlySteps_merged$ActivityHour is in character format and not date time format.
- All variables are in CamelCase.
- Slight name changes to tables as the “merged” part is not needed.
Now on to cleaning the data:
# Cleaning the column names and renaming the table
daily_activity <- clean_names(dailyActivity_merged)
# Changing the format of activity_date variable
daily_activity$activity_date <- as.Date(daily_activity$activity_date,"%m/%d/%y")# Cleaning the column names and renaming the table
daily_sleep <- clean_names(sleepDay_merged)
# Changing the format of sleep_day variable and dropping the time portion as
# it is automatically logged and wont be useful
daily_sleep$sleep_day <- as.Date(daily_sleep$sleep_day,"%m/%d/%y")# Cleaning the column names and renaming the table
weight_log_info <- clean_names(weightLogInfo_merged)
# Changing the format of date variable to date time
weight_log_info$date <- parse_date_time(weight_log_info$date, "%m/%d/%y %H:%M:%S %p")
# Changing the format of is_manual_report variable to logical
weight_log_info$is_manual_report <- as.logical(weight_log_info$is_manual_report)# Cleaning the column names and renaming the table
hourly_intensities <- clean_names(hourlyIntensities_merged)
# Changing the format of activity_hour variable to date time
hourly_intensities$activity_hour <- parse_date_time(hourly_intensities$activity_hour, "%m/%d/y %H:%M:%S %p")# Cleaning the column names and renaming the table
hourly_steps <- clean_names(hourlySteps_merged)
# Changing the format of activity_hour variable to date time
hourly_steps$activity_hour <- parse_date_time(hourly_steps$activity_hour, "%m/%d/%y %H:%M:%S %p")Some manipulation to the datasets that I need to make in order to query and visualize the data in a meaningful way:
# Adding column day_of_week to this dataframe
daily_activity$day_of_week <- weekdays(daily_activity$activity_date)
# Changing the order of week days to better visualize
daily_activity$day_of_week <- factor(daily_activity$day_of_week, c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))# Adding column bmi_range to this dataframe
weight_log_info <- weight_log_info %>%
mutate(bmi_range = case_when(
bmi > 24.9 ~ 'Over',
bmi < 18.5 ~ 'Under',
TRUE ~ 'Healthy'
))# Adding column sleep_range to this dataframe
daily_sleep <- daily_sleep %>%
mutate(sleep_range = case_when(
total_minutes_asleep >= 560 ~ 'Over',
total_minutes_asleep < 420 ~ 'Under',
TRUE ~ 'Healthy'
))
# Formatting sleep_range variable as factor
daily_sleep$sleep_range <- as.factor(daily_sleep$sleep_range)Analysis and Visualizations
The first aspect to analyze user behavior is to look at which days users are most and least active, with supporting arguments which may explain the behavior:
# A column chart showing week day and activity level by step count
daily_activity %>%
ggplot(aes(day_of_week, total_steps, fill = day_of_week)) +
geom_col(alpha = 0.4) +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none")+
labs(title = "Step Count Totals by Week Day",
subtitle = "33 users over a one month period",
x = "Week Day",
y = "Step Count")Observations regarding the chart above are:
- Lowest step count in order are Friday, Saturday and Wednesday. Potential reasons are social engagements after a long week on a Friday, rest thereafter on a Saturday, and Wednesday being a potential break from the activities of Sunday, Monday and Tuesday.
- Highest step count is on a Sunday, Monday and Tuesday suggesting that users end the week healthy and that continues through until midweek, and then drops off going into the weekend.
- Users seem to engage in the highest step count on a Sunday, and the count lowers daily until a slight increase in step count on Thursday from the day prior.
The next aspect to analyze user behavior is to see how many users are under, at a healthy, or over the recommended BMI range for a young adult:
# A tally of users who are under recommended BMI
nrow(filter(distinct(weight_log_info, id, .keep_all = TRUE),bmi_range == "Under"))## [1] 0
# A tally of users who are at a healthy BMI
nrow(filter(distinct(weight_log_info, id, .keep_all = TRUE),bmi_range == "Healthy"))## [1] 3
# A tally of users who are over recommended BMI
nrow(filter(distinct(weight_log_info, id, .keep_all = TRUE),bmi_range == "Over"))## [1] 5
Based on the users information the following observations can be made:
- None of the users are under the recommended BMI.
- 3 users are at a healthy BMI.
- 5 of the users are over the recommended BMI, suggesting that users could benefit from knowing their BMI is over the recommended count.
- Having weight loss to reach a healthy BMI as a goal would apply to a larger segment of the users.
The next aspect to analyze user behavior is to inspect the time of day when strenuous activity occurs, to best understand when users are working out as opposed to general walking/etc.
# group_by function to sum user activity by time interval
hourly_cal_grouped <- hourlyCalories_merged_test %>%
group_by(Time) %>%
summarise(sum_calories = sum(Calories))
# A column chart mapping the time of day that calories are burnt
ggplot(data = hourly_cal_grouped, aes(x=Time, y=sum_calories))+
geom_col(position = "dodge", color = "#000000", fill = "#82D3E1", alpha = 0.8)+
theme_bw()+
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Calories burnt by time of day",
subtitle = "33 users activity grouped by time",
x = "Time of Day",
y = "Sum of Calories")Observations regarding the chart above are :
- The time interval between 12:00 and 14:00 sees an increase in user activity, suggesting that a lunch break workout is a common theme among users.
- The time interval between 17:00 and 19:00 also sees an increase in user activity, suggesting that a post-work workout is a common theme among users.
The next aspect to analyze is users sleep duration, and whether or not users are getting the recommended amount of sleep.
First is a pie chart segmenting the users into 3 ranges of sleep quality:
# creating a table to include frequency and variable
sleep_range_pie_data <- table(daily_sleep$sleep_range)
# A pie chart displaying the quality range of sleep users are getting
PieChart(sleep_range_pie_data, hole=0, quiet=TRUE,
fill = "colors", color = "black", lty = "solid", lwd=1, clockwise=FALSE,
transparency = 0.4, init_angle = 310, alpha = 0.5,
main = "Actual sleep quality vs recommended (420 - 560 mins)")Secondly, a histogram showing the distribution of the number of hours asleep daily by recorded counts:
# adding variable "hours asleep" to dataframe
daily_sleep$hours_asleep = round((daily_sleep$total_minutes_asleep)/60, digits = 2)
# a histogram showing the distribution of hours asleep
daily_sleep %>%
ggplot(aes(x = hours_asleep)) +
geom_histogram(stat = "bin", binwidth = 1, color = "#000000", fill = "#82D3E1", alpha = 0.8) +
scale_x_continuous(breaks=1:15) +
theme_bw() +
labs(x = "Hours Asleep",
y = "Recorded Counts",
title = "Distribution of hours asleep per day")Thirdly, a column chart to visualize which days users are getting the least sleep:
# adding a new column week_day to daily_sleep dataset
daily_sleep$day_of_week <- weekdays(daily_sleep$sleep_day)
# reordering the week days so they display logically
daily_sleep$day_of_week <- factor(daily_sleep$day_of_week, c("Monday", "Tuesday", "Wednesday", "Thursday","Friday","Saturday", "Sunday"))
# grouping sleep_range
daily_sleep_filtered <- daily_sleep %>%
group_by(day_of_week) %>%
filter(sleep_range=="Under" | sleep_range=="Healthy")
# a column chart showing sleep range by weekday
ggplot(daily_sleep_filtered, aes(fill = sleep_range, x = day_of_week, y = hours_asleep))+
geom_col(position = "dodge")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90)) +
facet_wrap(~sleep_range)+
labs(title = "Weekday by sleep range",
x = "Week Day",
y = "Hours Asleep",
fill = "Sleep Range")Observations regarding the charts above are :
- CDC guidelines for total daily sleep for a young adult are 7 - 9 hours.
- 50% of users get the recommended amount of sleep.
- 44% of users are getting under the recommended amount of sleep.
- 6% of users are getting more than the recommended amount of sleep.
- There is not a specific day where users get drastically less sleep, suggesting that the lack of sleep is constant rather than on certain days.
The next aspect to analyze is the correlation between total steps, very active minutes and calories burned.
Firstly a scatter plot showing the relationship between total steps and calories burned:
# dropping null values as they are only cluttering the calculation of the relationship
# between total steps and calories burned
daily_activity_3 <- daily_activity %>%
rowwise() %>%
filter(sum(c(total_steps)) != 0)
# a scatter plot showing total steps vs calories
daily_activity_3 %>%
ggplot(aes(total_steps, calories)) +
geom_smooth(color = "black") +
geom_point(size = 1.5, alpha = 0.6, position = "jitter", color = "#2F8BC6") +
theme_bw() +
labs(title = "Calories burned vs Step total",
x = "Step Total",
y = " Calories")## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Below is a calculation of the correlation between the two variables:
## [1] 0.5915681
Secondly, another scatter plot mapping very active minutes against calories burned:
# dropping null values as they are only cluttering the calculation of the relationship
# between very active minutes and calories burned
daily_activity_2 <- daily_activity %>%
rowwise() %>%
filter(sum(c(very_active_minutes)) != 0)
# scatter plot mapping very active minutes against calories burned
daily_activity_2 %>%
group_by(activity_date) %>%
ggplot(aes(very_active_minutes, calories)) +
geom_smooth(color = "black") +
geom_point(size = 1.5, alpha = 0.6, position = "jitter", color = "#2F8BC6") +
theme_bw() +
labs(title = "Calories Burned vs Very Active Minutes",
x = "Very Active Minutes",
y = "Calories Burned")## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Below is a calculation of the correlation between the two variables:
## [1] 0.6670787
Observations regarding the charts above are :
There is a strong correlation of 0.59 between the two variables Step Total and Calories, which is obvious to the user, the more steps taken, the more calories burned.
There is a strong correlation of 0.67 between the two variables Very Active Minutes and Calories, which is obvious to the user, the more vigorous the activity, the more calories burned.
Note that the stronger correlation indicates that being very active burns more calories than steps alone. Again quite obvious.
Reccomendations to guide marketing strategy
Based on user behavior the following recommendations are made:
Activity
Considering the days that contain the least amount of activity are Wednesday and Friday, notifying users on those days to get a quick workout in would boost their overall output over a monthly basis.
Introducing a group workout aspect to the app may well boost user activity, and bring together users of the product range, with the potential of having new potential clients engaged in activity with product users.
Including home workout videos as an aspect of the app, and having notifications for days where user activity is low, would add value and potentially increase activity on those days.
Focusing app notifications around the times of 11:00 and 16:00 would be useful, as it is shortly before the time that users are most likely to work out.
BMI / Weight
With regards to manually logging weight information, 8 of 33 users used this function, indicating that it is not well utilized and would benefit from an alteration to an automatic logging function.
Considering 62.5% of the users are over the recommended BMI range, there is an opportunity to motivate users to use the weight logging and calorie count functions more rigidly.
Introduce a reward system for being within BMI range, eg discounted price range for monthly subscription to the Coach, or a simple “badge” for milestones or in-app points.
Sleep patterns
- Considering 44% of users are not getting the recommended daily sleep for a young adult, users would benefit from knowing this information. This could be done by having a sleep score each morning upon waking, with metrics for length of time to fall asleep, REM count, etc.
Additional recommendations
Having a female focused range of products does negate around half of potential market share, so leveraging Bellabeat’s existing technology to create a range of unisex products would increase market share. Could even be done under a separate brand manufactured by Bellabeat.
Run a referral discount promotion for users who refer a friend or colleague to Bellabeat successfully to grow membership.
There are some in app issues for the Fitbit, like 0 step counts, 0 logged distances, N/A values, varying dates from 2016 to 2020 on datasets, would be wise to look at data from Bellabeat’s customer base and see if there are similar problems in data collection.
Get larger data for exploration!
That concludes this case study, thank you for reading and I do hope that you enjoyed it!
Additional Citations:
Fitabase Dictionary - This was used for a deeper understanding of some of the variables, ie units of measurement.