This is a capstone project I’ve recently completed as part of the Google Data Analytics Professionl Certificate. As described in the summary, course participants can learn the following:
This report and analysis follows the six steps in the data analysis process:
Ask, Prepare, Process, Analyze, Share, and Act
I am a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. I have been asked to focus on one of Bellabeat’s products and analyse smart device data to gain insight into how consumers are using their smart devices. The insights gained from this analysis will help guide marketing strategy for Bellabeat. I will present my analysis to the Bellabeat executive team, together with high-level recommendations for Bellabeat’s marketing strategy.
Bellabeat was founded in 2014 by Urška Sršen and Sando Mur. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Since originally launching, the company has positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Products became available through online retailers as well as on their own website.
To identify and analyze trends in smart device usage data in order to gain insights into how consumers use non-Bellabeat smart devices. These insights will guide the marketing strategy for the company.
The data used for this case study is Fitbit Fitness Tracker Data. This is a public data set stored in Kaggle and made available by Möbius. The data set contains personal fitness tracker from thirty Fitbit users, all of whom consented to the submission of personal tracker data. It includes data about physical activity, sleep, heart rate, and weight that can be used to explore user habits.
This data set generated by respondents to a distributed survey via Amazon Mechanical Turk between 2016-March-03 and 2016-May-12. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
The data contains 18 CSV files of anonymized user information, and is organised in long format. Each row represents a specific time interval with corresponding data points such as activity level, heart rate, and sleep patterns.
The data was downloaded from Kaggle and stored in the local drive on my Macbook Air laptop.
A good data set follows the acronym: ROCCC. It stands for the following:
Let’s conduct a cursory review of the Fitbit data based on the aforementioned criteria: Reliable: The data is not reliable based on the sample size. It only contains 30 respondents. Though this is often the sample size for small studies in exercise physiology, it is comparatively small compared to the millions of user Bellabeat claims to have. Furthermore, given the data is anonymized, it lacks demographic data such as age and sex that would particularly useful in analysis. Original: The data is not original. It was generated by respondents during a survey conducted by Amazon Mechanical Turk, as opposed to collect via the company’s own collection efforts. Comprehensive: The data is not comprehensive. For the same reasons listed under reliable, the data lacks key demographic data and a robust sample size needed to given statistical power to the analysis. Current: The data is not current. It was collected during a two month window in May, 2016, more than 9 years prior to this analysis. Furthermore, the Kaggle repository puports annual updating to the data. It unfortunately has not been updated in two years. **Cited*: This data was collected by Amazon Mechanical Turk, which is a crowdsourcing marketplace where requesters (businesses) can hire a global workforce to complete tasks, with the task in this scenario being the generation of biometric health data. There is no information otherwise regarding the reliability of the source.
Sršen mentioned the data set may have limitations and encouraged me to add another data source to mitigate those limitations as analyze the data. Specific data sets are not mentioned. Exploring and incorporating complimentary data sets can strength the statistical validity of the analysis and assist in the cleaning process while providing a more comprehensive analysis of the data. This would lead to more valuable insights for the stakeholders. After searching for similar data sets that might address the shortcomings in the provided data I came up empty-handed.
I have done an initial review of the CSV files through the Numbers application on my personal device. This program allows for similar data review at Microsoft Excel. Its integration within my own device provided an ease of access for initial data viewing. After conducting a cursory review, I’ve elected to analyze the following data sets:
I chose the data to assess any correlations between hourly steps, hourly intensities, and daily activity, and how that may affect weight and sleep. I assumed some relationship exists in the above based on my own background in fitness data. Based on the titles alone, I acknowledge I will need to transform the data to make it more conducive for analysis.
I have opted to use Google BigQuery to begin processing the data and prepare it for analysis. This course has provided me practical SQL skills through BigQuery I will be looking to display below. I also have some experience with MySQL, but for the purpose of this course will be utilizing the tools recommended. Because of the size of these data sets, I’ve opted away from using Microsoft Excel or Apple Numbers, as any data transformations would be difficult and cumbersome on the processing power.
After downloading the data sets to my personal device, as mentioned above, I’ve uploaded them into my Google work space for use with BigQuery. I specifically uploaded the CSVs into the project titled “endler-da-capstone20251117” and the data set “BellaBeat_Data_Capstone_20251117”.
BigQuery proved to have some difficulty automatically parsing the scheme for table creation during upload, so I manually assigned the schema for each of the five documents. I also ran into initial problems delineating the correct scheme (TIMESTAMP or DATETIME) for any of the columns detailing time, so I set the initial schema to a string. For all other variables, I set them to the appropriate BigQuery schema, which included the following:
I elected to use both the Graphic User Interface (GUI) Table option in BigQuery and the manual text editor in order to practice both skills. I ensured to keep the same name for the table as the CSV file name for initial posterity, with the intention to make data transformation edits after uploading.
I also ensured I selected the advanced options and listed 1 row as a header row.
SleepDay_Merged BigQuery Coding Input
DailyActivity_Merged GUI Input
I did some initial data exploration in the Apple Numbers application. However, I did more thorough data exploration after uploading the CSVs to my BigQuery data set.
COUNT to Determine the Number of Unique
Users in the Data SetsCOUNT and DISTINCT functions in
BigQuery to parse the data for unique anonymous User IDs. This allowed
me to determine how many users contributed to each of the five analyzed
data sets.
All SELECT COUNT(DISTINCT) statements utilized the
general format. I’ve only included one for brevity.
From the above data chart, we can see the data sets only contained records from 33 users. Not all users utilized each function (namely sleep and weight tracking). So few users utilized the weight log, there isn’t any meaningful analysis to be done, especially with length of data collection (only a two month window). I know from experience it would likely create bias in the overall results. Furthermore, 33 users is a small sample size, and not likely a reasonable alliquot from the total population. It’s likely that any data trends observed may not be representative of the population at large. Regardless, we can see some trends:
When I imported all of the CSVs into BigQuery, I followed the same naming customs as the original files. This resulted in all columns being in Camel Case. Many analysis programs, like R and Python, utilize snake case. As such, I first needed to rename all of the data sets with the correct format:
Note, other than altering table name, I am no longer working with the weightlog_info dataset because of the small sample size
ALTER TABLE DailyActivity_Merged RENAME TO daily_activity;
ALTER TABLE SleepDay_Merged RENAME TO sleep_day;
ALTER TABLE HourlySteps_Merged RENAME TO hourly_steps;
ALTER TABLE HourlyIntensities_Merged RENAME TO hourly_intensities;
ALTER TABLE WeightLogInfo_Merged RENAME TO weightlog_info;
Then I specifically altered the columns in each table to be in snake case:
ALTER TABLE daily_activity
RENAME COLUMN Id TO user_id,
RENAME COLUMN ActivityDate TO activity_date,
RENAME COLUMN TotalSteps TO total_steps,
RENAME COLUMN TotalDistance TO total_distance,
RENAME COLUMN TrackerDistance TO tracker_distance,
RENAME COLUMN LoggedActivitiesDistance TO logged_activities_distance,
RENAME COLUMN VeryActiveDistance TO very_active_distance,
RENAME COLUMN ModeratelyActiveDistance TO moderately_active_distance,
RENAME COLUMN LightActiveDistance TO light_active_distance,
RENAME COLUMN SedentaryActiveDistance TO sedentary_active_distance,
RENAME COLUMN VeryActiveMinutes TO very_active_minutes,
RENAME COLUMN FairlyActiveMinutes TO fairly_active_minutes,
RENAME COLUMN LightlyActiveMinutes TO lightly_active_minutes,
RENAME COLUMN SedentaryMinutes TO sedentary_minutes;
ALTER TABLE sleep_day
RENAME COLUMN Id to user_id
RENAME COLUMN SleepDay to sleep_day,
RENAME COLUMN TotalSleepRecords to total_sleep_records,
RENAME COLUMN TotalMinutesAsleep to total_minutes_asleep,
RENAME COLUMN TotalTimeInBed to total_time_in_bed;
ALTER TABLE hourly_steps
RENAME COLUMN Id TO user_id,
RENAME COLUMN ActivityHour TO activity_hour,
RENAME COLUMN StepTotal TO step_total;
ALTER TABLE hourly_intensities
RENAME COLUMN Id TO user_id,
RENAME COLUMN AcitivtyHour TO activity_hour,
RENAME COLUMN TotalIntensity TO total_intensity,
RENAME COLUMN AverageIntensity TO average_intensity;
This created the following new tables, with theri associate new
columns:
- daily_activity from DailyActivity_Merged
- sleep_day from SleepDay_Merged
- hourly_steps from HourlySteps_Merged
- hourly_intensities from HourlyIntensities_Merged
Had I not manually selected the data schema for each data column when I
imported them into BigQuery, I would run a
DESCRIBE
statement to determine the data type of each, as shown:
All of the date/times are the incorrect data type. They need to be
the DATE data type (for daily_activity) and
DATETIME datatype for all other data sets. To accomplish
this, I used the PARSE_DATE function in BigQuery for
daily_activity:
CREATE OR REPLACE TABLEdaily_activity ASSELECTuser_id,PARSE_DATE('%m/%d/%Y', activity_date) AS activity_date,total_steps,total_distance,tracker_distance,logged_activities_distance,very_active_distance,moderately_active_distance,light_active_distance,sedentary_active_distance,very_active_minutes,fairly_active_minutes,lightly_active_minutes,sedentary_minutes,CaloriesFROMdaily_activity;
For the other data sets, I used the function
PARSE_DATETIME.
CREATE OR REPLACE TABLE hourly_intensities AS
SELECT
user_id,
PARSE_DATETIME('%m/%d/%Y %I:%M:%S %p', activity_hour) AS activity_hour,
total_intensity,
average_intensity
FROM
hourly_intensities;
CREATE OR REPLACE TABLE
hourly_steps AS
SELECT
user_id,
PARSE_DATETIME('%m/%d/%Y %I:%M:%S %p', activity_hour) AS activity_hour,
step_total
FROM
hourly_steps;
CREATE OR REPLACE TABLEsleep_day ASSELECTuser_id,PARSE_DATETIME('%m/%d/%Y %I:%M:%S %p', t.sleep_day) AS sleep_day,total_sleep_records,total_minutes_asleep,total_time_in_bedFROMsleep_day AS t;
Note: For sleep_day, I had to use an alias for the column because
the DATETIME column and data set had the same name,
creating some anbiguity in the code.
Alternatively, I could have separated the DATETIME data
into two new columns (DATE and TIME) but have
elected to move forward with the above changes. After inputting the
above code, I observed the data types in the BigQuery GUI over using the
DESCRIBE function due to ease of access. See below:
With this done, I can move onto data cleaning.
Before I can export the data for analysis, I need to ensure it is clean of errors. I’ve already transformed it to ensure proper formatting. The main source of errors stems from duplicate or null values. In a normal distribution, I would also attempt to control for outliers. Given the small data set, I will be using all distinct values to lend maximum power to analysis. First, I will start with duplicate values.
In BigQuery and other SQL platforms, I can filter out duplicate
values with the SELECT DISTINCT function. This will select
only the unique data points and exclude any duplicate values.
First, I will use the function SELECT COUNT(*) to
determine the total count of data points in each data set.
SELECT COUNT: 940
SELECT DISTINCT: 940
SELECT COUNT: 413
SELECT DISTINCT: 410
SELECT COUNT: 22099
SELECT DISTINCT: 22099
SELECT COUNT: 22099
SELECT DISTINCT: 22099
From here, we can see everything except the sleep_day has no
duplicate values. I also assessed for blank (null) values using the
COUNTIF and IS NULL` commands in BigQuery. None returned
any null values. I’ve included the code snippet from the daily_activity
data set below:
SELECTCOUNTIF(user_id IS NULL) AS user_id_null_count,COUNTIF(activity_date IS NULL) AS activity_date_null_count,COUNTIF(total_steps IS NULL) AS total_steps_null_count,COUNTIF(total_distance IS NULL) AS total_distance_null_count,COUNTIF(tracker_distance IS NULL) AS tracker_distance_null_count,COUNTIF(logged_activities_distance IS NULL) AS logged_activities_distance_null_count,COUNTIF(very_active_distance IS NULL) AS very_active_distance_null_count,COUNTIF(moderately_active_distance IS NULL) AS moderately_active_distance_null_count,COUNTIF(light_active_distance IS NULL) AS light_active_distance_null_count,COUNTIF(sedentary_active_distance IS NULL) AS sedentary_active_distance_null_count,COUNTIF(very_active_minutes IS NULL) AS very_active_minutes_null_count,COUNTIF(fairly_active_minutes IS NULL) AS fairly_active_minutes_null_count,COUNTIF(lightly_active_minutes IS NULL) AS lightly_active_minutes_null_count,COUNTIF(sedentary_minutes IS NULL) AS sedentary_minutes_null_count,COUNTIF(Calories IS NULL) AS calories_null_countFROMdaily_activity;Now I can create the final tables, based on the above cleaning results:
Now that all four tables are finalized, I can export them as CSVs for data visualization and analysis. Exporting CSVs from BigQuery is not the most straight forward process. I complete the following sequence of events to download my CSV files for analysis:
Now that all of my data is cleaned and organized, I can begin the analysis section. Before importing any files into R, I made sure to load the tidyverse library:
library(tidyverse)
── Attaching core tidyverse packages ───────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4
✔ readr 2.1.5
✔ forcats 1.0.1
✔ stringr 1.5.2
✔ ggplot2 4.0.0
✔ tibble 3.3.0
✔ lubridate 1.9.4
✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────── ───────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
I imported all four files into R using the GUI following the pathway in the files panel in RStudio. This is convenient and allowed for quick access to all four data frames.
For posterity, had I needed to import via code I would’ve first
changed the working directory. I checked my working directory using the
code getwd() and then once I confirmed it wasn’t where I
saved my files, I changed it to via the code setwd(). I
repeated the getwd() command to ensure the working
directory changed, as intended. From there, I could enter the code
below:
daily_activity_final <- read_csv("daily_activity_final", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d")));
sleep_day_final <- read_csv("sleep_day_final", col_types = cols(user_id = col_character(), sleep_date = col_date(format = "%Y-%m-%d"), sleep_time = col_time(format = "%X")));
hourly_steps_final <- read_csv("hourly_steps_final", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d"), activity_time = col_time(format = "%X")));
hourly_intensities_final <- read_csv("hourly_intensities_final", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d"), activity_time = col_time(format = "%X")));
I created a new data frame categorizing the users based on the frequency they utilized the Fitbit app for data collection. This would be useful for providing a user overview and for exporting the CSV for use in data visualization in programs like Tableau or Flourish Studio.
user_types <- daily_activity_final %>% group_by(user_id) %>% summarise(days_used=sum(n())) %>% mutate(user_type = case_when(
days_used >= 1 & days_used <= 10 ~ "Light User",
days_used >= 11 & days_used <= 20 ~ "Moderate User",
days_used >= 21 & days_used <= 29 ~ "Heavy User",
days_used >= 30 ~ "Everyday User"))
This created four user types based on the above 4 variables. From a quick glance at the data, I can see there is 1 light user, 3 moderate uers, 5 heavy users, and 22 everyday users. I also exported the CSV for use in other programs.
write.csv(user_types, "user_types.csv", row.names=FALSE).
I then confirmed the file was in the selected directory.
I have some past experience working as a scientist and conducting
statistical analysis to confirm or deny a null hypothesis through
soemthing like a T or Z test. However, for situations like this,
summarizing the data is more useful to provide an overall assessment of
the data. To achieve this, I used the summary()
command.
Daily Activity Summary
Daily Sleep Summary
Hourly Intensities Summary
Hourly Steps Summary
We can observed some notable results right off the bad. Both the median (7,406) and mean (7,638) total daily steps were relatively low. The United States Center for Disease Control and Prevention previously published a study associating adults taking 8,000 steps per day with 51% lower mortality from all causes than an individual taking 4,000 steps. This effect with more pronounced as steps increase, with 12,000 steps showing a 65% decrease. A recent study published at The Lancet showed meaningful health benefits (47% reduction) with 7,000 steps versys 2,000.
Other studies even suggest health benefits based on frequency of movement, rather than a hard cut off value. This segways into the next point. The European Commission cites sedentary behavior as Sedentary behaviour refers to any waking behaviour characterised by energy expenditure ≤ 1.5 METs, while in in a sitting, reclining or lying position, where MET refers to metabolic equivalents. One MET is the energy expended while sitting at rest. This definition has been corroborated in studies as well as authoritative agencies such as the United States CDC and the World Health Organization. The average sedentary time is 991.2 minutes, more than 2/3rd the entire day. Even the most active quartile (1st quartile) exhibited sedentary behavior for 729.8 minutes, or just over half the day. Fitbit is meant to reveal this data, but generally users need to make the effort to change their habits on their own.
Activity of all levels varied. Users were lightly active for 199.0 minutes, fairly active for 6.00 minutes, and very active for 4.00 minutes on average, showing empirically nearly all active minutes were lightly active. The above referenced studies indicate any type of activity above sedentary is beneifical, and that increased intensity is not correlative with reductions in mortality at higher statistical rates. As such, just increasing activity from sedentary to light would likely cause a great boon to the users’ health.
Articles from the CDC indicate the average adult needs 7 or more hours per sleep a night. The National Institute of Health. Meanwhile, a study by the WHO suggests individuals with intermediate sleep durations (6-9 hours) exhibit statistically higher cognitive scores than those outside the range. The average sleep duration of these users is 6.99 hours and the median is 7.21 hours, with 50% of the users falling between 6.02 and 8.17 hours. The users in this data set generally fell near or within recommended ranges based on studies. There was also a notable difference between time in bed and time asleep. Users were in bed for 7.64 hours on average, indicating the slept for about 91.5% of their time in bed.
Bellabeat’s mission is focusing on creating innovative health and wellness products for women, in order to empower women to take control of their health by providing them with technology-driven solutions that blend design and function. Their products focus on sleep, activity, stress, and women’s reproductive health.
The provided dataset is small; a larger data set is necessary to carry out a more in-depth analysis. Ideally, a wide variety of data should be collected to provided data useful to Bellabeat’s mission. On top of the data provided from these Fitbit data sets, other useful information would included demographics (age, gender, occupation, location), nutrition information (especially hydration, given Bellabeat’s product-line), any training regiment, etc. Demographic data is useful to control for bias. Furthermore, this data is out-dated. Newer data sets would be beneficial to ensure business solutions are drafted based on the most accurate analysis.
In summary, these are the key findings: - Users had various levels of interaction with the Fitbit applicaiton. Not all utilized sleep tracking; even fewer utilized the weight tracking funciton. - The weight log function is by far the least popular function of those included in this data set. - Users were most active Saturday and Tuesday, with slightly less activity on Friday and Sunday, based on movement. - Users spent by far the most time in sedentary (</= 1.5 MET) activity. - Users were mostly active between 8 AM and 8 PM, with localized peaks in the evening and early afternoon. - Active users burnt more calories. - Of the limited cohort utilizing the sleep tracking, users seemed to have roughly enough sleep on average, per cited studies. However, users tended to have some difficulty falling asleep in under the 20 minutes cited as typical.
Based on this analysis, I woulld recommend the following: Stronger Sleep Integration - Many smart products today introduce a variety of applications for sleep assistance, including blue light filtering, automatic alarms, wind-down time, wind-up alarms, etc. Bellabeat can integrate functionalities such as sleep tracking seen in the Apple Watch to provide more relevent data to users, including REM data, sleep stage breakdown, bed-time notifications, and periodic (weekly, biweekly, monthly) sleep trends. - Bellabeat should campaign on the above features as a core metric for tracking overall health. Healthy sleep is beneficial to all apsects of health, from mental acuity, to fitness recovery, to digestion consistency, to hormonal regulation. Since many Fitbit users get barely adequate or inadequate sleep, it is likely that Bellabeat users have similar shortcomings. A strong informational campaign could help combat this.
Activity and Exercise Integration - Many Fitbit users exhibited mostly sedentary activity. There is no data about their use of activity (cardiovascular, weight training, etc.). Including functions for logging this information may be beneficial to Bellabeat users to promote overall activity by tracking progress.
Nutrition Integration - Bellabeat promotes products for tracking hydration and promotes themselves as a pioneer of women’s health. Including nutrition logging would be a boon to promoting proper hydration and ensuring better horomone regulation through macro and micronutrient monitoring.
Weight Tracking Analysis - The cohort for this data set barely utilized the weight tracking funciton. It would benefit Bellabeat to collect data collection (perhaps through surveys) to find out why it’s sparingly used. Weight tracking would sync well with other metrics Bellabeat finds beneficial for their users.
Step and Activity Notifications - Most smart devices offer some way to notify the user to hit the minimum step or activity goal, such as 250 steps/hour on Fitbit or activity every hour on Apple Watch. Intefgrating these features into the Bellabeat ecosystem would likely promote more activity for the users. - Promoting more activity in any form will reduce sedentary behavior, which according to many studies by the CDC, WHO, and indepdent researchers, are som eof the best eays to reduce overall risk of mortality.
Online Advertisements - Bellabeat’s overall goal is not just to promote their products for tracking health, but promote themselves as a bastion of women’s health. They should engage in product advertisements that describes the Bellabeat app not just as a fitness app, but as an app for promoting a healthy lifestyle and empowering women to strike the best balance between their work, personal, and fitness lives.