R Markdown

This is a capstone project I’ve recently completed as part of the Google Data Analytics Professionl Certificate. As described in the summary, course participants can learn the following:

Gain an immersive understanding of the practices and processes used by a junior or associate data analyst in their day-to-day job
Learn key analytical skills (data cleaning, analysis, & visualization) and tools (spreadsheets, SQL, R programming, Tableau)
Understand how to clean and organize data for analysis, and complete analysis and calculations using spreadsheets, SQL and R programming
Learn how to visualize and present data findings in dashboards, presentations and commonly used visualization platforms

This report and analysis follows the six steps in the data analysis process:

Ask, Prepare, Process, Analyze, Share, and Act

1: Introduction

I am a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. I have been asked to focus on one of Bellabeat’s products and analyse smart device data to gain insight into how consumers are using their smart devices. The insights gained from this analysis will help guide marketing strategy for Bellabeat. I will present my analysis to the Bellabeat executive team, together with high-level recommendations for Bellabeat’s marketing strategy.

1.1: About the Company

Bellabeat was founded in 2014 by Urška Sršen and Sando Mur. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Since originally launching, the company has positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Products became available through online retailers as well as on their own website.

1.2: Characters and Products

1.2.1: Characters

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer.
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team.
Bellabeat Marketing and Analytics Team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. I joined this team six months ago and have been busy learning about Bellabeat’s mission and business goals — as well as how I, as a junior data analyst, can help Bellabeat achieve them.

1.2.2: Products

Bellabeat App: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat Membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

2: Step 1: Ask

2.1: Business Task

To identify and analyze trends in smart device usage data in order to gain insights into how consumers use non-Bellabeat smart devices. These insights will guide the marketing strategy for the company.

2.2: Key Stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Bellabeat’s cofounder and key member of the Bellabeat executive team
Bellabeat Marketing Analytics Team

2.3: Guiding Questions for Analysis

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

3: Step 2: Prepare

3.1: Data Source Used

The data used for this case study is Fitbit Fitness Tracker Data. This is a public data set stored in Kaggle and made available by Möbius. The data set contains personal fitness tracker from thirty Fitbit users, all of whom consented to the submission of personal tracker data. It includes data about physical activity, sleep, heart rate, and weight that can be used to explore user habits.

3.2: Data Set Information

This data set generated by respondents to a distributed survey via Amazon Mechanical Turk between 2016-March-03 and 2016-May-12. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

The data contains 18 CSV files of anonymized user information, and is organised in long format. Each row represents a specific time interval with corresponding data points such as activity level, heart rate, and sleep patterns.

The data was downloaded from Kaggle and stored in the local drive on my Macbook Air laptop.

3.3: Data Credibility and Limitations

A good data set follows the acronym: ROCCC. It stands for the following:

Reliable
Original
Comprehensive
Current
Cited

Let’s conduct a cursory review of the Fitbit data based on the aforementioned criteria: Reliable: The data is not reliable based on the sample size. It only contains 30 respondents. Though this is often the sample size for small studies in exercise physiology, it is comparatively small compared to the millions of user Bellabeat claims to have. Furthermore, given the data is anonymized, it lacks demographic data such as age and sex that would particularly useful in analysis. Original: The data is not original. It was generated by respondents during a survey conducted by Amazon Mechanical Turk, as opposed to collect via the company’s own collection efforts. Comprehensive: The data is not comprehensive. For the same reasons listed under reliable, the data lacks key demographic data and a robust sample size needed to given statistical power to the analysis. Current: The data is not current. It was collected during a two month window in May, 2016, more than 9 years prior to this analysis. Furthermore, the Kaggle repository puports annual updating to the data. It unfortunately has not been updated in two years. **Cited*: This data was collected by Amazon Mechanical Turk, which is a crowdsourcing marketplace where requesters (businesses) can hire a global workforce to complete tasks, with the task in this scenario being the generation of biometric health data. There is no information otherwise regarding the reliability of the source.

Sršen mentioned the data set may have limitations and encouraged me to add another data source to mitigate those limitations as analyze the data. Specific data sets are not mentioned. Exploring and incorporating complimentary data sets can strength the statistical validity of the analysis and assist in the cleaning process while providing a more comprehensive analysis of the data. This would lead to more valuable insights for the stakeholders. After searching for similar data sets that might address the shortcomings in the provided data I came up empty-handed.

4: Step 3: Process

4.1: Data Set Files Used for Analysis

I have done an initial review of the CSV files through the Numbers application on my personal device. This program allows for similar data review at Microsoft Excel. Its integration within my own device provided an ease of access for initial data viewing. After conducting a cursory review, I’ve elected to analyze the following data sets:

DailyActivity_Merged.csv
SleepDay_Merged.csv
HourlyIntensities_Merged.csv
HourlySteps_Merged.csv
WeightLogInfo_Merged.csv

I chose the data to assess any correlations between hourly steps, hourly intensities, and daily activity, and how that may affect weight and sleep. I assumed some relationship exists in the above based on my own background in fitness data. Based on the titles alone, I acknowledge I will need to transform the data to make it more conducive for analysis.

4.2: Tools Used to Process Data

I have opted to use Google BigQuery to begin processing the data and prepare it for analysis. This course has provided me practical SQL skills through BigQuery I will be looking to display below. I also have some experience with MySQL, but for the purpose of this course will be utilizing the tools recommended. Because of the size of these data sets, I’ve opted away from using Microsoft Excel or Apple Numbers, as any data transformations would be difficult and cumbersome on the processing power.

4.3: Data Upload and Familiarization in BigQuery

After downloading the data sets to my personal device, as mentioned above, I’ve uploaded them into my Google work space for use with BigQuery. I specifically uploaded the CSVs into the project titled “endler-da-capstone20251117” and the data set “BellaBeat_Data_Capstone_20251117”.

BigQuery proved to have some difficulty automatically parsing the scheme for table creation during upload, so I manually assigned the schema for each of the five documents. I also ran into initial problems delineating the correct scheme (TIMESTAMP or DATETIME) for any of the columns detailing time, so I set the initial schema to a string. For all other variables, I set them to the appropriate BigQuery schema, which included the following:

String for text (or date) that don’t fit in another category
Boolean for true/false values
Integer for whole numbers
Numeric for any columns with less decimal places
BigNumeric for any columns needing exact decimals

I elected to use both the Graphic User Interface (GUI) Table option in BigQuery and the manual text editor in order to practice both skills. I ensured to keep the same name for the table as the CSV file name for initial posterity, with the intention to make data transformation edits after uploading.

I also ensured I selected the advanced options and listed 1 row as a header row.

SleepDay_Merged BigQuery Coding Input

DailyActivity_Merged GUI Input

I did some initial data exploration in the Apple Numbers application. However, I did more thorough data exploration after uploading the CSVs to my BigQuery data set.

Hide Code

Show Code

4.4: Using `COUNT` to Determine the Number of Unique Users in the Data Sets

I used the COUNT and DISTINCT functions in BigQuery to parse the data for unique anonymous User IDs. This allowed me to determine how many users contributed to each of the five analyzed data sets.

Hide Code

All SELECT COUNT(DISTINCT) statements utilized the general format. I’ve only included one for brevity.

From the above data chart, we can see the data sets only contained records from 33 users. Not all users utilized each function (namely sleep and weight tracking). So few users utilized the weight log, there isn’t any meaningful analysis to be done, especially with length of data collection (only a two month window). I know from experience it would likely create bias in the overall results. Furthermore, 33 users is a small sample size, and not likely a reasonable alliquot from the total population. It’s likely that any data trends observed may not be representative of the population at large. Regardless, we can see some trends:

Not everyone utilizes all options within the smart devices. Sleep and weight tracking are underutilized compared to other metrics. This could be because users may not want to manually input data or have a scale (for weight) and may not want to wear their trackers to bed (for sleep).
Of the metrics I looked at, the weight log is the least utilized. I’ve personally used the Fitbit ecosystem before. Anecdotally, I can say it’s not likely because the feature is any clunkier in the app than any other manual entry feature. Manual is the operative word, however, as many users probably don’t want to do any addition logging past what the device does on its own.

4.5: Data Transformation

When I imported all of the CSVs into BigQuery, I followed the same naming customs as the original files. This resulted in all columns being in Camel Case. Many analysis programs, like R and Python, utilize snake case. As such, I first needed to rename all of the data sets with the correct format:

Note, other than altering table name, I am no longer working with the weightlog_info dataset because of the small sample size

ALTER TABLE DailyActivity_Merged RENAME TO daily_activity;
ALTER TABLE SleepDay_Merged RENAME TO sleep_day;
ALTER TABLE HourlySteps_Merged RENAME TO hourly_steps;
ALTER TABLE HourlyIntensities_Merged RENAME TO hourly_intensities;
ALTER TABLE WeightLogInfo_Merged RENAME TO weightlog_info;

Then I specifically altered the columns in each table to be in snake case:

ALTER TABLE daily_activity
RENAME COLUMN Id TO user_id,
RENAME COLUMN ActivityDate TO activity_date,
RENAME COLUMN TotalSteps TO total_steps,
RENAME COLUMN TotalDistance TO total_distance,
RENAME COLUMN TrackerDistance TO tracker_distance,
RENAME COLUMN LoggedActivitiesDistance TO logged_activities_distance,
RENAME COLUMN VeryActiveDistance TO very_active_distance,
RENAME COLUMN ModeratelyActiveDistance TO moderately_active_distance,
RENAME COLUMN LightActiveDistance TO light_active_distance,
RENAME COLUMN SedentaryActiveDistance TO sedentary_active_distance,
RENAME COLUMN VeryActiveMinutes TO very_active_minutes,
RENAME COLUMN FairlyActiveMinutes TO fairly_active_minutes,
RENAME COLUMN LightlyActiveMinutes TO lightly_active_minutes,
RENAME COLUMN SedentaryMinutes TO sedentary_minutes;

ALTER TABLE sleep_day
RENAME COLUMN Id to user_id
RENAME COLUMN SleepDay to sleep_day,
RENAME COLUMN TotalSleepRecords to total_sleep_records,
RENAME COLUMN TotalMinutesAsleep to total_minutes_asleep,
RENAME COLUMN TotalTimeInBed to total_time_in_bed;

ALTER TABLE hourly_steps
RENAME COLUMN Id TO user_id,
RENAME COLUMN ActivityHour TO activity_hour,
RENAME COLUMN StepTotal TO step_total;

ALTER TABLE hourly_intensities
RENAME COLUMN Id TO user_id,
RENAME COLUMN AcitivtyHour TO activity_hour,
RENAME COLUMN TotalIntensity TO total_intensity,
RENAME COLUMN AverageIntensity TO average_intensity;

This created the following new tables, with theri associate new columns:
- daily_activity from DailyActivity_Merged
- sleep_day from SleepDay_Merged
- hourly_steps from HourlySteps_Merged
- hourly_intensities from HourlyIntensities_Merged

Hide Code

Had I not manually selected the data schema for each data column when I imported them into BigQuery, I would run a DESCRIBE statement to determine the data type of each, as shown:

Hide Code

All of the date/times are the incorrect data type. They need to be the DATE data type (for daily_activity) and DATETIME datatype for all other data sets. To accomplish this, I used the PARSE_DATE function in BigQuery for daily_activity:

Hide Code

CREATE OR REPLACE TABLE
daily_activity AS
SELECT
user_id,
PARSE_DATE('%m/%d/%Y', activity_date) AS activity_date,
total_steps,
total_distance,
tracker_distance,
logged_activities_distance,
very_active_distance,
moderately_active_distance,
light_active_distance,
sedentary_active_distance,
very_active_minutes,
fairly_active_minutes,
lightly_active_minutes,
sedentary_minutes,
Calories
FROM
daily_activity;

For the other data sets, I used the function PARSE_DATETIME.

Hide Code

CREATE OR REPLACE TABLE hourly_intensities AS
SELECT
user_id,
PARSE_DATETIME('%m/%d/%Y %I:%M:%S %p', activity_hour) AS activity_hour,
total_intensity,
average_intensity
FROM
hourly_intensities;

CREATE OR REPLACE TABLE
hourly_steps AS
SELECT
user_id,
PARSE_DATETIME('%m/%d/%Y %I:%M:%S %p', activity_hour) AS activity_hour,
step_total
FROM
hourly_steps;

CREATE OR REPLACE TABLE
sleep_day AS
SELECT
user_id,
PARSE_DATETIME('%m/%d/%Y %I:%M:%S %p', t.sleep_day) AS sleep_day,
total_sleep_records,
total_minutes_asleep,
total_time_in_bed
FROM
sleep_day AS t;

Note: For sleep_day, I had to use an alias for the column because the DATETIME column and data set had the same name, creating some anbiguity in the code.

Alternatively, I could have separated the DATETIME data into two new columns (DATE and TIME) but have elected to move forward with the above changes. After inputting the above code, I observed the data types in the BigQuery GUI over using the DESCRIBE function due to ease of access. See below:

With this done, I can move onto data cleaning.

4.6: Cleaning the Data For Analysis

Before I can export the data for analysis, I need to ensure it is clean of errors. I’ve already transformed it to ensure proper formatting. The main source of errors stems from duplicate or null values. In a normal distribution, I would also attempt to control for outliers. Given the small data set, I will be using all distinct values to lend maximum power to analysis. First, I will start with duplicate values.

In BigQuery and other SQL platforms, I can filter out duplicate values with the SELECT DISTINCT function. This will select only the unique data points and exclude any duplicate values.

First, I will use the function SELECT COUNT(*) to determine the total count of data points in each data set.

Hide Code

SELECT COUNT: 940
SELECT DISTINCT: 940

Hide Code

SELECT COUNT: 413
SELECT DISTINCT: 410

Hide Code

SELECT COUNT: 22099
SELECT DISTINCT: 22099

Hide Code

SELECT COUNT: 22099
SELECT DISTINCT: 22099

From here, we can see everything except the sleep_day has no duplicate values. I also assessed for blank (null) values using the COUNTIF and IS NULL` commands in BigQuery. None returned any null values. I’ve included the code snippet from the daily_activity data set below:

SELECT
COUNTIF(user_id IS NULL) AS user_id_null_count,
COUNTIF(activity_date IS NULL) AS activity_date_null_count,
COUNTIF(total_steps IS NULL) AS total_steps_null_count,
COUNTIF(total_distance IS NULL) AS total_distance_null_count,
COUNTIF(tracker_distance IS NULL) AS tracker_distance_null_count,
COUNTIF(logged_activities_distance IS NULL) AS logged_activities_distance_null_count,
COUNTIF(very_active_distance IS NULL) AS very_active_distance_null_count,
COUNTIF(moderately_active_distance IS NULL) AS moderately_active_distance_null_count,
COUNTIF(light_active_distance IS NULL) AS light_active_distance_null_count,
COUNTIF(sedentary_active_distance IS NULL) AS sedentary_active_distance_null_count,
COUNTIF(very_active_minutes IS NULL) AS very_active_minutes_null_count,
COUNTIF(fairly_active_minutes IS NULL) AS fairly_active_minutes_null_count,
COUNTIF(lightly_active_minutes IS NULL) AS lightly_active_minutes_null_count,
COUNTIF(sedentary_minutes IS NULL) AS sedentary_minutes_null_count,
COUNTIF(Calories IS NULL) AS calories_null_count
FROM
daily_activity;

Hide Code

Now I can create the final tables, based on the above cleaning results:

Hide Data

Now that all four tables are finalized, I can export them as CSVs for data visualization and analysis. Exporting CSVs from BigQuery is not the most straight forward process. I complete the following sequence of events to download my CSV files for analysis:

I navgiated to the Google Cloud Console and created a bucket via the Cloud Storage -> Buckets -> Create pathway before I could export any data sets to the selected bucket. For the purpose of this project, I am operating under a free license to utilize Google Cloud services, but creating a bucket is a nominal cost for long-term data usage on the personal level.
I followed the guide listed here. I selected each of my four data sets in the Explorer Pane and exported them to the bucket.
I navigated back to the GCC, selected my bucket, title endler_data_analytics_capstone and manually downloaded each of the files to my desktop.
I confirmed all files opened and looked as expected in CSV format, ready for import for analysis.

Step 5: Analyze

Now that all of my data is cleaned and organized, I can begin the analysis section. Before importing any files into R, I made sure to load the tidyverse library:

library(tidyverse)
── Attaching core tidyverse packages ───────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4
✔ readr 2.1.5
✔ forcats 1.0.1
✔ stringr 1.5.2
✔ ggplot2 4.0.0
✔ tibble 3.3.0
✔ lubridate 1.9.4
✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────── ───────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors

I imported all four files into R using the GUI following the pathway in the files panel in RStudio. This is convenient and allowed for quick access to all four data frames.

For posterity, had I needed to import via code I would’ve first changed the working directory. I checked my working directory using the code getwd() and then once I confirmed it wasn’t where I saved my files, I changed it to via the code setwd(). I repeated the getwd() command to ensure the working directory changed, as intended. From there, I could enter the code below:

daily_activity_final <- read_csv("daily_activity_final", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d")));

sleep_day_final <- read_csv("sleep_day_final", col_types = cols(user_id = col_character(), sleep_date = col_date(format = "%Y-%m-%d"), sleep_time = col_time(format = "%X")));

hourly_steps_final <- read_csv("hourly_steps_final", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d"), activity_time = col_time(format = "%X")));

hourly_intensities_final <- read_csv("hourly_intensities_final", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d"), activity_time = col_time(format = "%X")));

5.1: User Types

I created a new data frame categorizing the users based on the frequency they utilized the Fitbit app for data collection. This would be useful for providing a user overview and for exporting the CSV for use in data visualization in programs like Tableau or Flourish Studio.

user_types <- daily_activity_final %>% group_by(user_id) %>% summarise(days_used=sum(n())) %>% mutate(user_type = case_when( days_used >= 1 & days_used <= 10 ~ "Light User", days_used >= 11 & days_used <= 20 ~ "Moderate User", days_used >= 21 & days_used <= 29 ~ "Heavy User", days_used >= 30 ~ "Everyday User"))

This created four user types based on the above 4 variables. From a quick glance at the data, I can see there is 1 light user, 3 moderate uers, 5 heavy users, and 22 everyday users. I also exported the CSV for use in other programs.

write.csv(user_types, "user_types.csv", row.names=FALSE).

I then confirmed the file was in the selected directory.

5.2: Data Summaries

I have some past experience working as a scientist and conducting statistical analysis to confirm or deny a null hypothesis through soemthing like a T or Z test. However, for situations like this, summarizing the data is more useful to provide an overall assessment of the data. To achieve this, I used the summary() command.

Daily Activity Summary

Daily Sleep Summary

Hourly Intensities Summary

Hourly Steps Summary

We can observed some notable results right off the bad. Both the median (7,406) and mean (7,638) total daily steps were relatively low. The United States Center for Disease Control and Prevention previously published a study associating adults taking 8,000 steps per day with 51% lower mortality from all causes than an individual taking 4,000 steps. This effect with more pronounced as steps increase, with 12,000 steps showing a 65% decrease. A recent study published at The Lancet showed meaningful health benefits (47% reduction) with 7,000 steps versys 2,000.
Other studies even suggest health benefits based on frequency of movement, rather than a hard cut off value. This segways into the next point. The European Commission cites sedentary behavior as Sedentary behaviour refers to any waking behaviour characterised by energy expenditure ≤ 1.5 METs, while in in a sitting, reclining or lying position, where MET refers to metabolic equivalents. One MET is the energy expended while sitting at rest. This definition has been corroborated in studies as well as authoritative agencies such as the United States CDC and the World Health Organization. The average sedentary time is 991.2 minutes, more than 2/3rd the entire day. Even the most active quartile (1st quartile) exhibited sedentary behavior for 729.8 minutes, or just over half the day. Fitbit is meant to reveal this data, but generally users need to make the effort to change their habits on their own.
Activity of all levels varied. Users were lightly active for 199.0 minutes, fairly active for 6.00 minutes, and very active for 4.00 minutes on average, showing empirically nearly all active minutes were lightly active. The above referenced studies indicate any type of activity above sedentary is beneifical, and that increased intensity is not correlative with reductions in mortality at higher statistical rates. As such, just increasing activity from sedentary to light would likely cause a great boon to the users’ health.
Articles from the CDC indicate the average adult needs 7 or more hours per sleep a night. The National Institute of Health. Meanwhile, a study by the WHO suggests individuals with intermediate sleep durations (6-9 hours) exhibit statistically higher cognitive scores than those outside the range. The average sleep duration of these users is 6.99 hours and the median is 7.21 hours, with 50% of the users falling between 6.02 and 8.17 hours. The users in this data set generally fell near or within recommended ranges based on studies. There was also a notable difference between time in bed and time asleep. Users were in bed for 7.64 hours on average, indicating the slept for about 91.5% of their time in bed.

Step 5: Share

I created visualizations using R (using the ggplot() function), and Tableau, which can be viewed here. All diagrams from both R and Tableau are listed below.

6.1: Number of Users with Recorded Data

Observation: The most habitual users used their Fitbit and recorded data daily. There is also a descention in overall use, with most uses being everyday, while light users (less than 11 days a month) make up the smaller cohort. Given the ease of access for Fitbit data collection and the long battery life, it makes sense most users would continue to use them. Reasons others may have not used it as frequently possibly stemmed from other factors such as not charging the device, disomfort with the Fitbit band, or desiring some disconnect from technology.

6.2: Steps

Observation: Users spent the most axtive part of their day between the hours of 8 AM and 8 PM. This is reiterated in descriptions throughout this presentation. This is likely due to the daylight hours behind times of activity such as work, school, errands, or leisure, and the nighttime hours reserved for sleeping.

Observation: Users moved most during the middle of the day, between the 8th hour (8 AM) and 20th hour (8 PM). This makes sense, given users are likely away from home, at work, running errands, or conducting some sort of liesurely activity. Meanwhile, step trends throughout the week vary, with generally more activity during the end of the week and weekend (Friday and Saturday). Many users are likely engaged in enjoyable activities during the weekend that required more movement. Their lessened movement during the week may be assosciated with sedentary jobs.

6.3: Overall Activity Trends

Observation: This diagram shows intense extercise increases steadily and peaks in the late afternoon/early evening. This correlates heavily with sleep and sedentary schedules, as expected. Users are most active during the day, roughly between 8 AM and 8 PM.

Observation: This diagram, like previously discussed, reiterates that users spent the brunt of their days in light or sedentary activity, by minutes.

ggplot(data=daily_activity_final, aes(x=total_steps, y=Calories)) + geom_point(aes(color=Calories))+geom_smooth(method = "gam", formula = y ~ s(x, bs="cs"), color="lemonchiffon") + labs(title="Total Steps vs. Calories Burnt", x="Total Steps", y="Calories Burnt", color="Calories") + theme(plot.title = element_text(hjust=0.5))

Hide Code

Observation: As expected, in general, the more steps users took, the more calories they burnt. A smooth trend line in the diagram supports this. There is of course the caveat that the intensity of walking affects the calories burnt (e.g. sprints will burn more calories than a jog, which will burn more calories than a brisk walk, which will burn more calories that a slow walk, etc.), but in general the effects of calories burned are easily seen in this chart.

Observation: This diagram illustrates each of the four intensity divisions and the total distance traveled in each. As discussed above, increased steps will generally result in more calories burnt. More calories can be burnt by higher intensity exercise. It makes sense that the distance traveled at higher intensity is higher, since it is likely associated with some cardiovascular activity, such as running or brisk walking. Both would yield significantly higher distances than regular walking.

6.4: Sleep

ggplot(data=sleep_day_final, aes(x=total_time_in_bed, y=total_minutes_asleep)) + geom_point(aes(color = total_minutes_asleep)) + scale_color_gradient(low = "lightblue", high = "darkgreen")+ geom_smooth(method = lm, color="yellow") + labs(title="Total Time Asleep vs. Total Time in Bed", x="Total Minutes in Bed", y="Total Minutes Asleep", color="Total Minutes Asleep")+theme(plot.title = element_text(hjust=0.5))

Hide Code

Observation: There is very clear positive correlation between the X (Total Minutes in Bed) and Y (Total Minutes Asleep). As mentioned above, the average time asleep is 6.99 hours, while the average time in bed is 7.64 hours. This means users generally slept for 91% of their time in bed. However, users typically took longer to fall asleep than average time for most adults. Deficiencies in sleep latency occur for a variety of reasons, including distraction, anxiety, hunger, hormonal disruption, light isues, etc. Whatveer the reason, users seem to have some issue falling asleep and it may be worth determing the root cause and addressing it for better sleep.

Observation: There isn’t a strong relationship between intensity and length of sleep among this cohort. In general, users of all sleep quantities participated in various levels of intense exercise. While some studies could suggest intense activity can lead to longer, deep sleep, others have shown the opposite.

7 Step 6: Act

Bellabeat’s mission is focusing on creating innovative health and wellness products for women, in order to empower women to take control of their health by providing them with technology-driven solutions that blend design and function. Their products focus on sleep, activity, stress, and women’s reproductive health.

The provided dataset is small; a larger data set is necessary to carry out a more in-depth analysis. Ideally, a wide variety of data should be collected to provided data useful to Bellabeat’s mission. On top of the data provided from these Fitbit data sets, other useful information would included demographics (age, gender, occupation, location), nutrition information (especially hydration, given Bellabeat’s product-line), any training regiment, etc. Demographic data is useful to control for bias. Furthermore, this data is out-dated. Newer data sets would be beneficial to ensure business solutions are drafted based on the most accurate analysis.

7.1: Key Findings

In summary, these are the key findings: - Users had various levels of interaction with the Fitbit applicaiton. Not all utilized sleep tracking; even fewer utilized the weight tracking funciton. - The weight log function is by far the least popular function of those included in this data set. - Users were most active Saturday and Tuesday, with slightly less activity on Friday and Sunday, based on movement. - Users spent by far the most time in sedentary (</= 1.5 MET) activity. - Users were mostly active between 8 AM and 8 PM, with localized peaks in the evening and early afternoon. - Active users burnt more calories. - Of the limited cohort utilizing the sleep tracking, users seemed to have roughly enough sleep on average, per cited studies. However, users tended to have some difficulty falling asleep in under the 20 minutes cited as typical.

7.2: Recommendations

Based on this analysis, I woulld recommend the following: Stronger Sleep Integration - Many smart products today introduce a variety of applications for sleep assistance, including blue light filtering, automatic alarms, wind-down time, wind-up alarms, etc. Bellabeat can integrate functionalities such as sleep tracking seen in the Apple Watch to provide more relevent data to users, including REM data, sleep stage breakdown, bed-time notifications, and periodic (weekly, biweekly, monthly) sleep trends. - Bellabeat should campaign on the above features as a core metric for tracking overall health. Healthy sleep is beneficial to all apsects of health, from mental acuity, to fitness recovery, to digestion consistency, to hormonal regulation. Since many Fitbit users get barely adequate or inadequate sleep, it is likely that Bellabeat users have similar shortcomings. A strong informational campaign could help combat this.

Activity and Exercise Integration - Many Fitbit users exhibited mostly sedentary activity. There is no data about their use of activity (cardiovascular, weight training, etc.). Including functions for logging this information may be beneficial to Bellabeat users to promote overall activity by tracking progress.

Nutrition Integration - Bellabeat promotes products for tracking hydration and promotes themselves as a pioneer of women’s health. Including nutrition logging would be a boon to promoting proper hydration and ensuring better horomone regulation through macro and micronutrient monitoring.

Weight Tracking Analysis - The cohort for this data set barely utilized the weight tracking funciton. It would benefit Bellabeat to collect data collection (perhaps through surveys) to find out why it’s sparingly used. Weight tracking would sync well with other metrics Bellabeat finds beneficial for their users.

Step and Activity Notifications - Most smart devices offer some way to notify the user to hit the minimum step or activity goal, such as 250 steps/hour on Fitbit or activity every hour on Apple Watch. Intefgrating these features into the Bellabeat ecosystem would likely promote more activity for the users. - Promoting more activity in any form will reduce sedentary behavior, which according to many studies by the CDC, WHO, and indepdent researchers, are som eof the best eays to reduce overall risk of mortality.

Online Advertisements - Bellabeat’s overall goal is not just to promote their products for tracking health, but promote themselves as a bastion of women’s health. They should engage in product advertisements that describes the Bellabeat app not just as a fitness app, but as an app for promoting a healthy lifestyle and empowering women to strike the best balance between their work, personal, and fitness lives.

Google Data Analytics Capstone Project: Bellabeat Case Study

Matthew Endler

2025-11-24

R Markdown

1: Introduction

1.1: About the Company

1.2: Characters and Products

1.2.1: Characters

1.2.2: Products

2: Step 1: Ask

2.1: Business Task

2.2: Key Stakeholders

2.3: Guiding Questions for Analysis

3: Step 2: Prepare

3.1: Data Source Used

3.2: Data Set Information

3.3: Data Credibility and Limitations

4: Step 3: Process

4.1: Data Set Files Used for Analysis

4.2: Tools Used to Process Data

4.3: Data Upload and Familiarization in BigQuery

4.4: Using `COUNT` to Determine the Number of Unique Users in the Data Sets

4.5: Data Transformation

4.6: Cleaning the Data For Analysis

Step 5: Analyze

5.1: User Types

5.2: Data Summaries

7 Step 6: Act

7.1: Key Findings

7.2: Recommendations

Google Data Analytics Capstone Project: Bellabeat Case Study

Matthew Endler

2025-11-24

R Markdown

1: Introduction

1.1: About the Company

1.2: Characters and Products

1.2.1: Characters

1.2.2: Products

2: Step 1: Ask

2.1: Business Task

2.2: Key Stakeholders

2.3: Guiding Questions for Analysis

3: Step 2: Prepare

3.1: Data Source Used

3.2: Data Set Information

3.3: Data Credibility and Limitations

4: Step 3: Process

4.1: Data Set Files Used for Analysis

4.2: Tools Used to Process Data

4.3: Data Upload and Familiarization in BigQuery

4.4: Using COUNT to Determine the Number of Unique Users in the Data Sets

4.5: Data Transformation

4.6: Cleaning the Data For Analysis

Step 5: Analyze

5.1: User Types

5.2: Data Summaries

Step 5: Share

6.1: Number of Users with Recorded Data

6.2: Steps

6.3: Overall Activity Trends

6.4: Sleep

7 Step 6: Act

7.1: Key Findings

7.2: Recommendations

4.4: Using `COUNT` to Determine the Number of Unique Users in the Data Sets