Bellabeat is a high-tech wellness technology company that specializes in health-focused smart products designed specifically for women. Founded in 2013 by Urška Sršen and Sando Mur, the company combines innovative technology with elegant design to empower women by providing insights into their health, fitness, and well-being.
With a diverse product line that includes wearable wellness trackers, smart water bottles, and a mobile app, Bellabeat collects and analyzes data on activity, sleep, stress, hydration, and reproductive health. By leveraging this data, the company aims to enhance user experiences and promote healthier lifestyles.
As a junior data analyst at Bellabeat, you are part of the marketing analytics team responsible for using data to guide business decisions. Bellabeat, a leading wellness technology company, wants to expand its presence in the global smart device market by leveraging data-driven insights.
Urška Sršen, Bellabeat’s cofounder and Chief Creative Officer, believes that analyzing smart device fitness data can help identify trends and new growth opportunities. You have been assigned to focus on one of Bellabeat’s products and analyze smart device usage data to gain insights into consumer behavior.
Your report and recommendations will help Bellabeat refine its marketing approach and strengthen its position in the smart wellness market.
In this phase, the goal is to define the business problem and establish clear objectives for the data analysis.
Bellabeat wants to leverage data-driven insights to refine its marketing strategy and expand its presence in the smart wellness market. By analyzing smart device usage data, the company aims to understand consumer habits and apply these insights to enhance its product positioning and marketing efforts.
By defining the problem and identifying the right questions, this phase ensures that the analysis remains focused on providing actionable insights for Bellabeat’s business growth.
In this phase, the focus is on gathering, assessing, and preparing the data for analysis while ensuring its credibility, security, and usability.
Credibility of Data:
By thoroughly assessing and preparing the dataset while considering its credibility, privacy, and accessibility, we ensure that the data is reliable and suitable for generating meaningful insights in the next phase.
In this phase, the dataset is cleaned, transformed, and prepared using R to ensure accuracy and usability for analysis.
Setting up environment
This section loads multiple R packages that are essential for data
processing, cleaning, and analysis.
#load packages
library(readr)
library(janitor)
library(dplyr)
library(tidyverse)
library(lubridate)
Import Data into R
This step imports multiple data sets using the read_csv()
function from the readr
package. Each data set contains
specific health-related data collected from Fit-bit users.
#Import Data
activity <- read_csv("Data/dailyActivity_merged.csv")
sleep <- read_csv("Data/sleepDay_merged.csv")
weight <- read_csv("Data/weightLogInfo_merged.csv")
Data Validation
This step ensures that the imported data-sets have the correct structure
by checking their column names.
#Data Validation
colnames(activity)
colnames(sleep)
colnames(weight)
Preview the Data
This step displays the first few rows of each data-set to understand the
structure and contents.
#Preview Data
head(activity)
head(sleep)
head(weight)
Data Type Conversion, Column Renaming
This step ensures consistency by converting ID columns to character
type, formatting date columns, and renaming date-related columns for
clarity.
# Convert Id to character data type
# Convert Day to date format
# Rename various dates to Day
activity <-activity %>%
mutate_at(vars(Id), as.character) %>%
mutate_at(vars(ActivityDate), as.Date, format = "%m/%d/%y") %>%
rename("Day"="ActivityDate")
sleep <-sleep %>%
mutate_at(vars(Id), as.character) %>%
mutate_at(vars(SleepDay), as.Date, format = "%m/%d/%y") %>%
rename("Day"="SleepDay")
weight <-weight %>%
mutate_at(vars(Id,LogId), as.character) %>%
mutate_at(vars(Date),as.Date, format = "%m/%d/%y") %>%
rename("Day"="Date")
Merging Datasets and Adding Day of the Week
This step combines all three datasets (sleep
,
activity
, and weight
) into a single
data-frame while ensuring that no data is lost. It also adds a
new column for the weekday name, making it easier to analyze
trends based on days of the week.
# Combine data frames & add day of the week
combined_data <-sleep %>%
full_join(activity, by=c("Id","Day")) %>%
full_join(weight, by=c("Id", "Day")) %>%
mutate(Weekday = weekdays(as.Date(Day, "m/%d/%Y")))
Removing Duplicates, Counting Missing Values, and Checking
Unique Users
This step ensures data quality by removing duplicate rows, counting
missing (NA) values, and checking the number of unique users in each
data-set.
# Find and remove duplicate rows & count 'NA' and distinct Ids
combined_data <-combined_data[!duplicated(combined_data), ]
sum(is.na(combined_data))
n_distinct(combined_data$Id)
n_distinct(sleep$Id)
n_distinct(weight$Id)
Ordering the Days of the Week
This step ensures that the Weekday
column is correctly
ordered in a logical sequence (Sunday → Saturday)
rather than being treated as un-ordered text. This is important for
proper visualization and analysis.
# Order the days of the week
combined_data$Weekday <-factor(combined_data$Weekday, levels=c
("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"))
Generating Summary Statistics
This step provides descriptive statistics for key
variables, helping to understand the overall distribution, central
tendencies, and possible outliers in the dataset.
# Select summary statistics
combined_data %>%
select(TotalMinutesAsleep, TotalSteps, TotalDistance, VeryActiveMinutes,
FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories,
WeightKg, Fat, BMI, IsManualReport) %>%
summary()
Saving the Cleaned Data
This step saves the cleaned and processed dataset as a
CSV file, ensuring that the data is available for further analysis,
visualization, or sharing.
#Save the cleaned data
write.csv(combined_data,file = "Data/combined_data.csv",row.names = FALSE)
This phase ensures that the dataset is accurate and optimized for the next step: Analysis. 🚀
Data visualization helps in identifying patterns, trends, and insights from our dataset in an intuitive and easy-to-understand manner. By visually analyzing the cleaned data, we can uncover key relationships between variables such as activity levels, calories burned, sleep patterns, and weight changes.
The goal of this phase is to:
Visualization 1:
Visualization 2:
This stacked bar chart represents different types of activity minutes (Lightly Active, Fairly Active, and Very Active) across weekdays. The data is segmented into three separate bar charts for each activity level.
Lightly Active Minutes (Top Chart) → The highest on Sunday (29,996 minutes), followed by Monday and Tuesday. The lowest activity is observed on Friday.
Fairly Active Minutes (Middle Chart) → Sunday again leads with 2,179 minutes, while Wednesday has the lowest with 1,526 minutes.
Very Active Minutes (Bottom Chart) → Sunday has the most intense activity (3,489 minutes), and Friday has the lowest at 2,418 minutes.
Visualization 3:
This stepped area chart compares Total Distance (annotated at the bottom) with different levels of Total Steps (middle of the chart) across the weekdays.
Visualization 4: This visualization consists of two
stacked bar charts, each representing different aspects
of sleep behavior across the week. The color gradient represents the
sum of calories burned. This dual bar
chart compares:
Total Minutes Asleep (Top Chart)
Total Time in Bed (Bottom Chart)
for each weekday, with color intensity representing calories burned.
Visualization 5:
This scatter plot visualization analyzes the relationship between different levels of activity (Lightly Active, Fairly Active, and Very Active Minutes) and the average calories burned. Each point represents an individual (ID), and the color-coding differentiates them. A trend line is added to observe general patterns.
1️⃣ X-Axis (Activity Levels)
2️⃣ Y-Axis (Avg. Calories Burned)
3️⃣ Data Points (Circles)
4️⃣ Trend Line
Visualization 6:
This line chart tracks the average total steps over a period of time. It shows daily fluctuations in step count and highlights the overall trend.
1️⃣ X-Axis (Time Period)
2️⃣ Y-Axis (Avg. Total Steps)
3️⃣ Line Chart (Daily Step Trend)
The blue line represents the fluctuations in average daily steps.
Noticeable peaks and dips suggest variations in physical activity over time.
4️⃣ Dashed Trend Line
The gray dashed line represents the overall trend of steps over time.
The downward slope indicates a gradual decrease in average steps over the recorded period.
Visualization 7:
Each point represents an individual (ID), showing their average daily steps and average daily calories burned over a period.
1️⃣ Positive Correlation Between Steps & Calories Burned
2️⃣ Variability in Calorie Burn
🔹 Possible Factors:
3️⃣ Outliers & Unique Cases
By acting on these insights, we can create more effective, data-driven fitness strategies that cater to individual needs and optimize overall health outcomes.
The analysis of activity levels, steps, and calories burned has provided valuable insights into the relationship between movement intensity and energy expenditure. The data confirms that while step count is an important metric, the intensity of activity plays a more significant role in calorie burn. Individuals who engage in higher-intensity workouts, reflected in their Very Active Minutes, tend to achieve greater calorie expenditure compared to those with similar step counts but lower intensity levels.
Additionally, the variations in individual performance suggest that a one-size-fits-all approach to fitness tracking may not be ideal. Factors such as metabolic rate, workout type, and lifestyle differences impact the effectiveness of physical activity, highlighting the need for personalized fitness recommendations.
Moving forward, the key takeaway is that simply focusing on increasing daily steps may not be the most effective strategy for optimizing calorie burn. Instead, a more holistic approach that includes high-intensity activities, personalized fitness goals, and improved data tracking will yield better long-term health outcomes.
By leveraging these insights, we can refine fitness strategies to maximize efficiency, improve user engagement, and enhance overall well-being.
—————–————————–—–—–—–————–—–—–—–End of Case Study—————–—–—–—————————————————–