Understand how smart device users interact with fitness trackers and identify trends that align with Bellabeat’s customer base.
Use these trends to propose actionable recommendations for Bellabeat’s marketing strategy.
Urška Sršen, Bellabeat’s co-founder and Chief Creative Officer. Sando Mur, Bellabeat’s co-founder and Mathematician. Bellabeat’s marketing analytics team.
Use Fitbit Fitness Tracker data: A public dataset containing minute-level output for physical activity, sleep monitoring, and heart rate for ~30 users. This dataset has been used in 6 published studies.
Focus on daily and sleep data to identify high-level trends.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
library(ggplot2)
library(tidyr)
Import relevant tables (dailyActivity_merged, dailyCalories_merged, sleepDay_merged).
daily_activity <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_calories <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_day <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_log_info <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Clean Data Identify key patterns in user activity and sleep behaviors. Investigate relationships between activity levels, sleep quality, and device usage. Create visualizations for clear presentation of findings.
First, we noticed that the weightLogInfo data frame was small so we are going to see if it is worth including. We will check how many users it tracks.
n_distinct(weight_log_info$id)
## Warning: Unknown or uninitialised column: `id`.
## [1] 0
Since there are only 8 users tracked, we will not be using the weightLogInfo data frame.
Next, we are going to change all of the column headings to lower case since they are currently in camel case.
daily_activity <- rename_with(daily_activity, tolower)
daily_calories <- rename_with(daily_calories, tolower)
sleep_day <- rename_with(sleep_day, tolower)
head(daily_activity)
head(daily_calories)
head(sleep_day)
Next, we will clean up the date information since sleepday has a date/time format whereas the others are in date only format.
sleep_day <- sleep_day %>%
separate(sleepday, c("date", "time"), sep= " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(sleep_day)
Next we will rename each of the date column headings in the daily_activity and daily_calories data frames so they all match.
daily_activity <- daily_activity %>%
rename(date=activitydate)
daily_calories <- daily_calories %>%
rename(date=activityday)
head(daily_activity)
head(daily_calories)
Checking for Not Available (NA) values to avoid any discrepencies in the data sets
colSums(is.na(daily_activity))
## id date totalsteps
## 0 0 0
## totaldistance trackerdistance loggedactivitiesdistance
## 0 0 0
## veryactivedistance moderatelyactivedistance lightactivedistance
## 0 0 0
## sedentaryactivedistance veryactiveminutes fairlyactiveminutes
## 0 0 0
## lightlyactiveminutes sedentaryminutes calories
## 0 0 0
colSums(is.na(daily_calories))
## id date calories
## 0 0 0
colSums(is.na(sleep_day))
## id date time totalsleeprecords
## 0 0 0 0
## totalminutesasleep totaltimeinbed
## 0 0
There were zero NA values in the dataset.
is.null(daily_activity)
## [1] FALSE
is.null(daily_calories)
## [1] FALSE
is.null(sleep_day)
## [1] FALSE
There were zero Null values
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_calories))
## [1] 0
sum(duplicated(sleep_day))
## [1] 3
We have 3 occurrences in the sleep_day data frame. We will remove the duplicates. After running the duplicated() function I found the 3 occurrences (rows 162, 224, 381) and verified that they could be deleted.
sleep_day <- distinct(sleep_day)
sum(duplicated(sleep_day))
## [1] 0
Mission accomplished! No more duplicated rows in our sleep_day data frame.
Now that we have all of our data cleaned and ready to go we will merge these datasets on Id and ActivityDate.
merged_data <- daily_activity %>%
left_join(daily_calories, by= c("id", "date")) %>%
left_join(sleep_day, by= c("id", "date"))
Metrics to Explore:
Since demographic variables are not available in the dataset, users will be classified based on their average daily steps. The activity levels are categorized as follows:
These classifications will allow us to analyze trends and behaviors based on user activity levels and identify meaningful insights about their lifestyles.
Since our data has daily figures we will create a data frame that groups them by Id and averages for steps, calories, and sleep time.
daily_average <- merged_data %>%
group_by(id) %>%
summarise(mean_daily_steps = mean(totalsteps), mean_daily_calories = mean(calories.x), mean_daily_sleep = mean(totalminutesasleep))
head(daily_average)
Now we will classify users by their daily average steps
activity_level <- daily_average %>%
mutate(user_type = case_when(mean_daily_steps < 5000 ~ "sedentary", mean_daily_steps >= 5000 & mean_daily_steps < 7500 ~ "lightly active", mean_daily_steps >= 7500 & mean_daily_steps < 10000 ~ "fairly active", mean_daily_steps >= 10000 ~ "very active"))
head(activity_level)
Now we want to set up for our first visual. To do this we will use our new column to create a data frame that breaks down activity levels as a percentage of the entire data set.
activity_level_percent <- activity_level %>%
group_by(user_type) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(user_type) %>%
summarise(total_percent = total / totals) %>%
mutate(labels = scales::percent(total_percent))
# ensure the user_type factor levels are ordered correctly
activity_level_percent$user_type <- factor(activity_level_percent$user_type, levels = c("very active", "fairly active", "lightly active", "sedentary"))
head(activity_level_percent)
Visualize activity trends across the user base.
ggplot(activity_level_percent, aes(x = user_type, y = total_percent, fill = user_type)) +
geom_col() +
labs(
title = "Percentage of Users by Activity Level",
x = "Activity Level",
y = "Percentage"
) +
scale_y_continuous(labels = scales::percent) +
scale_x_discrete(labels = NULL) +
scale_fill_manual(
values = c(
"very active" = "#66CC66",
"fairly active" = "#99CC99",
"lightly active" = "#FF9999",
"sedentary" = "#CC6666"
)
) +
theme_minimal()
The data reveals the following distribution of activity levels among users:
These percentages indicate a fairly balanced distribution of activity levels, with a slight majority engaging in light to moderate activity.
Plot calories burned against average daily steps and show correlation.
# Perform a correlation test
cor_test <- cor.test(activity_level$mean_daily_steps, activity_level$mean_daily_calories)
# Extract the p-value
p_value <- cor_test$p.value
# Create the scatter plot with the p-value
ggplot(activity_level, aes(x = mean_daily_steps, y = mean_daily_calories)) +
geom_point(color = "#0072B2", alpha = 0.7, size = 3) + # Scatter points
geom_smooth(method = "lm", color = "red", linetype = "dashed", se = FALSE) + # Trend line
labs(
title = "Calories Burned vs. Average Daily Steps",
x = "Average Daily Steps",
y = "Average Daily Calories",
caption = "Note: The p-value indicates that the relationship is statistically significant."
) +
annotate(
"text",
x = max(activity_level$mean_daily_steps) * 0.9, # Adjust x position
y = max(activity_level$mean_daily_calories) * 0.6, # Adjust y position
label = paste0("p = ", signif(p_value, digits = 3)), # Format p-value
size = 5,
color = "darkred"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
While we expected this, we can clearly see that more daily steps = more calories burned. We can later use this in our marketing efforts.
First we will try visualizing the correlation between sleep and steps
# Filter out rows with non-finite or missing values
filtered_data <- merged_data %>%
filter(!is.na(totalsteps) & !is.na(totalminutesasleep) &
is.finite(totalsteps) & is.finite(totalminutesasleep))
# Create scatterplot
ggplot(filtered_data, aes(x = totalsteps, y = totalminutesasleep)) +
geom_point(color = "#0072B2", alpha = 0.6, size = 2) + # Scatterplot points
geom_smooth(method = "loess", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) + # Smoothed trend line
labs(
title = "Relationship Between Steps and Sleep Duration",
x = "Total Steps",
y = "Total Minutes Asleep"
) +
theme_minimal()
The graph doesn’t show a strong statistical correlation so we will do some math.
Now we will run a correlation test and find out what our r and p-values are.
# Correlation test
cor_test <- cor.test(filtered_data$totalsteps, filtered_data$totalminutesasleep)
# Display the correlation coefficient (r) and p-value
cor_test
##
## Pearson's product-moment correlation
##
## data: filtered_data$totalsteps and filtered_data$totalminutesasleep
## t = -3.9164, df = 408, p-value = 0.0001054
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.28199288 -0.09525253
## sample estimates:
## cor
## -0.1903439
ggplot(filtered_data, aes(x = totalsteps, y = totalminutesasleep)) +
geom_point(color = "#0072B2", alpha = 0.6, size = 2) + # Scatterplot points
geom_smooth(method = "lm", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) + # Linear trend line
labs(
title = "Does More Steps Lead to Better Sleep?",
x = "Total Steps",
y = "Total Minutes Asleep",
caption = paste("Correlation Coefficient (r):", round(cor_test$estimate, 3),
"\nP-value:", signif(cor_test$p.value, 3))
) +
theme_minimal()
## Findings: Steps and Sleep
An interesting opportunity for Bellabeat would be to enhance their devices with a daily user feedback prompt. This feature could collect data on how activity levels impact users’ perceived well-being.
Now we are going to run the same analysis comparing how many minutes users were “very active” versus daily sleep to see if there is a trend.
# Filter out rows with non-finite or missing values
filtered_data_activity <- merged_data %>%
filter(!is.na(veryactiveminutes) & !is.na(totalminutesasleep) &
is.finite(veryactiveminutes) & is.finite(totalminutesasleep))
# Correlation test
cor_test_activity <- cor.test(filtered_data_activity$veryactiveminutes, filtered_data_activity$totalminutesasleep)
# Display the correlation coefficient (r) and p-value
cor_test_activity
##
## Pearson's product-moment correlation
##
## data: filtered_data_activity$veryactiveminutes and filtered_data_activity$totalminutesasleep
## t = -1.787, df = 408, p-value = 0.07468
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.183408523 0.008795793
## sample estimates:
## cor
## -0.08812658
ggplot(filtered_data_activity, aes(x = veryactiveminutes, y = totalminutesasleep)) +
geom_point(color = "#0072B2", alpha = 0.6, size = 2) + # Scatterplot points
geom_smooth(method = "lm", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) + # Linear trend line
labs(
title = "Does More Very Active Minutes Lead to Better Sleep?",
x = "Very Active Minutes",
y = "Total Minutes Asleep",
caption = paste("Correlation Coefficient (r):", round(cor_test_activity$estimate, 3),
"\nP-value:", signif(cor_test_activity$p.value, 3))
) +
theme_minimal()
## Findings: Very active and Sleep
We have a similar correlation but the p-value is not significant ( < 0.05 to be significant). We will not present this to the client.
Now we will look at lightly active minutes and sleep to see if there is a correlation.
# Filter out rows with non-finite or missing values
filtered_data_light <- merged_data %>%
filter(!is.na(lightlyactiveminutes) & !is.na(totalminutesasleep) &
is.finite(lightlyactiveminutes) & is.finite(totalminutesasleep))
# Correlation test
cor_test_light <- cor.test(filtered_data_light$lightlyactiveminutes, filtered_data_light$totalminutesasleep)
# Display the correlation coefficient (r) and p-value
cor_test_light
##
## Pearson's product-moment correlation
##
## data: filtered_data_light$lightlyactiveminutes and filtered_data_light$totalminutesasleep
## t = 0.55737, df = 408, p-value = 0.5776
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06944947 0.12409914
## sample estimates:
## cor
## 0.02758336
# Graph
ggplot(filtered_data_light, aes(x = lightlyactiveminutes, y = totalminutesasleep)) +
geom_point(color = "#0072B2", alpha = 0.6, size = 2) + # Scatterplot points
geom_smooth(method = "lm", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) + # Linear trend line
labs(
title = "Does More Lightly Active Minutes Lead to Better Sleep?",
x = "Lightly Active Minutes",
y = "Total Minutes Asleep",
caption = paste("Correlation Coefficient (r):", round(cor_test_light$estimate, 3),
"\nP-value:", signif(cor_test_light$p.value, 3))
) +
theme_minimal()
We finally found a positive correlation but it is not statistically significant.
Lastly, we will look for a correlation between sedentary minutes and sleep
# Filter out rows with non-finite or missing values
filtered_data_sedentary <- merged_data %>%
filter(!is.na(sedentaryminutes) & !is.na(totalminutesasleep) &
is.finite(sedentaryminutes) & is.finite(totalminutesasleep))
# Correlation test
cor_test_sedentary <- cor.test(filtered_data_sedentary$sedentaryminutes, filtered_data_sedentary$totalminutesasleep)
# Display the correlation coefficient (r) and p-value
cor_test_sedentary
##
## Pearson's product-moment correlation
##
## data: filtered_data_sedentary$sedentaryminutes and filtered_data_sedentary$totalminutesasleep
## t = -15.192, df = 408, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6595278 -0.5353923
## sample estimates:
## cor
## -0.6010731
ggplot(filtered_data_sedentary, aes(x = sedentaryminutes, y = totalminutesasleep)) +
geom_point(color = "#0072B2", alpha = 0.6, size = 2) + # Scatterplot points
geom_smooth(method = "lm", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) + # Linear trend line
labs(
title = "Does More Sedentary Minutes Lead to Better Sleep?",
x = "Sedentary Minutes",
y = "Total Minutes Asleep",
caption = paste("Correlation Coefficient (r):", round(cor_test_sedentary$estimate, 3),
"\nP-value:", signif(cor_test_sedentary$p.value, 3))
) +
theme_minimal()
Imagine tracking how much time people spend sitting versus sleeping: - The data shows that people who sit for long periods tend to sleep less. - This connection is strong enough to matter and is highly unlikely to be random.
Prepared for Bellabeat’s Marketing Team
Bellabeat’s target customers—primarily health-conscious women—can leverage these insights for better health and wellness through the IVY+ tracker:
By combining insights from user behavior and the features of the IVY+ tracker, Bellabeat can: - Address the specific needs of health-conscious women, focusing on sedentary behavior, stress reduction, and better sleep. - Position the IVY+ tracker as a holistic wellness tool that empowers women to make meaningful changes in their daily routines. - Create targeted marketing campaigns that emphasize female centric personalized wellness and scientifically-backed insights to differentiate Bellabeat in the smart wearables market.