PHASE 1: ASK

Key Objectives:

Identify the Business Task:

Understand how smart device users interact with fitness trackers and identify trends that align with Bellabeat’s customer base.

Use these trends to propose actionable recommendations for Bellabeat’s marketing strategy.

Key Stakeholders:

Urška Sršen, Bellabeat’s co-founder and Chief Creative Officer. Sando Mur, Bellabeat’s co-founder and Mathematician. Bellabeat’s marketing analytics team.

PHASE 2: PREPARE

Key Objectives:

Data Credibility:

Use Fitbit Fitness Tracker data: A public dataset containing minute-level output for physical activity, sleep monitoring, and heart rate for ~30 users. This dataset has been used in 6 published studies.

Focus on daily and sleep data to identify high-level trends.

Data Preparation:

Setting up the environment.

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate)
library(ggplot2)
library(tidyr)

Importing datasets for analysis

Import relevant tables (dailyActivity_merged, dailyCalories_merged, sleepDay_merged).

daily_activity <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_calories <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_day <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_log_info <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

PHASE 3 & 4: PROCESS & ANALYZE

Focus:

Clean Data Identify key patterns in user activity and sleep behaviors. Investigate relationships between activity levels, sleep quality, and device usage. Create visualizations for clear presentation of findings.

Clean Data

Check weightLogInfo to see if worth including

First, we noticed that the weightLogInfo data frame was small so we are going to see if it is worth including. We will check how many users it tracks.

n_distinct(weight_log_info$id)
## Warning: Unknown or uninitialised column: `id`.
## [1] 0

Since there are only 8 users tracked, we will not be using the weightLogInfo data frame.

Update case in column headings to be all lower case

Next, we are going to change all of the column headings to lower case since they are currently in camel case.

daily_activity <- rename_with(daily_activity, tolower)
daily_calories <- rename_with(daily_calories, tolower)
sleep_day <- rename_with(sleep_day, tolower)
head(daily_activity)
head(daily_calories)
head(sleep_day)

Clean Date/Time Formatting

Next, we will clean up the date information since sleepday has a date/time format whereas the others are in date only format.

sleep_day <- sleep_day %>% 
  separate(sleepday, c("date", "time"), sep= " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(sleep_day)

Rename column headings

Next we will rename each of the date column headings in the daily_activity and daily_calories data frames so they all match.

daily_activity <- daily_activity %>%
  rename(date=activitydate)
daily_calories <- daily_calories %>% 
  rename(date=activityday)
head(daily_activity)
head(daily_calories)

Checking for ‘NA’ values in dataset

Checking for Not Available (NA) values to avoid any discrepencies in the data sets

colSums(is.na(daily_activity))
##                       id                     date               totalsteps 
##                        0                        0                        0 
##            totaldistance          trackerdistance loggedactivitiesdistance 
##                        0                        0                        0 
##       veryactivedistance moderatelyactivedistance      lightactivedistance 
##                        0                        0                        0 
##  sedentaryactivedistance        veryactiveminutes      fairlyactiveminutes 
##                        0                        0                        0 
##     lightlyactiveminutes         sedentaryminutes                 calories 
##                        0                        0                        0
colSums(is.na(daily_calories))
##       id     date calories 
##        0        0        0
colSums(is.na(sleep_day))
##                 id               date               time  totalsleeprecords 
##                  0                  0                  0                  0 
## totalminutesasleep     totaltimeinbed 
##                  0                  0

There were zero NA values in the dataset.

Checking for ‘Null’ values

is.null(daily_activity)
## [1] FALSE
is.null(daily_calories)
## [1] FALSE
is.null(sleep_day)
## [1] FALSE

There were zero Null values

Checking for Duplicate Data

sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_calories))
## [1] 0
sum(duplicated(sleep_day))
## [1] 3

We have 3 occurrences in the sleep_day data frame. We will remove the duplicates. After running the duplicated() function I found the 3 occurrences (rows 162, 224, 381) and verified that they could be deleted.

Deleting duplicates.

sleep_day <- distinct(sleep_day)
sum(duplicated(sleep_day))
## [1] 0

Mission accomplished! No more duplicated rows in our sleep_day data frame.

Merging datasets on Id and ActivityDate

Now that we have all of our data cleaned and ready to go we will merge these datasets on Id and ActivityDate.

merged_data <- daily_activity %>% 
  left_join(daily_calories, by= c("id", "date")) %>%
  left_join(sleep_day, by= c("id", "date"))

Analysis Structure

Analysis Plan

Since demographic variables are not available in the dataset, users will be classified based on their average daily steps. The activity levels are categorized as follows:

  • Sedentary: Fewer than 5,000 steps per day.
  • Lightly Active: Between 5,000 and 7,499 steps per day.
  • Fairly Active: Between 7,500 and 9,999 steps per day.
  • Very Active: More than 10,000 steps per day.

These classifications will allow us to analyze trends and behaviors based on user activity levels and identify meaningful insights about their lifestyles.

Since our data has daily figures we will create a data frame that groups them by Id and averages for steps, calories, and sleep time.

daily_average <- merged_data %>% 
  group_by(id) %>% 
  summarise(mean_daily_steps = mean(totalsteps), mean_daily_calories = mean(calories.x), mean_daily_sleep = mean(totalminutesasleep))
head(daily_average)

Classify Users

Now we will classify users by their daily average steps

activity_level <- daily_average %>% 
  mutate(user_type = case_when(mean_daily_steps < 5000 ~ "sedentary", mean_daily_steps >= 5000 & mean_daily_steps < 7500 ~ "lightly active", mean_daily_steps >= 7500 & mean_daily_steps < 10000 ~ "fairly active", mean_daily_steps >= 10000 ~ "very active"))

head(activity_level)

Prepare activity level data

Now we want to set up for our first visual. To do this we will use our new column to create a data frame that breaks down activity levels as a percentage of the entire data set.

activity_level_percent <- activity_level %>% 
  group_by(user_type) %>% 
  summarise(total = n()) %>% 
  mutate(totals = sum(total)) %>% 
  group_by(user_type) %>% 
  summarise(total_percent = total / totals) %>% 
  mutate(labels = scales::percent(total_percent))

# ensure the user_type factor levels are ordered correctly
activity_level_percent$user_type <- factor(activity_level_percent$user_type, levels = c("very active", "fairly active", "lightly active", "sedentary"))

head(activity_level_percent)

Create activity trend visual

Visualize activity trends across the user base.

ggplot(activity_level_percent, aes(x = user_type, y = total_percent, fill = user_type)) +
  geom_col() +
  labs(
    title = "Percentage of Users by Activity Level",
    x = "Activity Level",
    y = "Percentage"
  ) +
  scale_y_continuous(labels = scales::percent) +
  scale_x_discrete(labels = NULL) +
  scale_fill_manual(
    values = c(
      "very active" = "#66CC66",  
      "fairly active" = "#99CC99",  
      "lightly active" = "#FF9999",  
      "sedentary" = "#CC6666"  
    )
  ) +
  theme_minimal()

Activity Level Distribution

The data reveals the following distribution of activity levels among users:

  • Very Active: Approximately 21% of users fell into this category.
  • Fairly Active: Around 27% of users were classified as fairly active.
  • Lightly Active: Similarly, 27% of users were lightly active.
  • Sedentary: About 24% of users were categorized as sedentary.

These percentages indicate a fairly balanced distribution of activity levels, with a slight majority engaging in light to moderate activity.

Visualize calories burned by avg daily steps

Plot calories burned against average daily steps and show correlation.

# Perform a correlation test
cor_test <- cor.test(activity_level$mean_daily_steps, activity_level$mean_daily_calories)

# Extract the p-value
p_value <- cor_test$p.value

# Create the scatter plot with the p-value
ggplot(activity_level, aes(x = mean_daily_steps, y = mean_daily_calories)) +
  geom_point(color = "#0072B2", alpha = 0.7, size = 3) +  # Scatter points
  geom_smooth(method = "lm", color = "red", linetype = "dashed", se = FALSE) +  # Trend line
  labs(
    title = "Calories Burned vs. Average Daily Steps",
    x = "Average Daily Steps",
    y = "Average Daily Calories",
    caption = "Note: The p-value indicates that the relationship is statistically significant."
  ) +
  annotate(
    "text",
    x = max(activity_level$mean_daily_steps) * 0.9,  # Adjust x position
    y = max(activity_level$mean_daily_calories) * 0.6,  # Adjust y position
    label = paste0("p = ", signif(p_value, digits = 3)),  # Format p-value
    size = 5,
    color = "darkred"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

While we expected this, we can clearly see that more daily steps = more calories burned. We can later use this in our marketing efforts.

Visualize sleep by activity level

First we will try visualizing the correlation between sleep and steps

# Filter out rows with non-finite or missing values
filtered_data <- merged_data %>%
  filter(!is.na(totalsteps) & !is.na(totalminutesasleep) & 
           is.finite(totalsteps) & is.finite(totalminutesasleep))

# Create scatterplot
ggplot(filtered_data, aes(x = totalsteps, y = totalminutesasleep)) +
  geom_point(color = "#0072B2", alpha = 0.6, size = 2) +  # Scatterplot points
  geom_smooth(method = "loess", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) +  # Smoothed trend line
  labs(
    title = "Relationship Between Steps and Sleep Duration",
    x = "Total Steps",
    y = "Total Minutes Asleep"
  ) +
  theme_minimal()

The graph doesn’t show a strong statistical correlation so we will do some math.

Check statistical variables

Now we will run a correlation test and find out what our r and p-values are.

# Correlation test
cor_test <- cor.test(filtered_data$totalsteps, filtered_data$totalminutesasleep)

# Display the correlation coefficient (r) and p-value
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  filtered_data$totalsteps and filtered_data$totalminutesasleep
## t = -3.9164, df = 408, p-value = 0.0001054
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.28199288 -0.09525253
## sample estimates:
##        cor 
## -0.1903439
ggplot(filtered_data, aes(x = totalsteps, y = totalminutesasleep)) +
  geom_point(color = "#0072B2", alpha = 0.6, size = 2) +  # Scatterplot points
  geom_smooth(method = "lm", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) +  # Linear trend line
  labs(
    title = "Does More Steps Lead to Better Sleep?",
    x = "Total Steps",
    y = "Total Minutes Asleep",
    caption = paste("Correlation Coefficient (r):", round(cor_test$estimate, 3),
                    "\nP-value:", signif(cor_test$p.value, 3))
  ) +
  theme_minimal()

## Findings: Steps and Sleep

Key Insights:

  • Correlation Coefficient (\(r\)):
    • \(r = -0.187\): This indicates a weak negative correlation between steps and sleep duration.
    • Interpretation: As steps go up, sleep goes down—but the effect is very small and not practically significant.
  • Statistical Significance (\(p\)):
    • \(p = 0.000134\): This very small p-value means the connection is statistically significant, meaning the relationship between steps and sleep is real and unlikely to occur by chance.

What This Means:

  1. Steps and Sleep:
    • When people take more steps, they might sleep a tiny bit less, but the difference is minimal.
    • The relationship is a weak negative correlation.
  2. Statistical Significance:
    • The small p-value confirms this weak connection is statistically real and not random.
    • However, the correlation is so weak that it’s not very meaningful in practical terms.
  3. Possible Explanations:
    • Could this mean more active people need slightly less sleep, or are they simply busier and therefore sleeping less? Further research could explore this question.

Recommendation for Bellabeat:

An interesting opportunity for Bellabeat would be to enhance their devices with a daily user feedback prompt. This feature could collect data on how activity levels impact users’ perceived well-being.

Suggested Feature:

  • Prompt: Ask users a simple question like:
    • “Physically, how did you feel today?”
    • Options: Three clickable faces (😢 Sad, 😐 Okay, 😊 Happy).
  • Why This Matters:
    • This feedback, combined with activity data, could help Bellabeat gain valuable insights into how physical activity and perceived well-being are related.
    • Bellabeat could use this data to create personalized recommendations and provide users with a more holistic view of their health.

Next Steps:

  • Explore the potential relationship between activity levels, sleep duration, and subjective well-being using such feedback data.
  • Conduct a pilot study to test the effectiveness of the proposed feature in capturing meaningful insights for Bellabeat customers.

Comparing “very active” vs sleep

Now we are going to run the same analysis comparing how many minutes users were “very active” versus daily sleep to see if there is a trend.

# Filter out rows with non-finite or missing values
filtered_data_activity <- merged_data %>%
  filter(!is.na(veryactiveminutes) & !is.na(totalminutesasleep) & 
           is.finite(veryactiveminutes) & is.finite(totalminutesasleep))

# Correlation test
cor_test_activity <- cor.test(filtered_data_activity$veryactiveminutes, filtered_data_activity$totalminutesasleep)

# Display the correlation coefficient (r) and p-value
cor_test_activity
## 
##  Pearson's product-moment correlation
## 
## data:  filtered_data_activity$veryactiveminutes and filtered_data_activity$totalminutesasleep
## t = -1.787, df = 408, p-value = 0.07468
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.183408523  0.008795793
## sample estimates:
##         cor 
## -0.08812658
ggplot(filtered_data_activity, aes(x = veryactiveminutes, y = totalminutesasleep)) +
  geom_point(color = "#0072B2", alpha = 0.6, size = 2) +  # Scatterplot points
  geom_smooth(method = "lm", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) +  # Linear trend line
  labs(
    title = "Does More Very Active Minutes Lead to Better Sleep?",
    x = "Very Active Minutes",
    y = "Total Minutes Asleep",
    caption = paste("Correlation Coefficient (r):", round(cor_test_activity$estimate, 3),
                    "\nP-value:", signif(cor_test_activity$p.value, 3))
  ) +
  theme_minimal()

## Findings: Very active and Sleep

Key Insights

We have a similar correlation but the p-value is not significant ( < 0.05 to be significant). We will not present this to the client.

Comparing lightly active and sleep

Now we will look at lightly active minutes and sleep to see if there is a correlation.

# Filter out rows with non-finite or missing values
filtered_data_light <- merged_data %>%
  filter(!is.na(lightlyactiveminutes) & !is.na(totalminutesasleep) & 
           is.finite(lightlyactiveminutes) & is.finite(totalminutesasleep))

# Correlation test
cor_test_light <- cor.test(filtered_data_light$lightlyactiveminutes, filtered_data_light$totalminutesasleep)

# Display the correlation coefficient (r) and p-value
cor_test_light
## 
##  Pearson's product-moment correlation
## 
## data:  filtered_data_light$lightlyactiveminutes and filtered_data_light$totalminutesasleep
## t = 0.55737, df = 408, p-value = 0.5776
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06944947  0.12409914
## sample estimates:
##        cor 
## 0.02758336
# Graph
ggplot(filtered_data_light, aes(x = lightlyactiveminutes, y = totalminutesasleep)) +
  geom_point(color = "#0072B2", alpha = 0.6, size = 2) +  # Scatterplot points
  geom_smooth(method = "lm", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) +  # Linear trend line
  labs(
    title = "Does More Lightly Active Minutes Lead to Better Sleep?",
    x = "Lightly Active Minutes",
    y = "Total Minutes Asleep",
    caption = paste("Correlation Coefficient (r):", round(cor_test_light$estimate, 3),
                    "\nP-value:", signif(cor_test_light$p.value, 3))
  ) +
  theme_minimal()

Findings: Lightly active and Sleep

Key insights

We finally found a positive correlation but it is not statistically significant.

Comparing Sedentary vs Sleep

Lastly, we will look for a correlation between sedentary minutes and sleep

# Filter out rows with non-finite or missing values
filtered_data_sedentary <- merged_data %>%
  filter(!is.na(sedentaryminutes) & !is.na(totalminutesasleep) & 
           is.finite(sedentaryminutes) & is.finite(totalminutesasleep))

# Correlation test
cor_test_sedentary <- cor.test(filtered_data_sedentary$sedentaryminutes, filtered_data_sedentary$totalminutesasleep)

# Display the correlation coefficient (r) and p-value
cor_test_sedentary
## 
##  Pearson's product-moment correlation
## 
## data:  filtered_data_sedentary$sedentaryminutes and filtered_data_sedentary$totalminutesasleep
## t = -15.192, df = 408, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6595278 -0.5353923
## sample estimates:
##        cor 
## -0.6010731
ggplot(filtered_data_sedentary, aes(x = sedentaryminutes, y = totalminutesasleep)) +
  geom_point(color = "#0072B2", alpha = 0.6, size = 2) +  # Scatterplot points
  geom_smooth(method = "lm", color = "red", linetype = "solid", se = TRUE, formula = y ~ x) +  # Linear trend line
  labs(
    title = "Does More Sedentary Minutes Lead to Better Sleep?",
    x = "Sedentary Minutes",
    y = "Total Minutes Asleep",
    caption = paste("Correlation Coefficient (r):", round(cor_test_sedentary$estimate, 3),
                    "\nP-value:", signif(cor_test_sedentary$p.value, 3))
  ) +
  theme_minimal()

Findings: Sedentary vs Sleep

Key Insights:

  1. Correlation Coefficient (\(r\)):
    • \(r = -0.599\)
    • This indicates a moderately strong negative correlation, meaning that as sedentary minutes increase, total minutes asleep tend to decrease.
  2. p-value:
    • \(p = 1.22 \times 10^{-41}\) (effectively 0)
    • This extremely small p-value shows that the relationship is statistically significant. The correlation is not due to random chance.

What This Means:

  • Practical Implications:
    • Individuals who spend more time being sedentary tend to sleep less.
    • The strength of the relationship (\(r = -0.599\)) suggests that sedentary behavior could have a meaningful impact on sleep duration.
  • Statistical Significance:
    • The p-value confirms the relationship is robust and unlikely to be a random occurrence.

Real-Life Analogy

Imagine tracking how much time people spend sitting versus sleeping: - The data shows that people who sit for long periods tend to sleep less. - This connection is strong enough to matter and is highly unlikely to be random.

Next Steps

  • Explore possible reasons for this relationship:
    • Does increased sedentary time replace physical activity, which could promote better sleep? -What are the causes of high amounts of sedentary time?
    • Is sedentary behavior linked to stress or other factors that disrupt sleep?

Phase 5 & 6

Smart Device Usage Report for Bellabeat

Prepared for Bellabeat’s Marketing Team


4. Key Takeaways

By combining insights from user behavior and the features of the IVY+ tracker, Bellabeat can: - Address the specific needs of health-conscious women, focusing on sedentary behavior, stress reduction, and better sleep. - Position the IVY+ tracker as a holistic wellness tool that empowers women to make meaningful changes in their daily routines. - Create targeted marketing campaigns that emphasize female centric personalized wellness and scientifically-backed insights to differentiate Bellabeat in the smart wearables market.