Phase 1: Ask

In this phase, we define the business problem and the goals of our analysis. This sets the stage for our entire project by identifying what we are trying to solve and who we are solving it for.

1.1 What is the problem you are trying to solve?

Bellabeat, a high-tech manufacturer of health-focused smart products for women, wants to understand how consumers are using their smart devices. The insights from this analysis of existing user data will help guide the company’s marketing strategy and provide data-driven recommendations for future growth.

1.2 Who are the stakeholders?

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer.
  • Sando Mur: Cofounder and CEO of Bellabeat.
  • The Bellabeat Marketing Analytics Team: This team needs data-driven recommendations to inform their marketing strategy.

1.3 How will this data help stakeholders make decisions?

By analyzing user trends in activity, sleep, and weight, we can identify key patterns and behaviors. This analysis will provide insights into: * Opportunities for growth: What are the most and least used features of the smart devices? * Marketing focus: What user behaviors should Bellabeat encourage to drive sales and engagement? * Product development: How can insights into user habits inform future product features?

Phase 2: Prepare

In this phase, we load our data and verify its integrity. We need to understand where the data came from, its limitations, and its structure before we can clean and analyze it.

First, we load the R packages that will be essential for our analysis.

# Load necessary libraries for our analysis
library(tidyverse)
library(lubridate)
library(ggplot2)
# Make sure the CSV files are in the same folder as this RMarkdown file
activity <- read.csv("dailyActivity_merged.csv")
sleep <- read.csv("sleepDay_merged.csv")
weight <- read.csv("weightLogInfo_merged.csv")
# Get a glimpse of the first few rows of each dataset
head(activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    3/25/2016      11004          7.11            7.11
## 2 1503960366    3/26/2016      17609         11.55           11.55
## 3 1503960366    3/27/2016      12736          8.53            8.53
## 4 1503960366    3/28/2016      13231          8.93            8.93
## 5 1503960366    3/29/2016      12041          7.85            7.85
## 6 1503960366    3/30/2016      10970          7.16            7.16
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               2.57                     0.46
## 2                        0               6.92                     0.73
## 3                        0               4.66                     0.16
## 4                        0               3.19                     0.79
## 5                        0               2.16                     1.09
## 6                        0               2.36                     0.51
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                4.07                       0                33
## 2                3.91                       0                89
## 3                3.71                       0                56
## 4                4.95                       0                39
## 5                4.61                       0                28
## 6                4.29                       0                30
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  12                  205              804     1819
## 2                  17                  274              588     2154
## 3                   5                  268              605     1944
## 4                  20                  224             1080     1932
## 5                  28                  243              763     1886
## 6                  13                  223             1174     1820
head(sleep)
##           Id       SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 4/12/2016 0:00                 1                327            346
## 2 1503960366 4/13/2016 0:00                 2                384            407
## 3 1503960366 4/15/2016 0:00                 1                412            442
## 4 1503960366 4/16/2016 0:00                 2                340            367
## 5 1503960366 4/17/2016 0:00                 1                700            712
## 6 1503960366 4/19/2016 0:00                 1                304            320
head(weight)
##           Id                 Date WeightKg WeightPounds Fat   BMI
## 1 1503960366 4/5/2016 11:59:59 PM     53.3     117.5064  22 22.97
## 2 1927972279 4/10/2016 6:33:26 PM    129.6     285.7191  NA 46.17
## 3 2347167796 4/3/2016 11:59:59 PM     63.4     139.7731  10 24.77
## 4 2873212765 4/6/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 4/7/2016 11:59:59 PM     57.2     126.1044  NA 21.65
## 6 2891001357 4/5/2016 11:59:59 PM     88.4     194.8886  NA 25.03
##   IsManualReport        LogId
## 1           True 1.459901e+12
## 2          False 1.460313e+12
## 3           True 1.459728e+12
## 4           True 1.459987e+12
## 5           True 1.460074e+12
## 6           True 1.459901e+12
# Get information about data types and structure
str(activity)
## 'data.frame':    457 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "3/25/2016" "3/26/2016" "3/27/2016" "3/28/2016" ...
##  $ TotalSteps              : int  11004 17609 12736 13231 12041 10970 12256 12262 11248 10016 ...
##  $ TotalDistance           : num  7.11 11.55 8.53 8.93 7.85 ...
##  $ TrackerDistance         : num  7.11 11.55 8.53 8.93 7.85 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  2.57 6.92 4.66 3.19 2.16 ...
##  $ ModeratelyActiveDistance: num  0.46 0.73 0.16 0.79 1.09 ...
##  $ LightActiveDistance     : num  4.07 3.91 3.71 4.95 4.61 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  33 89 56 39 28 30 33 47 40 15 ...
##  $ FairlyActiveMinutes     : int  12 17 5 20 28 13 12 21 11 30 ...
##  $ LightlyActiveMinutes    : int  205 274 268 224 243 223 239 200 244 314 ...
##  $ SedentaryMinutes        : int  804 588 605 1080 763 1174 820 866 636 655 ...
##  $ Calories                : int  1819 2154 1944 1932 1886 1820 1889 1868 1843 1850 ...
# 1. CONVERT THE 'DATE' COLUMNS
# The format="%m/%d/%Y" tells R that the dates are in the "month/day/Year" format.
activity$ActivityDate <- as.Date(activity$ActivityDate, format="%m/%d/%Y")
sleep$SleepDay <- as.Date(sleep$SleepDay, format="%m/%d/%Y")
weight$Date <- as.Date(weight$Date, format="%m/%d/%Y")
# 2. CHECK FOR AND REMOVE DUPLICATES
# First, let's see how many duplicates exist in each dataset
cat("Number of duplicates in sleep BEFORE cleaning: ", sum(duplicated(sleep)), "\n")
## Number of duplicates in sleep BEFORE cleaning:  3
# Now, remove the duplicates from the datasets that have them
sleep <- sleep %>% 
  distinct()
# Finally, let's verify that the duplicates are gone
cat("Number of duplicates in sleep AFTER cleaning: ", sum(duplicated(sleep)), "\n")
## Number of duplicates in sleep AFTER cleaning:  0

3. SUMMARIZE UNIQUE PARTICIPANTS

cat(“Unique participants in activity data:”, n_distinct(activity\(Id), "\n") cat("Unique participants in sleep data: ", n_distinct(sleep\)Id), “”) cat(“Unique participants in weight data:”, n_distinct(weight$Id), “”)

# 1. CALCULATE SUMMARY STATISTICS
# We select only the most relevant columns for a cleaner summary
activity %>%
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes,
         Calories) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :  32.0   Min.   :   0  
##  1st Qu.: 1988   1st Qu.: 1.410   1st Qu.: 728.0   1st Qu.:1776  
##  Median : 5986   Median : 4.090   Median :1057.0   Median :2062  
##  Mean   : 6547   Mean   : 4.664   Mean   : 995.3   Mean   :2189  
##  3rd Qu.:10198   3rd Qu.: 7.160   3rd Qu.:1285.0   3rd Qu.:2667  
##  Max.   :28497   Max.   :27.530   Max.   :1440.0   Max.   :4562
sleep %>%
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0
# 2. MERGE THE ACTIVITY AND SLEEP DATASETS
merged_data <- merge(activity, sleep, by.x=c("Id", "ActivityDate"), by.y=c("Id", "SleepDay"), all.x=TRUE)

# 3. ADD A 'DAY OF THE WEEK' COLUMN
merged_data$DayOfWeek <- weekdays(merged_data$ActivityDate)

# For better visualizations later, we want the days to be in the correct order
merged_data$DayOfWeek <- ordered(merged_data$DayOfWeek, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# Create a scatter plot of TotalSteps vs. Calories
ggplot(data=merged_data, aes(x=TotalSteps, y=Calories)) +
  geom_point(alpha=0.5, color="dodgerblue") +
  geom_smooth(method="loess", color="red") + # Adds a smooth trend line
  labs(title="Daily Steps vs. Calories Burned",
       x="Total Steps per Day",
       y="Calories Burned") +
  theme_light() # Applies a clean theme to the plot
## `geom_smooth()` using formula = 'y ~ x'

# To get the average steps per day, we first need to aggregate our data
daily_avg_steps <- merged_data %>%
  group_by(DayOfWeek) %>%
  summarise(AverageSteps = mean(TotalSteps, na.rm = TRUE))
# Now, create the bar plot
ggplot(data=daily_avg_steps, aes(x=DayOfWeek, y=AverageSteps, fill=DayOfWeek)) +
  geom_col() + # geom_col is used for bar charts where you provide the y-value
  labs(title="Average Steps by Day of the Week",
       x="Day of the Week",
       y="Average Steps") +
  theme(axis.text.x = element_text(angle=45, hjust=1), # Rotates x-axis labels
        legend.position="none") # Hides the redundant legend

First, create a temporary dataframe that has no missing sleep values

sleep_data_complete <- merged_data %>% filter(!is.na(TotalMinutesAsleep)) #Keeps rows where TotalMinutesAsleep is NOT NA # Now, build the plot using this new, clean dataframe ggplot(data=sleep_data_complete, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(color=“purple”) + geom_abline(color=“red”, linetype=“dashed”) + labs(title=“Time Asleep vs. Time in Bed”, subtitle=“Red line represents perfect sleep efficiency (100%)”, x=“Total Minutes Asleep”, y=“Total Time Spent in Bed (minutes)”) + theme_light() sleep_data_complete <- merged_data %>% filter(!is.na(TotalMinutesAsleep)) ggplot(data=sleep_data_complete, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(color=“purple”) + # This adds a perfect 1:1 reference line in red geom_abline(color=“red”, linetype=“dashed”) + labs(title=“Time Asleep vs. Time in Bed”, subtitle=“Red line represents perfect sleep efficiency (100%)”, x=“Total Minutes Asleep”, y=“Total Time Spent in Bed (minutes)”) + theme_light() # Now, build the plot using this new, clean dataframe ggplot(data=sleep_data_complete, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(color=“purple”) + geom_abline(color=“red”, linetype=“dashed”) + labs(title=“Time Asleep vs. Time in Bed”, subtitle=“Red line represents perfect sleep efficiency (100%)”, x=“Total Minutes Asleep”, y=“Total Time Spent in Bed (minutes)”) + theme_light()```

``` Insight: Nearly all data points fall above the red reference line, indicating that users consistently spend more time in bed than they are actually asleep. This gap represents an opportunity for Bellabeat to market features that improve sleep quality and efficiency, not just duration.

Phase 6: Act

This final phase is about delivering our conclusions and actionable recommendations based on the analysis.

6.1 Final Conclusion

Our analysis of the FitBit user data reveals several key behavioral patterns. The user base is characterized by a high level of sedentary time, with the average user being inactive for over 16 hours a day. While daily step counts show a strong correlation with calories burned, the average user does not consistently meet recommended activity levels. Sleep duration is generally adequate, but our analysis of sleep efficiency shows a clear opportunity for improvement.

6.2 Actionable Recommendations for Bellabeat’s Marketing Strategy

Based on these insights, we recommend the following three strategies for the Bellabeat marketing team:

  1. Focus Marketing on Small, Achievable Habit Changes: Given the high sedentary minutes, many users may feel overwhelmed by a goal like 10,000 steps. Bellabeat’s marketing should focus on the app’s ability to help users make small, incremental changes. Campaigns could highlight features like “stand-up reminders” or celebrate modest goals.

  2. Launch Targeted Engagement Campaigns: Our analysis showed that activity levels dip on certain days, particularly Sundays. Bellabeat should use this insight to create targeted in-app notifications and marketing campaigns. A “Sunday Reset” or “Weekend Warrior” challenge could motivate users to stay active during these lulls.

  3. Emphasize and Market Sleep Quality Features: While competitors may focus on tracking sleep duration, Bellabeat has an opportunity to stand out by focusing on sleep quality. Marketing materials should highlight how Bellabeat products can help users understand their sleep efficiency, positioning the brand as a more advanced and holistic wellness tool.

6.3 Citations

[1] Tudor-Locke, C., Craig, C. L., Brown, W. J., Clemes, S. A., De Cocker, K., Giles-Corti, B., … & Blair, S. N. (2011). How many steps/day are enough? For adults. International Journal of Behavioral Nutrition and Physical Activity, 8(1), 1-17.