In this phase, we define the business problem and the goals of our analysis. This sets the stage for our entire project by identifying what we are trying to solve and who we are solving it for.
Bellabeat, a high-tech manufacturer of health-focused smart products for women, wants to understand how consumers are using their smart devices. The insights from this analysis of existing user data will help guide the company’s marketing strategy and provide data-driven recommendations for future growth.
By analyzing user trends in activity, sleep, and weight, we can identify key patterns and behaviors. This analysis will provide insights into: * Opportunities for growth: What are the most and least used features of the smart devices? * Marketing focus: What user behaviors should Bellabeat encourage to drive sales and engagement? * Product development: How can insights into user habits inform future product features?
In this phase, we load our data and verify its integrity. We need to understand where the data came from, its limitations, and its structure before we can clean and analyze it.
First, we load the R packages that will be essential for our analysis.
# Load necessary libraries for our analysis
library(tidyverse)
library(lubridate)
library(ggplot2)
# Make sure the CSV files are in the same folder as this RMarkdown file
activity <- read.csv("dailyActivity_merged.csv")
sleep <- read.csv("sleepDay_merged.csv")
weight <- read.csv("weightLogInfo_merged.csv")
# Get a glimpse of the first few rows of each dataset
head(activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 3/25/2016 11004 7.11 7.11
## 2 1503960366 3/26/2016 17609 11.55 11.55
## 3 1503960366 3/27/2016 12736 8.53 8.53
## 4 1503960366 3/28/2016 13231 8.93 8.93
## 5 1503960366 3/29/2016 12041 7.85 7.85
## 6 1503960366 3/30/2016 10970 7.16 7.16
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 2.57 0.46
## 2 0 6.92 0.73
## 3 0 4.66 0.16
## 4 0 3.19 0.79
## 5 0 2.16 1.09
## 6 0 2.36 0.51
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 4.07 0 33
## 2 3.91 0 89
## 3 3.71 0 56
## 4 4.95 0 39
## 5 4.61 0 28
## 6 4.29 0 30
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 12 205 804 1819
## 2 17 274 588 2154
## 3 5 268 605 1944
## 4 20 224 1080 1932
## 5 28 243 763 1886
## 6 13 223 1174 1820
head(sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 4/12/2016 0:00 1 327 346
## 2 1503960366 4/13/2016 0:00 2 384 407
## 3 1503960366 4/15/2016 0:00 1 412 442
## 4 1503960366 4/16/2016 0:00 2 340 367
## 5 1503960366 4/17/2016 0:00 1 700 712
## 6 1503960366 4/19/2016 0:00 1 304 320
head(weight)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 4/5/2016 11:59:59 PM 53.3 117.5064 22 22.97
## 2 1927972279 4/10/2016 6:33:26 PM 129.6 285.7191 NA 46.17
## 3 2347167796 4/3/2016 11:59:59 PM 63.4 139.7731 10 24.77
## 4 2873212765 4/6/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 4/7/2016 11:59:59 PM 57.2 126.1044 NA 21.65
## 6 2891001357 4/5/2016 11:59:59 PM 88.4 194.8886 NA 25.03
## IsManualReport LogId
## 1 True 1.459901e+12
## 2 False 1.460313e+12
## 3 True 1.459728e+12
## 4 True 1.459987e+12
## 5 True 1.460074e+12
## 6 True 1.459901e+12
# Get information about data types and structure
str(activity)
## 'data.frame': 457 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "3/25/2016" "3/26/2016" "3/27/2016" "3/28/2016" ...
## $ TotalSteps : int 11004 17609 12736 13231 12041 10970 12256 12262 11248 10016 ...
## $ TotalDistance : num 7.11 11.55 8.53 8.93 7.85 ...
## $ TrackerDistance : num 7.11 11.55 8.53 8.93 7.85 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 2.57 6.92 4.66 3.19 2.16 ...
## $ ModeratelyActiveDistance: num 0.46 0.73 0.16 0.79 1.09 ...
## $ LightActiveDistance : num 4.07 3.91 3.71 4.95 4.61 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 33 89 56 39 28 30 33 47 40 15 ...
## $ FairlyActiveMinutes : int 12 17 5 20 28 13 12 21 11 30 ...
## $ LightlyActiveMinutes : int 205 274 268 224 243 223 239 200 244 314 ...
## $ SedentaryMinutes : int 804 588 605 1080 763 1174 820 866 636 655 ...
## $ Calories : int 1819 2154 1944 1932 1886 1820 1889 1868 1843 1850 ...
# 1. CONVERT THE 'DATE' COLUMNS
# The format="%m/%d/%Y" tells R that the dates are in the "month/day/Year" format.
activity$ActivityDate <- as.Date(activity$ActivityDate, format="%m/%d/%Y")
sleep$SleepDay <- as.Date(sleep$SleepDay, format="%m/%d/%Y")
weight$Date <- as.Date(weight$Date, format="%m/%d/%Y")
# 2. CHECK FOR AND REMOVE DUPLICATES
# First, let's see how many duplicates exist in each dataset
cat("Number of duplicates in sleep BEFORE cleaning: ", sum(duplicated(sleep)), "\n")
## Number of duplicates in sleep BEFORE cleaning: 3
# Now, remove the duplicates from the datasets that have them
sleep <- sleep %>%
distinct()
# Finally, let's verify that the duplicates are gone
cat("Number of duplicates in sleep AFTER cleaning: ", sum(duplicated(sleep)), "\n")
## Number of duplicates in sleep AFTER cleaning: 0
cat(“Unique participants in activity data:”, n_distinct(activity\(Id), "\n") cat("Unique participants in sleep data: ", n_distinct(sleep\)Id), “”) cat(“Unique participants in weight data:”, n_distinct(weight$Id), “”)
# 1. CALCULATE SUMMARY STATISTICS
# We select only the most relevant columns for a cleaner summary
activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes,
Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes Calories
## Min. : 0 Min. : 0.000 Min. : 32.0 Min. : 0
## 1st Qu.: 1988 1st Qu.: 1.410 1st Qu.: 728.0 1st Qu.:1776
## Median : 5986 Median : 4.090 Median :1057.0 Median :2062
## Mean : 6547 Mean : 4.664 Mean : 995.3 Mean :2189
## 3rd Qu.:10198 3rd Qu.: 7.160 3rd Qu.:1285.0 3rd Qu.:2667
## Max. :28497 Max. :27.530 Max. :1440.0 Max. :4562
sleep %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
# 2. MERGE THE ACTIVITY AND SLEEP DATASETS
merged_data <- merge(activity, sleep, by.x=c("Id", "ActivityDate"), by.y=c("Id", "SleepDay"), all.x=TRUE)
# 3. ADD A 'DAY OF THE WEEK' COLUMN
merged_data$DayOfWeek <- weekdays(merged_data$ActivityDate)
# For better visualizations later, we want the days to be in the correct order
merged_data$DayOfWeek <- ordered(merged_data$DayOfWeek, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# Create a scatter plot of TotalSteps vs. Calories
ggplot(data=merged_data, aes(x=TotalSteps, y=Calories)) +
geom_point(alpha=0.5, color="dodgerblue") +
geom_smooth(method="loess", color="red") + # Adds a smooth trend line
labs(title="Daily Steps vs. Calories Burned",
x="Total Steps per Day",
y="Calories Burned") +
theme_light() # Applies a clean theme to the plot
## `geom_smooth()` using formula = 'y ~ x'
# To get the average steps per day, we first need to aggregate our data
daily_avg_steps <- merged_data %>%
group_by(DayOfWeek) %>%
summarise(AverageSteps = mean(TotalSteps, na.rm = TRUE))
# Now, create the bar plot
ggplot(data=daily_avg_steps, aes(x=DayOfWeek, y=AverageSteps, fill=DayOfWeek)) +
geom_col() + # geom_col is used for bar charts where you provide the y-value
labs(title="Average Steps by Day of the Week",
x="Day of the Week",
y="Average Steps") +
theme(axis.text.x = element_text(angle=45, hjust=1), # Rotates x-axis labels
legend.position="none") # Hides the redundant legend
sleep_data_complete <- merged_data %>% filter(!is.na(TotalMinutesAsleep)) #Keeps rows where TotalMinutesAsleep is NOT NA # Now, build the plot using this new, clean dataframe ggplot(data=sleep_data_complete, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(color=“purple”) + geom_abline(color=“red”, linetype=“dashed”) + labs(title=“Time Asleep vs. Time in Bed”, subtitle=“Red line represents perfect sleep efficiency (100%)”, x=“Total Minutes Asleep”, y=“Total Time Spent in Bed (minutes)”) + theme_light() sleep_data_complete <- merged_data %>% filter(!is.na(TotalMinutesAsleep)) ggplot(data=sleep_data_complete, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(color=“purple”) + # This adds a perfect 1:1 reference line in red geom_abline(color=“red”, linetype=“dashed”) + labs(title=“Time Asleep vs. Time in Bed”, subtitle=“Red line represents perfect sleep efficiency (100%)”, x=“Total Minutes Asleep”, y=“Total Time Spent in Bed (minutes)”) + theme_light() # Now, build the plot using this new, clean dataframe ggplot(data=sleep_data_complete, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point(color=“purple”) + geom_abline(color=“red”, linetype=“dashed”) + labs(title=“Time Asleep vs. Time in Bed”, subtitle=“Red line represents perfect sleep efficiency (100%)”, x=“Total Minutes Asleep”, y=“Total Time Spent in Bed (minutes)”) + theme_light()```
``` Insight: Nearly all data points fall above the red reference line, indicating that users consistently spend more time in bed than they are actually asleep. This gap represents an opportunity for Bellabeat to market features that improve sleep quality and efficiency, not just duration.
This final phase is about delivering our conclusions and actionable recommendations based on the analysis.
Our analysis of the FitBit user data reveals several key behavioral patterns. The user base is characterized by a high level of sedentary time, with the average user being inactive for over 16 hours a day. While daily step counts show a strong correlation with calories burned, the average user does not consistently meet recommended activity levels. Sleep duration is generally adequate, but our analysis of sleep efficiency shows a clear opportunity for improvement.
Based on these insights, we recommend the following three strategies for the Bellabeat marketing team:
Focus Marketing on Small, Achievable Habit Changes: Given the high sedentary minutes, many users may feel overwhelmed by a goal like 10,000 steps. Bellabeat’s marketing should focus on the app’s ability to help users make small, incremental changes. Campaigns could highlight features like “stand-up reminders” or celebrate modest goals.
Launch Targeted Engagement Campaigns: Our analysis showed that activity levels dip on certain days, particularly Sundays. Bellabeat should use this insight to create targeted in-app notifications and marketing campaigns. A “Sunday Reset” or “Weekend Warrior” challenge could motivate users to stay active during these lulls.
Emphasize and Market Sleep Quality Features: While competitors may focus on tracking sleep duration, Bellabeat has an opportunity to stand out by focusing on sleep quality. Marketing materials should highlight how Bellabeat products can help users understand their sleep efficiency, positioning the brand as a more advanced and holistic wellness tool.
[1] Tudor-Locke, C., Craig, C. L., Brown, W. J., Clemes, S. A., De Cocker, K., Giles-Corti, B., … & Blair, S. N. (2011). How many steps/day are enough? For adults. International Journal of Behavioral Nutrition and Physical Activity, 8(1), 1-17.