This high-level study will pass through the six phases of the data analysis life cycle:
BellaBeat would like to get a better understanding of users’ smart device usage habits to better position marketing strategy for their own products.
The key stakeholders include: Urška Sršen — Chief Creative Officer and Bellabeat’s Co-founder. Sando Mur — Mathematician and Bellabeat’s Co-founder. Bellabeat’s marketing analytics team — a team of data analysts.
The dataset used in this study was downloaded from: kaggle.com where it was uploaded by a user ‘Mobius’. It can be accessed via this link. Dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016.
Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
The dataset has several limitations, including a small sample size of 30 users, data’s age being 9 years old, and a collection period of only 2 months. Additionally, demographic information is missing. Due to these constraints, the analysis was conducted using hourly data instead of minute-level.
Minute-level data was generally excluded from this analysis due to the limited sample size; such detail was unnecessary for high-level insights sought in this project.
Additionally, specific metrics like METs (Metabolic Equivalent of Task — a measure of the energy expenditure of an activity relative to resting energy use, recorded in minutes) were omitted for the same reasons: limited sample size and their redundancy for this high-level analysis.
The weight data included records from only 13 unique users, which was deemed too small a sample to yield meaningful insights, leading to its exclusion.
Caloric data was also not analyzed, primarily because the dataset lacked critical demographic variables such as age and gender, which are essential for interpreting calorie expenditure meaningfully.
DailyActivity (03/12/2016 – 04/11/2016 & 04/12/2016 – 05/12/2016): Records total daily steps taken, grouped by date.
hourlySteps (03/12/2016 – 04/11/2016 & 04/12/2016 – 05/12/2016): Contains hourly step counts recorded for each day.
DailySleep (04/12/2016 – 05/12/2016): Includes total minutes spent asleep and awake, grouped by date.
R was chosen for this project based on its data manipulation and analysis capabilities, particularly when working with large-scale datasets exceeding one million rows. Additionally, my proficiency with R libraries made it a more effective choice over Python for this analysis.
Data was collected from two distinct time periods and imported from CSV files:
# 3.12.16 - 4.11.16:
Activity1 <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/dailyActivity_merged.csv')
Steps_h1 <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlySteps_merged.csv')
# 4.12.16 - 5.12.16:
Activity2 <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
Steps_h2 <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')
Sleep_d <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
Data from multiple months was stacked (combined), and datasets were prepared for further processing.
JOIN was not used because both the files had same headers.
#Stack:
Activity <- rbind(Activity1, Activity2)
Steps_h <- rbind(Steps_h1, Steps_h2)
Datasets were inspected for unique IDs to confirm user count for each dataset.
n_distinct(Activity$Id) #result: 35
n_distinct(Steps_h$Id) #result: 35
n_distinct(Sleep_d$Id) #result: 24
Data was inspected for duplicate, missing and NULL values:
#Activity:
Activity <- Activity %>% distinct()
glimpse(Activity)
sum(duplicated(Activity))
is.null(Activity) #PASS
colSums(is.na(Activity)) # PASS
#Steps
Steps_h <- Steps_h %>% distinct()
glimpse(Steps_h)
sum(duplicated(Steps_h))
is.null(Steps_h) #PASS
colSums(is.na(Steps_h)) #PASS
#Sleep:
Sleep_d <- Sleep_d %>% distinct()
glimpse(Sleep_d)
sum(duplicated(Sleep_d))
class(Sleep_d)
is.null(Sleep_d) #PASS
colSums(is.na(Sleep_d)) #PASS
Date and time fields were formatted appropriately for time-series analysis:
#Activity
Activity$ActivityDate <- mdy(Activity$ActivityDate)
Activity$DayOfWeek <- format(as.Date(Activity$ActivityDate), "%A")
#Sleep
Sleep_d$SleepDay <- mdy_hms(Sleep_d$SleepDay)
Sleep_d$DayOfWeek <- format(as.Date(Sleep_d$SleepDay), "%A")
Sleep_d$TotalInHours <- Sleep_d$TotalMinutesAsleep/60
Sleep_d$TotalTimeInBed_H <- Sleep_d$TotalTimeInBed/60
#Steps
Steps_h$ActivityHour <- mdy_hms(Steps_h$ActivityHour)
Steps_h$TimeOnly <- format(Steps_h$ActivityHour, "%I:%M %p")
Data was grouped by day of the week, and new datasets were created with additional statistical measures to assess activity patterns. While total step counts highlight overall engagement across all users, they can be heavily influenced by highly active individuals. Therefore, both mean and median values were calculated: the mean provides insight into overall group activity, whereas the median better represents the behavior of a “typical” user by minimizing the impact of outliers.
#Daily Activity:
daily_averages <- Activity %>%
group_by(DayOfWeek) %>%
summarise(
Sum_Steps = sum(TotalSteps),
Mean_DailySteps = mean(TotalSteps),
Median_DailySteps = median(TotalSteps),
)
#Hourly_Activity:
hourly_averages <- Steps_h %>%
group_by(TimeOnly) %>%
summarise(
TotalByHour = sum(StepTotal),
Mean_HSteps = mean(StepTotal),
Median_H = median(StepTotal)
)
#Sleep:
daily_sleep <- Sleep_d %>%
group_by(DayOfWeek) %>%
summarise(
TotalByDay = sum(TotalInHours),
Mean_Sleep = mean(TotalInHours),
Median_Sleep = median(TotalInHours),
TimeInBed_H = sum(TotalTimeInBed_H),
Median_TimeInBed = median(TotalTimeInBed_H)
)
daily_sleep$Median_Awake = daily_sleep$Median_TimeInBed - daily_sleep$Median_Sleep
Users are significantly more active after noon, with peak activity between 5-7 PM, and a noticeable drop-off after 8 PM.
Highest total steps are recorded on Tuesdays and Saturdays, but the median steps fall below 10,000 across all days, with the lowest activity on Sundays.
Users sleep the least on Mondays and most on Wednesdays (median close to 8 hours), with a steady decline toward the weekend, especially by Friday.
Users spend the most time awake in bed on Sundays, with the lowest awake time midweek, especially on Tuesdays.
Schedule fitness-related promotions (workout gear, supplements, or fitness classes) during 5-7 PM, when users are most active. After 8 PM, switch to wellness and recovery-focused offers (e.g., mobility tools, yoga mats, foam rollers), and push sleep-related products (sleep music, calming sounds, supplements) between 11 PM - 12 AM when activity sharply declines.
Run marketing campaigns for attracting high activity users during peak engagement windows (5-7 PM) to capitalize on high user activity and drive urgency.
Introduce step goal incentive program that rewards users who hit 10,000+ steps with perks such as exclusive in-app discounts or badges to motivate them to be more active and capitalize by suggesting premium subscription for more insights into their data when they do.
Launch social challenges, where users earn rewards for encouraging and improving the activity levels of their less active friends, leveraging social networks of users to increase platform engagement.
On Sundays, when activity dips and awake-in-bed time peaks, promote wellness and relaxation products like meditation apps, yoga classes, and calming teas to align with user behavior seeking rest and recovery. Partnering with sleep-focused brands to run Friday and Sunday evening campaigns to target users with lower sleep patterns.