Overview of the Data Analysis Process

This high-level study will pass through the six phases of the data analysis life cycle:

  • Ask
  • Prepare
  • Process
  • Analyze
  • Share
  • Act

Ask:

Business Task

BellaBeat would like to get a better understanding of users’ smart device usage habits to better position marketing strategy for their own products.

The key stakeholders include: Urška Sršen — Chief Creative Officer and Bellabeat’s Co-founder. Sando Mur — Mathematician and Bellabeat’s Co-founder. Bellabeat’s marketing analytics team — a team of data analysts.

Prepare:

About the data:

The dataset used in this study was downloaded from: kaggle.com where it was uploaded by a user ‘Mobius’. It can be accessed via this link. Dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016.

Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

Limitations of this dataset:

The dataset has several limitations, including a small sample size of 30 users, data’s age being 9 years old, and a collection period of only 2 months. Additionally, demographic information is missing. Due to these constraints, the analysis was conducted using hourly data instead of minute-level.

Data that was not used:

Minute-level data was generally excluded from this analysis due to the limited sample size; such detail was unnecessary for high-level insights sought in this project.

Additionally, specific metrics like METs (Metabolic Equivalent of Task — a measure of the energy expenditure of an activity relative to resting energy use, recorded in minutes) were omitted for the same reasons: limited sample size and their redundancy for this high-level analysis.

The weight data included records from only 13 unique users, which was deemed too small a sample to yield meaningful insights, leading to its exclusion.

Caloric data was also not analyzed, primarily because the dataset lacked critical demographic variables such as age and gender, which are essential for interpreting calorie expenditure meaningfully.

Data used:

  • DailyActivity (03/12/2016 – 04/11/2016 & 04/12/2016 – 05/12/2016): Records total daily steps taken, grouped by date.

  • hourlySteps (03/12/2016 – 04/11/2016 & 04/12/2016 – 05/12/2016): Contains hourly step counts recorded for each day.

  • DailySleep (04/12/2016 – 05/12/2016): Includes total minutes spent asleep and awake, grouped by date.

Process:

R was chosen for this project based on its data manipulation and analysis capabilities, particularly when working with large-scale datasets exceeding one million rows. Additionally, my proficiency with R libraries made it a more effective choice over Python for this analysis.

Data Import:

Data was collected from two distinct time periods and imported from CSV files:

# 3.12.16 - 4.11.16:
  Activity1 <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/dailyActivity_merged.csv')
  Steps_h1 <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlySteps_merged.csv')


# 4.12.16 - 5.12.16:
  Activity2 <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
  Steps_h2 <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')

  Sleep_d <- read.csv('/Users/osama/Work/capstone R/capstone/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')

Data Consolidation:

Data from multiple months was stacked (combined), and datasets were prepared for further processing.

JOIN was not used because both the files had same headers.

    #Stack:
  Activity <- rbind(Activity1, Activity2)
  Steps_h <- rbind(Steps_h1, Steps_h2)

User Count Validation:

Datasets were inspected for unique IDs to confirm user count for each dataset.

n_distinct(Activity$Id) #result: 35
n_distinct(Steps_h$Id) #result: 35
n_distinct(Sleep_d$Id) #result: 24

Data Cleaning:

Data was inspected for duplicate, missing and NULL values:

    #Activity:
  Activity <- Activity %>% distinct()
  glimpse(Activity)
  sum(duplicated(Activity))
  is.null(Activity) #PASS
  colSums(is.na(Activity)) # PASS
  
  #Steps
  Steps_h <- Steps_h %>% distinct()
  glimpse(Steps_h)
  sum(duplicated(Steps_h))
  is.null(Steps_h) #PASS
  colSums(is.na(Steps_h)) #PASS
  
  #Sleep:
  Sleep_d <- Sleep_d %>% distinct()
  glimpse(Sleep_d)
  sum(duplicated(Sleep_d))
  class(Sleep_d)
  is.null(Sleep_d) #PASS
  colSums(is.na(Sleep_d))  #PASS

Data Formatting:

Date and time fields were formatted appropriately for time-series analysis:

    #Activity
  Activity$ActivityDate <- mdy(Activity$ActivityDate)
  Activity$DayOfWeek <- format(as.Date(Activity$ActivityDate), "%A")
  
    #Sleep
  Sleep_d$SleepDay <- mdy_hms(Sleep_d$SleepDay)
  Sleep_d$DayOfWeek <- format(as.Date(Sleep_d$SleepDay), "%A")
  Sleep_d$TotalInHours <- Sleep_d$TotalMinutesAsleep/60
  Sleep_d$TotalTimeInBed_H <- Sleep_d$TotalTimeInBed/60

    #Steps
  Steps_h$ActivityHour <- mdy_hms(Steps_h$ActivityHour)
  Steps_h$TimeOnly <- format(Steps_h$ActivityHour, "%I:%M %p")

Analyze:

Data Grouping & Statistical Measures:

Data was grouped by day of the week, and new datasets were created with additional statistical measures to assess activity patterns. While total step counts highlight overall engagement across all users, they can be heavily influenced by highly active individuals. Therefore, both mean and median values were calculated: the mean provides insight into overall group activity, whereas the median better represents the behavior of a “typical” user by minimizing the impact of outliers.

  #Daily Activity:
  daily_averages <- Activity %>%
    group_by(DayOfWeek) %>% 
    summarise(
      Sum_Steps = sum(TotalSteps),
      Mean_DailySteps = mean(TotalSteps),
      Median_DailySteps = median(TotalSteps),
    )

  #Hourly_Activity:
  hourly_averages <- Steps_h %>%
    group_by(TimeOnly) %>%
    summarise(
      TotalByHour = sum(StepTotal),
      Mean_HSteps = mean(StepTotal),
      Median_H = median(StepTotal)
    )
    
  #Sleep:
  daily_sleep <- Sleep_d %>%
    group_by(DayOfWeek) %>%
    summarise(
      TotalByDay = sum(TotalInHours),
      Mean_Sleep = mean(TotalInHours),
      Median_Sleep = median(TotalInHours),
      TimeInBed_H = sum(TotalTimeInBed_H),
      Median_TimeInBed = median(TotalTimeInBed_H)
    )
  daily_sleep$Median_Awake = daily_sleep$Median_TimeInBed - daily_sleep$Median_Sleep 

Share:

Key Visualizations & Observations:

Four visualizations were created to highlight key insights:

  • Total daily step counts broken down by day of the week.
  • Hourly step counts across a 24-hour period to identify peak activity times.
  • Total hours spent asleep, segmented by day of the week.
  • Total hours spent awake while in bed, segmented by day of the week.

1) Daily step counts broken down by day of the week:

Key observations from the daily step count analysis:

  1. Participants were most active on Tuesdays, with a total of 1,593,790 steps, followed closely by Saturdays at 1,542,702 steps.

  2. The median daily step count fell below the recommended 10,000 steps per day.

  3. The lowest median step count was recorded on Sundays, at just 5,600 steps.

2) Hourly step counts across a 24-hour period:

Key observations from the Hourly step count analysis:

  1. Users are more active in the afternoon, with step counts increasing after noon compared to morning hours.

  2. The peak hour for total step activity is between 6 PM - 7 PM with 1,053,728 total steps.

  3. The median step count is highest between 5 PM - 6 PM at 282 steps.

  4. After 8 PM, there is a noticeable downward trend in activity, with a sharper decline occurring after 10 PM when step counts drop to 386,395, marking a significant reduction from the evening peak.

  5. The most significant jump in activity occurs between 6 AM - 7 AM, where the median step count doubles after a period of decline in activity after 11pm.

3) Total hours spent sleeping over a week:

Key observations from the Weekly Sleep Data analysis:

  1. The median sleep duration is generally close to 8 hours, with the lowest median being 6.75 hours.

  2. Total sleep hours show that users sleep the least on Mondays and the most on Wednesdays, with an upward trend in sleep from Monday to Wednesday, followed by a gradual decline towards the weekend.

  3. The median sleep duration is highest on Sundays at 8.01 hours and lowest on Fridays at 6.75 hours.

4) Total hours spent awake while in bed over the course of a week:

Key observations from the Hours spent awake (in bed) analysis:

  1. The median time awake in bed is highest on Sundays at 0.76 hours and lowest on Tuesdays at 0.48 hours.

  2. There is a general downward trend from Monday to Wednesday, followed by a gradual increase that continues through to Sunday.

Act:

Insights:

  1. Users are significantly more active after noon, with peak activity between 5-7 PM, and a noticeable drop-off after 8 PM.

  2. Highest total steps are recorded on Tuesdays and Saturdays, but the median steps fall below 10,000 across all days, with the lowest activity on Sundays.

  3. Users sleep the least on Mondays and most on Wednesdays (median close to 8 hours), with a steady decline toward the weekend, especially by Friday.

  4. Users spend the most time awake in bed on Sundays, with the lowest awake time midweek, especially on Tuesdays.

Recommendations & next steps:

  1. Schedule fitness-related promotions (workout gear, supplements, or fitness classes) during 5-7 PM, when users are most active. After 8 PM, switch to wellness and recovery-focused offers (e.g., mobility tools, yoga mats, foam rollers), and push sleep-related products (sleep music, calming sounds, supplements) between 11 PM - 12 AM when activity sharply declines.

  2. Run marketing campaigns for attracting high activity users during peak engagement windows (5-7 PM) to capitalize on high user activity and drive urgency.

  3. Introduce step goal incentive program that rewards users who hit 10,000+ steps with perks such as exclusive in-app discounts or badges to motivate them to be more active and capitalize by suggesting premium subscription for more insights into their data when they do.

  4. Launch social challenges, where users earn rewards for encouraging and improving the activity levels of their less active friends, leveraging social networks of users to increase platform engagement.

  5. On Sundays, when activity dips and awake-in-bed time peaks, promote wellness and relaxation products like meditation apps, yoga classes, and calming teas to align with user behavior seeking rest and recovery. Partnering with sleep-focused brands to run Friday and Sunday evening campaigns to target users with lower sleep patterns.