Inroduction

Bellabeat wants to improve user engagement with its wellness app and devices. Analyze smart device usage data to identify trends that can help Bellabeat improve user engagement and marketing strategy.

The key stakeholders include:

  • Urška Sršen: Chief Creative Officer and Bellabeat’s Co-founder.
  • Sando Mur: Mathematician and Bellabeat’s Co-founder.
  • Bellabeat’s marketing analytics team: a team of data analysts.

Business Problem - ASK

Business Problem Statement

Bellabeat wants to improve user engagement and encourage consistent activity tracking among its users. The company needs to understand how users interact with their devices and identify patterns that can inform marketing campaigns and product design.

Key Business Questions:

  1. How often do users track daily activity?
  2. Do users maintain consistent activity?
  3. How active are users when they track?
  4. How can engagement be improved?

Data Source - PREPARE

Dataset: Fitbit Fitness Tracker Data Kaggle by Möbius (CC0: Public Domain)

Brief: Fitbit Fitness Tracker Data, collected via Amazon Mechanical Turk from 35 users in 2016, offers minute-level physical activity, heart rate, and sleep monitoring. It is accessible on Kaggle.

Table used: dailyActivity_merged.csv

Data Cleaning - PROCESS

The dataset was checked for missing values, duplicates, and data types. No missing or duplicate entries were found. Numeric columns such as TotalSteps and Calories were correctly formatted, and values were within expected ranges.

# Loading the data sets and Remove trailing spaces (trim_ws = TRUE)
daily_activity <- read_csv("dailyActivity_merged.csv", trim_ws = TRUE)


glimpse(daily_activity)
## Rows: 457
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "3/25/2016", "3/26/2016", "3/27/2016", "3/28/…
## $ TotalSteps               <dbl> 11004, 17609, 12736, 13231, 12041, 10970, 122…
## $ TotalDistance            <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, 7.…
## $ TrackerDistance          <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, 7.…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 2.57, 6.92, 4.66, 3.19, 2.16, 2.36, 2.29, 3.3…
## $ ModeratelyActiveDistance <dbl> 0.46, 0.73, 0.16, 0.79, 1.09, 0.51, 0.49, 0.8…
## $ LightActiveDistance      <dbl> 4.07, 3.91, 3.71, 4.95, 4.61, 4.29, 5.04, 3.6…
## $ SedentaryActiveDistance  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ VeryActiveMinutes        <dbl> 33, 89, 56, 39, 28, 30, 33, 47, 40, 15, 43, 3…
## $ FairlyActiveMinutes      <dbl> 12, 17, 5, 20, 28, 13, 12, 21, 11, 30, 18, 18…
## $ LightlyActiveMinutes     <dbl> 205, 274, 268, 224, 243, 223, 239, 200, 244, …
## $ SedentaryMinutes         <dbl> 804, 588, 605, 1080, 763, 1174, 820, 866, 636…
## $ Calories                 <dbl> 1819, 2154, 1944, 1932, 1886, 1820, 1889, 186…
# Check missing values and duplicates
cat(
  "\n",
  "Missing values:",
  sum(is.na(daily_activity)),
  "\n",
  "Duplicate values:",
  sum(duplicated(daily_activity)),
  "\n",
  "Unique Ids:",
  n_distinct(daily_activity$Id)
)
## 
##  Missing values: 0 
##  Duplicate values: 0 
##  Unique Ids: 35

Cleaning - Change column names to lowercase because R is case-sensitive. - Change the type of “ActivityDate” from char to date.

# Cleaning column names and Correcting column types
daily_activity <-
  clean_names(daily_activity) %>%
  mutate(activity_date = as.Date(activity_date, format = "%m/%d/%Y"))

# Checking daily_activity dataset after cleaning
glimpse(daily_activity)
## Rows: 457
## Columns: 15
## $ id                         <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ activity_date              <date> 2016-03-25, 2016-03-26, 2016-03-27, 2016-0…
## $ total_steps                <dbl> 11004, 17609, 12736, 13231, 12041, 10970, 1…
## $ total_distance             <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, …
## $ tracker_distance           <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, …
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance       <dbl> 2.57, 6.92, 4.66, 3.19, 2.16, 2.36, 2.29, 3…
## $ moderately_active_distance <dbl> 0.46, 0.73, 0.16, 0.79, 1.09, 0.51, 0.49, 0…
## $ light_active_distance      <dbl> 4.07, 3.91, 3.71, 4.95, 4.61, 4.29, 5.04, 3…
## $ sedentary_active_distance  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ very_active_minutes        <dbl> 33, 89, 56, 39, 28, 30, 33, 47, 40, 15, 43,…
## $ fairly_active_minutes      <dbl> 12, 17, 5, 20, 28, 13, 12, 21, 11, 30, 18, …
## $ lightly_active_minutes     <dbl> 205, 274, 268, 224, 243, 223, 239, 200, 244…
## $ sedentary_minutes          <dbl> 804, 588, 605, 1080, 763, 1174, 820, 866, 6…
## $ calories                   <dbl> 1819, 2154, 1944, 1932, 1886, 1820, 1889, 1…
# Unique user with number of days tracks 
user_days_track <- daily_activity %>% 
 count(id)

summary(user_days_track)
##        id                  n        
##  Min.   :1.504e+09   Min.   : 8.00  
##  1st Qu.:2.610e+09   1st Qu.:10.50  
##  Median :4.445e+09   Median :12.00  
##  Mean   :4.845e+09   Mean   :13.06  
##  3rd Qu.:6.869e+09   3rd Qu.:12.00  
##  Max.   :8.878e+09   Max.   :32.00

Key finding

Initial validation revealed that each user had a minimum of 8 tracked days, indicating that the data set was curated to include users with sufficient activity history. Because users with fewer than 8 days are excluded, engagement metrics likely overestimate early retention. As a result, real-world first-week churn may be higher than what is observed in this analysis.

#Let us explore full dataset using summary()
summary(daily_activity)
##        id            activity_date         total_steps    total_distance  
##  Min.   :1.504e+09   Min.   :2016-03-12   Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.347e+09   1st Qu.:2016-04-02   1st Qu.: 1988   1st Qu.: 1.410  
##  Median :4.057e+09   Median :2016-04-05   Median : 5986   Median : 4.090  
##  Mean   :4.629e+09   Mean   :2016-04-04   Mean   : 6547   Mean   : 4.664  
##  3rd Qu.:6.392e+09   3rd Qu.:2016-04-08   3rd Qu.:10198   3rd Qu.: 7.160  
##  Max.   :8.878e+09   Max.   :2016-04-12   Max.   :28497   Max.   :27.530  
##  tracker_distance logged_activities_distance very_active_distance
##  Min.   : 0.00    Min.   :0.0000             Min.   : 0.000      
##  1st Qu.: 1.28    1st Qu.:0.0000             1st Qu.: 0.000      
##  Median : 4.09    Median :0.0000             Median : 0.000      
##  Mean   : 4.61    Mean   :0.1794             Mean   : 1.181      
##  3rd Qu.: 7.11    3rd Qu.:0.0000             3rd Qu.: 1.310      
##  Max.   :27.53    Max.   :6.7271             Max.   :21.920      
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   : 0.00         Min.   :0.000000         
##  1st Qu.:0.0000             1st Qu.: 0.87         1st Qu.:0.000000         
##  Median :0.0200             Median : 2.93         Median :0.000000         
##  Mean   :0.4786             Mean   : 2.89         Mean   :0.001904         
##  3rd Qu.:0.6700             3rd Qu.: 4.46         3rd Qu.:0.000000         
##  Max.   :6.4000             Max.   :12.51         Max.   :0.100000         
##  very_active_minutes fairly_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00        Min.   :  0.0         
##  1st Qu.:  0.00      1st Qu.:  0.00        1st Qu.: 64.0         
##  Median :  0.00      Median :  1.00        Median :181.0         
##  Mean   : 16.62      Mean   : 13.07        Mean   :170.1         
##  3rd Qu.: 25.00      3rd Qu.: 16.00        3rd Qu.:257.0         
##  Max.   :202.00      Max.   :660.00        Max.   :720.0         
##  sedentary_minutes    calories   
##  Min.   :  32.0    Min.   :   0  
##  1st Qu.: 728.0    1st Qu.:1776  
##  Median :1057.0    Median :2062  
##  Mean   : 995.3    Mean   :2189  
##  3rd Qu.:1285.0    3rd Qu.:2667  
##  Max.   :1440.0    Max.   :4562

This overall summary helps us explore each attribute quickly. We notice that some attributes have a minimum value of zero (total_step, total_distance, calories). Let us explore this observation.

When doing EDA, found that days with zero recorded steps still show non-zero calorie expenditure, reflecting basal metabolic energy consumption. Total calorie estimates therefore include both active and resting energy expenditure.

Possible reasons for TotalSteps = 0 + calories:

  • User didn’t wear tracker properly (steps not captured)
  • Tracker captured heart rate but not steps
  • User was truly inactive
  • Device imputes BMR regardless of activity

All are realistic/possible. Let us explore deep dive into it.

daily_activity %>%
  filter(total_steps == 0) %>%
  summarise(
    min_cal = min(calories),
    max_cal = max(calories),
    mean_cal = mean(calories)
  )
## # A tibble: 1 × 3
##   min_cal max_cal mean_cal
##     <dbl>   <dbl>    <dbl>
## 1       0    4562    1575.
daily_activity %>% 
  filter(total_steps == 0  & calories >= 2000) %>% 
  select(id, total_steps, very_active_minutes, fairly_active_minutes, calories)
## # A tibble: 8 × 5
##           id total_steps very_active_minutes fairly_active_minutes calories
##        <dbl>       <dbl>               <dbl>                 <dbl>    <dbl>
## 1 2891001357           0                   0                   660     4562
## 2 6290855005           0                  33                     0     2664
## 3 6290855005           0                   0                     0     2060
## 4 6290855005           0                   0                     0     2060
## 5 6290855005           0                   0                     0     2060
## 6 6290855005           0                   0                     0     2060
## 7 6290855005           0                   0                     0     2060
## 8 6290855005           0                   0                     0     2060

Based on this summary, we can infer that non-zero calorie values recorded on zero-step days primarily reflect basal metabolic energy expenditure. Although the maximum calorie value appears high, further validation showed that the corresponding record contains non-zero fairly active minutes, indicating that the device was worn and activity was detected. Therefore, this value does not necessarily represent invalid data and was retained, while being interpreted cautiously in step-based analyses.

Exploratory Data Analysis and Actionable KPI’s - ANALYSIS

I performed broad EDA initially to understand distributions and data quality, but for the final analysis I focused only on variables that directly answered the business questions and translated them into actionable KPIs.

Business Questions

Q1: How often do users track daily activity?

To understand user engagement, we want to know how frequently users record their daily activity. This helps identify whether users are consistently tracking their activity or only sporadically using the tracker.

Analysis Steps

  • For each user, the number of distinct activity days was calculated.
#calculating the number of tracked days per user
tracked_days <- daily_activity %>%
  group_by(id) %>%
  summarise(tracked_days = n_distinct(activity_date))

print(tracked_days)
## # A tibble: 35 × 2
##            id tracked_days
##         <dbl>        <int>
##  1 1503960366           19
##  2 1624580081           19
##  3 1644430081           10
##  4 1844505072           12
##  5 1927972279           12
##  6 2022484408           12
##  7 2026352035           12
##  8 2320127002           12
##  9 2347167796           15
## 10 2873212765           12
## # ℹ 25 more rows
summary(tracked_days$tracked_days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   10.50   12.00   13.06   12.00   32.00
  • Users were grouped into three tracking frequency categories:
Category Definition
Frequent trackers ≥ 20 tracked days
Moderate trackers 10–19 tracked days
Rare trackers < 10 tracked days
#Categorizing the users based on tracking level 
tracked_days <- tracked_days %>%
  mutate(tracking_level = case_when(
    tracked_days >= 20 ~ "Frequent trackers",
    tracked_days >= 10 & tracked_days < 20 ~ "Moderate trackers",
    tracked_days < 10 ~ "Rare trackers"
  ))

print(tracked_days)
## # A tibble: 35 × 3
##            id tracked_days tracking_level   
##         <dbl>        <int> <chr>            
##  1 1503960366           19 Moderate trackers
##  2 1624580081           19 Moderate trackers
##  3 1644430081           10 Moderate trackers
##  4 1844505072           12 Moderate trackers
##  5 1927972279           12 Moderate trackers
##  6 2022484408           12 Moderate trackers
##  7 2026352035           12 Moderate trackers
##  8 2320127002           12 Moderate trackers
##  9 2347167796           15 Moderate trackers
## 10 2873212765           12 Moderate trackers
## # ℹ 25 more rows
  • KPIs computed:
    • % of users in each category → main KPI
    • Average tracked days per category → supporting KPI
#Calculating the percentage of users and average tracked days for each category
tracked_days <- tracked_days %>%
  group_by(tracking_level) %>%
  summarise(
    users = n(),
    avg_days_tracked = round(mean(tracked_days), 0)
  ) %>%
  mutate(
    percent_users = round(users / sum(users) * 100, 1)
  )

print(tracked_days)
## # A tibble: 3 × 4
##   tracking_level    users avg_days_tracked percent_users
##   <chr>             <int>            <dbl>         <dbl>
## 1 Frequent trackers     2               32           5.7
## 2 Moderate trackers    28               13          80  
## 3 Rare trackers         5                8          14.3
# Visualization of Q1 - How often do users track daily activity?
ggplot(tracked_days, aes(x = tracking_level, y = percent_users)) +
 geom_col(fill = "#1f77b4") +
  geom_text(aes(label = paste0(percent_users, "%")), vjust = -0.5) +
  labs(
    title = "User Tracking Frequency Distribution",
    x = "Tracking Frequency",
    y = "Percentage of Users"
  )

Insights / Key Findings

  • User engagement with the tracker is highly concentrated among a small number of users.
  • Only 5.7% of users (2 out of 35) track activity frequently, with an average of 32 days, meaning they track almost every day in the dataset.
  • The majority of users (80%, 28 users) are moderate trackers, averaging 13 tracked days, showing intermittent engagement rather than consistent daily tracking.
  • 14.3% of users (5 users) track activity rarely, averaging 8 days, highlighting a segment with very low engagement.
  • While the minimum and maximum tracked days indicate the full range of user behavior, the distribution across tracking categories and percentage of users provide the clearest picture of engagement patterns and are the key KPIs.

Q2: Do users maintain consistent activity?

  • Tracking frequency tells us how often users track, but not how consistent their activity is on the days they track.
  • Q2 asks: Do users have steady activity levels, or do they fluctuate a lot day-to-day?
  • This is important to understand habit formation and user engagement quality.

Analysis Steps

Note: Users with zero total step counts were retained in the cleaned data set to reflect sedentary behavior; however, they were excluded from step-based consistency calculations, as variability metrics such as coefficient of variation are undefined when mean activity equals zero.

  • Choosing the metrics for activity consistency
Metric Description
Standard deviation (SD) How much daily steps vary for each user
Coefficient of variation (CV) SD divided by mean → normalizes for users with different activity levels
  • Aggregating user-level activity metrics for further analysis

    For a step-based consistency analysis, users were first evaluated at the user level, and only those with a non-zero total step count were included in variability calculations using a post-aggregation filtering approach (analogous to a HAVING condition), implemented via a semi_join.

# Identifying the valid users
valid_users <- daily_activity %>%
  group_by(id) %>%
  summarise(total_steps_sum = sum(total_steps, na.rm = TRUE)) %>%
  filter(total_steps_sum > 0)

# filtering only valid users for metric calculations
user_activity_variability <- daily_activity %>% 
  semi_join(valid_users, by = "id")

#calculating mean (avg_steps), standard deviation (sd_steps) and coefficient of variation (cv_steps) for total step count at user level
user_activity_variability <- user_activity_variability %>%
  group_by(id) %>%
  summarise(
    avg_steps = mean(total_steps, na.rm = TRUE),
    sd_steps = sd(total_steps, na.rm = TRUE),
    cv_steps = sd_steps / avg_steps
  )

print(user_activity_variability)
## # A tibble: 34 × 4
##            id avg_steps sd_steps cv_steps
##         <dbl>     <dbl>    <dbl>    <dbl>
##  1 1503960366    11641.    3330.    0.286
##  2 1624580081     4226.    4414.    1.04 
##  3 1644430081     9275.    5658.    0.610
##  4 1844505072     3641.    2868.    0.788
##  5 1927972279     2181.    1626.    0.745
##  6 2022484408    12175.    3851.    0.316
##  7 2026352035     3393.    1698.    0.501
##  8 2320127002     3138.    3432.    1.09 
##  9 2347167796     9800.    3661.    0.374
## 10 2873212765     6637.    4009.    0.604
## # ℹ 24 more rows
  • Users were categorized based on their step_count activity consistency across users using the coefficient of variation.
#Categorizing the users based on coefficient of variation
user_activity_variability <- user_activity_variability %>%
  mutate(activity_consistency = case_when(
    cv_steps <= 0.25 ~ "Consistent",
    cv_steps > 0.25 & cv_steps <= 0.75 ~ "Moderately consistent",
    cv_steps > 0.75 ~ "Inconsistent"
  ))

print(user_activity_variability)
## # A tibble: 34 × 5
##            id avg_steps sd_steps cv_steps activity_consistency 
##         <dbl>     <dbl>    <dbl>    <dbl> <chr>                
##  1 1503960366    11641.    3330.    0.286 Moderately consistent
##  2 1624580081     4226.    4414.    1.04  Inconsistent         
##  3 1644430081     9275.    5658.    0.610 Moderately consistent
##  4 1844505072     3641.    2868.    0.788 Inconsistent         
##  5 1927972279     2181.    1626.    0.745 Moderately consistent
##  6 2022484408    12175.    3851.    0.316 Moderately consistent
##  7 2026352035     3393.    1698.    0.501 Moderately consistent
##  8 2320127002     3138.    3432.    1.09  Inconsistent         
##  9 2347167796     9800.    3661.    0.374 Moderately consistent
## 10 2873212765     6637.    4009.    0.604 Moderately consistent
## # ℹ 24 more rows
  • KPIs computed:
    • % of users in each consistency category → main KPI
    • Average daily steps per category
    • Average variability (CV) per category

Step-based consistency analysis was conducted on 34 users with non-zero total step counts to ensure valid variability measurement.

#  Summary table creation with KPIs
consistency_summary <- user_activity_variability %>%
  group_by(activity_consistency) %>%
  summarise(
    users = n(),
    avg_daily_steps = round(mean(avg_steps), 0),
    avg_CV = round(mean(cv_steps), 2)
  ) %>%
  mutate(percent_users = round(users / sum(users) * 100, 1))

print(consistency_summary)
## # A tibble: 3 × 5
##   activity_consistency  users avg_daily_steps avg_CV percent_users
##   <chr>                 <int>           <dbl>  <dbl>         <dbl>
## 1 Consistent                3           10343   0.22           8.8
## 2 Inconsistent             11            3070   1.37          32.4
## 3 Moderately consistent    20            8202   0.48          58.8
ggplot(consistency_summary, aes(x = activity_consistency, y = percent_users)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = paste0(percent_users, "%")), vjust = -0.5) +
  labs(
    title = "User Activity Consistency by Step count",
    x = "Consistency Level",
    y = "Percentage of Users"
  )

Insights / Key Findings

  • The majority of users (58.8%) exhibit moderately consistent activity, indicating that most users maintain a reasonably stable activity pattern but with noticeable day-to-day variation.
  • Nearly one-third of users (32.4%) are classified as inconsistent, suggesting irregular engagement with physical activity, potentially due to fluctuating routines, motivation, or device usage patterns.
  • Only a small proportion of users (8.8%) demonstrate highly consistent activity, highlighting that sustained, routine activity is relatively uncommon among users.
  • Overall, the distribution suggests that while users are generally active, consistency remains a challenge, with most users falling short of maintaining steady daily activity levels.
  • The presence of a sizable inconsistent group indicates an opportunity for intervention, such as reminders, goal-setting features, or personalized nudges to encourage more regular engagement.

Q2.1: Parallel Analysis: Consistency Using Active Minutes

Why active minutes for “user consistency” analysis?

  • Steps capture volume of movement
  • Active minutes capture intentional activity
  • Comparing both tells us whether users are:
    • consistently moving, and/or
    • consistently being active

Analysis Steps
- defining the active minutes metric adding all active minutes to produce the total active minute

# Calculating the total active minutes on daily basis
daily_activity <- daily_activity %>%
  mutate(
    total_active_minutes =
      very_active_minutes +
      fairly_active_minutes +
      lightly_active_minutes
  )

print(daily_activity) # %>% 
## # A tibble: 457 × 16
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <date>              <dbl>          <dbl>            <dbl>
##  1 1503960366 2016-03-25          11004           7.11             7.11
##  2 1503960366 2016-03-26          17609          11.6             11.6 
##  3 1503960366 2016-03-27          12736           8.53             8.53
##  4 1503960366 2016-03-28          13231           8.93             8.93
##  5 1503960366 2016-03-29          12041           7.85             7.85
##  6 1503960366 2016-03-30          10970           7.16             7.16
##  7 1503960366 2016-03-31          12256           7.86             7.86
##  8 1503960366 2016-04-01          12262           7.87             7.87
##  9 1503960366 2016-04-02          11248           7.25             7.25
## 10 1503960366 2016-04-03          10016           6.37             6.37
## # ℹ 447 more rows
## # ℹ 11 more variables: logged_activities_distance <dbl>,
## #   very_active_distance <dbl>, moderately_active_distance <dbl>,
## #   light_active_distance <dbl>, sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>,
## #   total_active_minutes <dbl>
    # group_by(id) %>% 
    # summarise(total_active_minutes_sum = sum(total_active_minutes, na.rm = TRUE)) %>% 
    # filter(total_active_minutes_sum==0))
  • Aggregating user-level activity metrics for further analysis

Note: To avoid undefined variability metrics, users with zero total active minutes were excluded from the user-level active-minutes consistency analysis.

# Identifying the valid users for active minutes analysis
valid_acitve_users <- daily_activity %>%
  group_by(id) %>%
  summarise(total_active_minutes_sum = sum(total_active_minutes, na.rm = TRUE)) %>%
  filter(total_active_minutes_sum > 0)

# filtering only valid users for metric calculations
active_user_activity_variability <- daily_activity %>% 
  semi_join(valid_acitve_users, by = "id")

#calculating mean (avg_steps), standard deviation (sd_steps) and coefficient of variation (cv_steps) for total active minutes at user level
active_user_activity_variability <- active_user_activity_variability %>%
  group_by(id) %>%
  summarise(
    avg_active_minutes = mean(total_active_minutes, na.rm = TRUE),
    sd_active_minutes = sd(total_active_minutes, na.rm = TRUE),
    cv_active_minutes = sd_active_minutes / avg_active_minutes
  )

print(active_user_activity_variability)
## # A tibble: 34 × 4
##            id avg_active_minutes sd_active_minutes cv_active_minutes
##         <dbl>              <dbl>             <dbl>             <dbl>
##  1 1503960366               280.              80.9             0.289
##  2 1624580081               122.              61.3             0.501
##  3 1644430081               286              175.              0.611
##  4 1844505072               160              126.              0.790
##  5 1927972279               113.              74.6             0.658
##  6 2022484408               316.              75.2             0.238
##  7 2026352035               169.              62.7             0.371
##  8 2320127002               128.             133.              1.04 
##  9 2347167796               288.              98.5             0.341
## 10 2873212765               286.             137.              0.478
## # ℹ 24 more rows
  • Users were categorized based on their activity consistency across users using the coefficient of variation.
active_user_activity_variability <- active_user_activity_variability %>%
  mutate(active_consistency = case_when(
    cv_active_minutes <= 0.25 ~ "Consistent",
    cv_active_minutes > 0.25 & cv_active_minutes <= 0.75 ~ "Moderately consistent",
    cv_active_minutes > 0.75 ~ "Inconsistent"
  ))

print(active_user_activity_variability)
## # A tibble: 34 × 5
##            id avg_active_minutes sd_active_minutes cv_active_minutes
##         <dbl>              <dbl>             <dbl>             <dbl>
##  1 1503960366               280.              80.9             0.289
##  2 1624580081               122.              61.3             0.501
##  3 1644430081               286              175.              0.611
##  4 1844505072               160              126.              0.790
##  5 1927972279               113.              74.6             0.658
##  6 2022484408               316.              75.2             0.238
##  7 2026352035               169.              62.7             0.371
##  8 2320127002               128.             133.              1.04 
##  9 2347167796               288.              98.5             0.341
## 10 2873212765               286.             137.              0.478
## # ℹ 24 more rows
## # ℹ 1 more variable: active_consistency <chr>
  • KPIs computed:
    • % of users in each consistency category → main KPI
    • Average active minutes per category
    • Average variability (CV) per category

Active Minutes based consistency analysis was conducted on 34 users with non-zero total step counts to ensure valid variability measurement. Excluded user are same as zero step count user

active_user_consistency_summary <- active_user_activity_variability %>%
  group_by(active_consistency) %>%
  summarise(
    users = n(),
    avg_active_minutes = round(mean(avg_active_minutes), 0),
    avg_CV = round(mean(cv_active_minutes), 2)
  ) %>%
  mutate(
    percent_users = round(users / sum(users) * 100, 1)
  )

print(active_user_consistency_summary)
## # A tibble: 3 × 5
##   active_consistency    users avg_active_minutes avg_CV percent_users
##   <chr>                 <int>              <dbl>  <dbl>         <dbl>
## 1 Consistent                4                320   0.23          11.8
## 2 Inconsistent             10                123   1.28          29.4
## 3 Moderately consistent    20                234   0.42          58.8
ggplot(active_user_consistency_summary, aes(x = active_consistency, y = percent_users)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = paste0(percent_users, "%")), vjust = -0.5) +
  labs(
    title = "User Activity Consistency by Active Minutes",
    x = "Consistency Level",
    y = "Percentage of Users"
  )

Insights / Key Findings

  • Only ~12% of users are truly consistent in active minutes
  • Nearly 30% are highly inconsistent → large day-to-day swings
  • Average active minutes drop sharply for inconsistent users (123 vs 320)

Users who are inconsistent don’t just fluctuate — they are much less active overall.

Q2 over all comparison insights:

A parallel consistency analysis using total active minutes produced results similar to the steps-based analysis. In both cases, most users fall into the moderately consistent category, with only a small proportion maintaining stable activity levels. Users classified as inconsistent not only exhibit higher variability but also record substantially lower average activity. Users with zero activity were excluded from consistency analysis, as variability metrics are not meaningful in the absence of recorded activity.

Q3. How active are users when they track?

Q1 Tracking frequency tells us how often users track, not how consistent their activity is on the days they track.
Q2 answers us, Do users have steady activity levels, or do they fluctuate a lot day-to-day?
Q3 focuses on intensity and quality of activity on tracked days.
This answers “When users do wear the device, how active are they really?”

Analysis Steps

  • Defining the analysis population: For this business question, only non-wear days (TotalSteps = 0 and Calories = 0) were excluded; days or users with zero steps or zero active minutes but non-zero calories were retained to reflect sedentary or non-step-based activity.

  • Defining the activity intensity metrics: Key variables: VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Create: Total Active Minutes = Very + Fairly + Lightly

Note: Total active minutes were already calculated on the daily_activity table.

  • Aggregating user level activity intensity metrics For each user:
    • Daily Average: Total Active Minutes Very Active Minutes Sedentary Minutes
    • Proportion of time spent in each intensity
# calculating user level metrix
activity_intensity <- daily_activity %>% 
  filter(!(total_steps == 0 & calories == 0)) %>% 
  group_by(id) %>% 
  # calculating daily average for the metrics 
  summarise(
    avg_total_active = mean(total_active_minutes, na.rm = TRUE),
    avg_very_active = mean(very_active_minutes, na.rm = TRUE),
    avg_light_active = mean(lightly_active_minutes, na.rm = TRUE),
    avg_fair_active = mean(fairly_active_minutes, na.rm = TRUE),
    avg_sedentary = mean(sedentary_minutes, na.rm = TRUE)
    
  ) %>% 
  # calculating the proportion of time spent in each intensity (1 day = 1440 minutes )
  mutate(
    pct_sedentary = round((avg_sedentary / 1440) * 100, 2),
    pct_light     = round((avg_light_active / 1440) * 100, 2),
    pct_fair      = round((avg_fair_active / 1440) * 100, 2),
    pct_very      = round((avg_very_active / 1440) * 100, 2)
  )

print(activity_intensity)
## # A tibble: 35 × 10
##            id avg_total_active avg_very_active avg_light_active avg_fair_active
##         <dbl>            <dbl>           <dbl>            <dbl>           <dbl>
##  1 1503960366             280.          35.8               228.          15.8  
##  2 1624580081             122.           0.737             121.           0.579
##  3 1644430081             286           14.8               228.          43.5  
##  4 1844505072             160            0.75              158.           0.75 
##  5 1927972279             113.           0                 112.           1.67 
##  6 2022484408             316.          40.1               254.          22.5  
##  7 2026352035             169.           0                 169.           0    
##  8 2320127002             128.           0.917             126.           1.08 
##  9 2347167796             288.          11.8               254.          23.1  
## 10 2873212765             312.           5.55              300.           6.55 
## # ℹ 25 more rows
## # ℹ 5 more variables: avg_sedentary <dbl>, pct_sedentary <dbl>,
## #   pct_light <dbl>, pct_fair <dbl>, pct_very <dbl>
  • Categorizing users by activity intensity Creating activity level buckets using daily total average active minutes
    Low activity: < 150 min/day Moderate activity: 150–300 min/day High activity: > 300 min/day
# categorizing users by activity intensity 
activity_intensity <- activity_intensity %>% 
  mutate(active_intensity = case_when(
    avg_total_active <= 150 ~ "low activity",
    avg_total_active > 150 & avg_total_active <= 300 ~ "moderate activity",
    avg_total_active > 300 ~ "high activity"
  ))

print(activity_intensity)
## # A tibble: 35 × 11
##            id avg_total_active avg_very_active avg_light_active avg_fair_active
##         <dbl>            <dbl>           <dbl>            <dbl>           <dbl>
##  1 1503960366             280.          35.8               228.          15.8  
##  2 1624580081             122.           0.737             121.           0.579
##  3 1644430081             286           14.8               228.          43.5  
##  4 1844505072             160            0.75              158.           0.75 
##  5 1927972279             113.           0                 112.           1.67 
##  6 2022484408             316.          40.1               254.          22.5  
##  7 2026352035             169.           0                 169.           0    
##  8 2320127002             128.           0.917             126.           1.08 
##  9 2347167796             288.          11.8               254.          23.1  
## 10 2873212765             312.           5.55              300.           6.55 
## # ℹ 25 more rows
## # ℹ 6 more variables: avg_sedentary <dbl>, pct_sedentary <dbl>,
## #   pct_light <dbl>, pct_fair <dbl>, pct_very <dbl>, active_intensity <chr>
  • Sedentary behavior analysis for health insights 1. Do users with higher active minutes still spend large time sedentary? 1. Is sedentary time inversely related to activity level?
# calculating sedentary behavior metrics across intensity level
sedentary_behavior <- activity_intensity %>%
  group_by(active_intensity) %>% 
  summarise(
    users = n(),
    avg_sedentary_min = mean(avg_sedentary, na.rm = TRUE),
    avg_active_min =mean(avg_total_active, na.rm = TRUE)
  ) %>% 
  mutate(
    percent_users = round(users / sum(users) * 100, 1)
  )

print(sedentary_behavior)
## # A tibble: 3 × 5
##   active_intensity  users avg_sedentary_min avg_active_min percent_users
##   <chr>             <int>             <dbl>          <dbl>         <dbl>
## 1 high activity         7              958.          322.           20  
## 2 low activity         12             1187.           96.1          34.3
## 3 moderate activity    16              841.          244.           45.7
ggplot(sedentary_behavior, aes(x = active_intensity, y = percent_users)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = paste0(percent_users, "%")), vjust = -0.5) +
  labs(
    title = "Distribution of Users by Activity Intensity Level",
    x = "Intensity Level",
    y = "Percentage of Users"
  )

Insights and Key Finding:

  • Majority of users are moderately active, Nearly 46% of users fall under the moderate activity category, making it the largest user group. This suggests that most users maintain a consistent but not highly intensive level of physical activity.
  • High-activity users form the smallest segment Only 20% of users are classified as high activity. This indicates that sustained high-intensity activity is less common among users, possibly due to time, lifestyle, or physical constraints.
  • A significant portion of users show low activity levels About 34% of users belong to the low activity category, representing over one-third of the user base. This group presents a key opportunity for engagement and behavior improvement initiatives.
  • Most users exhibit moderate activity levels, while a substantial portion remains low-active, highlighting opportunities to encourage gradual increases in physical activity.
ggplot(activity_intensity, aes(x = avg_total_active, y = avg_sedentary)) +
  geom_point(aes(color = active_intensity)) +
  labs(
    title = "Sedentary Behavior Across Intensity Level",
    x = "active_mins",
    y = "sedentary_mins"
  )

Key Findings: Sedentary Behavior vs Activity Level

  • Sedentary time remains high even among active users High activity users average ~322 active minutes/day, yet still spend ~958 sedentary minutes (~16 hours). This shows that high activity does not eliminate prolonged sedentary behavior.
  • Sedentary time decreases with activity level, but not proportionally Moving from low → moderate activity, sedentary time drops sharply. Moving from moderate → high activity, sedentary time does not reduce further and slightly increases.
  • Weak inverse relationship between activity and sedentary time Low activity users cluster at very high sedentary minutes Moderate and high activity users show wide spread in sedentary time No strong linear inverse trend

Activity intensity increases total activity but does not eliminate sedentary behavior. Increasing activity intensity improves total movement but does not necessarily reduce overall inactivity during the day.

Health Interpretation: Even users who meet high daily activity levels remain sedentary for a substantial portion of the day, highlighting the importance of reducing prolonged sitting in addition to promoting exercise.

Q4: Recommendations to Improve User Engagement - ACT

The following recommendations are derived from engagement, consistency, activity intensity, and sedentary behavior analyses (corresponding KPI tables are tracked_days, consistency_summary, active_user_consistency_summary, activity_intensity, and sedentary_behavior) presented in Q1–Q3.

Recommendation 1: Convert moderate trackers into frequent trackers

Insight: Although overall engagement exists, only 5.7% of users track activity frequently, while 80% are moderate trackers who engage intermittently (average 13 days). This indicates that most users are willing to track but lack sustained motivation.

Recommendation: Introduce tracking streaks and milestone-based rewards (e.g., 5-day, 10-day, and 20-day streak badges) with visual progress indicators to encourage moderate trackers to increase tracking frequency.

Expected impact: Improves daily tracking consistency, shifts users from moderate to frequent tracking behavior, and increases overall data completeness and user engagement.

Recommendation 2: Support users struggling with activity consistency

Insight: Only 8.8–12% of users demonstrate highly consistent activity, while nearly 30% show high inconsistency in both steps and active minutes, accompanied by significantly lower average activity levels.

Recommendation: Provide weekly consistency summaries highlighting activity variability and deliver gentle nudges when irregular patterns are detected, focusing on routine-building rather than performance goals.

Expected impact: Helps users recognize irregular behavior early, reduces activity variability, and supports the development of more stable activity routines.

Recommendation 3: Encourage gradual progression for low-activity users

Insight: Approximately 34% of users fall into the low activity category, indicating that a substantial portion of the user base engages primarily in light activity and does not reach moderate or high intensity levels.

Recommendation: Introduce progressive, personalized goals (e.g., increasing daily active minutes by 5–10%) instead of fixed thresholds, allowing low-activity users to improve at a manageable pace.

Expected impact: Reduces user discouragement, increases adherence to activity goals, and supports long-term behavior change among low-activity users.

Recommendation 4: Address sedentary behavior independently from activity intensity

Insight: Even highly active users average ~16 hours of sedentary time per day, demonstrating that increased activity intensity does not necessarily reduce prolonged inactivity.

Recommendation: Implement sedentary break reminders that prompt short movement breaks (1–3 minutes) after extended periods of inactivity, regardless of users’ daily activity intensity.

Expected impact: Reduces prolonged sedentary periods, promotes healthier daily movement patterns, and mproves overall well-being beyond total activity metrics.

Recommendation 5: Personalize engagement strategies by user segment

Insight: Users exhibit distinct engagement patterns across tracking frequency, consistency, and activity intensity, indicating that a one-size-fits-all engagement strategy may be ineffective.

Recommendation: Segment users based on tracking behavior and activity patterns, and deliver tailored messaging (e.g., motivation-focused prompts for inconsistent users, performance insights for highly active users).

Expected impact: Improves relevance of notifications, increases feature adoption, and strengthens long-term user engagement across diverse user groups.

Conclusion

This analysis examined user engagement patterns in daily activity tracking data to understand how often users track their activity, how consistently they remain active, and how activity intensity relates to sedentary behavior. The findings show that while most users engage with the tracker at least intermittently, sustained and consistent usage is limited to a small subset of users. A majority of users fall into moderate tracking and moderately consistent activity categories, indicating partial engagement rather than long-term habit formation.

Further analysis revealed that higher activity intensity does not necessarily reduce sedentary time, suggesting that users tend to accumulate activity in short bursts while remaining inactive for much of the day. This highlights the importance of addressing sedentary behavior independently from promoting exercise alone. Based on these insights, targeted recommendations were proposed to improve engagement, including habit-building mechanisms, personalized nudges, progressive activity goals, and interventions to reduce prolonged sedentary time.

Overall, this project demonstrates how behavioral data can be translated into actionable product recommendations, supporting strategies that encourage sustained engagement and healthier daily movement patterns rather than focusing solely on total activity volume.

Limitations and Future Considerations

While the analysis provides meaningful insights into user engagement and activity behavior, several limitations should be considered:

1. Limited data period

The data set covers a short time window in 2016, which may not fully capture long-term behavior changes, seasonal effects, or evolving user engagement patterns. User activity habits and wearable usage have likely changed significantly since then.

2. Small sample size

The analysis is based on data from 35 users, which limits the statistical power of the findings and restricts generalization. The results should be interpreted as indicative patterns rather than population-level conclusions.

3. Not suitable for predictive modeling

Due to the small sample size, short time span, and lack of demographic or contextual features, the data set is not sufficient for robust predictive modeling or machine learning applications. This project is therefore positioned as an exploratory and descriptive analysis, rather than a data modeling exercise.

4. Lack of demographic and contextual features

The data set does not include user-level attributes such as age, gender, health status, occupation, or lifestyle factors. Without these features, it is not possible to analyze how engagement patterns differ across user segments or to control for confounding factors.

5. Device adherence and missing behavior context

Zero-step and zero-calorie days likely represent periods when the device was not worn rather than true inactivity. Although such cases were handled during analysis, the absence of explicit wear-time indicators introduces uncertainty into engagement and consistency metrics.

6. Aggregation hides intra-day behavior

Daily aggregation limits visibility into within-day activity patterns, such as time-of-day effects, activity bursts, or micro-sedentary periods. More granular data (e.g., hourly or minute-level) would enable deeper behavioral modeling.

7. Observational data limits causal inference

The data set is observational in nature, meaning relationships identified between engagement, activity intensity, and sedentary behavior should not be interpreted as causal. Controlled experiments or longitudinal interventions would be required to validate the effectiveness of proposed recommendations.