Bellabeat wants to improve user engagement with its wellness app and devices. Analyze smart device usage data to identify trends that can help Bellabeat improve user engagement and marketing strategy.
Bellabeat wants to improve user engagement and encourage consistent activity tracking among its users. The company needs to understand how users interact with their devices and identify patterns that can inform marketing campaigns and product design.
Dataset: Fitbit Fitness Tracker Data Kaggle by Möbius (CC0: Public Domain)
Brief: Fitbit Fitness Tracker Data, collected via Amazon Mechanical Turk from 35 users in 2016, offers minute-level physical activity, heart rate, and sleep monitoring. It is accessible on Kaggle.
Table used: dailyActivity_merged.csv
The dataset was checked for missing values, duplicates, and data types. No missing or duplicate entries were found. Numeric columns such as TotalSteps and Calories were correctly formatted, and values were within expected ranges.
# Loading the data sets and Remove trailing spaces (trim_ws = TRUE)
daily_activity <- read_csv("dailyActivity_merged.csv", trim_ws = TRUE)
glimpse(daily_activity)
## Rows: 457
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "3/25/2016", "3/26/2016", "3/27/2016", "3/28/…
## $ TotalSteps <dbl> 11004, 17609, 12736, 13231, 12041, 10970, 122…
## $ TotalDistance <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, 7.…
## $ TrackerDistance <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, 7.…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 2.57, 6.92, 4.66, 3.19, 2.16, 2.36, 2.29, 3.3…
## $ ModeratelyActiveDistance <dbl> 0.46, 0.73, 0.16, 0.79, 1.09, 0.51, 0.49, 0.8…
## $ LightActiveDistance <dbl> 4.07, 3.91, 3.71, 4.95, 4.61, 4.29, 5.04, 3.6…
## $ SedentaryActiveDistance <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ VeryActiveMinutes <dbl> 33, 89, 56, 39, 28, 30, 33, 47, 40, 15, 43, 3…
## $ FairlyActiveMinutes <dbl> 12, 17, 5, 20, 28, 13, 12, 21, 11, 30, 18, 18…
## $ LightlyActiveMinutes <dbl> 205, 274, 268, 224, 243, 223, 239, 200, 244, …
## $ SedentaryMinutes <dbl> 804, 588, 605, 1080, 763, 1174, 820, 866, 636…
## $ Calories <dbl> 1819, 2154, 1944, 1932, 1886, 1820, 1889, 186…
# Check missing values and duplicates
cat(
"\n",
"Missing values:",
sum(is.na(daily_activity)),
"\n",
"Duplicate values:",
sum(duplicated(daily_activity)),
"\n",
"Unique Ids:",
n_distinct(daily_activity$Id)
)
##
## Missing values: 0
## Duplicate values: 0
## Unique Ids: 35
Cleaning - Change column names to lowercase because R is case-sensitive. - Change the type of “ActivityDate” from char to date.
# Cleaning column names and Correcting column types
daily_activity <-
clean_names(daily_activity) %>%
mutate(activity_date = as.Date(activity_date, format = "%m/%d/%Y"))
# Checking daily_activity dataset after cleaning
glimpse(daily_activity)
## Rows: 457
## Columns: 15
## $ id <dbl> 1503960366, 1503960366, 1503960366, 1503960…
## $ activity_date <date> 2016-03-25, 2016-03-26, 2016-03-27, 2016-0…
## $ total_steps <dbl> 11004, 17609, 12736, 13231, 12041, 10970, 1…
## $ total_distance <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, …
## $ tracker_distance <dbl> 7.11, 11.55, 8.53, 8.93, 7.85, 7.16, 7.86, …
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance <dbl> 2.57, 6.92, 4.66, 3.19, 2.16, 2.36, 2.29, 3…
## $ moderately_active_distance <dbl> 0.46, 0.73, 0.16, 0.79, 1.09, 0.51, 0.49, 0…
## $ light_active_distance <dbl> 4.07, 3.91, 3.71, 4.95, 4.61, 4.29, 5.04, 3…
## $ sedentary_active_distance <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ very_active_minutes <dbl> 33, 89, 56, 39, 28, 30, 33, 47, 40, 15, 43,…
## $ fairly_active_minutes <dbl> 12, 17, 5, 20, 28, 13, 12, 21, 11, 30, 18, …
## $ lightly_active_minutes <dbl> 205, 274, 268, 224, 243, 223, 239, 200, 244…
## $ sedentary_minutes <dbl> 804, 588, 605, 1080, 763, 1174, 820, 866, 6…
## $ calories <dbl> 1819, 2154, 1944, 1932, 1886, 1820, 1889, 1…
# Unique user with number of days tracks
user_days_track <- daily_activity %>%
count(id)
summary(user_days_track)
## id n
## Min. :1.504e+09 Min. : 8.00
## 1st Qu.:2.610e+09 1st Qu.:10.50
## Median :4.445e+09 Median :12.00
## Mean :4.845e+09 Mean :13.06
## 3rd Qu.:6.869e+09 3rd Qu.:12.00
## Max. :8.878e+09 Max. :32.00
Initial validation revealed that each user had a minimum of 8 tracked days, indicating that the data set was curated to include users with sufficient activity history. Because users with fewer than 8 days are excluded, engagement metrics likely overestimate early retention. As a result, real-world first-week churn may be higher than what is observed in this analysis.
#Let us explore full dataset using summary()
summary(daily_activity)
## id activity_date total_steps total_distance
## Min. :1.504e+09 Min. :2016-03-12 Min. : 0 Min. : 0.000
## 1st Qu.:2.347e+09 1st Qu.:2016-04-02 1st Qu.: 1988 1st Qu.: 1.410
## Median :4.057e+09 Median :2016-04-05 Median : 5986 Median : 4.090
## Mean :4.629e+09 Mean :2016-04-04 Mean : 6547 Mean : 4.664
## 3rd Qu.:6.392e+09 3rd Qu.:2016-04-08 3rd Qu.:10198 3rd Qu.: 7.160
## Max. :8.878e+09 Max. :2016-04-12 Max. :28497 Max. :27.530
## tracker_distance logged_activities_distance very_active_distance
## Min. : 0.00 Min. :0.0000 Min. : 0.000
## 1st Qu.: 1.28 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 4.09 Median :0.0000 Median : 0.000
## Mean : 4.61 Mean :0.1794 Mean : 1.181
## 3rd Qu.: 7.11 3rd Qu.:0.0000 3rd Qu.: 1.310
## Max. :27.53 Max. :6.7271 Max. :21.920
## moderately_active_distance light_active_distance sedentary_active_distance
## Min. :0.0000 Min. : 0.00 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 0.87 1st Qu.:0.000000
## Median :0.0200 Median : 2.93 Median :0.000000
## Mean :0.4786 Mean : 2.89 Mean :0.001904
## 3rd Qu.:0.6700 3rd Qu.: 4.46 3rd Qu.:0.000000
## Max. :6.4000 Max. :12.51 Max. :0.100000
## very_active_minutes fairly_active_minutes lightly_active_minutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 64.0
## Median : 0.00 Median : 1.00 Median :181.0
## Mean : 16.62 Mean : 13.07 Mean :170.1
## 3rd Qu.: 25.00 3rd Qu.: 16.00 3rd Qu.:257.0
## Max. :202.00 Max. :660.00 Max. :720.0
## sedentary_minutes calories
## Min. : 32.0 Min. : 0
## 1st Qu.: 728.0 1st Qu.:1776
## Median :1057.0 Median :2062
## Mean : 995.3 Mean :2189
## 3rd Qu.:1285.0 3rd Qu.:2667
## Max. :1440.0 Max. :4562
This overall summary helps us explore each attribute quickly. We notice that some attributes have a minimum value of zero (total_step, total_distance, calories). Let us explore this observation.
When doing EDA, found that days with zero recorded steps still show non-zero calorie expenditure, reflecting basal metabolic energy consumption. Total calorie estimates therefore include both active and resting energy expenditure.
Possible reasons for TotalSteps = 0 + calories:
All are realistic/possible. Let us explore deep dive into it.
daily_activity %>%
filter(total_steps == 0) %>%
summarise(
min_cal = min(calories),
max_cal = max(calories),
mean_cal = mean(calories)
)
## # A tibble: 1 × 3
## min_cal max_cal mean_cal
## <dbl> <dbl> <dbl>
## 1 0 4562 1575.
daily_activity %>%
filter(total_steps == 0 & calories >= 2000) %>%
select(id, total_steps, very_active_minutes, fairly_active_minutes, calories)
## # A tibble: 8 × 5
## id total_steps very_active_minutes fairly_active_minutes calories
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2891001357 0 0 660 4562
## 2 6290855005 0 33 0 2664
## 3 6290855005 0 0 0 2060
## 4 6290855005 0 0 0 2060
## 5 6290855005 0 0 0 2060
## 6 6290855005 0 0 0 2060
## 7 6290855005 0 0 0 2060
## 8 6290855005 0 0 0 2060
Based on this summary, we can infer that non-zero calorie values recorded on zero-step days primarily reflect basal metabolic energy expenditure. Although the maximum calorie value appears high, further validation showed that the corresponding record contains non-zero fairly active minutes, indicating that the device was worn and activity was detected. Therefore, this value does not necessarily represent invalid data and was retained, while being interpreted cautiously in step-based analyses.
I performed broad EDA initially to understand distributions and data quality, but for the final analysis I focused only on variables that directly answered the business questions and translated them into actionable KPIs.
To understand user engagement, we want to know how frequently users record their daily activity. This helps identify whether users are consistently tracking their activity or only sporadically using the tracker.
Analysis Steps
#calculating the number of tracked days per user
tracked_days <- daily_activity %>%
group_by(id) %>%
summarise(tracked_days = n_distinct(activity_date))
print(tracked_days)
## # A tibble: 35 × 2
## id tracked_days
## <dbl> <int>
## 1 1503960366 19
## 2 1624580081 19
## 3 1644430081 10
## 4 1844505072 12
## 5 1927972279 12
## 6 2022484408 12
## 7 2026352035 12
## 8 2320127002 12
## 9 2347167796 15
## 10 2873212765 12
## # ℹ 25 more rows
summary(tracked_days$tracked_days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 10.50 12.00 13.06 12.00 32.00
| Category | Definition |
|---|---|
| Frequent trackers | ≥ 20 tracked days |
| Moderate trackers | 10–19 tracked days |
| Rare trackers | < 10 tracked days |
#Categorizing the users based on tracking level
tracked_days <- tracked_days %>%
mutate(tracking_level = case_when(
tracked_days >= 20 ~ "Frequent trackers",
tracked_days >= 10 & tracked_days < 20 ~ "Moderate trackers",
tracked_days < 10 ~ "Rare trackers"
))
print(tracked_days)
## # A tibble: 35 × 3
## id tracked_days tracking_level
## <dbl> <int> <chr>
## 1 1503960366 19 Moderate trackers
## 2 1624580081 19 Moderate trackers
## 3 1644430081 10 Moderate trackers
## 4 1844505072 12 Moderate trackers
## 5 1927972279 12 Moderate trackers
## 6 2022484408 12 Moderate trackers
## 7 2026352035 12 Moderate trackers
## 8 2320127002 12 Moderate trackers
## 9 2347167796 15 Moderate trackers
## 10 2873212765 12 Moderate trackers
## # ℹ 25 more rows
#Calculating the percentage of users and average tracked days for each category
tracked_days <- tracked_days %>%
group_by(tracking_level) %>%
summarise(
users = n(),
avg_days_tracked = round(mean(tracked_days), 0)
) %>%
mutate(
percent_users = round(users / sum(users) * 100, 1)
)
print(tracked_days)
## # A tibble: 3 × 4
## tracking_level users avg_days_tracked percent_users
## <chr> <int> <dbl> <dbl>
## 1 Frequent trackers 2 32 5.7
## 2 Moderate trackers 28 13 80
## 3 Rare trackers 5 8 14.3
# Visualization of Q1 - How often do users track daily activity?
ggplot(tracked_days, aes(x = tracking_level, y = percent_users)) +
geom_col(fill = "#1f77b4") +
geom_text(aes(label = paste0(percent_users, "%")), vjust = -0.5) +
labs(
title = "User Tracking Frequency Distribution",
x = "Tracking Frequency",
y = "Percentage of Users"
)
Analysis Steps
Note: Users with zero total step counts were retained in the cleaned data set to reflect sedentary behavior; however, they were excluded from step-based consistency calculations, as variability metrics such as coefficient of variation are undefined when mean activity equals zero.
| Metric | Description |
|---|---|
| Standard deviation (SD) | How much daily steps vary for each user |
| Coefficient of variation (CV) | SD divided by mean → normalizes for users with different activity levels |
Aggregating user-level activity metrics for further analysis
For a step-based consistency analysis, users were first evaluated at the user level, and only those with a non-zero total step count were included in variability calculations using a post-aggregation filtering approach (analogous to a HAVING condition), implemented via a semi_join.
# Identifying the valid users
valid_users <- daily_activity %>%
group_by(id) %>%
summarise(total_steps_sum = sum(total_steps, na.rm = TRUE)) %>%
filter(total_steps_sum > 0)
# filtering only valid users for metric calculations
user_activity_variability <- daily_activity %>%
semi_join(valid_users, by = "id")
#calculating mean (avg_steps), standard deviation (sd_steps) and coefficient of variation (cv_steps) for total step count at user level
user_activity_variability <- user_activity_variability %>%
group_by(id) %>%
summarise(
avg_steps = mean(total_steps, na.rm = TRUE),
sd_steps = sd(total_steps, na.rm = TRUE),
cv_steps = sd_steps / avg_steps
)
print(user_activity_variability)
## # A tibble: 34 × 4
## id avg_steps sd_steps cv_steps
## <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 11641. 3330. 0.286
## 2 1624580081 4226. 4414. 1.04
## 3 1644430081 9275. 5658. 0.610
## 4 1844505072 3641. 2868. 0.788
## 5 1927972279 2181. 1626. 0.745
## 6 2022484408 12175. 3851. 0.316
## 7 2026352035 3393. 1698. 0.501
## 8 2320127002 3138. 3432. 1.09
## 9 2347167796 9800. 3661. 0.374
## 10 2873212765 6637. 4009. 0.604
## # ℹ 24 more rows
#Categorizing the users based on coefficient of variation
user_activity_variability <- user_activity_variability %>%
mutate(activity_consistency = case_when(
cv_steps <= 0.25 ~ "Consistent",
cv_steps > 0.25 & cv_steps <= 0.75 ~ "Moderately consistent",
cv_steps > 0.75 ~ "Inconsistent"
))
print(user_activity_variability)
## # A tibble: 34 × 5
## id avg_steps sd_steps cv_steps activity_consistency
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1503960366 11641. 3330. 0.286 Moderately consistent
## 2 1624580081 4226. 4414. 1.04 Inconsistent
## 3 1644430081 9275. 5658. 0.610 Moderately consistent
## 4 1844505072 3641. 2868. 0.788 Inconsistent
## 5 1927972279 2181. 1626. 0.745 Moderately consistent
## 6 2022484408 12175. 3851. 0.316 Moderately consistent
## 7 2026352035 3393. 1698. 0.501 Moderately consistent
## 8 2320127002 3138. 3432. 1.09 Inconsistent
## 9 2347167796 9800. 3661. 0.374 Moderately consistent
## 10 2873212765 6637. 4009. 0.604 Moderately consistent
## # ℹ 24 more rows
Step-based consistency analysis was conducted on 34 users with non-zero total step counts to ensure valid variability measurement.
# Summary table creation with KPIs
consistency_summary <- user_activity_variability %>%
group_by(activity_consistency) %>%
summarise(
users = n(),
avg_daily_steps = round(mean(avg_steps), 0),
avg_CV = round(mean(cv_steps), 2)
) %>%
mutate(percent_users = round(users / sum(users) * 100, 1))
print(consistency_summary)
## # A tibble: 3 × 5
## activity_consistency users avg_daily_steps avg_CV percent_users
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Consistent 3 10343 0.22 8.8
## 2 Inconsistent 11 3070 1.37 32.4
## 3 Moderately consistent 20 8202 0.48 58.8
ggplot(consistency_summary, aes(x = activity_consistency, y = percent_users)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = paste0(percent_users, "%")), vjust = -0.5) +
labs(
title = "User Activity Consistency by Step count",
x = "Consistency Level",
y = "Percentage of Users"
)
Why active minutes for “user consistency” analysis?
Analysis Steps
- defining the active minutes
metric adding all active minutes to produce the total active minute
# Calculating the total active minutes on daily basis
daily_activity <- daily_activity %>%
mutate(
total_active_minutes =
very_active_minutes +
fairly_active_minutes +
lightly_active_minutes
)
print(daily_activity) # %>%
## # A tibble: 457 × 16
## id activity_date total_steps total_distance tracker_distance
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1503960366 2016-03-25 11004 7.11 7.11
## 2 1503960366 2016-03-26 17609 11.6 11.6
## 3 1503960366 2016-03-27 12736 8.53 8.53
## 4 1503960366 2016-03-28 13231 8.93 8.93
## 5 1503960366 2016-03-29 12041 7.85 7.85
## 6 1503960366 2016-03-30 10970 7.16 7.16
## 7 1503960366 2016-03-31 12256 7.86 7.86
## 8 1503960366 2016-04-01 12262 7.87 7.87
## 9 1503960366 2016-04-02 11248 7.25 7.25
## 10 1503960366 2016-04-03 10016 6.37 6.37
## # ℹ 447 more rows
## # ℹ 11 more variables: logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>,
## # total_active_minutes <dbl>
# group_by(id) %>%
# summarise(total_active_minutes_sum = sum(total_active_minutes, na.rm = TRUE)) %>%
# filter(total_active_minutes_sum==0))
Note: To avoid undefined variability metrics, users with zero total active minutes were excluded from the user-level active-minutes consistency analysis.
# Identifying the valid users for active minutes analysis
valid_acitve_users <- daily_activity %>%
group_by(id) %>%
summarise(total_active_minutes_sum = sum(total_active_minutes, na.rm = TRUE)) %>%
filter(total_active_minutes_sum > 0)
# filtering only valid users for metric calculations
active_user_activity_variability <- daily_activity %>%
semi_join(valid_acitve_users, by = "id")
#calculating mean (avg_steps), standard deviation (sd_steps) and coefficient of variation (cv_steps) for total active minutes at user level
active_user_activity_variability <- active_user_activity_variability %>%
group_by(id) %>%
summarise(
avg_active_minutes = mean(total_active_minutes, na.rm = TRUE),
sd_active_minutes = sd(total_active_minutes, na.rm = TRUE),
cv_active_minutes = sd_active_minutes / avg_active_minutes
)
print(active_user_activity_variability)
## # A tibble: 34 × 4
## id avg_active_minutes sd_active_minutes cv_active_minutes
## <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 280. 80.9 0.289
## 2 1624580081 122. 61.3 0.501
## 3 1644430081 286 175. 0.611
## 4 1844505072 160 126. 0.790
## 5 1927972279 113. 74.6 0.658
## 6 2022484408 316. 75.2 0.238
## 7 2026352035 169. 62.7 0.371
## 8 2320127002 128. 133. 1.04
## 9 2347167796 288. 98.5 0.341
## 10 2873212765 286. 137. 0.478
## # ℹ 24 more rows
active_user_activity_variability <- active_user_activity_variability %>%
mutate(active_consistency = case_when(
cv_active_minutes <= 0.25 ~ "Consistent",
cv_active_minutes > 0.25 & cv_active_minutes <= 0.75 ~ "Moderately consistent",
cv_active_minutes > 0.75 ~ "Inconsistent"
))
print(active_user_activity_variability)
## # A tibble: 34 × 5
## id avg_active_minutes sd_active_minutes cv_active_minutes
## <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 280. 80.9 0.289
## 2 1624580081 122. 61.3 0.501
## 3 1644430081 286 175. 0.611
## 4 1844505072 160 126. 0.790
## 5 1927972279 113. 74.6 0.658
## 6 2022484408 316. 75.2 0.238
## 7 2026352035 169. 62.7 0.371
## 8 2320127002 128. 133. 1.04
## 9 2347167796 288. 98.5 0.341
## 10 2873212765 286. 137. 0.478
## # ℹ 24 more rows
## # ℹ 1 more variable: active_consistency <chr>
Active Minutes based consistency analysis was conducted on 34 users with non-zero total step counts to ensure valid variability measurement. Excluded user are same as zero step count user
active_user_consistency_summary <- active_user_activity_variability %>%
group_by(active_consistency) %>%
summarise(
users = n(),
avg_active_minutes = round(mean(avg_active_minutes), 0),
avg_CV = round(mean(cv_active_minutes), 2)
) %>%
mutate(
percent_users = round(users / sum(users) * 100, 1)
)
print(active_user_consistency_summary)
## # A tibble: 3 × 5
## active_consistency users avg_active_minutes avg_CV percent_users
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Consistent 4 320 0.23 11.8
## 2 Inconsistent 10 123 1.28 29.4
## 3 Moderately consistent 20 234 0.42 58.8
ggplot(active_user_consistency_summary, aes(x = active_consistency, y = percent_users)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = paste0(percent_users, "%")), vjust = -0.5) +
labs(
title = "User Activity Consistency by Active Minutes",
x = "Consistency Level",
y = "Percentage of Users"
)
Users who are inconsistent don’t just fluctuate — they are much less active overall.
A parallel consistency analysis using total active minutes produced results similar to the steps-based analysis. In both cases, most users fall into the moderately consistent category, with only a small proportion maintaining stable activity levels. Users classified as inconsistent not only exhibit higher variability but also record substantially lower average activity. Users with zero activity were excluded from consistency analysis, as variability metrics are not meaningful in the absence of recorded activity.
Q1 Tracking frequency tells us how often users track, not how
consistent their activity is on the days they track.
Q2 answers us,
Do users have steady activity levels, or do they fluctuate a lot
day-to-day?
Q3 focuses on intensity and quality of
activity on tracked days.
This answers “When users do wear
the device, how active are they really?”
Analysis Steps
Defining the analysis population: For this business question, only non-wear days (TotalSteps = 0 and Calories = 0) were excluded; days or users with zero steps or zero active minutes but non-zero calories were retained to reflect sedentary or non-step-based activity.
Defining the activity intensity metrics: Key variables: VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Create: Total Active Minutes = Very + Fairly + Lightly
Note: Total active minutes were already calculated on the daily_activity table.
# calculating user level metrix
activity_intensity <- daily_activity %>%
filter(!(total_steps == 0 & calories == 0)) %>%
group_by(id) %>%
# calculating daily average for the metrics
summarise(
avg_total_active = mean(total_active_minutes, na.rm = TRUE),
avg_very_active = mean(very_active_minutes, na.rm = TRUE),
avg_light_active = mean(lightly_active_minutes, na.rm = TRUE),
avg_fair_active = mean(fairly_active_minutes, na.rm = TRUE),
avg_sedentary = mean(sedentary_minutes, na.rm = TRUE)
) %>%
# calculating the proportion of time spent in each intensity (1 day = 1440 minutes )
mutate(
pct_sedentary = round((avg_sedentary / 1440) * 100, 2),
pct_light = round((avg_light_active / 1440) * 100, 2),
pct_fair = round((avg_fair_active / 1440) * 100, 2),
pct_very = round((avg_very_active / 1440) * 100, 2)
)
print(activity_intensity)
## # A tibble: 35 × 10
## id avg_total_active avg_very_active avg_light_active avg_fair_active
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 280. 35.8 228. 15.8
## 2 1624580081 122. 0.737 121. 0.579
## 3 1644430081 286 14.8 228. 43.5
## 4 1844505072 160 0.75 158. 0.75
## 5 1927972279 113. 0 112. 1.67
## 6 2022484408 316. 40.1 254. 22.5
## 7 2026352035 169. 0 169. 0
## 8 2320127002 128. 0.917 126. 1.08
## 9 2347167796 288. 11.8 254. 23.1
## 10 2873212765 312. 5.55 300. 6.55
## # ℹ 25 more rows
## # ℹ 5 more variables: avg_sedentary <dbl>, pct_sedentary <dbl>,
## # pct_light <dbl>, pct_fair <dbl>, pct_very <dbl>
# categorizing users by activity intensity
activity_intensity <- activity_intensity %>%
mutate(active_intensity = case_when(
avg_total_active <= 150 ~ "low activity",
avg_total_active > 150 & avg_total_active <= 300 ~ "moderate activity",
avg_total_active > 300 ~ "high activity"
))
print(activity_intensity)
## # A tibble: 35 × 11
## id avg_total_active avg_very_active avg_light_active avg_fair_active
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 280. 35.8 228. 15.8
## 2 1624580081 122. 0.737 121. 0.579
## 3 1644430081 286 14.8 228. 43.5
## 4 1844505072 160 0.75 158. 0.75
## 5 1927972279 113. 0 112. 1.67
## 6 2022484408 316. 40.1 254. 22.5
## 7 2026352035 169. 0 169. 0
## 8 2320127002 128. 0.917 126. 1.08
## 9 2347167796 288. 11.8 254. 23.1
## 10 2873212765 312. 5.55 300. 6.55
## # ℹ 25 more rows
## # ℹ 6 more variables: avg_sedentary <dbl>, pct_sedentary <dbl>,
## # pct_light <dbl>, pct_fair <dbl>, pct_very <dbl>, active_intensity <chr>
# calculating sedentary behavior metrics across intensity level
sedentary_behavior <- activity_intensity %>%
group_by(active_intensity) %>%
summarise(
users = n(),
avg_sedentary_min = mean(avg_sedentary, na.rm = TRUE),
avg_active_min =mean(avg_total_active, na.rm = TRUE)
) %>%
mutate(
percent_users = round(users / sum(users) * 100, 1)
)
print(sedentary_behavior)
## # A tibble: 3 × 5
## active_intensity users avg_sedentary_min avg_active_min percent_users
## <chr> <int> <dbl> <dbl> <dbl>
## 1 high activity 7 958. 322. 20
## 2 low activity 12 1187. 96.1 34.3
## 3 moderate activity 16 841. 244. 45.7
ggplot(sedentary_behavior, aes(x = active_intensity, y = percent_users)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = paste0(percent_users, "%")), vjust = -0.5) +
labs(
title = "Distribution of Users by Activity Intensity Level",
x = "Intensity Level",
y = "Percentage of Users"
)
ggplot(activity_intensity, aes(x = avg_total_active, y = avg_sedentary)) +
geom_point(aes(color = active_intensity)) +
labs(
title = "Sedentary Behavior Across Intensity Level",
x = "active_mins",
y = "sedentary_mins"
)
Activity intensity increases total activity but does not eliminate sedentary behavior. Increasing activity intensity improves total movement but does not necessarily reduce overall inactivity during the day.
Health Interpretation: Even users who meet high daily activity levels remain sedentary for a substantial portion of the day, highlighting the importance of reducing prolonged sitting in addition to promoting exercise.
The following recommendations are derived from engagement, consistency, activity intensity, and sedentary behavior analyses (corresponding KPI tables are tracked_days, consistency_summary, active_user_consistency_summary, activity_intensity, and sedentary_behavior) presented in Q1–Q3.
Insight: Although overall engagement exists, only 5.7% of users track activity frequently, while 80% are moderate trackers who engage intermittently (average 13 days). This indicates that most users are willing to track but lack sustained motivation.
Recommendation: Introduce tracking streaks and milestone-based rewards (e.g., 5-day, 10-day, and 20-day streak badges) with visual progress indicators to encourage moderate trackers to increase tracking frequency.
Expected impact: Improves daily tracking consistency, shifts users from moderate to frequent tracking behavior, and increases overall data completeness and user engagement.
Insight: Only 8.8–12% of users demonstrate highly consistent activity, while nearly 30% show high inconsistency in both steps and active minutes, accompanied by significantly lower average activity levels.
Recommendation: Provide weekly consistency summaries highlighting activity variability and deliver gentle nudges when irregular patterns are detected, focusing on routine-building rather than performance goals.
Expected impact: Helps users recognize irregular behavior early, reduces activity variability, and supports the development of more stable activity routines.
Insight: Approximately 34% of users fall into the low activity category, indicating that a substantial portion of the user base engages primarily in light activity and does not reach moderate or high intensity levels.
Recommendation: Introduce progressive, personalized goals (e.g., increasing daily active minutes by 5–10%) instead of fixed thresholds, allowing low-activity users to improve at a manageable pace.
Expected impact: Reduces user discouragement, increases adherence to activity goals, and supports long-term behavior change among low-activity users.
Insight: Even highly active users average ~16 hours of sedentary time per day, demonstrating that increased activity intensity does not necessarily reduce prolonged inactivity.
Recommendation: Implement sedentary break reminders that prompt short movement breaks (1–3 minutes) after extended periods of inactivity, regardless of users’ daily activity intensity.
Expected impact: Reduces prolonged sedentary periods, promotes healthier daily movement patterns, and mproves overall well-being beyond total activity metrics.
Insight: Users exhibit distinct engagement patterns across tracking frequency, consistency, and activity intensity, indicating that a one-size-fits-all engagement strategy may be ineffective.
Recommendation: Segment users based on tracking behavior and activity patterns, and deliver tailored messaging (e.g., motivation-focused prompts for inconsistent users, performance insights for highly active users).
Expected impact: Improves relevance of notifications, increases feature adoption, and strengthens long-term user engagement across diverse user groups.
This analysis examined user engagement patterns in daily activity tracking data to understand how often users track their activity, how consistently they remain active, and how activity intensity relates to sedentary behavior. The findings show that while most users engage with the tracker at least intermittently, sustained and consistent usage is limited to a small subset of users. A majority of users fall into moderate tracking and moderately consistent activity categories, indicating partial engagement rather than long-term habit formation.
Further analysis revealed that higher activity intensity does not necessarily reduce sedentary time, suggesting that users tend to accumulate activity in short bursts while remaining inactive for much of the day. This highlights the importance of addressing sedentary behavior independently from promoting exercise alone. Based on these insights, targeted recommendations were proposed to improve engagement, including habit-building mechanisms, personalized nudges, progressive activity goals, and interventions to reduce prolonged sedentary time.
Overall, this project demonstrates how behavioral data can be translated into actionable product recommendations, supporting strategies that encourage sustained engagement and healthier daily movement patterns rather than focusing solely on total activity volume.
While the analysis provides meaningful insights into user engagement and activity behavior, several limitations should be considered:
1. Limited data period
The data set covers a short time window in 2016, which may not fully capture long-term behavior changes, seasonal effects, or evolving user engagement patterns. User activity habits and wearable usage have likely changed significantly since then.
2. Small sample size
The analysis is based on data from 35 users, which limits the statistical power of the findings and restricts generalization. The results should be interpreted as indicative patterns rather than population-level conclusions.
3. Not suitable for predictive modeling
Due to the small sample size, short time span, and lack of demographic or contextual features, the data set is not sufficient for robust predictive modeling or machine learning applications. This project is therefore positioned as an exploratory and descriptive analysis, rather than a data modeling exercise.
4. Lack of demographic and contextual
features
The data set does not include user-level attributes such as age, gender, health status, occupation, or lifestyle factors. Without these features, it is not possible to analyze how engagement patterns differ across user segments or to control for confounding factors.
5. Device adherence and missing behavior
context
Zero-step and zero-calorie days likely represent periods when the device was not worn rather than true inactivity. Although such cases were handled during analysis, the absence of explicit wear-time indicators introduces uncertainty into engagement and consistency metrics.
6. Aggregation hides intra-day behavior
Daily aggregation limits visibility into within-day activity patterns, such as time-of-day effects, activity bursts, or micro-sedentary periods. More granular data (e.g., hourly or minute-level) would enable deeper behavioral modeling.
7. Observational data limits causal
inference
The data set is observational in nature, meaning relationships identified between engagement, activity intensity, and sedentary behavior should not be interpreted as causal. Controlled experiments or longitudinal interventions would be required to validate the effectiveness of proposed recommendations.